webarchive-indexing
WARC Indexer
Tools for bulk indexing of WARC/ARC files to create a shared url index
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
43 stars
9 watching
10 forks
Language: Python
last commit: about 7 years ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
helgeho/warcpartitioner | Tool for partitioning and merging Web archive files by MIME type and year | 1 |
ukwa/webarchive-discovery | Tools for indexing and discovering archived web content | 117 |
webrecorder/warcio | A fast streaming library for working with WARC format web archival data | 391 |
internetarchive/warctools | Tools for working with archived web content | 153 |
internetarchive/warcprox | An HTTP proxy designed to capture and archive web traffic, including encrypted HTTPS connections. | 389 |
richardlehane/webarchive | Provides tools for reading and parsing web archive formats used in digital preservation. | 20 |
archiveteam/grab-site | A web crawler designed to backup websites by recursively crawling and writing WARC files. | 1,406 |
peterk/warcworker | A web archiving tool that archives websites with high-fidelity preservation capabilities. | 57 |
webrecorder/har2warc | Converts HTTP Archive format to Web Archive format | 48 |
turicas/crau | A command-line tool for archiving and playing back websites in WARC format | 59 |
nla/httrack2warc | Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs | 32 |
n0tan3rd/node-warc | A tool for parsing and generating Web Archive files in JavaScript using Node.js | 95 |
chfoo/warcat | Tool for handling Web Archive files | 152 |
florents-tselai/warcdb | A library for storing and querying web crawl data in a compact, easily sharable format. | 397 |
wabarc/rivet | A tool for archiving webpages to IPFS | 12 |