webarchive-indexing

WARC Indexer

Tools for bulk indexing of WARC/ARC files to create a shared url index

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

43 stars

9 watching

10 forks

Language: Python

last commit: over 8 years ago

Linked from 1 awesome list

Backlinks from these awesome lists:

iipc/awesome-web-archiving

Related projects:

Repository	Description	Stars
helgeho/warcpartitioner	Tool for partitioning and merging Web archive files by MIME type and year	1
ukwa/webarchive-discovery	Tools for indexing and discovering archived web content	117
webrecorder/warcio	A fast streaming library for working with WARC format web archival data	391
internetarchive/warctools	Tools for working with archived web content	153
internetarchive/warcprox	An HTTP proxy designed to capture and archive web traffic, including encrypted HTTPS connections.	389
richardlehane/webarchive	Provides tools for reading and parsing web archive formats used in digital preservation.	20
archiveteam/grab-site	A web crawler designed to backup websites by recursively crawling and writing WARC files.	1,406
peterk/warcworker	A web archiving tool that archives websites with high-fidelity preservation capabilities.	57
webrecorder/har2warc	Converts HTTP Archive format to Web Archive format	48
turicas/crau	A command-line tool for archiving and playing back websites in WARC format	59
nla/httrack2warc	Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs	32
n0tan3rd/node-warc	A tool for parsing and generating Web Archive files in JavaScript using Node.js	95
chfoo/warcat	Tool for handling Web Archive files	152
florents-tselai/warcdb	A library for storing and querying web crawl data in a compact, easily sharable format.	397
wabarc/rivet	A tool for archiving webpages to IPFS	12