warctools

WARC tools

Tools for working with archived web content

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

GitHub

153 stars
44 watching
28 forks
Language: Python
last commit: over 4 years ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
chfoo/warcat Tool for handling Web Archive files 152
internetarchive/warcprox An HTTP proxy designed to capture and archive web traffic, including encrypted HTTPS connections. 389
webrecorder/warcio A fast streaming library for working with WARC format web archival data 391
webrecorder/har2warc Converts HTTP Archive format to Web Archive format 48
nla/httrack2warc Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs 32
archiveteam/grab-site A web crawler designed to backup websites by recursively crawling and writing WARC files. 1,406
turicas/crau A command-line tool for archiving and playing back websites in WARC format 59
ikreymer/webarchive-indexing Tools for bulk indexing of WARC/ARC files to create a shared url index 43
ukwa/webarchive-discovery Tools for indexing and discovering archived web content 117
peterk/warcworker A web archiving tool that archives websites with high-fidelity preservation capabilities. 57
richardlehane/webarchive Provides tools for reading and parsing web archive formats used in digital preservation. 20
helgeho/warcpartitioner Tool for partitioning and merging Web archive files by MIME type and year 1
iipc/jwarc A Java library for reading and writing WARC files with a typed API 48
archiveteam/wpull Downloads and crawls web pages, allowing for the archiving of websites. 556
commoncrawl/whirlwind-python Tours using Common Crawl's WARC format data to demonstrate its structure and contents 14