warctools
WARC tools
Tools for working with archived web content
Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)
153 stars
44 watching
28 forks
Language: Python
last commit: over 4 years ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
chfoo/warcat | Tool for handling Web Archive files | 152 |
internetarchive/warcprox | An HTTP proxy designed to capture and archive web traffic, including encrypted HTTPS connections. | 389 |
webrecorder/warcio | A fast streaming library for working with WARC format web archival data | 391 |
webrecorder/har2warc | Converts HTTP Archive format to Web Archive format | 48 |
nla/httrack2warc | Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs | 32 |
archiveteam/grab-site | A web crawler designed to backup websites by recursively crawling and writing WARC files. | 1,406 |
turicas/crau | A command-line tool for archiving and playing back websites in WARC format | 59 |
ikreymer/webarchive-indexing | Tools for bulk indexing of WARC/ARC files to create a shared url index | 43 |
ukwa/webarchive-discovery | Tools for indexing and discovering archived web content | 117 |
peterk/warcworker | A web archiving tool that archives websites with high-fidelity preservation capabilities. | 57 |
richardlehane/webarchive | Provides tools for reading and parsing web archive formats used in digital preservation. | 20 |
helgeho/warcpartitioner | Tool for partitioning and merging Web archive files by MIME type and year | 1 |
iipc/jwarc | A Java library for reading and writing WARC files with a typed API | 48 |
archiveteam/wpull | Downloads and crawls web pages, allowing for the archiving of websites. | 556 |
commoncrawl/whirlwind-python | Tours using Common Crawl's WARC format data to demonstrate its structure and contents | 14 |