warcat

Archive tool

Tool for handling Web Archive files

Tool and library for handling Web ARChive (WARC) files.

GitHub

150 stars
11 watching
21 forks
Language: Python
last commit: about 1 month ago
Linked from 1 awesome list

python

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
internetarchive/warctools Tools for working with archived web content 152
webrecorder/warcio A fast streaming library for working with WARC format web archival data 385
turicas/crau A command-line tool for archiving and playing back websites in WARC format 57
webrecorder/har2warc Converts HTTP Archive format to Web Archive format 46
helgeho/warcpartitioner Tool for partitioning and merging Web archive files by MIME type and year 1
internetarchive/warcprox An HTTP proxy designed to capture and archive web traffic, including encrypted HTTPS connections. 381
archiveteam/grab-site A web crawler designed to backup websites by recursively crawling and writing WARC files. 1,398
florents-tselai/warcdb A library for storing and querying web crawl data in a compact, easily sharable format. 394
peterk/warcworker A web archiving tool that archives websites with high-fidelity preservation capabilities. 55
nla/httrack2warc Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs 30
richardlehane/webarchive Provides tools for reading and parsing web archive formats used in digital preservation. 20
bellingcat/auto-archiver Automates archiving of online content from various sources into local storage or cloud services 570
n0tan3rd/node-warc A tool for parsing and generating Web Archive files in JavaScript using Node.js 94
ukwa/webarchive-discovery Tools for indexing and discovering archived web content 116
ikreymer/webarchive-indexing Tools for bulk indexing of WARC/ARC files to create a shared url index 42