WarcDB

Web archive storage

A library for storing and querying web crawl data in a compact, easily sharable format.

WarcDB: Web crawl data as SQLite databases.

397 stars

10 watching

11 forks

Language: Python

last commit: about 2 years ago

clicrawlingdatabasesqlitewarcweb-archivingweb-data

Backlinks from these awesome lists:

Repository	Description	Stars
chfoo/warcat	Tool for handling Web Archive files	152
archiveteam/grab-site	A web crawler designed to backup websites by recursively crawling and writing WARC files.	1,406
turicas/crau	A command-line tool for archiving and playing back websites in WARC format	59
internetarchive/warcprox	An HTTP proxy designed to capture and archive web traffic, including encrypted HTTPS connections.	389
internetarchive/warctools	Tools for working with archived web content	153
ikreymer/webarchive-indexing	Tools for bulk indexing of WARC/ARC files to create a shared url index	43
webrecorder/warcio	A fast streaming library for working with WARC format web archival data	391
nla/outbackcdx	A RocksDB-based server for managing and replicating capture indexes used in web archiving	33
webis-de/wasp	A containerized web archive and search system using Elastic Search	27
archiveteam/wpull	Downloads and crawls web pages, allowing for the archiving of websites.	556
ukwa/webarchive-discovery	Tools for indexing and discovering archived web content	117
richardlehane/webarchive	Provides tools for reading and parsing web archive formats used in digital preservation.	20
helgeho/warcpartitioner	Tool for partitioning and merging Web archive files by MIME type and year	1
arcalex/warcrefs	Tools to identify and convert duplicate records in archived web content	6
oduwsdl/ipwb	A system for dispersing and replaying archived web content using peer-to-peer technology.	617