WarcDB

Web archive storage

A library for storing and querying web crawl data in a compact, easily sharable format.

WarcDB: Web crawl data as SQLite databases.

GitHub

394 stars
10 watching
11 forks
Language: Python
last commit: 4 months ago
Linked from 1 awesome list

clicrawlingdatabasesqlitewarcweb-archivingweb-data

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
chfoo/warcat Tool for handling Web Archive files 150
archiveteam/grab-site A web crawler designed to backup websites by recursively crawling and writing WARC files. 1,398
turicas/crau A command-line tool for archiving and playing back websites in WARC format 57
internetarchive/warcprox An HTTP proxy designed to capture and archive web traffic, including encrypted HTTPS connections. 381
internetarchive/warctools Tools for working with archived web content 152
ikreymer/webarchive-indexing Tools for bulk indexing of WARC/ARC files to create a shared url index 42
webrecorder/warcio A fast streaming library for working with WARC format web archival data 385
nla/outbackcdx A RocksDB-based server for managing and replicating capture indexes used in web archiving 32
webis-de/wasp A containerized web archive and search system using Elastic Search 26
archiveteam/wpull Downloads and crawls web pages, allowing for the archiving of websites. 556
ukwa/webarchive-discovery Tools for indexing and discovering archived web content 116
richardlehane/webarchive Provides tools for reading and parsing web archive formats used in digital preservation. 20
helgeho/warcpartitioner Tool for partitioning and merging Web archive files by MIME type and year 1
arcalex/warcrefs Tools to identify and convert duplicate records in archived web content 6
oduwsdl/ipwb A system for dispersing and replaying archived web content using peer-to-peer technology. 617