WarcDB
Web archive storage
A library for storing and querying web crawl data in a compact, easily sharable format.
WarcDB: Web crawl data as SQLite databases.
394 stars
10 watching
11 forks
Language: Python
last commit: 4 months ago
Linked from 1 awesome list
clicrawlingdatabasesqlitewarcweb-archivingweb-data
Related projects:
Repository | Description | Stars |
---|---|---|
chfoo/warcat | Tool for handling Web Archive files | 150 |
archiveteam/grab-site | A web crawler designed to backup websites by recursively crawling and writing WARC files. | 1,398 |
turicas/crau | A command-line tool for archiving and playing back websites in WARC format | 57 |
internetarchive/warcprox | An HTTP proxy designed to capture and archive web traffic, including encrypted HTTPS connections. | 381 |
internetarchive/warctools | Tools for working with archived web content | 152 |
ikreymer/webarchive-indexing | Tools for bulk indexing of WARC/ARC files to create a shared url index | 42 |
webrecorder/warcio | A fast streaming library for working with WARC format web archival data | 385 |
nla/outbackcdx | A RocksDB-based server for managing and replicating capture indexes used in web archiving | 32 |
webis-de/wasp | A containerized web archive and search system using Elastic Search | 26 |
archiveteam/wpull | Downloads and crawls web pages, allowing for the archiving of websites. | 556 |
ukwa/webarchive-discovery | Tools for indexing and discovering archived web content | 116 |
richardlehane/webarchive | Provides tools for reading and parsing web archive formats used in digital preservation. | 20 |
helgeho/warcpartitioner | Tool for partitioning and merging Web archive files by MIME type and year | 1 |
arcalex/warcrefs | Tools to identify and convert duplicate records in archived web content | 6 |
oduwsdl/ipwb | A system for dispersing and replaying archived web content using peer-to-peer technology. | 617 |