WarcDB
Web archive storage
A library for storing and querying web crawl data in a compact, easily sharable format.
WarcDB: Web crawl data as SQLite databases.
397 stars
10 watching
11 forks
Language: Python
last commit: over 1 year ago
Linked from 1 awesome list
clicrawlingdatabasesqlitewarcweb-archivingweb-data
Related projects:
| Repository | Description | Stars |
|---|---|---|
| | Tool for handling Web Archive files | 152 |
| | A web crawler designed to backup websites by recursively crawling and writing WARC files. | 1,406 |
| | A command-line tool for archiving and playing back websites in WARC format | 59 |
| | An HTTP proxy designed to capture and archive web traffic, including encrypted HTTPS connections. | 389 |
| | Tools for working with archived web content | 153 |
| | Tools for bulk indexing of WARC/ARC files to create a shared url index | 43 |
| | A fast streaming library for working with WARC format web archival data | 391 |
| | A RocksDB-based server for managing and replicating capture indexes used in web archiving | 33 |
| | A containerized web archive and search system using Elastic Search | 27 |
| | Downloads and crawls web pages, allowing for the archiving of websites. | 556 |
| | Tools for indexing and discovering archived web content | 117 |
| | Provides tools for reading and parsing web archive formats used in digital preservation. | 20 |
| | Tool for partitioning and merging Web archive files by MIME type and year | 1 |
| | Tools to identify and convert duplicate records in archived web content | 6 |
| | A system for dispersing and replaying archived web content using peer-to-peer technology. | 617 |