warcrefs
Duplicate converter
Tools to identify and convert duplicate records in archived web content
Web archive deduplication tools
6 stars
5 watching
1 forks
Language: Java
last commit: about 6 years ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
helgeho/warcpartitioner | Tool for partitioning and merging Web archive files by MIME type and year | 1 |
nla/httrack2warc | Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs | 30 |
ukwa/webarchive-discovery | Tools for indexing and discovering archived web content | 116 |
florents-tselai/warcdb | A library for storing and querying web crawl data in a compact, easily sharable format. | 394 |
ikreymer/webarchive-indexing | Tools for bulk indexing of WARC/ARC files to create a shared url index | 42 |
internetarchive/warctools | Tools for working with archived web content | 152 |
richardlehane/webarchive | Provides tools for reading and parsing web archive formats used in digital preservation. | 20 |
chfoo/warcat | Tool for handling Web Archive files | 150 |
webrecorder/har2warc | Converts HTTP Archive format to Web Archive format | 46 |
derfenix/webarchive | A web-based archive service that allows users to store and manage web pages in various formats. | 112 |
steffenfritz/html2warc | Converts offline data into a standard archival format | 18 |
n0tan3rd/node-warc | A tool for parsing and generating Web Archive files in JavaScript using Node.js | 94 |
webis-de/wasp | A containerized web archive and search system using Elastic Search | 26 |
nla/outbackcdx | A RocksDB-based server for managing and replicating capture indexes used in web archiving | 32 |
webrecorder/warcio | A fast streaming library for working with WARC format web archival data | 385 |