warcrefs

Duplicate converter

Tools to identify and convert duplicate records in archived web content

Web archive deduplication tools

GitHub

6 stars
5 watching
1 forks
Language: Java
last commit: about 6 years ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
helgeho/warcpartitioner Tool for partitioning and merging Web archive files by MIME type and year 1
nla/httrack2warc Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs 30
ukwa/webarchive-discovery Tools for indexing and discovering archived web content 116
florents-tselai/warcdb A library for storing and querying web crawl data in a compact, easily sharable format. 394
ikreymer/webarchive-indexing Tools for bulk indexing of WARC/ARC files to create a shared url index 42
internetarchive/warctools Tools for working with archived web content 152
richardlehane/webarchive Provides tools for reading and parsing web archive formats used in digital preservation. 20
chfoo/warcat Tool for handling Web Archive files 150
webrecorder/har2warc Converts HTTP Archive format to Web Archive format 46
derfenix/webarchive A web-based archive service that allows users to store and manage web pages in various formats. 112
steffenfritz/html2warc Converts offline data into a standard archival format 18
n0tan3rd/node-warc A tool for parsing and generating Web Archive files in JavaScript using Node.js 94
webis-de/wasp A containerized web archive and search system using Elastic Search 26
nla/outbackcdx A RocksDB-based server for managing and replicating capture indexes used in web archiving 32
webrecorder/warcio A fast streaming library for working with WARC format web archival data 385