warcrefs

Duplicate converter

Tools to identify and convert duplicate records in archived web content

Web archive deduplication tools

GitHub

6 stars
5 watching
1 forks
Language: Java
last commit: over 6 years ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
helgeho/warcpartitioner Tool for partitioning and merging Web archive files by MIME type and year 1
nla/httrack2warc Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs 32
ukwa/webarchive-discovery Tools for indexing and discovering archived web content 117
florents-tselai/warcdb A library for storing and querying web crawl data in a compact, easily sharable format. 397
ikreymer/webarchive-indexing Tools for bulk indexing of WARC/ARC files to create a shared url index 43
internetarchive/warctools Tools for working with archived web content 153
richardlehane/webarchive Provides tools for reading and parsing web archive formats used in digital preservation. 20
chfoo/warcat Tool for handling Web Archive files 152
webrecorder/har2warc Converts HTTP Archive format to Web Archive format 48
derfenix/webarchive A web-based archive service that allows users to store and manage web pages in various formats. 115
steffenfritz/html2warc Converts offline data into a standard archival format 18
n0tan3rd/node-warc A tool for parsing and generating Web Archive files in JavaScript using Node.js 95
webis-de/wasp A containerized web archive and search system using Elastic Search 27
nla/outbackcdx A RocksDB-based server for managing and replicating capture indexes used in web archiving 33
webrecorder/warcio A fast streaming library for working with WARC format web archival data 391