webarchive-discovery

Web archive indexer

Tools for indexing and discovering archived web content

WARC and ARC indexing and discovery tools.

GitHub

117 stars
24 watching
25 forks
Language: Java
last commit: 7 months ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
ukwa/shine A web archive exploration UI built on top of the Solr search engine and warc-discovery indexer. 43
richardlehane/webarchive Provides tools for reading and parsing web archive formats used in digital preservation. 20
ikreymer/webarchive-indexing Tools for bulk indexing of WARC/ARC files to create a shared url index 43
internetarchive/warctools Tools for working with archived web content 153
netarchivesuite/jwat A toolkit for analyzing and extracting data from legacy web archives in a structured format suitable for further analysis or reuse 3
netarchivesuite/solrwayback A search interface and archival tool for browsing historical web pages 102
webis-de/wasp A containerized web archive and search system using Elastic Search 27
helgeho/warcpartitioner Tool for partitioning and merging Web archive files by MIME type and year 1
peterk/warcworker A web archiving tool that archives websites with high-fidelity preservation capabilities. 57
nla/httrack2warc Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs 32
machawk1/wail A graphical user interface layer for preserving and replaying web pages using multiple archiving tools. 353
nla/outbackcdx A RocksDB-based server for managing and replicating capture indexes used in web archiving 33
turicas/crau A command-line tool for archiving and playing back websites in WARC format 59
jarofghosts/memento-client Provides a simple JavaScript interface to access historical web pages via the Wayback Machine 14
wabarc/wayback A tool for capturing and preserving web content and making it accessible in the future. 1,839