webarchive-discovery

Web archive indexer

Tools for indexing and discovering archived web content

WARC and ARC indexing and discovery tools.

GitHub

116 stars
24 watching
25 forks
Language: Java
last commit: 4 months ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
ukwa/shine A web archive exploration UI built on top of the Solr search engine and warc-discovery indexer. 43
richardlehane/webarchive Provides tools for reading and parsing web archive formats used in digital preservation. 20
ikreymer/webarchive-indexing Tools for bulk indexing of WARC/ARC files to create a shared url index 42
internetarchive/warctools Tools for working with archived web content 152
netarchivesuite/jwat A toolkit for analyzing and extracting data from legacy web archives in a structured format suitable for further analysis or reuse 3
netarchivesuite/solrwayback A web-based search interface and Wayback machine for browsing archived web pages using an index of WARC files. 102
webis-de/wasp A containerized web archive and search system using Elastic Search 26
helgeho/warcpartitioner Tool for partitioning and merging Web archive files by MIME type and year 1
peterk/warcworker A web archiving tool that archives websites with high-fidelity preservation capabilities. 55
nla/httrack2warc Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs 30
machawk1/wail A graphical user interface layer for preserving and replaying web pages using multiple archiving tools. 350
nla/outbackcdx A RocksDB-based server for managing and replicating capture indexes used in web archiving 32
turicas/crau A command-line tool for archiving and playing back websites in WARC format 57
jarofghosts/memento-client Provides a simple JavaScript interface to access historical web pages via the Wayback Machine 14
wabarc/wayback A tool for capturing and preserving web content and making it accessible in the future. 1,818