chatnoir-resiliparse

Web archiver

A toolkit for processing and analyzing web archive data

A robust web archive analytics toolkit

GitHub

89 stars
9 watching
14 forks
Language: Cython
last commit: 12 days ago
Linked from 1 awesome list

bigdatacppcythonextractionhtmlparserpythonwarcwebwebarchive

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
wabarc/cairn A tool for archiving web pages as single HTML files 45
webrecorder/archiveweb.page A high-fidelity web archiving system for storing and replaying interactive web pages in browsers. 903
bellingcat/auto-archiver Automates archiving of online content from various sources into local storage or cloud services 585
recrm/archivetools A collection of tools for extracting and analyzing data from web archives 71
webrecorder/pywb A toolkit for archiving and replaying web content accurately and efficiently 1,418
peterk/warcworker A web archiving tool that archives websites with high-fidelity preservation capabilities. 57
richardlehane/webarchive Provides tools for reading and parsing web archive formats used in digital preservation. 20
turicas/crau A command-line tool for archiving and playing back websites in WARC format 59
archiveteam/grab-site A web crawler designed to backup websites by recursively crawling and writing WARC files. 1,406
machawk1/wail A graphical user interface layer for preserving and replaying web pages using multiple archiving tools. 353
jarofghosts/memento-client Provides a simple JavaScript interface to access historical web pages via the Wayback Machine 14
archiveteam/wpull Downloads and crawls web pages, allowing for the archiving of websites. 556
n0tan3rd/squidwarc An archival crawler built on top of Chrome or Chromium to preserve the web in high fidelity and user scriptable manner 170
wabarc/wayback A tool for capturing and preserving web content and making it accessible in the future. 1,839
chfoo/warcat Tool for handling Web Archive files 152