Squidwarc

Web archiver

An archival crawler built on top of Chrome or Chromium to preserve the web in high fidelity and user scriptable manner

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

GitHub

170 stars
10 watching
26 forks
Language: JavaScript
last commit: over 4 years ago
Linked from 2 awesome lists

browser-automationchromechrome-headlesscrawlercrawlingheadless-chromehigh-fidelity-preservationpuppeteerwebarchiveswebarchiving

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
peterk/warcworker A web archiving tool that archives websites with high-fidelity preservation capabilities. 57
archiveteam/grab-site A web crawler designed to backup websites by recursively crawling and writing WARC files. 1,406
helgeho/web2warc A Web crawler that creates custom archives in WARC/CDX format 25
machawk1/wail A graphical user interface layer for preserving and replaying web pages using multiple archiving tools. 353
wabarc/cairn A tool for archiving web pages as single HTML files 45
wabarc/wayback A tool for capturing and preserving web content and making it accessible in the future. 1,839
turicas/crau A command-line tool for archiving and playing back websites in WARC format 59
webrecorder/archiveweb.page A high-fidelity web archiving system for storing and replaying interactive web pages in browsers. 903
webrecorder/pywb A toolkit for archiving and replaying web content accurately and efficiently 1,418
internetarchive/brozzler A distributed web crawler that fetches and extracts links from websites using a real browser. 678
n0tan3rd/node-warc A tool for parsing and generating Web Archive files in JavaScript using Node.js 95
webrecorder/browsertrix-crawler A containerized browser-based crawler system for capturing web content in a high-fidelity and customizable manner. 677
nla/httrack2warc Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs 32
webrecorder/har2warc Converts HTTP Archive format to Web Archive format 48
internetarchive/warcprox An HTTP proxy designed to capture and archive web traffic, including encrypted HTTPS connections. 389