Squidwarc
Web archiver
An archival crawler built on top of Chrome or Chromium to preserve the web in high fidelity and user scriptable manner
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
170 stars
10 watching
26 forks
Language: JavaScript
last commit: over 4 years ago
Linked from 2 awesome lists
browser-automationchromechrome-headlesscrawlercrawlingheadless-chromehigh-fidelity-preservationpuppeteerwebarchiveswebarchiving
Related projects:
Repository | Description | Stars |
---|---|---|
peterk/warcworker | A web archiving tool that archives websites with high-fidelity preservation capabilities. | 57 |
archiveteam/grab-site | A web crawler designed to backup websites by recursively crawling and writing WARC files. | 1,406 |
helgeho/web2warc | A Web crawler that creates custom archives in WARC/CDX format | 25 |
machawk1/wail | A graphical user interface layer for preserving and replaying web pages using multiple archiving tools. | 353 |
wabarc/cairn | A tool for archiving web pages as single HTML files | 45 |
wabarc/wayback | A tool for capturing and preserving web content and making it accessible in the future. | 1,839 |
turicas/crau | A command-line tool for archiving and playing back websites in WARC format | 59 |
webrecorder/archiveweb.page | A high-fidelity web archiving system for storing and replaying interactive web pages in browsers. | 903 |
webrecorder/pywb | A toolkit for archiving and replaying web content accurately and efficiently | 1,418 |
internetarchive/brozzler | A distributed web crawler that fetches and extracts links from websites using a real browser. | 678 |
n0tan3rd/node-warc | A tool for parsing and generating Web Archive files in JavaScript using Node.js | 95 |
webrecorder/browsertrix-crawler | A containerized browser-based crawler system for capturing web content in a high-fidelity and customizable manner. | 677 |
nla/httrack2warc | Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs | 32 |
webrecorder/har2warc | Converts HTTP Archive format to Web Archive format | 48 |
internetarchive/warcprox | An HTTP proxy designed to capture and archive web traffic, including encrypted HTTPS connections. | 389 |