grab-site
Web crawler
A web crawler designed to backup websites by recursively crawling and writing WARC files.
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
1k stars
41 watching
136 forks
Language: Python
last commit: 5 months ago
Linked from 1 awesome list
archivingcrawlcrawlerspiderwarc
Related projects:
Repository | Description | Stars |
---|---|---|
archiveteam/wpull | Downloads and crawls web pages, allowing for the archiving of websites. | 556 |
webrecorder/browsertrix-crawler | A containerized browser-based crawler system for capturing web content in a high-fidelity and customizable manner. | 677 |
helgeho/web2warc | A Web crawler that creates custom archives in WARC/CDX format | 25 |
nla/httrack2warc | Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs | 32 |
peterk/warcworker | A web archiving tool that archives websites with high-fidelity preservation capabilities. | 57 |
n0tan3rd/squidwarc | An archival crawler built on top of Chrome or Chromium to preserve the web in high fidelity and user scriptable manner | 170 |
internetarchive/brozzler | A distributed web crawler that fetches and extracts links from websites using a real browser. | 678 |
turicas/crau | A command-line tool for archiving and playing back websites in WARC format | 59 |
internetarchive/warctools | Tools for working with archived web content | 153 |
internetarchive/warcprox | An HTTP proxy designed to capture and archive web traffic, including encrypted HTTPS connections. | 389 |
vida-nyu/ache | A web crawler designed to efficiently collect and prioritize relevant content from the web | 459 |
cocrawler/cocrawler | A versatile web crawler built with modern tools and concurrency to handle various crawl tasks | 188 |
chfoo/warcat | Tool for handling Web Archive files | 152 |
a11ywatch/crawler | Performs web page crawling at high performance. | 51 |
spider-rs/spider | A tool for web data extraction and processing using Rust | 1,234 |