grab-site

Web crawler

A web crawler designed to backup websites by recursively crawling and writing WARC files.

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

GitHub

1k stars
41 watching
136 forks
Language: Python
last commit: 5 months ago
Linked from 1 awesome list

archivingcrawlcrawlerspiderwarc

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
archiveteam/wpull Downloads and crawls web pages, allowing for the archiving of websites. 556
webrecorder/browsertrix-crawler A containerized browser-based crawler system for capturing web content in a high-fidelity and customizable manner. 677
helgeho/web2warc A Web crawler that creates custom archives in WARC/CDX format 25
nla/httrack2warc Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs 32
peterk/warcworker A web archiving tool that archives websites with high-fidelity preservation capabilities. 57
n0tan3rd/squidwarc An archival crawler built on top of Chrome or Chromium to preserve the web in high fidelity and user scriptable manner 170
internetarchive/brozzler A distributed web crawler that fetches and extracts links from websites using a real browser. 678
turicas/crau A command-line tool for archiving and playing back websites in WARC format 59
internetarchive/warctools Tools for working with archived web content 153
internetarchive/warcprox An HTTP proxy designed to capture and archive web traffic, including encrypted HTTPS connections. 389
vida-nyu/ache A web crawler designed to efficiently collect and prioritize relevant content from the web 459
cocrawler/cocrawler A versatile web crawler built with modern tools and concurrency to handle various crawl tasks 188
chfoo/warcat Tool for handling Web Archive files 152
a11ywatch/crawler Performs web page crawling at high performance. 51
spider-rs/spider A tool for web data extraction and processing using Rust 1,234