grab-site

Web crawler

A web crawler designed to backup websites by recursively crawling and writing WARC files.

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

GitHub

1k stars

41 watching

136 forks

Language: Python

last commit: about 1 year ago

Linked from 1 awesome list

archivingcrawlcrawlerspiderwarc

Backlinks from these awesome lists:

iipc/awesome-web-archiving

Related projects:

Repository	Description	Stars
archiveteam/wpull	Downloads and crawls web pages, allowing for the archiving of websites.	556
webrecorder/browsertrix-crawler	A containerized browser-based crawler system for capturing web content in a high-fidelity and customizable manner.	677
helgeho/web2warc	A Web crawler that creates custom archives in WARC/CDX format	25
nla/httrack2warc	Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs	32
peterk/warcworker	A web archiving tool that archives websites with high-fidelity preservation capabilities.	57
n0tan3rd/squidwarc	An archival crawler built on top of Chrome or Chromium to preserve the web in high fidelity and user scriptable manner	170
internetarchive/brozzler	A distributed web crawler that fetches and extracts links from websites using a real browser.	678
turicas/crau	A command-line tool for archiving and playing back websites in WARC format	59
internetarchive/warctools	Tools for working with archived web content	153
internetarchive/warcprox	An HTTP proxy designed to capture and archive web traffic, including encrypted HTTPS connections.	389
vida-nyu/ache	A web crawler designed to efficiently collect and prioritize relevant content from the web	459
cocrawler/cocrawler	A versatile web crawler built with modern tools and concurrency to handle various crawl tasks	188
chfoo/warcat	Tool for handling Web Archive files	152
a11ywatch/crawler	Performs web page crawling at high performance.	51
spider-rs/spider	A tool for web data extraction and processing using Rust	1,234