Web2Warc

Crawler

A Web crawler that creates custom archives in WARC/CDX format

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

GitHub

25 stars

3 watching

4 forks

Language: Scala

last commit: almost 9 years ago

Linked from 1 awesome list

Backlinks from these awesome lists:

iipc/awesome-web-archiving

Related projects:

Repository	Description	Stars
webrecorder/browsertrix-crawler	A containerized browser-based crawler system for capturing web content in a high-fidelity and customizable manner.	677
archiveteam/grab-site	A web crawler designed to backup websites by recursively crawling and writing WARC files.	1,406
n0tan3rd/squidwarc	An archival crawler built on top of Chrome or Chromium to preserve the web in high fidelity and user scriptable manner	170
helgeho/warcpartitioner	Tool for partitioning and merging Web archive files by MIME type and year	1
nla/httrack2warc	Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs	32
internetarchive/brozzler	A distributed web crawler that fetches and extracts links from websites using a real browser.	678
vida-nyu/ache	A web crawler designed to efficiently collect and prioritize relevant content from the web	459
hominee/dyer	A fast and flexible web crawling tool with features like asynchronous I/O and event-driven design.	135
cocrawler/cocrawler	A versatile web crawler built with modern tools and concurrency to handle various crawl tasks	188
stewartmckee/cobweb	A flexible web crawler that can be used to extract data from websites in a scalable and efficient manner	226
fredwu/crawler	A high-performance web crawling and scraping solution with customizable settings and worker pooling.	945
c-sto/recursebuster	A tool for recursively querying web servers by sending HTTP requests and analyzing responses to discover hidden content	243
webrecorder/har2warc	Converts HTTP Archive format to Web Archive format	48
hu17889/go_spider	A modular, concurrent web crawler framework written in Go.	1,827
apache/incubator-stormcrawler	A scalable and versatile web crawling framework based on Apache Storm	895