Web2Warc

Crawler

A Web crawler that creates custom archives in WARC/CDX format

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

GitHub

25 stars
3 watching
4 forks
Language: Scala
last commit: about 7 years ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
webrecorder/browsertrix-crawler A containerized browser-based crawler system for capturing web content in a high-fidelity and customizable manner. 677
archiveteam/grab-site A web crawler designed to backup websites by recursively crawling and writing WARC files. 1,406
n0tan3rd/squidwarc An archival crawler built on top of Chrome or Chromium to preserve the web in high fidelity and user scriptable manner 170
helgeho/warcpartitioner Tool for partitioning and merging Web archive files by MIME type and year 1
nla/httrack2warc Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs 32
internetarchive/brozzler A distributed web crawler that fetches and extracts links from websites using a real browser. 678
vida-nyu/ache A web crawler designed to efficiently collect and prioritize relevant content from the web 459
hominee/dyer A fast and flexible web crawling tool with features like asynchronous I/O and event-driven design. 135
cocrawler/cocrawler A versatile web crawler built with modern tools and concurrency to handle various crawl tasks 188
stewartmckee/cobweb A flexible web crawler that can be used to extract data from websites in a scalable and efficient manner 226
fredwu/crawler A high-performance web crawling and scraping solution with customizable settings and worker pooling. 945
c-sto/recursebuster A tool for recursively querying web servers by sending HTTP requests and analyzing responses to discover hidden content 243
webrecorder/har2warc Converts HTTP Archive format to Web Archive format 48
hu17889/go_spider A modular, concurrent web crawler framework written in Go. 1,827
apache/incubator-stormcrawler A scalable and versatile web crawling framework based on Apache Storm 895