Web2Warc

Crawler

A Web crawler that creates custom archives in WARC/CDX format

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

GitHub

24 stars
3 watching
4 forks
Language: Scala
last commit: about 7 years ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
webrecorder/browsertrix-crawler A containerized browser-based crawler system for capturing web content in a high-fidelity and customizable manner. 652
archiveteam/grab-site A web crawler designed to backup websites by recursively crawling and writing WARC files. 1,402
n0tan3rd/squidwarc An archival crawler built on top of Chrome or Chromium to preserve the web in high fidelity and user scriptable manner 169
helgeho/warcpartitioner Tool for partitioning and merging Web archive files by MIME type and year 1
nla/httrack2warc Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs 30
internetarchive/brozzler A distributed web crawler that fetches and extracts links from websites using a real browser. 671
vida-nyu/ache A web crawler designed to efficiently collect and prioritize relevant content from the web 454
hominee/dyer A fast and flexible web crawling tool with features like asynchronous I/O and event-driven design. 133
cocrawler/cocrawler A versatile web crawler built with modern tools and concurrency to handle various crawl tasks 187
stewartmckee/cobweb A flexible web crawler that can be used to extract data from websites in a scalable and efficient manner 226
fredwu/crawler A high-performance web crawling and scraping solution with customizable settings and worker pooling. 945
c-sto/recursebuster A tool for recursively querying web servers by sending HTTP requests and analyzing responses to discover hidden content 242
webrecorder/har2warc Converts HTTP Archive format to Web Archive format 46
hu17889/go_spider A modular, concurrent web crawler framework written in Go. 1,826
apache/incubator-stormcrawler A collection of resources for building web crawlers on Apache Storm using Java 891