heritrix3

Crawler

A web crawler designed to collect and preserve digital artifacts while respecting site policies and load constraints.

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

GitHub

3k stars
188 watching
762 forks
Language: Java
last commit: 17 days ago
Linked from 3 awesome lists

heritrixjavawarcwebcrawling

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
helgeho/web2warc A Web crawler that creates custom archives in WARC/CDX format 24
vida-nyu/ache A web crawler designed to efficiently collect and prioritize relevant content from the web 454
n0tan3rd/squidwarc An archival crawler built on top of Chrome or Chromium to preserve the web in high fidelity and user scriptable manner 169
hisxo/jspector An extension that crawls JavaScript files in Burp Suite and automatically creates issues with URLs, endpoints, and dangerous methods. 341
archiveteam/grab-site A web crawler designed to backup websites by recursively crawling and writing WARC files. 1,402
unclecode/crawl4ai A tool for web crawling and data extraction, designed to work with large language models. 16,180
naufalardhani/domhttpx A tool to discover and extract information from web pages using HTTP requests and Google search queries. 68
s0md3v/photon A fast and flexible web crawler designed to gather information from the internet 11,067
internetarchive/brozzler A distributed web crawler that fetches and extracts links from websites using a real browser. 671
p3gleg/pwnback Generates a sitemap of a website using Wayback Machine 225
webrecorder/browsertrix-crawler A containerized browser-based crawler system for capturing web content in a high-fidelity and customizable manner. 652
archiveteam/wpull Downloads and crawls web pages, allowing for the archiving of websites. 556
howie6879/ruia An async web scraping micro-framework built with asyncio and aiohttp to simplify URL crawling 1,752
svenskaspel/har2locust Automatically converts browser recordings (.har files) into locust scripts. 163
hisxo/gitgraber Automated tool to monitor GitHub repositories for sensitive data in real-time 2,034