heritrix3
Crawler
A web crawler designed to collect and preserve digital artifacts while respecting site policies and load constraints.
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
3k stars
188 watching
762 forks
Language: Java
last commit: 17 days ago
Linked from 3 awesome lists
heritrixjavawarcwebcrawling
Related projects:
Repository | Description | Stars |
---|---|---|
helgeho/web2warc | A Web crawler that creates custom archives in WARC/CDX format | 24 |
vida-nyu/ache | A web crawler designed to efficiently collect and prioritize relevant content from the web | 454 |
n0tan3rd/squidwarc | An archival crawler built on top of Chrome or Chromium to preserve the web in high fidelity and user scriptable manner | 169 |
hisxo/jspector | An extension that crawls JavaScript files in Burp Suite and automatically creates issues with URLs, endpoints, and dangerous methods. | 341 |
archiveteam/grab-site | A web crawler designed to backup websites by recursively crawling and writing WARC files. | 1,402 |
unclecode/crawl4ai | A tool for web crawling and data extraction, designed to work with large language models. | 16,180 |
naufalardhani/domhttpx | A tool to discover and extract information from web pages using HTTP requests and Google search queries. | 68 |
s0md3v/photon | A fast and flexible web crawler designed to gather information from the internet | 11,067 |
internetarchive/brozzler | A distributed web crawler that fetches and extracts links from websites using a real browser. | 671 |
p3gleg/pwnback | Generates a sitemap of a website using Wayback Machine | 225 |
webrecorder/browsertrix-crawler | A containerized browser-based crawler system for capturing web content in a high-fidelity and customizable manner. | 652 |
archiveteam/wpull | Downloads and crawls web pages, allowing for the archiving of websites. | 556 |
howie6879/ruia | An async web scraping micro-framework built with asyncio and aiohttp to simplify URL crawling | 1,752 |
svenskaspel/har2locust | Automatically converts browser recordings (.har files) into locust scripts. | 163 |
hisxo/gitgraber | Automated tool to monitor GitHub repositories for sensitive data in real-time | 2,034 |