heritrix3
Crawler
A web crawler designed to collect and preserve digital artifacts while respecting site policies and load constraints.
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
3k stars
187 watching
762 forks
Language: Java
last commit: 12 months ago
Linked from 3 awesome lists
heritrixjavawarcwebcrawling
Related projects:
| Repository | Description | Stars |
|---|---|---|
| | A Web crawler that creates custom archives in WARC/CDX format | 25 |
| | A web crawler designed to efficiently collect and prioritize relevant content from the web | 459 |
| | An archival crawler built on top of Chrome or Chromium to preserve the web in high fidelity and user scriptable manner | 170 |
| | An extension that crawls JavaScript files in Burp Suite and automatically creates issues with URLs, endpoints, and dangerous methods. | 345 |
| | A web crawler designed to backup websites by recursively crawling and writing WARC files. | 1,406 |
| | A web crawling tool designed to extract structured data from the web for use in AI applications | 18,541 |
| | A tool to discover and extract information from web pages using HTTP requests and Google search queries. | 68 |
| | A fast and flexible web crawler designed to gather information from the internet | 11,122 |
| | A distributed web crawler that fetches and extracts links from websites using a real browser. | 678 |
| | Generates a sitemap of a website using Wayback Machine | 225 |
| | A containerized browser-based crawler system for capturing web content in a high-fidelity and customizable manner. | 677 |
| | Downloads and crawls web pages, allowing for the archiving of websites. | 556 |
| | An async web scraping micro-framework built with asyncio and aiohttp to simplify URL crawling | 1,753 |
| | Automatically converts browser recordings (.har files) into locust scripts. | 167 |
| | Automated tool to monitor GitHub repositories for sensitive data in real-time | 2,044 |