heritrix3
Crawler
A web crawler designed to collect and preserve digital artifacts while respecting site policies and load constraints.
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
3k stars
187 watching
762 forks
Language: Java
last commit: 3 months ago
Linked from 3 awesome lists
heritrixjavawarcwebcrawling
Related projects:
Repository | Description | Stars |
---|---|---|
| A Web crawler that creates custom archives in WARC/CDX format | 25 |
| A web crawler designed to efficiently collect and prioritize relevant content from the web | 459 |
| An archival crawler built on top of Chrome or Chromium to preserve the web in high fidelity and user scriptable manner | 170 |
| An extension that crawls JavaScript files in Burp Suite and automatically creates issues with URLs, endpoints, and dangerous methods. | 345 |
| A web crawler designed to backup websites by recursively crawling and writing WARC files. | 1,406 |
| A web crawling tool designed to extract structured data from the web for use in AI applications | 18,541 |
| A tool to discover and extract information from web pages using HTTP requests and Google search queries. | 68 |
| A fast and flexible web crawler designed to gather information from the internet | 11,122 |
| A distributed web crawler that fetches and extracts links from websites using a real browser. | 678 |
| Generates a sitemap of a website using Wayback Machine | 225 |
| A containerized browser-based crawler system for capturing web content in a high-fidelity and customizable manner. | 677 |
| Downloads and crawls web pages, allowing for the archiving of websites. | 556 |
| An async web scraping micro-framework built with asyncio and aiohttp to simplify URL crawling | 1,753 |
| Automatically converts browser recordings (.har files) into locust scripts. | 167 |
| Automated tool to monitor GitHub repositories for sensitive data in real-time | 2,044 |