crawlers
Data gatherer
A suite of tools for gathering and processing data from the web and file systems
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
183 stars
33 watching
67 forks
Language: Java
last commit: 10 days ago
Linked from 1 awesome list
collector-fscollector-httpcrawlercrawlersfilesystem-crawlerflexiblejavasearch-engineweb-crawler
Related projects:
Repository | Description | Stars |
---|---|---|
brendonboshell/supercrawler | A web crawler designed to crawl websites while obeying robots.txt rules, rate limits and concurrency limits, with customizable content handlers for parsing and processing crawled pages. | 378 |
webrecorder/browsertrix-crawler | A containerized browser-based crawler system for capturing web content in a high-fidelity and customizable manner. | 652 |
c-sto/recursebuster | A tool for recursively querying web servers by sending HTTP requests and analyzing responses to discover hidden content | 242 |
3nock/spidersuite | A cross-platform web spider/crawler tool for analyzing and mapping attack surfaces | 601 |
hu17889/go_spider | A modular, concurrent web crawler framework written in Go. | 1,826 |
cocrawler/cocrawler | A versatile web crawler built with modern tools and concurrency to handle various crawl tasks | 187 |
rndinfosecguy/scavenger | An OSINT bot that crawls pastebin sites to search for sensitive data leaks | 629 |
fredwu/crawler | A high-performance web crawling and scraping solution with customizable settings and worker pooling. | 945 |
archiveteam/grab-site | A web crawler designed to backup websites by recursively crawling and writing WARC files. | 1,398 |
spider-rs/spider | A web crawler and scraper built on top of Rust, designed to extract data from the web in a flexible and configurable manner. | 1,140 |
feng19/spider_man | A high-level web crawling and scraping framework for Elixir. | 23 |
turnersoftware/infinitycrawler | A web crawling library for .NET that allows customizable crawling and throttling of websites. | 248 |
postmodern/spidr | A Ruby web crawling library that provides flexible and customizable methods to crawl websites | 806 |
helgeho/web2warc | A Web crawler that creates custom archives in WARC/CDX format | 24 |
elixir-crawly/crawly | A framework for extracting structured data from websites | 987 |