crawlers

Data gatherer

A suite of tools for gathering and processing data from the web and file systems

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

GitHub

184 stars

33 watching

67 forks

Language: Java

last commit: 10 months ago

Linked from 1 awesome list

collector-fscollector-httpcrawlercrawlersfilesystem-crawlerflexiblejavasearch-engineweb-crawler

opensource.norconex.com/crawlers

Backlinks from these awesome lists:

brucedone/awesome-crawler

Related projects:

Repository	Description	Stars
brendonboshell/supercrawler	A web crawler designed to crawl websites while obeying robots.txt rules, rate limits and concurrency limits, with customizable content handlers for parsing and processing crawled pages.	380
webrecorder/browsertrix-crawler	A containerized browser-based crawler system for capturing web content in a high-fidelity and customizable manner.	677
c-sto/recursebuster	A tool for recursively querying web servers by sending HTTP requests and analyzing responses to discover hidden content	243
3nock/spidersuite	A cross-platform web spider/crawler tool for analyzing and mapping attack surfaces	614
hu17889/go_spider	A modular, concurrent web crawler framework written in Go.	1,827
cocrawler/cocrawler	A versatile web crawler built with modern tools and concurrency to handle various crawl tasks	188
rndinfosecguy/scavenger	An OSINT bot that crawls pastebin sites to search for sensitive data leaks	634
fredwu/crawler	A high-performance web crawling and scraping solution with customizable settings and worker pooling.	945
archiveteam/grab-site	A web crawler designed to backup websites by recursively crawling and writing WARC files.	1,406
spider-rs/spider	A tool for web data extraction and processing using Rust	1,234
feng19/spider_man	A high-level web crawling and scraping framework for Elixir.	23
turnersoftware/infinitycrawler	A web crawling library for .NET that allows customizable crawling and throttling of websites.	248
postmodern/spidr	A Ruby web crawling library that provides flexible and customizable methods to crawl websites	809
helgeho/web2warc	A Web crawler that creates custom archives in WARC/CDX format	25
elixir-crawly/crawly	A framework for extracting structured data from websites	994