awesome-crawler

crawler frameworks

A collection of reusable web crawling and scraping components in multiple programming languages.

A collection of awesome web crawler,spider in different languages

GitHub

7k stars

201 watching

709 forks

last commit: about 2 years ago

Linked from 1 awesome list

awesomecrawlernode-crawlerscraperspiderweb-crawlerweb-scraper

Awesome-crawler / Python
Scrapy	53,484	over 1 year ago	A fast high-level screen scraping and web crawling framework
Awesome-crawler / Python / Scrapy
django-dynamic-scraper	1,155	over 4 years ago	Creating Scrapy scrapers via the Django admin interface
Scrapy-Redis	5,548	about 2 years ago	Redis-based components for Scrapy
scrapy-cluster	1,185	over 2 years ago	Uses Redis and Kafka to create a distributed on demand scraping cluster
distribute_crawler	3,245	over 9 years ago	Uses scrapy,redis, mongodb,graphite to create a distributed spider
Awesome-crawler / Python
pyspider	16,511	over 2 years ago	A powerful spider system
CoCrawler	188	over 4 years ago	A versatile web crawler built using modern tools and concurrency
cola	1,501	about 4 years ago	A distributed crawling framework
Demiurge	115	over 4 years ago	PyQuery-based scraping micro-framework
Scrapely	1,865	over 4 years ago	A pure-python HTML screen-scraping library
feedparser			Universal feed parser
you-get	54,175	over 1 year ago	Dumb downloader that scrapes the web
MechanicalSoup	4,685	over 1 year ago	A Python library for automating interaction with websites
portia	9,327	about 2 years ago	Visual scraping for Scrapy
crawley	188	over 3 years ago	Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations
RoboBrowser	3,703	almost 6 years ago	A simple, Pythonic library for browsing the web without a standalone web browser
MSpider	348	about 4 years ago	A simple ,easy spider using gevent and js render
brownant	159	over 9 years ago	A lightweight web data extracting framework
PSpider	1,828	about 4 years ago	A simple spider frame in Python3
Gain	2,037	about 7 years ago	Web crawling framework based on asyncio for everyone
sukhoi	879	over 5 years ago	Minimalist and powerful Web Crawler
spidy	340	almost 2 years ago	The simple, easy to use command line web crawler
newspaper	14,220	about 2 years ago	News, full-text, and article metadata extraction in Python 3
aspider	1,753	about 3 years ago	An async web scraping micro-framework based on asyncio
Awesome-crawler / Java
ACHE Crawler	459	almost 3 years ago	An easy to use web crawler for domain-specific search
Apache Nutch			Highly extensible, highly scalable web crawler for production environment
Awesome-crawler / Java / Apache Nutch
anthelion	2,841	over 10 years ago	A plugin for Apache Nutch to crawl semantic annotations within HTML pages
Awesome-crawler / Java
Crawler4j	4,563	over 4 years ago	Simple and lightweight web crawler
JSoup			Scrapes, parses, manipulates and cleans HTML
websphinx			Website-Specific Processors for HTML information extraction
Open Search Server			A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything
Gecco	2,504	over 2 years ago	A easy to use lightweight web crawler
WebCollector	3,074	over 2 years ago	Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes
Webmagic	11,456	over 1 year ago	A scalable crawler framework
Spiderman			A scalable ,extensible, multi-threaded web crawler
Awesome-crawler / Java / Spiderman
Spiderman2			A distributed web crawler framework,support js render
Awesome-crawler / Java
Heritrix3	2,857	over 1 year ago	Extensible, web-scale, archival-quality web crawler project
SeimiCrawler	1,980	over 1 year ago	An agile, distributed crawler framework
StormCrawler			An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
Spark-Crawler	411	over 3 years ago	Evolving Apache Nutch to run on Spark
webBee	189	over 2 years ago	A DFS web spider
spider-flow	9,701	about 3 years ago	A visual spider framework, it's so good that you don't need to write any code to crawl the website
Norconex Web Crawler	184	over 1 year ago	Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications
Awesome-crawler / C#
ccrawler			Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content
SimpleCrawler			Simple spider base on mutithreading, regluar expression
DotnetSpider	4,007	almost 2 years ago	This is a cross platfrom, ligth spider develop by C#
Abot	2,255	almost 2 years ago	C# web crawler built for speed and flexibility
Hawk	3,163	over 6 years ago	Advanced Crawler and ETL tool written in C#/WPF
SkyScraper	59	almost 10 years ago	An asynchronous web scraper / web crawler using async / await and Reactive Extensions
Infinity Crawler	248	over 2 years ago	A simple but powerful web crawler library in C#
Awesome-crawler / JavaScript
scraperjs	3,714	almost 6 years ago	A complete and versatile web scraper
scrape-it	4,024	over 1 year ago	A Node.js scraper for humans
simplecrawler	2,143	over 5 years ago	Event driven web crawler
node-crawler	6,718	almost 2 years ago	Node-crawler has clean,simple api
js-crawler	254	about 8 years ago	Web crawler for Node.JS, both HTTP and HTTPS are supported
webster	518	over 1 year ago	A reliable web crawling framework which can scrape ajax and js rendered content in a web page
x-ray	5,883	over 1 year ago	Web scraper with pagination and crawler support
node-osmosis	4,115	over 2 years ago	HTML/XML parser and web scraper for Node.js
web-scraper-chrome-extension	1,318	almost 8 years ago	Web data extraction tool implemented as chrome extension
supercrawler	380	over 3 years ago	Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits
headless-chrome-crawler	5,534	over 3 years ago	Headless Chrome crawls with jQuery support
Squidwarc	170	about 6 years ago	High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
crawlee	16,081	over 1 year ago	A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast
Awesome-crawler / PHP
Goutte	9,264	over 3 years ago	A screen scraping and web crawling library for PHP
Awesome-crawler / PHP / Goutte
laravel-goutte	453	over 2 years ago	Laravel 5 Facade for Goutte
Awesome-crawler / PHP
dom-crawler	3,974	over 1 year ago	The DomCrawler component eases DOM navigation for HTML and XML documents
QueryList	2,671	over 1 year ago	The progressive PHP crawler framework
pspider	266	almost 11 years ago	Parallel web crawler written in PHP
php-spider	1,336	about 2 years ago	A configurable and extensible PHP web spider
spatie/crawler	2,552	over 1 year ago	An easy to use, powerful crawler implemented in PHP. Can execute Javascript
crawlzone/crawlzone	78	over 3 years ago	Crawlzone is a fast asynchronous internet crawling framework for PHP
PHPScraper	544	over 2 years ago	PHPScraper is a scraper & crawler built for simplicity
Awesome-crawler / C++
open-source-search-engine	1,546	over 2 years ago	A distributed open source search engine and spider/crawler written in C/C++
Awesome-crawler / C
httrack	3,648	almost 2 years ago	Copy websites to your computer
Awesome-crawler / Ruby
Nokogiri	6,164	over 1 year ago	A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support
upton	1,612	over 7 years ago	A batteries-included framework for easy web-scraping. Just add CSS(Or do more)
wombat	1,315	over 2 years ago	Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages
RubyRetriever	143	over 3 years ago	RubyRetriever is a Web Crawler, Scraper & File Harvester
Spidr	809	over 2 years ago	Spider a site, multiple domains, certain links or infinitely
Cobweb	226	over 3 years ago	Web crawler with very flexible crawling options, standalone or using sidekiq
mechanize	4,396	almost 2 years ago	Automated web interaction & crawling
Awesome-crawler / Rust
spider	1,234	over 1 year ago	The fastest web crawler and indexer
crawler	51	almost 2 years ago	A gRPC web indexer turbo charged for performance
Awesome-crawler / R
rvest	1,495	almost 2 years ago	Simple web scraping for R
Awesome-crawler / Erlang
ebot	330	over 15 years ago	A scalable, distribuited and highly configurable web cawler
Awesome-crawler / Perl
web-scraper	104	over 9 years ago	Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions
Awesome-crawler / Go
pholcus	7,578	over 3 years ago	A distributed, high concurrency and powerful web crawler
gocrawl	2,036	about 5 years ago	Polite, slim and concurrent web crawler
fetchbot	787	about 5 years ago	A simple and flexible web crawler that follows the robots.txt policies and crawl delays
go_spider	1,827	over 8 years ago	An awesome Go concurrent Crawler(spider) framework
dht	2,741	almost 5 years ago	BitTorrent DHT Protocol && DHT Spider
ants-go	363	over 10 years ago	A open source, distributed, restful crawler engine in golang
scrape	1,513	over 9 years ago	A simple, higher level interface for Go web scraping
creeper	780	about 9 years ago	The Next Generation Crawler Framework (Go)
colly	23,444	about 2 years ago	Fast and Elegant Scraping Framework for Gophers
ferret	5,760	over 1 year ago	Declarative web scraping
Dataflow kit	667	over 3 years ago	Extract structured data from web pages. Web sites scraping
Hakrawler	4,528	over 2 years ago	Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
Awesome-crawler / Scala
crawler	149	almost 10 years ago	Scala DSL for web crawling
scrala	113	almost 7 years ago	Scala crawler(spider) framework, inspired by scrapy
ferrit			Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra

Backlinks from these awesome lists:

inquest/awesome-yara

awesome-crawler

Awesome-crawler / Python

Awesome-crawler / Python / Scrapy

Awesome-crawler / Python

Awesome-crawler / Java

Awesome-crawler / Java / Apache Nutch

Awesome-crawler / Java

Awesome-crawler / Java / Spiderman

Awesome-crawler / Java

Awesome-crawler / C#

Awesome-crawler / JavaScript

Awesome-crawler / PHP

Awesome-crawler / PHP / Goutte

Awesome-crawler / PHP

Awesome-crawler / C++

Awesome-crawler / C

Awesome-crawler / Ruby

Awesome-crawler / Rust

Awesome-crawler / R

Awesome-crawler / Erlang

Awesome-crawler / Perl

Awesome-crawler / Go

Awesome-crawler / Scala

Backlinks from these awesome lists:

More related projects: