awesome-crawler

crawler frameworks

A collection of reusable web crawling and scraping components in multiple programming languages.

A collection of awesome web crawler,spider in different languages

GitHub

6k stars
201 watching
710 forks
last commit: 5 months ago
Linked from 1 awesome list

awesomecrawlernode-crawlerscraperspiderweb-crawlerweb-scraper

Awesome-crawler / Python

Scrapy 53,156 8 days ago A fast high-level screen scraping and web crawling framework

Awesome-crawler / Python / Scrapy

django-dynamic-scraper 1,153 almost 3 years ago Creating Scrapy scrapers via the Django admin interface
Scrapy-Redis 5,534 5 months ago Redis-based components for Scrapy
scrapy-cluster 1,182 about 1 year ago Uses Redis and Kafka to create a distributed on demand scraping cluster
distribute_crawler 3,247 over 7 years ago Uses scrapy,redis, mongodb,graphite to create a distributed spider

Awesome-crawler / Python

pyspider 16,497 7 months ago A powerful spider system
CoCrawler 187 over 2 years ago A versatile web crawler built using modern tools and concurrency
cola 1,500 over 2 years ago A distributed crawling framework
Demiurge 114 almost 3 years ago PyQuery-based scraping micro-framework
Scrapely 1,863 over 2 years ago A pure-python HTML screen-scraping library
feedparser Universal feed parser
you-get 53,851 26 days ago Dumb downloader that scrapes the web
MechanicalSoup 4,672 9 days ago A Python library for automating interaction with websites
portia 9,301 5 months ago Visual scraping for Scrapy
crawley 186 over 1 year ago Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations
RoboBrowser 3,702 about 4 years ago A simple, Pythonic library for browsing the web without a standalone web browser
MSpider 348 over 2 years ago A simple ,easy spider using gevent and js render
brownant 159 over 7 years ago A lightweight web data extracting framework
PSpider 1,827 over 2 years ago A simple spider frame in Python3
Gain 2,035 over 5 years ago Web crawling framework based on asyncio for everyone
sukhoi 881 almost 4 years ago Minimalist and powerful Web Crawler
spidy 340 4 months ago The simple, easy to use command line web crawler
newspaper 14,171 4 months ago News, full-text, and article metadata extraction in Python 3
aspider 1,752 over 1 year ago An async web scraping micro-framework based on asyncio

Awesome-crawler / Java

ACHE Crawler 454 about 1 year ago An easy to use web crawler for domain-specific search
Apache Nutch Highly extensible, highly scalable web crawler for production environment

Awesome-crawler / Java / Apache Nutch

anthelion 2,842 almost 9 years ago A plugin for Apache Nutch to crawl semantic annotations within HTML pages

Awesome-crawler / Java

Crawler4j 4,557 about 3 years ago Simple and lightweight web crawler
JSoup Scrapes, parses, manipulates and cleans HTML
websphinx Website-Specific Processors for HTML information extraction
Open Search Server A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything
Gecco 2,502 9 months ago A easy to use lightweight web crawler
WebCollector 3,068 8 months ago Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes
Webmagic 11,437 29 days ago A scalable crawler framework
Spiderman A scalable ,extensible, multi-threaded web crawler

Awesome-crawler / Java / Spiderman

Spiderman2 A distributed web crawler framework,support js render

Awesome-crawler / Java

Heritrix3 2,833 17 days ago Extensible, web-scale, archival-quality web crawler project
SeimiCrawler 1,980 over 1 year ago An agile, distributed crawler framework
StormCrawler An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
Spark-Crawler 410 over 1 year ago Evolving Apache Nutch to run on Spark
webBee 189 11 months ago A DFS web spider
spider-flow 9,613 over 1 year ago A visual spider framework, it's so good that you don't need to write any code to crawl the website
Norconex Web Crawler 183 13 days ago Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications

Awesome-crawler / C#

ccrawler Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content
SimpleCrawler Simple spider base on mutithreading, regluar expression
DotnetSpider 3,989 about 2 months ago This is a cross platfrom, ligth spider develop by C#
Abot 2,247 3 months ago C# web crawler built for speed and flexibility
Hawk 3,160 almost 5 years ago Advanced Crawler and ETL tool written in C#/WPF
SkyScraper 58 about 8 years ago An asynchronous web scraper / web crawler using async / await and Reactive Extensions
Infinity Crawler 248 11 months ago A simple but powerful web crawler library in C#

Awesome-crawler / JavaScript

scraperjs 3,710 about 4 years ago A complete and versatile web scraper
scrape-it 4,012 10 days ago A Node.js scraper for humans
simplecrawler 2,141 over 3 years ago Event driven web crawler
node-crawler 6,704 4 months ago Node-crawler has clean,simple api
js-crawler 253 over 6 years ago Web crawler for Node.JS, both HTTP and HTTPS are supported
webster 515 19 days ago A reliable web crawling framework which can scrape ajax and js rendered content in a web page
x-ray 5,878 25 days ago Web scraper with pagination and crawler support
node-osmosis 4,116 12 months ago HTML/XML parser and web scraper for Node.js
web-scraper-chrome-extension 1,314 about 6 years ago Web data extraction tool implemented as chrome extension
supercrawler 378 almost 2 years ago Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits
headless-chrome-crawler 5,527 over 1 year ago Headless Chrome crawls with jQuery support
Squidwarc 169 over 4 years ago High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
crawlee 15,740 1 day ago A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast

Awesome-crawler / PHP

Goutte 9,261 over 1 year ago A screen scraping and web crawling library for PHP

Awesome-crawler / PHP / Goutte

laravel-goutte 453 10 months ago Laravel 5 Facade for Goutte

Awesome-crawler / PHP

dom-crawler 3,961 10 days ago The DomCrawler component eases DOM navigation for HTML and XML documents
QueryList 2,668 4 months ago The progressive PHP crawler framework
pspider 266 about 9 years ago Parallel web crawler written in PHP
php-spider 1,332 5 months ago A configurable and extensible PHP web spider
spatie/crawler 2,537 4 months ago An easy to use, powerful crawler implemented in PHP. Can execute Javascript
crawlzone/crawlzone 77 over 1 year ago Crawlzone is a fast asynchronous internet crawling framework for PHP
PHPScraper 536 8 months ago PHPScraper is a scraper & crawler built for simplicity

Awesome-crawler / C++

open-source-search-engine 1,540 11 months ago A distributed open source search engine and spider/crawler written in C/C++

Awesome-crawler / C

httrack 3,601 3 months ago Copy websites to your computer

Awesome-crawler / Ruby

Nokogiri 6,153 9 days ago A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support
upton 1,613 almost 6 years ago A batteries-included framework for easy web-scraping. Just add CSS(Or do more)
wombat 1,315 10 months ago Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages
RubyRetriever 143 over 1 year ago RubyRetriever is a Web Crawler, Scraper & File Harvester
Spidr 806 10 months ago Spider a site, multiple domains, certain links or infinitely
Cobweb 226 almost 2 years ago Web crawler with very flexible crawling options, standalone or using sidekiq
mechanize 4,391 about 2 months ago Automated web interaction & crawling

Awesome-crawler / Rust

spider 1,140 8 days ago The fastest web crawler and indexer
crawler 49 3 months ago A gRPC web indexer turbo charged for performance

Awesome-crawler / R

rvest 1,492 29 days ago Simple web scraping for R

Awesome-crawler / Erlang

ebot 330 over 13 years ago A scalable, distribuited and highly configurable web cawler

Awesome-crawler / Perl

web-scraper 104 over 7 years ago Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions

Awesome-crawler / Go

pholcus 7,570 about 2 years ago A distributed, high concurrency and powerful web crawler
gocrawl 2,038 over 3 years ago Polite, slim and concurrent web crawler
fetchbot 786 over 3 years ago A simple and flexible web crawler that follows the robots.txt policies and crawl delays
go_spider 1,826 about 7 years ago An awesome Go concurrent Crawler(spider) framework
dht 2,741 over 3 years ago BitTorrent DHT Protocol && DHT Spider
ants-go 363 over 8 years ago A open source, distributed, restful crawler engine in golang
scrape 1,513 almost 8 years ago A simple, higher level interface for Go web scraping
creeper 780 over 7 years ago The Next Generation Crawler Framework (Go)
colly 23,351 4 months ago Fast and Elegant Scraping Framework for Gophers
ferret 5,741 15 days ago Declarative web scraping
Dataflow kit 662 over 1 year ago Extract structured data from web pages. Web sites scraping
Hakrawler 4,502 10 months ago Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

Awesome-crawler / Scala

crawler 148 over 8 years ago Scala DSL for web crawling
scrala 113 about 5 years ago Scala crawler(spider) framework, inspired by scrapy
ferrit Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra

Backlinks from these awesome lists:

More related projects: