awesome-crawler

crawler frameworks

A collection of reusable web crawling and scraping components in multiple programming languages.

A collection of awesome web crawler,spider in different languages

GitHub

7k stars
201 watching
709 forks
last commit: about 1 year ago
Linked from 1 awesome list

awesomecrawlernode-crawlerscraperspiderweb-crawlerweb-scraper

Awesome-crawler / Python

Scrapy 53,484 7 months ago A fast high-level screen scraping and web crawling framework

Awesome-crawler / Python / Scrapy

django-dynamic-scraper 1,155 over 3 years ago Creating Scrapy scrapers via the Django admin interface
Scrapy-Redis 5,548 about 1 year ago Redis-based components for Scrapy
scrapy-cluster 1,185 over 1 year ago Uses Redis and Kafka to create a distributed on demand scraping cluster
distribute_crawler 3,245 about 8 years ago Uses scrapy,redis, mongodb,graphite to create a distributed spider

Awesome-crawler / Python

pyspider 16,511 about 1 year ago A powerful spider system
CoCrawler 188 about 3 years ago A versatile web crawler built using modern tools and concurrency
cola 1,501 almost 3 years ago A distributed crawling framework
Demiurge 115 over 3 years ago PyQuery-based scraping micro-framework
Scrapely 1,865 over 3 years ago A pure-python HTML screen-scraping library
feedparser Universal feed parser
you-get 54,175 7 months ago Dumb downloader that scrapes the web
MechanicalSoup 4,685 8 months ago A Python library for automating interaction with websites
portia 9,327 about 1 year ago Visual scraping for Scrapy
crawley 188 about 2 years ago Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations
RoboBrowser 3,703 almost 5 years ago A simple, Pythonic library for browsing the web without a standalone web browser
MSpider 348 about 3 years ago A simple ,easy spider using gevent and js render
brownant 159 over 8 years ago A lightweight web data extracting framework
PSpider 1,828 about 3 years ago A simple spider frame in Python3
Gain 2,037 about 6 years ago Web crawling framework based on asyncio for everyone
sukhoi 879 over 4 years ago Minimalist and powerful Web Crawler
spidy 340 11 months ago The simple, easy to use command line web crawler
newspaper 14,220 12 months ago News, full-text, and article metadata extraction in Python 3
aspider 1,753 about 2 years ago An async web scraping micro-framework based on asyncio

Awesome-crawler / Java

ACHE Crawler 459 almost 2 years ago An easy to use web crawler for domain-specific search
Apache Nutch Highly extensible, highly scalable web crawler for production environment

Awesome-crawler / Java / Apache Nutch

anthelion 2,841 over 9 years ago A plugin for Apache Nutch to crawl semantic annotations within HTML pages

Awesome-crawler / Java

Crawler4j 4,563 over 3 years ago Simple and lightweight web crawler
JSoup Scrapes, parses, manipulates and cleans HTML
websphinx Website-Specific Processors for HTML information extraction
Open Search Server A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything
Gecco 2,504 over 1 year ago A easy to use lightweight web crawler
WebCollector 3,074 over 1 year ago Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes
Webmagic 11,456 7 months ago A scalable crawler framework
Spiderman A scalable ,extensible, multi-threaded web crawler

Awesome-crawler / Java / Spiderman

Spiderman2 A distributed web crawler framework,support js render

Awesome-crawler / Java

Heritrix3 2,857 7 months ago Extensible, web-scale, archival-quality web crawler project
SeimiCrawler 1,980 8 months ago An agile, distributed crawler framework
StormCrawler An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
Spark-Crawler 411 over 2 years ago Evolving Apache Nutch to run on Spark
webBee 189 over 1 year ago A DFS web spider
spider-flow 9,701 about 2 years ago A visual spider framework, it's so good that you don't need to write any code to crawl the website
Norconex Web Crawler 184 7 months ago Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications

Awesome-crawler / C#

ccrawler Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content
SimpleCrawler Simple spider base on mutithreading, regluar expression
DotnetSpider 4,007 10 months ago This is a cross platfrom, ligth spider develop by C#
Abot 2,255 10 months ago C# web crawler built for speed and flexibility
Hawk 3,163 over 5 years ago Advanced Crawler and ETL tool written in C#/WPF
SkyScraper 59 almost 9 years ago An asynchronous web scraper / web crawler using async / await and Reactive Extensions
Infinity Crawler 248 over 1 year ago A simple but powerful web crawler library in C#

Awesome-crawler / JavaScript

scraperjs 3,714 over 4 years ago A complete and versatile web scraper
scrape-it 4,024 8 months ago A Node.js scraper for humans
simplecrawler 2,143 over 4 years ago Event driven web crawler
node-crawler 6,718 11 months ago Node-crawler has clean,simple api
js-crawler 254 about 7 years ago Web crawler for Node.JS, both HTTP and HTTPS are supported
webster 518 7 months ago A reliable web crawling framework which can scrape ajax and js rendered content in a web page
x-ray 5,883 7 months ago Web scraper with pagination and crawler support
node-osmosis 4,115 over 1 year ago HTML/XML parser and web scraper for Node.js
web-scraper-chrome-extension 1,318 over 6 years ago Web data extraction tool implemented as chrome extension
supercrawler 380 over 2 years ago Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits
headless-chrome-crawler 5,534 about 2 years ago Headless Chrome crawls with jQuery support
Squidwarc 170 about 5 years ago High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
crawlee 16,081 7 months ago A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast

Awesome-crawler / PHP

Goutte 9,264 over 2 years ago A screen scraping and web crawling library for PHP

Awesome-crawler / PHP / Goutte

laravel-goutte 453 over 1 year ago Laravel 5 Facade for Goutte

Awesome-crawler / PHP

dom-crawler 3,974 7 months ago The DomCrawler component eases DOM navigation for HTML and XML documents
QueryList 2,671 7 months ago The progressive PHP crawler framework
pspider 266 almost 10 years ago Parallel web crawler written in PHP
php-spider 1,336 about 1 year ago A configurable and extensible PHP web spider
spatie/crawler 2,552 7 months ago An easy to use, powerful crawler implemented in PHP. Can execute Javascript
crawlzone/crawlzone 78 about 2 years ago Crawlzone is a fast asynchronous internet crawling framework for PHP
PHPScraper 544 over 1 year ago PHPScraper is a scraper & crawler built for simplicity

Awesome-crawler / C++

open-source-search-engine 1,546 over 1 year ago A distributed open source search engine and spider/crawler written in C/C++

Awesome-crawler / C

httrack 3,648 11 months ago Copy websites to your computer

Awesome-crawler / Ruby

Nokogiri 6,164 7 months ago A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support
upton 1,612 over 6 years ago A batteries-included framework for easy web-scraping. Just add CSS(Or do more)
wombat 1,315 over 1 year ago Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages
RubyRetriever 143 over 2 years ago RubyRetriever is a Web Crawler, Scraper & File Harvester
Spidr 809 over 1 year ago Spider a site, multiple domains, certain links or infinitely
Cobweb 226 over 2 years ago Web crawler with very flexible crawling options, standalone or using sidekiq
mechanize 4,396 9 months ago Automated web interaction & crawling

Awesome-crawler / Rust

spider 1,234 7 months ago The fastest web crawler and indexer
crawler 51 11 months ago A gRPC web indexer turbo charged for performance

Awesome-crawler / R

rvest 1,495 9 months ago Simple web scraping for R

Awesome-crawler / Erlang

ebot 330 over 14 years ago A scalable, distribuited and highly configurable web cawler

Awesome-crawler / Perl

web-scraper 104 about 8 years ago Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions

Awesome-crawler / Go

pholcus 7,578 over 2 years ago A distributed, high concurrency and powerful web crawler
gocrawl 2,036 about 4 years ago Polite, slim and concurrent web crawler
fetchbot 787 about 4 years ago A simple and flexible web crawler that follows the robots.txt policies and crawl delays
go_spider 1,827 over 7 years ago An awesome Go concurrent Crawler(spider) framework
dht 2,741 almost 4 years ago BitTorrent DHT Protocol && DHT Spider
ants-go 363 over 9 years ago A open source, distributed, restful crawler engine in golang
scrape 1,513 over 8 years ago A simple, higher level interface for Go web scraping
creeper 780 about 8 years ago The Next Generation Crawler Framework (Go)
colly 23,444 12 months ago Fast and Elegant Scraping Framework for Gophers
ferret 5,760 7 months ago Declarative web scraping
Dataflow kit 667 over 2 years ago Extract structured data from web pages. Web sites scraping
Hakrawler 4,528 over 1 year ago Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

Awesome-crawler / Scala

crawler 149 almost 9 years ago Scala DSL for web crawling
scrala 113 almost 6 years ago Scala crawler(spider) framework, inspired by scrapy
ferrit Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra

Backlinks from these awesome lists:

More related projects: