awesome-crawler

crawler frameworks

A collection of reusable web crawling and scraping components in multiple programming languages.

A collection of awesome web crawler,spider in different languages

GitHub

7k stars
201 watching
709 forks
last commit: 7 months ago
Linked from 1 awesome list

awesomecrawlernode-crawlerscraperspiderweb-crawlerweb-scraper

Awesome-crawler / Python

Scrapy 53,484 about 1 month ago A fast high-level screen scraping and web crawling framework

Awesome-crawler / Python / Scrapy

django-dynamic-scraper 1,155 almost 3 years ago Creating Scrapy scrapers via the Django admin interface
Scrapy-Redis 5,548 7 months ago Redis-based components for Scrapy
scrapy-cluster 1,185 about 1 year ago Uses Redis and Kafka to create a distributed on demand scraping cluster
distribute_crawler 3,245 over 7 years ago Uses scrapy,redis, mongodb,graphite to create a distributed spider

Awesome-crawler / Python

pyspider 16,511 9 months ago A powerful spider system
CoCrawler 188 over 2 years ago A versatile web crawler built using modern tools and concurrency
cola 1,501 over 2 years ago A distributed crawling framework
Demiurge 115 about 3 years ago PyQuery-based scraping micro-framework
Scrapely 1,865 almost 3 years ago A pure-python HTML screen-scraping library
feedparser Universal feed parser
you-get 54,175 about 1 month ago Dumb downloader that scrapes the web
MechanicalSoup 4,685 2 months ago A Python library for automating interaction with websites
portia 9,327 7 months ago Visual scraping for Scrapy
crawley 188 almost 2 years ago Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations
RoboBrowser 3,703 over 4 years ago A simple, Pythonic library for browsing the web without a standalone web browser
MSpider 348 over 2 years ago A simple ,easy spider using gevent and js render
brownant 159 almost 8 years ago A lightweight web data extracting framework
PSpider 1,828 over 2 years ago A simple spider frame in Python3
Gain 2,037 over 5 years ago Web crawling framework based on asyncio for everyone
sukhoi 879 about 4 years ago Minimalist and powerful Web Crawler
spidy 340 5 months ago The simple, easy to use command line web crawler
newspaper 14,220 6 months ago News, full-text, and article metadata extraction in Python 3
aspider 1,753 over 1 year ago An async web scraping micro-framework based on asyncio

Awesome-crawler / Java

ACHE Crawler 459 over 1 year ago An easy to use web crawler for domain-specific search
Apache Nutch Highly extensible, highly scalable web crawler for production environment

Awesome-crawler / Java / Apache Nutch

anthelion 2,841 about 9 years ago A plugin for Apache Nutch to crawl semantic annotations within HTML pages

Awesome-crawler / Java

Crawler4j 4,563 about 3 years ago Simple and lightweight web crawler
JSoup Scrapes, parses, manipulates and cleans HTML
websphinx Website-Specific Processors for HTML information extraction
Open Search Server A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything
Gecco 2,504 11 months ago A easy to use lightweight web crawler
WebCollector 3,074 10 months ago Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes
Webmagic 11,456 about 1 month ago A scalable crawler framework
Spiderman A scalable ,extensible, multi-threaded web crawler

Awesome-crawler / Java / Spiderman

Spiderman2 A distributed web crawler framework,support js render

Awesome-crawler / Java

Heritrix3 2,857 about 2 months ago Extensible, web-scale, archival-quality web crawler project
SeimiCrawler 1,980 about 2 months ago An agile, distributed crawler framework
StormCrawler An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
Spark-Crawler 411 almost 2 years ago Evolving Apache Nutch to run on Spark
webBee 189 about 1 year ago A DFS web spider
spider-flow 9,701 over 1 year ago A visual spider framework, it's so good that you don't need to write any code to crawl the website
Norconex Web Crawler 184 about 1 month ago Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications

Awesome-crawler / C#

ccrawler Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content
SimpleCrawler Simple spider base on mutithreading, regluar expression
DotnetSpider 4,007 4 months ago This is a cross platfrom, ligth spider develop by C#
Abot 2,255 4 months ago C# web crawler built for speed and flexibility
Hawk 3,163 about 5 years ago Advanced Crawler and ETL tool written in C#/WPF
SkyScraper 59 over 8 years ago An asynchronous web scraper / web crawler using async / await and Reactive Extensions
Infinity Crawler 248 about 1 year ago A simple but powerful web crawler library in C#

Awesome-crawler / JavaScript

scraperjs 3,714 about 4 years ago A complete and versatile web scraper
scrape-it 4,024 2 months ago A Node.js scraper for humans
simplecrawler 2,143 almost 4 years ago Event driven web crawler
node-crawler 6,718 6 months ago Node-crawler has clean,simple api
js-crawler 254 over 6 years ago Web crawler for Node.JS, both HTTP and HTTPS are supported
webster 518 about 1 month ago A reliable web crawling framework which can scrape ajax and js rendered content in a web page
x-ray 5,883 about 1 month ago Web scraper with pagination and crawler support
node-osmosis 4,115 about 1 year ago HTML/XML parser and web scraper for Node.js
web-scraper-chrome-extension 1,318 about 6 years ago Web data extraction tool implemented as chrome extension
supercrawler 380 about 2 years ago Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits
headless-chrome-crawler 5,534 over 1 year ago Headless Chrome crawls with jQuery support
Squidwarc 170 over 4 years ago High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
crawlee 16,081 about 1 month ago A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast

Awesome-crawler / PHP

Goutte 9,264 almost 2 years ago A screen scraping and web crawling library for PHP

Awesome-crawler / PHP / Goutte

laravel-goutte 453 12 months ago Laravel 5 Facade for Goutte

Awesome-crawler / PHP

dom-crawler 3,974 about 2 months ago The DomCrawler component eases DOM navigation for HTML and XML documents
QueryList 2,671 about 1 month ago The progressive PHP crawler framework
pspider 266 over 9 years ago Parallel web crawler written in PHP
php-spider 1,336 7 months ago A configurable and extensible PHP web spider
spatie/crawler 2,552 about 1 month ago An easy to use, powerful crawler implemented in PHP. Can execute Javascript
crawlzone/crawlzone 78 over 1 year ago Crawlzone is a fast asynchronous internet crawling framework for PHP
PHPScraper 544 9 months ago PHPScraper is a scraper & crawler built for simplicity

Awesome-crawler / C++

open-source-search-engine 1,546 about 1 year ago A distributed open source search engine and spider/crawler written in C/C++

Awesome-crawler / C

httrack 3,648 5 months ago Copy websites to your computer

Awesome-crawler / Ruby

Nokogiri 6,164 about 1 month ago A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support
upton 1,612 about 6 years ago A batteries-included framework for easy web-scraping. Just add CSS(Or do more)
wombat 1,315 12 months ago Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages
RubyRetriever 143 almost 2 years ago RubyRetriever is a Web Crawler, Scraper & File Harvester
Spidr 809 12 months ago Spider a site, multiple domains, certain links or infinitely
Cobweb 226 about 2 years ago Web crawler with very flexible crawling options, standalone or using sidekiq
mechanize 4,396 4 months ago Automated web interaction & crawling

Awesome-crawler / Rust

spider 1,234 about 1 month ago The fastest web crawler and indexer
crawler 51 5 months ago A gRPC web indexer turbo charged for performance

Awesome-crawler / R

rvest 1,495 3 months ago Simple web scraping for R

Awesome-crawler / Erlang

ebot 330 almost 14 years ago A scalable, distribuited and highly configurable web cawler

Awesome-crawler / Perl

web-scraper 104 over 7 years ago Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions

Awesome-crawler / Go

pholcus 7,578 about 2 years ago A distributed, high concurrency and powerful web crawler
gocrawl 2,036 over 3 years ago Polite, slim and concurrent web crawler
fetchbot 787 over 3 years ago A simple and flexible web crawler that follows the robots.txt policies and crawl delays
go_spider 1,827 about 7 years ago An awesome Go concurrent Crawler(spider) framework
dht 2,741 over 3 years ago BitTorrent DHT Protocol && DHT Spider
ants-go 363 almost 9 years ago A open source, distributed, restful crawler engine in golang
scrape 1,513 about 8 years ago A simple, higher level interface for Go web scraping
creeper 780 over 7 years ago The Next Generation Crawler Framework (Go)
colly 23,444 6 months ago Fast and Elegant Scraping Framework for Gophers
ferret 5,760 about 1 month ago Declarative web scraping
Dataflow kit 667 almost 2 years ago Extract structured data from web pages. Web sites scraping
Hakrawler 4,528 12 months ago Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

Awesome-crawler / Scala

crawler 149 over 8 years ago Scala DSL for web crawling
scrala 113 over 5 years ago Scala crawler(spider) framework, inspired by scrapy
ferrit Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra

Backlinks from these awesome lists:

More related projects: