awesome-crawler
crawler frameworks
A collection of reusable web crawling and scraping components in multiple programming languages.
A collection of awesome web crawler,spider in different languages
6k stars
201 watching
710 forks
last commit: 5 months ago
Linked from 1 awesome list
awesomecrawlernode-crawlerscraperspiderweb-crawlerweb-scraper
Awesome-crawler / Python | |||
Scrapy | 53,156 | 6 days ago | A fast high-level screen scraping and web crawling framework |
Awesome-crawler / Python / Scrapy | |||
django-dynamic-scraper | 1,153 | almost 3 years ago | Creating Scrapy scrapers via the Django admin interface |
Scrapy-Redis | 5,534 | 5 months ago | Redis-based components for Scrapy |
scrapy-cluster | 1,182 | about 1 year ago | Uses Redis and Kafka to create a distributed on demand scraping cluster |
distribute_crawler | 3,247 | over 7 years ago | Uses scrapy,redis, mongodb,graphite to create a distributed spider |
Awesome-crawler / Python | |||
pyspider | 16,497 | 7 months ago | A powerful spider system |
CoCrawler | 187 | over 2 years ago | A versatile web crawler built using modern tools and concurrency |
cola | 1,500 | over 2 years ago | A distributed crawling framework |
Demiurge | 114 | almost 3 years ago | PyQuery-based scraping micro-framework |
Scrapely | 1,863 | over 2 years ago | A pure-python HTML screen-scraping library |
feedparser | Universal feed parser | ||
you-get | 53,851 | 24 days ago | Dumb downloader that scrapes the web |
MechanicalSoup | 4,672 | 7 days ago | A Python library for automating interaction with websites |
portia | 9,301 | 5 months ago | Visual scraping for Scrapy |
crawley | 186 | over 1 year ago | Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations |
RoboBrowser | 3,702 | about 4 years ago | A simple, Pythonic library for browsing the web without a standalone web browser |
MSpider | 348 | over 2 years ago | A simple ,easy spider using gevent and js render |
brownant | 159 | over 7 years ago | A lightweight web data extracting framework |
PSpider | 1,827 | over 2 years ago | A simple spider frame in Python3 |
Gain | 2,035 | over 5 years ago | Web crawling framework based on asyncio for everyone |
sukhoi | 881 | almost 4 years ago | Minimalist and powerful Web Crawler |
spidy | 340 | 4 months ago | The simple, easy to use command line web crawler |
newspaper | 14,171 | 4 months ago | News, full-text, and article metadata extraction in Python 3 |
aspider | 1,752 | over 1 year ago | An async web scraping micro-framework based on asyncio |
Awesome-crawler / Java | |||
ACHE Crawler | 454 | about 1 year ago | An easy to use web crawler for domain-specific search |
Apache Nutch | Highly extensible, highly scalable web crawler for production environment | ||
Awesome-crawler / Java / Apache Nutch | |||
anthelion | 2,842 | almost 9 years ago | A plugin for Apache Nutch to crawl semantic annotations within HTML pages |
Awesome-crawler / Java | |||
Crawler4j | 4,555 | about 3 years ago | Simple and lightweight web crawler |
JSoup | Scrapes, parses, manipulates and cleans HTML | ||
websphinx | Website-Specific Processors for HTML information extraction | ||
Open Search Server | A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything | ||
Gecco | 2,502 | 9 months ago | A easy to use lightweight web crawler |
WebCollector | 3,068 | 8 months ago | Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes |
Webmagic | 11,432 | 27 days ago | A scalable crawler framework |
Spiderman | A scalable ,extensible, multi-threaded web crawler | ||
Awesome-crawler / Java / Spiderman | |||
Spiderman2 | A distributed web crawler framework,support js render | ||
Awesome-crawler / Java | |||
Heritrix3 | 2,833 | 15 days ago | Extensible, web-scale, archival-quality web crawler project |
SeimiCrawler | 1,980 | over 1 year ago | An agile, distributed crawler framework |
StormCrawler | An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm | ||
Spark-Crawler | 410 | over 1 year ago | Evolving Apache Nutch to run on Spark |
webBee | 189 | 11 months ago | A DFS web spider |
spider-flow | 9,613 | over 1 year ago | A visual spider framework, it's so good that you don't need to write any code to crawl the website |
Norconex Web Crawler | 183 | 10 days ago | Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications |
Awesome-crawler / C# | |||
ccrawler | Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content | ||
SimpleCrawler | Simple spider base on mutithreading, regluar expression | ||
DotnetSpider | 3,989 | about 2 months ago | This is a cross platfrom, ligth spider develop by C# |
Abot | 2,247 | 2 months ago | C# web crawler built for speed and flexibility |
Hawk | 3,160 | almost 5 years ago | Advanced Crawler and ETL tool written in C#/WPF |
SkyScraper | 58 | about 8 years ago | An asynchronous web scraper / web crawler using async / await and Reactive Extensions |
Infinity Crawler | 248 | 11 months ago | A simple but powerful web crawler library in C# |
Awesome-crawler / JavaScript | |||
scraperjs | 3,710 | about 4 years ago | A complete and versatile web scraper |
scrape-it | 4,012 | 7 days ago | A Node.js scraper for humans |
simplecrawler | 2,141 | over 3 years ago | Event driven web crawler |
node-crawler | 6,704 | 4 months ago | Node-crawler has clean,simple api |
js-crawler | 253 | over 6 years ago | Web crawler for Node.JS, both HTTP and HTTPS are supported |
webster | 515 | 16 days ago | A reliable web crawling framework which can scrape ajax and js rendered content in a web page |
x-ray | 5,878 | 23 days ago | Web scraper with pagination and crawler support |
node-osmosis | 4,116 | 11 months ago | HTML/XML parser and web scraper for Node.js |
web-scraper-chrome-extension | 1,314 | about 6 years ago | Web data extraction tool implemented as chrome extension |
supercrawler | 378 | almost 2 years ago | Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits |
headless-chrome-crawler | 5,527 | over 1 year ago | Headless Chrome crawls with jQuery support |
Squidwarc | 169 | over 4 years ago | High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head |
crawlee | 15,604 | 8 days ago | A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast |
Awesome-crawler / PHP | |||
Goutte | 9,261 | over 1 year ago | A screen scraping and web crawling library for PHP |
Awesome-crawler / PHP / Goutte | |||
laravel-goutte | 453 | 10 months ago | Laravel 5 Facade for Goutte |
Awesome-crawler / PHP | |||
dom-crawler | 3,961 | 8 days ago | The DomCrawler component eases DOM navigation for HTML and XML documents |
QueryList | 2,668 | 4 months ago | The progressive PHP crawler framework |
pspider | 266 | about 9 years ago | Parallel web crawler written in PHP |
php-spider | 1,332 | 5 months ago | A configurable and extensible PHP web spider |
spatie/crawler | 2,537 | 4 months ago | An easy to use, powerful crawler implemented in PHP. Can execute Javascript |
crawlzone/crawlzone | 77 | over 1 year ago | Crawlzone is a fast asynchronous internet crawling framework for PHP |
PHPScraper | 536 | 8 months ago | PHPScraper is a scraper & crawler built for simplicity |
Awesome-crawler / C++ | |||
open-source-search-engine | 1,540 | 11 months ago | A distributed open source search engine and spider/crawler written in C/C++ |
Awesome-crawler / C | |||
httrack | 3,601 | 3 months ago | Copy websites to your computer |
Awesome-crawler / Ruby | |||
Nokogiri | 6,153 | 7 days ago | A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support |
upton | 1,613 | almost 6 years ago | A batteries-included framework for easy web-scraping. Just add CSS(Or do more) |
wombat | 1,315 | 10 months ago | Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages |
RubyRetriever | 143 | over 1 year ago | RubyRetriever is a Web Crawler, Scraper & File Harvester |
Spidr | 806 | 10 months ago | Spider a site, multiple domains, certain links or infinitely |
Cobweb | 226 | almost 2 years ago | Web crawler with very flexible crawling options, standalone or using sidekiq |
mechanize | 4,391 | about 2 months ago | Automated web interaction & crawling |
Awesome-crawler / Rust | |||
spider | 1,140 | 6 days ago | The fastest web crawler and indexer |
crawler | 49 | 3 months ago | A gRPC web indexer turbo charged for performance |
Awesome-crawler / R | |||
rvest | 1,492 | 27 days ago | Simple web scraping for R |
Awesome-crawler / Erlang | |||
ebot | 330 | over 13 years ago | A scalable, distribuited and highly configurable web cawler |
Awesome-crawler / Perl | |||
web-scraper | 104 | over 7 years ago | Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions |
Awesome-crawler / Go | |||
pholcus | 7,570 | about 2 years ago | A distributed, high concurrency and powerful web crawler |
gocrawl | 2,038 | over 3 years ago | Polite, slim and concurrent web crawler |
fetchbot | 786 | over 3 years ago | A simple and flexible web crawler that follows the robots.txt policies and crawl delays |
go_spider | 1,826 | about 7 years ago | An awesome Go concurrent Crawler(spider) framework |
dht | 2,741 | over 3 years ago | BitTorrent DHT Protocol && DHT Spider |
ants-go | 363 | over 8 years ago | A open source, distributed, restful crawler engine in golang |
scrape | 1,513 | almost 8 years ago | A simple, higher level interface for Go web scraping |
creeper | 780 | over 7 years ago | The Next Generation Crawler Framework (Go) |
colly | 23,317 | 4 months ago | Fast and Elegant Scraping Framework for Gophers |
ferret | 5,741 | 13 days ago | Declarative web scraping |
Dataflow kit | 662 | over 1 year ago | Extract structured data from web pages. Web sites scraping |
Hakrawler | 4,502 | 10 months ago | Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application |
Awesome-crawler / Scala | |||
crawler | 148 | over 8 years ago | Scala DSL for web crawling |
scrala | 113 | about 5 years ago | Scala crawler(spider) framework, inspired by scrapy |
ferrit | Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra |