awesome-crawler
crawler frameworks
A collection of reusable web crawling and scraping components in multiple programming languages.
A collection of awesome web crawler,spider in different languages
7k stars
201 watching
709 forks
last commit: over 1 year ago
Linked from 1 awesome list
awesomecrawlernode-crawlerscraperspiderweb-crawlerweb-scraper
Awesome-crawler / Python | |||
| Scrapy | 53,484 | 11 months ago | A fast high-level screen scraping and web crawling framework |
Awesome-crawler / Python / Scrapy | |||
| django-dynamic-scraper | 1,155 | over 3 years ago | Creating Scrapy scrapers via the Django admin interface |
| Scrapy-Redis | 5,548 | over 1 year ago | Redis-based components for Scrapy |
| scrapy-cluster | 1,185 | almost 2 years ago | Uses Redis and Kafka to create a distributed on demand scraping cluster |
| distribute_crawler | 3,245 | over 8 years ago | Uses scrapy,redis, mongodb,graphite to create a distributed spider |
Awesome-crawler / Python | |||
| pyspider | 16,511 | over 1 year ago | A powerful spider system |
| CoCrawler | 188 | over 3 years ago | A versatile web crawler built using modern tools and concurrency |
| cola | 1,501 | about 3 years ago | A distributed crawling framework |
| Demiurge | 115 | almost 4 years ago | PyQuery-based scraping micro-framework |
| Scrapely | 1,865 | over 3 years ago | A pure-python HTML screen-scraping library |
| feedparser | Universal feed parser | ||
| you-get | 54,175 | 11 months ago | Dumb downloader that scrapes the web |
| MechanicalSoup | 4,685 | 12 months ago | A Python library for automating interaction with websites |
| portia | 9,327 | over 1 year ago | Visual scraping for Scrapy |
| crawley | 188 | over 2 years ago | Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations |
| RoboBrowser | 3,703 | about 5 years ago | A simple, Pythonic library for browsing the web without a standalone web browser |
| MSpider | 348 | over 3 years ago | A simple ,easy spider using gevent and js render |
| brownant | 159 | over 8 years ago | A lightweight web data extracting framework |
| PSpider | 1,828 | over 3 years ago | A simple spider frame in Python3 |
| Gain | 2,037 | over 6 years ago | Web crawling framework based on asyncio for everyone |
| sukhoi | 879 | almost 5 years ago | Minimalist and powerful Web Crawler |
| spidy | 340 | about 1 year ago | The simple, easy to use command line web crawler |
| newspaper | 14,220 | over 1 year ago | News, full-text, and article metadata extraction in Python 3 |
| aspider | 1,753 | over 2 years ago | An async web scraping micro-framework based on asyncio |
Awesome-crawler / Java | |||
| ACHE Crawler | 459 | about 2 years ago | An easy to use web crawler for domain-specific search |
| Apache Nutch | Highly extensible, highly scalable web crawler for production environment | ||
Awesome-crawler / Java / Apache Nutch | |||
| anthelion | 2,841 | almost 10 years ago | A plugin for Apache Nutch to crawl semantic annotations within HTML pages |
Awesome-crawler / Java | |||
| Crawler4j | 4,563 | almost 4 years ago | Simple and lightweight web crawler |
| JSoup | Scrapes, parses, manipulates and cleans HTML | ||
| websphinx | Website-Specific Processors for HTML information extraction | ||
| Open Search Server | A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything | ||
| Gecco | 2,504 | over 1 year ago | A easy to use lightweight web crawler |
| WebCollector | 3,074 | over 1 year ago | Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes |
| Webmagic | 11,456 | 11 months ago | A scalable crawler framework |
| Spiderman | A scalable ,extensible, multi-threaded web crawler | ||
Awesome-crawler / Java / Spiderman | |||
| Spiderman2 | A distributed web crawler framework,support js render | ||
Awesome-crawler / Java | |||
| Heritrix3 | 2,857 | 11 months ago | Extensible, web-scale, archival-quality web crawler project |
| SeimiCrawler | 1,980 | 11 months ago | An agile, distributed crawler framework |
| StormCrawler | An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm | ||
| Spark-Crawler | 411 | over 2 years ago | Evolving Apache Nutch to run on Spark |
| webBee | 189 | almost 2 years ago | A DFS web spider |
| spider-flow | 9,701 | over 2 years ago | A visual spider framework, it's so good that you don't need to write any code to crawl the website |
| Norconex Web Crawler | 184 | 11 months ago | Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications |
Awesome-crawler / C# | |||
| ccrawler | Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content | ||
| SimpleCrawler | Simple spider base on mutithreading, regluar expression | ||
| DotnetSpider | 4,007 | about 1 year ago | This is a cross platfrom, ligth spider develop by C# |
| Abot | 2,255 | about 1 year ago | C# web crawler built for speed and flexibility |
| Hawk | 3,163 | almost 6 years ago | Advanced Crawler and ETL tool written in C#/WPF |
| SkyScraper | 59 | about 9 years ago | An asynchronous web scraper / web crawler using async / await and Reactive Extensions |
| Infinity Crawler | 248 | almost 2 years ago | A simple but powerful web crawler library in C# |
Awesome-crawler / JavaScript | |||
| scraperjs | 3,714 | about 5 years ago | A complete and versatile web scraper |
| scrape-it | 4,024 | 12 months ago | A Node.js scraper for humans |
| simplecrawler | 2,143 | over 4 years ago | Event driven web crawler |
| node-crawler | 6,718 | about 1 year ago | Node-crawler has clean,simple api |
| js-crawler | 254 | over 7 years ago | Web crawler for Node.JS, both HTTP and HTTPS are supported |
| webster | 518 | 11 months ago | A reliable web crawling framework which can scrape ajax and js rendered content in a web page |
| x-ray | 5,883 | 11 months ago | Web scraper with pagination and crawler support |
| node-osmosis | 4,115 | almost 2 years ago | HTML/XML parser and web scraper for Node.js |
| web-scraper-chrome-extension | 1,318 | about 7 years ago | Web data extraction tool implemented as chrome extension |
| supercrawler | 380 | almost 3 years ago | Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits |
| headless-chrome-crawler | 5,534 | over 2 years ago | Headless Chrome crawls with jQuery support |
| Squidwarc | 170 | over 5 years ago | High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head |
| crawlee | 16,081 | 11 months ago | A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast |
Awesome-crawler / PHP | |||
| Goutte | 9,264 | over 2 years ago | A screen scraping and web crawling library for PHP |
Awesome-crawler / PHP / Goutte | |||
| laravel-goutte | 453 | almost 2 years ago | Laravel 5 Facade for Goutte |
Awesome-crawler / PHP | |||
| dom-crawler | 3,974 | 11 months ago | The DomCrawler component eases DOM navigation for HTML and XML documents |
| QueryList | 2,671 | 11 months ago | The progressive PHP crawler framework |
| pspider | 266 | about 10 years ago | Parallel web crawler written in PHP |
| php-spider | 1,336 | over 1 year ago | A configurable and extensible PHP web spider |
| spatie/crawler | 2,552 | 11 months ago | An easy to use, powerful crawler implemented in PHP. Can execute Javascript |
| crawlzone/crawlzone | 78 | over 2 years ago | Crawlzone is a fast asynchronous internet crawling framework for PHP |
| PHPScraper | 544 | over 1 year ago | PHPScraper is a scraper & crawler built for simplicity |
Awesome-crawler / C++ | |||
| open-source-search-engine | 1,546 | almost 2 years ago | A distributed open source search engine and spider/crawler written in C/C++ |
Awesome-crawler / C | |||
| httrack | 3,648 | about 1 year ago | Copy websites to your computer |
Awesome-crawler / Ruby | |||
| Nokogiri | 6,164 | 11 months ago | A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support |
| upton | 1,612 | almost 7 years ago | A batteries-included framework for easy web-scraping. Just add CSS(Or do more) |
| wombat | 1,315 | almost 2 years ago | Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages |
| RubyRetriever | 143 | over 2 years ago | RubyRetriever is a Web Crawler, Scraper & File Harvester |
| Spidr | 809 | almost 2 years ago | Spider a site, multiple domains, certain links or infinitely |
| Cobweb | 226 | almost 3 years ago | Web crawler with very flexible crawling options, standalone or using sidekiq |
| mechanize | 4,396 | about 1 year ago | Automated web interaction & crawling |
Awesome-crawler / Rust | |||
| spider | 1,234 | 11 months ago | The fastest web crawler and indexer |
| crawler | 51 | about 1 year ago | A gRPC web indexer turbo charged for performance |
Awesome-crawler / R | |||
| rvest | 1,495 | about 1 year ago | Simple web scraping for R |
Awesome-crawler / Erlang | |||
| ebot | 330 | over 14 years ago | A scalable, distribuited and highly configurable web cawler |
Awesome-crawler / Perl | |||
| web-scraper | 104 | over 8 years ago | Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions |
Awesome-crawler / Go | |||
| pholcus | 7,578 | almost 3 years ago | A distributed, high concurrency and powerful web crawler |
| gocrawl | 2,036 | over 4 years ago | Polite, slim and concurrent web crawler |
| fetchbot | 787 | over 4 years ago | A simple and flexible web crawler that follows the robots.txt policies and crawl delays |
| go_spider | 1,827 | almost 8 years ago | An awesome Go concurrent Crawler(spider) framework |
| dht | 2,741 | about 4 years ago | BitTorrent DHT Protocol && DHT Spider |
| ants-go | 363 | over 9 years ago | A open source, distributed, restful crawler engine in golang |
| scrape | 1,513 | almost 9 years ago | A simple, higher level interface for Go web scraping |
| creeper | 780 | over 8 years ago | The Next Generation Crawler Framework (Go) |
| colly | 23,444 | about 1 year ago | Fast and Elegant Scraping Framework for Gophers |
| ferret | 5,760 | 11 months ago | Declarative web scraping |
| Dataflow kit | 667 | over 2 years ago | Extract structured data from web pages. Web sites scraping |
| Hakrawler | 4,528 | almost 2 years ago | Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application |
Awesome-crawler / Scala | |||
| crawler | 149 | about 9 years ago | Scala DSL for web crawling |
| scrala | 113 | about 6 years ago | Scala crawler(spider) framework, inspired by scrapy |
| ferrit | Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra | ||