awesome-crawler
crawler frameworks
A collection of reusable web crawling and scraping components in multiple programming languages.
A collection of awesome web crawler,spider in different languages
7k stars
201 watching
709 forks
last commit: 7 months ago
Linked from 1 awesome list
awesomecrawlernode-crawlerscraperspiderweb-crawlerweb-scraper
Awesome-crawler / Python | |||
Scrapy | 53,484 | about 1 month ago | A fast high-level screen scraping and web crawling framework |
Awesome-crawler / Python / Scrapy | |||
django-dynamic-scraper | 1,155 | almost 3 years ago | Creating Scrapy scrapers via the Django admin interface |
Scrapy-Redis | 5,548 | 7 months ago | Redis-based components for Scrapy |
scrapy-cluster | 1,185 | about 1 year ago | Uses Redis and Kafka to create a distributed on demand scraping cluster |
distribute_crawler | 3,245 | over 7 years ago | Uses scrapy,redis, mongodb,graphite to create a distributed spider |
Awesome-crawler / Python | |||
pyspider | 16,511 | 9 months ago | A powerful spider system |
CoCrawler | 188 | over 2 years ago | A versatile web crawler built using modern tools and concurrency |
cola | 1,501 | over 2 years ago | A distributed crawling framework |
Demiurge | 115 | about 3 years ago | PyQuery-based scraping micro-framework |
Scrapely | 1,865 | almost 3 years ago | A pure-python HTML screen-scraping library |
feedparser | Universal feed parser | ||
you-get | 54,175 | about 1 month ago | Dumb downloader that scrapes the web |
MechanicalSoup | 4,685 | 2 months ago | A Python library for automating interaction with websites |
portia | 9,327 | 7 months ago | Visual scraping for Scrapy |
crawley | 188 | almost 2 years ago | Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations |
RoboBrowser | 3,703 | over 4 years ago | A simple, Pythonic library for browsing the web without a standalone web browser |
MSpider | 348 | over 2 years ago | A simple ,easy spider using gevent and js render |
brownant | 159 | almost 8 years ago | A lightweight web data extracting framework |
PSpider | 1,828 | over 2 years ago | A simple spider frame in Python3 |
Gain | 2,037 | over 5 years ago | Web crawling framework based on asyncio for everyone |
sukhoi | 879 | about 4 years ago | Minimalist and powerful Web Crawler |
spidy | 340 | 5 months ago | The simple, easy to use command line web crawler |
newspaper | 14,220 | 6 months ago | News, full-text, and article metadata extraction in Python 3 |
aspider | 1,753 | over 1 year ago | An async web scraping micro-framework based on asyncio |
Awesome-crawler / Java | |||
ACHE Crawler | 459 | over 1 year ago | An easy to use web crawler for domain-specific search |
Apache Nutch | Highly extensible, highly scalable web crawler for production environment | ||
Awesome-crawler / Java / Apache Nutch | |||
anthelion | 2,841 | about 9 years ago | A plugin for Apache Nutch to crawl semantic annotations within HTML pages |
Awesome-crawler / Java | |||
Crawler4j | 4,563 | about 3 years ago | Simple and lightweight web crawler |
JSoup | Scrapes, parses, manipulates and cleans HTML | ||
websphinx | Website-Specific Processors for HTML information extraction | ||
Open Search Server | A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything | ||
Gecco | 2,504 | 11 months ago | A easy to use lightweight web crawler |
WebCollector | 3,074 | 10 months ago | Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes |
Webmagic | 11,456 | about 1 month ago | A scalable crawler framework |
Spiderman | A scalable ,extensible, multi-threaded web crawler | ||
Awesome-crawler / Java / Spiderman | |||
Spiderman2 | A distributed web crawler framework,support js render | ||
Awesome-crawler / Java | |||
Heritrix3 | 2,857 | about 2 months ago | Extensible, web-scale, archival-quality web crawler project |
SeimiCrawler | 1,980 | about 2 months ago | An agile, distributed crawler framework |
StormCrawler | An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm | ||
Spark-Crawler | 411 | almost 2 years ago | Evolving Apache Nutch to run on Spark |
webBee | 189 | about 1 year ago | A DFS web spider |
spider-flow | 9,701 | over 1 year ago | A visual spider framework, it's so good that you don't need to write any code to crawl the website |
Norconex Web Crawler | 184 | about 1 month ago | Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications |
Awesome-crawler / C# | |||
ccrawler | Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content | ||
SimpleCrawler | Simple spider base on mutithreading, regluar expression | ||
DotnetSpider | 4,007 | 4 months ago | This is a cross platfrom, ligth spider develop by C# |
Abot | 2,255 | 4 months ago | C# web crawler built for speed and flexibility |
Hawk | 3,163 | about 5 years ago | Advanced Crawler and ETL tool written in C#/WPF |
SkyScraper | 59 | over 8 years ago | An asynchronous web scraper / web crawler using async / await and Reactive Extensions |
Infinity Crawler | 248 | about 1 year ago | A simple but powerful web crawler library in C# |
Awesome-crawler / JavaScript | |||
scraperjs | 3,714 | about 4 years ago | A complete and versatile web scraper |
scrape-it | 4,024 | 2 months ago | A Node.js scraper for humans |
simplecrawler | 2,143 | almost 4 years ago | Event driven web crawler |
node-crawler | 6,718 | 6 months ago | Node-crawler has clean,simple api |
js-crawler | 254 | over 6 years ago | Web crawler for Node.JS, both HTTP and HTTPS are supported |
webster | 518 | about 1 month ago | A reliable web crawling framework which can scrape ajax and js rendered content in a web page |
x-ray | 5,883 | about 1 month ago | Web scraper with pagination and crawler support |
node-osmosis | 4,115 | about 1 year ago | HTML/XML parser and web scraper for Node.js |
web-scraper-chrome-extension | 1,318 | about 6 years ago | Web data extraction tool implemented as chrome extension |
supercrawler | 380 | about 2 years ago | Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits |
headless-chrome-crawler | 5,534 | over 1 year ago | Headless Chrome crawls with jQuery support |
Squidwarc | 170 | over 4 years ago | High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head |
crawlee | 16,081 | about 1 month ago | A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast |
Awesome-crawler / PHP | |||
Goutte | 9,264 | almost 2 years ago | A screen scraping and web crawling library for PHP |
Awesome-crawler / PHP / Goutte | |||
laravel-goutte | 453 | 12 months ago | Laravel 5 Facade for Goutte |
Awesome-crawler / PHP | |||
dom-crawler | 3,974 | about 2 months ago | The DomCrawler component eases DOM navigation for HTML and XML documents |
QueryList | 2,671 | about 1 month ago | The progressive PHP crawler framework |
pspider | 266 | over 9 years ago | Parallel web crawler written in PHP |
php-spider | 1,336 | 7 months ago | A configurable and extensible PHP web spider |
spatie/crawler | 2,552 | about 1 month ago | An easy to use, powerful crawler implemented in PHP. Can execute Javascript |
crawlzone/crawlzone | 78 | over 1 year ago | Crawlzone is a fast asynchronous internet crawling framework for PHP |
PHPScraper | 544 | 9 months ago | PHPScraper is a scraper & crawler built for simplicity |
Awesome-crawler / C++ | |||
open-source-search-engine | 1,546 | about 1 year ago | A distributed open source search engine and spider/crawler written in C/C++ |
Awesome-crawler / C | |||
httrack | 3,648 | 5 months ago | Copy websites to your computer |
Awesome-crawler / Ruby | |||
Nokogiri | 6,164 | about 1 month ago | A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support |
upton | 1,612 | about 6 years ago | A batteries-included framework for easy web-scraping. Just add CSS(Or do more) |
wombat | 1,315 | 12 months ago | Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages |
RubyRetriever | 143 | almost 2 years ago | RubyRetriever is a Web Crawler, Scraper & File Harvester |
Spidr | 809 | 12 months ago | Spider a site, multiple domains, certain links or infinitely |
Cobweb | 226 | about 2 years ago | Web crawler with very flexible crawling options, standalone or using sidekiq |
mechanize | 4,396 | 4 months ago | Automated web interaction & crawling |
Awesome-crawler / Rust | |||
spider | 1,234 | about 1 month ago | The fastest web crawler and indexer |
crawler | 51 | 5 months ago | A gRPC web indexer turbo charged for performance |
Awesome-crawler / R | |||
rvest | 1,495 | 3 months ago | Simple web scraping for R |
Awesome-crawler / Erlang | |||
ebot | 330 | almost 14 years ago | A scalable, distribuited and highly configurable web cawler |
Awesome-crawler / Perl | |||
web-scraper | 104 | over 7 years ago | Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions |
Awesome-crawler / Go | |||
pholcus | 7,578 | about 2 years ago | A distributed, high concurrency and powerful web crawler |
gocrawl | 2,036 | over 3 years ago | Polite, slim and concurrent web crawler |
fetchbot | 787 | over 3 years ago | A simple and flexible web crawler that follows the robots.txt policies and crawl delays |
go_spider | 1,827 | about 7 years ago | An awesome Go concurrent Crawler(spider) framework |
dht | 2,741 | over 3 years ago | BitTorrent DHT Protocol && DHT Spider |
ants-go | 363 | almost 9 years ago | A open source, distributed, restful crawler engine in golang |
scrape | 1,513 | about 8 years ago | A simple, higher level interface for Go web scraping |
creeper | 780 | over 7 years ago | The Next Generation Crawler Framework (Go) |
colly | 23,444 | 6 months ago | Fast and Elegant Scraping Framework for Gophers |
ferret | 5,760 | about 1 month ago | Declarative web scraping |
Dataflow kit | 667 | almost 2 years ago | Extract structured data from web pages. Web sites scraping |
Hakrawler | 4,528 | 12 months ago | Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application |
Awesome-crawler / Scala | |||
crawler | 149 | over 8 years ago | Scala DSL for web crawling |
scrala | 113 | over 5 years ago | Scala crawler(spider) framework, inspired by scrapy |
ferrit | Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra |