scrapy-cluster
Crawler cluster
A distributed scraping framework that scales crawling and prioritizes sites, utilizing Redis and Kafka for coordination.
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
1k stars
108 watching
324 forks
Language: Python
last commit: about 1 year ago
Linked from 1 awesome list
distributedkafkapythonredisscrapingscrapy
Related projects:
Repository | Description | Stars |
---|---|---|
scrapy/scrapely | A pure-python library for extracting structured data from HTML pages. | 1,863 |
dyweb/scrala | A web crawling framework written in Scala that allows users to define the start URL and parse response from it | 113 |
pjkelly/robocop | A middleware that adds a meta tag to HTTP responses to instruct search engines on how to crawl the content. | 3 |
cuiweixie/lua-resty-redis-cluster | A client library for managing Redis clusters using Lua scripts in an OpenResty configuration. | 100 |
efremidze/cluster | A map annotation clustering library that efficiently groups and displays geographic pins on an iOS map view. | 1,275 |
postmodern/spidr | A Ruby web crawling library that provides flexible and customizable methods to crawl websites | 806 |
rndinfosecguy/scavenger | An OSINT bot that crawls pastebin sites to search for sensitive data leaks | 629 |
rusty1s/pytorch_cluster | A PyTorch extension library providing optimized graph cluster algorithms | 824 |
elixir-crawly/crawly | A framework for extracting structured data from websites | 987 |
holgerd77/django-dynamic-scraper | An app that allows you to manage Scrapy spiders through a Django admin interface. | 1,153 |
howie6879/ruia | An async web scraping micro-framework built with asyncio and aiohttp to simplify URL crawling | 1,752 |
needmorecowbell/giggity | A tool to scrape and store hierarchical data about GitHub organizations, users, or repositories. | 126 |
malfrats/xeuledoc | A tool to fetch information about public Google documents from various services | 846 |
stewartmckee/cobweb | A flexible web crawler that can be used to extract data from websites in a scalable and efficient manner | 226 |
tidyverse/rvest | A package for extracting data from web pages using HTML parsing and CSS/XPath selectors. | 1,492 |