scrapy-cluster

Crawler cluster

A distributed scraping framework that scales crawling and prioritizes sites, utilizing Redis and Kafka for coordination.

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

GitHub

1k stars
108 watching
324 forks
Language: Python
last commit: about 1 year ago
Linked from 1 awesome list

distributedkafkapythonredisscrapingscrapy

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
scrapy/scrapely A pure-python library for extracting structured data from HTML pages. 1,863
dyweb/scrala A web crawling framework written in Scala that allows users to define the start URL and parse response from it 113
pjkelly/robocop A middleware that adds a meta tag to HTTP responses to instruct search engines on how to crawl the content. 3
cuiweixie/lua-resty-redis-cluster A client library for managing Redis clusters using Lua scripts in an OpenResty configuration. 100
efremidze/cluster A map annotation clustering library that efficiently groups and displays geographic pins on an iOS map view. 1,275
postmodern/spidr A Ruby web crawling library that provides flexible and customizable methods to crawl websites 806
rndinfosecguy/scavenger An OSINT bot that crawls pastebin sites to search for sensitive data leaks 629
rusty1s/pytorch_cluster A PyTorch extension library providing optimized graph cluster algorithms 824
elixir-crawly/crawly A framework for extracting structured data from websites 987
holgerd77/django-dynamic-scraper An app that allows you to manage Scrapy spiders through a Django admin interface. 1,153
howie6879/ruia An async web scraping micro-framework built with asyncio and aiohttp to simplify URL crawling 1,752
needmorecowbell/giggity A tool to scrape and store hierarchical data about GitHub organizations, users, or repositories. 126
malfrats/xeuledoc A tool to fetch information about public Google documents from various services 846
stewartmckee/cobweb A flexible web crawler that can be used to extract data from websites in a scalable and efficient manner 226
tidyverse/rvest A package for extracting data from web pages using HTML parsing and CSS/XPath selectors. 1,492