Python-crawler-tutorial-starts-from-zero

Crawler tutorial

A comprehensive tutorial on building distributed crawlers from scratch using Python

python爬虫教程，带你从零到一，包含js逆向，selenium, tesseract OCR识别,mongodb的使用，以及scrapy框架

GitHub

4k stars

163 watching

763 forks

Language: Python

last commit: over 4 years ago

Related projects:

Repository	Description	Stars
ssssssss-team/spider-flow	A tool for defining and executing web crawlers with a visual workflow, allowing users to configure crawlers without writing code.	9,701
unclecode/crawl4ai	A web crawling tool designed to extract structured data from the web for use in AI applications	18,541
chenjiandongx/github-spider	A Python-based web crawler for scraping Github user and repository data.	264
bda-research/node-crawler	A NodeJS-based web crawler and spider that extracts data from websites.	6,718
yujiosaka/headless-chrome-crawler	A distributed crawling framework that leverages Headless Chrome to scrape dynamic websites	5,534
jmg/crawley	A Pythonic framework for building high-speed web crawlers with flexible data extraction and storage options.	188
apify/crawlee	A tool for building reliable web scraping and browser automation pipelines in Node.js.	16,081
jae-jae/querylist	A PHP framework for building web scrapers and crawlers with a focus on ease of use and extensibility.	2,671
brendonboshell/supercrawler	A web crawler designed to crawl websites while obeying robots.txt rules, rate limits and concurrency limits, with customizable content handlers for parsing and processing crawled pages.	380
spatie/crawler	A powerful web crawler written in PHP that can execute JavaScript and crawl multiple URLs concurrently.	2,552
xtuhcy/gecco	A lightweight web crawler framework that enables easy extraction of web page data using jQuery-like selectors and supports asynchronous requests and distributed crawling.	2,504
elliotgao2/gain	A Python web crawling framework utilizing asyncio and aiohttp for efficient data extraction from websites.	2,037
pjkelly/robocop	A middleware that adds a meta tag to HTTP responses to instruct search engines on how to crawl the content.	3
puerkitobio/fetchbot	A flexible web crawler that follows robots.txt policies and crawl delays.	787
xianhu/pspider	A Python web crawler framework with support for multi-threading and proxy usage.	1,828