Python-crawler-tutorial-starts-from-zero

Crawler tutorial

A comprehensive tutorial on building distributed crawlers from scratch using Python

python爬虫教程,带你从零到一,包含js逆向,selenium, tesseract OCR识别,mongodb的使用,以及scrapy框架

GitHub

4k stars
163 watching
761 forks
Language: Python
last commit: almost 4 years ago

Related projects:

Repository Description Stars
ssssssss-team/spider-flow A tool for defining and executing web crawlers with a visual workflow, allowing users to configure crawlers without writing code. 9,613
unclecode/crawl4ai A tool for web crawling and data extraction, designed to work with large language models. 16,180
chenjiandongx/github-spider A Python-based web crawler for scraping Github user and repository data. 264
bda-research/node-crawler A NodeJS-based web crawler and spider that extracts data from websites. 6,704
yujiosaka/headless-chrome-crawler A distributed crawling framework that leverages Headless Chrome to scrape dynamic websites 5,527
jmg/crawley A Pythonic framework for building high-speed web crawlers with flexible data extraction and storage options. 186
apify/crawlee A tool for building reliable web scraping and browser automation pipelines in Node.js. 15,604
jae-jae/querylist A PHP framework for building web scrapers and crawlers with a focus on ease of use and extensibility. 2,668
brendonboshell/supercrawler A web crawler designed to crawl websites while obeying robots.txt rules, rate limits and concurrency limits, with customizable content handlers for parsing and processing crawled pages. 378
spatie/crawler A powerful web crawler written in PHP that can execute JavaScript and crawl multiple URLs concurrently. 2,537
xtuhcy/gecco A lightweight web crawler framework that enables easy extraction of web page data using jQuery-like selectors and supports asynchronous requests and distributed crawling. 2,502
elliotgao2/gain A Python web crawling framework utilizing asyncio and aiohttp for efficient data extraction from websites. 2,035
pjkelly/robocop A middleware that adds a meta tag to HTTP responses to instruct search engines on how to crawl the content. 3
puerkitobio/fetchbot A flexible web crawler that follows robots.txt policies and crawl delays. 786
xianhu/pspider A Python web crawler framework with support for multi-threading and proxy usage. 1,827