Python-crawler-tutorial-starts-from-zero
Crawler tutorial
A comprehensive tutorial on building distributed crawlers from scratch using Python
python爬虫教程,带你从零到一,包含js逆向,selenium, tesseract OCR识别,mongodb的使用,以及scrapy框架
4k stars
163 watching
761 forks
Language: Python
last commit: almost 4 years ago Related projects:
Repository | Description | Stars |
---|---|---|
ssssssss-team/spider-flow | A tool for defining and executing web crawlers with a visual workflow, allowing users to configure crawlers without writing code. | 9,613 |
unclecode/crawl4ai | A tool for web crawling and data extraction, designed to work with large language models. | 16,180 |
chenjiandongx/github-spider | A Python-based web crawler for scraping Github user and repository data. | 264 |
bda-research/node-crawler | A NodeJS-based web crawler and spider that extracts data from websites. | 6,704 |
yujiosaka/headless-chrome-crawler | A distributed crawling framework that leverages Headless Chrome to scrape dynamic websites | 5,527 |
jmg/crawley | A Pythonic framework for building high-speed web crawlers with flexible data extraction and storage options. | 186 |
apify/crawlee | A tool for building reliable web scraping and browser automation pipelines in Node.js. | 15,604 |
jae-jae/querylist | A PHP framework for building web scrapers and crawlers with a focus on ease of use and extensibility. | 2,668 |
brendonboshell/supercrawler | A web crawler designed to crawl websites while obeying robots.txt rules, rate limits and concurrency limits, with customizable content handlers for parsing and processing crawled pages. | 378 |
spatie/crawler | A powerful web crawler written in PHP that can execute JavaScript and crawl multiple URLs concurrently. | 2,537 |
xtuhcy/gecco | A lightweight web crawler framework that enables easy extraction of web page data using jQuery-like selectors and supports asynchronous requests and distributed crawling. | 2,502 |
elliotgao2/gain | A Python web crawling framework utilizing asyncio and aiohttp for efficient data extraction from websites. | 2,035 |
pjkelly/robocop | A middleware that adds a meta tag to HTTP responses to instruct search engines on how to crawl the content. | 3 |
puerkitobio/fetchbot | A flexible web crawler that follows robots.txt policies and crawl delays. | 786 |
xianhu/pspider | A Python web crawler framework with support for multi-threading and proxy usage. | 1,827 |