crawlee
Web scraper
A tool for building reliable web scraping and browser automation pipelines in Node.js.
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
16k stars
103 watching
668 forks
Language: TypeScript
last commit: 8 days ago
Linked from 1 awesome list
apifyautomationcrawlercrawlingheadlessheadless-chromejavascriptnodejsnpmplaywrightpuppeteerscraperscrapingtypescriptweb-crawlerweb-crawlingweb-scraping
Related projects:
Repository | Description | Stars |
---|---|---|
spatie/crawler | A powerful web crawler written in PHP that can execute JavaScript and crawl multiple URLs concurrently. | 2,537 |
bda-research/node-crawler | A NodeJS-based web crawler and spider that extracts data from websites. | 6,704 |
yujiosaka/headless-chrome-crawler | A distributed crawling framework that leverages Headless Chrome to scrape dynamic websites | 5,527 |
unclecode/crawl4ai | A tool for web crawling and data extraction, designed to work with large language models. | 16,180 |
code4craft/webmagic | A scalable framework for building web crawlers in Java. | 11,432 |
veliovgroup/spiderable-middleware | intercepts requests from web crawlers and proxies them to a prerendering service for rendering HTML | 38 |
howie6879/ruia | An async web scraping micro-framework built with asyncio and aiohttp to simplify URL crawling | 1,752 |
ruipgil/scraperjs | A versatile web scraping module with two scrapers for static and dynamic content extraction. | 3,710 |
matthewmueller/x-ray | A flexible web scraping framework for extracting data from websites with customizable selectors and pagination support. | 5,878 |
elliotgao2/gain | A Python web crawling framework utilizing asyncio and aiohttp for efficient data extraction from websites. | 2,035 |
builderio/gpt-crawler | Automates the process of generating knowledge files to create custom AI models from website content | 18,860 |
rndinfosecguy/scavenger | An OSINT bot that crawls pastebin sites to search for sensitive data leaks | 629 |
yasserg/crawler4j | A Java-based web crawler for extracting and processing web page content | 4,555 |
brendonboshell/supercrawler | A web crawler designed to crawl websites while obeying robots.txt rules, rate limits and concurrency limits, with customizable content handlers for parsing and processing crawled pages. | 378 |
gocolly/colly | A framework for extracting structured data from websites in a fast and elegant way | 23,317 |