crawlee

Web scraper

A tool for building reliable web scraping and browser automation pipelines in Node.js.

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

GitHub

16k stars

105 watching

699 forks

Language: TypeScript

last commit: 8 months ago

Linked from 1 awesome list

apifyautomationcrawlercrawlingheadlessheadless-chromejavascriptnodejsnpmplaywrightpuppeteerscraperscrapingtypescriptweb-crawlerweb-crawlingweb-scraping

crawlee.dev

Backlinks from these awesome lists:

brucedone/awesome-crawler

Related projects:

Repository	Description	Stars
spatie/crawler	A powerful web crawler written in PHP that can execute JavaScript and crawl multiple URLs concurrently.	2,552
bda-research/node-crawler	A NodeJS-based web crawler and spider that extracts data from websites.	6,718
yujiosaka/headless-chrome-crawler	A distributed crawling framework that leverages Headless Chrome to scrape dynamic websites	5,534
unclecode/crawl4ai	A web crawling tool designed to extract structured data from the web for use in AI applications	18,541
code4craft/webmagic	A framework for building scalable web crawlers in Java.	11,456
veliovgroup/spiderable-middleware	intercepts requests from web crawlers and proxies them to a prerendering service for rendering HTML	39
howie6879/ruia	An async web scraping micro-framework built with asyncio and aiohttp to simplify URL crawling	1,753
ruipgil/scraperjs	A versatile web scraping module with two scrapers for static and dynamic content extraction.	3,714
matthewmueller/x-ray	A flexible web scraping framework for extracting data from websites with customizable selectors and pagination support.	5,883
elliotgao2/gain	A Python web crawling framework utilizing asyncio and aiohttp for efficient data extraction from websites.	2,037
builderio/gpt-crawler	Automates the process of generating knowledge files to create custom AI models from website content	19,059
rndinfosecguy/scavenger	An OSINT bot that crawls pastebin sites to search for sensitive data leaks	634
yasserg/crawler4j	A Java-based web crawler for extracting and processing web page content	4,563
brendonboshell/supercrawler	A web crawler designed to crawl websites while obeying robots.txt rules, rate limits and concurrency limits, with customizable content handlers for parsing and processing crawled pages.	380
gocolly/colly	A framework for extracting structured data from websites in a fast and elegant way	23,444