crawlee

Web scraper

A tool for building reliable web scraping and browser automation pipelines in Node.js.

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

GitHub

16k stars
103 watching
668 forks
Language: TypeScript
last commit: 8 days ago
Linked from 1 awesome list

apifyautomationcrawlercrawlingheadlessheadless-chromejavascriptnodejsnpmplaywrightpuppeteerscraperscrapingtypescriptweb-crawlerweb-crawlingweb-scraping

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
spatie/crawler A powerful web crawler written in PHP that can execute JavaScript and crawl multiple URLs concurrently. 2,537
bda-research/node-crawler A NodeJS-based web crawler and spider that extracts data from websites. 6,704
yujiosaka/headless-chrome-crawler A distributed crawling framework that leverages Headless Chrome to scrape dynamic websites 5,527
unclecode/crawl4ai A tool for web crawling and data extraction, designed to work with large language models. 16,180
code4craft/webmagic A scalable framework for building web crawlers in Java. 11,432
veliovgroup/spiderable-middleware intercepts requests from web crawlers and proxies them to a prerendering service for rendering HTML 38
howie6879/ruia An async web scraping micro-framework built with asyncio and aiohttp to simplify URL crawling 1,752
ruipgil/scraperjs A versatile web scraping module with two scrapers for static and dynamic content extraction. 3,710
matthewmueller/x-ray A flexible web scraping framework for extracting data from websites with customizable selectors and pagination support. 5,878
elliotgao2/gain A Python web crawling framework utilizing asyncio and aiohttp for efficient data extraction from websites. 2,035
builderio/gpt-crawler Automates the process of generating knowledge files to create custom AI models from website content 18,860
rndinfosecguy/scavenger An OSINT bot that crawls pastebin sites to search for sensitive data leaks 629
yasserg/crawler4j A Java-based web crawler for extracting and processing web page content 4,555
brendonboshell/supercrawler A web crawler designed to crawl websites while obeying robots.txt rules, rate limits and concurrency limits, with customizable content handlers for parsing and processing crawled pages. 378
gocolly/colly A framework for extracting structured data from websites in a fast and elegant way 23,317