crawlee

Web scraper

A tool for building reliable web scraping and browser automation pipelines in Node.js.

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

GitHub

16k stars
105 watching
699 forks
Language: TypeScript
last commit: about 1 month ago
Linked from 1 awesome list

apifyautomationcrawlercrawlingheadlessheadless-chromejavascriptnodejsnpmplaywrightpuppeteerscraperscrapingtypescriptweb-crawlerweb-crawlingweb-scraping

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
spatie/crawler A powerful web crawler written in PHP that can execute JavaScript and crawl multiple URLs concurrently. 2,552
bda-research/node-crawler A NodeJS-based web crawler and spider that extracts data from websites. 6,718
yujiosaka/headless-chrome-crawler A distributed crawling framework that leverages Headless Chrome to scrape dynamic websites 5,534
unclecode/crawl4ai A web crawling tool designed to extract structured data from the web for use in AI applications 18,541
code4craft/webmagic A framework for building scalable web crawlers in Java. 11,456
veliovgroup/spiderable-middleware intercepts requests from web crawlers and proxies them to a prerendering service for rendering HTML 39
howie6879/ruia An async web scraping micro-framework built with asyncio and aiohttp to simplify URL crawling 1,753
ruipgil/scraperjs A versatile web scraping module with two scrapers for static and dynamic content extraction. 3,714
matthewmueller/x-ray A flexible web scraping framework for extracting data from websites with customizable selectors and pagination support. 5,883
elliotgao2/gain A Python web crawling framework utilizing asyncio and aiohttp for efficient data extraction from websites. 2,037
builderio/gpt-crawler Automates the process of generating knowledge files to create custom AI models from website content 19,059
rndinfosecguy/scavenger An OSINT bot that crawls pastebin sites to search for sensitive data leaks 634
yasserg/crawler4j A Java-based web crawler for extracting and processing web page content 4,563
brendonboshell/supercrawler A web crawler designed to crawl websites while obeying robots.txt rules, rate limits and concurrency limits, with customizable content handlers for parsing and processing crawled pages. 380
gocolly/colly A framework for extracting structured data from websites in a fast and elegant way 23,444