crawler4j
Web crawler
A Java-based web crawler for extracting and processing web page content
Open Source Web Crawler for Java
5k stars
306 watching
2k forks
Language: Java
last commit: about 3 years ago
Linked from 3 awesome lists
Related projects:
Repository | Description | Stars |
---|---|---|
code4craft/webmagic | A framework for building scalable web crawlers in Java. | 11,456 |
unclecode/crawl4ai | A web crawling tool designed to extract structured data from the web for use in AI applications | 18,541 |
apify/crawlee | A tool for building reliable web scraping and browser automation pipelines in Node.js. | 16,081 |
spatie/crawler | A powerful web crawler written in PHP that can execute JavaScript and crawl multiple URLs concurrently. | 2,552 |
yujiosaka/headless-chrome-crawler | A distributed crawling framework that leverages Headless Chrome to scrape dynamic websites | 5,534 |
apache/incubator-stormcrawler | A scalable and versatile web crawling framework based on Apache Storm | 895 |
xtuhcy/gecco | A lightweight web crawler framework that enables easy extraction of web page data using jQuery-like selectors and supports asynchronous requests and distributed crawling. | 2,504 |
hakluke/hakrawler | A tool for automatically discovering and crawling web application endpoints and assets | 4,528 |
brendonboshell/supercrawler | A web crawler designed to crawl websites while obeying robots.txt rules, rate limits and concurrency limits, with customizable content handlers for parsing and processing crawled pages. | 380 |
iamstoxe/urlgrab | A tool to crawl websites by exploring links recursively with support for JavaScript rendering. | 331 |
cocrawler/cocrawler | A versatile web crawler built with modern tools and concurrency to handle various crawl tasks | 188 |
stewartmckee/cobweb | A flexible web crawler that can be used to extract data from websites in a scalable and efficient manner | 226 |
codesofun/web-bee | A Java framework for building web-based crawlers with features like distributed crawling and proxy support. | 189 |
builderio/gpt-crawler | Automates the process of generating knowledge files to create custom AI models from website content | 19,059 |
twitter4j/twitter4j | A Java library providing access to the Twitter API for sending and retrieving tweets. | 2,782 |