crawler4j

Web crawler

A Java-based web crawler for extracting and processing web page content

Open Source Web Crawler for Java

GitHub

5k stars
306 watching
2k forks
Language: Java
last commit: about 3 years ago
Linked from 3 awesome lists


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
code4craft/webmagic A framework for building scalable web crawlers in Java. 11,456
unclecode/crawl4ai A web crawling tool designed to extract structured data from the web for use in AI applications 18,541
apify/crawlee A tool for building reliable web scraping and browser automation pipelines in Node.js. 16,081
spatie/crawler A powerful web crawler written in PHP that can execute JavaScript and crawl multiple URLs concurrently. 2,552
yujiosaka/headless-chrome-crawler A distributed crawling framework that leverages Headless Chrome to scrape dynamic websites 5,534
apache/incubator-stormcrawler A scalable and versatile web crawling framework based on Apache Storm 895
xtuhcy/gecco A lightweight web crawler framework that enables easy extraction of web page data using jQuery-like selectors and supports asynchronous requests and distributed crawling. 2,504
hakluke/hakrawler A tool for automatically discovering and crawling web application endpoints and assets 4,528
brendonboshell/supercrawler A web crawler designed to crawl websites while obeying robots.txt rules, rate limits and concurrency limits, with customizable content handlers for parsing and processing crawled pages. 380
iamstoxe/urlgrab A tool to crawl websites by exploring links recursively with support for JavaScript rendering. 331
cocrawler/cocrawler A versatile web crawler built with modern tools and concurrency to handle various crawl tasks 188
stewartmckee/cobweb A flexible web crawler that can be used to extract data from websites in a scalable and efficient manner 226
codesofun/web-bee A Java framework for building web-based crawlers with features like distributed crawling and proxy support. 189
builderio/gpt-crawler Automates the process of generating knowledge files to create custom AI models from website content 19,059
twitter4j/twitter4j A Java library providing access to the Twitter API for sending and retrieving tweets. 2,782