crawler

Web scraper

A Scala-based DSL for programmatically accessing and interacting with web pages

Scala DSL for web crawling

GitHub

148 stars
14 watching
40 forks
Language: Scala
last commit: over 8 years ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
ruippeixotog/scala-scraper A Scala library that provides a domain-specific language (DSL) for parsing and extracting content from HTML pages. 717
felipecsl/wombat A Ruby-based web crawler and data extraction tool with an elegant DSL. 1,315
postmodern/spidr A Ruby web crawling library that provides flexible and customizable methods to crawl websites 806
brendonboshell/supercrawler A web crawler designed to crawl websites while obeying robots.txt rules, rate limits and concurrency limits, with customizable content handlers for parsing and processing crawled pages. 378
dyweb/scrala A web crawling framework written in Scala that allows users to define the start URL and parse response from it 113
internetarchive/brozzler A distributed web crawler that fetches and extracts links from websites using a real browser. 671
benibela/xidel A tool to extract data from web pages using various query languages and selectors. 681
fimad/scalpel A web scraping library providing a declarative interface on top of an HTML parsing library to extract data from HTML pages 323
miyagawa/web-scraper A Perl toolkit for extracting structured data from HTML documents using a DSL-like interface. 104
webrecorder/browsertrix-crawler A containerized browser-based crawler system for capturing web content in a high-fidelity and customizable manner. 652
lambdaworks/scurl-detector Detects and extracts URLs from text in written content 16
apiel/test-crawler A tool for end-to-end testing of web applications by crawling and comparing screenshots. 32
stewartmckee/cobweb A flexible web crawler that can be used to extract data from websites in a scalable and efficient manner 226
archiveteam/wpull Downloads and crawls web pages, allowing for the archiving of websites. 556
the-markup/blacklight-collector A tool for scraping website content and analyzing browser behavior 202