wombat

Web scraper library

A Ruby-based web crawler and data extraction tool with an elegant DSL.

Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.

GitHub

1k stars

51 watching

129 forks

Language: Ruby

last commit: over 2 years ago

Linked from 3 awesome lists

crawlerdslrubyscraper

felipecsl.github.io/wombat/

Backlinks from these awesome lists:

Related projects:

Repository	Description	Stars
bplawler/crawler	A Scala-based DSL for programmatically accessing and interacting with web pages	149
benibela/xidel	A tool to extract data from web pages using various query languages and selectors.	690
postmodern/spidr	A Ruby web crawling library that provides flexible and customizable methods to crawl websites	809
jaimeiniesta/metainspector	A Ruby gem for web scraping and extracting metadata from web pages.	1,038
ruippeixotog/scala-scraper	A Scala library providing a DSL for loading and extracting content from HTML pages	717
archiveteam/wpull	Downloads and crawls web pages, allowing for the archiving of websites.	556
miyagawa/web-scraper	A Perl toolkit for extracting structured data from HTML documents using a DSL-like interface.	104
joseconstela/webparsy	A Node.js library and CLI for scraping websites using Puppeteer and YAML definitions	44
medialab/minet	A command line tool and Python library for extracting data from various web sources.	293
oscarotero/embed	A PHP library to retrieve metadata and embed code from any web page	2,100
slotix/dataflowkit	A framework for extracting structured data from web pages using CSS selectors.	667
spider-rs/spider	A tool for web data extraction and processing using Rust	1,234
jjelosua/doga_scraper	A tool that extracts and converts Galician Official journal documents to different formats based on input year.	0
s0rg/crawley	A utility for systematically extracting URLs from web pages and printing them to the console.	268
fimad/scalpel	A web scraping library providing a declarative interface on top of an HTML parsing library to extract data from HTML pages	325