wpull

Website scraper

Downloads and crawls web pages, allowing for the archiving of websites.

Wget-compatible web downloader and crawler.

GitHub

557 stars
23 watching
77 forks
Language: HTML
last commit: 7 months ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
archiveteam/grab-site A web crawler designed to backup websites by recursively crawling and writing WARC files. 1,400
karust/gogetcrawl A tool and package for extracting web archive data from popular sources like Wayback Machine and Common Crawl using the Go programming language. 149
vida-nyu/ache A web crawler designed to efficiently collect and prioritize relevant content from the web 456
felipecsl/wombat A Ruby-based web crawler and data extraction tool with an elegant DSL. 1,315
p3gleg/pwnback Generates a sitemap of a website using Wayback Machine 225
s0rg/crawley A utility for systematically extracting URLs from web pages and printing them to the console. 265
machawk1/wail A graphical user interface layer for preserving and replaying web pages using multiple archiving tools. 351
a11ywatch/crawler Performs web page crawling at high performance. 50
internetarchive/brozzler A distributed web crawler that fetches and extracts links from websites using a real browser. 673
stevepolitodesign/my_site_archive A simple Rails application for archiving websites 27
internetarchive/warctools Tools for working with archived web content 152
bellingcat/auto-archiver Automates archiving of online content from various sources into local storage or cloud services 583
turicas/crau A command-line tool for archiving and playing back websites in WARC format 58
amoilanen/js-crawler A Node.js module for crawling web sites and scraping their content 254
oduwsdl/archivenow A tool to automate archiving of web resources into public archives. 409