gogetcrawl

Archive extractor

A tool and package for extracting web archive data from popular sources like Wayback Machine and Common Crawl using the Go programming language.

Extract web archive data using Wayback Machine and Common Crawl

GitHub

147 stars
5 watching
16 forks
Language: Go
last commit: 17 days ago
Linked from 1 awesome list

commoncrawlconcurrencycrawlergolangwayback-machinewebarchive

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
s0rg/crawley A utility for systematically extracting URLs from web pages and printing them to the console. 263
recrm/archivetools A collection of tools for extracting and analyzing data from web archives 69
dwisiswant0/galer A tool to extract URLs from HTML attributes by crawling in and evaluating JavaScript 253
archiveteam/wpull Downloads and crawls web pages, allowing for the archiving of websites. 556
go-shiori/obelisk Archives a web page as a single HTML file with embedded resources. 263
archiveteam/grab-site A web crawler designed to backup websites by recursively crawling and writing WARC files. 1,398
jiiks/asar.net A .NET implementation of the Atom Asar archive format, allowing extraction and manipulation of archived files. 35
afjoseph/rake.go An algorithm for extracting keywords from text based on word frequency and part-of-speech tagging 117
richardlehane/webarchive Provides tools for reading and parsing web archive formats used in digital preservation. 20
wabarc/wayback A tool for capturing and preserving web content and making it accessible in the future. 1,811
limiu82214/gojmapr A library to extract specific properties from complex JSON structures into Go structs with minimal code changes. 22
oduwsdl/archivenow A tool to automate archiving of web resources into public archives. 410
thetic/extract A plugin that allows users to extract files from various archive formats without specifying the extraction command. 9
gb-archive/salvage A tool for archiving and preserving online content, allowing users to salvage and store websites, articles, contributions, text, and documentation. 32
allyshka/pwngitmanager A tool for extracting specific files from git repositories during penetration testing without downloading the entire repository. 107