gogetcrawl

Archive extractor

A tool and package for extracting web archive data from popular sources like Wayback Machine and Common Crawl using the Go programming language.

Extract web archive data using Wayback Machine and Common Crawl

GitHub

148 stars
5 watching
17 forks
Language: Go
last commit: 2 months ago
Linked from 1 awesome list

commoncrawlconcurrencycrawlergolangwayback-machinewebarchive

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
s0rg/crawley A utility for systematically extracting URLs from web pages and printing them to the console. 268
recrm/archivetools A collection of tools for extracting and analyzing data from web archives 71
dwisiswant0/galer A tool to extract URLs from HTML attributes by crawling in and evaluating JavaScript 255
archiveteam/wpull Downloads and crawls web pages, allowing for the archiving of websites. 556
go-shiori/obelisk Archives a web page as a single HTML file with embedded resources. 267
archiveteam/grab-site A web crawler designed to backup websites by recursively crawling and writing WARC files. 1,406
jiiks/asar.net A .NET implementation of the Atom Asar archive format, allowing extraction and manipulation of archived files. 36
afjoseph/rake.go An algorithm for extracting keywords from text based on word frequency and part-of-speech tagging 117
richardlehane/webarchive Provides tools for reading and parsing web archive formats used in digital preservation. 20
wabarc/wayback A tool for capturing and preserving web content and making it accessible in the future. 1,839
limiu82214/gojmapr A library to extract specific properties from complex JSON structures into Go structs with minimal code changes. 22
oduwsdl/archivenow A tool to automate archiving of web resources into public archives. 409
thetic/extract A plugin that allows users to extract files from various archive formats without specifying the extraction command. 9
gb-archive/salvage A tool for archiving and preserving online content, allowing users to salvage and store websites, articles, contributions, text, and documentation. 33
allyshka/pwngitmanager A tool for extracting specific files from git repositories during penetration testing without downloading the entire repository. 107