gogetcrawl
Archive extractor
A tool and package for extracting web archive data from popular sources like Wayback Machine and Common Crawl using the Go programming language.
Extract web archive data using Wayback Machine and Common Crawl
147 stars
5 watching
16 forks
Language: Go
last commit: 17 days ago
Linked from 1 awesome list
commoncrawlconcurrencycrawlergolangwayback-machinewebarchive
Related projects:
Repository | Description | Stars |
---|---|---|
s0rg/crawley | A utility for systematically extracting URLs from web pages and printing them to the console. | 263 |
recrm/archivetools | A collection of tools for extracting and analyzing data from web archives | 69 |
dwisiswant0/galer | A tool to extract URLs from HTML attributes by crawling in and evaluating JavaScript | 253 |
archiveteam/wpull | Downloads and crawls web pages, allowing for the archiving of websites. | 556 |
go-shiori/obelisk | Archives a web page as a single HTML file with embedded resources. | 263 |
archiveteam/grab-site | A web crawler designed to backup websites by recursively crawling and writing WARC files. | 1,398 |
jiiks/asar.net | A .NET implementation of the Atom Asar archive format, allowing extraction and manipulation of archived files. | 35 |
afjoseph/rake.go | An algorithm for extracting keywords from text based on word frequency and part-of-speech tagging | 117 |
richardlehane/webarchive | Provides tools for reading and parsing web archive formats used in digital preservation. | 20 |
wabarc/wayback | A tool for capturing and preserving web content and making it accessible in the future. | 1,811 |
limiu82214/gojmapr | A library to extract specific properties from complex JSON structures into Go structs with minimal code changes. | 22 |
oduwsdl/archivenow | A tool to automate archiving of web resources into public archives. | 410 |
thetic/extract | A plugin that allows users to extract files from various archive formats without specifying the extraction command. | 9 |
gb-archive/salvage | A tool for archiving and preserving online content, allowing users to salvage and store websites, articles, contributions, text, and documentation. | 32 |
allyshka/pwngitmanager | A tool for extracting specific files from git repositories during penetration testing without downloading the entire repository. | 107 |