aut
Archive analyzer
An open-source toolkit for analyzing web archives using Apache Spark.
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
137 stars
15 watching
33 forks
Language: Scala
last commit: 9 months ago
Linked from 2 awesome lists
analysisapache-sparkbig-databig-data-analyticsdataframedigital-humanitieshadoopnetwork-graphingpysparkpython3scalasparktext-extractionwebarchives
Related projects:
Repository | Description | Stars |
---|---|---|
archivesunleashed/twut | An open-source toolkit for analyzing Twitter archives using Apache Spark. | 9 |
archivesunleashed/notebooks | Provides tools and examples for working with web archives using the Archives Unleashed Toolkit | 22 |
netarchivesuite/jwat | A toolkit for analyzing and extracting data from legacy web archives in a structured format suitable for further analysis or reuse | 3 |
oduwsdl/archivenow | A tool to automate archiving of web resources into public archives. | 410 |
archiveteam/wpull | Downloads and crawls web pages, allowing for the archiving of websites. | 556 |
machawk1/wail | A graphical user interface layer for preserving and replaying web pages using multiple archiving tools. | 350 |
jiiks/asar.net | A .NET implementation of the Atom Asar archive format, allowing extraction and manipulation of archived files. | 35 |
internetarchive/arch | A distributed compute analysis system for web archive collections | 15 |
ukwa/webarchive-discovery | Tools for indexing and discovering archived web content | 116 |
chatnoir-eu/chatnoir-resiliparse | A toolkit for processing and analyzing web archive data | 84 |
recrm/archivetools | A collection of tools for extracting and analyzing data from web archives | 69 |
bellingcat/auto-archiver | Automates archiving of online content from various sources into local storage or cloud services | 570 |
le0me55i/zsh-extract | A plugin that automates the extraction of archive files from various formats. | 19 |
karust/gogetcrawl | A tool and package for extracting web archive data from popular sources like Wayback Machine and Common Crawl using the Go programming language. | 147 |
helgeho/archivespark | A framework for efficient data processing and extraction from archival collections, enabling the transformation of raw data into more accessible formats. | 145 |