aut

Archive analyzer

An open-source toolkit for analyzing web archives using Apache Spark.

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

GitHub

137 stars
15 watching
33 forks
Language: Scala
last commit: 9 months ago
Linked from 2 awesome lists

analysisapache-sparkbig-databig-data-analyticsdataframedigital-humanitieshadoopnetwork-graphingpysparkpython3scalasparktext-extractionwebarchives

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
archivesunleashed/twut An open-source toolkit for analyzing Twitter archives using Apache Spark. 9
archivesunleashed/notebooks Provides tools and examples for working with web archives using the Archives Unleashed Toolkit 22
netarchivesuite/jwat A toolkit for analyzing and extracting data from legacy web archives in a structured format suitable for further analysis or reuse 3
oduwsdl/archivenow A tool to automate archiving of web resources into public archives. 410
archiveteam/wpull Downloads and crawls web pages, allowing for the archiving of websites. 556
machawk1/wail A graphical user interface layer for preserving and replaying web pages using multiple archiving tools. 350
jiiks/asar.net A .NET implementation of the Atom Asar archive format, allowing extraction and manipulation of archived files. 35
internetarchive/arch A distributed compute analysis system for web archive collections 15
ukwa/webarchive-discovery Tools for indexing and discovering archived web content 116
chatnoir-eu/chatnoir-resiliparse A toolkit for processing and analyzing web archive data 84
recrm/archivetools A collection of tools for extracting and analyzing data from web archives 69
bellingcat/auto-archiver Automates archiving of online content from various sources into local storage or cloud services 570
le0me55i/zsh-extract A plugin that automates the extraction of archive files from various formats. 19
karust/gogetcrawl A tool and package for extracting web archive data from popular sources like Wayback Machine and Common Crawl using the Go programming language. 147
helgeho/archivespark A framework for efficient data processing and extraction from archival collections, enabling the transformation of raw data into more accessible formats. 145