ArchiveSpark

Data processor

A framework for efficient data processing and extraction from archival collections, enabling the transformation of raw data into more accessible formats.

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

GitHub

145 stars
15 watching
19 forks
Language: Scala
last commit: 4 months ago
Linked from 1 awesome list

archivesparkinternet-archivesparkspark-frameworkwarcweb-archivingwebarchive

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
internetarchive/sparkling A data processing library built on top of Apache Spark to handle temporal web data 11
apache/spark An analytics engine designed to handle large-scale data processing and analysis 40,170
helgeho/warcpartitioner Tool for partitioning and merging Web archive files by MIME type and year 1
internetarchive/arch A distributed compute analysis system for web archive collections 15
svenkreiss/pysparkling A lightweight Python implementation of Spark's RDD and DStream interfaces for improved performance on small datasets 262
instaclustr/sample-kafkasparkcassandra An introductory Scala app using Apache Spark Streaming to process data from Kafka and write summaries to Cassandra. 23
apache/pig Enables data processing and transformation in large files using a high-level language with compile-time optimizations for efficient execution on distributed computing frameworks. 682
helgeho/hadoopconcatgz Provides a custom input format for handling concatenated GZIP files in distributed processing systems like Hadoop 9
utdemir/distributed-dataset A Haskell-based framework for processing and distributing large datasets across multiple nodes in parallel. 116
nathanmarz/cascalog A library for data processing and querying on large datasets without the need for Hadoop expertise 1,375
apache/samza A distributed stream processing framework for handling high-volume data streams with fault tolerance and durability guarantees 817
archivesunleashed/aut An open-source toolkit for analyzing web archives using Apache Spark. 138
helgeho/web2warc A Web crawler that creates custom archives in WARC/CDX format 25
hepdata/hepdata A web application for managing and sharing high-energy physics data from experiments 41
iokasimov/apart A Haskell library for serializing and deserializing data structures in a persistent manner. 30