ArchiveSpark

Data processor

A framework for efficient data processing and extraction from archival collections, enabling the transformation of raw data into more accessible formats.

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

GitHub

145 stars
15 watching
19 forks
Language: Scala
last commit: 2 months ago
Linked from 1 awesome list

archivesparkinternet-archivesparkspark-frameworkwarcweb-archivingwebarchive

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
internetarchive/sparkling A data processing library built on top of Apache Spark to handle temporal web data 11
apache/spark An analytics engine designed to handle large-scale data processing and analysis 40,002
helgeho/warcpartitioner Tool for partitioning and merging Web archive files by MIME type and year 1
internetarchive/arch A distributed compute analysis system for web archive collections 15
svenkreiss/pysparkling A lightweight Python implementation of Spark's RDD and DStream interfaces for improved performance on small datasets 262
instaclustr/sample-kafkasparkcassandra An introductory Scala app using Apache Spark Streaming to process data from Kafka and write summaries to Cassandra. 23
apache/pig Enables data processing and transformation in large files using a high-level language with compile-time optimizations for efficient execution on distributed computing frameworks. 681
helgeho/hadoopconcatgz Provides a custom input format for handling concatenated GZIP files in distributed processing systems like Hadoop 9
utdemir/distributed-dataset A Haskell-based framework for processing and distributing large datasets across multiple nodes in parallel. 116
nathanmarz/cascalog A library for data processing and querying on large datasets without the need for Hadoop expertise 1,376
apache/samza A distributed stream processing framework for handling high-volume data streams with fault tolerance and durability guarantees 819
archivesunleashed/aut An open-source toolkit for analyzing web archives using Apache Spark. 137
helgeho/web2warc A Web crawler that creates custom archives in WARC/CDX format 24
hepdata/hepdata A web application for managing and sharing high-energy physics data from experiments 41
iokasimov/apart A Haskell library for serializing and deserializing data structures in a persistent manner. 30