ArchiveSpark

Data processor

A framework for efficient data processing and extraction from archival collections, enabling the transformation of raw data into more accessible formats.

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

GitHub

145 stars

15 watching

19 forks

Language: Scala

last commit: almost 2 years ago

Linked from 1 awesome list

archivesparkinternet-archivesparkspark-frameworkwarcweb-archivingwebarchive

Backlinks from these awesome lists:

iipc/awesome-web-archiving

Related projects:

Repository	Description	Stars
internetarchive/sparkling	A data processing library built on top of Apache Spark to handle temporal web data	11
apache/spark	An analytics engine designed to handle large-scale data processing and analysis	40,170
helgeho/warcpartitioner	Tool for partitioning and merging Web archive files by MIME type and year	1
internetarchive/arch	A distributed compute analysis system for web archive collections	15
svenkreiss/pysparkling	A lightweight Python implementation of Spark's RDD and DStream interfaces for improved performance on small datasets	262
instaclustr/sample-kafkasparkcassandra	An introductory Scala app using Apache Spark Streaming to process data from Kafka and write summaries to Cassandra.	23
apache/pig	Enables data processing and transformation in large files using a high-level language with compile-time optimizations for efficient execution on distributed computing frameworks.	682
helgeho/hadoopconcatgz	Provides a custom input format for handling concatenated GZIP files in distributed processing systems like Hadoop	9
utdemir/distributed-dataset	A Haskell-based framework for processing and distributing large datasets across multiple nodes in parallel.	116
nathanmarz/cascalog	A library for data processing and querying on large datasets without the need for Hadoop expertise	1,375
apache/samza	A distributed stream processing framework for handling high-volume data streams with fault tolerance and durability guarantees	817
archivesunleashed/aut	An open-source toolkit for analyzing web archives using Apache Spark.	138
helgeho/web2warc	A Web crawler that creates custom archives in WARC/CDX format	25
hepdata/hepdata	A web application for managing and sharing high-energy physics data from experiments	41
iokasimov/apart	A Haskell library for serializing and deserializing data structures in a persistent manner.	30