ArchiveSpark
Data processor
A framework for efficient data processing and extraction from archival collections, enabling the transformation of raw data into more accessible formats.
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
145 stars
15 watching
19 forks
Language: Scala
last commit: 2 months ago
Linked from 1 awesome list
archivesparkinternet-archivesparkspark-frameworkwarcweb-archivingwebarchive
Related projects:
Repository | Description | Stars |
---|---|---|
internetarchive/sparkling | A data processing library built on top of Apache Spark to handle temporal web data | 11 |
apache/spark | An analytics engine designed to handle large-scale data processing and analysis | 40,002 |
helgeho/warcpartitioner | Tool for partitioning and merging Web archive files by MIME type and year | 1 |
internetarchive/arch | A distributed compute analysis system for web archive collections | 15 |
svenkreiss/pysparkling | A lightweight Python implementation of Spark's RDD and DStream interfaces for improved performance on small datasets | 262 |
instaclustr/sample-kafkasparkcassandra | An introductory Scala app using Apache Spark Streaming to process data from Kafka and write summaries to Cassandra. | 23 |
apache/pig | Enables data processing and transformation in large files using a high-level language with compile-time optimizations for efficient execution on distributed computing frameworks. | 681 |
helgeho/hadoopconcatgz | Provides a custom input format for handling concatenated GZIP files in distributed processing systems like Hadoop | 9 |
utdemir/distributed-dataset | A Haskell-based framework for processing and distributing large datasets across multiple nodes in parallel. | 116 |
nathanmarz/cascalog | A library for data processing and querying on large datasets without the need for Hadoop expertise | 1,376 |
apache/samza | A distributed stream processing framework for handling high-volume data streams with fault tolerance and durability guarantees | 819 |
archivesunleashed/aut | An open-source toolkit for analyzing web archives using Apache Spark. | 137 |
helgeho/web2warc | A Web crawler that creates custom archives in WARC/CDX format | 24 |
hepdata/hepdata | A web application for managing and sharing high-energy physics data from experiments | 41 |
iokasimov/apart | A Haskell library for serializing and deserializing data structures in a persistent manner. | 30 |