awesome-spark

Spark toolkit

A curated collection of packages and resources for working with Apache Spark, an open-source cluster-computing framework.

A curated list of awesome Apache Spark packages and resources.

GitHub

2k stars
85 watching
330 forks
Language: Shell
last commit: about 1 month ago
Linked from 3 awesome lists

apache-sparkawesomepysparksparkr

Awesome Spark / Packages / Language Bindings

Kotlin for Apache Spark 463 5 months ago Kotlin API bindings and extensions
.NET for Apache Spark 2,026 4 months ago .NET bindings
sparklyr 957 26 days ago An alternative R backend, using
sparkle 447 almost 2 years ago Haskell on Apache Spark
spark-connect-rs 90 22 days ago Rust bindings
spark-connect-go 162 16 days ago Golang bindings
spark-connect-csharp 1 7 months ago C# bindings

Awesome Spark / Packages / Notebooks and IDEs

almond A scala kernel for
Apache Zeppelin Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box
Polynote Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from
sparkmagic 1,331 about 11 hours ago magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through , in Jupyter notebooks

Awesome Spark / Packages / General Purpose Libraries

itachi 56 about 1 year ago A library that brings useful functions from modern database management systems to Apache Spark
spark-daria 754 30 days ago A Scala library with essential Spark functions and extensions to make you more productive
quinn 643 about 1 month ago A native PySpark implementation of spark-daria
Apache DataFu 119 about 1 month ago A library of general purpose functions and UDF's
Joblib Apache Spark Backend 242 3 months ago backend for running tasks on Spark clusters

Awesome Spark / Packages / SQL Data Sources

Spark XML 505 3 months ago XML parser and writer
Spark Cassandra Connector 1,942 3 months ago Cassandra support including data source and API and support for arbitrary queries
Mongo-Spark 712 3 months ago Official MongoDB connector

Awesome Spark / Packages / Storage

Delta Lake 7,621 about 11 hours ago Storage layer with ACID transactions
Apache Hudi 5,450 about 19 hours ago Upserts, Deletes And Incremental Processing on Big Data
Apache Iceberg 6,494 about 16 hours ago Upserts, Deletes And Incremental Processing on Big Data
lakeFS Integration with the lakeFS atomic versioned storage layer

Awesome Spark / Packages / Bioinformatics

ADAM 1,003 about 1 month ago Set of tools designed to analyse genomics data
Hail 984 3 days ago Genetic analysis framework

Awesome Spark / Packages / GIS

Apache Sedona 1,960 2 days ago Cluster computing system for processing large-scale spatial data

Awesome Spark / Packages / Graph Processing

GraphFrames 1,002 6 days ago Data frame based graph API
neo4j-spark-connector 313 9 days ago Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support

Awesome Spark / Packages / Machine Learning Extension

Apache SystemML Declarative machine learning framework on top of Spark
Mahout Spark Bindings [status unknown] - linear algebra DSL and optimizer with R-like syntax
KeystoneML Type safe machine learning pipelines with RDDs
JPMML-Spark 94 over 2 years ago PMML transformer library for Spark ML
ModelDB A system to manage machine learning models for and
Sparkling Water 968 4 days ago interoperability layer
BigDL 6,733 about 23 hours ago Distributed Deep Learning library
MLeap 1,504 11 days ago Execution engine and serialization format which supports deployment of models without dependency on
Microsoft ML for Apache Spark 5,068 about 15 hours ago A distributed ml library with support for LightGBM, Vowpal Wabbit, OpenCV, Deep Learning, Cognitive Services, and Model Deployment
MLflow Machine learning orchestration platform

Awesome Spark / Packages / Middleware

Livy 889 11 days ago REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing
spark-jobserver 2,840 5 months ago Simple Spark as a Service which supports objects sharing using so called named objects. JVM only
Apache Toree 740 15 days ago IPython protocol based middleware for interactive applications
Apache Kyuubi 2,106 about 22 hours ago A distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark

Awesome Spark / Packages / Monitoring

Data Mechanics Delight 342 6 months ago Cross-platform monitoring tool (Spark UI / Spark History Server replacement)

Awesome Spark / Packages / Utilities

sparkly 60 over 1 year ago Helpers & syntactic sugar for PySpark
Flintrock 638 5 months ago A command-line tool for launching Spark clusters on EC2
Optimus 1,481 5 days ago Data Cleansing and Exploration utilities with the goal of simplifying data cleaning

Awesome Spark / Packages / Natural Language Processing

spark-nlp 3,876 1 day ago Natural language processing library built on top of Apache Spark ML

Awesome Spark / Packages / Streaming

Apache Bahir Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ)

Awesome Spark / Packages / Interfaces

Apache Beam Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments
Koalas 3,339 8 months ago Pandas DataFrame API on top of Apache Spark

Awesome Spark / Packages / Data quality

deequ 3,315 about 1 month ago Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets
python-deequ 730 about 1 month ago Python API for Deequ

Awesome Spark / Packages / Testing

spark-testing-base 1,524 20 days ago Collection of base test classes
spark-fast-tests 436 15 days ago A lightweight and fast testing framework
chispa 620 30 days ago PySpark test helpers with beautiful error messages

Awesome Spark / Packages / Web Archives

Archives Unleashed Toolkit 137 9 months ago Open-source toolkit for analyzing web archives

Awesome Spark / Packages / Workflow Management

Cromwell 998 about 15 hours ago Workflow management system with

Awesome Spark / Resources / Books

Learning Spark, 2nd Edition Introduction to Spark API with Spark 3.0 covered. Good source of knowledge about basic concepts
Advanced Analytics with Spark Useful collection of Spark processing patterns. Accompanying GitHub repository:
Mastering Apache Spark Interesting compilation of notes by . Focused on different aspects of Spark internals
Spark in Action New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo

Awesome Spark / Resources / Papers

Large-Scale Intelligent Microservices Microsoft paper that presents an Apache Spark-based micro-service orchestration framework that extends database operations to include web service primitives
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Paper introducing a core distributed memory abstraction
Spark SQL: Relational Data Processing in Spark Paper introducing relational underpinnings, code generation and Catalyst optimizer
Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark Structured Streaming is a new high-level streaming API, it is a declarative API based on automatically incrementalizing a static relational query

Awesome Spark / Resources / MOOCS

Data Science and Engineering with Apache Spark (edX XSeries) Series of five courses ( , , , , ) covering different aspects of software engineering and data science. Python oriented
Big Data Analysis with Scala and Spark (Coursera) Scala oriented introductory course. Part of

Awesome Spark / Resources / Workshops

AMP Camp Periodical training event organized by the . A source of useful exercise and recorded workshops covering different tools from the

Awesome Spark / Resources / Projects Using Spark

Oryx 2 1,787 over 3 years ago platform built on Apache Spark and with specialization for real-time large scale machine learning
Photon ML 792 about 3 years ago A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model
PredictionIO Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time
Crossdata 169 about 5 years ago Data integration platform with extended DataSource API and multi-user environment

Awesome Spark / Resources / Docker Images

apache/spark Apache Spark Official Docker images
jupyter/docker-stacks/pyspark-notebook 8,009 4 days ago PySpark with Jupyter Notebook and Mesos client
sequenceiq/docker-spark 765 over 3 years ago Yarn images from
datamechanics/spark An easy to setup Docker image for Apache Spark from

Awesome Spark / Resources / Miscellaneous

Spark with Scala Gitter channel " " started by
Apache Spark User List and - Mailing lists dedicated to usage questions and development topics respectively

Backlinks from these awesome lists:

More related projects: