awesome-spark

A curated list of awesome Apache Spark packages and resources.

GitHub

2k stars
86 watching
330 forks
Language: Shell
last commit: 2 days ago
Linked from 3 awesome lists

apache-sparkawesomepysparksparkr

Awesome Spark / Packages / Language Bindings

Kotlin for Apache Spark 459 4 months ago Kotlin API bindings and extensions
Mobius 943 8 months ago C# bindings (Deprecated in favor of .NET for Apache Spark)
.NET for Apache Spark 2,020 3 months ago .NET bindings
sparklyr 948 23 days ago An alternative R backend, using
sparkle 447 over 1 year ago Haskell on Apache Spark
spark-connect-rs 76 1 day ago Rust bindings
spark-connect-go 147 1 day ago Golang bindings
spark-connect-rs 1 6 months ago C# bindings

Awesome Spark / Packages / Notebooks and IDEs

almond A scala kernel for
Apache Zeppelin Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box
Polynote Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from
sparkmagic 1,322 2 months ago magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through , in Jupyter notebooks

Awesome Spark / Packages / General Purpose Libraries

itachi 53 about 1 year ago A library that brings useful functions from modern database management systems to Apache Spark
spark-daria 750 9 days ago A Scala library with essential Spark functions and extensions to make you more productive
quinn 627 3 days ago A native PySpark implementation of spark-daria
Apache DataFu 115 12 days ago A library of general purpose functions and UDF's
Joblib Apache Spark Backend 242 about 2 months ago backend for running tasks on Spark clusters

Awesome Spark / Packages / SQL Data Sources

Spark XML 501 about 2 months ago XML parser and writer
Spark Cassandra Connector 1,942 about 1 month ago Cassandra support including data source and API and support for arbitrary queries
Mongo-Spark 708 about 2 months ago Official MongoDB connector

Awesome Spark / Packages / Storage

Delta Lake 7,487 3 days ago Storage layer with ACID transactions
lakeFS Integration with the lakeFS atomic versioned storage layer

Awesome Spark / Packages / Bioinformatics

ADAM 998 about 1 month ago Set of tools designed to analyse genomics data
Hail 976 3 days ago Genetic analysis framework

Awesome Spark / Packages / GIS

Apache Sedona 1,881 4 days ago Cluster computing system for processing large-scale spatial data

Awesome Spark / Packages / Graph Processing

GraphFrames 996 3 months ago Data frame based graph API
neo4j-spark-connector 312 4 days ago Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support

Awesome Spark / Packages / Machine Learning Extension

Apache SystemML Declarative machine learning framework on top of Spark
Mahout Spark Bindings [status unknown] - linear algebra DSL and optimizer with R-like syntax
KeystoneML Type safe machine learning pipelines with RDDs
JPMML-Spark 94 over 2 years ago PMML transformer library for Spark ML
ModelDB A system to manage machine learning models for and
Sparkling Water 962 3 days ago interoperability layer
BigDL 6,552 6 days ago Distributed Deep Learning library
MLeap 1,501 3 months ago Execution engine and serialization format which supports deployment of models without dependency on
Microsoft ML for Apache Spark 5,054 4 days ago A distributed ml library with support for LightGBM, Vowpal Wabbit, OpenCV, Deep Learning, Cognitive Services, and Model Deployment
MLflow Machine learning orchestration platform

Awesome Spark / Packages / Middleware

Livy 882 22 days ago REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing
spark-jobserver 2,843 3 months ago Simple Spark as a Service which supports objects sharing using so called named objects. JVM only
Apache Toree 739 about 1 month ago IPython protocol based middleware for interactive applications
Apache Kyuubi 2,079 3 days ago A distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark

Awesome Spark / Packages / Monitoring

Data Mechanics Delight 341 4 months ago Cross-platform monitoring tool (Spark UI / Spark History Server replacement)

Awesome Spark / Packages / Utilities

sparkly 60 over 1 year ago Helpers & syntactic sugar for PySpark
pyspark-stubs 115 about 2 years ago Static type annotations for PySpark (obsolete since Spark 3.1. See )
Flintrock 637 3 months ago A command-line tool for launching Spark clusters on EC2
Optimus 1,474 19 days ago Data Cleansing and Exploration utilities with the goal of simplifying data cleaning

Awesome Spark / Packages / Natural Language Processing

spark-nlp 3,827 1 day ago Natural language processing library built on top of Apache Spark ML

Awesome Spark / Packages / Streaming

Apache Bahir Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ)

Awesome Spark / Packages / Interfaces

Apache Beam Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments
Koalas 3,330 7 months ago Pandas DataFrame API on top of Apache Spark

Awesome Spark / Packages / Testing

deequ 3,269 4 days ago Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets
spark-testing-base 1,514 5 days ago Collection of base test classes
spark-fast-tests 431 10 days ago A lightweight and fast testing framework

Awesome Spark / Packages / Web Archives

Archives Unleashed Toolkit 137 7 months ago Open-source toolkit for analyzing web archives

Awesome Spark / Packages / Workflow Management

Cromwell 990 3 days ago Workflow management system with

Awesome Spark / Resources / Books

Learning Spark, 2nd Edition Introduction to Spark API with Spark 3.0 covered. Good source of knowledge about basic concepts
Advanced Analytics with Spark Useful collection of Spark processing patterns. Accompanying GitHub repository:
Mastering Apache Spark Interesting compilation of notes by . Focused on different aspects of Spark internals
Spark in Action New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo

Awesome Spark / Resources / Papers

Large-Scale Intelligent Microservices Microsoft paper that presents an Apache Spark-based micro-service orchestration framework that extends database operations to include web service primitives
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Paper introducing a core distributed memory abstraction
Spark SQL: Relational Data Processing in Spark Paper introducing relational underpinnings, code generation and Catalyst optimizer
Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark Structured Streaming is a new high-level streaming API, it is a declarative API based on automatically incrementalizing a static relational query

Awesome Spark / Resources / MOOCS

Data Science and Engineering with Apache Spark (edX XSeries) Series of five courses ( , , , , ) covering different aspects of software engineering and data science. Python oriented
Big Data Analysis with Scala and Spark (Coursera) Scala oriented introductory course. Part of

Awesome Spark / Resources / Workshops

AMP Camp Periodical training event organized by the . A source of useful exercise and recorded workshops covering different tools from the

Awesome Spark / Resources / Projects Using Spark

Oryx 2 1,786 about 3 years ago platform built on Apache Spark and with specialization for real-time large scale machine learning
Photon ML 793 about 3 years ago A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model
PredictionIO Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time
Crossdata 169 almost 5 years ago Data integration platform with extended DataSource API and multi-user environment

Awesome Spark / Resources / Docker Images

apache/spark Apache Spark Official Docker images
jupyter/docker-stacks/pyspark-notebook 7,940 4 days ago PySpark with Jupyter Notebook and Mesos client
sequenceiq/docker-spark 765 over 3 years ago Yarn images from
datamechanics/spark An easy to setup Docker image for Apache Spark from

Awesome Spark / Resources / Miscellaneous

Spark with Scala Gitter channel " " started by
Apache Spark User List and - Mailing lists dedicated to usage questions and development topics respectively

Backlinks from these awesome lists: