awesome-spark
Spark toolkit
A curated collection of packages and resources for working with Apache Spark, an open-source cluster-computing framework.
A curated list of awesome Apache Spark packages and resources.
2k stars
85 watching
330 forks
Language: Shell
last commit: about 1 month ago
Linked from 3 awesome lists
apache-sparkawesomepysparksparkr
Awesome Spark / Packages / Language Bindings | |||
Kotlin for Apache Spark | 463 | 5 months ago | Kotlin API bindings and extensions |
.NET for Apache Spark | 2,026 | 4 months ago | .NET bindings |
sparklyr | 957 | 26 days ago | An alternative R backend, using |
sparkle | 447 | almost 2 years ago | Haskell on Apache Spark |
spark-connect-rs | 90 | 22 days ago | Rust bindings |
spark-connect-go | 162 | 16 days ago | Golang bindings |
spark-connect-csharp | 1 | 7 months ago | C# bindings |
Awesome Spark / Packages / Notebooks and IDEs | |||
almond | A scala kernel for | ||
Apache Zeppelin | Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box | ||
Polynote | Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from | ||
sparkmagic | 1,331 | about 11 hours ago | magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through , in Jupyter notebooks |
Awesome Spark / Packages / General Purpose Libraries | |||
itachi | 56 | about 1 year ago | A library that brings useful functions from modern database management systems to Apache Spark |
spark-daria | 754 | 30 days ago | A Scala library with essential Spark functions and extensions to make you more productive |
quinn | 643 | about 1 month ago | A native PySpark implementation of spark-daria |
Apache DataFu | 119 | about 1 month ago | A library of general purpose functions and UDF's |
Joblib Apache Spark Backend | 242 | 3 months ago | backend for running tasks on Spark clusters |
Awesome Spark / Packages / SQL Data Sources | |||
Spark XML | 505 | 3 months ago | XML parser and writer |
Spark Cassandra Connector | 1,942 | 3 months ago | Cassandra support including data source and API and support for arbitrary queries |
Mongo-Spark | 712 | 3 months ago | Official MongoDB connector |
Awesome Spark / Packages / Storage | |||
Delta Lake | 7,621 | about 11 hours ago | Storage layer with ACID transactions |
Apache Hudi | 5,450 | about 19 hours ago | Upserts, Deletes And Incremental Processing on Big Data |
Apache Iceberg | 6,494 | about 16 hours ago | Upserts, Deletes And Incremental Processing on Big Data |
lakeFS | Integration with the lakeFS atomic versioned storage layer | ||
Awesome Spark / Packages / Bioinformatics | |||
ADAM | 1,003 | about 1 month ago | Set of tools designed to analyse genomics data |
Hail | 984 | 3 days ago | Genetic analysis framework |
Awesome Spark / Packages / GIS | |||
Apache Sedona | 1,960 | 2 days ago | Cluster computing system for processing large-scale spatial data |
Awesome Spark / Packages / Graph Processing | |||
GraphFrames | 1,002 | 6 days ago | Data frame based graph API |
neo4j-spark-connector | 313 | 9 days ago | Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support |
Awesome Spark / Packages / Machine Learning Extension | |||
Apache SystemML | Declarative machine learning framework on top of Spark | ||
Mahout Spark Bindings | [status unknown] - linear algebra DSL and optimizer with R-like syntax | ||
KeystoneML | Type safe machine learning pipelines with RDDs | ||
JPMML-Spark | 94 | over 2 years ago | PMML transformer library for Spark ML |
ModelDB | A system to manage machine learning models for and | ||
Sparkling Water | 968 | 4 days ago | interoperability layer |
BigDL | 6,733 | about 23 hours ago | Distributed Deep Learning library |
MLeap | 1,504 | 11 days ago | Execution engine and serialization format which supports deployment of models without dependency on |
Microsoft ML for Apache Spark | 5,068 | about 15 hours ago | A distributed ml library with support for LightGBM, Vowpal Wabbit, OpenCV, Deep Learning, Cognitive Services, and Model Deployment |
MLflow | Machine learning orchestration platform | ||
Awesome Spark / Packages / Middleware | |||
Livy | 889 | 11 days ago | REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing |
spark-jobserver | 2,840 | 5 months ago | Simple Spark as a Service which supports objects sharing using so called named objects. JVM only |
Apache Toree | 740 | 15 days ago | IPython protocol based middleware for interactive applications |
Apache Kyuubi | 2,106 | about 22 hours ago | A distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark |
Awesome Spark / Packages / Monitoring | |||
Data Mechanics Delight | 342 | 6 months ago | Cross-platform monitoring tool (Spark UI / Spark History Server replacement) |
Awesome Spark / Packages / Utilities | |||
sparkly | 60 | over 1 year ago | Helpers & syntactic sugar for PySpark |
Flintrock | 638 | 5 months ago | A command-line tool for launching Spark clusters on EC2 |
Optimus | 1,481 | 5 days ago | Data Cleansing and Exploration utilities with the goal of simplifying data cleaning |
Awesome Spark / Packages / Natural Language Processing | |||
spark-nlp | 3,876 | 1 day ago | Natural language processing library built on top of Apache Spark ML |
Awesome Spark / Packages / Streaming | |||
Apache Bahir | Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ) | ||
Awesome Spark / Packages / Interfaces | |||
Apache Beam | Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments | ||
Koalas | 3,339 | 8 months ago | Pandas DataFrame API on top of Apache Spark |
Awesome Spark / Packages / Data quality | |||
deequ | 3,315 | about 1 month ago | Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets |
python-deequ | 730 | about 1 month ago | Python API for Deequ |
Awesome Spark / Packages / Testing | |||
spark-testing-base | 1,524 | 20 days ago | Collection of base test classes |
spark-fast-tests | 436 | 15 days ago | A lightweight and fast testing framework |
chispa | 620 | 30 days ago | PySpark test helpers with beautiful error messages |
Awesome Spark / Packages / Web Archives | |||
Archives Unleashed Toolkit | 137 | 9 months ago | Open-source toolkit for analyzing web archives |
Awesome Spark / Packages / Workflow Management | |||
Cromwell | 998 | about 15 hours ago | Workflow management system with |
Awesome Spark / Resources / Books | |||
Learning Spark, 2nd Edition | Introduction to Spark API with Spark 3.0 covered. Good source of knowledge about basic concepts | ||
Advanced Analytics with Spark | Useful collection of Spark processing patterns. Accompanying GitHub repository: | ||
Mastering Apache Spark | Interesting compilation of notes by . Focused on different aspects of Spark internals | ||
Spark in Action | New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo | ||
Awesome Spark / Resources / Papers | |||
Large-Scale Intelligent Microservices | Microsoft paper that presents an Apache Spark-based micro-service orchestration framework that extends database operations to include web service primitives | ||
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing | Paper introducing a core distributed memory abstraction | ||
Spark SQL: Relational Data Processing in Spark | Paper introducing relational underpinnings, code generation and Catalyst optimizer | ||
Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark | Structured Streaming is a new high-level streaming API, it is a declarative API based on automatically incrementalizing a static relational query | ||
Awesome Spark / Resources / MOOCS | |||
Data Science and Engineering with Apache Spark (edX XSeries) | Series of five courses ( , , , , ) covering different aspects of software engineering and data science. Python oriented | ||
Big Data Analysis with Scala and Spark (Coursera) | Scala oriented introductory course. Part of | ||
Awesome Spark / Resources / Workshops | |||
AMP Camp | Periodical training event organized by the . A source of useful exercise and recorded workshops covering different tools from the | ||
Awesome Spark / Resources / Projects Using Spark | |||
Oryx 2 | 1,787 | over 3 years ago | platform built on Apache Spark and with specialization for real-time large scale machine learning |
Photon ML | 792 | about 3 years ago | A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model |
PredictionIO | Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time | ||
Crossdata | 169 | about 5 years ago | Data integration platform with extended DataSource API and multi-user environment |
Awesome Spark / Resources / Docker Images | |||
apache/spark | Apache Spark Official Docker images | ||
jupyter/docker-stacks/pyspark-notebook | 8,009 | 4 days ago | PySpark with Jupyter Notebook and Mesos client |
sequenceiq/docker-spark | 765 | over 3 years ago | Yarn images from |
datamechanics/spark | An easy to setup Docker image for Apache Spark from | ||
Awesome Spark / Resources / Miscellaneous | |||
Spark with Scala Gitter channel | " " started by | ||
Apache Spark User List | and - Mailing lists dedicated to usage questions and development topics respectively |