awesome-spark
A curated list of awesome Apache Spark packages and resources.
2k stars
86 watching
330 forks
Language: Shell
last commit: 2 days ago
Linked from 3 awesome lists
apache-sparkawesomepysparksparkr
Awesome Spark / Packages / Language Bindings | |||
Kotlin for Apache Spark | 459 | 4 months ago | Kotlin API bindings and extensions |
Mobius | 943 | 8 months ago | C# bindings (Deprecated in favor of .NET for Apache Spark) |
.NET for Apache Spark | 2,020 | 3 months ago | .NET bindings |
sparklyr | 948 | 23 days ago | An alternative R backend, using |
sparkle | 447 | over 1 year ago | Haskell on Apache Spark |
spark-connect-rs | 76 | 1 day ago | Rust bindings |
spark-connect-go | 147 | 1 day ago | Golang bindings |
spark-connect-rs | 1 | 6 months ago | C# bindings |
Awesome Spark / Packages / Notebooks and IDEs | |||
almond | A scala kernel for | ||
Apache Zeppelin | Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box | ||
Polynote | Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from | ||
sparkmagic | 1,322 | 2 months ago | magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through , in Jupyter notebooks |
Awesome Spark / Packages / General Purpose Libraries | |||
itachi | 53 | about 1 year ago | A library that brings useful functions from modern database management systems to Apache Spark |
spark-daria | 750 | 9 days ago | A Scala library with essential Spark functions and extensions to make you more productive |
quinn | 627 | 3 days ago | A native PySpark implementation of spark-daria |
Apache DataFu | 115 | 12 days ago | A library of general purpose functions and UDF's |
Joblib Apache Spark Backend | 242 | about 2 months ago | backend for running tasks on Spark clusters |
Awesome Spark / Packages / SQL Data Sources | |||
Spark XML | 501 | about 2 months ago | XML parser and writer |
Spark Cassandra Connector | 1,942 | about 1 month ago | Cassandra support including data source and API and support for arbitrary queries |
Mongo-Spark | 708 | about 2 months ago | Official MongoDB connector |
Awesome Spark / Packages / Storage | |||
Delta Lake | 7,487 | 3 days ago | Storage layer with ACID transactions |
lakeFS | Integration with the lakeFS atomic versioned storage layer | ||
Awesome Spark / Packages / Bioinformatics | |||
ADAM | 998 | about 1 month ago | Set of tools designed to analyse genomics data |
Hail | 976 | 3 days ago | Genetic analysis framework |
Awesome Spark / Packages / GIS | |||
Apache Sedona | 1,881 | 4 days ago | Cluster computing system for processing large-scale spatial data |
Awesome Spark / Packages / Graph Processing | |||
GraphFrames | 996 | 3 months ago | Data frame based graph API |
neo4j-spark-connector | 312 | 4 days ago | Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support |
Awesome Spark / Packages / Machine Learning Extension | |||
Apache SystemML | Declarative machine learning framework on top of Spark | ||
Mahout Spark Bindings | [status unknown] - linear algebra DSL and optimizer with R-like syntax | ||
KeystoneML | Type safe machine learning pipelines with RDDs | ||
JPMML-Spark | 94 | over 2 years ago | PMML transformer library for Spark ML |
ModelDB | A system to manage machine learning models for and | ||
Sparkling Water | 962 | 3 days ago | interoperability layer |
BigDL | 6,552 | 6 days ago | Distributed Deep Learning library |
MLeap | 1,501 | 3 months ago | Execution engine and serialization format which supports deployment of models without dependency on |
Microsoft ML for Apache Spark | 5,054 | 4 days ago | A distributed ml library with support for LightGBM, Vowpal Wabbit, OpenCV, Deep Learning, Cognitive Services, and Model Deployment |
MLflow | Machine learning orchestration platform | ||
Awesome Spark / Packages / Middleware | |||
Livy | 882 | 22 days ago | REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing |
spark-jobserver | 2,843 | 3 months ago | Simple Spark as a Service which supports objects sharing using so called named objects. JVM only |
Apache Toree | 739 | about 1 month ago | IPython protocol based middleware for interactive applications |
Apache Kyuubi | 2,079 | 3 days ago | A distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark |
Awesome Spark / Packages / Monitoring | |||
Data Mechanics Delight | 341 | 4 months ago | Cross-platform monitoring tool (Spark UI / Spark History Server replacement) |
Awesome Spark / Packages / Utilities | |||
sparkly | 60 | over 1 year ago | Helpers & syntactic sugar for PySpark |
pyspark-stubs | 115 | about 2 years ago | Static type annotations for PySpark (obsolete since Spark 3.1. See ) |
Flintrock | 637 | 3 months ago | A command-line tool for launching Spark clusters on EC2 |
Optimus | 1,474 | 19 days ago | Data Cleansing and Exploration utilities with the goal of simplifying data cleaning |
Awesome Spark / Packages / Natural Language Processing | |||
spark-nlp | 3,827 | 1 day ago | Natural language processing library built on top of Apache Spark ML |
Awesome Spark / Packages / Streaming | |||
Apache Bahir | Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ) | ||
Awesome Spark / Packages / Interfaces | |||
Apache Beam | Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments | ||
Koalas | 3,330 | 7 months ago | Pandas DataFrame API on top of Apache Spark |
Awesome Spark / Packages / Testing | |||
deequ | 3,269 | 4 days ago | Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets |
spark-testing-base | 1,514 | 5 days ago | Collection of base test classes |
spark-fast-tests | 431 | 10 days ago | A lightweight and fast testing framework |
Awesome Spark / Packages / Web Archives | |||
Archives Unleashed Toolkit | 137 | 7 months ago | Open-source toolkit for analyzing web archives |
Awesome Spark / Packages / Workflow Management | |||
Cromwell | 990 | 3 days ago | Workflow management system with |
Awesome Spark / Resources / Books | |||
Learning Spark, 2nd Edition | Introduction to Spark API with Spark 3.0 covered. Good source of knowledge about basic concepts | ||
Advanced Analytics with Spark | Useful collection of Spark processing patterns. Accompanying GitHub repository: | ||
Mastering Apache Spark | Interesting compilation of notes by . Focused on different aspects of Spark internals | ||
Spark in Action | New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo | ||
Awesome Spark / Resources / Papers | |||
Large-Scale Intelligent Microservices | Microsoft paper that presents an Apache Spark-based micro-service orchestration framework that extends database operations to include web service primitives | ||
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing | Paper introducing a core distributed memory abstraction | ||
Spark SQL: Relational Data Processing in Spark | Paper introducing relational underpinnings, code generation and Catalyst optimizer | ||
Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark | Structured Streaming is a new high-level streaming API, it is a declarative API based on automatically incrementalizing a static relational query | ||
Awesome Spark / Resources / MOOCS | |||
Data Science and Engineering with Apache Spark (edX XSeries) | Series of five courses ( , , , , ) covering different aspects of software engineering and data science. Python oriented | ||
Big Data Analysis with Scala and Spark (Coursera) | Scala oriented introductory course. Part of | ||
Awesome Spark / Resources / Workshops | |||
AMP Camp | Periodical training event organized by the . A source of useful exercise and recorded workshops covering different tools from the | ||
Awesome Spark / Resources / Projects Using Spark | |||
Oryx 2 | 1,786 | about 3 years ago | platform built on Apache Spark and with specialization for real-time large scale machine learning |
Photon ML | 793 | about 3 years ago | A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model |
PredictionIO | Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time | ||
Crossdata | 169 | almost 5 years ago | Data integration platform with extended DataSource API and multi-user environment |
Awesome Spark / Resources / Docker Images | |||
apache/spark | Apache Spark Official Docker images | ||
jupyter/docker-stacks/pyspark-notebook | 7,940 | 4 days ago | PySpark with Jupyter Notebook and Mesos client |
sequenceiq/docker-spark | 765 | over 3 years ago | Yarn images from |
datamechanics/spark | An easy to setup Docker image for Apache Spark from | ||
Awesome Spark / Resources / Miscellaneous | |||
Spark with Scala Gitter channel | " " started by | ||
Apache Spark User List | and - Mailing lists dedicated to usage questions and development topics respectively |