awesome-spark
Spark toolkit
A curated collection of packages and resources for working with Apache Spark, an open-source cluster-computing framework.
A curated list of awesome Apache Spark packages and resources.
2k stars
85 watching
331 forks
Language: Shell
last commit: about 1 year ago
Linked from 3 awesome lists
apache-sparkawesomepysparksparkr
Awesome Spark / Packages / Language Bindings | |||
| Kotlin for Apache Spark | 463 | over 1 year ago | Kotlin API bindings and extensions |
| .NET for Apache Spark | 2,032 | 11 months ago | .NET bindings |
| sparklyr | 955 | about 1 year ago | An alternative R backend, using |
| sparkle | 447 | over 2 years ago | Haskell on Apache Spark |
| spark-connect-rs | 91 | 12 months ago | Rust bindings |
| spark-connect-go | 168 | 12 months ago | Golang bindings |
| spark-connect-csharp | 1 | over 1 year ago | C# bindings |
Awesome Spark / Packages / Notebooks and IDEs | |||
| almond | A scala kernel for | ||
| Apache Zeppelin | Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box | ||
| Polynote | Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from | ||
| sparkmagic | 1,334 | 11 months ago | magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through , in Jupyter notebooks |
Awesome Spark / Packages / General Purpose Libraries | |||
| itachi | 56 | about 2 years ago | A library that brings useful functions from modern database management systems to Apache Spark |
| spark-daria | 754 | about 1 year ago | A Scala library with essential Spark functions and extensions to make you more productive |
| quinn | 651 | 11 months ago | A native PySpark implementation of spark-daria |
| Apache DataFu | 119 | 11 months ago | A library of general purpose functions and UDF's |
| Joblib Apache Spark Backend | 243 | about 1 year ago | backend for running tasks on Spark clusters |
Awesome Spark / Packages / SQL Data Sources | |||
| Spark XML | 504 | about 1 year ago | XML parser and writer |
| Spark Cassandra Connector | 1,944 | about 1 year ago | Cassandra support including data source and API and support for arbitrary queries |
| Mongo-Spark | 713 | about 1 year ago | Official MongoDB connector |
Awesome Spark / Packages / Storage | |||
| Delta Lake | 7,677 | 11 months ago | Storage layer with ACID transactions |
| Apache Hudi | 5,498 | 11 months ago | Upserts, Deletes And Incremental Processing on Big Data |
| Apache Iceberg | 6,621 | 11 months ago | Upserts, Deletes And Incremental Processing on Big Data |
| lakeFS | Integration with the lakeFS atomic versioned storage layer | ||
Awesome Spark / Packages / Bioinformatics | |||
| ADAM | 1,005 | 11 months ago | Set of tools designed to analyse genomics data |
| Hail | 984 | 11 months ago | Genetic analysis framework |
Awesome Spark / Packages / GIS | |||
| Apache Sedona | 1,974 | 11 months ago | Cluster computing system for processing large-scale spatial data |
Awesome Spark / Packages / Graph Processing | |||
| GraphFrames | 1,007 | 11 months ago | Data frame based graph API |
| neo4j-spark-connector | 313 | 11 months ago | Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support |
Awesome Spark / Packages / Machine Learning Extension | |||
| Apache SystemML | Declarative machine learning framework on top of Spark | ||
| Mahout Spark Bindings | [status unknown] - linear algebra DSL and optimizer with R-like syntax | ||
| KeystoneML | Type safe machine learning pipelines with RDDs | ||
| JPMML-Spark | 94 | over 3 years ago | PMML transformer library for Spark ML |
| ModelDB | A system to manage machine learning models for and | ||
| Sparkling Water | 968 | 11 months ago | interoperability layer |
| BigDL | 6,801 | 11 months ago | Distributed Deep Learning library |
| MLeap | 1,506 | 11 months ago | Execution engine and serialization format which supports deployment of models without dependency on |
| Microsoft ML for Apache Spark | 5,083 | 11 months ago | A distributed ml library with support for LightGBM, Vowpal Wabbit, OpenCV, Deep Learning, Cognitive Services, and Model Deployment |
| MLflow | Machine learning orchestration platform | ||
Awesome Spark / Packages / Middleware | |||
| Livy | 894 | 12 months ago | REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing |
| spark-jobserver | 2,839 | 11 months ago | Simple Spark as a Service which supports objects sharing using so called named objects. JVM only |
| Apache Toree | 740 | 12 months ago | IPython protocol based middleware for interactive applications |
| Apache Kyuubi | 2,116 | 11 months ago | A distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark |
Awesome Spark / Packages / Monitoring | |||
| Data Mechanics Delight | 344 | over 1 year ago | Cross-platform monitoring tool (Spark UI / Spark History Server replacement) |
Awesome Spark / Packages / Utilities | |||
| sparkly | 61 | over 2 years ago | Helpers & syntactic sugar for PySpark |
| Flintrock | 637 | 11 months ago | A command-line tool for launching Spark clusters on EC2 |
| Optimus | 1,486 | 11 months ago | Data Cleansing and Exploration utilities with the goal of simplifying data cleaning |
Awesome Spark / Packages / Natural Language Processing | |||
| spark-nlp | 3,889 | 11 months ago | Natural language processing library built on top of Apache Spark ML |
Awesome Spark / Packages / Streaming | |||
| Apache Bahir | Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ) | ||
Awesome Spark / Packages / Interfaces | |||
| Apache Beam | Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments | ||
| Koalas | 3,343 | over 1 year ago | Pandas DataFrame API on top of Apache Spark |
Awesome Spark / Packages / Data quality | |||
| deequ | 3,324 | about 1 year ago | Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets |
| python-deequ | 734 | about 1 year ago | Python API for Deequ |
Awesome Spark / Packages / Testing | |||
| spark-testing-base | 1,525 | 12 months ago | Collection of base test classes |
| spark-fast-tests | 437 | 11 months ago | A lightweight and fast testing framework |
| chispa | 632 | about 1 year ago | PySpark test helpers with beautiful error messages |
Awesome Spark / Packages / Web Archives | |||
| Archives Unleashed Toolkit | 138 | over 1 year ago | Open-source toolkit for analyzing web archives |
Awesome Spark / Packages / Workflow Management | |||
| Cromwell | 1,004 | 11 months ago | Workflow management system with |
Awesome Spark / Resources / Books | |||
| Learning Spark, 2nd Edition | Introduction to Spark API with Spark 3.0 covered. Good source of knowledge about basic concepts | ||
| Advanced Analytics with Spark | Useful collection of Spark processing patterns. Accompanying GitHub repository: | ||
| Mastering Apache Spark | Interesting compilation of notes by . Focused on different aspects of Spark internals | ||
| Spark in Action | New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo | ||
Awesome Spark / Resources / Papers | |||
| Large-Scale Intelligent Microservices | Microsoft paper that presents an Apache Spark-based micro-service orchestration framework that extends database operations to include web service primitives | ||
| Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing | Paper introducing a core distributed memory abstraction | ||
| Spark SQL: Relational Data Processing in Spark | Paper introducing relational underpinnings, code generation and Catalyst optimizer | ||
| Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark | Structured Streaming is a new high-level streaming API, it is a declarative API based on automatically incrementalizing a static relational query | ||
Awesome Spark / Resources / MOOCS | |||
| Data Science and Engineering with Apache Spark (edX XSeries) | Series of five courses ( , , , , ) covering different aspects of software engineering and data science. Python oriented | ||
| Big Data Analysis with Scala and Spark (Coursera) | Scala oriented introductory course. Part of | ||
Awesome Spark / Resources / Workshops | |||
| AMP Camp | Periodical training event organized by the . A source of useful exercise and recorded workshops covering different tools from the | ||
Awesome Spark / Resources / Projects Using Spark | |||
| Oryx 2 | 1,787 | about 4 years ago | platform built on Apache Spark and with specialization for real-time large scale machine learning |
| Photon ML | 793 | about 4 years ago | A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model |
| PredictionIO | Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time | ||
| Crossdata | 169 | almost 6 years ago | Data integration platform with extended DataSource API and multi-user environment |
Awesome Spark / Resources / Docker Images | |||
| apache/spark | Apache Spark Official Docker images | ||
| jupyter/docker-stacks/pyspark-notebook | 8,037 | 11 months ago | PySpark with Jupyter Notebook and Mesos client |
| sequenceiq/docker-spark | 765 | over 4 years ago | Yarn images from |
| datamechanics/spark | An easy to setup Docker image for Apache Spark from | ||
Awesome Spark / Resources / Miscellaneous | |||
| Spark with Scala Gitter channel | " " started by | ||
| Apache Spark User List | and - Mailing lists dedicated to usage questions and development topics respectively | ||