awesome-streaming

Streaming platform library

A curated list of streaming frameworks and applications for building scalable and fault-tolerant real-time data processing systems

a curated list of awesome streaming frameworks, applications, etc

GitHub

3k stars
138 watching
298 forks
last commit: 2 months ago
Linked from 4 awesome lists

awesomeawesome-listliststream-processing

Table of Contents / Streaming Engine

Apache Apex 349 over 3 years ago [Java] - unified platform for big data stream and batch processing
Apache Ballista 1,580 about 1 month ago [Rust] - distributed compute platform powered by Apache Arrow
Apache Flink 24,261 about 1 month ago [Java] - system for high-throughput, low-latency data stream processing that supports stateful computation, data-driven windowing semantics and iterative stream processing
Apache Heron (incubating) 3,638 almost 2 years ago [Java] - a realtime, distributed, fault-tolerant stream processing engine from Twitter
Apache Samza 817 about 2 months ago [Scala/Java] - distributed stream processing framework that build on Kafka(messaging, storage) and YARN(fault tolerance, processor isolation, security and resource management)
Apache Spark Streaming 40,170 about 1 month ago [Scala] - makes it easy to build scalable fault-tolerant streaming applications
Apache Storm 6,603 about 1 month ago [Clojure/Java] - distributed real-time computation system. Storm is to stream processing what Hadoop is to batch processing
AthenaX 1,222 over 4 years ago [Java] - Uber's Stream Analytics Framework used in production
Bytewax 1,585 about 1 month ago [Python] - data parallel, distributed, stateful stream processing framework
Faust 6,751 6 months ago [Python] - stream processing library, porting the ideas from Kafka Streams to Python
Gearpump 762 almost 3 years ago [Scala] - lightweight real-time distributed streaming engine built on Akka
Hazelcast Jet 1,103 11 months ago [Java] - A general purpose distributed data processing engine, built on top of Hazelcast
hailstorm 90 over 10 years ago [Haskell] - distributed stream processing with exactly-once semantics based on Storm
Maki Nage 38 over 2 years ago [Python] - A stream processing framework for data scientists, based on Kafka and ReactiveX
mantis 1,419 about 1 month ago [Java] - Netflix's platform to build an ecosystem of realtime stream processing applications
mupd8(muppet) 126 over 3 years ago [Scala/Java] - mapReduce-style framework for processing fast/streaming data
Numaflow 1,748 about 1 month ago [Java/Python/Go/Rust] - Kubernetes native stream processing platform with language agnostic framework. Scalable and cost-efficient
Onyx 2,050 over 5 years ago [Clojure] - Distributed, masterless, high performance, fault tolerant data processing
Pathway 7,174 about 1 month ago [Python] - The fastest data processing engine supporting unified workflows for batch, streaming data, and LLM applications
s4 43 about 6 years ago [Java] - general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data
SABER 38 about 2 years ago [Java/C] - Window-Based Hybrid CPU/GPU Stream Processing Engine
Scramjet Cloud Platform 68 about 2 months ago [Python/JavaScript/Node.js] - data processing engine for running multiple data processing apps (sequences) written in Python, JavaScript or TypeScript
SPQR 29 almost 9 years ago [Java] - dynamic framework for processing high volumn data streams through pipelines
tigon 284 almost 8 years ago [C++/Java] - high throughput real-time streaming processing framework built on Hadoop and HBase
Teknek 8 about 9 years ago [Java] - Simple elegant stream processing with interactive prototying shell SOL (Stream Operator Language) Mesos, designed for high performance data processing jobs that require flexibility & control
Trill 1,250 about 1 year ago [.NET/C#] - Trill is a high-performance one-pass in-memory streaming analytics engine from Microsoft Research
Wallaroo 1,477 almost 4 years ago [Python] - A fast, stream-processing framework. Wallaroo makes it easy to react to data in real-time. By eliminating infrastructure complexity, going from prototype to production has never been simpler
LightSaber 70 about 3 years ago [C++] - Multi-core Window-Based Stream Processing Engine. LightSaber uses code generation for efficient window aggregation
HStreamDB 713 4 months ago [Haskell] - The streaming database built for IoT data storage and real-time processing
Kuiper 1,505 about 1 month ago [Golang] - An edge lightweight IoT data analytics/streaming software implemented by Golang, and it can be run at all kinds of resource-constrained edge devices
WindFlow [C++] - A C++17 Data Stream Processing Parallel Library for Multicores and GPUs
RisingWave 7,141 about 1 month ago [Rust] - A PostgreSQL-compatible streaming database that is designed to build event-driven applications, real-time ETL pipelines, continuous analytics services, and feature stores for AI applications. It excels in extracting fresh and consistent insights from real-time event streams, database CDC, and time series data within sub-seconds. It unifies streaming and batch processing, enabling users to ingest, join, and analyze both live and historical data at a cloud scale

Table of Contents / Streaming Library

Apache Kafka Streams 29,060 about 1 month ago [Java] - lightweight stream processing library included in Apache Kafka (since 0.10 version)
Streamiz 477 about 1 month ago [C#] - a .Net Stream Processing Library for Apache Kafka
Akka Streams 13,072 about 1 month ago [Scala] - stream processing library on Akka Actors
Daggy 154 4 months ago [C++] - real-time streams aggregation and catching
Benthos 8,165 about 1 month ago [Go] - Benthos is a high performance and resilient message streaming service, able to connect various sources and sinks and perform arbitrary actions, transformations and filters on payloads
FS2(prev. 'Scalaz-Stream') 2,381 about 1 month ago [Scala] - Compositional, streaming I/O library for Scala
FastStream 3,241 about 1 month ago [Python] - powerful and easy-to-use Python library simplifying the process of writing producers and consumers for message queues, handling all the parsing, networking and documentation generation automatically. Supports multiple protocols such as Apache Kafka, RabbitMQ and alike
monix 1,932 5 months ago [Scala] - high-performance Scala / Scala.js library for composing asynchronous and event-based programs
Quix Streams 1,246 about 1 month ago [Python] - a streaming library originally designed for the McLaren Formula 1 racing team that can process high volumes of time-series data with up to nanosecond precision using Apache Kafka as a message broker
Scramjet Node.js 38 over 2 years ago [Node.js] functional reactive stream programming framework written on top of Node.js object streams +
Scramjet Python 35 about 1 year ago [Python] functional reactive stream programming framework written from scratch operating on object, string and buffer streams
Scramjet C++ 3 over 2 years ago [C++] functional reactive stream programming framework written on top of Node.js object streams
Streamline 165 over 1 year ago [Java] - Stream Analytics Framework by Hortonworks, designed as a wrapper around existing streaming solutions like Storm. Aimed to allow users to drag-and-drop streaming components to focus on business logic
StreamAlert 2,864 about 1 year ago [Python] - Airbnb's Real-time Data Analysis and Alerting
Swave 171 over 6 years ago [Scala] - A lightweight Reactive Streams Infrastructure Toolkit for Scala
Streamz 1,247 about 2 months ago [Python] - A lightweight library for building pipelines to manage continuous streams of data; supports complex pipelines that involve branching, joining, flow control, feedback, back pressure, and so on
Stream Ops 48 about 5 years ago [Java] - A fully embeddable data streaming engine and stream processing API for Java
Substation 332 about 1 month ago [Go] - Substation is a cloud native data pipeline and transformation toolkit written in Go
SwimOS 321 about 1 month ago [Rust] - A framework for building real-time streaming data processing applications written in Rust
Tributary 444 about 2 months ago [Python] - A python library for constructing dataflow graphs. Supports synchronous, reactive data streams built using python generators that mimic complex event processors, as well as lazily-evaluated acyclic graphs and functional currying streams
YoMo 1,674 about 1 month ago [Go] - An open source Streaming Serverless Framework for building Low-latency Geo-distributed system. YoMo Built atop and Functional Reactive Programming interface
Mediapipe 27,962 about 1 month ago Cross-platform, customizable ML solutions for live and streaming media

Table of Contents / Streaming Application

javactrl-kafka 9 about 1 month ago [Java] - An application of a stateful stream processing for workflow as Java code (microservices orchestration, business process automation, and more)
straw 103 almost 9 years ago [Python/Java] - A platform for real-time streaming search
storm-crawler 895 about 1 month ago [Java] - Web crawler SDK based on Apache Storm
Zilla 553 about 1 month ago [Java] - Cross-platform, API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT and the native Kafka protocol

Table of Contents / IoT

sensorbee 231 about 5 years ago [Go] - lightweight stream processing engine for IoT
Apache Edgent 218 about 5 years ago [Java] - a programming model and runtime that enables continuous streaming analytics on gateways and edge devices which can work with centralized systems to provide efficient and timely analytics across the whole IoT ecosystem: from the center to the edge, opens sourced by IBM
Apache StreamPipes 614 about 1 month ago [Java] - a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams

Table of Contents / DSL

Apache Beam 7,911 about 1 month ago [Java, Python, SQL, Scala, Go] - unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs), open sourced by Google
coast 60 over 8 years ago [Scala] - a DSL that builds DAGs on top of Samza and provides exactly-once semantics
Esper 842 9 months ago [Java] - component for complex event processing (CEP) and event series analysis
Streamparse 1,494 5 months ago [Python] - lets you run Python code against real-time streams of data via Apache Storm
summingbird 2,135 almost 3 years ago [Scala] - library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding

Table of Contents / Data Pipeline

Apache Kafka 29,060 about 1 month ago [Scala/Java] - distributed, partitioned, replicated commit log service, which provides the functionality of a messaging system, but with a unique design
Apache Pulsar 14,315 about 1 month ago [Java] - distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API
Apache RocketMQ 21,354 about 1 month ago [Java] - distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability
AutoMQ 3,918 about 1 month ago [Scala/Java] - cloud-first alternative to Kafka by decoupling durability to S3 and EBS. 100% Kafka compatible. 10x cost-effective. Autoscale in seconds. Single-digit ms latency
brooklin 931 8 months ago [Java] - a distributed system intended for streaming data between various heterogeneous source and destination systems with high reliability and throughput at scale from Linkedin (replaced databus)
camus 878 over 4 years ago [Java] - Linkedin's Kafka -> HDFS pipeline
databus 3,643 over 1 year ago [Java] - Linkedin's source-agnostic distributed change data capture system
flume 2,541 3 months ago [Java] - distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
fluvio 3,932 about 1 month ago [Rust/WASM] - Real-time programmable data streaming platform with in-line computation capabilities
Gazette 723 about 1 month ago [golang] - Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms
LogDevice [C++] - a high-performant distributed system by Facebook for streaming and storing sequential data, using a log structure
metaq 1,334 almost 5 years ago [Java] - Taobao's high available, high performance distributed messaging system
NATS streaming 2,510 10 months ago [Go] - fast disk-backed messaging solution
nsq 25,029 2 months ago [Go] - realtime distributed messaging platform designed to operate at scale, handling billions of messages per day
Redpanda 9,780 about 1 month ago [C++] - Redpanda is Kafka compatible, ZooKeeper-free, JVM-free and source available
RudderStack 4,109 about 1 month ago [Go] - an open source customer data infrastructure (segment, mparticle alternative)
suro 794 almost 2 years ago [Java] - data pipeline service for collecting, aggregating, and dispatching large volume of application events including log data
StreamSets Data Collector 90 6 months ago [Java] - continuous big data ingestion infrastructure that reads from and writes to a large number of end-points, including S3, JDBC, Hadoop, Kafka, Cassandra and many others

Table of Contents / Online Machine Learning

Apache Samoa 248 almost 2 years ago [Java] - distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms
DataSketches 899 about 1 month ago [Java] - sketches library from Yahoo!
https://github.com/numaproj/numalogic 168 3 months ago [Numalogic] ( ) [Python] - Collection of ML models and libraries for real-time anomaly detection and forecasting on time series data. Built on Numaflow, a K8s native stream processing platform
River 5,121 about 1 month ago [Python] - online machine learning library
streamDM 492 almost 2 years ago [Scala] - mining Big Data streams using Spark Streaming from Huawei
StreamingBandit 80 almost 2 years ago [Python] - Provides a webserver to quickly setup and evaluate possible solutions to contextual multi-armed bandit (cMAB) problems
StormCV 167 about 8 years ago [Java] - enables the use of Apache Storm for video processing by adding computer vision (CV) specific operations and data model
trident-ml 381 about 1 year ago [Java] - realtime online machine learning library based on Trident
yurita 107 over 5 years ago [Scala] - Anomaly detection framework built on Spark Structured Streaming from Paypal

Table of Contents / Streaming SQL

pipelinedb 2,639 almost 3 years ago [C] - An open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables
squall 270 over 7 years ago [Java] - Squall executes SQL queries on top of Storm for doing online processing
StreamCQL 0 over 7 years ago [Java] - Continuous Query Language on RealTime Computation System
ksqlDB 133 about 1 month ago [Java] - A cloud-native, source-available purpose-built for stream processing applications
Materialize [Rust] - A source-available streaming SQL engine for maintaining materialized views on data from message brokers and databases
Siddhi 1,526 5 months ago [Java] - A cloud native Streaming and Complex Event Processing engine that understands Streaming SQL queries in order to capture events from diverse data sources, process them, detect complex conditions, and publish output to various endpoints in real time
Proton 1,605 about 1 month ago [C++] - A unified streaming and historical data analytics database in a single binary, powered by ClickHouse

Table of Contents / Benchmark

storm-perf-test 74 almost 2 years ago [Java] - a simple storm performance/stress test
streaming-benchmarks 635 about 1 year ago [Java] - Benchmarks for Low Latency (Streaming) solutions including Apache Storm, Apache Spark, Apache Flink, etc
flotilla 234 almost 9 years ago [Go] - Automated message queue orchestration for scaled-up benchmarking

Table of Contents / Toolkit

akka 13,072 about 1 month ago [Scala] - toolkit and runtime for building highly concurrent, distributed, and resilient message-driven application on the JVM
Apache Pekko 1,237 about 1 month ago [Scala, Java] - Fork of Akka 2.6.x, prior to the Akka project's adoption of the Business Source License
pulsar 1,861 about 5 years ago [Python] - Actor based event driven concurrent framework for Python
aeron 7,466 about 1 month ago [Java/C++] - efficient reliable unicast and multicast message transport
StreamFlow 253 about 1 month ago [Java] - stream processing tool designed to help build and monitor processing workflows
samza-luwak 99 about 10 years ago [Java] - uses Luwak, a stored-query engine built on Lucene, to implement full-text search on streams
Streamdal [Go/Node.js/Python] - A tool to embed privacy controls in your application code to detect PII as it enters and leaves your systems, preventing it from reaching unintended data streams or pipelines
Turbine 835 almost 2 years ago [Java] - tool for aggregating streams of Server-Sent Event (SSE) JSON data into a single stream
Nussknacker 669 about 1 month ago [Scala] - A visual tool to define and run real-time decision algorithms

Table of Contents / Closed Source

Amazon Kinesis Streams [Java] - real-time, fully managed and scalable data stream engine provided by AWS
Azure Stream Analytics [.NET] a massively scalable, fully managed, real-time, data stream engine provided by Microsoft Azure
Cloud Dataflow [Java, Python, SQL, Scala] - Google's managed stream and batch data processing engine. Supports running Beam pipelines
concord [C++] - a distributed stream processing framework built in C++ on top of Apache
IBM Streams [Python/Java/Scala] - platform for distributed processing and real-time analytics. Provides toolkits for advanced analytics like geospatial, time series, etc. out of the box
jubatus [C++] - distributed processing framework and streaming machine learning library
millwheel framework for building low-latency data-processing applications that is widely used at Google
NVIDIA Deep Stream [Python/C/C++] - a platform for real-time image, video and audio processing, preferably using on edge devices or cloud

Table of Contents / Readings

In-Stream Big Data Processing
The world beyond batch: Streaming 101 by Tyler Akidau
Real Time Analytics: Algorithms and Systems (VLDB 2015)
Grokking Streaming Systems by Josh Fischer & Ning Wang
Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing by Reuven Lax, Slava Chernyak, and Tyler Akidau
Data Pipelines with Apache Airflow by Bas P. Harenslak and Julian Rutger de Ruiter

Backlinks from these awesome lists:

More related projects: