awesome-opensource-data-engineering
Data engineering toolkit
An aggregated list of projects and tools for data engineering tasks in software development.
An Awesome List of Open-Source Data Engineering Projects
2k stars
60 watching
349 forks
last commit: 4 months ago
Linked from 1 awesome list
awesome-listdata-engineering
https://spark.apache.org/[Apache | Spark] - A unified analytics engine for large-scale data processing. Includes APIs in Scala, Java, Python (known as PySpark), and R (SparkR) | ||
https://beam.apache.org/[Apache | Beam] - An open-source implementation of Google DataFlow. Provides capabilites of batch and streaming data processing jobs that run on any execution engine, including Spark, Flink, or its own DirectRunner. Supports multiple APIs in Java, Python, and Go | ||
https://flink.apache.org/[Apache | Flink] - Stateful computations over data streams | ||
https://trino.io/[Trino | (formerly known as PrestoSQL)] - Distributed SQL Query Engine for Big Data | ||
https://superset.incubator.apache.org/[Apache | Superset] - A modern, enterprise-ready business intelligence web application | ||
https://gethue.com/[HUE] | The Hadoop User Interface. Similar to Superset, but interfaces between RDBMS, Hive, Impala, HBase, Spark, HDFS & S3, Oozie, Pig, YARN Job Explorer, and more. Offers an extensible Django environment for custom app integration | ||
https://www.metabase.com/[Metabase] | An easy way for everyone in your company to ask questions and learn from data | ||
https://redash.io/[Redash] | All the tools to unlock your data | ||
https://delta.io/[Delta | Lake] - Open-source storage framework that enables building a lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python | ||
https://hudi.apache.org/[Apache | Hudi] - Transactional data lake platform that brings database and data warehouse capabilities to the data lake. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics | ||
https://iceberg.apache.org/[Apache | Iceberg] - High-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time | ||
https://debezium.io/[Debezium] | Change data capture for MySQL, Postgres, MongoDB, SQL Server and others | ||
https://github.com/zendesk/maxwell[Maxwell] | Maxwell's daemon, a MySQL-to-JSON Kafka producer | ||
https://calcite.apache.org/[Apache | Calcite] - SQL parser, building blocks for datastores | ||
http://cassandra.apache.org/[Apache | Cassandra] - Open Source distributed wide column store, NoSQL database | ||
https://druid.apache.org/[Apache | Druid] - A high performance real-time analytics database | ||
https://hbase.apache.org/[Apache | HBase] - Open Source non-relational distributed database | ||
https://pinot.apache.org/[Apache | Pinot] - A realtime distributed OLAP datastore | ||
https://clickhouse.tech/[ClickHouse] | Open Source distributed column-oriented DBMS | ||
https://www.influxdata.com/[InfluxDB] | Purpose-Built Open Source Time Series Database | ||
https://min.io/[MinIO] | MinIO is a high performance, distributed object storage system and AWS S3 compatible | ||
https://www.postgresql.org/[Postgres] | The World's Most Advanced Open Source Relational Database | ||
https://questdb.io/[QuestDB] | Open Source Time Series Database with a focus on performance and simplicity | ||
https://github.com/lyft/amundsen[Amundsen] | metadata catalogue | ||
https://atlas.apache.org[Apache | Atlas] - Data governance and metadata framework for Hadoop | ||
https://github.com/linkedin/datahub[DataHub] | A Generalized Metadata Search & Discovery Tool | ||
https://github.com/Netflix/metacat[Metacat] | Unified metadata exploration API service | ||
https://github.com/elementary-data/elementary-lineage[Elementary] | Data reliability solution, starting with plug-and-play data lineage and datasets operational status | ||
https://github.com/monosidev/monosi[Monosi] | Data observability & monitoring platform | ||
https://github.com/open-metadata/OpenMetadata[OpenMetadata] | Generalized metadata, search, and lineage tool | ||
https://drill.apache.org/[Apache | Drill] - Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage | ||
https://github.com/dremio/dremio-oss[Dremio] | A data lake engine. Provides an Apache Arrow-based query and acceleration engine together with the ability to create an IT-governed self-service layer for data scientists and analysts | ||
http://teiid.io/[Teiid] | A relational abstraction of different information sources | ||
https://prestodb.io/[Presto] | Distributed SQL Query Engine for Big Data | ||
https://avro.apache.org/[Apache | Avro] - A data serialization system | ||
https://parquet.apache.org/[Apache | Parquet] - A columnar storage format | ||
https://orc.apache.org/[Apache | ORC] - Another columnar storage format | ||
https://thrift.apache.org/[Apache | Thrift] - Data type and service interface definitions and code generator | ||
https://arrow.apache.org/[Apache | Arrow] - A cross-language development platform for in-memory data. It specifies a standardized, language-independent, columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy IPC and streaming messaging | ||
https://capnproto.org/[Cap’n | Proto] - A data interchange format and capability-based RPC system | ||
https://google.github.io/flatbuffers/[FlatBuffers] | An efficient cross platform serialization library for C++, C#, C, Go, Java, JavaScript, Lobster, Lua, TypeScript, PHP, Python, and Rust | ||
https://msgpack.org/index.html[MessagePack] | An efficient binary serialization format. It lets you exchange data among multiple languages like JSON | ||
https://developers.google.com/protocol-buffers[Protocol | Buffers] - Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data | ||
https://camel.apache.org/[Apache | Camel] - Easily integrate various systems consuming or producing data | ||
https://kafka.apache.org/documentation/#connect[Kafka | Connect] - Reusable framework to handle data int-and-out of Apache Kafka | ||
https://www.elastic.co/logstash[Logstash] | Open Source server-side data processing pipeline | ||
https://github.com/influxdata/telegraf[Telegraf] | a plugin-driven server agent writen in Go (deployed as a single binary with no external dependencies) for collecting and sending metrics and events from databases, systems, and IoT sensors. Offers hundreds of existing plugins | ||
https://activemq.apache.org/[Apache | ActiveMQ] - Flexible & Powerful Multi-Protocol Messaging | ||
https://kafka.apache.org/[Apache | Kafka] - A distributed commit log with messaging capabilities | ||
https://pulsar.apache.org/[Apache | Pulsar] - A distributed pub-sub messaging system | ||
http://github.com/bsideup/liiklus[Liiklus] | An event gateway that provides reactive gRPC/RSocket access to Kafka-like systems | ||
https://nakadi.io/[Nakadi] | A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues] | ||
https://nats.io/[NATS] | A simple, secure and high performance messaging system | ||
https://www.rabbitmq.com/[RabbitMQ] | A message broker | ||
https://github.com/wepay/waltz[Waltz] | A quorum-based distributed write-ahead log for replicating transactions | ||
https://zeromq.org/[ZeroMQ] | An open-source universal, high-performance messaging library | ||
https://cloudevents.io/[CloudEvents] | A specification for describing event data in a common way | ||
https://kafka.apache.org/documentation/streams/[Apache | Kafka Streams] - A client library for building applications and microservices, where the input and output data are stored in Kafka | ||
http://samza.apache.org/[Apache | Samza] - A distributed stream processing framework | ||
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html[Apache | Spark Structured Streaming] - A scalable and fault-tolerant stream processing engine built on the Spark SQL engine | ||
http://storm.apache.org/[Apache | Storm] - A distributed realtime computation system | ||
https://greatexpectations.io/[Great | expectations] - Helps data teams eliminate pipeline debt, through data testing | ||
https://github.com/DataKitchen/data-observability-installer/[DataKitchen | 95 | about 2 months ago | Data Observability] - A full featured data quality profiling and data testing tool: it automatically generates tests for you |
https://prometheus.io/[prometheus] | An open-source systems monitoring and alerting toolkit | ||
https://grafana.com/[grafana] | An open-source analytics and monitoring platform | ||
https://github.com/DataKitchen/data-observability-installer/[DataKitchen | 95 | about 2 months ago | Data Observability] - A full featured monitoring and alerting software that watches across and down your data estate |
https://github.com/treeverse/lakeFS/[lakeFS] | 4,496 | about 1 month ago | Repeatable, atomic and versioned data lake on top of object storage |
https://github.com/meirwah/awesome-workflow-engines[Awesome | Workflow Engines] - A curated list of awesome open source workflow engines | ||
https://airflow.apache.org/[Apache | Airflow] - A platform created by community to programmatically author, schedule and monitor workflows | ||
https://nifi.apache.org/[Apache | NiFi] - Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic | ||
https://github.com/knime/[KNIME] | KNIME Analytics Platform offers a WYSIWYG Editor for Spark-based workflows, with over 2000+ integrations. Offers visualization and flow analytics in-place. KNIME Server is a commercially licensed component that adds additional features | ||
https://github.com/PrefectHQ/prefect/[Prefect] | 17,771 | about 1 month ago | A workflow management system designed for modern infrastructure |
https://github.com/dagster-io/dagster/[Dagster] | 12,055 | about 1 month ago | A data orchestrator for machine learning, analytics, and ETL |
https://github.com/kestra-io/kestra[Kestra] | Open source data orchestration and scheduling platform with declarative syntax | ||
https://github.com/mage-ai/mage-ai[Mage] | Open source data orchestration and scheduling platform with a rich interactive UI for workflows | ||
https://www.dataengineeringpodcast.com/[Data | Engineering Podcast] | ||
https://softwareengineeringdaily.com/[Software | Engineering Daily] | ||
https://datastackshow.com/[Data | Stack Show] | ||
https://dataengweekly.substack.com/[Data | Eng Weekly] | ||
https://nosql-database.org/[NOSQL | Database Management Systems] - List of NoSQL database management systems | ||
https://db-engines.com/en/[DB-Engines] | Knowledge base of relational and NoSQL database management systems | ||
https://www.goodreads.com/list/show/146550.Data_Engineering_Group[Books] | and club] - Goodreads list and group about Data Engineering books | ||
https://www.kdnuggets.com/25-free-books-to-master-sql-python-data-science-machine-learning-and-natural-language-processing[25 | Free Data Books] - Collection of 25 free e-books related to SQL, Python, Data Science, Machine Learning, and Natural Language Processing |