awesome-dataops

sunglasses A curated list of awesome DataOps tools

GitHub

145 stars
8 watching
19 forks
Language: Python
last commit: 5 days ago
Linked from 1 awesome list

awesomeawesome-listdata-engineerdata-engineeringdataops

Awesome DataOps / Data Catalog

Amundsen Data discovery and metadata engine for improving the productivity when interacting with data
Apache Atlas Provides open metadata management and governance capabilities to build a data catalog
CKAN 4,409 4 days ago Open-source DMS (data management system) for powering data hubs and data portals
DataHub 9,727 1 day ago LinkedIn's generalized metadata search & discovery tool
Magda 508 5 days ago A federated, open-source data catalog for all your big data and small data
Metacat 1,607 11 days ago Unified metadata exploration API service for Hive, RDS, Teradata, Redshift, S3 and Cassandra
OpenMetadata A Single place to discover, collaborate and get your data right
Unity Catalog Industry’s only universal catalog for data and AI

Awesome DataOps / Data Exploration

Apache Zeppelin Enables data-driven, interactive data analytics and collaborative documents
Jupyter Notebook Web-based notebook environment for interactive computing
JupyterLab The next-generation user interface for Project Jupyter
Jupytext 6,590 28 days ago Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts
Polynote The polyglot notebook with first-class Scala support

Awesome DataOps / Data Ingestion

Amazon Kinesis Easily collect, process, and analyze video and data streams in real time
Apache Gobblin 2,214 9 days ago A framework that simplifies common aspects of big data such as data ingestion
Apache Kafka 28,386 10 days ago Open-source distributed event streaming platform used by thousands of companies
Apache Pulsar 14,141 5 days ago Distributed pub-sub messaging platform with a flexible messaging model and intuitive API
Embulk 1,745 12 days ago A parallel bulk data loader that helps data transfer between various storages
Fluentd 12,832 11 days ago Collects events from various data sources and writes them to files
Google PubSub Ingest events for streaming into BigQuery, data lakes or operational databases
Nakadi 953 6 months ago A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues
Pravega 1,983 25 days ago An open source distributed storage service implementing Streams
RabbitMQ One of the most popular open source message brokers

Awesome DataOps / Data Workflow

Apache Airflow 36,331 11 days ago A platform to programmatically author, schedule, and monitor workflows
Apache Oozie 708 2 months ago An extensible, scalable and reliable system to manage complex Hadoop workloads
Azkaban 4,450 3 months ago Batch workflow job scheduler created at LinkedIn to run Hadoop jobs
Dagster 11,152 13 days ago An orchestration platform for the development, production, and observation of data assets
Luigi 17,705 21 days ago Python module that helps you build complex pipelines of batch jobs
Prefect A workflow management system, designed for modern infrastructure

Awesome DataOps / Data Processing

Apache Beam 7,780 8 days ago A unified model for defining both batch and streaming data-parallel processing pipelines
Apache Flink 23,852 8 days ago An open source stream processing framework with powerful capabilities
Apache Hadoop MapReduce A framework for writing applications which process vast amounts of data
Apache Nifi 4,759 11 days ago An easy to use, powerful, and reliable system to process and distribute data
Apache Samza 811 about 1 month ago A distributed stream processing framework which uses Apache Kafka and Hadoop YARN
Apache Spark 39,300 12 days ago A unified analytics engine for large-scale data processing
Apache Storm 6,590 21 days ago An open source distributed realtime computation system
Apache Tez 471 13 days ago A generic data-processing pipeline engine envisioned as a low-level engine
Faust 6,722 2 months ago A stream processing library, porting the ideas from Kafka Streams to Python

Awesome DataOps / Data Quality

Cerberus 3,147 about 1 month ago Lightweight, extensible data validation library for Python
Cleanlab 9,428 22 days ago Data-centric AI tool to detect (non-predefined) issues in ML data like label errors or outliers
DataProfiler 1,420 3 months ago A Python library designed to make data analysis, monitoring, and sensitive data detection easy
Deequ 3,252 9 days ago A library built on top of Apache Spark for measuring data quality in large datasets
Great Expectations A Python data validation framework that allows to test your data against datasets
JSON Schema A vocabulary that allows you to annotate and validate JSON documents
SodaSQL 60 almost 2 years ago Data profiling, testing, and monitoring for SQL accessible data

Awesome DataOps / Data Serialization

Apache Avro 2,891 12 days ago A data serialization system which is compact, fast and provides rich data structures
Apache ORC 678 10 days ago A self-describing type-aware columnar file format designed for Hadoop workloads
Apache Parquet 2,559 29 days ago A columnar storage format which provides efficient storage and encoding of data
Kryo 6,171 12 days ago A fast and efficient binary object graph serialization framework for Java
ProtoBuf 65,302 3 days ago Language-neutral, platform-neutral, extensible mechanism for serializing structured data

Awesome DataOps / Data Serialization / Data Compression

Pigz 2,623 11 days ago A parallel implementation of gzip for modern multi-processor, multi-core machines
Snappy 6,102 about 1 month ago Open source compression library that is fast, stable and robuts

Awesome DataOps / Data Serialization / Data Table Format

Apache Hudi 5,334 8 days ago Manages the storage of large analytical datasets on DFS
Apache Iceberg 6,178 9 days ago Open table format for huge analytic datasets
Delta Lake 7,448 9 days ago An open source project that enables building a Lakehouse architecture on top of data lakes

Awesome DataOps / Data Visualization

Apache Superset 61,801 11 days ago A modern data exploration and data visualization platform
Count SQL/drag-and-drop querying and visualisation tool based on notebooks
Dash 21,169 16 days ago Analytical Web Apps for Python, R, Julia, and Jupyter
Data Studio Reporting solution for power users who want to go beyond the data and dashboards of GA
HUE 1,159 10 days ago A mature SQL Assistant for querying Databases & Data Warehouses
Lux 5,136 6 months ago Fast and easy data exploration by automating the visualization and data analysis process
Metabase The simplest, fastest way to get business intelligence and analytics to everyone
Redash Connect to any data source, easily visualize, dashboard and share your data
Tableau Powerful and fastest growing data visualization tool used in the business intelligence industry

Awesome DataOps / Data Warehouse

Amazon Redshift Accelerate your time to insights with fast, easy, and secure cloud data warehousing
Apache Hive 5,493 10 days ago Facilitates reading, writing, and managing large datasets residing in distributed storage
Apache Kylin 3,634 14 days ago An open source, distributed analytical data warehouse for big data
Google BigQuery Serverless, highly scalable, and cost-effective multicloud data warehouse

Awesome DataOps / Database / Columnar Database

Apache Cassandra 8,719 5 days ago Open source column based DBMS designed to handle large amounts of data
Apache Druid 13,416 5 days ago Designed to quickly ingest massive quantities of event data, and provide low-latency queries
Apache HBase 5,197 10 days ago An open-source, distributed, versioned, column-oriented store
Scylla 13,240 8 days ago Designed to be compatible with Cassandra while achieving higher throughputs and lower latencies

Awesome DataOps / Database / Document-Oriented Database

Apache CouchDB 6,180 8 days ago An open-source document-oriented NoSQL database, implemented in Erlang
Elasticsearch 69,622 5 days ago A distributed document oriented database with a RESTful search engine
MongoDB 26,120 5 days ago A cross-platform document database that uses JSON-like documents with optional schemas
RethinkDB 26,717 6 months ago The first open-source scalable database built for realtime applications

Awesome DataOps / Database / Graph Database

Age 3,028 15 days ago A multi-model database that supports both graph and relational data models
ArangoDB 13,505 9 days ago A scalable open-source multi-model database natively supporting graph, document and search
JanusGraph 5,276 9 days ago Manage large graphs with billions of data distributed across a multi-machine cluster
Memgraph 2,341 10 days ago An open source graph database, built for real-time streaming data, compatible with Neo4j
Neo4j 13,141 25 days ago A high performance graph store with all the features expected of a mature and robust database
Titan 5,247 almost 2 years ago A highly scalable graph database optimized for storing and querying large graphs

Awesome DataOps / Database / Key-Value Database

Apache Accumulo 1,060 11 days ago A sorted, distributed key-value store that provides robust and scalable data storage
Dragonfly 25,310 10 days ago A modern in-memory datastore, fully compatible with Redis and Memcached APIs
DynamoDB Fast, flexible NoSQL database service for single-digit millisecond performance at any scale
etcd 47,457 4 days ago Distributed reliable key-value store for the most critical data of a distributed system
EVCache 2,014 25 days ago A distributed in-memory data store for the cloud
Memcached 13,433 21 days ago A high performance multithreaded event-based key/value cache store
Redis 66,394 5 days ago An in-memory key-value database that persists on disk

Awesome DataOps / Database / Relational Database

CockroachDB 29,879 12 days ago A distributed database designed to build, scale, and manage data-intensive apps
Crate 4,052 4 days ago A distributed SQL database that makes it simple to store and analyze massive amounts of data
MariaDB 5,584 3 days ago A replacement of MySQL with more features, new storage engines and better performance
MySQL 10,733 about 2 months ago One of the most popular open source transactional databases
PostgreSQL 15,792 9 days ago An advanced RDBMS that supports an extended subset of the SQL standard
RQLite 15,576 11 days ago A lightweight, distributed relational database, which uses SQLite as its storage engine
SQLite 6,343 11 days ago A popular choice as embedded database software for local/client storage

Awesome DataOps / Database / Time Series Database

Akumuli 837 about 2 years ago Can be used to capture, store and process time-series data in real-time
Atlas 3,431 23 days ago An in-memory dimensional time series database
InfluxDB 28,639 10 days ago Scalable datastore for metrics, events, and real-time analytics
QuestDB 14,323 11 days ago An open source SQL database designed to process time series data, faster
TimescaleDB 17,531 3 days ago Open-source time-series SQL database optimized for fast ingest and complex queries

Awesome DataOps / Database / Vector Database

Milvus 29,481 10 days ago An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy
Pinecone Managed and distributed vector similarity search used with a lightweight SDK
Qdrant 19,842 10 days ago An open source vector similarity search engine with extended filtering support

Awesome DataOps / File System

Alluxio 6,806 16 days ago A virtual distributed storage system
Amazon Simple Storage Service (S3) Object storage built to retrieve any amount of data from anywhere
Apache Hadoop Distributed File System (HDFS) A distributed file system
GlusterFS 4,655 about 2 months ago A software defined distributed storage that can scale to several petabytes
Google Cloud Storage (GCS) Object storage for companies of all sizes, to store any amount of data
LakeFS 4,363 3 days ago Open source tool that transforms your object storage into a Git-like repository
LizardFS 952 about 2 months ago A highly reliable, scalable and efficient distributed file system
MinIO 46,793 10 days ago High Performance, Kubernetes Native Object Storage compatible with Amazon S3 API
SeaweedFS 22,299 13 days ago A fast distributed storage system for blobs, objects, files, and data lake
Swift 2,610 16 days ago A distributed object storage system designed to scale from a single machine to thousands of servers

Awesome DataOps / Logging and Monitoring

Grafana 64,069 4 days ago Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, InfluxDB and more
Loki 23,418 4 days ago A horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus
Prometheus 54,925 10 days ago A monitoring system and time series database
Whylogs 2,635 1 day ago A tool for creating data logs, enabling monitoring for data drift and data quality issues

Awesome DataOps / Metadata Service

Hive Metastore Service that stores metadata related to Apache Hive and other services
Metacat 1,607 11 days ago Provides you information about what data you have, where it resides and how to process it

Awesome DataOps / SQL Query Engine

Apache Drill 1,928 about 1 month ago Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
Apache Impala 1,124 12 days ago Lightning-fast, distributed SQL queries for petabytes of data
Dremio Power high-performing BI dashboards and interactive analytics directly on data lake
Presto 15,958 about 18 hours ago A distributed SQL query engine for big data
Trino 10,237 1 day ago A fast distributed SQL query engine for big data analytics

Resources / Books

Data Mesh: Delivering Data-Driven Value at Scale (O'Reilly)
Designing Data-Intensive Applications (O'Reilly)
Fundamentals of Data Engineering (O'Reilly)
Getting Started with Impala (O'Reilly)
Learning and Operating Presto (O'Reilly)
Learning Spark: Lightning-Fast Data Analytics (O'Reilly)
Spark in Action (O'Reilly)
Spark: The Definitive Guide (O'Reilly)

Resources / Other Lists

Awesome Data Engineering 6,593 25 days ago
Awesome MLOps 3,956 about 1 month ago
DataOps Resource 21 about 4 years ago

Resources / Slack

Delta Lake Workspace
Trino Workspace

Backlinks from these awesome lists: