awesome-dataops

sunglasses A curated list of awesome DataOps tools

GitHub

147 stars
8 watching
19 forks
Language: Python
last commit: 12 days ago
Linked from 1 awesome list

awesomeawesome-listdata-engineerdata-engineeringdataops

Awesome DataOps / Data Catalog

Amundsen Data discovery and metadata engine for improving the productivity when interacting with data
Apache Atlas Provides open metadata management and governance capabilities to build a data catalog
CKAN 4,409 11 days ago Open-source DMS (data management system) for powering data hubs and data portals
DataHub 9,727 8 days ago LinkedIn's generalized metadata search & discovery tool
Magda 508 12 days ago A federated, open-source data catalog for all your big data and small data
Metacat 1,607 18 days ago Unified metadata exploration API service for Hive, RDS, Teradata, Redshift, S3 and Cassandra
OpenMetadata A Single place to discover, collaborate and get your data right
Unity Catalog Industry’s only universal catalog for data and AI

Awesome DataOps / Data Exploration

Apache Zeppelin Enables data-driven, interactive data analytics and collaborative documents
Jupyter Notebook Web-based notebook environment for interactive computing
JupyterLab The next-generation user interface for Project Jupyter
Jupytext 6,605 about 1 month ago Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts
Polynote The polyglot notebook with first-class Scala support

Awesome DataOps / Data Ingestion

Amazon Kinesis Easily collect, process, and analyze video and data streams in real time
Apache Gobblin 2,216 16 days ago A framework that simplifies common aspects of big data such as data ingestion
Apache Kafka 28,494 3 days ago Open-source distributed event streaming platform used by thousands of companies
Apache Pulsar 14,141 12 days ago Distributed pub-sub messaging platform with a flexible messaging model and intuitive API
Embulk 1,748 12 days ago A parallel bulk data loader that helps data transfer between various storages
Fluentd 12,857 3 days ago Collects events from various data sources and writes them to files
Google PubSub Ingest events for streaming into BigQuery, data lakes or operational databases
Nakadi 956 6 months ago A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues
Pravega 1,982 about 1 month ago An open source distributed storage service implementing Streams
RabbitMQ One of the most popular open source message brokers

Awesome DataOps / Data Workflow

Apache Airflow 36,519 4 days ago A platform to programmatically author, schedule, and monitor workflows
Apache Oozie 708 3 months ago An extensible, scalable and reliable system to manage complex Hadoop workloads
Azkaban 4,458 3 months ago Batch workflow job scheduler created at LinkedIn to run Hadoop jobs
Dagster 11,237 5 days ago An orchestration platform for the development, production, and observation of data assets
Luigi 17,746 11 days ago Python module that helps you build complex pipelines of batch jobs
Prefect A workflow management system, designed for modern infrastructure

Awesome DataOps / Data Processing

Apache Beam 7,805 1 day ago A unified model for defining both batch and streaming data-parallel processing pipelines
Apache Flink 23,889 1 day ago An open source stream processing framework with powerful capabilities
Apache Hadoop MapReduce A framework for writing applications which process vast amounts of data
Apache Nifi 4,798 4 days ago An easy to use, powerful, and reliable system to process and distribute data
Apache Samza 812 4 days ago A distributed stream processing framework which uses Apache Kafka and Hadoop YARN
Apache Spark 39,387 5 days ago A unified analytics engine for large-scale data processing
Apache Storm 6,589 4 days ago An open source distributed realtime computation system
Apache Tez 472 10 days ago A generic data-processing pipeline engine envisioned as a low-level engine
Faust 6,726 2 months ago A stream processing library, porting the ideas from Kafka Streams to Python

Awesome DataOps / Data Quality

Cerberus 3,154 about 2 months ago Lightweight, extensible data validation library for Python
Cleanlab 9,428 29 days ago Data-centric AI tool to detect (non-predefined) issues in ML data like label errors or outliers
DataProfiler 1,420 3 months ago A Python library designed to make data analysis, monitoring, and sensitive data detection easy
Deequ 3,269 4 days ago A library built on top of Apache Spark for measuring data quality in large datasets
Great Expectations A Python data validation framework that allows to test your data against datasets
JSON Schema A vocabulary that allows you to annotate and validate JSON documents
SodaSQL 61 almost 2 years ago Data profiling, testing, and monitoring for SQL accessible data

Awesome DataOps / Data Serialization

Apache Avro 2,900 5 days ago A data serialization system which is compact, fast and provides rich data structures
Apache ORC 681 5 days ago A self-describing type-aware columnar file format designed for Hadoop workloads
Apache Parquet 2,575 3 days ago A columnar storage format which provides efficient storage and encoding of data
Kryo 6,181 5 days ago A fast and efficient binary object graph serialization framework for Java
ProtoBuf 65,302 10 days ago Language-neutral, platform-neutral, extensible mechanism for serializing structured data

Awesome DataOps / Data Serialization / Data Compression

Pigz 2,635 18 days ago A parallel implementation of gzip for modern multi-processor, multi-core machines
Snappy 6,110 about 2 months ago Open source compression library that is fast, stable and robuts

Awesome DataOps / Data Serialization / Data Table Format

Apache Hudi 5,346 1 day ago Manages the storage of large analytical datasets on DFS
Apache Iceberg 6,241 2 days ago Open table format for huge analytic datasets
Delta Lake 7,487 3 days ago An open source project that enables building a Lakehouse architecture on top of data lakes

Awesome DataOps / Data Visualization

Apache Superset 62,043 3 days ago A modern data exploration and data visualization platform
Count SQL/drag-and-drop querying and visualisation tool based on notebooks
Dash 21,250 15 days ago Analytical Web Apps for Python, R, Julia, and Jupyter
Data Studio Reporting solution for power users who want to go beyond the data and dashboards of GA
HUE 1,163 4 days ago A mature SQL Assistant for querying Databases & Data Warehouses
Lux 5,144 7 months ago Fast and easy data exploration by automating the visualization and data analysis process
Metabase The simplest, fastest way to get business intelligence and analytics to everyone
Redash Connect to any data source, easily visualize, dashboard and share your data
Tableau Powerful and fastest growing data visualization tool used in the business intelligence industry

Awesome DataOps / Data Warehouse

Amazon Redshift Accelerate your time to insights with fast, easy, and secure cloud data warehousing
Apache Hive 5,514 3 days ago Facilitates reading, writing, and managing large datasets residing in distributed storage
Apache Kylin 3,636 4 days ago An open source, distributed analytical data warehouse for big data
Google BigQuery Serverless, highly scalable, and cost-effective multicloud data warehouse

Awesome DataOps / Database / Columnar Database

Apache Cassandra 8,719 12 days ago Open source column based DBMS designed to handle large amounts of data
Apache Druid 13,429 5 days ago Designed to quickly ingest massive quantities of event data, and provide low-latency queries
Apache HBase 5,206 3 days ago An open-source, distributed, versioned, column-oriented store
Scylla 13,370 1 day ago Designed to be compatible with Cassandra while achieving higher throughputs and lower latencies

Awesome DataOps / Database / Document-Oriented Database

Apache CouchDB 6,180 15 days ago An open-source document-oriented NoSQL database, implemented in Erlang
Elasticsearch 69,622 12 days ago A distributed document oriented database with a RESTful search engine
MongoDB 26,120 12 days ago A cross-platform document database that uses JSON-like documents with optional schemas
RethinkDB 26,735 7 months ago The first open-source scalable database built for realtime applications

Awesome DataOps / Database / Graph Database

Age 3,028 22 days ago A multi-model database that supports both graph and relational data models
ArangoDB 13,530 2 days ago A scalable open-source multi-model database natively supporting graph, document and search
JanusGraph 5,282 5 days ago Manage large graphs with billions of data distributed across a multi-machine cluster
Memgraph 2,360 3 days ago An open source graph database, built for real-time streaming data, compatible with Neo4j
Neo4j 13,200 3 days ago A high performance graph store with all the features expected of a mature and robust database
Titan 5,247 almost 2 years ago A highly scalable graph database optimized for storing and querying large graphs

Awesome DataOps / Database / Key-Value Database

Apache Accumulo 1,062 4 days ago A sorted, distributed key-value store that provides robust and scalable data storage
Dragonfly 25,449 3 days ago A modern in-memory datastore, fully compatible with Redis and Memcached APIs
DynamoDB Fast, flexible NoSQL database service for single-digit millisecond performance at any scale
etcd 47,457 11 days ago Distributed reliable key-value store for the most critical data of a distributed system
EVCache 2,026 about 1 month ago A distributed in-memory data store for the cloud
Memcached 13,433 28 days ago A high performance multithreaded event-based key/value cache store
Redis 66,394 12 days ago An in-memory key-value database that persists on disk

Awesome DataOps / Database / Relational Database

CockroachDB 29,954 5 days ago A distributed database designed to build, scale, and manage data-intensive apps
Crate 4,052 11 days ago A distributed SQL database that makes it simple to store and analyze massive amounts of data
MariaDB 5,584 10 days ago A replacement of MySQL with more features, new storage engines and better performance
MySQL 10,733 about 2 months ago One of the most popular open source transactional databases
PostgreSQL 15,921 1 day ago An advanced RDBMS that supports an extended subset of the SQL standard
RQLite 15,617 5 days ago A lightweight, distributed relational database, which uses SQLite as its storage engine
SQLite 6,408 3 days ago A popular choice as embedded database software for local/client storage

Awesome DataOps / Database / Time Series Database

Akumuli 836 about 2 years ago Can be used to capture, store and process time-series data in real-time
Atlas 3,439 8 days ago An in-memory dimensional time series database
InfluxDB 28,713 3 days ago Scalable datastore for metrics, events, and real-time analytics
QuestDB 14,381 4 days ago An open source SQL database designed to process time series data, faster
TimescaleDB 17,531 10 days ago Open-source time-series SQL database optimized for fast ingest and complex queries

Awesome DataOps / Database / Vector Database

Milvus 29,730 5 days ago An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy
Pinecone Managed and distributed vector similarity search used with a lightweight SDK
Qdrant 19,996 3 days ago An open source vector similarity search engine with extended filtering support

Awesome DataOps / File System

Alluxio 6,816 23 days ago A virtual distributed storage system
Amazon Simple Storage Service (S3) Object storage built to retrieve any amount of data from anywhere
Apache Hadoop Distributed File System (HDFS) A distributed file system
GlusterFS 4,655 about 2 months ago A software defined distributed storage that can scale to several petabytes
Google Cloud Storage (GCS) Object storage for companies of all sizes, to store any amount of data
LakeFS 4,363 10 days ago Open source tool that transforms your object storage into a Git-like repository
LizardFS 953 about 2 months ago A highly reliable, scalable and efficient distributed file system
MinIO 47,067 3 days ago High Performance, Kubernetes Native Object Storage compatible with Amazon S3 API
SeaweedFS 22,426 5 days ago A fast distributed storage system for blobs, objects, files, and data lake
Swift 2,616 3 days ago A distributed object storage system designed to scale from a single machine to thousands of servers

Awesome DataOps / Logging and Monitoring

Grafana 64,069 11 days ago Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, InfluxDB and more
Loki 23,418 11 days ago A horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus
Prometheus 55,095 3 days ago A monitoring system and time series database
Whylogs 2,635 9 days ago A tool for creating data logs, enabling monitoring for data drift and data quality issues

Awesome DataOps / Metadata Service

Hive Metastore Service that stores metadata related to Apache Hive and other services
Metacat 1,607 18 days ago Provides you information about what data you have, where it resides and how to process it

Awesome DataOps / SQL Query Engine

Apache Drill 1,933 9 days ago Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
Apache Impala 1,131 4 days ago Lightning-fast, distributed SQL queries for petabytes of data
Dremio Power high-performing BI dashboards and interactive analytics directly on data lake
Presto 15,958 8 days ago A distributed SQL query engine for big data
Trino 10,237 8 days ago A fast distributed SQL query engine for big data analytics

Resources / Books

Data Mesh: Delivering Data-Driven Value at Scale (O'Reilly)
Designing Data-Intensive Applications (O'Reilly)
Fundamentals of Data Engineering (O'Reilly)
Getting Started with Impala (O'Reilly)
Learning and Operating Presto (O'Reilly)
Learning Spark: Lightning-Fast Data Analytics (O'Reilly)
Spark in Action (O'Reilly)
Spark: The Definitive Guide (O'Reilly)

Resources / Other Lists

Awesome Data Engineering 6,655 about 1 month ago
Awesome MLOps 3,994 12 days ago
DataOps Resource 21 about 4 years ago

Resources / Slack

Delta Lake Workspace
Trino Workspace

Backlinks from these awesome lists: