awesome-dataops

DataOps toolkit

A curated list of tools and technologies for DataOps, covering data cataloging, exploration, ingestion, processing, and more.

sunglasses A curated list of awesome DataOps tools

GitHub

163 stars
9 watching
20 forks
Language: Python
last commit: 6 months ago
Linked from 1 awesome list

awesomeawesome-listdata-engineerdata-engineeringdataops

Awesome DataOps / Data Catalog

Amundsen Data discovery and metadata engine for improving the productivity when interacting with data
Apache Atlas Provides open metadata management and governance capabilities to build a data catalog
CKAN 4,509 4 months ago Open-source DMS (data management system) for powering data hubs and data portals
DataHub 10,046 4 months ago LinkedIn's generalized metadata search & discovery tool
Magda 518 4 months ago A federated, open-source data catalog for all your big data and small data
Marquez 1,800 4 months ago Service for the collection, aggregation, and visualization of a data ecosystem's metadata
Metacat 1,616 4 months ago Unified metadata exploration API service for Hive, RDS, Teradata, Redshift, S3 and Cassandra
OpenLineage 1,802 4 months ago Open standard for metadata and lineage collection
OpenMetadata A Single place to discover, collaborate and get your data right
Unity Catalog Industry’s only universal catalog for data and AI

Awesome DataOps / Data Exploration

Apache Zeppelin Enables data-driven, interactive data analytics and collaborative documents
Jupyter Notebook Web-based notebook environment for interactive computing
JupyterLab The next-generation user interface for Project Jupyter
Jupytext 6,673 4 months ago Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts
Polynote The polyglot notebook with first-class Scala support

Awesome DataOps / Data Ingestion

Amazon Kinesis Easily collect, process, and analyze video and data streams in real time
Apache Gobblin 2,232 4 months ago A framework that simplifies common aspects of big data such as data ingestion
Apache Kafka 29,060 4 months ago Open-source distributed event streaming platform used by thousands of companies
Apache Pulsar 14,315 4 months ago Distributed pub-sub messaging platform with a flexible messaging model and intuitive API
Embulk 1,758 4 months ago A parallel bulk data loader that helps data transfer between various storages
Fluentd 12,963 4 months ago Collects events from various data sources and writes them to files
Google PubSub Ingest events for streaming into BigQuery, data lakes or operational databases
Nakadi 958 12 months ago A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues
Pravega 1,983 7 months ago An open source distributed storage service implementing Streams
RabbitMQ One of the most popular open source message brokers

Awesome DataOps / Data Workflow

Apache Airflow 37,580 4 months ago A platform to programmatically author, schedule, and monitor workflows
Apache Oozie 717 9 months ago An extensible, scalable and reliable system to manage complex Hadoop workloads
Azkaban 4,481 9 months ago Batch workflow job scheduler created at LinkedIn to run Hadoop jobs
Dagster 12,055 4 months ago An orchestration platform for the development, production, and observation of data assets
Luigi 17,950 4 months ago Python module that helps you build complex pipelines of batch jobs
Prefect A workflow management system, designed for modern infrastructure

Awesome DataOps / Data Processing

Apache Beam 7,911 4 months ago A unified model for defining both batch and streaming data-parallel processing pipelines
Apache Flink 24,261 4 months ago An open source stream processing framework with powerful capabilities
Apache Hadoop MapReduce A framework for writing applications which process vast amounts of data
Apache Nifi 4,955 4 months ago An easy to use, powerful, and reliable system to process and distribute data
Apache Samza 817 4 months ago A distributed stream processing framework which uses Apache Kafka and Hadoop YARN
Apache Spark 40,170 4 months ago A unified analytics engine for large-scale data processing
Apache Storm 6,603 4 months ago An open source distributed realtime computation system
Apache Tez 482 4 months ago A generic data-processing pipeline engine envisioned as a low-level engine
Faust 6,751 8 months ago A stream processing library, porting the ideas from Kafka Streams to Python

Awesome DataOps / Data Quality

Cerberus 3,179 8 months ago Lightweight, extensible data validation library for Python
Cleanlab 9,820 4 months ago Data-centric AI tool to detect (non-predefined) issues in ML data like label errors or outliers
DataProfiler 1,442 5 months ago A Python library designed to make data analysis, monitoring, and sensitive data detection easy
Deequ 3,324 6 months ago A library built on top of Apache Spark for measuring data quality in large datasets
Great Expectations A Python data validation framework that allows to test your data against datasets
JSON Schema A vocabulary that allows you to annotate and validate JSON documents
SodaSQL 61 over 2 years ago Data profiling, testing, and monitoring for SQL accessible data

Awesome DataOps / Data Serialization

Apache Avro 2,973 4 months ago A data serialization system which is compact, fast and provides rich data structures
Apache ORC 698 4 months ago A self-describing type-aware columnar file format designed for Hadoop workloads
Apache Parquet 2,665 4 months ago A columnar storage format which provides efficient storage and encoding of data
Kryo 6,217 4 months ago A fast and efficient binary object graph serialization framework for Java
ProtoBuf 65,999 4 months ago Language-neutral, platform-neutral, extensible mechanism for serializing structured data

Awesome DataOps / Data Serialization / Data Compression

Pigz 2,669 7 months ago A parallel implementation of gzip for modern multi-processor, multi-core machines
Snappy 6,217 8 months ago Open source compression library that is fast, stable and robuts

Awesome DataOps / Data Serialization / Data Table Format

Apache Hudi 5,498 4 months ago Manages the storage of large analytical datasets on DFS
Apache Iceberg 6,621 4 months ago Open table format for huge analytic datasets
Delta Lake 7,677 4 months ago An open source project that enables building a Lakehouse architecture on top of data lakes

Awesome DataOps / Data Visualization

Apache Superset 63,320 4 months ago A modern data exploration and data visualization platform
Count SQL/drag-and-drop querying and visualisation tool based on notebooks
Dash 21,641 4 months ago Analytical Web Apps for Python, R, Julia, and Jupyter
Data Studio Reporting solution for power users who want to go beyond the data and dashboards of GA
HUE 1,188 4 months ago A mature SQL Assistant for querying Databases & Data Warehouses
Lux 5,226 about 1 year ago Fast and easy data exploration by automating the visualization and data analysis process
Metabase The simplest, fastest way to get business intelligence and analytics to everyone
Redash Connect to any data source, easily visualize, dashboard and share your data
Tableau Powerful and fastest growing data visualization tool used in the business intelligence industry

Awesome DataOps / Data Warehouse

Amazon Redshift Accelerate your time to insights with fast, easy, and secure cloud data warehousing
Apache Hive 5,577 4 months ago Facilitates reading, writing, and managing large datasets residing in distributed storage
Apache Kylin 3,661 4 months ago An open source, distributed analytical data warehouse for big data
Google BigQuery Serverless, highly scalable, and cost-effective multicloud data warehouse

Awesome DataOps / Database / Columnar Database

Apache Cassandra 8,906 4 months ago Open source column based DBMS designed to handle large amounts of data
Apache Druid 13,548 4 months ago Designed to quickly ingest massive quantities of event data, and provide low-latency queries
Apache HBase 5,246 4 months ago An open-source, distributed, versioned, column-oriented store
Scylla 13,725 4 months ago Designed to be compatible with Cassandra while achieving higher throughputs and lower latencies

Awesome DataOps / Database / Document-Oriented Database

Apache CouchDB 6,298 4 months ago An open-source document-oriented NoSQL database, implemented in Erlang
Elasticsearch 71,007 4 months ago A distributed document oriented database with a RESTful search engine
MongoDB 26,503 4 months ago A cross-platform document database that uses JSON-like documents with optional schemas
RethinkDB 26,806 5 months ago The first open-source scalable database built for realtime applications

Awesome DataOps / Database / Graph Database

Age 3,191 6 months ago A multi-model database that supports both graph and relational data models
ArangoDB 13,613 4 months ago A scalable open-source multi-model database natively supporting graph, document and search
JanusGraph 5,351 4 months ago Manage large graphs with billions of data distributed across a multi-machine cluster
Memgraph 2,520 4 months ago An open source graph database, built for real-time streaming data, compatible with Neo4j
Neo4j 13,537 4 months ago A high performance graph store with all the features expected of a mature and robust database
Titan 5,243 over 2 years ago A highly scalable graph database optimized for storing and querying large graphs

Awesome DataOps / Database / Key-Value Database

Apache Accumulo 1,075 4 months ago A sorted, distributed key-value store that provides robust and scalable data storage
Dragonfly 26,326 4 months ago A modern in-memory datastore, fully compatible with Redis and Memcached APIs
DynamoDB Fast, flexible NoSQL database service for single-digit millisecond performance at any scale
etcd 48,056 4 months ago Distributed reliable key-value store for the most critical data of a distributed system
EVCache 2,071 4 months ago A distributed in-memory data store for the cloud
Memcached 13,601 4 months ago A high performance multithreaded event-based key/value cache store
Redis 67,358 4 months ago An in-memory key-value database that persists on disk

Awesome DataOps / Database / Relational Database

CockroachDB 30,270 4 months ago A distributed database designed to build, scale, and manage data-intensive apps
Crate 4,139 4 months ago A distributed SQL database that makes it simple to store and analyze massive amounts of data
MariaDB 5,752 4 months ago A replacement of MySQL with more features, new storage engines and better performance
MySQL 10,964 6 months ago One of the most popular open source transactional databases
PostgreSQL 16,442 4 months ago An advanced RDBMS that supports an extended subset of the SQL standard
RQLite 15,906 4 months ago A lightweight, distributed relational database, which uses SQLite as its storage engine
SQLite 6,902 4 months ago A popular choice as embedded database software for local/client storage

Awesome DataOps / Database / Time Series Database

Akumuli 835 over 2 years ago Can be used to capture, store and process time-series data in real-time
Atlas 3,459 4 months ago An in-memory dimensional time series database
InfluxDB 29,126 4 months ago Scalable datastore for metrics, events, and real-time analytics
QuestDB 14,699 4 months ago An open source SQL database designed to process time series data, faster
TimescaleDB 18,066 4 months ago Open-source time-series SQL database optimized for fast ingest and complex queries

Awesome DataOps / Database / Vector Database

Milvus 31,283 4 months ago An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy
Pinecone Managed and distributed vector similarity search used with a lightweight SDK
Qdrant 21,001 4 months ago An open source vector similarity search engine with extended filtering support

Awesome DataOps / File System

Alluxio 6,880 4 months ago A virtual distributed storage system
Amazon Simple Storage Service (S3) Object storage built to retrieve any amount of data from anywhere
Apache Hadoop Distributed File System (HDFS) A distributed file system
GlusterFS 4,774 4 months ago A software defined distributed storage that can scale to several petabytes
Google Cloud Storage (GCS) Object storage for companies of all sizes, to store any amount of data
LakeFS 4,496 4 months ago Open source tool that transforms your object storage into a Git-like repository
LizardFS 958 8 months ago A highly reliable, scalable and efficient distributed file system
MinIO 48,833 4 months ago High Performance, Kubernetes Native Object Storage compatible with Amazon S3 API
SeaweedFS 23,207 4 months ago A fast distributed storage system for blobs, objects, files, and data lake
Swift 2,639 4 months ago A distributed object storage system designed to scale from a single machine to thousands of servers

Awesome DataOps / Logging and Monitoring

Grafana 65,525 4 months ago Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, InfluxDB and more
Loki 24,172 4 months ago A horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus
Prometheus 56,244 4 months ago A monitoring system and time series database
Whylogs 2,664 4 months ago A tool for creating data logs, enabling monitoring for data drift and data quality issues

Awesome DataOps / Metadata Service

Hive Metastore Service that stores metadata related to Apache Hive and other services
Metacat 1,616 4 months ago Provides you information about what data you have, where it resides and how to process it

Awesome DataOps / SQL Query Engine

Apache Drill 1,949 4 months ago Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
Apache Impala 1,164 4 months ago Lightning-fast, distributed SQL queries for petabytes of data
Dremio Power high-performing BI dashboards and interactive analytics directly on data lake
Presto 16,114 4 months ago A distributed SQL query engine for big data
Trino 10,601 4 months ago A fast distributed SQL query engine for big data analytics

Resources / Books

Data Mesh: Delivering Data-Driven Value at Scale (O'Reilly)
Designing Data-Intensive Applications (O'Reilly)
Fundamentals of Data Engineering (O'Reilly)
Getting Started with Impala (O'Reilly)
Learning and Operating Presto (O'Reilly)
Learning Spark: Lightning-Fast Data Analytics (O'Reilly)
Spark in Action (O'Reilly)
Spark: The Definitive Guide (O'Reilly)

Resources / Other Lists

Awesome Data Engineering 6,889 5 months ago
Awesome MLOps 4,181 4 months ago
DataOps Resource 24 over 4 years ago

Resources / Slack

Delta Lake Workspace
Trino Workspace

Backlinks from these awesome lists:

More related projects: