awesome-dataops

DataOps toolkit

A curated list of tools and technologies for DataOps, covering data cataloging, exploration, ingestion, processing, and more.

sunglasses A curated list of awesome DataOps tools

GitHub

163 stars
9 watching
20 forks
Language: Python
last commit: 2 months ago
Linked from 1 awesome list

awesomeawesome-listdata-engineerdata-engineeringdataops

Awesome DataOps / Data Catalog

Amundsen Data discovery and metadata engine for improving the productivity when interacting with data
Apache Atlas Provides open metadata management and governance capabilities to build a data catalog
CKAN 4,509 1 day ago Open-source DMS (data management system) for powering data hubs and data portals
DataHub 10,046 about 13 hours ago LinkedIn's generalized metadata search & discovery tool
Magda 518 1 day ago A federated, open-source data catalog for all your big data and small data
Marquez 1,800 2 days ago Service for the collection, aggregation, and visualization of a data ecosystem's metadata
Metacat 1,616 about 24 hours ago Unified metadata exploration API service for Hive, RDS, Teradata, Redshift, S3 and Cassandra
OpenLineage 1,802 about 10 hours ago Open standard for metadata and lineage collection
OpenMetadata A Single place to discover, collaborate and get your data right
Unity Catalog Industry’s only universal catalog for data and AI

Awesome DataOps / Data Exploration

Apache Zeppelin Enables data-driven, interactive data analytics and collaborative documents
Jupyter Notebook Web-based notebook environment for interactive computing
JupyterLab The next-generation user interface for Project Jupyter
Jupytext 6,673 7 days ago Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts
Polynote The polyglot notebook with first-class Scala support

Awesome DataOps / Data Ingestion

Amazon Kinesis Easily collect, process, and analyze video and data streams in real time
Apache Gobblin 2,232 4 days ago A framework that simplifies common aspects of big data such as data ingestion
Apache Kafka 29,060 4 days ago Open-source distributed event streaming platform used by thousands of companies
Apache Pulsar 14,315 4 days ago Distributed pub-sub messaging platform with a flexible messaging model and intuitive API
Embulk 1,758 14 days ago A parallel bulk data loader that helps data transfer between various storages
Fluentd 12,963 1 day ago Collects events from various data sources and writes them to files
Google PubSub Ingest events for streaming into BigQuery, data lakes or operational databases
Nakadi 958 8 months ago A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues
Pravega 1,983 4 months ago An open source distributed storage service implementing Streams
RabbitMQ One of the most popular open source message brokers

Awesome DataOps / Data Workflow

Apache Airflow 37,580 4 days ago A platform to programmatically author, schedule, and monitor workflows
Apache Oozie 717 5 months ago An extensible, scalable and reliable system to manage complex Hadoop workloads
Azkaban 4,481 6 months ago Batch workflow job scheduler created at LinkedIn to run Hadoop jobs
Dagster 12,055 5 days ago An orchestration platform for the development, production, and observation of data assets
Luigi 17,950 6 days ago Python module that helps you build complex pipelines of batch jobs
Prefect A workflow management system, designed for modern infrastructure

Awesome DataOps / Data Processing

Apache Beam 7,911 4 days ago A unified model for defining both batch and streaming data-parallel processing pipelines
Apache Flink 24,261 5 days ago An open source stream processing framework with powerful capabilities
Apache Hadoop MapReduce A framework for writing applications which process vast amounts of data
Apache Nifi 4,955 4 days ago An easy to use, powerful, and reliable system to process and distribute data
Apache Samza 817 22 days ago A distributed stream processing framework which uses Apache Kafka and Hadoop YARN
Apache Spark 40,170 5 days ago A unified analytics engine for large-scale data processing
Apache Storm 6,603 9 days ago An open source distributed realtime computation system
Apache Tez 482 5 days ago A generic data-processing pipeline engine envisioned as a low-level engine
Faust 6,751 5 months ago A stream processing library, porting the ideas from Kafka Streams to Python

Awesome DataOps / Data Quality

Cerberus 3,179 4 months ago Lightweight, extensible data validation library for Python
Cleanlab 9,820 6 days ago Data-centric AI tool to detect (non-predefined) issues in ML data like label errors or outliers
DataProfiler 1,442 about 1 month ago A Python library designed to make data analysis, monitoring, and sensitive data detection easy
Deequ 3,324 2 months ago A library built on top of Apache Spark for measuring data quality in large datasets
Great Expectations A Python data validation framework that allows to test your data against datasets
JSON Schema A vocabulary that allows you to annotate and validate JSON documents
SodaSQL 61 about 2 years ago Data profiling, testing, and monitoring for SQL accessible data

Awesome DataOps / Data Serialization

Apache Avro 2,973 8 days ago A data serialization system which is compact, fast and provides rich data structures
Apache ORC 698 5 days ago A self-describing type-aware columnar file format designed for Hadoop workloads
Apache Parquet 2,665 13 days ago A columnar storage format which provides efficient storage and encoding of data
Kryo 6,217 9 days ago A fast and efficient binary object graph serialization framework for Java
ProtoBuf 65,999 5 days ago Language-neutral, platform-neutral, extensible mechanism for serializing structured data

Awesome DataOps / Data Serialization / Data Compression

Pigz 2,669 3 months ago A parallel implementation of gzip for modern multi-processor, multi-core machines
Snappy 6,217 4 months ago Open source compression library that is fast, stable and robuts

Awesome DataOps / Data Serialization / Data Table Format

Apache Hudi 5,498 4 days ago Manages the storage of large analytical datasets on DFS
Apache Iceberg 6,621 5 days ago Open table format for huge analytic datasets
Delta Lake 7,677 4 days ago An open source project that enables building a Lakehouse architecture on top of data lakes

Awesome DataOps / Data Visualization

Apache Superset 63,320 4 days ago A modern data exploration and data visualization platform
Count SQL/drag-and-drop querying and visualisation tool based on notebooks
Dash 21,641 5 days ago Analytical Web Apps for Python, R, Julia, and Jupyter
Data Studio Reporting solution for power users who want to go beyond the data and dashboards of GA
HUE 1,188 1 day ago A mature SQL Assistant for querying Databases & Data Warehouses
Lux 5,226 9 months ago Fast and easy data exploration by automating the visualization and data analysis process
Metabase The simplest, fastest way to get business intelligence and analytics to everyone
Redash Connect to any data source, easily visualize, dashboard and share your data
Tableau Powerful and fastest growing data visualization tool used in the business intelligence industry

Awesome DataOps / Data Warehouse

Amazon Redshift Accelerate your time to insights with fast, easy, and secure cloud data warehousing
Apache Hive 5,577 5 days ago Facilitates reading, writing, and managing large datasets residing in distributed storage
Apache Kylin 3,661 5 days ago An open source, distributed analytical data warehouse for big data
Google BigQuery Serverless, highly scalable, and cost-effective multicloud data warehouse

Awesome DataOps / Database / Columnar Database

Apache Cassandra 8,906 4 days ago Open source column based DBMS designed to handle large amounts of data
Apache Druid 13,548 4 days ago Designed to quickly ingest massive quantities of event data, and provide low-latency queries
Apache HBase 5,246 4 days ago An open-source, distributed, versioned, column-oriented store
Scylla 13,725 5 days ago Designed to be compatible with Cassandra while achieving higher throughputs and lower latencies

Awesome DataOps / Database / Document-Oriented Database

Apache CouchDB 6,298 5 days ago An open-source document-oriented NoSQL database, implemented in Erlang
Elasticsearch 71,007 4 days ago A distributed document oriented database with a RESTful search engine
MongoDB 26,503 4 days ago A cross-platform document database that uses JSON-like documents with optional schemas
RethinkDB 26,806 about 1 month ago The first open-source scalable database built for realtime applications

Awesome DataOps / Database / Graph Database

Age 3,191 3 months ago A multi-model database that supports both graph and relational data models
ArangoDB 13,613 4 days ago A scalable open-source multi-model database natively supporting graph, document and search
JanusGraph 5,351 28 days ago Manage large graphs with billions of data distributed across a multi-machine cluster
Memgraph 2,520 about 10 hours ago An open source graph database, built for real-time streaming data, compatible with Neo4j
Neo4j 13,537 9 days ago A high performance graph store with all the features expected of a mature and robust database
Titan 5,243 about 2 years ago A highly scalable graph database optimized for storing and querying large graphs

Awesome DataOps / Database / Key-Value Database

Apache Accumulo 1,075 4 days ago A sorted, distributed key-value store that provides robust and scalable data storage
Dragonfly 26,326 1 day ago A modern in-memory datastore, fully compatible with Redis and Memcached APIs
DynamoDB Fast, flexible NoSQL database service for single-digit millisecond performance at any scale
etcd 48,056 1 day ago Distributed reliable key-value store for the most critical data of a distributed system
EVCache 2,071 6 days ago A distributed in-memory data store for the cloud
Memcached 13,601 13 days ago A high performance multithreaded event-based key/value cache store
Redis 67,358 5 days ago An in-memory key-value database that persists on disk

Awesome DataOps / Database / Relational Database

CockroachDB 30,270 5 days ago A distributed database designed to build, scale, and manage data-intensive apps
Crate 4,139 4 days ago A distributed SQL database that makes it simple to store and analyze massive amounts of data
MariaDB 5,752 about 19 hours ago A replacement of MySQL with more features, new storage engines and better performance
MySQL 10,964 2 months ago One of the most popular open source transactional databases
PostgreSQL 16,442 5 days ago An advanced RDBMS that supports an extended subset of the SQL standard
RQLite 15,906 5 days ago A lightweight, distributed relational database, which uses SQLite as its storage engine
SQLite 6,902 about 10 hours ago A popular choice as embedded database software for local/client storage

Awesome DataOps / Database / Time Series Database

Akumuli 835 over 2 years ago Can be used to capture, store and process time-series data in real-time
Atlas 3,459 7 days ago An in-memory dimensional time series database
InfluxDB 29,126 4 days ago Scalable datastore for metrics, events, and real-time analytics
QuestDB 14,699 4 days ago An open source SQL database designed to process time series data, faster
TimescaleDB 18,066 1 day ago Open-source time-series SQL database optimized for fast ingest and complex queries

Awesome DataOps / Database / Vector Database

Milvus 31,283 4 days ago An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy
Pinecone Managed and distributed vector similarity search used with a lightweight SDK
Qdrant 21,001 1 day ago An open source vector similarity search engine with extended filtering support

Awesome DataOps / File System

Alluxio 6,880 20 days ago A virtual distributed storage system
Amazon Simple Storage Service (S3) Object storage built to retrieve any amount of data from anywhere
Apache Hadoop Distributed File System (HDFS) A distributed file system
GlusterFS 4,774 14 days ago A software defined distributed storage that can scale to several petabytes
Google Cloud Storage (GCS) Object storage for companies of all sizes, to store any amount of data
LakeFS 4,496 1 day ago Open source tool that transforms your object storage into a Git-like repository
LizardFS 958 4 months ago A highly reliable, scalable and efficient distributed file system
MinIO 48,833 4 days ago High Performance, Kubernetes Native Object Storage compatible with Amazon S3 API
SeaweedFS 23,207 5 days ago A fast distributed storage system for blobs, objects, files, and data lake
Swift 2,639 about 10 hours ago A distributed object storage system designed to scale from a single machine to thousands of servers

Awesome DataOps / Logging and Monitoring

Grafana 65,525 1 day ago Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, InfluxDB and more
Loki 24,172 1 day ago A horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus
Prometheus 56,244 4 days ago A monitoring system and time series database
Whylogs 2,664 7 days ago A tool for creating data logs, enabling monitoring for data drift and data quality issues

Awesome DataOps / Metadata Service

Hive Metastore Service that stores metadata related to Apache Hive and other services
Metacat 1,616 about 24 hours ago Provides you information about what data you have, where it resides and how to process it

Awesome DataOps / SQL Query Engine

Apache Drill 1,949 30 days ago Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
Apache Impala 1,164 5 days ago Lightning-fast, distributed SQL queries for petabytes of data
Dremio Power high-performing BI dashboards and interactive analytics directly on data lake
Presto 16,114 1 day ago A distributed SQL query engine for big data
Trino 10,601 about 14 hours ago A fast distributed SQL query engine for big data analytics

Resources / Books

Data Mesh: Delivering Data-Driven Value at Scale (O'Reilly)
Designing Data-Intensive Applications (O'Reilly)
Fundamentals of Data Engineering (O'Reilly)
Getting Started with Impala (O'Reilly)
Learning and Operating Presto (O'Reilly)
Learning Spark: Lightning-Fast Data Analytics (O'Reilly)
Spark in Action (O'Reilly)
Spark: The Definitive Guide (O'Reilly)

Resources / Other Lists

Awesome Data Engineering 6,889 about 2 months ago
Awesome MLOps 4,181 19 days ago
DataOps Resource 24 over 4 years ago

Resources / Slack

Delta Lake Workspace
Trino Workspace

Backlinks from these awesome lists:

More related projects: