awesome-dataops

DataOps toolkit

A curated list of tools and technologies for DataOps, covering data cataloging, exploration, ingestion, processing, and more.

sunglasses A curated list of awesome DataOps tools

GitHub

163 stars
9 watching
20 forks
Language: Python
last commit: 3 months ago
Linked from 1 awesome list

awesomeawesome-listdata-engineerdata-engineeringdataops

Awesome DataOps / Data Catalog

Amundsen Data discovery and metadata engine for improving the productivity when interacting with data
Apache Atlas Provides open metadata management and governance capabilities to build a data catalog
CKAN 4,509 about 1 month ago Open-source DMS (data management system) for powering data hubs and data portals
DataHub 10,046 about 1 month ago LinkedIn's generalized metadata search & discovery tool
Magda 518 about 1 month ago A federated, open-source data catalog for all your big data and small data
Marquez 1,800 about 1 month ago Service for the collection, aggregation, and visualization of a data ecosystem's metadata
Metacat 1,616 about 1 month ago Unified metadata exploration API service for Hive, RDS, Teradata, Redshift, S3 and Cassandra
OpenLineage 1,802 about 1 month ago Open standard for metadata and lineage collection
OpenMetadata A Single place to discover, collaborate and get your data right
Unity Catalog Industry’s only universal catalog for data and AI

Awesome DataOps / Data Exploration

Apache Zeppelin Enables data-driven, interactive data analytics and collaborative documents
Jupyter Notebook Web-based notebook environment for interactive computing
JupyterLab The next-generation user interface for Project Jupyter
Jupytext 6,673 about 1 month ago Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts
Polynote The polyglot notebook with first-class Scala support

Awesome DataOps / Data Ingestion

Amazon Kinesis Easily collect, process, and analyze video and data streams in real time
Apache Gobblin 2,232 about 1 month ago A framework that simplifies common aspects of big data such as data ingestion
Apache Kafka 29,060 about 1 month ago Open-source distributed event streaming platform used by thousands of companies
Apache Pulsar 14,315 about 1 month ago Distributed pub-sub messaging platform with a flexible messaging model and intuitive API
Embulk 1,758 about 2 months ago A parallel bulk data loader that helps data transfer between various storages
Fluentd 12,963 about 1 month ago Collects events from various data sources and writes them to files
Google PubSub Ingest events for streaming into BigQuery, data lakes or operational databases
Nakadi 958 9 months ago A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues
Pravega 1,983 5 months ago An open source distributed storage service implementing Streams
RabbitMQ One of the most popular open source message brokers

Awesome DataOps / Data Workflow

Apache Airflow 37,580 about 1 month ago A platform to programmatically author, schedule, and monitor workflows
Apache Oozie 717 6 months ago An extensible, scalable and reliable system to manage complex Hadoop workloads
Azkaban 4,481 7 months ago Batch workflow job scheduler created at LinkedIn to run Hadoop jobs
Dagster 12,055 about 1 month ago An orchestration platform for the development, production, and observation of data assets
Luigi 17,950 about 1 month ago Python module that helps you build complex pipelines of batch jobs
Prefect A workflow management system, designed for modern infrastructure

Awesome DataOps / Data Processing

Apache Beam 7,911 about 1 month ago A unified model for defining both batch and streaming data-parallel processing pipelines
Apache Flink 24,261 about 1 month ago An open source stream processing framework with powerful capabilities
Apache Hadoop MapReduce A framework for writing applications which process vast amounts of data
Apache Nifi 4,955 about 1 month ago An easy to use, powerful, and reliable system to process and distribute data
Apache Samza 817 about 2 months ago A distributed stream processing framework which uses Apache Kafka and Hadoop YARN
Apache Spark 40,170 about 1 month ago A unified analytics engine for large-scale data processing
Apache Storm 6,603 about 1 month ago An open source distributed realtime computation system
Apache Tez 482 about 1 month ago A generic data-processing pipeline engine envisioned as a low-level engine
Faust 6,751 6 months ago A stream processing library, porting the ideas from Kafka Streams to Python

Awesome DataOps / Data Quality

Cerberus 3,179 5 months ago Lightweight, extensible data validation library for Python
Cleanlab 9,820 about 1 month ago Data-centric AI tool to detect (non-predefined) issues in ML data like label errors or outliers
DataProfiler 1,442 2 months ago A Python library designed to make data analysis, monitoring, and sensitive data detection easy
Deequ 3,324 3 months ago A library built on top of Apache Spark for measuring data quality in large datasets
Great Expectations A Python data validation framework that allows to test your data against datasets
JSON Schema A vocabulary that allows you to annotate and validate JSON documents
SodaSQL 61 about 2 years ago Data profiling, testing, and monitoring for SQL accessible data

Awesome DataOps / Data Serialization

Apache Avro 2,973 about 1 month ago A data serialization system which is compact, fast and provides rich data structures
Apache ORC 698 about 1 month ago A self-describing type-aware columnar file format designed for Hadoop workloads
Apache Parquet 2,665 about 1 month ago A columnar storage format which provides efficient storage and encoding of data
Kryo 6,217 about 1 month ago A fast and efficient binary object graph serialization framework for Java
ProtoBuf 65,999 about 1 month ago Language-neutral, platform-neutral, extensible mechanism for serializing structured data

Awesome DataOps / Data Serialization / Data Compression

Pigz 2,669 4 months ago A parallel implementation of gzip for modern multi-processor, multi-core machines
Snappy 6,217 5 months ago Open source compression library that is fast, stable and robuts

Awesome DataOps / Data Serialization / Data Table Format

Apache Hudi 5,498 about 1 month ago Manages the storage of large analytical datasets on DFS
Apache Iceberg 6,621 about 1 month ago Open table format for huge analytic datasets
Delta Lake 7,677 about 1 month ago An open source project that enables building a Lakehouse architecture on top of data lakes

Awesome DataOps / Data Visualization

Apache Superset 63,320 about 1 month ago A modern data exploration and data visualization platform
Count SQL/drag-and-drop querying and visualisation tool based on notebooks
Dash 21,641 about 1 month ago Analytical Web Apps for Python, R, Julia, and Jupyter
Data Studio Reporting solution for power users who want to go beyond the data and dashboards of GA
HUE 1,188 about 1 month ago A mature SQL Assistant for querying Databases & Data Warehouses
Lux 5,226 10 months ago Fast and easy data exploration by automating the visualization and data analysis process
Metabase The simplest, fastest way to get business intelligence and analytics to everyone
Redash Connect to any data source, easily visualize, dashboard and share your data
Tableau Powerful and fastest growing data visualization tool used in the business intelligence industry

Awesome DataOps / Data Warehouse

Amazon Redshift Accelerate your time to insights with fast, easy, and secure cloud data warehousing
Apache Hive 5,577 about 1 month ago Facilitates reading, writing, and managing large datasets residing in distributed storage
Apache Kylin 3,661 about 1 month ago An open source, distributed analytical data warehouse for big data
Google BigQuery Serverless, highly scalable, and cost-effective multicloud data warehouse

Awesome DataOps / Database / Columnar Database

Apache Cassandra 8,906 about 1 month ago Open source column based DBMS designed to handle large amounts of data
Apache Druid 13,548 about 1 month ago Designed to quickly ingest massive quantities of event data, and provide low-latency queries
Apache HBase 5,246 about 1 month ago An open-source, distributed, versioned, column-oriented store
Scylla 13,725 about 1 month ago Designed to be compatible with Cassandra while achieving higher throughputs and lower latencies

Awesome DataOps / Database / Document-Oriented Database

Apache CouchDB 6,298 about 1 month ago An open-source document-oriented NoSQL database, implemented in Erlang
Elasticsearch 71,007 about 1 month ago A distributed document oriented database with a RESTful search engine
MongoDB 26,503 about 1 month ago A cross-platform document database that uses JSON-like documents with optional schemas
RethinkDB 26,806 2 months ago The first open-source scalable database built for realtime applications

Awesome DataOps / Database / Graph Database

Age 3,191 4 months ago A multi-model database that supports both graph and relational data models
ArangoDB 13,613 about 1 month ago A scalable open-source multi-model database natively supporting graph, document and search
JanusGraph 5,351 about 2 months ago Manage large graphs with billions of data distributed across a multi-machine cluster
Memgraph 2,520 about 1 month ago An open source graph database, built for real-time streaming data, compatible with Neo4j
Neo4j 13,537 about 1 month ago A high performance graph store with all the features expected of a mature and robust database
Titan 5,243 about 2 years ago A highly scalable graph database optimized for storing and querying large graphs

Awesome DataOps / Database / Key-Value Database

Apache Accumulo 1,075 about 1 month ago A sorted, distributed key-value store that provides robust and scalable data storage
Dragonfly 26,326 about 1 month ago A modern in-memory datastore, fully compatible with Redis and Memcached APIs
DynamoDB Fast, flexible NoSQL database service for single-digit millisecond performance at any scale
etcd 48,056 about 1 month ago Distributed reliable key-value store for the most critical data of a distributed system
EVCache 2,071 about 1 month ago A distributed in-memory data store for the cloud
Memcached 13,601 about 1 month ago A high performance multithreaded event-based key/value cache store
Redis 67,358 about 1 month ago An in-memory key-value database that persists on disk

Awesome DataOps / Database / Relational Database

CockroachDB 30,270 about 1 month ago A distributed database designed to build, scale, and manage data-intensive apps
Crate 4,139 about 1 month ago A distributed SQL database that makes it simple to store and analyze massive amounts of data
MariaDB 5,752 about 1 month ago A replacement of MySQL with more features, new storage engines and better performance
MySQL 10,964 3 months ago One of the most popular open source transactional databases
PostgreSQL 16,442 about 1 month ago An advanced RDBMS that supports an extended subset of the SQL standard
RQLite 15,906 about 1 month ago A lightweight, distributed relational database, which uses SQLite as its storage engine
SQLite 6,902 about 1 month ago A popular choice as embedded database software for local/client storage

Awesome DataOps / Database / Time Series Database

Akumuli 835 over 2 years ago Can be used to capture, store and process time-series data in real-time
Atlas 3,459 about 1 month ago An in-memory dimensional time series database
InfluxDB 29,126 about 1 month ago Scalable datastore for metrics, events, and real-time analytics
QuestDB 14,699 about 1 month ago An open source SQL database designed to process time series data, faster
TimescaleDB 18,066 about 1 month ago Open-source time-series SQL database optimized for fast ingest and complex queries

Awesome DataOps / Database / Vector Database

Milvus 31,283 about 1 month ago An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy
Pinecone Managed and distributed vector similarity search used with a lightweight SDK
Qdrant 21,001 about 1 month ago An open source vector similarity search engine with extended filtering support

Awesome DataOps / File System

Alluxio 6,880 about 2 months ago A virtual distributed storage system
Amazon Simple Storage Service (S3) Object storage built to retrieve any amount of data from anywhere
Apache Hadoop Distributed File System (HDFS) A distributed file system
GlusterFS 4,774 about 2 months ago A software defined distributed storage that can scale to several petabytes
Google Cloud Storage (GCS) Object storage for companies of all sizes, to store any amount of data
LakeFS 4,496 about 1 month ago Open source tool that transforms your object storage into a Git-like repository
LizardFS 958 5 months ago A highly reliable, scalable and efficient distributed file system
MinIO 48,833 about 1 month ago High Performance, Kubernetes Native Object Storage compatible with Amazon S3 API
SeaweedFS 23,207 about 1 month ago A fast distributed storage system for blobs, objects, files, and data lake
Swift 2,639 about 1 month ago A distributed object storage system designed to scale from a single machine to thousands of servers

Awesome DataOps / Logging and Monitoring

Grafana 65,525 about 1 month ago Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, InfluxDB and more
Loki 24,172 about 1 month ago A horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus
Prometheus 56,244 about 1 month ago A monitoring system and time series database
Whylogs 2,664 about 1 month ago A tool for creating data logs, enabling monitoring for data drift and data quality issues

Awesome DataOps / Metadata Service

Hive Metastore Service that stores metadata related to Apache Hive and other services
Metacat 1,616 about 1 month ago Provides you information about what data you have, where it resides and how to process it

Awesome DataOps / SQL Query Engine

Apache Drill 1,949 2 months ago Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
Apache Impala 1,164 about 1 month ago Lightning-fast, distributed SQL queries for petabytes of data
Dremio Power high-performing BI dashboards and interactive analytics directly on data lake
Presto 16,114 about 1 month ago A distributed SQL query engine for big data
Trino 10,601 about 1 month ago A fast distributed SQL query engine for big data analytics

Resources / Books

Data Mesh: Delivering Data-Driven Value at Scale (O'Reilly)
Designing Data-Intensive Applications (O'Reilly)
Fundamentals of Data Engineering (O'Reilly)
Getting Started with Impala (O'Reilly)
Learning and Operating Presto (O'Reilly)
Learning Spark: Lightning-Fast Data Analytics (O'Reilly)
Spark in Action (O'Reilly)
Spark: The Definitive Guide (O'Reilly)

Resources / Other Lists

Awesome Data Engineering 6,889 3 months ago
Awesome MLOps 4,181 about 2 months ago
DataOps Resource 24 over 4 years ago

Resources / Slack

Delta Lake Workspace
Trino Workspace

Backlinks from these awesome lists:

More related projects: