awesome-dataops
DataOps toolkit
A curated list of tools and technologies for DataOps, covering data cataloging, exploration, ingestion, processing, and more.
A curated list of awesome DataOps tools
163 stars
9 watching
20 forks
Language: Python
last commit: about 1 year ago
Linked from 1 awesome list
awesomeawesome-listdata-engineerdata-engineeringdataops
Awesome DataOps / Data Catalog | |||
| Amundsen | Data discovery and metadata engine for improving the productivity when interacting with data | ||
| Apache Atlas | Provides open metadata management and governance capabilities to build a data catalog | ||
| CKAN | 4,509 | 11 months ago | Open-source DMS (data management system) for powering data hubs and data portals |
| DataHub | 10,046 | 11 months ago | LinkedIn's generalized metadata search & discovery tool |
| Magda | 518 | 11 months ago | A federated, open-source data catalog for all your big data and small data |
| Marquez | 1,800 | 11 months ago | Service for the collection, aggregation, and visualization of a data ecosystem's metadata |
| Metacat | 1,616 | 11 months ago | Unified metadata exploration API service for Hive, RDS, Teradata, Redshift, S3 and Cassandra |
| OpenLineage | 1,802 | 11 months ago | Open standard for metadata and lineage collection |
| OpenMetadata | A Single place to discover, collaborate and get your data right | ||
| Unity Catalog | Industry’s only universal catalog for data and AI | ||
Awesome DataOps / Data Exploration | |||
| Apache Zeppelin | Enables data-driven, interactive data analytics and collaborative documents | ||
| Jupyter Notebook | Web-based notebook environment for interactive computing | ||
| JupyterLab | The next-generation user interface for Project Jupyter | ||
| Jupytext | 6,673 | 11 months ago | Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts |
| Polynote | The polyglot notebook with first-class Scala support | ||
Awesome DataOps / Data Ingestion | |||
| Amazon Kinesis | Easily collect, process, and analyze video and data streams in real time | ||
| Apache Gobblin | 2,232 | 11 months ago | A framework that simplifies common aspects of big data such as data ingestion |
| Apache Kafka | 29,060 | 11 months ago | Open-source distributed event streaming platform used by thousands of companies |
| Apache Pulsar | 14,315 | 11 months ago | Distributed pub-sub messaging platform with a flexible messaging model and intuitive API |
| Embulk | 1,758 | 11 months ago | A parallel bulk data loader that helps data transfer between various storages |
| Fluentd | 12,963 | 11 months ago | Collects events from various data sources and writes them to files |
| Google PubSub | Ingest events for streaming into BigQuery, data lakes or operational databases | ||
| Nakadi | 958 | over 1 year ago | A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues |
| Pravega | 1,983 | about 1 year ago | An open source distributed storage service implementing Streams |
| RabbitMQ | One of the most popular open source message brokers | ||
Awesome DataOps / Data Workflow | |||
| Apache Airflow | 37,580 | 11 months ago | A platform to programmatically author, schedule, and monitor workflows |
| Apache Oozie | 717 | over 1 year ago | An extensible, scalable and reliable system to manage complex Hadoop workloads |
| Azkaban | 4,481 | over 1 year ago | Batch workflow job scheduler created at LinkedIn to run Hadoop jobs |
| Dagster | 12,055 | 11 months ago | An orchestration platform for the development, production, and observation of data assets |
| Luigi | 17,950 | 11 months ago | Python module that helps you build complex pipelines of batch jobs |
| Prefect | A workflow management system, designed for modern infrastructure | ||
Awesome DataOps / Data Processing | |||
| Apache Beam | 7,911 | 11 months ago | A unified model for defining both batch and streaming data-parallel processing pipelines |
| Apache Flink | 24,261 | 11 months ago | An open source stream processing framework with powerful capabilities |
| Apache Hadoop MapReduce | A framework for writing applications which process vast amounts of data | ||
| Apache Nifi | 4,955 | 11 months ago | An easy to use, powerful, and reliable system to process and distribute data |
| Apache Samza | 817 | 12 months ago | A distributed stream processing framework which uses Apache Kafka and Hadoop YARN |
| Apache Spark | 40,170 | 11 months ago | A unified analytics engine for large-scale data processing |
| Apache Storm | 6,603 | 11 months ago | An open source distributed realtime computation system |
| Apache Tez | 482 | 11 months ago | A generic data-processing pipeline engine envisioned as a low-level engine |
| Faust | 6,751 | over 1 year ago | A stream processing library, porting the ideas from Kafka Streams to Python |
Awesome DataOps / Data Quality | |||
| Cerberus | 3,179 | about 1 year ago | Lightweight, extensible data validation library for Python |
| Cleanlab | 9,820 | 11 months ago | Data-centric AI tool to detect (non-predefined) issues in ML data like label errors or outliers |
| DataProfiler | 1,442 | 12 months ago | A Python library designed to make data analysis, monitoring, and sensitive data detection easy |
| Deequ | 3,324 | about 1 year ago | A library built on top of Apache Spark for measuring data quality in large datasets |
| Great Expectations | A Python data validation framework that allows to test your data against datasets | ||
| JSON Schema | A vocabulary that allows you to annotate and validate JSON documents | ||
| SodaSQL | 61 | almost 3 years ago | Data profiling, testing, and monitoring for SQL accessible data |
Awesome DataOps / Data Serialization | |||
| Apache Avro | 2,973 | 11 months ago | A data serialization system which is compact, fast and provides rich data structures |
| Apache ORC | 698 | 11 months ago | A self-describing type-aware columnar file format designed for Hadoop workloads |
| Apache Parquet | 2,665 | 11 months ago | A columnar storage format which provides efficient storage and encoding of data |
| Kryo | 6,217 | 11 months ago | A fast and efficient binary object graph serialization framework for Java |
| ProtoBuf | 65,999 | 11 months ago | Language-neutral, platform-neutral, extensible mechanism for serializing structured data |
Awesome DataOps / Data Serialization / Data Compression | |||
| Pigz | 2,669 | about 1 year ago | A parallel implementation of gzip for modern multi-processor, multi-core machines |
| Snappy | 6,217 | about 1 year ago | Open source compression library that is fast, stable and robuts |
Awesome DataOps / Data Serialization / Data Table Format | |||
| Apache Hudi | 5,498 | 11 months ago | Manages the storage of large analytical datasets on DFS |
| Apache Iceberg | 6,621 | 11 months ago | Open table format for huge analytic datasets |
| Delta Lake | 7,677 | 11 months ago | An open source project that enables building a Lakehouse architecture on top of data lakes |
Awesome DataOps / Data Visualization | |||
| Apache Superset | 63,320 | 11 months ago | A modern data exploration and data visualization platform |
| Count | SQL/drag-and-drop querying and visualisation tool based on notebooks | ||
| Dash | 21,641 | 11 months ago | Analytical Web Apps for Python, R, Julia, and Jupyter |
| Data Studio | Reporting solution for power users who want to go beyond the data and dashboards of GA | ||
| HUE | 1,188 | 11 months ago | A mature SQL Assistant for querying Databases & Data Warehouses |
| Lux | 5,226 | over 1 year ago | Fast and easy data exploration by automating the visualization and data analysis process |
| Metabase | The simplest, fastest way to get business intelligence and analytics to everyone | ||
| Redash | Connect to any data source, easily visualize, dashboard and share your data | ||
| Tableau | Powerful and fastest growing data visualization tool used in the business intelligence industry | ||
Awesome DataOps / Data Warehouse | |||
| Amazon Redshift | Accelerate your time to insights with fast, easy, and secure cloud data warehousing | ||
| Apache Hive | 5,577 | 11 months ago | Facilitates reading, writing, and managing large datasets residing in distributed storage |
| Apache Kylin | 3,661 | 11 months ago | An open source, distributed analytical data warehouse for big data |
| Google BigQuery | Serverless, highly scalable, and cost-effective multicloud data warehouse | ||
Awesome DataOps / Database / Columnar Database | |||
| Apache Cassandra | 8,906 | 11 months ago | Open source column based DBMS designed to handle large amounts of data |
| Apache Druid | 13,548 | 11 months ago | Designed to quickly ingest massive quantities of event data, and provide low-latency queries |
| Apache HBase | 5,246 | 11 months ago | An open-source, distributed, versioned, column-oriented store |
| Scylla | 13,725 | 11 months ago | Designed to be compatible with Cassandra while achieving higher throughputs and lower latencies |
Awesome DataOps / Database / Document-Oriented Database | |||
| Apache CouchDB | 6,298 | 11 months ago | An open-source document-oriented NoSQL database, implemented in Erlang |
| Elasticsearch | 71,007 | 11 months ago | A distributed document oriented database with a RESTful search engine |
| MongoDB | 26,503 | 11 months ago | A cross-platform document database that uses JSON-like documents with optional schemas |
| RethinkDB | 26,806 | 12 months ago | The first open-source scalable database built for realtime applications |
Awesome DataOps / Database / Graph Database | |||
| Age | 3,191 | about 1 year ago | A multi-model database that supports both graph and relational data models |
| ArangoDB | 13,613 | 11 months ago | A scalable open-source multi-model database natively supporting graph, document and search |
| JanusGraph | 5,351 | 12 months ago | Manage large graphs with billions of data distributed across a multi-machine cluster |
| Memgraph | 2,520 | 11 months ago | An open source graph database, built for real-time streaming data, compatible with Neo4j |
| Neo4j | 13,537 | 11 months ago | A high performance graph store with all the features expected of a mature and robust database |
| Titan | 5,243 | about 3 years ago | A highly scalable graph database optimized for storing and querying large graphs |
Awesome DataOps / Database / Key-Value Database | |||
| Apache Accumulo | 1,075 | 11 months ago | A sorted, distributed key-value store that provides robust and scalable data storage |
| Dragonfly | 26,326 | 11 months ago | A modern in-memory datastore, fully compatible with Redis and Memcached APIs |
| DynamoDB | Fast, flexible NoSQL database service for single-digit millisecond performance at any scale | ||
| etcd | 48,056 | 11 months ago | Distributed reliable key-value store for the most critical data of a distributed system |
| EVCache | 2,071 | 11 months ago | A distributed in-memory data store for the cloud |
| Memcached | 13,601 | 11 months ago | A high performance multithreaded event-based key/value cache store |
| Redis | 67,358 | 11 months ago | An in-memory key-value database that persists on disk |
Awesome DataOps / Database / Relational Database | |||
| CockroachDB | 30,270 | 11 months ago | A distributed database designed to build, scale, and manage data-intensive apps |
| Crate | 4,139 | 11 months ago | A distributed SQL database that makes it simple to store and analyze massive amounts of data |
| MariaDB | 5,752 | 11 months ago | A replacement of MySQL with more features, new storage engines and better performance |
| MySQL | 10,964 | about 1 year ago | One of the most popular open source transactional databases |
| PostgreSQL | 16,442 | 11 months ago | An advanced RDBMS that supports an extended subset of the SQL standard |
| RQLite | 15,906 | 11 months ago | A lightweight, distributed relational database, which uses SQLite as its storage engine |
| SQLite | 6,902 | 11 months ago | A popular choice as embedded database software for local/client storage |
Awesome DataOps / Database / Time Series Database | |||
| Akumuli | 835 | over 3 years ago | Can be used to capture, store and process time-series data in real-time |
| Atlas | 3,459 | 11 months ago | An in-memory dimensional time series database |
| InfluxDB | 29,126 | 11 months ago | Scalable datastore for metrics, events, and real-time analytics |
| QuestDB | 14,699 | 11 months ago | An open source SQL database designed to process time series data, faster |
| TimescaleDB | 18,066 | 11 months ago | Open-source time-series SQL database optimized for fast ingest and complex queries |
Awesome DataOps / Database / Vector Database | |||
| Milvus | 31,283 | 11 months ago | An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy |
| Pinecone | Managed and distributed vector similarity search used with a lightweight SDK | ||
| Qdrant | 21,001 | 11 months ago | An open source vector similarity search engine with extended filtering support |
Awesome DataOps / File System | |||
| Alluxio | 6,880 | 11 months ago | A virtual distributed storage system |
| Amazon Simple Storage Service (S3) | Object storage built to retrieve any amount of data from anywhere | ||
| Apache Hadoop Distributed File System (HDFS) | A distributed file system | ||
| GlusterFS | 4,774 | 11 months ago | A software defined distributed storage that can scale to several petabytes |
| Google Cloud Storage (GCS) | Object storage for companies of all sizes, to store any amount of data | ||
| LakeFS | 4,496 | 11 months ago | Open source tool that transforms your object storage into a Git-like repository |
| LizardFS | 958 | about 1 year ago | A highly reliable, scalable and efficient distributed file system |
| MinIO | 48,833 | 11 months ago | High Performance, Kubernetes Native Object Storage compatible with Amazon S3 API |
| SeaweedFS | 23,207 | 11 months ago | A fast distributed storage system for blobs, objects, files, and data lake |
| Swift | 2,639 | 11 months ago | A distributed object storage system designed to scale from a single machine to thousands of servers |
Awesome DataOps / Logging and Monitoring | |||
| Grafana | 65,525 | 11 months ago | Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, InfluxDB and more |
| Loki | 24,172 | 11 months ago | A horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus |
| Prometheus | 56,244 | 11 months ago | A monitoring system and time series database |
| Whylogs | 2,664 | 11 months ago | A tool for creating data logs, enabling monitoring for data drift and data quality issues |
Awesome DataOps / Metadata Service | |||
| Hive Metastore | Service that stores metadata related to Apache Hive and other services | ||
| Metacat | 1,616 | 11 months ago | Provides you information about what data you have, where it resides and how to process it |
Awesome DataOps / SQL Query Engine | |||
| Apache Drill | 1,949 | 12 months ago | Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage |
| Apache Impala | 1,164 | 11 months ago | Lightning-fast, distributed SQL queries for petabytes of data |
| Dremio | Power high-performing BI dashboards and interactive analytics directly on data lake | ||
| Presto | 16,114 | 11 months ago | A distributed SQL query engine for big data |
| Trino | 10,601 | 11 months ago | A fast distributed SQL query engine for big data analytics |
Resources / Books | |||
| Data Mesh: Delivering Data-Driven Value at Scale | (O'Reilly) | ||
| Designing Data-Intensive Applications | (O'Reilly) | ||
| Fundamentals of Data Engineering | (O'Reilly) | ||
| Getting Started with Impala | (O'Reilly) | ||
| Learning and Operating Presto | (O'Reilly) | ||
| Learning Spark: Lightning-Fast Data Analytics | (O'Reilly) | ||
| Spark in Action | (O'Reilly) | ||
| Spark: The Definitive Guide | (O'Reilly) | ||
Resources / Other Lists | |||
| Awesome Data Engineering | 6,889 | about 1 year ago | |
| Awesome MLOps | 4,181 | 11 months ago | |
| DataOps Resource | 24 | about 5 years ago | |
Resources / Slack | |||
| Delta Lake Workspace | |||
| Trino Workspace | |||