awesome-dataops
DataOps toolkit
A curated list of tools and technologies for DataOps, covering data cataloging, exploration, ingestion, processing, and more.
A curated list of awesome DataOps tools
163 stars
9 watching
20 forks
Language: Python
last commit: 2 months ago
Linked from 1 awesome list
awesomeawesome-listdata-engineerdata-engineeringdataops
Awesome DataOps / Data Catalog | |||
Amundsen | Data discovery and metadata engine for improving the productivity when interacting with data | ||
Apache Atlas | Provides open metadata management and governance capabilities to build a data catalog | ||
CKAN | 4,509 | 1 day ago | Open-source DMS (data management system) for powering data hubs and data portals |
DataHub | 10,046 | about 13 hours ago | LinkedIn's generalized metadata search & discovery tool |
Magda | 518 | 1 day ago | A federated, open-source data catalog for all your big data and small data |
Marquez | 1,800 | 2 days ago | Service for the collection, aggregation, and visualization of a data ecosystem's metadata |
Metacat | 1,616 | about 24 hours ago | Unified metadata exploration API service for Hive, RDS, Teradata, Redshift, S3 and Cassandra |
OpenLineage | 1,802 | about 10 hours ago | Open standard for metadata and lineage collection |
OpenMetadata | A Single place to discover, collaborate and get your data right | ||
Unity Catalog | Industry’s only universal catalog for data and AI | ||
Awesome DataOps / Data Exploration | |||
Apache Zeppelin | Enables data-driven, interactive data analytics and collaborative documents | ||
Jupyter Notebook | Web-based notebook environment for interactive computing | ||
JupyterLab | The next-generation user interface for Project Jupyter | ||
Jupytext | 6,673 | 7 days ago | Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts |
Polynote | The polyglot notebook with first-class Scala support | ||
Awesome DataOps / Data Ingestion | |||
Amazon Kinesis | Easily collect, process, and analyze video and data streams in real time | ||
Apache Gobblin | 2,232 | 4 days ago | A framework that simplifies common aspects of big data such as data ingestion |
Apache Kafka | 29,060 | 4 days ago | Open-source distributed event streaming platform used by thousands of companies |
Apache Pulsar | 14,315 | 4 days ago | Distributed pub-sub messaging platform with a flexible messaging model and intuitive API |
Embulk | 1,758 | 14 days ago | A parallel bulk data loader that helps data transfer between various storages |
Fluentd | 12,963 | 1 day ago | Collects events from various data sources and writes them to files |
Google PubSub | Ingest events for streaming into BigQuery, data lakes or operational databases | ||
Nakadi | 958 | 8 months ago | A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues |
Pravega | 1,983 | 4 months ago | An open source distributed storage service implementing Streams |
RabbitMQ | One of the most popular open source message brokers | ||
Awesome DataOps / Data Workflow | |||
Apache Airflow | 37,580 | 4 days ago | A platform to programmatically author, schedule, and monitor workflows |
Apache Oozie | 717 | 5 months ago | An extensible, scalable and reliable system to manage complex Hadoop workloads |
Azkaban | 4,481 | 6 months ago | Batch workflow job scheduler created at LinkedIn to run Hadoop jobs |
Dagster | 12,055 | 5 days ago | An orchestration platform for the development, production, and observation of data assets |
Luigi | 17,950 | 6 days ago | Python module that helps you build complex pipelines of batch jobs |
Prefect | A workflow management system, designed for modern infrastructure | ||
Awesome DataOps / Data Processing | |||
Apache Beam | 7,911 | 4 days ago | A unified model for defining both batch and streaming data-parallel processing pipelines |
Apache Flink | 24,261 | 5 days ago | An open source stream processing framework with powerful capabilities |
Apache Hadoop MapReduce | A framework for writing applications which process vast amounts of data | ||
Apache Nifi | 4,955 | 4 days ago | An easy to use, powerful, and reliable system to process and distribute data |
Apache Samza | 817 | 22 days ago | A distributed stream processing framework which uses Apache Kafka and Hadoop YARN |
Apache Spark | 40,170 | 5 days ago | A unified analytics engine for large-scale data processing |
Apache Storm | 6,603 | 9 days ago | An open source distributed realtime computation system |
Apache Tez | 482 | 5 days ago | A generic data-processing pipeline engine envisioned as a low-level engine |
Faust | 6,751 | 5 months ago | A stream processing library, porting the ideas from Kafka Streams to Python |
Awesome DataOps / Data Quality | |||
Cerberus | 3,179 | 4 months ago | Lightweight, extensible data validation library for Python |
Cleanlab | 9,820 | 6 days ago | Data-centric AI tool to detect (non-predefined) issues in ML data like label errors or outliers |
DataProfiler | 1,442 | about 1 month ago | A Python library designed to make data analysis, monitoring, and sensitive data detection easy |
Deequ | 3,324 | 2 months ago | A library built on top of Apache Spark for measuring data quality in large datasets |
Great Expectations | A Python data validation framework that allows to test your data against datasets | ||
JSON Schema | A vocabulary that allows you to annotate and validate JSON documents | ||
SodaSQL | 61 | about 2 years ago | Data profiling, testing, and monitoring for SQL accessible data |
Awesome DataOps / Data Serialization | |||
Apache Avro | 2,973 | 8 days ago | A data serialization system which is compact, fast and provides rich data structures |
Apache ORC | 698 | 5 days ago | A self-describing type-aware columnar file format designed for Hadoop workloads |
Apache Parquet | 2,665 | 13 days ago | A columnar storage format which provides efficient storage and encoding of data |
Kryo | 6,217 | 9 days ago | A fast and efficient binary object graph serialization framework for Java |
ProtoBuf | 65,999 | 5 days ago | Language-neutral, platform-neutral, extensible mechanism for serializing structured data |
Awesome DataOps / Data Serialization / Data Compression | |||
Pigz | 2,669 | 3 months ago | A parallel implementation of gzip for modern multi-processor, multi-core machines |
Snappy | 6,217 | 4 months ago | Open source compression library that is fast, stable and robuts |
Awesome DataOps / Data Serialization / Data Table Format | |||
Apache Hudi | 5,498 | 4 days ago | Manages the storage of large analytical datasets on DFS |
Apache Iceberg | 6,621 | 5 days ago | Open table format for huge analytic datasets |
Delta Lake | 7,677 | 4 days ago | An open source project that enables building a Lakehouse architecture on top of data lakes |
Awesome DataOps / Data Visualization | |||
Apache Superset | 63,320 | 4 days ago | A modern data exploration and data visualization platform |
Count | SQL/drag-and-drop querying and visualisation tool based on notebooks | ||
Dash | 21,641 | 5 days ago | Analytical Web Apps for Python, R, Julia, and Jupyter |
Data Studio | Reporting solution for power users who want to go beyond the data and dashboards of GA | ||
HUE | 1,188 | 1 day ago | A mature SQL Assistant for querying Databases & Data Warehouses |
Lux | 5,226 | 9 months ago | Fast and easy data exploration by automating the visualization and data analysis process |
Metabase | The simplest, fastest way to get business intelligence and analytics to everyone | ||
Redash | Connect to any data source, easily visualize, dashboard and share your data | ||
Tableau | Powerful and fastest growing data visualization tool used in the business intelligence industry | ||
Awesome DataOps / Data Warehouse | |||
Amazon Redshift | Accelerate your time to insights with fast, easy, and secure cloud data warehousing | ||
Apache Hive | 5,577 | 5 days ago | Facilitates reading, writing, and managing large datasets residing in distributed storage |
Apache Kylin | 3,661 | 5 days ago | An open source, distributed analytical data warehouse for big data |
Google BigQuery | Serverless, highly scalable, and cost-effective multicloud data warehouse | ||
Awesome DataOps / Database / Columnar Database | |||
Apache Cassandra | 8,906 | 4 days ago | Open source column based DBMS designed to handle large amounts of data |
Apache Druid | 13,548 | 4 days ago | Designed to quickly ingest massive quantities of event data, and provide low-latency queries |
Apache HBase | 5,246 | 4 days ago | An open-source, distributed, versioned, column-oriented store |
Scylla | 13,725 | 5 days ago | Designed to be compatible with Cassandra while achieving higher throughputs and lower latencies |
Awesome DataOps / Database / Document-Oriented Database | |||
Apache CouchDB | 6,298 | 5 days ago | An open-source document-oriented NoSQL database, implemented in Erlang |
Elasticsearch | 71,007 | 4 days ago | A distributed document oriented database with a RESTful search engine |
MongoDB | 26,503 | 4 days ago | A cross-platform document database that uses JSON-like documents with optional schemas |
RethinkDB | 26,806 | about 1 month ago | The first open-source scalable database built for realtime applications |
Awesome DataOps / Database / Graph Database | |||
Age | 3,191 | 3 months ago | A multi-model database that supports both graph and relational data models |
ArangoDB | 13,613 | 4 days ago | A scalable open-source multi-model database natively supporting graph, document and search |
JanusGraph | 5,351 | 28 days ago | Manage large graphs with billions of data distributed across a multi-machine cluster |
Memgraph | 2,520 | about 10 hours ago | An open source graph database, built for real-time streaming data, compatible with Neo4j |
Neo4j | 13,537 | 9 days ago | A high performance graph store with all the features expected of a mature and robust database |
Titan | 5,243 | about 2 years ago | A highly scalable graph database optimized for storing and querying large graphs |
Awesome DataOps / Database / Key-Value Database | |||
Apache Accumulo | 1,075 | 4 days ago | A sorted, distributed key-value store that provides robust and scalable data storage |
Dragonfly | 26,326 | 1 day ago | A modern in-memory datastore, fully compatible with Redis and Memcached APIs |
DynamoDB | Fast, flexible NoSQL database service for single-digit millisecond performance at any scale | ||
etcd | 48,056 | 1 day ago | Distributed reliable key-value store for the most critical data of a distributed system |
EVCache | 2,071 | 6 days ago | A distributed in-memory data store for the cloud |
Memcached | 13,601 | 13 days ago | A high performance multithreaded event-based key/value cache store |
Redis | 67,358 | 5 days ago | An in-memory key-value database that persists on disk |
Awesome DataOps / Database / Relational Database | |||
CockroachDB | 30,270 | 5 days ago | A distributed database designed to build, scale, and manage data-intensive apps |
Crate | 4,139 | 4 days ago | A distributed SQL database that makes it simple to store and analyze massive amounts of data |
MariaDB | 5,752 | about 19 hours ago | A replacement of MySQL with more features, new storage engines and better performance |
MySQL | 10,964 | 2 months ago | One of the most popular open source transactional databases |
PostgreSQL | 16,442 | 5 days ago | An advanced RDBMS that supports an extended subset of the SQL standard |
RQLite | 15,906 | 5 days ago | A lightweight, distributed relational database, which uses SQLite as its storage engine |
SQLite | 6,902 | about 10 hours ago | A popular choice as embedded database software for local/client storage |
Awesome DataOps / Database / Time Series Database | |||
Akumuli | 835 | over 2 years ago | Can be used to capture, store and process time-series data in real-time |
Atlas | 3,459 | 7 days ago | An in-memory dimensional time series database |
InfluxDB | 29,126 | 4 days ago | Scalable datastore for metrics, events, and real-time analytics |
QuestDB | 14,699 | 4 days ago | An open source SQL database designed to process time series data, faster |
TimescaleDB | 18,066 | 1 day ago | Open-source time-series SQL database optimized for fast ingest and complex queries |
Awesome DataOps / Database / Vector Database | |||
Milvus | 31,283 | 4 days ago | An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy |
Pinecone | Managed and distributed vector similarity search used with a lightweight SDK | ||
Qdrant | 21,001 | 1 day ago | An open source vector similarity search engine with extended filtering support |
Awesome DataOps / File System | |||
Alluxio | 6,880 | 20 days ago | A virtual distributed storage system |
Amazon Simple Storage Service (S3) | Object storage built to retrieve any amount of data from anywhere | ||
Apache Hadoop Distributed File System (HDFS) | A distributed file system | ||
GlusterFS | 4,774 | 14 days ago | A software defined distributed storage that can scale to several petabytes |
Google Cloud Storage (GCS) | Object storage for companies of all sizes, to store any amount of data | ||
LakeFS | 4,496 | 1 day ago | Open source tool that transforms your object storage into a Git-like repository |
LizardFS | 958 | 4 months ago | A highly reliable, scalable and efficient distributed file system |
MinIO | 48,833 | 4 days ago | High Performance, Kubernetes Native Object Storage compatible with Amazon S3 API |
SeaweedFS | 23,207 | 5 days ago | A fast distributed storage system for blobs, objects, files, and data lake |
Swift | 2,639 | about 10 hours ago | A distributed object storage system designed to scale from a single machine to thousands of servers |
Awesome DataOps / Logging and Monitoring | |||
Grafana | 65,525 | 1 day ago | Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, InfluxDB and more |
Loki | 24,172 | 1 day ago | A horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus |
Prometheus | 56,244 | 4 days ago | A monitoring system and time series database |
Whylogs | 2,664 | 7 days ago | A tool for creating data logs, enabling monitoring for data drift and data quality issues |
Awesome DataOps / Metadata Service | |||
Hive Metastore | Service that stores metadata related to Apache Hive and other services | ||
Metacat | 1,616 | about 24 hours ago | Provides you information about what data you have, where it resides and how to process it |
Awesome DataOps / SQL Query Engine | |||
Apache Drill | 1,949 | 30 days ago | Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage |
Apache Impala | 1,164 | 5 days ago | Lightning-fast, distributed SQL queries for petabytes of data |
Dremio | Power high-performing BI dashboards and interactive analytics directly on data lake | ||
Presto | 16,114 | 1 day ago | A distributed SQL query engine for big data |
Trino | 10,601 | about 14 hours ago | A fast distributed SQL query engine for big data analytics |
Resources / Books | |||
Data Mesh: Delivering Data-Driven Value at Scale | (O'Reilly) | ||
Designing Data-Intensive Applications | (O'Reilly) | ||
Fundamentals of Data Engineering | (O'Reilly) | ||
Getting Started with Impala | (O'Reilly) | ||
Learning and Operating Presto | (O'Reilly) | ||
Learning Spark: Lightning-Fast Data Analytics | (O'Reilly) | ||
Spark in Action | (O'Reilly) | ||
Spark: The Definitive Guide | (O'Reilly) | ||
Resources / Other Lists | |||
Awesome Data Engineering | 6,889 | about 2 months ago | |
Awesome MLOps | 4,181 | 19 days ago | |
DataOps Resource | 24 | over 4 years ago | |
Resources / Slack | |||
Delta Lake Workspace | |||
Trino Workspace |