awesome-dataops
A curated list of awesome DataOps tools
147 stars
8 watching
19 forks
Language: Python
last commit: 12 days ago
Linked from 1 awesome list
awesomeawesome-listdata-engineerdata-engineeringdataops
Awesome DataOps / Data Catalog | |||
Amundsen | Data discovery and metadata engine for improving the productivity when interacting with data | ||
Apache Atlas | Provides open metadata management and governance capabilities to build a data catalog | ||
CKAN | 4,409 | 11 days ago | Open-source DMS (data management system) for powering data hubs and data portals |
DataHub | 9,727 | 8 days ago | LinkedIn's generalized metadata search & discovery tool |
Magda | 508 | 12 days ago | A federated, open-source data catalog for all your big data and small data |
Metacat | 1,607 | 18 days ago | Unified metadata exploration API service for Hive, RDS, Teradata, Redshift, S3 and Cassandra |
OpenMetadata | A Single place to discover, collaborate and get your data right | ||
Unity Catalog | Industry’s only universal catalog for data and AI | ||
Awesome DataOps / Data Exploration | |||
Apache Zeppelin | Enables data-driven, interactive data analytics and collaborative documents | ||
Jupyter Notebook | Web-based notebook environment for interactive computing | ||
JupyterLab | The next-generation user interface for Project Jupyter | ||
Jupytext | 6,605 | about 1 month ago | Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts |
Polynote | The polyglot notebook with first-class Scala support | ||
Awesome DataOps / Data Ingestion | |||
Amazon Kinesis | Easily collect, process, and analyze video and data streams in real time | ||
Apache Gobblin | 2,216 | 16 days ago | A framework that simplifies common aspects of big data such as data ingestion |
Apache Kafka | 28,494 | 3 days ago | Open-source distributed event streaming platform used by thousands of companies |
Apache Pulsar | 14,141 | 12 days ago | Distributed pub-sub messaging platform with a flexible messaging model and intuitive API |
Embulk | 1,748 | 12 days ago | A parallel bulk data loader that helps data transfer between various storages |
Fluentd | 12,857 | 3 days ago | Collects events from various data sources and writes them to files |
Google PubSub | Ingest events for streaming into BigQuery, data lakes or operational databases | ||
Nakadi | 956 | 6 months ago | A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues |
Pravega | 1,982 | about 1 month ago | An open source distributed storage service implementing Streams |
RabbitMQ | One of the most popular open source message brokers | ||
Awesome DataOps / Data Workflow | |||
Apache Airflow | 36,519 | 4 days ago | A platform to programmatically author, schedule, and monitor workflows |
Apache Oozie | 708 | 3 months ago | An extensible, scalable and reliable system to manage complex Hadoop workloads |
Azkaban | 4,458 | 3 months ago | Batch workflow job scheduler created at LinkedIn to run Hadoop jobs |
Dagster | 11,237 | 5 days ago | An orchestration platform for the development, production, and observation of data assets |
Luigi | 17,746 | 11 days ago | Python module that helps you build complex pipelines of batch jobs |
Prefect | A workflow management system, designed for modern infrastructure | ||
Awesome DataOps / Data Processing | |||
Apache Beam | 7,805 | 1 day ago | A unified model for defining both batch and streaming data-parallel processing pipelines |
Apache Flink | 23,889 | 1 day ago | An open source stream processing framework with powerful capabilities |
Apache Hadoop MapReduce | A framework for writing applications which process vast amounts of data | ||
Apache Nifi | 4,798 | 4 days ago | An easy to use, powerful, and reliable system to process and distribute data |
Apache Samza | 812 | 4 days ago | A distributed stream processing framework which uses Apache Kafka and Hadoop YARN |
Apache Spark | 39,387 | 5 days ago | A unified analytics engine for large-scale data processing |
Apache Storm | 6,589 | 4 days ago | An open source distributed realtime computation system |
Apache Tez | 472 | 10 days ago | A generic data-processing pipeline engine envisioned as a low-level engine |
Faust | 6,726 | 2 months ago | A stream processing library, porting the ideas from Kafka Streams to Python |
Awesome DataOps / Data Quality | |||
Cerberus | 3,154 | about 2 months ago | Lightweight, extensible data validation library for Python |
Cleanlab | 9,428 | 29 days ago | Data-centric AI tool to detect (non-predefined) issues in ML data like label errors or outliers |
DataProfiler | 1,420 | 3 months ago | A Python library designed to make data analysis, monitoring, and sensitive data detection easy |
Deequ | 3,269 | 4 days ago | A library built on top of Apache Spark for measuring data quality in large datasets |
Great Expectations | A Python data validation framework that allows to test your data against datasets | ||
JSON Schema | A vocabulary that allows you to annotate and validate JSON documents | ||
SodaSQL | 61 | almost 2 years ago | Data profiling, testing, and monitoring for SQL accessible data |
Awesome DataOps / Data Serialization | |||
Apache Avro | 2,900 | 5 days ago | A data serialization system which is compact, fast and provides rich data structures |
Apache ORC | 681 | 5 days ago | A self-describing type-aware columnar file format designed for Hadoop workloads |
Apache Parquet | 2,575 | 3 days ago | A columnar storage format which provides efficient storage and encoding of data |
Kryo | 6,181 | 5 days ago | A fast and efficient binary object graph serialization framework for Java |
ProtoBuf | 65,302 | 10 days ago | Language-neutral, platform-neutral, extensible mechanism for serializing structured data |
Awesome DataOps / Data Serialization / Data Compression | |||
Pigz | 2,635 | 18 days ago | A parallel implementation of gzip for modern multi-processor, multi-core machines |
Snappy | 6,110 | about 2 months ago | Open source compression library that is fast, stable and robuts |
Awesome DataOps / Data Serialization / Data Table Format | |||
Apache Hudi | 5,346 | 1 day ago | Manages the storage of large analytical datasets on DFS |
Apache Iceberg | 6,241 | 2 days ago | Open table format for huge analytic datasets |
Delta Lake | 7,487 | 3 days ago | An open source project that enables building a Lakehouse architecture on top of data lakes |
Awesome DataOps / Data Visualization | |||
Apache Superset | 62,043 | 3 days ago | A modern data exploration and data visualization platform |
Count | SQL/drag-and-drop querying and visualisation tool based on notebooks | ||
Dash | 21,250 | 15 days ago | Analytical Web Apps for Python, R, Julia, and Jupyter |
Data Studio | Reporting solution for power users who want to go beyond the data and dashboards of GA | ||
HUE | 1,163 | 4 days ago | A mature SQL Assistant for querying Databases & Data Warehouses |
Lux | 5,144 | 7 months ago | Fast and easy data exploration by automating the visualization and data analysis process |
Metabase | The simplest, fastest way to get business intelligence and analytics to everyone | ||
Redash | Connect to any data source, easily visualize, dashboard and share your data | ||
Tableau | Powerful and fastest growing data visualization tool used in the business intelligence industry | ||
Awesome DataOps / Data Warehouse | |||
Amazon Redshift | Accelerate your time to insights with fast, easy, and secure cloud data warehousing | ||
Apache Hive | 5,514 | 3 days ago | Facilitates reading, writing, and managing large datasets residing in distributed storage |
Apache Kylin | 3,636 | 4 days ago | An open source, distributed analytical data warehouse for big data |
Google BigQuery | Serverless, highly scalable, and cost-effective multicloud data warehouse | ||
Awesome DataOps / Database / Columnar Database | |||
Apache Cassandra | 8,719 | 12 days ago | Open source column based DBMS designed to handle large amounts of data |
Apache Druid | 13,429 | 5 days ago | Designed to quickly ingest massive quantities of event data, and provide low-latency queries |
Apache HBase | 5,206 | 3 days ago | An open-source, distributed, versioned, column-oriented store |
Scylla | 13,370 | 1 day ago | Designed to be compatible with Cassandra while achieving higher throughputs and lower latencies |
Awesome DataOps / Database / Document-Oriented Database | |||
Apache CouchDB | 6,180 | 15 days ago | An open-source document-oriented NoSQL database, implemented in Erlang |
Elasticsearch | 69,622 | 12 days ago | A distributed document oriented database with a RESTful search engine |
MongoDB | 26,120 | 12 days ago | A cross-platform document database that uses JSON-like documents with optional schemas |
RethinkDB | 26,735 | 7 months ago | The first open-source scalable database built for realtime applications |
Awesome DataOps / Database / Graph Database | |||
Age | 3,028 | 22 days ago | A multi-model database that supports both graph and relational data models |
ArangoDB | 13,530 | 2 days ago | A scalable open-source multi-model database natively supporting graph, document and search |
JanusGraph | 5,282 | 5 days ago | Manage large graphs with billions of data distributed across a multi-machine cluster |
Memgraph | 2,360 | 3 days ago | An open source graph database, built for real-time streaming data, compatible with Neo4j |
Neo4j | 13,200 | 3 days ago | A high performance graph store with all the features expected of a mature and robust database |
Titan | 5,247 | almost 2 years ago | A highly scalable graph database optimized for storing and querying large graphs |
Awesome DataOps / Database / Key-Value Database | |||
Apache Accumulo | 1,062 | 4 days ago | A sorted, distributed key-value store that provides robust and scalable data storage |
Dragonfly | 25,449 | 3 days ago | A modern in-memory datastore, fully compatible with Redis and Memcached APIs |
DynamoDB | Fast, flexible NoSQL database service for single-digit millisecond performance at any scale | ||
etcd | 47,457 | 11 days ago | Distributed reliable key-value store for the most critical data of a distributed system |
EVCache | 2,026 | about 1 month ago | A distributed in-memory data store for the cloud |
Memcached | 13,433 | 28 days ago | A high performance multithreaded event-based key/value cache store |
Redis | 66,394 | 12 days ago | An in-memory key-value database that persists on disk |
Awesome DataOps / Database / Relational Database | |||
CockroachDB | 29,954 | 5 days ago | A distributed database designed to build, scale, and manage data-intensive apps |
Crate | 4,052 | 11 days ago | A distributed SQL database that makes it simple to store and analyze massive amounts of data |
MariaDB | 5,584 | 10 days ago | A replacement of MySQL with more features, new storage engines and better performance |
MySQL | 10,733 | about 2 months ago | One of the most popular open source transactional databases |
PostgreSQL | 15,921 | 1 day ago | An advanced RDBMS that supports an extended subset of the SQL standard |
RQLite | 15,617 | 5 days ago | A lightweight, distributed relational database, which uses SQLite as its storage engine |
SQLite | 6,408 | 3 days ago | A popular choice as embedded database software for local/client storage |
Awesome DataOps / Database / Time Series Database | |||
Akumuli | 836 | about 2 years ago | Can be used to capture, store and process time-series data in real-time |
Atlas | 3,439 | 8 days ago | An in-memory dimensional time series database |
InfluxDB | 28,713 | 3 days ago | Scalable datastore for metrics, events, and real-time analytics |
QuestDB | 14,381 | 4 days ago | An open source SQL database designed to process time series data, faster |
TimescaleDB | 17,531 | 10 days ago | Open-source time-series SQL database optimized for fast ingest and complex queries |
Awesome DataOps / Database / Vector Database | |||
Milvus | 29,730 | 5 days ago | An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy |
Pinecone | Managed and distributed vector similarity search used with a lightweight SDK | ||
Qdrant | 19,996 | 3 days ago | An open source vector similarity search engine with extended filtering support |
Awesome DataOps / File System | |||
Alluxio | 6,816 | 23 days ago | A virtual distributed storage system |
Amazon Simple Storage Service (S3) | Object storage built to retrieve any amount of data from anywhere | ||
Apache Hadoop Distributed File System (HDFS) | A distributed file system | ||
GlusterFS | 4,655 | about 2 months ago | A software defined distributed storage that can scale to several petabytes |
Google Cloud Storage (GCS) | Object storage for companies of all sizes, to store any amount of data | ||
LakeFS | 4,363 | 10 days ago | Open source tool that transforms your object storage into a Git-like repository |
LizardFS | 953 | about 2 months ago | A highly reliable, scalable and efficient distributed file system |
MinIO | 47,067 | 3 days ago | High Performance, Kubernetes Native Object Storage compatible with Amazon S3 API |
SeaweedFS | 22,426 | 5 days ago | A fast distributed storage system for blobs, objects, files, and data lake |
Swift | 2,616 | 3 days ago | A distributed object storage system designed to scale from a single machine to thousands of servers |
Awesome DataOps / Logging and Monitoring | |||
Grafana | 64,069 | 11 days ago | Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, InfluxDB and more |
Loki | 23,418 | 11 days ago | A horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus |
Prometheus | 55,095 | 3 days ago | A monitoring system and time series database |
Whylogs | 2,635 | 9 days ago | A tool for creating data logs, enabling monitoring for data drift and data quality issues |
Awesome DataOps / Metadata Service | |||
Hive Metastore | Service that stores metadata related to Apache Hive and other services | ||
Metacat | 1,607 | 18 days ago | Provides you information about what data you have, where it resides and how to process it |
Awesome DataOps / SQL Query Engine | |||
Apache Drill | 1,933 | 9 days ago | Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage |
Apache Impala | 1,131 | 4 days ago | Lightning-fast, distributed SQL queries for petabytes of data |
Dremio | Power high-performing BI dashboards and interactive analytics directly on data lake | ||
Presto | 15,958 | 8 days ago | A distributed SQL query engine for big data |
Trino | 10,237 | 8 days ago | A fast distributed SQL query engine for big data analytics |
Resources / Books | |||
Data Mesh: Delivering Data-Driven Value at Scale | (O'Reilly) | ||
Designing Data-Intensive Applications | (O'Reilly) | ||
Fundamentals of Data Engineering | (O'Reilly) | ||
Getting Started with Impala | (O'Reilly) | ||
Learning and Operating Presto | (O'Reilly) | ||
Learning Spark: Lightning-Fast Data Analytics | (O'Reilly) | ||
Spark in Action | (O'Reilly) | ||
Spark: The Definitive Guide | (O'Reilly) | ||
Resources / Other Lists | |||
Awesome Data Engineering | 6,655 | about 1 month ago | |
Awesome MLOps | 3,994 | 12 days ago | |
DataOps Resource | 21 | about 4 years ago | |
Resources / Slack | |||
Delta Lake Workspace | |||
Trino Workspace |