awesome-dataops

DataOps toolkit

A curated list of tools and technologies for DataOps, covering data cataloging, exploration, ingestion, processing, and more.

A curated list of awesome DataOps tools

GitHub

163 stars

9 watching

20 forks

Language: Python

last commit: almost 2 years ago

Linked from 1 awesome list

awesomeawesome-listdata-engineerdata-engineeringdataops

Awesome DataOps / Data Catalog
Amundsen			Data discovery and metadata engine for improving the productivity when interacting with data
Apache Atlas			Provides open metadata management and governance capabilities to build a data catalog
CKAN	4,509	over 1 year ago	Open-source DMS (data management system) for powering data hubs and data portals
DataHub	10,046	over 1 year ago	LinkedIn's generalized metadata search & discovery tool
Magda	518	over 1 year ago	A federated, open-source data catalog for all your big data and small data
Marquez	1,800	over 1 year ago	Service for the collection, aggregation, and visualization of a data ecosystem's metadata
Metacat	1,616	over 1 year ago	Unified metadata exploration API service for Hive, RDS, Teradata, Redshift, S3 and Cassandra
OpenLineage	1,802	over 1 year ago	Open standard for metadata and lineage collection
OpenMetadata			A Single place to discover, collaborate and get your data right
Unity Catalog			Industry’s only universal catalog for data and AI
Awesome DataOps / Data Exploration
Apache Zeppelin			Enables data-driven, interactive data analytics and collaborative documents
Jupyter Notebook			Web-based notebook environment for interactive computing
JupyterLab			The next-generation user interface for Project Jupyter
Jupytext	6,673	over 1 year ago	Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts
Polynote			The polyglot notebook with first-class Scala support
Awesome DataOps / Data Ingestion
Amazon Kinesis			Easily collect, process, and analyze video and data streams in real time
Apache Gobblin	2,232	over 1 year ago	A framework that simplifies common aspects of big data such as data ingestion
Apache Kafka	29,060	over 1 year ago	Open-source distributed event streaming platform used by thousands of companies
Apache Pulsar	14,315	over 1 year ago	Distributed pub-sub messaging platform with a flexible messaging model and intuitive API
Embulk	1,758	over 1 year ago	A parallel bulk data loader that helps data transfer between various storages
Fluentd	12,963	over 1 year ago	Collects events from various data sources and writes them to files
Google PubSub			Ingest events for streaming into BigQuery, data lakes or operational databases
Nakadi	958	over 2 years ago	A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues
Pravega	1,983	almost 2 years ago	An open source distributed storage service implementing Streams
RabbitMQ			One of the most popular open source message brokers
Awesome DataOps / Data Workflow
Apache Airflow	37,580	over 1 year ago	A platform to programmatically author, schedule, and monitor workflows
Apache Oozie	717	about 2 years ago	An extensible, scalable and reliable system to manage complex Hadoop workloads
Azkaban	4,481	about 2 years ago	Batch workflow job scheduler created at LinkedIn to run Hadoop jobs
Dagster	12,055	over 1 year ago	An orchestration platform for the development, production, and observation of data assets
Luigi	17,950	over 1 year ago	Python module that helps you build complex pipelines of batch jobs
Prefect			A workflow management system, designed for modern infrastructure
Awesome DataOps / Data Processing
Apache Beam	7,911	over 1 year ago	A unified model for defining both batch and streaming data-parallel processing pipelines
Apache Flink	24,261	over 1 year ago	An open source stream processing framework with powerful capabilities
Apache Hadoop MapReduce			A framework for writing applications which process vast amounts of data
Apache Nifi	4,955	over 1 year ago	An easy to use, powerful, and reliable system to process and distribute data
Apache Samza	817	over 1 year ago	A distributed stream processing framework which uses Apache Kafka and Hadoop YARN
Apache Spark	40,170	over 1 year ago	A unified analytics engine for large-scale data processing
Apache Storm	6,603	over 1 year ago	An open source distributed realtime computation system
Apache Tez	482	over 1 year ago	A generic data-processing pipeline engine envisioned as a low-level engine
Faust	6,751	almost 2 years ago	A stream processing library, porting the ideas from Kafka Streams to Python
Awesome DataOps / Data Quality
Cerberus	3,179	almost 2 years ago	Lightweight, extensible data validation library for Python
Cleanlab	9,820	over 1 year ago	Data-centric AI tool to detect (non-predefined) issues in ML data like label errors or outliers
DataProfiler	1,442	over 1 year ago	A Python library designed to make data analysis, monitoring, and sensitive data detection easy
Deequ	3,324	almost 2 years ago	A library built on top of Apache Spark for measuring data quality in large datasets
Great Expectations			A Python data validation framework that allows to test your data against datasets
JSON Schema			A vocabulary that allows you to annotate and validate JSON documents
SodaSQL	61	over 3 years ago	Data profiling, testing, and monitoring for SQL accessible data
Awesome DataOps / Data Serialization
Apache Avro	2,973	over 1 year ago	A data serialization system which is compact, fast and provides rich data structures
Apache ORC	698	over 1 year ago	A self-describing type-aware columnar file format designed for Hadoop workloads
Apache Parquet	2,665	over 1 year ago	A columnar storage format which provides efficient storage and encoding of data
Kryo	6,217	over 1 year ago	A fast and efficient binary object graph serialization framework for Java
ProtoBuf	65,999	over 1 year ago	Language-neutral, platform-neutral, extensible mechanism for serializing structured data
Awesome DataOps / Data Serialization / Data Compression
Pigz	2,669	almost 2 years ago	A parallel implementation of gzip for modern multi-processor, multi-core machines
Snappy	6,217	almost 2 years ago	Open source compression library that is fast, stable and robuts
Awesome DataOps / Data Serialization / Data Table Format
Apache Hudi	5,498	over 1 year ago	Manages the storage of large analytical datasets on DFS
Apache Iceberg	6,621	over 1 year ago	Open table format for huge analytic datasets
Delta Lake	7,677	over 1 year ago	An open source project that enables building a Lakehouse architecture on top of data lakes
Awesome DataOps / Data Visualization
Apache Superset	63,320	over 1 year ago	A modern data exploration and data visualization platform
Count			SQL/drag-and-drop querying and visualisation tool based on notebooks
Dash	21,641	over 1 year ago	Analytical Web Apps for Python, R, Julia, and Jupyter
Data Studio			Reporting solution for power users who want to go beyond the data and dashboards of GA
HUE	1,188	over 1 year ago	A mature SQL Assistant for querying Databases & Data Warehouses
Lux	5,226	over 2 years ago	Fast and easy data exploration by automating the visualization and data analysis process
Metabase			The simplest, fastest way to get business intelligence and analytics to everyone
Redash			Connect to any data source, easily visualize, dashboard and share your data
Tableau			Powerful and fastest growing data visualization tool used in the business intelligence industry
Awesome DataOps / Data Warehouse
Amazon Redshift			Accelerate your time to insights with fast, easy, and secure cloud data warehousing
Apache Hive	5,577	over 1 year ago	Facilitates reading, writing, and managing large datasets residing in distributed storage
Apache Kylin	3,661	over 1 year ago	An open source, distributed analytical data warehouse for big data
Google BigQuery			Serverless, highly scalable, and cost-effective multicloud data warehouse
Awesome DataOps / Database / Columnar Database
Apache Cassandra	8,906	over 1 year ago	Open source column based DBMS designed to handle large amounts of data
Apache Druid	13,548	over 1 year ago	Designed to quickly ingest massive quantities of event data, and provide low-latency queries
Apache HBase	5,246	over 1 year ago	An open-source, distributed, versioned, column-oriented store
Scylla	13,725	over 1 year ago	Designed to be compatible with Cassandra while achieving higher throughputs and lower latencies
Awesome DataOps / Database / Document-Oriented Database
Apache CouchDB	6,298	over 1 year ago	An open-source document-oriented NoSQL database, implemented in Erlang
Elasticsearch	71,007	over 1 year ago	A distributed document oriented database with a RESTful search engine
MongoDB	26,503	over 1 year ago	A cross-platform document database that uses JSON-like documents with optional schemas
RethinkDB	26,806	over 1 year ago	The first open-source scalable database built for realtime applications
Awesome DataOps / Database / Graph Database
Age	3,191	almost 2 years ago	A multi-model database that supports both graph and relational data models
ArangoDB	13,613	over 1 year ago	A scalable open-source multi-model database natively supporting graph, document and search
JanusGraph	5,351	over 1 year ago	Manage large graphs with billions of data distributed across a multi-machine cluster
Memgraph	2,520	over 1 year ago	An open source graph database, built for real-time streaming data, compatible with Neo4j
Neo4j	13,537	over 1 year ago	A high performance graph store with all the features expected of a mature and robust database
Titan	5,243	over 3 years ago	A highly scalable graph database optimized for storing and querying large graphs
Awesome DataOps / Database / Key-Value Database
Apache Accumulo	1,075	over 1 year ago	A sorted, distributed key-value store that provides robust and scalable data storage
Dragonfly	26,326	over 1 year ago	A modern in-memory datastore, fully compatible with Redis and Memcached APIs
DynamoDB			Fast, flexible NoSQL database service for single-digit millisecond performance at any scale
etcd	48,056	over 1 year ago	Distributed reliable key-value store for the most critical data of a distributed system
EVCache	2,071	over 1 year ago	A distributed in-memory data store for the cloud
Memcached	13,601	over 1 year ago	A high performance multithreaded event-based key/value cache store
Redis	67,358	over 1 year ago	An in-memory key-value database that persists on disk
Awesome DataOps / Database / Relational Database
CockroachDB	30,270	over 1 year ago	A distributed database designed to build, scale, and manage data-intensive apps
Crate	4,139	over 1 year ago	A distributed SQL database that makes it simple to store and analyze massive amounts of data
MariaDB	5,752	over 1 year ago	A replacement of MySQL with more features, new storage engines and better performance
MySQL	10,964	almost 2 years ago	One of the most popular open source transactional databases
PostgreSQL	16,442	over 1 year ago	An advanced RDBMS that supports an extended subset of the SQL standard
RQLite	15,906	over 1 year ago	A lightweight, distributed relational database, which uses SQLite as its storage engine
SQLite	6,902	over 1 year ago	A popular choice as embedded database software for local/client storage
Awesome DataOps / Database / Time Series Database
Akumuli	835	almost 4 years ago	Can be used to capture, store and process time-series data in real-time
Atlas	3,459	over 1 year ago	An in-memory dimensional time series database
InfluxDB	29,126	over 1 year ago	Scalable datastore for metrics, events, and real-time analytics
QuestDB	14,699	over 1 year ago	An open source SQL database designed to process time series data, faster
TimescaleDB	18,066	over 1 year ago	Open-source time-series SQL database optimized for fast ingest and complex queries
Awesome DataOps / Database / Vector Database
Milvus	31,283	over 1 year ago	An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy
Pinecone			Managed and distributed vector similarity search used with a lightweight SDK
Qdrant	21,001	over 1 year ago	An open source vector similarity search engine with extended filtering support
Awesome DataOps / File System
Alluxio	6,880	over 1 year ago	A virtual distributed storage system
Amazon Simple Storage Service (S3)			Object storage built to retrieve any amount of data from anywhere
Apache Hadoop Distributed File System (HDFS)			A distributed file system
GlusterFS	4,774	over 1 year ago	A software defined distributed storage that can scale to several petabytes
Google Cloud Storage (GCS)			Object storage for companies of all sizes, to store any amount of data
LakeFS	4,496	over 1 year ago	Open source tool that transforms your object storage into a Git-like repository
LizardFS	958	almost 2 years ago	A highly reliable, scalable and efficient distributed file system
MinIO	48,833	over 1 year ago	High Performance, Kubernetes Native Object Storage compatible with Amazon S3 API
SeaweedFS	23,207	over 1 year ago	A fast distributed storage system for blobs, objects, files, and data lake
Swift	2,639	over 1 year ago	A distributed object storage system designed to scale from a single machine to thousands of servers
Awesome DataOps / Logging and Monitoring
Grafana	65,525	over 1 year ago	Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, InfluxDB and more
Loki	24,172	over 1 year ago	A horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus
Prometheus	56,244	over 1 year ago	A monitoring system and time series database
Whylogs	2,664	over 1 year ago	A tool for creating data logs, enabling monitoring for data drift and data quality issues
Awesome DataOps / Metadata Service
Hive Metastore			Service that stores metadata related to Apache Hive and other services
Metacat	1,616	over 1 year ago	Provides you information about what data you have, where it resides and how to process it
Awesome DataOps / SQL Query Engine
Apache Drill	1,949	over 1 year ago	Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
Apache Impala	1,164	over 1 year ago	Lightning-fast, distributed SQL queries for petabytes of data
Dremio			Power high-performing BI dashboards and interactive analytics directly on data lake
Presto	16,114	over 1 year ago	A distributed SQL query engine for big data
Trino	10,601	over 1 year ago	A fast distributed SQL query engine for big data analytics
Resources / Books
Data Mesh: Delivering Data-Driven Value at Scale			(O'Reilly)
Designing Data-Intensive Applications			(O'Reilly)
Fundamentals of Data Engineering			(O'Reilly)
Getting Started with Impala			(O'Reilly)
Learning and Operating Presto			(O'Reilly)
Learning Spark: Lightning-Fast Data Analytics			(O'Reilly)
Spark in Action			(O'Reilly)
Spark: The Definitive Guide			(O'Reilly)
Resources / Other Lists
Awesome Data Engineering	6,889	over 1 year ago
Awesome MLOps	4,181	over 1 year ago
DataOps Resource	24	almost 6 years ago
Resources / Slack
Delta Lake Workspace
Trino Workspace

Backlinks from these awesome lists:

kelvins/awesome-mlops

awesome-dataops

Awesome DataOps / Data Catalog

Awesome DataOps / Data Exploration

Awesome DataOps / Data Ingestion

Awesome DataOps / Data Workflow

Awesome DataOps / Data Processing

Awesome DataOps / Data Quality

Awesome DataOps / Data Serialization

Awesome DataOps / Data Serialization / Data Compression

Awesome DataOps / Data Serialization / Data Table Format

Awesome DataOps / Data Visualization

Awesome DataOps / Data Warehouse

Awesome DataOps / Database / Columnar Database

Awesome DataOps / Database / Document-Oriented Database

Awesome DataOps / Database / Graph Database

Awesome DataOps / Database / Key-Value Database

Awesome DataOps / Database / Relational Database

Awesome DataOps / Database / Time Series Database

Awesome DataOps / Database / Vector Database

Awesome DataOps / File System

Awesome DataOps / Logging and Monitoring

Awesome DataOps / Metadata Service

Awesome DataOps / SQL Query Engine

Resources / Books

Resources / Other Lists

Resources / Slack

Backlinks from these awesome lists:

More related projects: