awesome-data-engineering

Data storage tools

A curated list of tools and databases for designing, implementing, and managing data storage solutions in software applications.

A curated list of data engineering tools for software developers

GitHub

10 stars

2 watching

4 forks

last commit: over 7 years ago

Linked from 1 awesome list

Databases / Relational
RQLite	12	about 3 years ago	Replicated SQLite using the Raft consensus protocol
MySQL			The world's most popular open source database
Databases / Relational / MySQL
TiDB	37,447	over 1 year ago	TiDB is a distributed NewSQL database compatible with MySQL protocol
Percona XtraBackup			Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®
mysql_utils	883	about 7 years ago	Pinterest MySQL Management Tools
Databases / Relational
MariaDB			An enhanced, drop-in replacement for MySQL
PostgreSQL			The world's most advanced open source database
Amazon RDS			Amazon RDS makes it easy to set up, operate, and scale a relational database in the cloud
Crate.IO			Scalable SQL database with the NOSQL goodies
Databases / Key-Value
Redis			An open source, BSD licensed, advanced key-value cache and store
Riak			A distributed database designed to deliver maximum data availability by distributing data across multiple servers
AWS DynamoDB			A fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale
HyperDex	1,393	about 2 years ago	HyperDex is a scalable, searchable key-value store
SSDB			A high performance NoSQL database supporting many data structures, an alternative to Redis
Kyoto Tycoon	277	over 2 years ago	Kyoto Tycoon is a lightweight network server on top of the Kyoto Cabinet key-value database, built for high-performance and concurrency
IonDB	592	about 2 years ago	A key-value store for microcontroller and IoT applications
Databases / Column
Cassandra			The right choice when you need scalability and high availability without compromising performance
Databases / Column / Cassandra
Cassandra Calculator			This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is for your application
CCM	1,219	almost 2 years ago	A script to easily create and destroy an Apache Cassandra cluster on localhost
ScyllaDB	13,725	over 1 year ago	NoSQL data store using the seastar framework, compatible with Apache Cassandra
Databases / Column
HBase			The Hadoop database, a distributed, scalable, big data store
Infobright			Column oriented, open-source analytic database provides both speed and efficiency
AWS Redshift			A fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools
https://github.com/tuplejump/FiloDB	1,429	over 1 year ago	FiloDB ( ) Distributed. Columnar. Versioned. Streaming. SQL
HPE Vertica			Distributed, MPP columnar database with extensive analytics SQL
Databases / Document
MongoDB			An open-source, document database designed for ease of development and scaling
Databases / Document / MongoDB
Percona Server for MongoDB			Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality
MemDB	596	about 8 years ago	Distributed Transactional In-Memory Database (based on MongoDB)
Databases / Document
Elasticsearch			Search & Analyze Data in Real Time
Couchbase			The highest performing NoSQL distributed database
RethinkDB			The open-source database for the realtime web
Databases / Graph
Neo4j			The world’s leading graph database
OrientDB			2nd Generation Distributed Graph Database with the flexibility of Documents in one product with an Open Source commercial friendly license
ArangoDB			A distributed free and open-source database with a flexible data model for documents, graphs, and key-values
Titan			A scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster
FlockDB	3,337	over 9 years ago	A distributed, fault-tolerant graph database by Twitter
Databases / Distributed
DAtomic			The fully transactional, cloud-ready, distributed database
Apache Geode			An open source, distributed, in-memory database for scale-out applications
Gaffer	1,774	over 1 year ago	A large-scale graph database
Databases / Timeseries
InfluxDB	29,126	over 1 year ago	Scalable datastore for metrics, events, and real-time analytics
OpenTSDB	5,009	over 1 year ago	A scalable, distributed Time Series Database
kairosdb	1,740	over 1 year ago	Fast scalable time series database
Heroic	848	over 5 years ago	A scalable time series database based on Cassandra and Elasticsearch, by Spotify
Druid	13,548	over 1 year ago	Column oriented distributed data store ideal for powering interactive applications
Riak-TS			Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data
Akumuli	835	almost 4 years ago	Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate"
Rhombus			A time-series object store for Cassandra that handles all the complexity of building wide row indexes
Dalmatiner DB	694	over 7 years ago	Fast distributed metrics database
Blueflood	595	almost 2 years ago	A distributed system designed to ingest and process time series data
Timely	379	about 2 years ago	Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana
Databases / Other
Tarantool	3,437	over 1 year ago	Tarantool is an in-memory database and application server
GreenPlum			The Greenplum Database (GPDB) is an advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes
cayley	14,868	over 1 year ago	An open-source graph database. Google
Snappydata	1,041	over 3 years ago	SnappyData: OLTP + OLAP Database built on Apache Spark
Data Ingestion
Kafka			Publish-subscribe messaging rethought as a distributed commit log
Data Ingestion / Kafka
Camus	878	almost 6 years ago	LinkedIn's Kafka to HDFS pipeline
BottledWater	4	about 3 years ago	Change data capture from PostgreSQL into Kafka
kafkat	504	about 7 years ago	Simplified command-line administration for Kafka brokers
kafkacat	5,468	about 2 years ago	Generic command line non-JVM Apache Kafka producer and consumer
pg-kafka	111	over 11 years ago	A PostgreSQL extension to produce messages to Apache Kafka
librdkafka	332	over 1 year ago	The Apache Kafka C/C++ library
kafka-docker	6,943	about 2 years ago	Kafka in Docker
kafka-manager	11,853	almost 3 years ago	A tool for managing Apache Kafka
kafka-node	2,664	almost 3 years ago	Node.js client for Apache Kafka 0.8
Secor	1,846	over 1 year ago	Pinterest's Kafka to S3 distributed consumer
Kafka-logger	45	almost 8 years ago	Kafka-winston logger for nodejs from uber
Kafka Awesome List	206	over 2 years ago	A super list of resources about Apache Kafka
Data Ingestion
AWS Kinesis			A fully managed, cloud-based service for real-time data processing over large, distributed data streams
RabbitMQ			Robust messaging for applications
FluentD			An open source data collector for unified logging layer
Embulk			An open source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services
Apache Sqoop			A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases
Heka	3,389	over 2 years ago	Data Acquisition and Processing Made Easy
Gobblin	2,232	over 1 year ago	Universal data ingestion framework for Hadoop from Linkedin
Nakadi			Nakadi is an open source event messaging platform that provides a REST API on top of Kafka-like queues
Pravega			Pravega provides a new storage abstraction - a stream - for continuous and unbounded data
Apache Pulsar			Apache Pulsar is an open-source distributed pub-sub messaging system
File System
HDFS
File System / HDFS
Snakebite	854	about 4 years ago	A pure python HDFS client
File System
AWS S3
File System / AWS S3
smart_open	3,233	over 1 year ago	Utils for streaming large files (S3, HDFS, gzip, bz2)
File System
Tachyon			Tachyon is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce
CEPH			Ceph is a unified, distributed storage system designed for excellent performance, reliability and scalability
OrangeFS			Orange File System is a branch of the Parallel Virtual File System
SnackFS	14	about 11 years ago	SnackFS is our bite-sized, lightweight HDFS compatible FileSystem built over Cassandra
GlusterFS			Gluster Filesystem
XtreemFS			fault-tolerant distributed file system for all storage needs
SeaweedFS	23,207	over 1 year ago	Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS"
S3QL			S3QL is a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack
LizardFS			LizardFS Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system
Serialization format
Apache Avro			Apache Avro™ is a data serialization system
Apache Parquet			Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language
Serialization format / Apache Parquet
Snappy	6,217	almost 2 years ago	A fast compressor/decompressor. Used with Parquet
PigZ			A parallel implementation of gzip for modern multi-processor, multi-core machines
Serialization format
Apache ORC			The smallest, fastest columnar storage for Hadoop workloads
Apache Thrift			The Apache Thrift software framework, for scalable cross-language services development
ProtoBuf	65,999	over 1 year ago	Protocol Buffers - Google's data interchange format
SequenceFile			SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats
Kryo	6,217	over 1 year ago	Kryo is a fast and efficient object graph serialization framework for Java
Stream Processing
Apache Beam			Apache Beam is a unified programming model that implements both batch and streaming data processing jobs that run on many execution engines
Spark Streaming			Spark Streaming makes it easy to build scalable fault-tolerant streaming applications
Apache Flink			Apache Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams
Apache Storm			Apache Storm is a free and open source distributed realtime computation system
Apache Samza			Apache Samza is a distributed stream processing framework
Apache NiFi			is an easy to use, powerful, and reliable system to process and distribute data
VoltDB			VoltDb is an ACID-compliant RDBMS which uses a
PipelineDB	2,639	over 4 years ago	The Streaming SQL Database
Spring Cloud Dataflow			Streaming and tasks execution between Spring Boot apps
Bonobo			Bonobo is a data-processing toolkit for python 3.5+
Robinhood's Faust	6,751	almost 2 years ago	Forever scalable event processing & in-memory durable K/V store as a library with asyncio & static typing
Batch Processing
Hadoop MapReduce			Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner
Spark
Batch Processing / Spark
Spark Packages			A community index of packages for Apache Spark
Deep Spark	197	about 10 years ago	Connecting Apache Spark with different data stores
Spark RDD API Examples			by Zhen He
Livy	1,188	over 1 year ago	Livy, the REST Spark Server
Batch Processing
AWS EMR			A web service that makes it easy to quickly and cost-effectively process vast amounts of data
Tez			An application framework which allows for a complex directed-acyclic-graph of tasks for processing data
Bistro	7	almost 8 years ago	is a light-weight engine for general-purpose data processing including both batch and stream analytics. It is based on a novel unique data model, which represents data via and processes data via as opposed to having only set operations in conventional approaches like MapReduce or SQL
Batch Processing / Batch ML
H2O			Fast scalable machine learning API for smarter applications
Mahout			An environment for quickly creating scalable performant machine learning applications
Spark MLlib			Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives
Batch Processing / Batch Graph
GraphLab Create			A machine learning platform that enables data scientists and app developers to easily create intelligent apps at scale
Giraph			An iterative graph processing system built for high scalability
Spark GraphX			Apache Spark's API for graphs and graph-parallel computation
Batch Processing / Batch SQL
Presto			A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources
Hive			Data warehouse software facilitates querying and managing large datasets residing in distributed storage
Batch Processing / Batch SQL / Hive
Hivemall	505	over 9 years ago	Scalable machine learning library for Hive/Hadoop
PyHive	1,676	almost 2 years ago	Python interface to Hive and Presto
Batch Processing / Batch SQL
Drill			Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
Charts and Dashboards
Highcharts			A charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application
ZingChart			Fast JavaScript charts for any data set
C3.js			D3-based reusable chart library
D3.js			A JavaScript library for manipulating documents based on data
Charts and Dashboards / D3.js
D3Plus			D3's simplier, easier to use cousin. Mostly predefined templates that you can just plug data in
Charts and Dashboards
SmoothieCharts			A JavaScript Charting Library for Streaming Data
PyXley	2,272	over 8 years ago	Python helpers for building dashboards using Flask and React
Plotly	21,641	over 1 year ago	Flask, JS, and CSS boilerplate for interactive, web-based visualization apps in Python
Apache Superset	63,320	over 1 year ago	Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application
Redash			Make Your Company Data Driven. Connect to any data source, easily visualize and share your data
Metabase	39,103	over 1 year ago	Metabase is the easy, open source way for everyone in your company to ask questions and learn from data
PyQtGraph			PyQtGraph is a pure-python graphics and GUI library built on PyQt4 / PySide and numpy. It is intended for use in mathematics / scientific / engineering applications
Workflow
Luigi	17,950	over 1 year ago	Luigi is a Python module that helps you build complex pipelines of batch jobs
Workflow / Luigi
CronQ			An application cron-like system. w/Luige
Workflow
Cascading			Java based application development platform
Airflow	37,580	over 1 year ago	Airflow is a system to programmaticaly author, schedule and monitor data pipelines
Azkaban			Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows
Oozie			Oozie is a workflow scheduler system to manage Apache Hadoop jobs
Pinball	1,046	over 6 years ago	DAG based workflow manager. Job flows are defined programmaticaly in Python. Support output passing between jobs
ELK Elastic Logstash Kibana
docker-logstash	236	over 10 years ago	A highly configurable logstash (1.4.4) docker image running Elasticsearch (1.7.0) and Kibana (3.1.2)
elasticsearch-jdbc	2,838	over 4 years ago	JDBC importer for Elasticsearch
ZomboDB	4,687	over 1 year ago	Postgres Extension that allows creating an index backed by Elasticsearch
Docker
Gockerize	666	over 8 years ago	Package golang service into minimal docker containers
Flocker	3,390	about 9 years ago	Easily manage Docker containers & their data
Rancher			RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers
Kontena			Application Containers for Masses
Weave	6,621	almost 2 years ago	Weaving Docker containers into applications
Zodiac	198	over 6 years ago	A lightweight tool for easy deployment and rollback of dockerized applications
cAdvisor	17,304	over 1 year ago	Analyzes resource usage and performance characteristics of running containers
Micro S3 persistence	14	almost 7 years ago	Docker microservice for saving/restoring volume data to S3
Dockup	241	over 9 years ago	Docker image to backup/restore your Docker container volumes to AWS S3
Rocker-compose	406	over 3 years ago	Docker composition tool with idempotency features for deploying apps composed of multiple containers
Nomad	15,029	over 1 year ago	Nomad is a cluster manager, designed for both long lived services and short lived batch processing workloads
ImageLayers			Vizualize docker images and the layers that compose them
Datasets / Realtime
Twitter Realtime			The Streaming APIs give developers low latency access to Twitter’s global stream of Tweet data
Eventsim	508	over 4 years ago	Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic
Reddit			Real-time data is available including comments, submissions and links posted to reddit
Datasets / Data Dumps
GitHub Archive			GitHub's public timeline since 2011, updated every hour
Common Crawl			Open source repository of web crawl data
Wikipedia			Wikipedia's complete copy of all wikis, in the form of wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available
Monitoring / Prometheus
Prometheus.io	56,244	over 1 year ago	An open-source service monitoring system and time series database
HAProxy Exporter	619	over 3 years ago	Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption
Community / Forums
/r/dataengineering			News, tips and background on Data Engineering
/r/etl			Subreddit focused on ETL
Community / Conferences
DataEngConf			DataEngConf is the first technical conference that bridges the gap between data scientists, data engineers and data analysts
Community / Podcasts
Data Engineering Podcast			The show about modern data infrastructure

Backlinks from these awesome lists:

monksy/awesome-kafka

awesome-data-engineering

Databases / Relational

Databases / Relational / MySQL

Databases / Relational

Databases / Key-Value

Databases / Column

Databases / Column / Cassandra

Databases / Column

Databases / Document

Databases / Document / MongoDB

Databases / Document

Databases / Graph

Databases / Distributed

Databases / Timeseries

Databases / Other

Data Ingestion

Data Ingestion / Kafka

Data Ingestion

File System

File System / HDFS

File System

File System / AWS S3

File System

Serialization format

Serialization format / Apache Parquet

Serialization format

Stream Processing

Batch Processing

Batch Processing / Spark

Batch Processing

Batch Processing / Batch ML

Batch Processing / Batch Graph

Batch Processing / Batch SQL

Batch Processing / Batch SQL / Hive

Batch Processing / Batch SQL

Charts and Dashboards

Charts and Dashboards / D3.js

Charts and Dashboards

Workflow

Workflow / Luigi

Workflow

ELK Elastic Logstash Kibana

Docker

Datasets / Realtime

Datasets / Data Dumps

Monitoring / Prometheus

Community / Forums

Community / Conferences

Community / Podcasts

Backlinks from these awesome lists:

More related projects: