awesome-data-engineering

A curated list of data engineering tools for software developers

GitHub

9 stars
2 watching
4 forks
last commit: over 5 years ago
Linked from 1 awesome list


Databases / Relational

RQLite 12 over 1 year ago Replicated SQLite using the Raft consensus protocol
MySQL The world's most popular open source database

Databases / Relational / MySQL

TiDB 36,985 5 days ago TiDB is a distributed NewSQL database compatible with MySQL protocol
Percona XtraBackup Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®
mysql_utils 883 over 5 years ago Pinterest MySQL Management Tools

Databases / Relational

MariaDB An enhanced, drop-in replacement for MySQL
PostgreSQL The world's most advanced open source database
Amazon RDS Amazon RDS makes it easy to set up, operate, and scale a relational database in the cloud
Crate.IO Scalable SQL database with the NOSQL goodies

Databases / Key-Value

Redis An open source, BSD licensed, advanced key-value cache and store
Riak A distributed database designed to deliver maximum data availability by distributing data across multiple servers
AWS DynamoDB A fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale
HyperDex 1,394 5 months ago HyperDex is a scalable, searchable key-value store
SSDB A high performance NoSQL database supporting many data structures, an alternative to Redis
Kyoto Tycoon 274 10 months ago Kyoto Tycoon is a lightweight network server on top of the Kyoto Cabinet key-value database, built for high-performance and concurrency
IonDB 587 4 months ago A key-value store for microcontroller and IoT applications

Databases / Column

Cassandra The right choice when you need scalability and high availability without compromising performance

Databases / Column / Cassandra

Cassandra Calculator This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is for your application
CCM 1,216 4 days ago A script to easily create and destroy an Apache Cassandra cluster on localhost
ScyllaDB 13,370 1 day ago NoSQL data store using the seastar framework, compatible with Apache Cassandra

Databases / Column

HBase The Hadoop database, a distributed, scalable, big data store
Infobright Column oriented, open-source analytic database provides both speed and efficiency
AWS Redshift A fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools
https://github.com/tuplejump/FiloDB 1,427 8 days ago FiloDB ( ) Distributed. Columnar. Versioned. Streaming. SQL
HPE Vertica Distributed, MPP columnar database with extensive analytics SQL

Databases / Document

MongoDB An open-source, document database designed for ease of development and scaling

Databases / Document / MongoDB

Percona Server for MongoDB Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality
MemDB 596 over 6 years ago Distributed Transactional In-Memory Database (based on MongoDB)

Databases / Document

Elasticsearch Search & Analyze Data in Real Time
Couchbase The highest performing NoSQL distributed database
RethinkDB The open-source database for the realtime web

Databases / Graph

Neo4j The world’s leading graph database
OrientDB 2nd Generation Distributed Graph Database with the flexibility of Documents in one product with an Open Source commercial friendly license
ArangoDB A distributed free and open-source database with a flexible data model for documents, graphs, and key-values
Titan A scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster
FlockDB 3,337 over 7 years ago A distributed, fault-tolerant graph database by Twitter

Databases / Distributed

DAtomic The fully transactional, cloud-ready, distributed database
Apache Geode An open source, distributed, in-memory database for scale-out applications
Gaffer 1,766 3 days ago A large-scale graph database

Databases / Timeseries

InfluxDB 28,713 3 days ago Scalable datastore for metrics, events, and real-time analytics
OpenTSDB 4,996 9 days ago A scalable, distributed Time Series Database
kairosdb 1,738 5 months ago Fast scalable time series database
Heroic 848 over 3 years ago A scalable time series database based on Cassandra and Elasticsearch, by Spotify
Druid 13,429 5 days ago Column oriented distributed data store ideal for powering interactive applications
Riak-TS Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data
Akumuli 836 about 2 years ago Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate"
Rhombus A time-series object store for Cassandra that handles all the complexity of building wide row indexes
Dalmatiner DB 695 over 5 years ago Fast distributed metrics database
Blueflood 597 about 2 months ago A distributed system designed to ingest and process time series data
Timely 377 3 months ago Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana

Databases / Other

Tarantool 3,396 8 days ago Tarantool is an in-memory database and application server
GreenPlum The Greenplum Database (GPDB) is an advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes
cayley 14,842 3 months ago An open-source graph database. Google
Snappydata 1,039 almost 2 years ago SnappyData: OLTP + OLAP Database built on Apache Spark

Data Ingestion

Kafka Publish-subscribe messaging rethought as a distributed commit log

Data Ingestion / Kafka

Camus 883 about 4 years ago LinkedIn's Kafka to HDFS pipeline
BottledWater 2 over 1 year ago Change data capture from PostgreSQL into Kafka
kafkat 503 over 5 years ago Simplified command-line administration for Kafka brokers
kafkacat 5,402 3 months ago Generic command line non-JVM Apache Kafka producer and consumer
pg-kafka 112 over 9 years ago A PostgreSQL extension to produce messages to Apache Kafka
librdkafka 217 3 days ago The Apache Kafka C/C++ library
kafka-docker 6,922 5 months ago Kafka in Docker
kafka-manager 11,808 about 1 year ago A tool for managing Apache Kafka
kafka-node 2,663 about 1 year ago Node.js client for Apache Kafka 0.8
Secor 1,845 8 days ago Pinterest's Kafka to S3 distributed consumer
Kafka-logger 45 almost 6 years ago Kafka-winston logger for nodejs from uber
Kafka Awesome List 204 8 months ago A super list of resources about Apache Kafka

Data Ingestion

AWS Kinesis A fully managed, cloud-based service for real-time data processing over large, distributed data streams
RabbitMQ Robust messaging for applications
FluentD An open source data collector for unified logging layer
Embulk An open source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services
Apache Sqoop A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases
Heka 3,390 9 months ago Data Acquisition and Processing Made Easy
Gobblin 2,216 16 days ago Universal data ingestion framework for Hadoop from Linkedin
Nakadi Nakadi is an open source event messaging platform that provides a REST API on top of Kafka-like queues
Pravega Pravega provides a new storage abstraction - a stream - for continuous and unbounded data
Apache Pulsar Apache Pulsar is an open-source distributed pub-sub messaging system

File System

HDFS

File System / HDFS

Snakebite 857 over 2 years ago A pure python HDFS client

File System

AWS S3

File System / AWS S3

smart_open 3,176 19 days ago Utils for streaming large files (S3, HDFS, gzip, bz2)

File System

Tachyon Tachyon is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce
CEPH Ceph is a unified, distributed storage system designed for excellent performance, reliability and scalability
OrangeFS Orange File System is a branch of the Parallel Virtual File System
SnackFS 14 about 9 years ago SnackFS is our bite-sized, lightweight HDFS compatible FileSystem built over Cassandra
GlusterFS Gluster Filesystem
XtreemFS fault-tolerant distributed file system for all storage needs
SeaweedFS 22,426 5 days ago Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS"
S3QL S3QL is a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack
LizardFS LizardFS Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system

Serialization format

Apache Avro Apache Avro™ is a data serialization system
Apache Parquet Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language

Serialization format / Apache Parquet

Snappy 6,110 about 2 months ago A fast compressor/decompressor. Used with Parquet
PigZ A parallel implementation of gzip for modern multi-processor, multi-core machines

Serialization format

Apache ORC The smallest, fastest columnar storage for Hadoop workloads
Apache Thrift The Apache Thrift software framework, for scalable cross-language services development
ProtoBuf 65,302 10 days ago Protocol Buffers - Google's data interchange format
SequenceFile SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats
Kryo 6,181 5 days ago Kryo is a fast and efficient object graph serialization framework for Java

Stream Processing

Apache Beam Apache Beam is a unified programming model that implements both batch and streaming data processing jobs that run on many execution engines
Spark Streaming Spark Streaming makes it easy to build scalable fault-tolerant streaming applications
Apache Flink Apache Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams
Apache Storm Apache Storm is a free and open source distributed realtime computation system
Apache Samza Apache Samza is a distributed stream processing framework
Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data
VoltDB VoltDb is an ACID-compliant RDBMS which uses a
PipelineDB 2,632 over 2 years ago The Streaming SQL Database
Spring Cloud Dataflow Streaming and tasks execution between Spring Boot apps
Bonobo Bonobo is a data-processing toolkit for python 3.5+
Robinhood's Faust 6,726 2 months ago Forever scalable event processing & in-memory durable K/V store as a library with asyncio & static typing

Batch Processing

Hadoop MapReduce Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner
Spark

Batch Processing / Spark

Spark Packages A community index of packages for Apache Spark
Deep Spark 197 over 8 years ago Connecting Apache Spark with different data stores
Spark RDD API Examples by Zhen He
Livy 1,163 4 days ago Livy, the REST Spark Server

Batch Processing

AWS EMR A web service that makes it easy to quickly and cost-effectively process vast amounts of data
Tez An application framework which allows for a complex directed-acyclic-graph of tasks for processing data
Bistro 7 about 6 years ago is a light-weight engine for general-purpose data processing including both batch and stream analytics. It is based on a novel unique data model, which represents data via and processes data via as opposed to having only set operations in conventional approaches like MapReduce or SQL

Batch Processing / Batch ML

H2O Fast scalable machine learning API for smarter applications
Mahout An environment for quickly creating scalable performant machine learning applications
Spark MLlib Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives

Batch Processing / Batch Graph

GraphLab Create A machine learning platform that enables data scientists and app developers to easily create intelligent apps at scale
Giraph An iterative graph processing system built for high scalability
Spark GraphX Apache Spark's API for graphs and graph-parallel computation

Batch Processing / Batch SQL

Presto A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources
Hive Data warehouse software facilitates querying and managing large datasets residing in distributed storage

Batch Processing / Batch SQL / Hive

Hivemall 503 almost 8 years ago Scalable machine learning library for Hive/Hadoop
PyHive 1,671 about 2 months ago Python interface to Hive and Presto

Batch Processing / Batch SQL

Drill Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage

Charts and Dashboards

Highcharts A charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application
ZingChart Fast JavaScript charts for any data set
C3.js D3-based reusable chart library
D3.js A JavaScript library for manipulating documents based on data

Charts and Dashboards / D3.js

D3Plus D3's simplier, easier to use cousin. Mostly predefined templates that you can just plug data in

Charts and Dashboards

SmoothieCharts A JavaScript Charting Library for Streaming Data
PyXley 2,270 over 6 years ago Python helpers for building dashboards using Flask and React
Plotly 21,250 15 days ago Flask, JS, and CSS boilerplate for interactive, web-based visualization apps in Python
Apache Superset 62,043 3 days ago Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application
Redash Make Your Company Data Driven. Connect to any data source, easily visualize and share your data
Metabase 38,274 5 days ago Metabase is the easy, open source way for everyone in your company to ask questions and learn from data
PyQtGraph PyQtGraph is a pure-python graphics and GUI library built on PyQt4 / PySide and numpy. It is intended for use in mathematics / scientific / engineering applications

Workflow

Luigi 17,746 11 days ago Luigi is a Python module that helps you build complex pipelines of batch jobs

Workflow / Luigi

CronQ An application cron-like system. w/Luige

Workflow

Cascading Java based application development platform
Airflow 36,519 4 days ago Airflow is a system to programmaticaly author, schedule and monitor data pipelines
Azkaban Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows
Oozie Oozie is a workflow scheduler system to manage Apache Hadoop jobs
Pinball 1,048 almost 5 years ago DAG based workflow manager. Job flows are defined programmaticaly in Python. Support output passing between jobs

ELK Elastic Logstash Kibana

docker-logstash 236 almost 9 years ago A highly configurable logstash (1.4.4) docker image running Elasticsearch (1.7.0) and Kibana (3.1.2)
elasticsearch-jdbc 2,839 almost 3 years ago JDBC importer for Elasticsearch
ZomboDB 4,681 2 months ago Postgres Extension that allows creating an index backed by Elasticsearch

Docker

Gockerize 667 over 6 years ago Package golang service into minimal docker containers
Flocker 3,386 over 7 years ago Easily manage Docker containers & their data
Rancher RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers
Kontena Application Containers for Masses
Weave 6,616 about 2 months ago Weaving Docker containers into applications
Zodiac 196 over 4 years ago A lightweight tool for easy deployment and rollback of dockerized applications
cAdvisor 17,011 about 2 months ago Analyzes resource usage and performance characteristics of running containers
Micro S3 persistence 13 about 5 years ago Docker microservice for saving/restoring volume data to S3
Dockup 241 over 7 years ago Docker image to backup/restore your Docker container volumes to AWS S3
Rocker-compose 406 over 1 year ago Docker composition tool with idempotency features for deploying apps composed of multiple containers
Nomad 14,812 11 days ago Nomad is a cluster manager, designed for both long lived services and short lived batch processing workloads
ImageLayers Vizualize docker images and the layers that compose them

Datasets / Realtime

Twitter Realtime The Streaming APIs give developers low latency access to Twitter’s global stream of Tweet data
Eventsim 495 over 2 years ago Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic
Reddit Real-time data is available including comments, submissions and links posted to reddit

Datasets / Data Dumps

GitHub Archive GitHub's public timeline since 2011, updated every hour
Common Crawl Open source repository of web crawl data
Wikipedia Wikipedia's complete copy of all wikis, in the form of wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available

Monitoring / Prometheus

Prometheus.io 55,095 3 days ago An open-source service monitoring system and time series database
HAProxy Exporter 617 over 1 year ago Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption

Community / Forums

/r/dataengineering News, tips and background on Data Engineering
/r/etl Subreddit focused on ETL

Community / Conferences

DataEngConf DataEngConf is the first technical conference that bridges the gap between data scientists, data engineers and data analysts

Community / Podcasts

Data Engineering Podcast The show about modern data infrastructure

Backlinks from these awesome lists: