awesome-hadoop

Hadoop toolkit

A curated list of resources and tools for developing and managing Hadoop-based applications and workflows

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources

GitHub

1k stars
102 watching
258 forks
last commit: 7 months ago
Linked from 9 awesome lists


Awesome Hadoop / Hadoop

Apache Hadoop Apache Hadoop
Apache Hadoop Ozone An Object Store for Apache Hadoop
Apache Tez A Framework for YARN-based, Data Processing Applications In Hadoop
SpatialHadoop SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data
GIS Tools for Hadoop Big Data Spatial Analytics for the Hadoop Framework
Elasticsearch Hadoop 9 8 days ago Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig
hadoopy 243 almost 9 years ago Python MapReduce library written in Cython
mrjob 2,615 over 1 year ago mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs
pydoop Pydoop is a package that provides a Python API for Hadoop
hdfs-du 229 about 4 years ago HDFS-DU is an interactive visualization of the Hadoop distributed file system
White Elephant 192 about 11 years ago Hadoop log aggregator and dashboard
Genie 1,716 about 2 months ago Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them
Apache Kylin Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
Crunch 214 about 10 years ago Go-based toolkit for ETL and feature extraction on Hadoop
Apache Ignite Distributed in-memory platform

Awesome Hadoop / YARN

Apache Slider Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster
Apache Twill Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic
mpich2-yarn 114 about 7 years ago Running MPICH2 on Yarn

Awesome Hadoop / NoSQL

Apache HBase Apache HBase
Apache Phoenix A SQL skin over HBase supporting secondary indices
happybase 612 4 months ago A developer-friendly Python library to interact with Apache HBase
Hannibal 172 almost 7 years ago Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting
Haeinsa 158 over 7 years ago Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
hindex 591 over 7 years ago Secondary Index for HBase
Apache Accumulo The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system
OpenTSDB The Scalable Time Series Database
Apache Cassandra

Awesome Hadoop / SQL on Hadoop

Apache Hive The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
Apache Phoenix A SQL skin over HBase supporting secondary indices
Apache HAWQ (incubating) Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop
Lingual SQL interface for Cascading (MR/Tez job generator)
Apache Impala Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012
Presto Distributed SQL Query Engine for Big Data. Open sourced by Facebook
Apache Tajo Data warehouse system for Apache Hadoop
Apache Drill Schema-free SQL Query Engine
Apache Trafodion

Awesome Hadoop / Data Management

Apache Calcite A Dynamic Data Management Framework
Apache Atlas Metadata tagging & lineage capture suppoting complex business data taxonomies
Apache Kudu Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase
Confluent Schema registry for Kafka 2,225 6 days ago Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving Avro schemas
Hortonworks Schema Registry 15 5 months ago Schema Registry is a framework to build metadata repositories

Awesome Hadoop / Workflow, Lifecycle and Governance

Apache Oozie Apache Oozie
Azkaban
Apache Falcon Data management and processing platform
Apache NiFi A dataflow system
Apache AirFlow 37,120 6 days ago Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
Luigi Python package that helps you build complex pipelines of batch jobs

Awesome Hadoop / Data Ingestion and Integration

Apache Flume Apache Flume
Suro 794 over 1 year ago Netflix's distributed Data Pipeline
Apache Sqoop Apache Sqoop
Apache Kafka Apache Kafka
Gobblin from LinkedIn 2,229 7 days ago Universal data ingestion framework for Hadoop

Awesome Hadoop / DSL

Apache Pig Apache Pig
Apache DataFu A collection of libraries for working with large-scale data in Hadoop
vahara 53 almost 11 years ago Machine learning and natural language processing with Apache Pig
packetpig 299 over 6 years ago Open Source Big Data Security Analytics
akela 76 over 10 years ago Mozilla's utility library for Hadoop, HBase, Pig, etc
seqpig Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
Lipstick 464 over 1 year ago Pig workflow visualization tool
PigPen 567 over 1 year ago PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it

Awesome Hadoop / Libraries and Tools

Kite Software Development Kit A set of libraries, tools, examples, and documentation
gohadoop Native go clients for Apache Hadoop YARN
Hue A Web interface for analyzing data with Apache Hadoop
Apache Zeppelin A web-based notebook that enables interactive data analytics
Apache Thrift
Apache Avro Apache Avro is a data serialization system
Elephant Bird 1,138 over 1 year ago Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code
Spring for Apache Hadoop
hdfs - A native go client for HDFS 1,370 6 months ago
Oozie Eclipse Plugin A graphical editor for editing Apache Oozie workflows inside Eclipse
snakebite A pure python HDFS client
Apache Parquet Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language
Apache Superset (incubating) Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application
Schema Registry UI 421 9 months ago Web tool for the Confluent Schema Registry in order to create / view / search / evolve / view history & configure Avro schemas of your Kafka cluster

Awesome Hadoop / Realtime Data Processing

Apache Storm
Apache Samza
Apache Spark
Apache Flink Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing
Apache Pulsar (incubating) Apache Pulsar (incubating) is a highly scalable, low latency messaging platform running on commodity hardware. It provides simple pub-sub semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication
Apache Druid (incubating) A high-performance, column-oriented, distributed data store

Awesome Hadoop / Distributed Computing and Programming

Apache Spark
Spark Packages A community index of packages for Apache Spark
SparkHub A community site for Apache Spark
Apache Crunch
Cascading Cascading is the proven application development platform for building data applications on Hadoop
Apache Flink Apache Flink is a platform for efficient, distributed, general-purpose data processing
Apache Apex (incubating) Enterprise-grade unified stream and batch processing engine
Apache Livy (incubating) Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts

Awesome Hadoop / Packaging, Provisioning and Monitoring

Apache Bigtop Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
Apache Ambari Apache Ambari
Ganglia Monitoring System
ankush 21 over 9 years ago A big data cluster management tool that creates and manages clusters of different technologies
Apache Zookeeper Apache Zookeeper
Apache Curator ZooKeeper client wrapper and rich ZooKeeper framework
inviso 204 over 1 year ago Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization
Logit.io Send logs from Hadoop to Elasticsearch for monitoring and alerting
ElasticSearch
Apache Solr Apache Solr is an open source search platform built upon a Java library called Lucene
Banana 668 4 months ago Kibana port for Apache Solr

Awesome Hadoop / Search Engine Framework

Apache Nutch Apache Nutch is a highly extensible and scalable open source web crawler software project

Awesome Hadoop / Security

Apache Ranger Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform
Apache Sentry An authorization module for Hadoop
Apache Knox Gateway A REST API Gateway for interacting with Hadoop clusters

Awesome Hadoop / Benchmark

Big Data Benchmark
HiBench 1,458 9 months ago
YCSB 4,955 7 days ago The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems

Awesome Hadoop / Machine learning and Big Data analytics

Apache Mahout
Oryx 2 1,788 over 3 years ago Lambda architecture on Spark, Kafka for real-time large scale machine learning
MLlib MLlib is Apache Spark's scalable machine learning library
R R is a free software environment for statistical computing and graphics
RHadoop 763 almost 9 years ago including RHDFS, RHBase, RMR2, plyrmr
Apache Lens
Apache SINGA (incubating) SINGA is a general distributed deep learning platform for training big deep learning models over large datasets
BigDL BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters
Apache Hivemall (incubating) Apache Hivemall is a scalable machine learning library that runs on Apache Hive, Spark and Pig

Awesome Hadoop / Misc. / UDF

https://github.com/edwardcapriolo/hive_cassandra_udfs 11 over 12 years ago
https://github.com/livingsocial/HiveSwarm 101 over 8 years ago
https://github.com/ThinkBigAnalytics/Hive-Extensions-from-Think-Big-Analytics 16 over 10 years ago
https://github.com/twitter/elephant-bird 1,138 over 1 year ago Twitter
https://github.com/lovelysystems/ls-hive 5 almost 12 years ago
https://github.com/klout/brickhouse

Awesome Hadoop / Misc. / Storage Handler

https://github.com/dvasilen/Hive-Cassandra 15 over 8 years ago
https://github.com/yc-huang/Hive-mongo 32 over 1 year ago
https://github.com/balshor/gdata-storagehandler 14 over 13 years ago
https://github.com/chimpler/hive-solr 16 over 10 years ago
https://github.com/bfemiano/accumulo-hive-storage-manager 13 over 1 year ago

Awesome Hadoop / Misc. / Libraries and tools

https://github.com/forward3d/rbhive 98 over 3 years ago
https://github.com/synctree/activerecord-hive-adapter 5 over 13 years ago
https://github.com/hrp/sequel-hive-adapter 5 3 months ago
https://github.com/forward/node-hive 61 over 6 years ago
https://github.com/recruitcojp/WebHive 19 almost 11 years ago
shib 200 almost 8 years ago WebUI for query engines: Hive and Presto
https://github.com/dmorel/Thrift-API-HiveClient2 0 over 1 year ago (Perl - HiveServer2)
PyHive 1,671 4 months ago Python interface to Hive and Presto
https://github.com/recruitcojp/OdbcHive 8 over 13 years ago
HiveRunner 255 6 days ago An Open Source unit test framework for hadoop hive queries based on JUnit4
Beetest 72 almost 8 years ago A super simple utility for testing Apache Hive scripts locally for non-Java developers
Hive_test 64 over 2 years ago Unit test framework for hive and hive-service

Awesome Hadoop / Misc. / Flume Plugins

Flume MongoDB Sink 71 over 1 year ago
Flume RabbitMQ source and sink 58 11 months ago
Flume UDP Source 8 over 10 years ago
.Net FlumeNG Clients 17 over 10 years ago

Resources / Websites

Hadoop Weekly
The Hadoop Ecosystem Table
Hadoop illuminated Open Source Hadoop Book
AWS BigData Blog
Hadoop360
How to monitor Hadoop metrics

Resources / Presentations

Apache Hadoop In Theory And Practice
Hadoop Operations at LinkedIn
Hadoop Performance at LinkedIn
Docker based Hadoop provisioning

Resources / Books

Hadoop: The Definitive Guide
Hadoop Operations
Apache Hadoop Yarn
HBase: The Definitive Guide
Programming Pig
Programming Hive
Hadoop in Practice, Second Edition
Hadoop in Action, Second Edition

Resources / Hadoop and Big Data Events

ApacheCon
Strata + Hadoop World
DataWorks Summit
Spark Summit

Backlinks from these awesome lists:

More related projects: