awesome-hadoop

Hadoop toolkit

A curated list of resources and tools for developing and managing Hadoop-based applications and workflows

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources

GitHub

1k stars

102 watching

260 forks

last commit: about 2 years ago

Linked from 9 awesome lists

Screenshot of youngwookim/awesome-hadoop website

youngwookim.github.io/awesome-hadoop

Awesome Hadoop / Hadoop
Apache Hadoop			Apache Hadoop
Apache Hadoop Ozone			An Object Store for Apache Hadoop
Apache Tez			A Framework for YARN-based, Data Processing Applications In Hadoop
SpatialHadoop			SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data
GIS Tools for Hadoop			Big Data Spatial Analytics for the Hadoop Framework
Elasticsearch Hadoop	1,930	over 1 year ago	Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig
hadoopy	243	over 10 years ago	Python MapReduce library written in Cython
mrjob	2,617	over 3 years ago	mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs
pydoop			Pydoop is a package that provides a Python API for Hadoop
hdfs-du	229	almost 6 years ago	HDFS-DU is an interactive visualization of the Hadoop distributed file system
White Elephant	191	over 12 years ago	Hadoop log aggregator and dashboard
Genie	1,723	over 1 year ago	Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them
Apache Kylin			Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
Crunch	214	over 11 years ago	Go-based toolkit for ETL and feature extraction on Hadoop
Apache Ignite			Distributed in-memory platform
Awesome Hadoop / YARN
Apache Slider			Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster
Apache Twill			Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic
mpich2-yarn	114	almost 9 years ago	Running MPICH2 on Yarn
Awesome Hadoop / NoSQL
Apache HBase			Apache HBase
Apache Phoenix			A SQL skin over HBase supporting secondary indices
happybase	610	almost 2 years ago	A developer-friendly Python library to interact with Apache HBase
Hannibal	172	over 8 years ago	Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting
Haeinsa	158	over 9 years ago	Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
hindex	591	about 9 years ago	Secondary Index for HBase
Apache Accumulo			The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system
OpenTSDB			The Scalable Time Series Database
Apache Cassandra
Awesome Hadoop / SQL on Hadoop
Apache Hive			The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
Apache Phoenix			A SQL skin over HBase supporting secondary indices
Apache HAWQ (incubating)			Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop
Lingual			SQL interface for Cascading (MR/Tez job generator)
Apache Impala			Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012
Presto			Distributed SQL Query Engine for Big Data. Open sourced by Facebook
Apache Tajo			Data warehouse system for Apache Hadoop
Apache Drill			Schema-free SQL Query Engine
Apache Trafodion
Awesome Hadoop / Data Management
Apache Calcite			A Dynamic Data Management Framework
Apache Atlas			Metadata tagging & lineage capture suppoting complex business data taxonomies
Apache Kudu			Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase
Confluent Schema registry for Kafka	2,233	over 1 year ago	Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving Avro schemas
Hortonworks Schema Registry	15	about 2 years ago	Schema Registry is a framework to build metadata repositories
Awesome Hadoop / Workflow, Lifecycle and Governance
Apache Oozie			Apache Oozie
Azkaban
Apache Falcon			Data management and processing platform
Apache NiFi			A dataflow system
Apache AirFlow	37,580	over 1 year ago	Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
Luigi			Python package that helps you build complex pipelines of batch jobs
Awesome Hadoop / Data Ingestion and Integration
Apache Flume			Apache Flume
Suro	794	over 3 years ago	Netflix's distributed Data Pipeline
Apache Sqoop			Apache Sqoop
Apache Kafka			Apache Kafka
Gobblin from LinkedIn	2,232	over 1 year ago	Universal data ingestion framework for Hadoop
Awesome Hadoop / DSL
Apache Pig			Apache Pig
Apache DataFu			A collection of libraries for working with large-scale data in Hadoop
vahara	53	over 12 years ago	Machine learning and natural language processing with Apache Pig
packetpig	299	about 8 years ago	Open Source Big Data Security Analytics
akela	76	over 12 years ago	Mozilla's utility library for Hadoop, HBase, Pig, etc
seqpig			Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
Lipstick	465	over 3 years ago	Pig workflow visualization tool
PigPen	567	over 3 years ago	PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it
Awesome Hadoop / Libraries and Tools
Kite Software Development Kit			A set of libraries, tools, examples, and documentation
gohadoop			Native go clients for Apache Hadoop YARN
Hue			A Web interface for analyzing data with Apache Hadoop
Apache Zeppelin			A web-based notebook that enables interactive data analytics
Apache Thrift
Apache Avro			Apache Avro is a data serialization system
Elephant Bird	1,137	over 3 years ago	Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code
Spring for Apache Hadoop
hdfs - A native go client for HDFS	1,377	over 1 year ago
Oozie Eclipse Plugin			A graphical editor for editing Apache Oozie workflows inside Eclipse
snakebite			A pure python HDFS client
Apache Parquet			Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language
Apache Superset (incubating)			Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application
Schema Registry UI	421	over 2 years ago	Web tool for the Confluent Schema Registry in order to create / view / search / evolve / view history & configure Avro schemas of your Kafka cluster
Awesome Hadoop / Realtime Data Processing
Apache Storm
Apache Samza
Apache Spark
Apache Flink			Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing
Apache Pulsar (incubating)			Apache Pulsar (incubating) is a highly scalable, low latency messaging platform running on commodity hardware. It provides simple pub-sub semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication
Apache Druid (incubating)			A high-performance, column-oriented, distributed data store
Awesome Hadoop / Distributed Computing and Programming
Apache Spark
Spark Packages			A community index of packages for Apache Spark
SparkHub			A community site for Apache Spark
Apache Crunch
Cascading			Cascading is the proven application development platform for building data applications on Hadoop
Apache Flink			Apache Flink is a platform for efficient, distributed, general-purpose data processing
Apache Apex (incubating)			Enterprise-grade unified stream and batch processing engine
Apache Livy (incubating)			Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts
Awesome Hadoop / Packaging, Provisioning and Monitoring
Apache Bigtop			Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
Apache Ambari			Apache Ambari
Ganglia Monitoring System
ankush	21	over 11 years ago	A big data cluster management tool that creates and manages clusters of different technologies
Apache Zookeeper			Apache Zookeeper
Apache Curator			ZooKeeper client wrapper and rich ZooKeeper framework
inviso	204	about 3 years ago	Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization
Logit.io			Send logs from Hadoop to Elasticsearch for monitoring and alerting
Awesome Hadoop / Search
ElasticSearch
Apache Solr			Apache Solr is an open source search platform built upon a Java library called Lucene
Banana	668	almost 2 years ago	Kibana port for Apache Solr
Awesome Hadoop / Search Engine Framework
Apache Nutch			Apache Nutch is a highly extensible and scalable open source web crawler software project
Awesome Hadoop / Security
Apache Ranger			Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform
Apache Sentry			An authorization module for Hadoop
Apache Knox Gateway			A REST API Gateway for interacting with Hadoop clusters
Awesome Hadoop / Benchmark
Big Data Benchmark
HiBench	1,463	over 1 year ago
YCSB	4,968	over 1 year ago	The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems
Awesome Hadoop / Machine learning and Big Data analytics
Apache Mahout
Oryx 2	1,787	almost 5 years ago	Lambda architecture on Spark, Kafka for real-time large scale machine learning
MLlib			MLlib is Apache Spark's scalable machine learning library
R			R is a free software environment for statistical computing and graphics
RHadoop	763	over 10 years ago	including RHDFS, RHBase, RMR2, plyrmr
Apache Lens
Apache SINGA (incubating)			SINGA is a general distributed deep learning platform for training big deep learning models over large datasets
BigDL			BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters
Apache Hivemall (incubating)			Apache Hivemall is a scalable machine learning library that runs on Apache Hive, Spark and Pig
Awesome Hadoop / Misc. / UDF
https://github.com/edwardcapriolo/hive_cassandra_udfs	11	over 14 years ago
https://github.com/livingsocial/HiveSwarm	101	about 10 years ago
https://github.com/ThinkBigAnalytics/Hive-Extensions-from-Think-Big-Analytics	16	almost 12 years ago
https://github.com/twitter/elephant-bird	1,137	over 3 years ago	Twitter
https://github.com/lovelysystems/ls-hive	5	over 13 years ago
https://github.com/klout/brickhouse
Awesome Hadoop / Misc. / Storage Handler
https://github.com/dvasilen/Hive-Cassandra	15	over 10 years ago
https://github.com/yc-huang/Hive-mongo	32	over 3 years ago
https://github.com/balshor/gdata-storagehandler	14	almost 15 years ago
https://github.com/chimpler/hive-solr	16	over 12 years ago
https://github.com/bfemiano/accumulo-hive-storage-manager	13	over 3 years ago
Awesome Hadoop / Misc. / Libraries and tools
https://github.com/forward3d/rbhive	98	about 5 years ago
https://github.com/synctree/activerecord-hive-adapter	5	about 15 years ago
https://github.com/hrp/sequel-hive-adapter	5	almost 2 years ago
https://github.com/forward/node-hive	61	about 8 years ago
https://github.com/recruitcojp/WebHive	19	over 12 years ago
shib	199	over 9 years ago	WebUI for query engines: Hive and Presto
https://github.com/dmorel/Thrift-API-HiveClient2	0	almost 3 years ago	(Perl - HiveServer2)
PyHive	1,676	almost 2 years ago	Python interface to Hive and Presto
https://github.com/recruitcojp/OdbcHive	8	about 15 years ago
HiveRunner	257	over 1 year ago	An Open Source unit test framework for hadoop hive queries based on JUnit4
Beetest	72	over 9 years ago	A super simple utility for testing Apache Hive scripts locally for non-Java developers
Hive_test	64	about 4 years ago	Unit test framework for hive and hive-service
Awesome Hadoop / Misc. / Flume Plugins
Flume MongoDB Sink	71	over 3 years ago
Flume RabbitMQ source and sink	58	over 2 years ago
Flume UDP Source	8	over 12 years ago
.Net FlumeNG Clients	17	about 12 years ago
Resources / Websites
Hadoop Weekly
The Hadoop Ecosystem Table
Hadoop illuminated			Open Source Hadoop Book
AWS BigData Blog
Hadoop360
How to monitor Hadoop metrics
Resources / Presentations
Apache Hadoop In Theory And Practice
Hadoop Operations at LinkedIn
Hadoop Performance at LinkedIn
Docker based Hadoop provisioning
Resources / Books
Hadoop: The Definitive Guide
Hadoop Operations
Apache Hadoop Yarn
HBase: The Definitive Guide
Programming Pig
Programming Hive
Hadoop in Practice, Second Edition
Hadoop in Action, Second Edition
Resources / Hadoop and Big Data Events
ApacheCon
Strata + Hadoop World
DataWorks Summit
Spark Summit

awesome-hadoop

Awesome Hadoop / Hadoop

Awesome Hadoop / YARN

Awesome Hadoop / NoSQL

Awesome Hadoop / SQL on Hadoop

Awesome Hadoop / Data Management

Awesome Hadoop / Workflow, Lifecycle and Governance

Awesome Hadoop / Data Ingestion and Integration

Awesome Hadoop / DSL

Awesome Hadoop / Libraries and Tools

Awesome Hadoop / Realtime Data Processing

Awesome Hadoop / Distributed Computing and Programming

Awesome Hadoop / Packaging, Provisioning and Monitoring

Awesome Hadoop / Search

Awesome Hadoop / Search Engine Framework

Awesome Hadoop / Security

Awesome Hadoop / Benchmark

Awesome Hadoop / Machine learning and Big Data analytics

Awesome Hadoop / Misc. / UDF

Awesome Hadoop / Misc. / Storage Handler

Awesome Hadoop / Misc. / Libraries and tools

Awesome Hadoop / Misc. / Flume Plugins

Resources / Websites

Resources / Presentations

Resources / Books

Resources / Hadoop and Big Data Events

Backlinks from these awesome lists:

More related projects: