awesome-distributed-systems

Distributed system knowledge base

A curated list of resources on distributed systems

A curated list to learn about distributed systems

GitHub

11k stars
423 watching
1k forks
last commit: 4 months ago
Linked from 1 awesome list

architectureconsensusdistributed-systemslamportpaperpaxos

awesome-distributed-systems / Bootcamp

CAP Theorem , Also explanation
Fallacies of Distributed Computing , expect things to break,
Distributed systems theory for the distributed engineer , most of the papers/books in the blog might reappear in this list again. Still a good BFS approach to distributed systems
FLP Impossibility Result (paper) , an easier to follow along
An Introduction to Distributed Systems 9,067 almost 2 years ago @aphyr's excellent introduction to distributed systems

awesome-distributed-systems / Books

Distributed Systems for fun and profit [Free]
Distributed Systems Principles and Paradigms, Andrew Tanenbaum [Free with registration]
Scalable Web Architecture and Distributed Systems [Free]
Principles of Distributed Systems [Free] [ETH Zurich University]
Making reliable distributed systems in the presence of software errors , [Free] Joe Amstrong's (Author of Erlang) PhD thesis
Designing Data Intensive Applications [Amazon Link]
Distributed Machine Learning Patterns, Yuan Tang 390 3 months ago , Practical patterns for scaling machine learning from your laptop to a distributed cluster
Distributed Computing, Hagit Attiya and Jennifer Welch
Distributed Algorithms, Nancy Lynch [Amazon Link]
Impossibility Results for Distributed Computing [paywall]
Designing Distributed Systems, Brendan Burns [Free with registration]
Distributed Systems: Concepts and Design, George Coulouris [Amazon Link]
Akka in Action, Second Edition
Systemantics: how systems work and especially how they fail
Think Distributed Systems [Free with subscription]

awesome-distributed-systems / Papers

Times, Clocks and Ordering of Events in Distributed Systems Lamport's paper, the Quintessential distributed systems primer
Session Guarantees for Weakly Consistent Replicated Data a '94 paper that talks about various recommendations for session guarantees for eventually consistent systems, many of this would be standard vocabulary in reading other dist. sys papers, like monotonic reads, read your writes etc

awesome-distributed-systems / Papers / Storage & Databases

Dynamo: Amazon's Highly Available Key Value Store Paraphrasing @fogus from their , it is very rare for a paper describing an active production system to influence the state of active research in any industry; this is one of those seminal distributed systems paper that solves the problem of a highly available and fault tolerant database in an elegant way, later paving the way for systems like Cassandra, and many other AP systems using a consistent hashing
Bigtable: A Distributed Storage System for Structured Data
The Google File System
Cassandra: A Decentralized Structured Storage System Inspired heavily by Dynamo, an now an open source
CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data , the algorithm for the basis of Ceph distributed storage system, for the architecture itself read

awesome-distributed-systems / Papers / Messaging systems

The Log: What every software engineer should know about real-time data's unifying abstraction , a somewhat long read, but covers brilliantly on logs, which are at the heart of most distributed systems
Kafka: a Distributed Messaging System for Log Processing

awesome-distributed-systems / Papers / Distributed Consensus and Fault-Tolerance

Practical Byzantine Fault Tolerance
The Byzantine Generals Problem
Impossibility of Distributed Consensus with One Faulty Process
The Part Time Parliament Paxos, Lamport's original Paxos paper, a bit difficult to understand, may require multiple passes
Paxos Made Simple , a more terse readable Paxos paper by Lamport himself. Shorter and more easier compared to the original
The Chubby Lock Service for loosely coupled distributed systems Google's lock service used for loosely coupled distributed systems. Sort of Paxos as a Service for building other distributed systems. Primary inspiration behind other Service Discovery & Coordination tools like Zookeeper, etcd, Consul etc
Paxos made live - An engineering perspective Google's learning while implementing systems atop of Paxos. Demonstrates various practical issues encountered while implementing a theoretical concept
Raft Consensus Algorithm An alternative to Paxos for distributed consensus, that is much simpler to understand. Do checkout an
Conflict-free Replicated Data Types presents an approach for Strong Eventual Consistency which as been applied in projects such as , and . A great talk on the subject by Martin Kleppmann can be found
Azos.Sky.Server.Locking 213 7 days ago Speculative algorithms for global state synchronizations uses probability based QOS (Quality of Service)/Trust measure to ensure probability-based consensus. The approach avoids distributed state machine/phase synchronization and is very simple to understand and implement

awesome-distributed-systems / Papers / Testing, monitoring and tracing

Dapper , Google's large scale distributed-systems tracing infrastructure, this was also the basis for the design of open source projects such as , , and

awesome-distributed-systems / Papers / Programming Models

Distributed Programming Model
PSync: a partially synchronous language for fault-tolerant distributed algorithms Video:
Programming Models for Distributed Computing
Logic and Lattices for Distributed Programming

awesome-distributed-systems / Papers / Verification of Distributed Systems

testing distributed systems Curated list of resources on includes links to materials on testing by various companies (Google, Amazon, Netflix, Microsoft, Dropbox, etc) and research papers
Jepsen 6,830 27 days ago A framework for distributed systems verification, with fault injection @aphyr has featured enough times in this list already, but Jepsen and the blog posts that go with are a quintessntial addition to any distributed systems reading list
Verdi A Framework for Implementing and Formally Verifying Distributed Systems

awesome-distributed-systems / Videos

Distributed Deep Dive interview series by
Distributed Systems in One Lesson Distributed Systems in One Lesson by Tim Berglund

awesome-distributed-systems / Courses

Reliable Distributed Algorithms, Part 1 , KTH Sweden
Reliable Distributed Algorithms, Part 2 , KTH Sweden
Cloud Computing Concepts , University of Illinois
CMU: Distributed Systems in Go Programming Language
Software Defined Networking , Georgia Tech
ETH Zurich: Distributed Systems
ETH Zurich: Distributed Systems Part 2 , covers Distributed control algorithms, communication models, fault-tolerance among other things. In particular fault tolerance issues (models, consensus, agreement) and replication issues (2PC,3PC, Paxos), which are critical in understanding distributed systems are explained in great detail
Distributed Systems Course , A beginner course on distributed system by Chris Colohan, A google employee who contributed to SUIF, MapReduce, TCMalloc, Percolator, Caffeine, Borg, Omega, and Piper
MIT 6.824 , MIT distributed system lectures, in each video they discuss papers like GFS, Zookeeper, RAFT, Spanner
Distributed Systems , Lectures 9 to 16 of the Cambridge University lecture "Concurrent and Distributed Systems", given by Dr. Martin Kleppmann. . A computer science entrance course, covered basic models and algorithms in distributed systems, also discussed CRDT, collaboration software and google's spanner
Amazon Builder's Library , a collection of Amazon's learnings on distributed systems
How we implemented consistent hashing efficiently
Notes on Distributed Systems for Young Bloods
High Scalability Several architectures of huge internet services, for eg ,
There is No Now , Problems with simultaneity in distributed systems
Turing Lecture: The Computer Science of Concurrency: The Early Years , An article by Leslie Lamport on concurrency
The Paper Trail blog, a very readable blog covering various aspects of distributed systems
aphyr , Posts on series are pretty awesome
All Things Distributed Wernel Vogel's (Amazon CTO) blog on distributed systems
Distributed Systems: Take Responsibility for Failover
The C10K problem
On Designing and Deploying Internet-Scale Services
Files are hard A blog post on filesystem consistency, pretty important to read if you are into distributed storage or databases
Distributed Systems Testing: The Lost World Testing distributed systems are hard enough, a well researched blog post which again covers a lot of links to various approaches and other papers
SWIM Protocol explained A blog post on popular SWIM failure detector

awesome-distributed-systems / Research

ACM Symposium on Principles of Distributed Computing (PODC) and International Symposium on Distributed Computing (DISC) , a list of resources from PODC–DISC community including conference series, mailing lists, youtube, twitter, etc
IEEE International Parallel & Distributed Processing Symposium (IPDPS) , an international forum for engineers and scientists to present their latest research findings
Springer Distributed Computing Journal , a journal about theory, design, specification, and implementation of distributed systems

awesome-distributed-systems / Meta Lists

Readings in distributed systems
Distributed Systems meta list
List of required readings for Distributed Systems Part of CMU's Engineering Distributed Systems course
The Distributed Reader
A Distributed Systems Reading List , A collection of material, mostly papers on Distributed Systems Theory as well as seminal industry papers
Distributed Systems Readings , A comprehensive list of online courses related to distributed systems
Awesome Distributed Consensus 2,049 6 months ago , Another list of materials on distributed consensus protocols
Beginner's Guide to Distributed Systems A blog post with some useful getting started links for distributed systems

Backlinks from these awesome lists:

More related projects: