awesome-chaos-engineering
System stress test collection
A curated list of resources and examples on experimenting with distributed systems to improve resilience and reliability
A curated list of Chaos Engineering resources.
6k stars
307 watching
649 forks
last commit: 11 months ago
Linked from 2 awesome lists
awesomeawesome-listchaoschaos-communitychaos-engineeringchaos-monkeychaos-testingnetflix-chaos-monkeyresiliencesimian-armysite-reliability-engineering
Awesome Chaos Engineering / Culture | |||
Principles Of Chaos Engineering | |||
Chaos Community | |||
Chaos Engineering | |||
O'Reilly Velocity San Jose 2017: Precision Chaos | |||
The Discipline of Chaos Engineering | |||
Chaos Monkey for Fun and Profit | |||
Fault Injection in Production: Making the case for resilience testing | |||
Lord of Chaos - Becoming a Chaos Engineer | |||
Chaos testing - Preventing failure by instigation | |||
Orchestrated Chaos | |||
Video | Choose your own adventure: Chaos Engineering - & | ||
AMA Chaos Engineering + DiRT | |||
SRECON17: Principles of Chaos Engineering | |||
Chaos & Intuition Engineering at Netflix | |||
Mastering Chaos - A Netflix Guide to Microservices | |||
Too big to test: Breaking a production brokerage platform without causing financial devastation | |||
Inside Azure Search: Chaos Engineering | |||
Netflix, the Simian Army, and the culture of freedom and responsibility | |||
FIT: Failure Injection Testing | |||
The Netflix Simian Army | |||
Automated Failure Testing | |||
The Verification of a Distributed System by Caitie McCaffrey | |||
The Journey to Chaos Engineering begins with a single step - Bruce Wong and James Burns (Twilio) | |||
Chaos Engineering by Lorin Hochstein | |||
Aaron Rinehart - ChaoSlingr: Introducing Security based Chaos Testing | |||
Chaos Engineering - Casey Rosenthal | |||
video | The Road to Chaos - Velocity 2017- & | ||
How Netflix DDoS’d Itself To Help Protect the Entire Internet | |||
10 Years of Crashing Google | |||
Weathering the Unexpected | |||
SRECON17: Breaking Things on Purpose | |||
PuppetConf 2016: Chaos Patterns - Architecting for Failure in Distributed Systems | |||
Ship More, Sink Less - Changing Chaos Engineering and Distributed Tracing | |||
Cloudcast - Discipline of Chaos Engineering | |||
Software Engineering Daily - Failure Injection with Kolton Andrus podcast | |||
Responding to Failures in Playback Features with Haley Tucker podcast | |||
"Antics, drift, and chaos" by Lorin Hochstein | |||
re:invent 2017: Nora Jones Describes Why We Need More Chaos - Chaos Engineering, That Is | |||
Failure Friday: Four Years On | |||
Monkeys & Lemurs and Locusts, Oh my! | |||
Practical Chaos Engineering | |||
Chaos Day in the Met Office Cloud | |||
Cloud Native and Chaos Engineering | |||
Chaos Engineering with Kolton Andrus | |||
Chaos Engineering: the history, principles, and practice | |||
Embracing the Chaos of Chaos Engineering | |||
Designing Services for Resilience: Netflix Lessons | |||
Chaos Engineering: A cheat sheet | |||
How to convince your boss and make them say “Yes!” to Chaos Engineering? | |||
Why the World Needs More Resilient Systems | |||
Chaos Architecture | |||
Gremlin’s Tammy Bütow on the Business Side of Chaos Engineering | |||
Kubernetes Chaos Engineering: Lessons Learned | |||
Chaos Engineering: managing complexity by breaking things | |||
Podcast:Database Chaos with Tammy Butow | |||
LinkedOut: A Request-Level Failure Injection Framework | |||
GOTO 2018 - Breaking Things on Purpose - Kolton Andrus | |||
Why should Chaos be part of your Distributed Systems Engineering? | |||
Brian Holt - Chaos Monkeys in Your Browser What Chaos Engineering Means For the Front End | |||
Chaos Engineering: Why the World Needs More Resilient Systems | |||
video | QCon·Beijing 2017: The Practice of Failure Management and Fault Injection at Alibaba E-Commerce Platforms - & (Chinese speech) | ||
Orchestrating Chaos using Grab's Experimentation Platform | |||
Breaking to Learn: Chaos Engineering Explained | |||
Chaos Engineering Traps | |||
Chaos Engineering - The Art of Breaking Things Purposefully | |||
Disasterpiece Theater: Slack’s process for approachable Chaos Engineering | |||
Taming chaos: Preparing for your next incident | |||
The Future of Chaos Engineering w/ Conde Nast | |||
Chaos Engineering For People Systems w/ Dave Rensin of Google | |||
Performing chaos engineering in a serverless world (AWS re:Invent 2019 CMY301) | |||
Building Confidence in Healthcare Systems through Chaos Engineering | |||
Break Your App before Someone Else Does | |||
Preparing for Traffic Spikes with Chaos Engineering | |||
Automating Chaos Engineering GameDays with Terraform | |||
Postmortem Culture: Learning from failure | |||
Problem Detection by John Allspaw | |||
New Paradigms for the Next Era of Security | |||
Cloud-Native Chaos Engineering | |||
Building resilient services at Prime Video with chaos engineering | |||
Making Chaos Part of Kubernetes/OpenShift Performance and Scalability Tests | |||
Lucky Lotto, chaos engineering but for teams | |||
Using Fault Injection Testing to Improve DoorDash Reliability | |||
Chaos Engineering At Ant Group | |||
Awesome Chaos Engineering / Books | |||
Chaos Engineering: Building Confidence in System Behavior through Experiment | |||
Site Reliability Engineering: How Google Runs Production Systems | - | ||
The Practice Of Cloud System Administration: Designing and Operating Large Distributed Systems | |||
Antifragile Systems and Teams | |||
The InfoQ eMag: Chaos Engineering | |||
Learning Chaos Engineering | |||
Chaos Engineering: System Resilience in Practice | |||
Chaos Engineering: Crash test your applications | |||
Security Chaos Engineering: Gaining Confidence in Resilience and Safety at Speed and Scale | |||
Chaos Engineering Observability | |||
Awesome Chaos Engineering / Education | |||
Slides | A Chaos Engineering Bootcamp for O'Reilly Velocity 2017 - & | ||
Your First Chaos Experiment | |||
Chaos Engineering 101 | |||
A Primer on Automating Chaos | |||
Intro to Chaos Engineering | |||
Learn the basics of the Chaos Toolkit | |||
Build System Confidence with Chaos Engineering | |||
How we break things at Twitter: failure testing | |||
Run Chaos Experiments Without Risking Your Job | |||
A Guide to Your First Chaos Day | |||
Planning Your Own Chaos Day | |||
How To Install Distributed Tensorflow on GCP and Perform Chaos Engineering Experiments | |||
Monitoring Your Chaos Experiments | |||
Increasing the Resilience of APIs with Chaos Engineering | |||
3 key steps for running chaos engineering experiments | |||
Exploring Multi-level Weaknesses using Automated Chaos Experiments | |||
Chaos Monkey Guide for Engineers | |||
Chaos Engineering for Serverless | |||
Network Fire Drills with Chaos Engineering | |||
Dev Ops Foundations: Chaos Engineering | |||
Resilience Engineering: Short Course | |||
The Chaos Engineering Collection | |||
PenTester Academic | |||
Consul and Chaos Engineering | |||
Awesome Chaos Engineering / Notable Tools | |||
Chaos Monkey | 15,256 | about 2 months ago | A resiliency tool that helps applications tolerate random instance failures |
orchestrator | 5,637 | 4 months ago | MySQL replication topology management and HA |
kube-monkey | 2,981 | 5 months ago | An implementation of Netflix's Chaos Monkey for Kubernetes clusters |
Gremlin Inc. | Failure as a Service | ||
Chaos Toolkit | 1,891 | 4 months ago | A chaos engineering toolkit to help you build confidence in your software system |
steadybit | A Chaos Engineering platform (SaaS or On-Prem) with auto discovery features, different attack types, user management and many more | ||
PowerfulSeal | 1,946 | about 1 year ago | Adds chaos to your Kubernetes clusters, so that you can detect problems in your systems as early as possible. It kills targeted pods and takes VMs up and down |
drax | 42 | over 5 years ago | DC/OS Resilience Automated Xenodiagnosis tool. It helps to test DC/OS deployments by applying a Chaos Monkey-inspired, proactive and invasive testing approach |
Wiremock | API mocking (Service Virtualization) which enables modeling real world faults and delays | ||
MockLab | API mocking (Service Virtualization) as a service which enables modeling real world faults and delays | ||
Pod-Reaper | 201 | 4 months ago | A rules based pod killing container. Pod-Reaper was designed to kill pods that meet specific conditions that can be used for Chaos testing in Kubernetes |
Muxy | 823 | almost 4 years ago | A chaos testing tool for simulating a real-world distributed system failures |
Toxiproxy | 10,841 | 12 days ago | A TCP proxy to simulate network and system conditions for chaos and resiliency testing |
Awesome Chaos Engineering / Notable Tools / Chaos engineering for Docker: | |||
Pumba | 2,791 | 3 months ago | Chaos testing and network emulation for Docker containers (and clusters) |
Blockade | 907 | over 3 years ago | Docker-based utility for testing network failures and partitions in distributed applications |
Awesome Chaos Engineering / Notable Tools | |||
chaos-lambda | 163 | 5 months ago | Randomly terminate ASG instances during business hours |
Namazu | 493 | about 6 years ago | Programmable fuzzy scheduler for testing distributed systems |
Chaos Monkey for Spring Boot | Injects latencies, exceptions, and terminations into Spring Boot applications | ||
Byte-Monkey | 225 | about 4 years ago | Bytecode-level fault injection for the JVM. It works by instrumenting application code on the fly to deliberately introduce faults like exceptions and latency |
GomJabbar | 30 | 2 months ago | ChaosMonkey for your private cloud |
Turbulence | 49 | over 5 years ago | Tool focused on BOSH environments capable of stressing VMs, manipulating network traffic, and more. It is very simmilar to Gremlin |
chaosblade | 5,982 | 15 days ago | An Easy to Use and Powerful Chaos Engineering Toolkit |
KubeInvaders | 1,022 | 27 days ago | Gamfied Chaos engineering tool for Kubernetes Clusters |
Cthulhu | 93 | about 5 years ago | Chaos Engineering tool that helps evaluating the resiliency of microservice systems simulating various disaster scenarios against a target infrastructure in a data-driven manner |
VMware Mangle | Orchestrating Chaos Engineering | ||
Byteman | A Swiss Army Knife for Byte Code Manipulation | ||
Litmus | 4,439 | 8 days ago | Framework for Kubernetes environments that enables users to run test suites, capture logs, generate reports and perform chaos tests |
Perses | 66 | over 3 years ago | A project to cause (controlled) destruction to a JVM application |
ChaosKube | 1,810 | 22 days ago | chaoskube periodically kills random pods in your Kubernetes cluster |
Chaos Mesh | 6,768 | 18 days ago | Chaos Mesh is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments |
failure-lambda | 94 | 3 months ago | A small Node module for injecting failure into AWS Lambda using latency, exception, statuscode or diskspace |
aws-chaos-scripts | 92 | about 1 year ago | Collection of python scripts to run failure injection on AWS infrastructure |
chaos-ssm-documents | 267 | over 1 year ago | Collection of AWS SSM Documents to perform Chaos Engineering experiments |
aws-lambda-chaos-injection | 100 | 16 days ago | A library injecting chaos into AWS Lambda. It offers simple python decorators to do delay, exception and statusCode injection and a Class to add delay to any 3rd party dependencies |
chaos-dingo | 11 | about 5 years ago | A tool to mess with Azure services using the Azure NodeJS SDK |
Chaos HTTP Proxy | 145 | 12 months ago | Introduce failures into HTTP requests via a proxy server |
Chaos Lemur | 62 | over 6 years ago | A self-hostable application to randomly destroy virtual machines in a BOSH-managed environment |
Simoorg | 191 | almost 7 years ago | Linkedin’s very own failure inducer framework |
react-chaos | 593 | almost 2 years ago | A chaos engineering tool for your React apps |
vue-chaos | 2 | about 4 years ago | A chaos engineering tool for your Vue apps |
Chaos Engine | 68 | 9 months ago | tool designed to intermittently destroy or degrade application resources running in cloud based infrastructure |
kubedoom | 2,016 | 3 months ago | Kill Kubernetes pods by playing Id's DOOM |
kubethanos | 623 | over 4 years ago | Kills half of your randomly selected Kubernetes pods |
go-fault | 506 | about 1 month ago | Fault injection middleware in Go |
Proofdock's Chaos Engineering Platform | A chaos engineering platform that seamlessly integrates in Azure DevOps and has a focus on the Azure cloud platform | ||
Pystol | Pystol is a fault injection platform allowing users to execute fault injection Actions in cloud-native environments in a controlled and prescribed way | ||
AWSSSMChaosRunner | 249 | about 1 year ago | Amazon's light-weight open-source library for chaos engineering on AWS. It can be used for , and |
Kraken | 288 | 8 days ago | Chaos and resiliency testing tool for Kubernetes and OpenShift |
kube-burner | 502 | 10 days ago | A tool aimed at stressing Kubernetes clusters by creating or deleting a high quantity of objects |
Chaos Experimentation Framework | 1,691 | 6 days ago | An extensible platform for infrastructure management including Chaos Engineering |
NetHavoc | A Chaos Engineering Tool for Linux, K8s, Windows, PCF, Cloud, and Containers for injecting Resource, Infrastructure, Network, and Application failures | ||
gorm-sqlchaos | 5 | about 3 years ago | A runtime SQL manipulator for your Golang applications based on gorm |
Chaos Frontend Toolkit | A set of tools to apply Chaos Engineering to frontend | ||
Mitigant | The Continuos Security Verification Platform, enables confidence in cloud security posture by leveraging security chaos engineering | ||
Awesome Chaos Engineering / Retired tools | |||
The Simian Army | 7,979 | almost 6 years ago | A suite of tools for keeping your cloud operating in top form |
ChaoSlingr | 66 | over 5 years ago | Introducing Security Chaos Engineering. ChaoSlingr focuses primarily on the experimentation on AWS Infrastructure to proactively instrument system security failure through experimentation |
Awesome Chaos Engineering / Cloud Services | |||
Testing Amazon Aurora Using Fault Injection Queries | |||
Azure Chaos Studio | A managed fault injection service for Azure applications. See also for Azure Service Fabric applications | ||
Security Chaos Engineering for Cloud Services | |||
Awesome Chaos Engineering / Papers | |||
Maelstrom: Mitigating Datacenter-level Disasters by Draining Interdependent Traffic Safely and Efficiently | |||
Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems | |||
Automating Failure Testing Research at Internet Scale | |||
Principles of Antifragile Software | |||
Why is random testing effective for partition tolerance bugs? | |||
Chaos Engineering | |||
A Platform for Automating Chaos Experiments | |||
A Chaos Engineering System for Live Analysis and Falsification of Exception-handling in the JVM | |||
TripleAgent: Monitoring, Perturbation And Failure-obliviousness for Automated Resilience Improvement in Java Applications | |||
Lineage-driven Fault Injection | |||
Antifragility is a Fragile Concept | |||
Chaos Engineering Security | |||
Security Chaos Engineering: A new paradigm for cybersecurity | |||
Security Challenges around Chaos Engineering | |||
CloudStrike: Security Chaos Engineering for Cloud Services | |||
Observability and Chaos Engineering on System Calls for Containerized Applications in Docker | |||
Maximizing Error Injection Realism for Chaos Engineering with System Calls | |||
Chaos Engineering of Ethereum Blockchain Clients | |||
Awesome Chaos Engineering / Gamedays | |||
Target: What is a Gameday? | Chaos Gamedays experience by Target | ||
Codecentric: Chaos Engineering Gamedays | Chaos Gamedays by Codecentric | ||
New Relic: How to run a Gameday? | Chaos Gamedays experience by New Relic | ||
Dius: Gamedays resources | Resources for getting started with GameDay and Chaos Engineering | ||
Gremlin: Gamedays | Resources for getting started with GameDay and Chaos Engineering | ||
Gremlin: What is a Chaos Day? | What is a Gameday according Gremlin | ||
Gremlin: Why run a Chaos Day? | Reasons to run Gamedays according Gremlin | ||
Gremlin: How to run a Gameday? | Methodology to run Gamedays according Gremlin | ||
Gremlin DB: Breaking Dynamo DB | Example of a Gameday with DynamoDB by Gremlin | ||
Gremlin: Introduction to Gameday | What is a Gameday according Gremlin | ||
Gremlin: Planning your own Chaos Day | Example of a Gameday with DynamoDB by Gremlin | ||
Gremlin: Inside Gremlin 2019 Gremlin Gamedays Roadmap | Chaos Gamedays experience by Gremlin | ||
Gremlin: What I lerned running the Chaos Lab with Kafka | Example of a Gameday with Kafka by Gremlin | ||
Chaos Toolkit: Chaos Engineering with Humans in the loop | Article about Chaos Gamedays | ||
GooCardless: All fun and games until you start with Gamedays | Article about Chaos Gamedays | ||
InfoQ: Gamedays - Achieving Resilience through Chaos Engineering | InfoQ Presentation with experiences about Chaos Gamedays | ||
Awesome Chaos Engineering / Blogs & Newsletters | |||
Netflix Technology Blog | Learn more about how Netflix designs, builds, and operates our systems and engineering organizations | ||
Production Ready | A mailing list about building resilient infrastructure and tools | ||
SRE Weekly | Weekly Site Reliability Newsletter | ||
Site Reliability Engineering resources | 11,989 | 6 months ago | A curated list of awesome Site Reliability and Production Engineering resources |
SysAdvent | One article for each day of December, ending on the 25th article | ||
Gremlin Blog | Blogs on Chaos Engineering from Gremlin Inc | ||
O’Reilly Systems Engineering and Operations Newsletter | Weekly systems engineering and operations news and insights from industry insiders | ||
LaunchDarkly Blog | Continuous delivery and feature flags blog | ||
Verica | Chaos engineering, security chaos engineering and continuous verification | ||
Proofdock | Reliability, resilience and chaos engineering with a focus on MS Azure | ||
LitmusChaos Blog | Blogs on Chaos Engineering from LitmusChaos | ||
ChaosEngineering.news | Chaos Engineering newsletter. All things chaos engineering, directly to your inbox! | ||
Chaos Mesh Blog | Blogs on Chaos Engineering from Chaos Mesh | ||
Chaos Experimentation Framework | Chaos Experimentation, an open-source framework built on top of Envoy Proxy | ||
Squadcast | Blog on Site Reliability engineering | ||
steadybit Blog | Blogs on Chaos Engineering, Resilience, SRE and OPS from steadybit | ||
Awesome Chaos Engineering / Podcasts | |||
Break Things On Purpose | Monthly podcast about Chaos Engineering presented by Gremlin Inc. Also available on Spotify, Google Play, and Stitcher | ||
Awesome Chaos Engineering / Conferences & Meetups | |||
Chaos Carnival | A global two-day virtual conference for Cloud Native Chaos Engineering | ||
Chaos Conf | A day of Chaos Engineering demos, expert advice, and connect with your peers putting chaos into practice at their companies | ||
SRECon Conferences | The official SRE conference | ||
LISA Conferences | Prominent conference about SysAdmin/DevOps/SRE | ||
O'Reilly Velocity Conference | Prominent conference about Systems Engineering/DevOps/SRE | ||
Chaos Engineering Community Meetup Group | Bay Area Meetup group for Chaos Engineers | ||
London Chaos Engineering Community | _ London Area Meetup group for Chaos Engineers | ||
Stockholm Chaos Engineering Meetup | Stockholm Meetup group for Chaos Engineers | ||
Chaos Engineering Community | A collection of meetups across the globe about Chaos Engineerings | ||
Conf42.com: Chaos Engineering | Chaos Engineering for practitioners and adopters - London UK, 23 Jan 2020 | ||
Kubernetes Chaos Engineering Meetup Group India | India Meetup group for Chaos Engineers | ||
Awesome Chaos Engineering / Forums | |||
Chaos Community Google Group | |||
Chaos Engineering LinkedIn Group | |||
Chaos Engineering Slack Community | |||
CNCF Chaos Engineering Working Group | |||
CNCF Chaos Engineering Working Group Github | 113 | over 4 years ago | |
Chaos Toolkit Slack Community | |||
Litmus Chaos Engineering Slack Community |