awesome-chaos-engineering
System stress test collection
A curated list of resources and examples on experimenting with distributed systems to improve resilience and reliability
A curated list of Chaos Engineering resources.
6k stars
307 watching
650 forks
last commit: almost 2 years ago
Linked from 2 awesome lists
awesomeawesome-listchaoschaos-communitychaos-engineeringchaos-monkeychaos-testingnetflix-chaos-monkeyresiliencesimian-armysite-reliability-engineering
Awesome Chaos Engineering / Culture | |||
| Principles Of Chaos Engineering | |||
| Chaos Community | |||
| Chaos Engineering | |||
| O'Reilly Velocity San Jose 2017: Precision Chaos | |||
| The Discipline of Chaos Engineering | |||
| Chaos Monkey for Fun and Profit | |||
| Fault Injection in Production: Making the case for resilience testing | |||
| Lord of Chaos - Becoming a Chaos Engineer | |||
| Chaos testing - Preventing failure by instigation | |||
| Orchestrated Chaos | |||
| Video | Choose your own adventure: Chaos Engineering - & | ||
| AMA Chaos Engineering + DiRT | |||
| SRECON17: Principles of Chaos Engineering | |||
| Chaos & Intuition Engineering at Netflix | |||
| Mastering Chaos - A Netflix Guide to Microservices | |||
| Too big to test: Breaking a production brokerage platform without causing financial devastation | |||
| Inside Azure Search: Chaos Engineering | |||
| Netflix, the Simian Army, and the culture of freedom and responsibility | |||
| FIT: Failure Injection Testing | |||
| The Netflix Simian Army | |||
| Automated Failure Testing | |||
| The Verification of a Distributed System by Caitie McCaffrey | |||
| The Journey to Chaos Engineering begins with a single step - Bruce Wong and James Burns (Twilio) | |||
| Chaos Engineering by Lorin Hochstein | |||
| Aaron Rinehart - ChaoSlingr: Introducing Security based Chaos Testing | |||
| Chaos Engineering - Casey Rosenthal | |||
| video | The Road to Chaos - Velocity 2017- & | ||
| How Netflix DDoS’d Itself To Help Protect the Entire Internet | |||
| 10 Years of Crashing Google | |||
| Weathering the Unexpected | |||
| SRECON17: Breaking Things on Purpose | |||
| PuppetConf 2016: Chaos Patterns - Architecting for Failure in Distributed Systems | |||
| Ship More, Sink Less - Changing Chaos Engineering and Distributed Tracing | |||
| Cloudcast - Discipline of Chaos Engineering | |||
| Software Engineering Daily - Failure Injection with Kolton Andrus podcast | |||
| Responding to Failures in Playback Features with Haley Tucker podcast | |||
| "Antics, drift, and chaos" by Lorin Hochstein | |||
| re:invent 2017: Nora Jones Describes Why We Need More Chaos - Chaos Engineering, That Is | |||
| Failure Friday: Four Years On | |||
| Monkeys & Lemurs and Locusts, Oh my! | |||
| Practical Chaos Engineering | |||
| Chaos Day in the Met Office Cloud | |||
| Cloud Native and Chaos Engineering | |||
| Chaos Engineering with Kolton Andrus | |||
| Chaos Engineering: the history, principles, and practice | |||
| Embracing the Chaos of Chaos Engineering | |||
| Designing Services for Resilience: Netflix Lessons | |||
| Chaos Engineering: A cheat sheet | |||
| How to convince your boss and make them say “Yes!” to Chaos Engineering? | |||
| Why the World Needs More Resilient Systems | |||
| Chaos Architecture | |||
| Gremlin’s Tammy Bütow on the Business Side of Chaos Engineering | |||
| Kubernetes Chaos Engineering: Lessons Learned | |||
| Chaos Engineering: managing complexity by breaking things | |||
| Podcast:Database Chaos with Tammy Butow | |||
| LinkedOut: A Request-Level Failure Injection Framework | |||
| GOTO 2018 - Breaking Things on Purpose - Kolton Andrus | |||
| Why should Chaos be part of your Distributed Systems Engineering? | |||
| Brian Holt - Chaos Monkeys in Your Browser What Chaos Engineering Means For the Front End | |||
| Chaos Engineering: Why the World Needs More Resilient Systems | |||
| video | QCon·Beijing 2017: The Practice of Failure Management and Fault Injection at Alibaba E-Commerce Platforms - & (Chinese speech) | ||
| Orchestrating Chaos using Grab's Experimentation Platform | |||
| Breaking to Learn: Chaos Engineering Explained | |||
| Chaos Engineering Traps | |||
| Chaos Engineering - The Art of Breaking Things Purposefully | |||
| Disasterpiece Theater: Slack’s process for approachable Chaos Engineering | |||
| Taming chaos: Preparing for your next incident | |||
| The Future of Chaos Engineering w/ Conde Nast | |||
| Chaos Engineering For People Systems w/ Dave Rensin of Google | |||
| Performing chaos engineering in a serverless world (AWS re:Invent 2019 CMY301) | |||
| Building Confidence in Healthcare Systems through Chaos Engineering | |||
| Break Your App before Someone Else Does | |||
| Preparing for Traffic Spikes with Chaos Engineering | |||
| Automating Chaos Engineering GameDays with Terraform | |||
| Postmortem Culture: Learning from failure | |||
| Problem Detection by John Allspaw | |||
| New Paradigms for the Next Era of Security | |||
| Cloud-Native Chaos Engineering | |||
| Building resilient services at Prime Video with chaos engineering | |||
| Making Chaos Part of Kubernetes/OpenShift Performance and Scalability Tests | |||
| Lucky Lotto, chaos engineering but for teams | |||
| Using Fault Injection Testing to Improve DoorDash Reliability | |||
| Chaos Engineering At Ant Group | |||
Awesome Chaos Engineering / Books | |||
| Chaos Engineering: Building Confidence in System Behavior through Experiment | |||
| Site Reliability Engineering: How Google Runs Production Systems | - | ||
| The Practice Of Cloud System Administration: Designing and Operating Large Distributed Systems | |||
| Antifragile Systems and Teams | |||
| The InfoQ eMag: Chaos Engineering | |||
| Learning Chaos Engineering | |||
| Chaos Engineering: System Resilience in Practice | |||
| Chaos Engineering: Crash test your applications | |||
| Security Chaos Engineering: Gaining Confidence in Resilience and Safety at Speed and Scale | |||
| Chaos Engineering Observability | |||
Awesome Chaos Engineering / Education | |||
| Slides | A Chaos Engineering Bootcamp for O'Reilly Velocity 2017 - & | ||
| Your First Chaos Experiment | |||
| Chaos Engineering 101 | |||
| A Primer on Automating Chaos | |||
| Intro to Chaos Engineering | |||
| Learn the basics of the Chaos Toolkit | |||
| Build System Confidence with Chaos Engineering | |||
| How we break things at Twitter: failure testing | |||
| Run Chaos Experiments Without Risking Your Job | |||
| A Guide to Your First Chaos Day | |||
| Planning Your Own Chaos Day | |||
| How To Install Distributed Tensorflow on GCP and Perform Chaos Engineering Experiments | |||
| Monitoring Your Chaos Experiments | |||
| Increasing the Resilience of APIs with Chaos Engineering | |||
| 3 key steps for running chaos engineering experiments | |||
| Exploring Multi-level Weaknesses using Automated Chaos Experiments | |||
| Chaos Monkey Guide for Engineers | |||
| Chaos Engineering for Serverless | |||
| Network Fire Drills with Chaos Engineering | |||
| Dev Ops Foundations: Chaos Engineering | |||
| Resilience Engineering: Short Course | |||
| The Chaos Engineering Collection | |||
| PenTester Academic | |||
| Consul and Chaos Engineering | |||
Awesome Chaos Engineering / Notable Tools | |||
| Chaos Monkey | 15,332 | about 1 year ago | A resiliency tool that helps applications tolerate random instance failures |
| orchestrator | 5,645 | over 1 year ago | MySQL replication topology management and HA |
| kube-monkey | 2,985 | over 1 year ago | An implementation of Netflix's Chaos Monkey for Kubernetes clusters |
| Gremlin Inc. | Failure as a Service | ||
| Chaos Toolkit | 1,897 | over 1 year ago | A chaos engineering toolkit to help you build confidence in your software system |
| steadybit | A Chaos Engineering platform (SaaS or On-Prem) with auto discovery features, different attack types, user management and many more | ||
| PowerfulSeal | 1,949 | about 2 years ago | Adds chaos to your Kubernetes clusters, so that you can detect problems in your systems as early as possible. It kills targeted pods and takes VMs up and down |
| drax | 42 | over 6 years ago | DC/OS Resilience Automated Xenodiagnosis tool. It helps to test DC/OS deployments by applying a Chaos Monkey-inspired, proactive and invasive testing approach |
| Wiremock | API mocking (Service Virtualization) which enables modeling real world faults and delays | ||
| MockLab | API mocking (Service Virtualization) as a service which enables modeling real world faults and delays | ||
| Pod-Reaper | 202 | 12 months ago | A rules based pod killing container. Pod-Reaper was designed to kill pods that meet specific conditions that can be used for Chaos testing in Kubernetes |
| Muxy | 822 | almost 5 years ago | A chaos testing tool for simulating a real-world distributed system failures |
| Toxiproxy | 10,918 | 11 months ago | A TCP proxy to simulate network and system conditions for chaos and resiliency testing |
Awesome Chaos Engineering / Notable Tools / Chaos engineering for Docker: | |||
| Pumba | 2,797 | about 1 year ago | Chaos testing and network emulation for Docker containers (and clusters) |
| Blockade | 907 | over 4 years ago | Docker-based utility for testing network failures and partitions in distributed applications |
Awesome Chaos Engineering / Notable Tools | |||
| chaos-lambda | 163 | over 1 year ago | Randomly terminate ASG instances during business hours |
| Namazu | 492 | about 7 years ago | Programmable fuzzy scheduler for testing distributed systems |
| Chaos Monkey for Spring Boot | Injects latencies, exceptions, and terminations into Spring Boot applications | ||
| Byte-Monkey | 225 | about 5 years ago | Bytecode-level fault injection for the JVM. It works by instrumenting application code on the fly to deliberately introduce faults like exceptions and latency |
| GomJabbar | 30 | about 1 year ago | ChaosMonkey for your private cloud |
| Turbulence | 49 | about 6 years ago | Tool focused on BOSH environments capable of stressing VMs, manipulating network traffic, and more. It is very simmilar to Gremlin |
| chaosblade | 6,015 | 11 months ago | An Easy to Use and Powerful Chaos Engineering Toolkit |
| KubeInvaders | 1,027 | 11 months ago | Gamfied Chaos engineering tool for Kubernetes Clusters |
| Cthulhu | 93 | about 6 years ago | Chaos Engineering tool that helps evaluating the resiliency of microservice systems simulating various disaster scenarios against a target infrastructure in a data-driven manner |
| VMware Mangle | Orchestrating Chaos Engineering | ||
| Byteman | A Swiss Army Knife for Byte Code Manipulation | ||
| Litmus | 4,476 | 11 months ago | Framework for Kubernetes environments that enables users to run test suites, capture logs, generate reports and perform chaos tests |
| Perses | 66 | over 4 years ago | A project to cause (controlled) destruction to a JVM application |
| ChaosKube | 1,817 | 11 months ago | chaoskube periodically kills random pods in your Kubernetes cluster |
| Chaos Mesh | 6,803 | 11 months ago | Chaos Mesh is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments |
| failure-lambda | 94 | about 1 year ago | A small Node module for injecting failure into AWS Lambda using latency, exception, statuscode or diskspace |
| aws-chaos-scripts | 92 | about 2 years ago | Collection of python scripts to run failure injection on AWS infrastructure |
| chaos-ssm-documents | 268 | over 2 years ago | Collection of AWS SSM Documents to perform Chaos Engineering experiments |
| aws-lambda-chaos-injection | 101 | about 1 year ago | A library injecting chaos into AWS Lambda. It offers simple python decorators to do delay, exception and statusCode injection and a Class to add delay to any 3rd party dependencies |
| chaos-dingo | 11 | about 6 years ago | A tool to mess with Azure services using the Azure NodeJS SDK |
| Chaos HTTP Proxy | 144 | almost 2 years ago | Introduce failures into HTTP requests via a proxy server |
| Chaos Lemur | 62 | over 7 years ago | A self-hostable application to randomly destroy virtual machines in a BOSH-managed environment |
| Simoorg | 190 | almost 8 years ago | Linkedin’s very own failure inducer framework |
| react-chaos | 592 | almost 3 years ago | A chaos engineering tool for your React apps |
| vue-chaos | 2 | about 5 years ago | A chaos engineering tool for your Vue apps |
| Chaos Engine | 68 | over 1 year ago | tool designed to intermittently destroy or degrade application resources running in cloud based infrastructure |
| kubedoom | 2,021 | about 1 year ago | Kill Kubernetes pods by playing Id's DOOM |
| kubethanos | 623 | over 5 years ago | Kills half of your randomly selected Kubernetes pods |
| go-fault | 507 | 12 months ago | Fault injection middleware in Go |
| Proofdock's Chaos Engineering Platform | A chaos engineering platform that seamlessly integrates in Azure DevOps and has a focus on the Azure cloud platform | ||
| Pystol | Pystol is a fault injection platform allowing users to execute fault injection Actions in cloud-native environments in a controlled and prescribed way | ||
| AWSSSMChaosRunner | 250 | about 2 years ago | Amazon's light-weight open-source library for chaos engineering on AWS. It can be used for , and |
| Kraken | 295 | 11 months ago | Chaos and resiliency testing tool for Kubernetes and OpenShift |
| kube-burner | 508 | 11 months ago | A tool aimed at stressing Kubernetes clusters by creating or deleting a high quantity of objects |
| Chaos Experimentation Framework | 1,701 | 11 months ago | An extensible platform for infrastructure management including Chaos Engineering |
| NetHavoc | A Chaos Engineering Tool for Linux, K8s, Windows, PCF, Cloud, and Containers for injecting Resource, Infrastructure, Network, and Application failures | ||
| gorm-sqlchaos | 5 | about 4 years ago | A runtime SQL manipulator for your Golang applications based on gorm |
| Chaos Frontend Toolkit | A set of tools to apply Chaos Engineering to frontend | ||
| Mitigant | The Continuos Security Verification Platform, enables confidence in cloud security posture by leveraging security chaos engineering | ||
Awesome Chaos Engineering / Retired tools | |||
| The Simian Army | 7,982 | almost 7 years ago | A suite of tools for keeping your cloud operating in top form |
| ChaoSlingr | 66 | over 6 years ago | Introducing Security Chaos Engineering. ChaoSlingr focuses primarily on the experimentation on AWS Infrastructure to proactively instrument system security failure through experimentation |
Awesome Chaos Engineering / Cloud Services | |||
| Testing Amazon Aurora Using Fault Injection Queries | |||
| Azure Chaos Studio | A managed fault injection service for Azure applications. See also for Azure Service Fabric applications | ||
| Security Chaos Engineering for Cloud Services | |||
Awesome Chaos Engineering / Papers | |||
| Maelstrom: Mitigating Datacenter-level Disasters by Draining Interdependent Traffic Safely and Efficiently | |||
| Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems | |||
| Automating Failure Testing Research at Internet Scale | |||
| Principles of Antifragile Software | |||
| Why is random testing effective for partition tolerance bugs? | |||
| Chaos Engineering | |||
| A Platform for Automating Chaos Experiments | |||
| A Chaos Engineering System for Live Analysis and Falsification of Exception-handling in the JVM | |||
| TripleAgent: Monitoring, Perturbation And Failure-obliviousness for Automated Resilience Improvement in Java Applications | |||
| Lineage-driven Fault Injection | |||
| Antifragility is a Fragile Concept | |||
| Chaos Engineering Security | |||
| Security Chaos Engineering: A new paradigm for cybersecurity | |||
| Security Challenges around Chaos Engineering | |||
| CloudStrike: Security Chaos Engineering for Cloud Services | |||
| Observability and Chaos Engineering on System Calls for Containerized Applications in Docker | |||
| Maximizing Error Injection Realism for Chaos Engineering with System Calls | |||
| Chaos Engineering of Ethereum Blockchain Clients | |||
Awesome Chaos Engineering / Gamedays | |||
| Target: What is a Gameday? | Chaos Gamedays experience by Target | ||
| Codecentric: Chaos Engineering Gamedays | Chaos Gamedays by Codecentric | ||
| New Relic: How to run a Gameday? | Chaos Gamedays experience by New Relic | ||
| Dius: Gamedays resources | Resources for getting started with GameDay and Chaos Engineering | ||
| Gremlin: Gamedays | Resources for getting started with GameDay and Chaos Engineering | ||
| Gremlin: What is a Chaos Day? | What is a Gameday according Gremlin | ||
| Gremlin: Why run a Chaos Day? | Reasons to run Gamedays according Gremlin | ||
| Gremlin: How to run a Gameday? | Methodology to run Gamedays according Gremlin | ||
| Gremlin DB: Breaking Dynamo DB | Example of a Gameday with DynamoDB by Gremlin | ||
| Gremlin: Introduction to Gameday | What is a Gameday according Gremlin | ||
| Gremlin: Planning your own Chaos Day | Example of a Gameday with DynamoDB by Gremlin | ||
| Gremlin: Inside Gremlin 2019 Gremlin Gamedays Roadmap | Chaos Gamedays experience by Gremlin | ||
| Gremlin: What I lerned running the Chaos Lab with Kafka | Example of a Gameday with Kafka by Gremlin | ||
| Chaos Toolkit: Chaos Engineering with Humans in the loop | Article about Chaos Gamedays | ||
| GooCardless: All fun and games until you start with Gamedays | Article about Chaos Gamedays | ||
| InfoQ: Gamedays - Achieving Resilience through Chaos Engineering | InfoQ Presentation with experiences about Chaos Gamedays | ||
Awesome Chaos Engineering / Blogs & Newsletters | |||
| Netflix Technology Blog | Learn more about how Netflix designs, builds, and operates our systems and engineering organizations | ||
| Production Ready | A mailing list about building resilient infrastructure and tools | ||
| SRE Weekly | Weekly Site Reliability Newsletter | ||
| Site Reliability Engineering resources | 12,063 | over 1 year ago | A curated list of awesome Site Reliability and Production Engineering resources |
| SysAdvent | One article for each day of December, ending on the 25th article | ||
| Gremlin Blog | Blogs on Chaos Engineering from Gremlin Inc | ||
| O’Reilly Systems Engineering and Operations Newsletter | Weekly systems engineering and operations news and insights from industry insiders | ||
| LaunchDarkly Blog | Continuous delivery and feature flags blog | ||
| Verica | Chaos engineering, security chaos engineering and continuous verification | ||
| Proofdock | Reliability, resilience and chaos engineering with a focus on MS Azure | ||
| LitmusChaos Blog | Blogs on Chaos Engineering from LitmusChaos | ||
| ChaosEngineering.news | Chaos Engineering newsletter. All things chaos engineering, directly to your inbox! | ||
| Chaos Mesh Blog | Blogs on Chaos Engineering from Chaos Mesh | ||
| Chaos Experimentation Framework | Chaos Experimentation, an open-source framework built on top of Envoy Proxy | ||
| Squadcast | Blog on Site Reliability engineering | ||
| steadybit Blog | Blogs on Chaos Engineering, Resilience, SRE and OPS from steadybit | ||
Awesome Chaos Engineering / Podcasts | |||
| Break Things On Purpose | Monthly podcast about Chaos Engineering presented by Gremlin Inc. Also available on Spotify, Google Play, and Stitcher | ||
Awesome Chaos Engineering / Conferences & Meetups | |||
| Chaos Carnival | A global two-day virtual conference for Cloud Native Chaos Engineering | ||
| Chaos Conf | A day of Chaos Engineering demos, expert advice, and connect with your peers putting chaos into practice at their companies | ||
| SRECon Conferences | The official SRE conference | ||
| LISA Conferences | Prominent conference about SysAdmin/DevOps/SRE | ||
| O'Reilly Velocity Conference | Prominent conference about Systems Engineering/DevOps/SRE | ||
| Chaos Engineering Community Meetup Group | Bay Area Meetup group for Chaos Engineers | ||
| London Chaos Engineering Community | _ London Area Meetup group for Chaos Engineers | ||
| Stockholm Chaos Engineering Meetup | Stockholm Meetup group for Chaos Engineers | ||
| Chaos Engineering Community | A collection of meetups across the globe about Chaos Engineerings | ||
| Conf42.com: Chaos Engineering | Chaos Engineering for practitioners and adopters - London UK, 23 Jan 2020 | ||
| Kubernetes Chaos Engineering Meetup Group India | India Meetup group for Chaos Engineers | ||
Awesome Chaos Engineering / Forums | |||
| Chaos Community Google Group | |||
| Chaos Engineering LinkedIn Group | |||
| Chaos Engineering Slack Community | |||
| CNCF Chaos Engineering Working Group | |||
| CNCF Chaos Engineering Working Group Github | 113 | over 5 years ago | |
| Chaos Toolkit Slack Community | |||
| Litmus Chaos Engineering Slack Community | |||