awesome-sre

SRE guides

A curated list of resources and knowledge on Site Reliability Engineering practices and principles.

A curated list of Site Reliability and Production Engineering resources.

GitHub

12k stars
500 watching
2k forks
last commit: 5 months ago
Linked from 4 awesome lists

alertingavailabilityawesomeawesome-listcapacity-planningdevopsincident-responselistmonitoringon-callpost-mortempostmortemproductionreliabilityreliability-engineeringscalabilityservice-level-agreementsite-reliabilitysite-reliability-engineeringsre

Awesome Site Reliability Engineering / Culture

What is Site Reliability Engineering?
Keys To SRE by Ben Treynor
Google SRE Resources
Notes from Production Engineering by Pedro Canahuati
PostOps: Recovery from Operations
Love DevOps? Wait 'till you meet SRE
How Google Does Planet-Scale Engineering for Planet-Scale Infra
Site Reliability Engineering at Facebook
A History of Site Reliability Engineering at Uber
Case Study: Adopting SRE Principles at StackOverflow
Site Reliability Engineering at Dropbox
Site Reliability Engineers — Keeping Google up and running 24/7
Site Reliability Engineering at Salesforce
video From Sys Admin to Netflix SRE - and
SRE@Google: Thousands of DevOps Since 2004
Transactional System Administration Is Killing Us and Must be Stopped
A hierarchy of SRE needs
PostOps: A Non-Surgical Tale of Software, Fragility, and Reliability
SRE: An incomplete guide to cultural Narnia -
Putting Together Great SRE Teams
Work at Google: Meet our Production Engineers for Site Reliability Hangout on Air
Toil: A Word Every Engineer Should Know
Engineering Reliability into Web Sites: Google SRE
DEVOPS & SRE AMA - Building High Performance Organizations
John Allspaw's AMA on Incident Analysis and Postmortems
Part 1 Site Reliability Engineering with Paul Newson - &
How SysAdmins Devalue Themselves
The Softer Side of DevOps
SRE, noun. See also: confidence, trust.
Site Reliability Engineering with Stephen Weinberg
We are the Google Site Reliability team. We make Google’s websites work. Ask us Anything!
We are the Google Site Reliability Engineering team. Ask us Anything!
The Ops Identity Crisis
The Irreproducibility Of Bugs In Large-Scale Production Systems
SE-Radio Episode 276: Björn Rabenstein on Site Reliability Engineering
Microservices, DevOps and Production Complexity
Introducing Google Customer Reliability Engineering
Evolution or Rebellion? The rise of Site Reliability Engineers (SRE)
The difference between Site Reliability Engineering, System Administration, and DevOps
SRE in the Small and in the Large
SBSRE Meetup: Different SRE roles and challenges(Netflix)
Panel: Who/What Is SRE?
Hope Is Not a Strategy
Tenets of SRE
Site Reliability Engineering Demystified
Is Site Reliability Engineering the True ‘Ops’ in DevOps?
SRE vs. DevOps vs. Cloud Native: The Server Cage Match
SRE: What’s The Big Idea?
Building the SRE Culture at LinkedIn
Podcast #111 – SRE: Occasionally Maintaining Infrastructure That You Hate
Splicing SRE DNA Sequences in the Biggest Software Company on the Planet
Why should your app get SRE support? - CRE life lessons
How SREs find the landmines in a service - CRE life lessons
Making the most of an SRE service takeover - CRE life lessons
The Cloudcast #301: SRE and Infrastructure Operations (Podcast)
The SRE model
Onboarding New Site Reliability Engineers
Building Blocks for Site Reliability At Google
Beyond Google SRE: What is Site Reliability Engineering like at Medium?
Intelligent Site Reliability Engineering – A Machine Learning Perspective
A crash course in LinkedIn's global site operations
Google’s Site Reliability Engineering with Todd Underwood
What is Site Reliability Engineering? (VMware)
A Gentle Introduction to SRE
Understanding Site Reliability Engineering through Movies and Books
GOTO 2017 • Site Reliability Engineering at Google • Christof Leng
Part1 The Makeup of Successful Geographically-Distributed SRE Teams - &
Tech Leadership in SRE
The Azure Podcast: Episode 227 - Azure SRE
The human scalability of "DevOps"
Podcast: Site Reliability Management with Mike Hiraga
How a cat inspired system reliability at Knowlarity
Getting Started with Site Reliability Engineering 110 over 6 years ago
"Practical Applications of the Dickerson Pyramid" by Nat Welch
LinkedIn’s Kurt Andersen Uncovers Blindspots in SRE Implementations
Interview with Betsy Beyer, Stephen Thorne of Google
Less Risk Through Greater Humanity - Dave Rensin
Getting Started with SRE - Stephen Thorne, Google
Building Successful SRE in Large Enterprises
Solving Reliability Fears with Site Reliability Engineering
SRE vs. DevOps: competing standards or close friends?
How to Avoid the 5 SRE Implementation Traps that Catch Even the Best Teams
Reliability Engineering – The Essential Discipline for Complex Systems
The Modern Site Reliability Workbench on Top of OCI
SRE in the Third Age
About SRE and how (not) to apply it
Transitioning a typical engineering ops team into an SRE powerhouse
Making a Lion Bulletproof: SRE in Banking
Identifying and tracking toil using SRE principles
From Ops to SRE: Evolution of the OpenShift Dedicated Team
Meeting reliability challenges with SRE principles
A quick introduction to SRE principles
The SRE I Aspire to Be
Taming Operational Load with VMware CRE
SRE Cultural Values
Are we there yet? Thoughts on assessing an SRE team’s maturity
What SREs have to do with project-based services?
Making operational work more visible
SRE vs. DevOps: What’s the Difference Between Them?

Awesome Site Reliability Engineering / Education

Panel: Educating SRE
From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams
New to an SRE team?
The Systems Engineering Side of Site Reliability Engineering
Graduating from Bootcamp and interested in becoming a Site Reliability Engineer?
So you want to be a Site Reliability Engineer?
Spiraling Ops Debt & the SRE Coding Imperative
So you want to be an SRE?
Career Profiles/Site Reliability Engineer
What is the role of a Site Reliability Engineer?
Lynda.com: DevOps Foundations: Site Reliability Engineering
Incident Management Training: Wheel of Misfortune
Site Un-Reliability Engineering [Video Series]
The Ultimate Guide to Structuring a 90-Day Onboarding Plan
SRE fundamentals: SLIs, SLAs and SLOs
How to Get Into SRE
Do you have an SRE team yet? How to start and assess your journey
How SRE teams are organized, and how to get started
Why SRE Documents Matter
How to get started with site reliability engineering (SRE)
Duties of a Site Reliability Engineering Manager
Designing distributed systems using NALSD flashcards
Training Site Reliability Engineers: What Your Organization Needs to Create a Learning Program
SRE Classroom: Distributed PubSub workshop
School of SRE: Curriculum for onboarding non-traditional hires and new grads

Awesome Site Reliability Engineering / Books

Practical Linux Infrastructure
Site Reliability Engineering: How Google Runs Production Systems
The Site Reliability Workbook: Practical Ways to Implement SRE
Observability Engineering: Achieving Production Excellence
The Practice Of Cloud System Administration: Designing and Operating Large Distributed Systems
Web Operations - Keeping the Data On Time
The Checklist Manifesto: How to Get Things Right
Microservices in Production - Standard Principles and Requirements
Production-Ready Microservices - Building Standardized Systems Across an Engineering Organization
Systems Performance: Enterprise and the Cloud [Sample chapter titled
Monitoring Distributed Systems: Case Studies from Google's SRE Teams
The Human Side of Postmortems: Managing Stress and Cognitive Biases
Chaos Engineering: Building Confidence in System Behavior through Experiment
Post-Incident Reviews: Learning from Failure for Improved Incident Responses
Antifragile Systems and Teams
How to Monitoring the SRE Golden Signals (E-Book)
Incident Management for Operations
Real-World SRE
Seeking SRE
What is SRE?
Engineering Reliable Mobile Applications: Strategies for Developing Resilient Native Mobile Applications
Building Secure and Reliable Systems
Chaos Engineering: Crash test your applications
97 Things Every SRE Should Know
Four Steps to Creating Effective Game Day Tests
The Linux Programming Interface

Awesome Site Reliability Engineering / Hiring

SRE Hiring
Hiring SREs at LinkedIn
Hiring Site Reliability Engineers
Hiring your first SRE
Growing the Site Reliability Team at LinkedIn: Hiring is Hard
Engineering Manager - Site Reliability Engineering Interview Preparation

Awesome Site Reliability Engineering / Reliability

The Realities of the Job of Delivering Reliability
Fail at Scale by Ben Maurer
Embracing Failure: Fault-Injection and Service Reliability
10 Years of Crashing Google
How we break things at Twitter: failure testing
Reliable Cron across the Planet
Push our limits - reliability testing at Twitter
The Verification of a Distributed System by Caitie McCaffrey
Weathering the Unexpected
SRE Hour: Tech Talks by Box & Yelp
Simplicity: A Prerequisite for Reliability
The Two Sides to Google Infrastructure for Everyone Else
How Embracing Continuous Release Reduced Change Complexity
Making "Push On Green" a Reality
BeyondCorp: A New Approach to Enterprise Security
Brainstorming Failure by Jeff Smith
The Ripple Effect Of Outages And Downtime Cannot Be Underestimated
The infrastructure behind Twitter: efficiency and optimization
Dickerson's Hierarchy of Reliability
The Morning Paper on Operability
Production is all that matters
Using load shedding to survive a success disaster - CRE life lessons
How to avoid a self-inflicted DDoS Attack - CRE life lessons
Don't gamble when it comes to reliability
Resilience Engineering: Learning to Embrace Failure
The Infrastructure Behind Twitter: Scale
Scaling Reliability at Twitter: So You Want to Add a 9
Principles Of Chaos Engineering
Chaos Engineering
Available...or not? That is the question - CRE life lessons
How Google Backs Up The Internet Along With Exabytes Of Other Data
Performance, Scalability, And High Availability: 3 Key Infrastructure Adaptability Requirements
Part 1 The Production Environment at Google - &
Reliable releases and rollbacks - CRE life lessons
How release canaries can save your bacon - CRE life lessons
Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites
Every Day Is Monday in Operations
Under the Hood: Ensuring Site Reliability
Designing reliable systems with cloud infrastructure (Google Cloud Next '17)
A Google SRE explores GitHub reliability with BigQuery
Know thy enemy: how to prioritize and communicate risks - CRE life lessons
Chaos Engineering resources 6,002 11 months ago
CRE life lessons: What is a dark launch, and what does it do for me?
Why you should pick strong consistency, whenever possible
The Network is Reliable
Are You Load Balancing Wrong?
How production engineers support global events on Facebook
Google: A Collection Of Best Practices For Production Services
Canary Analysis Service
Tips for High Availability
Progressive Service Architecture At Auth0
Google Cloud Production Guideline
production readiness
Trust By Design: The Fusion of Operational Maturity and Risk Modeling
Top Seven Myths of Robust Systems
Taming chaos: Preparing for your next incident
PID Loops and the Art of Keeping Systems Stable
Are you ready for production? -
Production Checklist for Web Apps on Kubernetes
Finding a problem at the bottom of the Google stack
Rethinking Task Size in SRE
How maintenance windows affect your error budget
The Production Readiness Spectrum
Generic mitigations
How we’re building a production readiness review process at Grafana Labs
Resiliency Planning for High-Traffic Events
Using Fault Injection Testing to Improve DoorDash Reliability

Awesome Site Reliability Engineering / Monitoring & Observability & Alerting

A Working Theory-of-Monitoring
The Evolution of Monitoring Systems at Google - Tony Rippy
Monitoring without Infrastructure @ Airbnb
Monitoring distributed systems
Observability at Uber Engineering: Past, Present, Future
The 4 Golden Signals of API Health and Performance in Cloud-Native Applications
My Philosophy on Alerting by Rob Ewaschuk
Time To Detect - Netflix
Why Percentiles Don’t Work the Way you Think
Building Twitter’s Next-Gen Alerting System
Instrumentation: Worst case performance matters
Instrumentation: What does 'uptime' mean?
Incidents + Outages at CircleCI: Our Playbook and What We’ve Learned
An introduction to monitoring and alerting with timeseries at scale, with Prometheus
Detecting outliers and anomalies in realtime at Datadog
How to Monitor the SRE Golden Signals
Monitoring in a DevOps World
Monitoring Your Monitoring’s Monitoring
Observability: the new wave or buzzword?
Monitoring Isn't Observability
Monitoring in the time of Cloud Native
Principles of Monitoring Microservices
The Many Ways Your Monitoring Is Lying to You
GitOps Part 3 - Observability
Want to Debug Latency?
Debugging Latency in Go 1.11
Alerting on SLOs like Pros
Applied Alerting Philosophy
Observations on Observability
Deploys: It's Not Actually About Fridays
Site Reliability Engineering Best Practices for Data Pipelines
Elastic Observability in SRE and Incident Response
Error Budget Policy - Part 1 - Adoption at Expedia Group
Error Budget Policy - Part 2 - Practices at Expedia Group

Awesome Site Reliability Engineering / On-Call

Being an On-Call Engineer: A Google SRE Perspective
Inside Atlassian: how our site reliability engineers do incident management
Inside Atlassian: how IT & SRE use ChatOps to run incident management
Incident Response at Heroku
Who's On Call?
SysAdvent - Day 6 - No More On-Call Martyrs
On Being On Call
The On-Call Handbook 401 over 4 years ago
Incident management at Google — adventures in SRE-land
Run Book / Operations Manual template 705 over 5 years ago
Automating Your Oncall: Open Sourcing Fossor and Ascii Etch
Project STAR*: Streamlining Our On-Call Process
SRE@Xero: Managing Incidents Part I
SRE@Xero: Managing Incidents Part II
How To Establish a High Severity Incident Management Program
How Your Systems Keep Running Day After Day - John Allspaw
On-call doesn’t have to suck
Why, as a Netflix infrastructure manager, am I on call?
Oncall and Sustainable Software Development
On Call Rotations: How Best to Wake Devs Up in the Middle of the Night
Understanding The Role Of The Incident Manager On-Call (IMOC)
3 Ways to Minimize the Impact of High Severity Incidents
Advice to Management Teams While Enrolling Changes to On-Call Systems
Moving Past Shallow Incident Data
Sustainable On-Call
dotScale 2017 - Aish Raj Dahal - Chaos management during a major incident
Incident Management at Netflix Velocity
Incidents, fixes, and the day after
10 Steps to Develop an Incident Response Plan You’ll ACTUALLY Use
Checklists: a stupidly simple but valuable operational gift
How to write a status page update
Atlassian Incident Handbook
PagerDuty Incident Response Handbook
Avoiding Burnout for SREs
Better On-Call the SRE way
Managing Incidents at Monzo
Making On-Call Not Suck
How we (Monzo) respond to incidents
How we’ve evolved on-call at Monzo
Code Yellow: When Operations Isn’t Perfect
MTTR is dead, long live CIRT
Extended Dreyfus Model for Incident Lifecycles 36 about 6 years ago
Inhumanity of Root Cause Analysis
Incident insights from NASA, NTSB, and the CDC
How to avoid On-Call Burnout the SRE Way
My week shadowing a GitLab Site Reliability Engineer
How our production team runs the weekly on-call handover
Writing Runbook Documentation When You’re An SRE
Incident response, programs and you(r startup)
An Incident Command Training Handbook
Shrinking the time to mitigate production incidents
Incident writeup as sociological storytelling
Elephant in the Blameless War Room: Accountability
Naming names in incident writeups
Building On-Call Culture at GitHub

Awesome Site Reliability Engineering / Post-Mortem

A collection of post-mortems 11,309 4 months ago
Collection of Kubernetes Failure Stories 6,232 about 4 years ago
Blameless PostMortems and a Just Culture
A Tale of Postmortems
Building a Blameless Post-Mortem Culture with Jason Hand
The infinite hows
Failure is Always An Option: How a Blameless Culture Leads to Better Results
SysAdvent - Day 1 - Why You Need a Postmortem Process
Etsy’s Debriefing Facilitation Guide for Blameless Postmortems
Writing Your First Postmortem
How to Write Great Outage Post-Mortems
A collection of postmortem templates 1,314 over 1 year ago
Embracing Feedback
Postmortem Action Items: Plan the Work and Work the Plan
Social Issues In Postmortems
Google Has an Official Process in Place for Learning From Failure--and It's Absolutely Brilliant
Postmortem culture: how you can learn from failure
re:Work - Postmortem discussion template
Post-mortems to the rescue
Postmortem Action Items: Plan the Work and Work the Plan
Why Every Company Can Benefit from a Blameless Culture
"It's dead, Jim": How we write an incident postmortem
Our incident postmortem template
Learn out of mistakes. Postmortems to the rescue.
Improving Postmortem Practices with Veteran Google SRE, Steve McGhee
Inhumanity of Root Cause Analysis

Awesome Site Reliability Engineering / Capacity Planning

Capacity Planning
SouthBay SRE: Cloud Capacity Planning
Intent-based Capacity Planning and Autoscaling with Kubernetes
How do you do Capacity Planning
How Back Market SREs prepared for Black Friday

Awesome Site Reliability Engineering / Service Level Agreement

If It's in the Cloud, Get It on Paper: Cloud Computing Contract Issues
Service Level Agreements in the Cloud: Who cares?
SysAdvent- Day 20 - How to set and monitor SLAs
SLOs, SLIs, SLAs, oh my - CRE life lessons
Service Levels and Error Budgets
(Un)Reliability Budgets - Finding Balance between Innovation and Reliability
The Calculus of Service Availability
Availability Calculator: Calculate how much downtime should be permitted in your SLA
Standardize cloud SLA availability with numerical performance data
Best practices to develop SLAs for cloud computing
A Practical Guide to SLAs
Building good SLOs - CRE life lessons
No Grumpy Humans and Other Site Reliability Engineering Lessons from Google
Consequences of SLO violations — CRE life lessons
Service Level Objectives in Practice
SRE Consensus Building
An example escalation policy — CRE life lessons
Error Budget Calculator
Understanding error budget overspend - part one - CRE life lessons
Good housekeeping for error budgets - part two - CRE life lessons
SRE fundamentals: SLIs, SLAs and SLOs
SLOs & You: A Guide To Service Level Objectives
Earning Our Wings: Stories and Findings From Operating a Large-scale Concourse Deployment
Nines are Not Enough: Meaningful Metrics for Clouds
How many nines is my storage system?
Don't follow the sun.
The Tyranny of the SLA
Backblaze Durability is 99.999999999% — And Why It Doesn’t Matter
DevOpsDays Chicago 2019 - The Art of SLOs
The Art of SLOs Workshop Materials
How to Include Latency in SLO-Based Alerting
Succeeding With Service Level Objectives
Putting customers first with SLIs and SLOs
SRE Leadership: Have Tiered SLAs
How SLOs Enable Fast, Reliable Application Delivery
The Tail at Scale
The Tail at Scale Revisited
Defining SLOs for services with dependencies
Service Level Disagreements
How We Use Sloth to do SLO Monitoring and Alerting with Prometheus
SLI Deep Dive
Measuring Reliability in GCP: Step By Step SLO creation guide using Cloud Operation Sandbox
SLO tracker
SLO Alerting for Mortals
SRE methods and climate change
What made SLOs so messy (and what we can do about it)
SLICK: Adopting SLOs for improved reliability
Calculating composite SLA
Best practices for setting SLOs and SLIs for modern, complex systems

Awesome Site Reliability Engineering / Performance

Performance Checklists for SREs
South Bay SRE Meetup - Netflix Cloud Performance Team
Software Performance Analysis Guided By SLOs
A framework for pragmatic performance engineering

Awesome Site Reliability Engineering / Programming

Go Language for Ops and Site Reliability Engineering
Go for SREs using Python
Operability in Go
Go Reliability and Durability at Dropbox

Awesome Site Reliability Engineering / Misc Articles

What is SRE (Site Reliability Engineering)?
Here’s How Google Makes Sure It (Almost) Never Goes Down
Are site reliability engineers the next data scientists?
Site Reliability Engineers: "solving the most interesting problems"
Site Reliability Engineers: the "world’s most intense pit crew"
Site reliability engineering kicks rote tasks out of IT ops
Notes on Site Reliability Engineering
Adventures in SRE-land: Welcome to Google Mission Control
Book Review: Site Reliability Engineering - How Google Runs Production Systems
Site Reliability Engineers: “We solve cooler problems”
SREcon17: Brave new world of site reliability engineering
Open AWS guide 35,742 3 months ago
Commentary on Site Reliability Engineering
Site Reliability Engineering: 4 Things to Know
Looking for SRE Success? Then Find the Intrapreneurs!
What Team Structure is Right for DevOps to Flourish?
Injured on Vacation? Applying Principles from Site Reliability Engineering to a Travel Emergency
Building blameless working environment
SRE Adoption Report
SREs: The Happiest – and Highest Paid – in the Industry
The Role of Site Reliability Engineering, Today and Tomorrow
SRE as a Lifestyle Choice
SRECon EMEA 2019 Recap
Life of an SRE at Google - JC van Winkel
Site Reliability Engineering for Native Mobile Apps - Abhijith Krishnappa Case study: Halodoc adaptation of SRE principles for Native Mobile Apps
SRE Best Practices by InfraCloud

Awesome Site Reliability Engineering / Real-time Messaging

#sre channel at Hangops Slack Discussion of Site Reliability Engineering generally
#incident_response channel at Hangops Slack Discussion about Incident Response
USENIX SREcon Slack

Awesome Site Reliability Engineering / Blogs

Brendan Gregg's Blog Highly Technical Blog Posts About Systems Internals, Performance and SRE
Everything Sysadmin Blog Posts About SysAdmin/DevOps/SRE by Tom Limoncelli
High Scalability Technical Blog Posts About Systems Architecture
rachelbythebay Techincal Blog Posts
Susan J. Fowler Various blog posts about SRE, Software Engineering and Microservices
SysAdvent One article for each day of December, ending on the 25th article
Stephen Thorne's Blog Blog Posts About SRE
Increment A digital magazine about how teams build and operate software systems at scale
GopherSRE Blog Posts about Go and SRE
Cindy Sridharan Blog posts about distributed systems and their management
Blameless Blog Blog posts about SRE culture and practices
Resilience Roundup Weekly analysis of Resilience Engineering and Human Factors research designed for software systems
Squadcast Blog Blog posts about SRE best practices, reliability, on-call and incident management
FireHydrant Blog Posts about complex systems, incident response, and SRE best practices
Rootly Blog Incident management best practices and guides
incident.io Blog Guides, advice and resources on incident management and response
Logit.io Blog Resources on log management, SRE and devOps

Awesome Site Reliability Engineering / Newsletters

DevOpsLinks A weekly newsletter about SRE, SysAdmin and DevOps news, tools, tutorials and opinions
KubeWeekly The weekly newsletters for all things Kubernetes. KubeWeekly is curated by Bob Killen, Chris Short, Craig Box, Kim McMahon and Michael Hausenblas
SRE Weekly Weekly Site Reliability Newsletter
O’Reilly Systems Engineering and Operations Newsletter Weekly systems engineering and operations news and insights from industry insiders
ChaosEngineering.news Chaos Engineering newsletter. All things Chaos Engineering, directly to your inbox!
Monitoring Weekly What's new in monitoring? Curated monitoring articles to your inbox each week
Observability news Updates around observability (o11y) with a special focus on open source

Awesome Site Reliability Engineering / Conferences & Meetups

SRECon Conferences The Official SRE Conference
LISA Conferences Prominent Conference About SysAdmin/DevOps/SRE
SRE Tech Talks SRE Talks Hosted by Google
South Bay Site Reliability Engineering (Sunnyvale, CA) Meetup A Group For Individuals Who Tackle Reliability Challenges For Web-Scale Systems
San Francisco Reliability Engineering A Group Of People Who Are Passionate About Reliable, Performant Software Systems
Site Reliability Engineering Munich, Germany SRE Meetup in the greater area of Oktoberfest city
ADDO - All Day DevOps A 24 hour conference that is completely online and free
Site Reliability Engineering Paris, France SRE Meetup in the city of light
Site Reliability Engineering India SRE Meetup India

Awesome Site Reliability Engineering / Twitter

Google SRE Twitter Account Google's SRE Twitter Account
SREBook The Official Twitter Account of Site Reliability Engineering Book
SREcon SRECon's Official Twitter Account
SREWorkbook The Official Twitter Account of Site Reliability Workbook
The SRE Dev SRE-related Posts from
Twitter SRE The Official Twitter Account of Twitter's SRE team
Twitter SRE Weekly The Official Twitter Account of SRE Weekly Newsletter
USENIX Association The Official USENIX Twitter Account

Awesome Site Reliability Engineering / SRE Tools

Awesome SRE Tools 1,228 5 days ago A curated list of Site Reliability and Production Engineering tools
List of Continuous Integration services 3,691 about 1 month ago
SRE cheat sheet 203 over 2 years ago A cheat sheet for Site Reliability Engineering principles and numbers

Awesome Site Reliability Engineering / Podcasts

Blameless / Resilience in Action
Google SRE Prodcast
o11y Observability Podcast
On Call Nightmares (retired)
Making of the SRE Omelette

Backlinks from these awesome lists:

More related projects: