Awesome Site Reliability Engineering / Culture |
| What is Site Reliability Engineering? | | | |
| Keys To SRE by Ben Treynor | | | |
| Google SRE Resources | | | |
| Notes from Production Engineering by Pedro Canahuati | | | |
| PostOps: Recovery from Operations | | | |
| Love DevOps? Wait 'till you meet SRE | | | |
| How Google Does Planet-Scale Engineering for Planet-Scale Infra | | | |
| Site Reliability Engineering at Facebook | | | |
| A History of Site Reliability Engineering at Uber | | | |
| Case Study: Adopting SRE Principles at StackOverflow | | | |
| Site Reliability Engineering at Dropbox | | | |
| Site Reliability Engineers — Keeping Google up and running 24/7 | | | |
| Site Reliability Engineering at Salesforce | | | |
| video | | | From Sys Admin to Netflix SRE - and |
| SRE@Google: Thousands of DevOps Since 2004 | | | |
| Transactional System Administration Is Killing Us and Must be Stopped | | | |
| A hierarchy of SRE needs | | | |
| PostOps: A Non-Surgical Tale of Software, Fragility, and Reliability | | | |
| SRE: An incomplete guide to cultural Narnia | | | - |
| Putting Together Great SRE Teams | | | |
| Work at Google: Meet our Production Engineers for Site Reliability Hangout on Air | | | |
| Toil: A Word Every Engineer Should Know | | | |
| Engineering Reliability into Web Sites: Google SRE | | | |
| DEVOPS & SRE AMA - Building High Performance Organizations | | | |
| John Allspaw's AMA on Incident Analysis and Postmortems | | | |
| Part 1 | | | Site Reliability Engineering with Paul Newson - & |
| How SysAdmins Devalue Themselves | | | |
| The Softer Side of DevOps | | | |
| SRE, noun. See also: confidence, trust. | | | |
| Site Reliability Engineering with Stephen Weinberg | | | |
| We are the Google Site Reliability team. We make Google’s websites work. Ask us Anything! | | | |
| We are the Google Site Reliability Engineering team. Ask us Anything! | | | |
| The Ops Identity Crisis | | | |
| The Irreproducibility Of Bugs In Large-Scale Production Systems | | | |
| SE-Radio Episode 276: Björn Rabenstein on Site Reliability Engineering | | | |
| Microservices, DevOps and Production Complexity | | | |
| Introducing Google Customer Reliability Engineering | | | |
| Evolution or Rebellion? The rise of Site Reliability Engineers (SRE) | | | |
| The difference between Site Reliability Engineering, System Administration, and DevOps | | | |
| SRE in the Small and in the Large | | | |
| SBSRE Meetup: Different SRE roles and challenges(Netflix) | | | |
| Panel: Who/What Is SRE? | | | |
| Hope Is Not a Strategy | | | |
| Tenets of SRE | | | |
| Site Reliability Engineering Demystified | | | |
| Is Site Reliability Engineering the True ‘Ops’ in DevOps? | | | |
| SRE vs. DevOps vs. Cloud Native: The Server Cage Match | | | |
| SRE: What’s The Big Idea? | | | |
| Building the SRE Culture at LinkedIn | | | |
| Podcast #111 – SRE: Occasionally Maintaining Infrastructure That You Hate | | | |
| Splicing SRE DNA Sequences in the Biggest Software Company on the Planet | | | |
| Why should your app get SRE support? - CRE life lessons | | | |
| How SREs find the landmines in a service - CRE life lessons | | | |
| Making the most of an SRE service takeover - CRE life lessons | | | |
| The Cloudcast #301: SRE and Infrastructure Operations (Podcast) | | | |
| The SRE model | | | |
| Onboarding New Site Reliability Engineers | | | |
| Building Blocks for Site Reliability At Google | | | |
| Beyond Google SRE: What is Site Reliability Engineering like at Medium? | | | |
| Intelligent Site Reliability Engineering – A Machine Learning Perspective | | | |
| A crash course in LinkedIn's global site operations | | | |
| Google’s Site Reliability Engineering with Todd Underwood | | | |
| What is Site Reliability Engineering? (VMware) | | | |
| A Gentle Introduction to SRE | | | |
| Understanding Site Reliability Engineering through Movies and Books | | | |
| GOTO 2017 • Site Reliability Engineering at Google • Christof Leng | | | |
| Part1 | | | The Makeup of Successful Geographically-Distributed SRE Teams - & |
| Tech Leadership in SRE | | | |
| The Azure Podcast: Episode 227 - Azure SRE | | | |
| The human scalability of "DevOps" | | | |
| Podcast: Site Reliability Management with Mike Hiraga | | | |
| How a cat inspired system reliability at Knowlarity | | | |
| Getting Started with Site Reliability Engineering | 110 | over 7 years ago | |
| "Practical Applications of the Dickerson Pyramid" by Nat Welch | | | |
| LinkedIn’s Kurt Andersen Uncovers Blindspots in SRE Implementations | | | |
| Interview with Betsy Beyer, Stephen Thorne of Google | | | |
| Less Risk Through Greater Humanity - Dave Rensin | | | |
| Getting Started with SRE - Stephen Thorne, Google | | | |
| Building Successful SRE in Large Enterprises | | | |
| Solving Reliability Fears with Site Reliability Engineering | | | |
| SRE vs. DevOps: competing standards or close friends? | | | |
| How to Avoid the 5 SRE Implementation Traps that Catch Even the Best Teams | | | |
| Reliability Engineering – The Essential Discipline for Complex Systems | | | |
| The Modern Site Reliability Workbench on Top of OCI | | | |
| SRE in the Third Age | | | |
| About SRE and how (not) to apply it | | | |
| Transitioning a typical engineering ops team into an SRE powerhouse | | | |
| Making a Lion Bulletproof: SRE in Banking | | | |
| Identifying and tracking toil using SRE principles | | | |
| From Ops to SRE: Evolution of the OpenShift Dedicated Team | | | |
| Meeting reliability challenges with SRE principles | | | |
| A quick introduction to SRE principles | | | |
| The SRE I Aspire to Be | | | |
| Taming Operational Load with VMware CRE | | | |
| SRE Cultural Values | | | |
| Are we there yet? Thoughts on assessing an SRE team’s maturity | | | |
| What SREs have to do with project-based services? | | | |
| Making operational work more visible | | | |
| SRE vs. DevOps: What’s the Difference Between Them? | | | |
Awesome Site Reliability Engineering / Education |
| Panel: Educating SRE | | | |
| From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams | | | |
| New to an SRE team? | | | |
| The Systems Engineering Side of Site Reliability Engineering | | | |
| Graduating from Bootcamp and interested in becoming a Site Reliability Engineer? | | | |
| So you want to be a Site Reliability Engineer? | | | |
| Spiraling Ops Debt & the SRE Coding Imperative | | | |
| So you want to be an SRE? | | | |
| Career Profiles/Site Reliability Engineer | | | |
| What is the role of a Site Reliability Engineer? | | | |
| Lynda.com: DevOps Foundations: Site Reliability Engineering | | | |
| Incident Management Training: Wheel of Misfortune | | | |
| Site Un-Reliability Engineering [Video Series] | | | |
| The Ultimate Guide to Structuring a 90-Day Onboarding Plan | | | |
| SRE fundamentals: SLIs, SLAs and SLOs | | | |
| How to Get Into SRE | | | |
| Do you have an SRE team yet? How to start and assess your journey | | | |
| How SRE teams are organized, and how to get started | | | |
| Why SRE Documents Matter | | | |
| How to get started with site reliability engineering (SRE) | | | |
| Duties of a Site Reliability Engineering Manager | | | |
| Designing distributed systems using NALSD flashcards | | | |
| Training Site Reliability Engineers: What Your Organization Needs to Create a Learning Program | | | |
| SRE Classroom: Distributed PubSub workshop | | | |
| School of SRE: Curriculum for onboarding non-traditional hires and new grads | | | |
Awesome Site Reliability Engineering / Books |
| Practical Linux Infrastructure | | | |
| Site Reliability Engineering: How Google Runs Production Systems | | | |
| The Site Reliability Workbook: Practical Ways to Implement SRE | | | |
| Observability Engineering: Achieving Production Excellence | | | |
| The Practice Of Cloud System Administration: Designing and Operating Large Distributed Systems | | | |
| Web Operations - Keeping the Data On Time | | | |
| The Checklist Manifesto: How to Get Things Right | | | |
| Microservices in Production - Standard Principles and Requirements | | | |
| Production-Ready Microservices - Building Standardized Systems Across an Engineering Organization | | | |
| Systems Performance: Enterprise and the Cloud | | | [Sample chapter titled |
| Monitoring Distributed Systems: Case Studies from Google's SRE Teams | | | |
| The Human Side of Postmortems: Managing Stress and Cognitive Biases | | | |
| Chaos Engineering: Building Confidence in System Behavior through Experiment | | | |
| Post-Incident Reviews: Learning from Failure for Improved Incident Responses | | | |
| Antifragile Systems and Teams | | | |
| How to Monitoring the SRE Golden Signals (E-Book) | | | |
| Incident Management for Operations | | | |
| Real-World SRE | | | |
| Seeking SRE | | | |
| What is SRE? | | | |
| Engineering Reliable Mobile Applications: Strategies for Developing Resilient Native Mobile Applications | | | |
| Building Secure and Reliable Systems | | | |
| Chaos Engineering: Crash test your applications | | | |
| 97 Things Every SRE Should Know | | | |
| Four Steps to Creating Effective Game Day Tests | | | |
| The Linux Programming Interface | | | |
Awesome Site Reliability Engineering / Hiring |
| SRE Hiring | | | |
| Hiring SREs at LinkedIn | | | |
| Hiring Site Reliability Engineers | | | |
| Hiring your first SRE | | | |
| Growing the Site Reliability Team at LinkedIn: Hiring is Hard | | | |
| Engineering Manager - Site Reliability Engineering Interview Preparation | | | |
Awesome Site Reliability Engineering / Reliability |
| The Realities of the Job of Delivering Reliability | | | |
| Fail at Scale by Ben Maurer | | | |
| Embracing Failure: Fault-Injection and Service Reliability | | | |
| 10 Years of Crashing Google | | | |
| How we break things at Twitter: failure testing | | | |
| Reliable Cron across the Planet | | | |
| Push our limits - reliability testing at Twitter | | | |
| The Verification of a Distributed System by Caitie McCaffrey | | | |
| Weathering the Unexpected | | | |
| SRE Hour: Tech Talks by Box & Yelp | | | |
| Simplicity: A Prerequisite for Reliability | | | |
| The Two Sides to Google Infrastructure for Everyone Else | | | |
| How Embracing Continuous Release Reduced Change Complexity | | | |
| Making "Push On Green" a Reality | | | |
| BeyondCorp: A New Approach to Enterprise Security | | | |
| Brainstorming Failure by Jeff Smith | | | |
| The Ripple Effect Of Outages And Downtime Cannot Be Underestimated | | | |
| The infrastructure behind Twitter: efficiency and optimization | | | |
| Dickerson's Hierarchy of Reliability | | | |
| The Morning Paper on Operability | | | |
| Production is all that matters | | | |
| Using load shedding to survive a success disaster - CRE life lessons | | | |
| How to avoid a self-inflicted DDoS Attack - CRE life lessons | | | |
| Don't gamble when it comes to reliability | | | |
| Resilience Engineering: Learning to Embrace Failure | | | |
| The Infrastructure Behind Twitter: Scale | | | |
| Scaling Reliability at Twitter: So You Want to Add a 9 | | | |
| Principles Of Chaos Engineering | | | |
| Chaos Engineering | | | |
| Available...or not? That is the question - CRE life lessons | | | |
| How Google Backs Up The Internet Along With Exabytes Of Other Data | | | |
| Performance, Scalability, And High Availability: 3 Key Infrastructure Adaptability Requirements | | | |
| Part 1 | | | The Production Environment at Google - & |
| Reliable releases and rollbacks - CRE life lessons | | | |
| How release canaries can save your bacon - CRE life lessons | | | |
| Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites | | | |
| Every Day Is Monday in Operations | | | |
| Under the Hood: Ensuring Site Reliability | | | |
| Designing reliable systems with cloud infrastructure (Google Cloud Next '17) | | | |
| A Google SRE explores GitHub reliability with BigQuery | | | |
| Know thy enemy: how to prioritize and communicate risks - CRE life lessons | | | |
| Chaos Engineering resources | 6,025 | almost 2 years ago | |
| CRE life lessons: What is a dark launch, and what does it do for me? | | | |
| Why you should pick strong consistency, whenever possible | | | |
| The Network is Reliable | | | |
| Are You Load Balancing Wrong? | | | |
| How production engineers support global events on Facebook | | | |
| Google: A Collection Of Best Practices For Production Services | | | |
| Canary Analysis Service | | | |
| Tips for High Availability | | | |
| Progressive Service Architecture At Auth0 | | | |
| Google Cloud Production Guideline | | | |
| production readiness | | | |
| Trust By Design: The Fusion of Operational Maturity and Risk Modeling | | | |
| Top Seven Myths of Robust Systems | | | |
| Taming chaos: Preparing for your next incident | | | |
| PID Loops and the Art of Keeping Systems Stable | | | |
| Are you ready for production? | | | - |
| Production Checklist for Web Apps on Kubernetes | | | |
| Finding a problem at the bottom of the Google stack | | | |
| Rethinking Task Size in SRE | | | |
| How maintenance windows affect your error budget | | | |
| The Production Readiness Spectrum | | | |
| Generic mitigations | | | |
| How we’re building a production readiness review process at Grafana Labs | | | |
| Resiliency Planning for High-Traffic Events | | | |
| Using Fault Injection Testing to Improve DoorDash Reliability | | | |
Awesome Site Reliability Engineering / Monitoring & Observability & Alerting |
| A Working Theory-of-Monitoring | | | |
| The Evolution of Monitoring Systems at Google - Tony Rippy | | | |
| Monitoring without Infrastructure @ Airbnb | | | |
| Monitoring distributed systems | | | |
| Observability at Uber Engineering: Past, Present, Future | | | |
| The 4 Golden Signals of API Health and Performance in Cloud-Native Applications | | | |
| My Philosophy on Alerting by Rob Ewaschuk | | | |
| Time To Detect - Netflix | | | |
| Why Percentiles Don’t Work the Way you Think | | | |
| Building Twitter’s Next-Gen Alerting System | | | |
| Instrumentation: Worst case performance matters | | | |
| Instrumentation: What does 'uptime' mean? | | | |
| Incidents + Outages at CircleCI: Our Playbook and What We’ve Learned | | | |
| An introduction to monitoring and alerting with timeseries at scale, with Prometheus | | | |
| Detecting outliers and anomalies in realtime at Datadog | | | |
| How to Monitor the SRE Golden Signals | | | |
| Monitoring in a DevOps World | | | |
| Monitoring Your Monitoring’s Monitoring | | | |
| Observability: the new wave or buzzword? | | | |
| Monitoring Isn't Observability | | | |
| Monitoring in the time of Cloud Native | | | |
| Principles of Monitoring Microservices | | | |
| The Many Ways Your Monitoring Is Lying to You | | | |
| GitOps Part 3 - Observability | | | |
| Want to Debug Latency? | | | |
| Debugging Latency in Go 1.11 | | | |
| Alerting on SLOs like Pros | | | |
| Applied Alerting Philosophy | | | |
| Observations on Observability | | | |
| Deploys: It's Not Actually About Fridays | | | |
| Site Reliability Engineering Best Practices for Data Pipelines | | | |
| Elastic Observability in SRE and Incident Response | | | |
| Error Budget Policy - Part 1 - Adoption at Expedia Group | | | |
| Error Budget Policy - Part 2 - Practices at Expedia Group | | | |
Awesome Site Reliability Engineering / On-Call |
| Being an On-Call Engineer: A Google SRE Perspective | | | |
| Inside Atlassian: how our site reliability engineers do incident management | | | |
| Inside Atlassian: how IT & SRE use ChatOps to run incident management | | | |
| Incident Response at Heroku | | | |
| Who's On Call? | | | |
| SysAdvent - Day 6 - No More On-Call Martyrs | | | |
| On Being On Call | | | |
| The On-Call Handbook | 402 | over 5 years ago | |
| Incident management at Google — adventures in SRE-land | | | |
| Run Book / Operations Manual template | 707 | about 6 years ago | |
| Automating Your Oncall: Open Sourcing Fossor and Ascii Etch | | | |
| Project STAR*: Streamlining Our On-Call Process | | | |
| SRE@Xero: Managing Incidents Part I | | | |
| SRE@Xero: Managing Incidents Part II | | | |
| How To Establish a High Severity Incident Management Program | | | |
| How Your Systems Keep Running Day After Day - John Allspaw | | | |
| On-call doesn’t have to suck | | | |
| Why, as a Netflix infrastructure manager, am I on call? | | | |
| Oncall and Sustainable Software Development | | | |
| On Call Rotations: How Best to Wake Devs Up in the Middle of the Night | | | |
| Understanding The Role Of The Incident Manager On-Call (IMOC) | | | |
| 3 Ways to Minimize the Impact of High Severity Incidents | | | |
| Advice to Management Teams While Enrolling Changes to On-Call Systems | | | |
| Moving Past Shallow Incident Data | | | |
| Sustainable On-Call | | | |
| dotScale 2017 - Aish Raj Dahal - Chaos management during a major incident | | | |
| Incident Management at Netflix Velocity | | | |
| Incidents, fixes, and the day after | | | |
| 10 Steps to Develop an Incident Response Plan You’ll ACTUALLY Use | | | |
| Checklists: a stupidly simple but valuable operational gift | | | |
| How to write a status page update | | | |
| Atlassian Incident Handbook | | | |
| PagerDuty Incident Response Handbook | | | |
| Avoiding Burnout for SREs | | | |
| Better On-Call the SRE way | | | |
| Managing Incidents at Monzo | | | |
| Making On-Call Not Suck | | | |
| How we (Monzo) respond to incidents | | | |
| How we’ve evolved on-call at Monzo | | | |
| Code Yellow: When Operations Isn’t Perfect | | | |
| MTTR is dead, long live CIRT | | | |
| Extended Dreyfus Model for Incident Lifecycles | 36 | about 7 years ago | |
| Inhumanity of Root Cause Analysis | | | |
| Incident insights from NASA, NTSB, and the CDC | | | |
| How to avoid On-Call Burnout the SRE Way | | | |
| My week shadowing a GitLab Site Reliability Engineer | | | |
| How our production team runs the weekly on-call handover | | | |
| Writing Runbook Documentation When You’re An SRE | | | |
| Incident response, programs and you(r startup) | | | |
| An Incident Command Training Handbook | | | |
| Shrinking the time to mitigate production incidents | | | |
| Incident writeup as sociological storytelling | | | |
| Elephant in the Blameless War Room: Accountability | | | |
| Naming names in incident writeups | | | |
| Building On-Call Culture at GitHub | | | |
Awesome Site Reliability Engineering / Post-Mortem |
| A collection of post-mortems | 11,336 | over 1 year ago | |
| Collection of Kubernetes Failure Stories | 6,232 | about 5 years ago | |
| Blameless PostMortems and a Just Culture | | | |
| A Tale of Postmortems | | | |
| Building a Blameless Post-Mortem Culture with Jason Hand | | | |
| The infinite hows | | | |
| Failure is Always An Option: How a Blameless Culture Leads to Better Results | | | |
| SysAdvent - Day 1 - Why You Need a Postmortem Process | | | |
| Etsy’s Debriefing Facilitation Guide for Blameless Postmortems | | | |
| Writing Your First Postmortem | | | |
| How to Write Great Outage Post-Mortems | | | |
| A collection of postmortem templates | 1,321 | over 2 years ago | |
| Embracing Feedback | | | |
| Postmortem Action Items: Plan the Work and Work the Plan | | | |
| Social Issues In Postmortems | | | |
| Google Has an Official Process in Place for Learning From Failure--and It's Absolutely Brilliant | | | |
| Postmortem culture: how you can learn from failure | | | |
| re:Work - Postmortem discussion template | | | |
| Post-mortems to the rescue | | | |
| Postmortem Action Items: Plan the Work and Work the Plan | | | |
| Why Every Company Can Benefit from a Blameless Culture | | | |
| "It's dead, Jim": How we write an incident postmortem | | | |
| Our incident postmortem template | | | |
| Learn out of mistakes. Postmortems to the rescue. | | | |
| Improving Postmortem Practices with Veteran Google SRE, Steve McGhee | | | |
| Inhumanity of Root Cause Analysis | | | |
Awesome Site Reliability Engineering / Capacity Planning |
| Capacity Planning | | | |
| SouthBay SRE: Cloud Capacity Planning | | | |
| Intent-based Capacity Planning and Autoscaling with Kubernetes | | | |
| How do you do Capacity Planning | | | |
| How Back Market SREs prepared for Black Friday | | | |
Awesome Site Reliability Engineering / Service Level Agreement |
| If It's in the Cloud, Get It on Paper: Cloud Computing Contract Issues | | | |
| Service Level Agreements in the Cloud: Who cares? | | | |
| SysAdvent- Day 20 - How to set and monitor SLAs | | | |
| SLOs, SLIs, SLAs, oh my - CRE life lessons | | | |
| Service Levels and Error Budgets | | | |
| (Un)Reliability Budgets - Finding Balance between Innovation and Reliability | | | |
| The Calculus of Service Availability | | | |
| Availability Calculator: Calculate how much downtime should be permitted in your SLA | | | |
| Standardize cloud SLA availability with numerical performance data | | | |
| Best practices to develop SLAs for cloud computing | | | |
| A Practical Guide to SLAs | | | |
| Building good SLOs - CRE life lessons | | | |
| No Grumpy Humans and Other Site Reliability Engineering Lessons from Google | | | |
| Consequences of SLO violations — CRE life lessons | | | |
| Service Level Objectives in Practice | | | |
| SRE Consensus Building | | | |
| An example escalation policy — CRE life lessons | | | |
| Error Budget Calculator | | | |
| Understanding error budget overspend - part one - CRE life lessons | | | |
| Good housekeeping for error budgets - part two - CRE life lessons | | | |
| SRE fundamentals: SLIs, SLAs and SLOs | | | |
| SLOs & You: A Guide To Service Level Objectives | | | |
| Earning Our Wings: Stories and Findings From Operating a Large-scale Concourse Deployment | | | |
| Nines are Not Enough: Meaningful Metrics for Clouds | | | |
| How many nines is my storage system? | | | |
| Don't follow the sun. | | | |
| The Tyranny of the SLA | | | |
| Backblaze Durability is 99.999999999% — And Why It Doesn’t Matter | | | |
| DevOpsDays Chicago 2019 - The Art of SLOs | | | |
| The Art of SLOs Workshop Materials | | | |
| How to Include Latency in SLO-Based Alerting | | | |
| Succeeding With Service Level Objectives | | | |
| Putting customers first with SLIs and SLOs | | | |
| SRE Leadership: Have Tiered SLAs | | | |
| How SLOs Enable Fast, Reliable Application Delivery | | | |
| The Tail at Scale | | | |
| The Tail at Scale Revisited | | | |
| Defining SLOs for services with dependencies | | | |
| Service Level Disagreements | | | |
| How We Use Sloth to do SLO Monitoring and Alerting with Prometheus | | | |
| SLI Deep Dive | | | |
| Measuring Reliability in GCP: Step By Step SLO creation guide using Cloud Operation Sandbox | | | |
| SLO tracker | | | |
| SLO Alerting for Mortals | | | |
| SRE methods and climate change | | | |
| What made SLOs so messy (and what we can do about it) | | | |
| SLICK: Adopting SLOs for improved reliability | | | |
| Calculating composite SLA | | | |
| Best practices for setting SLOs and SLIs for modern, complex systems | | | |
| |
| Performance Checklists for SREs | | | |
| South Bay SRE Meetup - Netflix Cloud Performance Team | | | |
| Software Performance Analysis Guided By SLOs | | | |
| A framework for pragmatic performance engineering | | | |
Awesome Site Reliability Engineering / Programming |
| Go Language for Ops and Site Reliability Engineering | | | |
| Go for SREs using Python | | | |
| Operability in Go | | | |
| Go Reliability and Durability at Dropbox | | | |
Awesome Site Reliability Engineering / Misc Articles |
| What is SRE (Site Reliability Engineering)? | | | |
| Here’s How Google Makes Sure It (Almost) Never Goes Down | | | |
| Are site reliability engineers the next data scientists? | | | |
| Site Reliability Engineers: "solving the most interesting problems" | | | |
| Site Reliability Engineers: the "world’s most intense pit crew" | | | |
| Site reliability engineering kicks rote tasks out of IT ops | | | |
| Notes on Site Reliability Engineering | | | |
| Adventures in SRE-land: Welcome to Google Mission Control | | | |
| Book Review: Site Reliability Engineering - How Google Runs Production Systems | | | |
| Site Reliability Engineers: “We solve cooler problems” | | | |
| SREcon17: Brave new world of site reliability engineering | | | |
| Open AWS guide | 35,774 | about 1 year ago | |
| Commentary on Site Reliability Engineering | | | |
| Site Reliability Engineering: 4 Things to Know | | | |
| Looking for SRE Success? Then Find the Intrapreneurs! | | | |
| What Team Structure is Right for DevOps to Flourish? | | | |
| Injured on Vacation? Applying Principles from Site Reliability Engineering to a Travel Emergency | | | |
| Building blameless working environment | | | |
| SRE Adoption Report | | | |
| SREs: The Happiest – and Highest Paid – in the Industry | | | |
| The Role of Site Reliability Engineering, Today and Tomorrow | | | |
| SRE as a Lifestyle Choice | | | |
| SRECon EMEA 2019 Recap | | | |
| Life of an SRE at Google - JC van Winkel | | | |
| Site Reliability Engineering for Native Mobile Apps - Abhijith Krishnappa | | | Case study: Halodoc adaptation of SRE principles for Native Mobile Apps |
| SRE Best Practices by InfraCloud | | | |
Awesome Site Reliability Engineering / Real-time Messaging |
| #sre channel at Hangops Slack | | | Discussion of Site Reliability Engineering generally |
| #incident_response channel at Hangops Slack | | | Discussion about Incident Response |
| USENIX SREcon Slack | | | |
Awesome Site Reliability Engineering / Blogs |
| Brendan Gregg's Blog | | | Highly Technical Blog Posts About Systems Internals, Performance and SRE |
| Everything Sysadmin | | | Blog Posts About SysAdmin/DevOps/SRE by Tom Limoncelli |
| High Scalability | | | Technical Blog Posts About Systems Architecture |
| rachelbythebay | | | Techincal Blog Posts |
| Susan J. Fowler | | | Various blog posts about SRE, Software Engineering and Microservices |
| SysAdvent | | | One article for each day of December, ending on the 25th article |
| Stephen Thorne's Blog | | | Blog Posts About SRE |
| Increment | | | A digital magazine about how teams build and operate software systems at scale |
| GopherSRE | | | Blog Posts about Go and SRE |
| Cindy Sridharan | | | Blog posts about distributed systems and their management |
| Blameless Blog | | | Blog posts about SRE culture and practices |
| Resilience Roundup | | | Weekly analysis of Resilience Engineering and Human Factors research designed for software systems |
| Squadcast Blog | | | Blog posts about SRE best practices, reliability, on-call and incident management |
| FireHydrant Blog | | | Posts about complex systems, incident response, and SRE best practices |
| Rootly Blog | | | Incident management best practices and guides |
| incident.io Blog | | | Guides, advice and resources on incident management and response |
| Logit.io Blog | | | Resources on log management, SRE and devOps |
Awesome Site Reliability Engineering / Newsletters |
| DevOpsLinks | | | A weekly newsletter about SRE, SysAdmin and DevOps news, tools, tutorials and opinions |
| KubeWeekly | | | The weekly newsletters for all things Kubernetes. KubeWeekly is curated by Bob Killen, Chris Short, Craig Box, Kim McMahon and Michael Hausenblas |
| SRE Weekly | | | Weekly Site Reliability Newsletter |
| O’Reilly Systems Engineering and Operations Newsletter | | | Weekly systems engineering and operations news and insights from industry insiders |
| ChaosEngineering.news | | | Chaos Engineering newsletter. All things Chaos Engineering, directly to your inbox! |
| Monitoring Weekly | | | What's new in monitoring? Curated monitoring articles to your inbox each week |
| Observability news | | | Updates around observability (o11y) with a special focus on open source |
Awesome Site Reliability Engineering / Conferences & Meetups |
| SRECon Conferences | | | The Official SRE Conference |
| LISA Conferences | | | Prominent Conference About SysAdmin/DevOps/SRE |
| SRE Tech Talks | | | SRE Talks Hosted by Google |
| South Bay Site Reliability Engineering (Sunnyvale, CA) Meetup | | | A Group For Individuals Who Tackle Reliability Challenges For Web-Scale Systems |
| San Francisco Reliability Engineering | | | A Group Of People Who Are Passionate About Reliable, Performant Software Systems |
| Site Reliability Engineering Munich, Germany | | | SRE Meetup in the greater area of Oktoberfest city |
| ADDO - All Day DevOps | | | A 24 hour conference that is completely online and free |
| Site Reliability Engineering Paris, France | | | SRE Meetup in the city of light |
| Site Reliability Engineering India | | | SRE Meetup India |
| |
| Google SRE Twitter Account | | | Google's SRE Twitter Account |
| SREBook | | | The Official Twitter Account of Site Reliability Engineering Book |
| SREcon | | | SRECon's Official Twitter Account |
| SREWorkbook | | | The Official Twitter Account of Site Reliability Workbook |
| The SRE Dev | | | SRE-related Posts from |
| Twitter SRE | | | The Official Twitter Account of Twitter's SRE team |
| Twitter SRE Weekly | | | The Official Twitter Account of SRE Weekly Newsletter |
| USENIX Association | | | The Official USENIX Twitter Account |
| |
| Awesome SRE Tools | 1,250 | 12 months ago | A curated list of Site Reliability and Production Engineering tools |
| List of Continuous Integration services | 3,723 | about 1 year ago | |
| SRE cheat sheet | 204 | over 3 years ago | A cheat sheet for Site Reliability Engineering principles and numbers |
Awesome Site Reliability Engineering / Podcasts |
| Blameless / Resilience in Action | | | |
| Google SRE Prodcast | | | |
| o11y Observability Podcast | | | |
| On Call Nightmares (retired) | | | |
| Making of the SRE Omelette | | | |