Awesome Site Reliability Engineering / Culture |
What is Site Reliability Engineering? | | | |
Keys To SRE by Ben Treynor | | | |
Google SRE Resources | | | |
Notes from Production Engineering by Pedro Canahuati | | | |
PostOps: Recovery from Operations | | | |
Love DevOps? Wait 'till you meet SRE | | | |
How Google Does Planet-Scale Engineering for Planet-Scale Infra | | | |
Site Reliability Engineering at Facebook | | | |
A History of Site Reliability Engineering at Uber | | | |
Case Study: Adopting SRE Principles at StackOverflow | | | |
Site Reliability Engineering at Dropbox | | | |
Site Reliability Engineers — Keeping Google up and running 24/7 | | | |
Site Reliability Engineering at Salesforce | | | |
video | | | From Sys Admin to Netflix SRE - and |
SRE@Google: Thousands of DevOps Since 2004 | | | |
Transactional System Administration Is Killing Us and Must be Stopped | | | |
A hierarchy of SRE needs | | | |
PostOps: A Non-Surgical Tale of Software, Fragility, and Reliability | | | |
SRE: An incomplete guide to cultural Narnia | | | - |
Putting Together Great SRE Teams | | | |
Work at Google: Meet our Production Engineers for Site Reliability Hangout on Air | | | |
Toil: A Word Every Engineer Should Know | | | |
Engineering Reliability into Web Sites: Google SRE | | | |
DEVOPS & SRE AMA - Building High Performance Organizations | | | |
John Allspaw's AMA on Incident Analysis and Postmortems | | | |
Part 1 | | | Site Reliability Engineering with Paul Newson - & |
How SysAdmins Devalue Themselves | | | |
The Softer Side of DevOps | | | |
SRE, noun. See also: confidence, trust. | | | |
Site Reliability Engineering with Stephen Weinberg | | | |
We are the Google Site Reliability team. We make Google’s websites work. Ask us Anything! | | | |
We are the Google Site Reliability Engineering team. Ask us Anything! | | | |
The Ops Identity Crisis | | | |
The Irreproducibility Of Bugs In Large-Scale Production Systems | | | |
SE-Radio Episode 276: Björn Rabenstein on Site Reliability Engineering | | | |
Microservices, DevOps and Production Complexity | | | |
Introducing Google Customer Reliability Engineering | | | |
Evolution or Rebellion? The rise of Site Reliability Engineers (SRE) | | | |
The difference between Site Reliability Engineering, System Administration, and DevOps | | | |
SRE in the Small and in the Large | | | |
SBSRE Meetup: Different SRE roles and challenges(Netflix) | | | |
Panel: Who/What Is SRE? | | | |
Hope Is Not a Strategy | | | |
Tenets of SRE | | | |
Site Reliability Engineering Demystified | | | |
Is Site Reliability Engineering the True ‘Ops’ in DevOps? | | | |
SRE vs. DevOps vs. Cloud Native: The Server Cage Match | | | |
SRE: What’s The Big Idea? | | | |
Building the SRE Culture at LinkedIn | | | |
Podcast #111 – SRE: Occasionally Maintaining Infrastructure That You Hate | | | |
Splicing SRE DNA Sequences in the Biggest Software Company on the Planet | | | |
Why should your app get SRE support? - CRE life lessons | | | |
How SREs find the landmines in a service - CRE life lessons | | | |
Making the most of an SRE service takeover - CRE life lessons | | | |
The Cloudcast #301: SRE and Infrastructure Operations (Podcast) | | | |
The SRE model | | | |
Onboarding New Site Reliability Engineers | | | |
Building Blocks for Site Reliability At Google | | | |
Beyond Google SRE: What is Site Reliability Engineering like at Medium? | | | |
Intelligent Site Reliability Engineering – A Machine Learning Perspective | | | |
A crash course in LinkedIn's global site operations | | | |
Google’s Site Reliability Engineering with Todd Underwood | | | |
What is Site Reliability Engineering? (VMware) | | | |
A Gentle Introduction to SRE | | | |
Understanding Site Reliability Engineering through Movies and Books | | | |
GOTO 2017 • Site Reliability Engineering at Google • Christof Leng | | | |
Part1 | | | The Makeup of Successful Geographically-Distributed SRE Teams - & |
Tech Leadership in SRE | | | |
The Azure Podcast: Episode 227 - Azure SRE | | | |
The human scalability of "DevOps" | | | |
Podcast: Site Reliability Management with Mike Hiraga | | | |
How a cat inspired system reliability at Knowlarity | | | |
Getting Started with Site Reliability Engineering | 110 | over 6 years ago | |
"Practical Applications of the Dickerson Pyramid" by Nat Welch | | | |
LinkedIn’s Kurt Andersen Uncovers Blindspots in SRE Implementations | | | |
Interview with Betsy Beyer, Stephen Thorne of Google | | | |
Less Risk Through Greater Humanity - Dave Rensin | | | |
Getting Started with SRE - Stephen Thorne, Google | | | |
Building Successful SRE in Large Enterprises | | | |
Solving Reliability Fears with Site Reliability Engineering | | | |
SRE vs. DevOps: competing standards or close friends? | | | |
How to Avoid the 5 SRE Implementation Traps that Catch Even the Best Teams | | | |
Reliability Engineering – The Essential Discipline for Complex Systems | | | |
The Modern Site Reliability Workbench on Top of OCI | | | |
SRE in the Third Age | | | |
About SRE and how (not) to apply it | | | |
Transitioning a typical engineering ops team into an SRE powerhouse | | | |
Making a Lion Bulletproof: SRE in Banking | | | |
Identifying and tracking toil using SRE principles | | | |
From Ops to SRE: Evolution of the OpenShift Dedicated Team | | | |
Meeting reliability challenges with SRE principles | | | |
A quick introduction to SRE principles | | | |
The SRE I Aspire to Be | | | |
Taming Operational Load with VMware CRE | | | |
SRE Cultural Values | | | |
Are we there yet? Thoughts on assessing an SRE team’s maturity | | | |
What SREs have to do with project-based services? | | | |
Making operational work more visible | | | |
SRE vs. DevOps: What’s the Difference Between Them? | | | |
Awesome Site Reliability Engineering / Education |
Panel: Educating SRE | | | |
From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams | | | |
New to an SRE team? | | | |
The Systems Engineering Side of Site Reliability Engineering | | | |
Graduating from Bootcamp and interested in becoming a Site Reliability Engineer? | | | |
So you want to be a Site Reliability Engineer? | | | |
Spiraling Ops Debt & the SRE Coding Imperative | | | |
So you want to be an SRE? | | | |
Career Profiles/Site Reliability Engineer | | | |
What is the role of a Site Reliability Engineer? | | | |
Lynda.com: DevOps Foundations: Site Reliability Engineering | | | |
Incident Management Training: Wheel of Misfortune | | | |
Site Un-Reliability Engineering [Video Series] | | | |
The Ultimate Guide to Structuring a 90-Day Onboarding Plan | | | |
SRE fundamentals: SLIs, SLAs and SLOs | | | |
How to Get Into SRE | | | |
Do you have an SRE team yet? How to start and assess your journey | | | |
How SRE teams are organized, and how to get started | | | |
Why SRE Documents Matter | | | |
How to get started with site reliability engineering (SRE) | | | |
Duties of a Site Reliability Engineering Manager | | | |
Designing distributed systems using NALSD flashcards | | | |
Training Site Reliability Engineers: What Your Organization Needs to Create a Learning Program | | | |
SRE Classroom: Distributed PubSub workshop | | | |
School of SRE: Curriculum for onboarding non-traditional hires and new grads | | | |
Awesome Site Reliability Engineering / Books |
Practical Linux Infrastructure | | | |
Site Reliability Engineering: How Google Runs Production Systems | | | |
The Site Reliability Workbook: Practical Ways to Implement SRE | | | |
Observability Engineering: Achieving Production Excellence | | | |
The Practice Of Cloud System Administration: Designing and Operating Large Distributed Systems | | | |
Web Operations - Keeping the Data On Time | | | |
The Checklist Manifesto: How to Get Things Right | | | |
Microservices in Production - Standard Principles and Requirements | | | |
Production-Ready Microservices - Building Standardized Systems Across an Engineering Organization | | | |
Systems Performance: Enterprise and the Cloud | | | [Sample chapter titled |
Monitoring Distributed Systems: Case Studies from Google's SRE Teams | | | |
The Human Side of Postmortems: Managing Stress and Cognitive Biases | | | |
Chaos Engineering: Building Confidence in System Behavior through Experiment | | | |
Post-Incident Reviews: Learning from Failure for Improved Incident Responses | | | |
Antifragile Systems and Teams | | | |
How to Monitoring the SRE Golden Signals (E-Book) | | | |
Incident Management for Operations | | | |
Real-World SRE | | | |
Seeking SRE | | | |
What is SRE? | | | |
Engineering Reliable Mobile Applications: Strategies for Developing Resilient Native Mobile Applications | | | |
Building Secure and Reliable Systems | | | |
Chaos Engineering: Crash test your applications | | | |
97 Things Every SRE Should Know | | | |
Four Steps to Creating Effective Game Day Tests | | | |
The Linux Programming Interface | | | |
Awesome Site Reliability Engineering / Hiring |
SRE Hiring | | | |
Hiring SREs at LinkedIn | | | |
Hiring Site Reliability Engineers | | | |
Hiring your first SRE | | | |
Growing the Site Reliability Team at LinkedIn: Hiring is Hard | | | |
Engineering Manager - Site Reliability Engineering Interview Preparation | | | |
Awesome Site Reliability Engineering / Reliability |
The Realities of the Job of Delivering Reliability | | | |
Fail at Scale by Ben Maurer | | | |
Embracing Failure: Fault-Injection and Service Reliability | | | |
10 Years of Crashing Google | | | |
How we break things at Twitter: failure testing | | | |
Reliable Cron across the Planet | | | |
Push our limits - reliability testing at Twitter | | | |
The Verification of a Distributed System by Caitie McCaffrey | | | |
Weathering the Unexpected | | | |
SRE Hour: Tech Talks by Box & Yelp | | | |
Simplicity: A Prerequisite for Reliability | | | |
The Two Sides to Google Infrastructure for Everyone Else | | | |
How Embracing Continuous Release Reduced Change Complexity | | | |
Making "Push On Green" a Reality | | | |
BeyondCorp: A New Approach to Enterprise Security | | | |
Brainstorming Failure by Jeff Smith | | | |
The Ripple Effect Of Outages And Downtime Cannot Be Underestimated | | | |
The infrastructure behind Twitter: efficiency and optimization | | | |
Dickerson's Hierarchy of Reliability | | | |
The Morning Paper on Operability | | | |
Production is all that matters | | | |
Using load shedding to survive a success disaster - CRE life lessons | | | |
How to avoid a self-inflicted DDoS Attack - CRE life lessons | | | |
Don't gamble when it comes to reliability | | | |
Resilience Engineering: Learning to Embrace Failure | | | |
The Infrastructure Behind Twitter: Scale | | | |
Scaling Reliability at Twitter: So You Want to Add a 9 | | | |
Principles Of Chaos Engineering | | | |
Chaos Engineering | | | |
Available...or not? That is the question - CRE life lessons | | | |
How Google Backs Up The Internet Along With Exabytes Of Other Data | | | |
Performance, Scalability, And High Availability: 3 Key Infrastructure Adaptability Requirements | | | |
Part 1 | | | The Production Environment at Google - & |
Reliable releases and rollbacks - CRE life lessons | | | |
How release canaries can save your bacon - CRE life lessons | | | |
Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites | | | |
Every Day Is Monday in Operations | | | |
Under the Hood: Ensuring Site Reliability | | | |
Designing reliable systems with cloud infrastructure (Google Cloud Next '17) | | | |
A Google SRE explores GitHub reliability with BigQuery | | | |
Know thy enemy: how to prioritize and communicate risks - CRE life lessons | | | |
Chaos Engineering resources | 6,002 | 11 months ago | |
CRE life lessons: What is a dark launch, and what does it do for me? | | | |
Why you should pick strong consistency, whenever possible | | | |
The Network is Reliable | | | |
Are You Load Balancing Wrong? | | | |
How production engineers support global events on Facebook | | | |
Google: A Collection Of Best Practices For Production Services | | | |
Canary Analysis Service | | | |
Tips for High Availability | | | |
Progressive Service Architecture At Auth0 | | | |
Google Cloud Production Guideline | | | |
production readiness | | | |
Trust By Design: The Fusion of Operational Maturity and Risk Modeling | | | |
Top Seven Myths of Robust Systems | | | |
Taming chaos: Preparing for your next incident | | | |
PID Loops and the Art of Keeping Systems Stable | | | |
Are you ready for production? | | | - |
Production Checklist for Web Apps on Kubernetes | | | |
Finding a problem at the bottom of the Google stack | | | |
Rethinking Task Size in SRE | | | |
How maintenance windows affect your error budget | | | |
The Production Readiness Spectrum | | | |
Generic mitigations | | | |
How we’re building a production readiness review process at Grafana Labs | | | |
Resiliency Planning for High-Traffic Events | | | |
Using Fault Injection Testing to Improve DoorDash Reliability | | | |
Awesome Site Reliability Engineering / Monitoring & Observability & Alerting |
A Working Theory-of-Monitoring | | | |
The Evolution of Monitoring Systems at Google - Tony Rippy | | | |
Monitoring without Infrastructure @ Airbnb | | | |
Monitoring distributed systems | | | |
Observability at Uber Engineering: Past, Present, Future | | | |
The 4 Golden Signals of API Health and Performance in Cloud-Native Applications | | | |
My Philosophy on Alerting by Rob Ewaschuk | | | |
Time To Detect - Netflix | | | |
Why Percentiles Don’t Work the Way you Think | | | |
Building Twitter’s Next-Gen Alerting System | | | |
Instrumentation: Worst case performance matters | | | |
Instrumentation: What does 'uptime' mean? | | | |
Incidents + Outages at CircleCI: Our Playbook and What We’ve Learned | | | |
An introduction to monitoring and alerting with timeseries at scale, with Prometheus | | | |
Detecting outliers and anomalies in realtime at Datadog | | | |
How to Monitor the SRE Golden Signals | | | |
Monitoring in a DevOps World | | | |
Monitoring Your Monitoring’s Monitoring | | | |
Observability: the new wave or buzzword? | | | |
Monitoring Isn't Observability | | | |
Monitoring in the time of Cloud Native | | | |
Principles of Monitoring Microservices | | | |
The Many Ways Your Monitoring Is Lying to You | | | |
GitOps Part 3 - Observability | | | |
Want to Debug Latency? | | | |
Debugging Latency in Go 1.11 | | | |
Alerting on SLOs like Pros | | | |
Applied Alerting Philosophy | | | |
Observations on Observability | | | |
Deploys: It's Not Actually About Fridays | | | |
Site Reliability Engineering Best Practices for Data Pipelines | | | |
Elastic Observability in SRE and Incident Response | | | |
Error Budget Policy - Part 1 - Adoption at Expedia Group | | | |
Error Budget Policy - Part 2 - Practices at Expedia Group | | | |
Awesome Site Reliability Engineering / On-Call |
Being an On-Call Engineer: A Google SRE Perspective | | | |
Inside Atlassian: how our site reliability engineers do incident management | | | |
Inside Atlassian: how IT & SRE use ChatOps to run incident management | | | |
Incident Response at Heroku | | | |
Who's On Call? | | | |
SysAdvent - Day 6 - No More On-Call Martyrs | | | |
On Being On Call | | | |
The On-Call Handbook | 401 | over 4 years ago | |
Incident management at Google — adventures in SRE-land | | | |
Run Book / Operations Manual template | 705 | over 5 years ago | |
Automating Your Oncall: Open Sourcing Fossor and Ascii Etch | | | |
Project STAR*: Streamlining Our On-Call Process | | | |
SRE@Xero: Managing Incidents Part I | | | |
SRE@Xero: Managing Incidents Part II | | | |
How To Establish a High Severity Incident Management Program | | | |
How Your Systems Keep Running Day After Day - John Allspaw | | | |
On-call doesn’t have to suck | | | |
Why, as a Netflix infrastructure manager, am I on call? | | | |
Oncall and Sustainable Software Development | | | |
On Call Rotations: How Best to Wake Devs Up in the Middle of the Night | | | |
Understanding The Role Of The Incident Manager On-Call (IMOC) | | | |
3 Ways to Minimize the Impact of High Severity Incidents | | | |
Advice to Management Teams While Enrolling Changes to On-Call Systems | | | |
Moving Past Shallow Incident Data | | | |
Sustainable On-Call | | | |
dotScale 2017 - Aish Raj Dahal - Chaos management during a major incident | | | |
Incident Management at Netflix Velocity | | | |
Incidents, fixes, and the day after | | | |
10 Steps to Develop an Incident Response Plan You’ll ACTUALLY Use | | | |
Checklists: a stupidly simple but valuable operational gift | | | |
How to write a status page update | | | |
Atlassian Incident Handbook | | | |
PagerDuty Incident Response Handbook | | | |
Avoiding Burnout for SREs | | | |
Better On-Call the SRE way | | | |
Managing Incidents at Monzo | | | |
Making On-Call Not Suck | | | |
How we (Monzo) respond to incidents | | | |
How we’ve evolved on-call at Monzo | | | |
Code Yellow: When Operations Isn’t Perfect | | | |
MTTR is dead, long live CIRT | | | |
Extended Dreyfus Model for Incident Lifecycles | 36 | about 6 years ago | |
Inhumanity of Root Cause Analysis | | | |
Incident insights from NASA, NTSB, and the CDC | | | |
How to avoid On-Call Burnout the SRE Way | | | |
My week shadowing a GitLab Site Reliability Engineer | | | |
How our production team runs the weekly on-call handover | | | |
Writing Runbook Documentation When You’re An SRE | | | |
Incident response, programs and you(r startup) | | | |
An Incident Command Training Handbook | | | |
Shrinking the time to mitigate production incidents | | | |
Incident writeup as sociological storytelling | | | |
Elephant in the Blameless War Room: Accountability | | | |
Naming names in incident writeups | | | |
Building On-Call Culture at GitHub | | | |
Awesome Site Reliability Engineering / Post-Mortem |
A collection of post-mortems | 11,309 | 4 months ago | |
Collection of Kubernetes Failure Stories | 6,232 | about 4 years ago | |
Blameless PostMortems and a Just Culture | | | |
A Tale of Postmortems | | | |
Building a Blameless Post-Mortem Culture with Jason Hand | | | |
The infinite hows | | | |
Failure is Always An Option: How a Blameless Culture Leads to Better Results | | | |
SysAdvent - Day 1 - Why You Need a Postmortem Process | | | |
Etsy’s Debriefing Facilitation Guide for Blameless Postmortems | | | |
Writing Your First Postmortem | | | |
How to Write Great Outage Post-Mortems | | | |
A collection of postmortem templates | 1,314 | over 1 year ago | |
Embracing Feedback | | | |
Postmortem Action Items: Plan the Work and Work the Plan | | | |
Social Issues In Postmortems | | | |
Google Has an Official Process in Place for Learning From Failure--and It's Absolutely Brilliant | | | |
Postmortem culture: how you can learn from failure | | | |
re:Work - Postmortem discussion template | | | |
Post-mortems to the rescue | | | |
Postmortem Action Items: Plan the Work and Work the Plan | | | |
Why Every Company Can Benefit from a Blameless Culture | | | |
"It's dead, Jim": How we write an incident postmortem | | | |
Our incident postmortem template | | | |
Learn out of mistakes. Postmortems to the rescue. | | | |
Improving Postmortem Practices with Veteran Google SRE, Steve McGhee | | | |
Inhumanity of Root Cause Analysis | | | |
Awesome Site Reliability Engineering / Capacity Planning |
Capacity Planning | | | |
SouthBay SRE: Cloud Capacity Planning | | | |
Intent-based Capacity Planning and Autoscaling with Kubernetes | | | |
How do you do Capacity Planning | | | |
How Back Market SREs prepared for Black Friday | | | |
Awesome Site Reliability Engineering / Service Level Agreement |
If It's in the Cloud, Get It on Paper: Cloud Computing Contract Issues | | | |
Service Level Agreements in the Cloud: Who cares? | | | |
SysAdvent- Day 20 - How to set and monitor SLAs | | | |
SLOs, SLIs, SLAs, oh my - CRE life lessons | | | |
Service Levels and Error Budgets | | | |
(Un)Reliability Budgets - Finding Balance between Innovation and Reliability | | | |
The Calculus of Service Availability | | | |
Availability Calculator: Calculate how much downtime should be permitted in your SLA | | | |
Standardize cloud SLA availability with numerical performance data | | | |
Best practices to develop SLAs for cloud computing | | | |
A Practical Guide to SLAs | | | |
Building good SLOs - CRE life lessons | | | |
No Grumpy Humans and Other Site Reliability Engineering Lessons from Google | | | |
Consequences of SLO violations — CRE life lessons | | | |
Service Level Objectives in Practice | | | |
SRE Consensus Building | | | |
An example escalation policy — CRE life lessons | | | |
Error Budget Calculator | | | |
Understanding error budget overspend - part one - CRE life lessons | | | |
Good housekeeping for error budgets - part two - CRE life lessons | | | |
SRE fundamentals: SLIs, SLAs and SLOs | | | |
SLOs & You: A Guide To Service Level Objectives | | | |
Earning Our Wings: Stories and Findings From Operating a Large-scale Concourse Deployment | | | |
Nines are Not Enough: Meaningful Metrics for Clouds | | | |
How many nines is my storage system? | | | |
Don't follow the sun. | | | |
The Tyranny of the SLA | | | |
Backblaze Durability is 99.999999999% — And Why It Doesn’t Matter | | | |
DevOpsDays Chicago 2019 - The Art of SLOs | | | |
The Art of SLOs Workshop Materials | | | |
How to Include Latency in SLO-Based Alerting | | | |
Succeeding With Service Level Objectives | | | |
Putting customers first with SLIs and SLOs | | | |
SRE Leadership: Have Tiered SLAs | | | |
How SLOs Enable Fast, Reliable Application Delivery | | | |
The Tail at Scale | | | |
The Tail at Scale Revisited | | | |
Defining SLOs for services with dependencies | | | |
Service Level Disagreements | | | |
How We Use Sloth to do SLO Monitoring and Alerting with Prometheus | | | |
SLI Deep Dive | | | |
Measuring Reliability in GCP: Step By Step SLO creation guide using Cloud Operation Sandbox | | | |
SLO tracker | | | |
SLO Alerting for Mortals | | | |
SRE methods and climate change | | | |
What made SLOs so messy (and what we can do about it) | | | |
SLICK: Adopting SLOs for improved reliability | | | |
Calculating composite SLA | | | |
Best practices for setting SLOs and SLIs for modern, complex systems | | | |
|
Performance Checklists for SREs | | | |
South Bay SRE Meetup - Netflix Cloud Performance Team | | | |
Software Performance Analysis Guided By SLOs | | | |
A framework for pragmatic performance engineering | | | |
Awesome Site Reliability Engineering / Programming |
Go Language for Ops and Site Reliability Engineering | | | |
Go for SREs using Python | | | |
Operability in Go | | | |
Go Reliability and Durability at Dropbox | | | |
Awesome Site Reliability Engineering / Misc Articles |
What is SRE (Site Reliability Engineering)? | | | |
Here’s How Google Makes Sure It (Almost) Never Goes Down | | | |
Are site reliability engineers the next data scientists? | | | |
Site Reliability Engineers: "solving the most interesting problems" | | | |
Site Reliability Engineers: the "world’s most intense pit crew" | | | |
Site reliability engineering kicks rote tasks out of IT ops | | | |
Notes on Site Reliability Engineering | | | |
Adventures in SRE-land: Welcome to Google Mission Control | | | |
Book Review: Site Reliability Engineering - How Google Runs Production Systems | | | |
Site Reliability Engineers: “We solve cooler problems” | | | |
SREcon17: Brave new world of site reliability engineering | | | |
Open AWS guide | 35,742 | 3 months ago | |
Commentary on Site Reliability Engineering | | | |
Site Reliability Engineering: 4 Things to Know | | | |
Looking for SRE Success? Then Find the Intrapreneurs! | | | |
What Team Structure is Right for DevOps to Flourish? | | | |
Injured on Vacation? Applying Principles from Site Reliability Engineering to a Travel Emergency | | | |
Building blameless working environment | | | |
SRE Adoption Report | | | |
SREs: The Happiest – and Highest Paid – in the Industry | | | |
The Role of Site Reliability Engineering, Today and Tomorrow | | | |
SRE as a Lifestyle Choice | | | |
SRECon EMEA 2019 Recap | | | |
Life of an SRE at Google - JC van Winkel | | | |
Site Reliability Engineering for Native Mobile Apps - Abhijith Krishnappa | | | Case study: Halodoc adaptation of SRE principles for Native Mobile Apps |
SRE Best Practices by InfraCloud | | | |
Awesome Site Reliability Engineering / Real-time Messaging |
#sre channel at Hangops Slack | | | Discussion of Site Reliability Engineering generally |
#incident_response channel at Hangops Slack | | | Discussion about Incident Response |
USENIX SREcon Slack | | | |
Awesome Site Reliability Engineering / Blogs |
Brendan Gregg's Blog | | | Highly Technical Blog Posts About Systems Internals, Performance and SRE |
Everything Sysadmin | | | Blog Posts About SysAdmin/DevOps/SRE by Tom Limoncelli |
High Scalability | | | Technical Blog Posts About Systems Architecture |
rachelbythebay | | | Techincal Blog Posts |
Susan J. Fowler | | | Various blog posts about SRE, Software Engineering and Microservices |
SysAdvent | | | One article for each day of December, ending on the 25th article |
Stephen Thorne's Blog | | | Blog Posts About SRE |
Increment | | | A digital magazine about how teams build and operate software systems at scale |
GopherSRE | | | Blog Posts about Go and SRE |
Cindy Sridharan | | | Blog posts about distributed systems and their management |
Blameless Blog | | | Blog posts about SRE culture and practices |
Resilience Roundup | | | Weekly analysis of Resilience Engineering and Human Factors research designed for software systems |
Squadcast Blog | | | Blog posts about SRE best practices, reliability, on-call and incident management |
FireHydrant Blog | | | Posts about complex systems, incident response, and SRE best practices |
Rootly Blog | | | Incident management best practices and guides |
incident.io Blog | | | Guides, advice and resources on incident management and response |
Logit.io Blog | | | Resources on log management, SRE and devOps |
Awesome Site Reliability Engineering / Newsletters |
DevOpsLinks | | | A weekly newsletter about SRE, SysAdmin and DevOps news, tools, tutorials and opinions |
KubeWeekly | | | The weekly newsletters for all things Kubernetes. KubeWeekly is curated by Bob Killen, Chris Short, Craig Box, Kim McMahon and Michael Hausenblas |
SRE Weekly | | | Weekly Site Reliability Newsletter |
O’Reilly Systems Engineering and Operations Newsletter | | | Weekly systems engineering and operations news and insights from industry insiders |
ChaosEngineering.news | | | Chaos Engineering newsletter. All things Chaos Engineering, directly to your inbox! |
Monitoring Weekly | | | What's new in monitoring? Curated monitoring articles to your inbox each week |
Observability news | | | Updates around observability (o11y) with a special focus on open source |
Awesome Site Reliability Engineering / Conferences & Meetups |
SRECon Conferences | | | The Official SRE Conference |
LISA Conferences | | | Prominent Conference About SysAdmin/DevOps/SRE |
SRE Tech Talks | | | SRE Talks Hosted by Google |
South Bay Site Reliability Engineering (Sunnyvale, CA) Meetup | | | A Group For Individuals Who Tackle Reliability Challenges For Web-Scale Systems |
San Francisco Reliability Engineering | | | A Group Of People Who Are Passionate About Reliable, Performant Software Systems |
Site Reliability Engineering Munich, Germany | | | SRE Meetup in the greater area of Oktoberfest city |
ADDO - All Day DevOps | | | A 24 hour conference that is completely online and free |
Site Reliability Engineering Paris, France | | | SRE Meetup in the city of light |
Site Reliability Engineering India | | | SRE Meetup India |
|
Google SRE Twitter Account | | | Google's SRE Twitter Account |
SREBook | | | The Official Twitter Account of Site Reliability Engineering Book |
SREcon | | | SRECon's Official Twitter Account |
SREWorkbook | | | The Official Twitter Account of Site Reliability Workbook |
The SRE Dev | | | SRE-related Posts from |
Twitter SRE | | | The Official Twitter Account of Twitter's SRE team |
Twitter SRE Weekly | | | The Official Twitter Account of SRE Weekly Newsletter |
USENIX Association | | | The Official USENIX Twitter Account |
|
Awesome SRE Tools | 1,228 | 5 days ago | A curated list of Site Reliability and Production Engineering tools |
List of Continuous Integration services | 3,691 | about 1 month ago | |
SRE cheat sheet | 203 | over 2 years ago | A cheat sheet for Site Reliability Engineering principles and numbers |
Awesome Site Reliability Engineering / Podcasts |
Blameless / Resilience in Action | | | |
Google SRE Prodcast | | | |
o11y Observability Podcast | | | |
On Call Nightmares (retired) | | | |
Making of the SRE Omelette | | | |