Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Real-World SRE
Real-World SRE

Real-World SRE: The Survival Guide for Responding to a System Outage and Maximizing Uptime

eBook
AU$33.99 AU$48.99
Paperback
AU$60.99
Subscription
Free Trial
Renews at AU$24.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $24.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Real-World SRE

Chapter 1. Introduction

As the internet has grown, people have become used to having access to content all of the time, from a variety of devices. This means that the reputation of a brand has slowly become connected with the responsiveness and reliability of its products. People choose Google for searching because it always returns relevant and useful results quickly. People share content on Twitter because their message will be seen in real time by their followers. Netflix's great content selection is useless if it cannot deliver consistently on a variety of network speeds. As this reliability has become more important to businesses, a specialization focused on software reliability has emerged: Site Reliability Engineering (SRE). This chapter will introduce you to the field and also describe what you will learn from this book, helping you to write software to navigate the ever-changing internet landscape.

Before we explain what the field and role of SRE pertains to, let us start with a thought experiment. Imagine that it's early in the morning and you wake up to a screenshot of a blank web page in a text message from a friend with the caption: "I can't load your website."

If your personal website is indeed down, maybe you will message back with an, "I'll check it after breakfast," or an, "Oh yeah, been meaning to look into that." If it is your company's website, or maybe the page hosting your resume that you just sent to 15 possible employers, then a stream of expletives and indecipherable emojis will probably erupt from your mouth and in your text message back. This is because, for many businesses, websites have become the main source of incoming business. For some companies, like Facebook, Amazon, or iFixit, their entire business is a website. For other businesses, like restaurants or advertising agencies, a website acts as a way for people interested in the organization to learn more. It is often part of the marketing flow that helps companies to grow.

Introduction

It is probably impossible to completely remove the adrenaline spike that comes from discovering a website is down if you are responsible for fixing it. However, we can work to set up a framework to limit how often things break. We can create a world where responding to outages is easy, and transition from, "Oh god, everything is on fire, what do I do?!" to "Oh hey, a page isn't loading, so let's check out what's having a rough day."

This chapter is our introduction to the book and the field of SRE. We will cover the following topics in the next few pages:

  • Exploring a brief history of the people who work on information systems
  • Defining what SRE is
  • Describing what is in the book and providing a rough framework for SRE.

A brief history

SRE is a relatively new field, but it is a slightly different take on many existing ideas. In 1958, the term IT was coined in the Harvard Business Review, and eventually became the descriptor for the maintenance of technology used for collecting, storing, and distributing data and information. At that time, computers were transitioning toward having integrated circuits, but they were still the size of a room and were maintained and programmed by a team of people. As computers shrank, that team started focusing on multiple computers. Over time, some people started to specialize in programming those computers, and others focused on keeping them running. "Dumb terminals" would connect to a single computer, which was maintained by a team while programmers and users used the terminals.

Eventually, these maintainers started taking care of both the machines that individuals used, as well as large arrays of machines that provided services. Users would use a word processor on their local machine, and then upload files to a remote machine. Those who maintained the remote machines became known as system engineers, system administrators, and system operators.

As computers became smaller and more commodified, programmers began spending more time interacting with infrastructure, and configuring their software and infrastructure to work together well. On the other end, system admins were writing more and more complex code to maintain infrastructure. The closer these teams became, the more they began working together. In smaller teams, often, people would start focusing on both code for infrastructure and business code. In larger organizations, teams were created that focused on tools for managing infrastructure in reliable ways, so that product teams could quickly and easily manage the infrastructure they needed. These joint teams were often described as SRE or DevOps (developer and operations) teams.

Benjamin Treynor Sloss of Google, often referred to as just Treynor, says in Google's Site Reliability Engineering book, "SRE is what happens when you ask a software engineer to design an operations team." He is often credited with the creation of the idea that operations work is now just a specialization of software engineering. Given Google's success with reliability, the idea has caught on at many companies.

SRE is still a burgeoning field and, like DevOps, is often used to describe roles that include a wide diversity of work. Some companies give the title of SRE to a position, but it is much closer to a traditional system admin role. You can use this book's framework to evaluate a job before you apply for it, however, the goal of this book is to introduce you to the SRE mindset and help you to apply it to an organization, regardless of your past experience in the tech world.

What is SRE?

SRE is an exciting field. As mentioned earlier, it has evolved from a long line of roles and, as it is a relatively new field, its definition is steadily changing. SRE is an extension and evolution of many past concepts and, as such, concepts relevant to SRE apply to many roles, including but not exclusive to, backend engineering, DevOps, systems engineering, systems administration, operations, and so on. Depending on the company, these roles can involve very similar or very different responsibilities. The point is that, no matter what your job title is, you can apply SRE principles to your role.

In an attempt to define the field, we can learn a lot from its full name, Site Reliability Engineering:

Merging these three definitions, we get something like, "The field focused on working artfully to bring about a website that performs consistently well." While this definition could use some brushing up, it suits our needs for now. If you work, or know people who work, in the web development or software engineering world and you ask them what SRE means, then they may ask you, "Isn't that like X?" To someone from that background, X might be "DevOps," "ops," "platform engineering," "infrastructure engineering," "24/7 engineering," "a sysadmin," and so on.

This variation of answers presents the first problem we will see throughout this book: every organization is different. SRE's primary goal is making a website perform consistently according to our previous definition, which is difficult because it is dependent on the organization, the business around that organization, and the website's (or product's) requirements. One of the primary goals of this book is to present a framework that you can apply even if you do not belong to an organization with any of the aforementioned roles. The framework should be effective if you work for yourself, and it should also work if you are employed by some gigantic international multi-headed Hydra organization, and anything in between.

I worked as an SRE in 2016 for Hillary for America. It was the lead organization (but definitely not the only one) working to help to elect Hillary Clinton as President of the United States of America. We were not successful, and while this example immediately dates this book, I found it to have the most concrete separation of concerns between the parts of a website that I have ever worked on. The organization was hyper-focused on one goal (electing Hillary Clinton as president), so it had a very explicit list of goals that made my job a lot easier.

There were many separate parts of the campaign that the technology team worked on, including a mobile application, different websites, data pipelines, and large databases. To keep this simple though, and to explain what I mean by a separation of concerns, let me use three separate websites that we built and maintained as an example:

What is SRE?

Figure 1: Screenshots of different parts of hillaryclinton.com, courtesy of the Hillary for America design team. From left to right: the header on the home page, a page about Nevada, a page about Hillary's policies, Hillary's home page in Spanish, the campaign blog, and the donate page.

The home page was a general landing page. It needed to be available during the hours that people in North America were awake (as our target audience was mostly based in the United States), but very few people visited the home page unless driven there.

The main reason you would go to https://www.hillaryclinton.com/ was if you were sent there, not because it was part of your daily browsing like you would visit Twitter or Reddit. Surrogates speaking at rallies, on the radio, or on television supporting Hillary Clinton would often say things like, "Go to hillaryclinton.com now to sign up," or "hillaryclinton.com has more details on her policies on this topic." A five-minute outage here and there was OK, because of this semipredictable traffic spike, but like many media organizations, there were no guarantees of when a large spike of traffic would occur.

The donate page always needed to be up. According to our product team and senior leadership, the donate page's availability was priority number one. If people could not give money, then the campaign might not be able to pay people's salaries or get the candidate to her speaking engagements. The donation site was not the only way that the campaign made money, but it was a significant source of income.

The voter registration page only needed to be fully available when there was an election coming soon. This was because the page let people say they were going to vote for Hillary Clinton and find their nearest polling location. While the donate page needed to be available for the majority of the campaign (May 2015 through to November 2016), the voter registration page only really needed to be available during the lead up to the primary election (September through to November of 2016). If we had built the voter registration page earlier in the election, it also would have been needed in the days leading up to the primaries, but then only for states that were voting on those days. Primary elections are a precursor to the general election and happen from February to June, with different states voting on different days.

The key here is that different websites and features have different requirements and a different definition of being reliable. Nothing will ever be perfect, nor is 100% uptime achievable on the internet, because things are always breaking. So, all we can do is figure out what sort of failures we might have and optimize our product to be resilient in a way that is useful for us. SRE isn't just the analysis of systems; it is also the architecting and building of systems so that they meet the requirements of the product.

Tip

Software on the internet can never be fully reliable for two reasons. The first reason is that the internet is a distributed system and, often, parts fail, which will affect your service's availability. The second reason is that humans write software, and that software will often have bugs, which will also cause outages.

Often, the job of someone working in SRE is to take in reliability requirements for software, and its infrastructure, and then figure out how to make the infrastructure meet those requirements. Steps toward this often require figuring out if existing infrastructure is meeting those needs, collaborating with teams (or people writing software that will run on the infrastructure), evaluating external tools, or just designing and writing what you need yourself.

As I mentioned at the beginning of the chapter, an SRE role can be very diverse. The requirements of an SRE position at a Fortune 500 company can be very different to those of a 20-person video game company. The role could be different at a bank in the USA from a role at a bank on the other side of the world. This is because the organization is different. For smaller organizations, someone working as an SRE may handle everything in the organization related to infrastructure and reliability. On the other hand, larger organizations may have multiple teams of SREs working with many diverse teams of developers. The role between two different banks could be different because of each bank's needs.

A local bank may only need someone to improve the reliability of tools for people who work for the bank, while a much larger bank in London may need someone who can make sure their bank's systems can make trades at very high speeds with the London Stock Exchange or support millions of individual customers. This book will provide a structure for anyone interested in becoming an SRE. The goal is to empower you, no matter your background or current situation. It will not be a panacea but will provide a knowledge base and a framework for making sites more reliable and moving your career forwards.

What is in the book?

I worked as an SRE at Google for four years, and that is where I started specializing, moving away from being a full stack engineer, and instead considering myself an SRE. Google had lots of internal education courses, and when I left, I found it difficult to continue my education. I also quickly discovered that SRE at Google is a very different beast than SRE at much smaller organizations. I decided to write this book for people interested in starting with SRE or applying it to organizations that are much smaller than Google.

To do this, the book is broken up into two parts. The first eight chapters walk through the hierarchy of reliability. This hierarchy was originally designed by Mikey Dickerson of the United States Digital Service (and– surprise, surprise –Google). The hierarchy says that as you are trying to add reliability to a system, you need to walk through each level before you get to the next one.

The following diagram shows a slightly modified version of Mikey's original pyramid. I have updated it to include the all-encompassing aspect of communication:

What is in the book?

Figure 2: This seven-layer pyramid is encircled with communication. Each layer builds upon and needs the previous layer. It is surrounded by communication because each layer needs communication to succeed.

Let us walk through the layers as a preview of what you can expect in each chapter.

  • Chapter 2, Monitoring: The first level is monitoring, which makes sure that you have insight into a system, tracking health, availability, and what is happening internally in the system. Monitoring is not just tools though, because it also requires communication. Monitoring is a very contentious part of SRE and operations because, depending on implementation, it can either be very useful or very pointless. Figuring out what to monitor, how to monitor it, where to store the monitoring data, who can access historical monitoring data, and how to look at data often takes time. Many people in your engineering organization will have opinions on these points based on past experiences.

    Some engineers will have had bad experiences and will not think monitoring is worth the investment, whereas others will have religious zealotry toward certain tools, and some will just ignore you. This chapter will help you to navigate all of these competing opinions and find and create the implementation that is best for your project and team.

  • Chapter 3, Incident Response: The next level is incident response. If something is broken, how do you alert people and respond? While tools help with this, as they define the rules by which to alert humans, most of incident response is about defining policy and setting up training so humans know what to do when they get alerts. If team members see an automated message in Slack, what should they do? If they get a phone call, how quickly do they need to respond? Will employees be paid extra if they have to work on a Saturday due to an outage? These are all questions we will address in the What is incident response section. Setting up on-call rotations, best practices for working together as a team, and building infrastructure to make incidents as low-stress as possible will also be covered.
  • Chapter 4, Postmortems: The third level is postmortems. Once you have had an outage, how do you make sure the problem does not happen again? Should you have a meeting about your incident? Does there need to be documentation? In this chapter, we will consider how to talk about past incidents and make it an enjoyable process for all involved. Postmortems are the act of recording for history how an incident happened, how the team fixed it, and how the team is working to prevent another similar incident in the future. We want to set up a culture of blameless and transparent postmortems, so people can work together.

    Individuals should not be afraid of incidents, but rather feel confident that if an incident happens, the team will respond and improve the system for the future, instead of focusing on the shame and anger that can come with failure. Incidents are things to learn from, not things to be afraid and ashamed of!

  • Chapter 5, Testing and Releasing: The fourth level is testing and releasing your software. In this chapter, we will be talking about the tooling and strategies that can be used to test and release software. This level in the hierarchy is our first level where instead of focusing on things that have happened, we focus on prevention. Prevention is about trying to limit the number of incidents that happen and also making sure that infrastructure and services stay stable when releasing new code. The chapter will talk about how to focus on all of the different types of testing that exist and make them useful for you and your team. It will also explore releasing software, when to use methodologies like continuous deployment, and some tools you can use.
  • Chapter 6, Capacity Planning: The fifth level is capacity planning. While Chapter 5, Testing and Releasing focused on the current world, this chapter is all about predicting the future and finding the limits of your system. Capacity planning is also about making sure you can grow over time. Once you are monitoring your system, and running a reliable system, you can start thinking about how to grow it over time, and how to find and anticipate bottlenecks and resource limits. In this chapter, we will talk about planning for long-term growth, writing budgets, communicating with outside teams about the future, and things to keep in mind as your service shrinks and grows.
  • Chapter 7, Building Tools: The sixth level is the development of new tools and services. SRE is not only about operations but also about software development. We hope SREs will spend around half of their time developing new tools and services. Some of these tools will exist to automate tasks that an employee has been doing by hand, while others will exist to improve another part of the hierarchy, such as automated load testing, or services to improve performance. In this chapter, we will talk about finding these projects, defining them, planning them, and building them. We will also talk about communicating their usefulness to your fellow engineers.
  • Chapter 8, User Experience: The final tier is user experience, which is about making sure the user has a good experience. We'll talk about measuring performance, working with user researchers, and defining what a good experience means to your team. We will also discuss how the experience of a tool and processes can cause outages. The goal is to make sure that, no matter the tool, or the user, people enjoy using it, understand how to use it, and cannot easily hurt themselves with it.

    Nori Heikkinen, an SRE at Google with many years of experience, adds that "the hierarchy does not include prevention, partly because 100% uptime is impossible, and partly because the bottom three needs in the hierarchy must be addressed within an organization before prevention can be examined." (https://www.infoq.com/news/2015/06/too-big-to-fail)

    The last two chapters of this book are a cheat section and introduction to common useful topics.

  • Chapter 9, Networking Foundations: This is a selection of tools and definitions of important ideas in networking. We discuss network packets, DNS, UDP and TCP, and lots of other things. After this chapter you should feel like you know the basics of networking, and the ability to research more advanced topics.
  • Chapter 10, Linux and Cloud Foundations: This is a selection of tools and important concepts involved in Linux and modern cloud products. We cover what the Linux kernel is, common parts of public clouds, and other topics. After this chapter you should feel like you know the basics of Linux and most public cloud products. Afterwards you should feel comfortable researching specific clouds and more advanced Linux topics.

SRE as a framework for new projects

One way to use this book is as a framework for working on a new project. As each chapter is about a different level of the hierarchy, you can work through the book to figure out where in the hierarchy your project sits. If it is a new project, then often it will be right at the bottom of the hierarchy, with no, or very little, monitoring implemented.

At each level, if there are others on the team, then you should begin a conversation to figure out what exists, and if it meets the team's needs. Each chapter will provide a rough rubric for that discussion, but remember that every team and project is unique. If you are the only person who is thinking about reliability and infrastructure, then you may end up spending a significant amount of time proposing solutions and pushing the project in a certain direction. Just remember that the point is to improve the reliability of the service, help the business, and improve the user's experience of the service.

You may find yourself distracted by each thing that you could fix. It is highly recommended to document the problems that you see first before diving in. Documenting first can be helpful in a few ways. Diving in is very satisfying, but it also may lead you to skip over requirements or spend too much time on a solution that doesn't work for your business (for example, integrating your system with a monitoring service you can't afford, or building a distributed job scheduler when you could have just used a piece of open source software).

So, when joining a new project, or evaluating a new service, here is a set of steps to follow:

  • Figure out the team structure. Who owns what? Who is in charge?
  • Find any documentation the team has for their service or the project.
  • Get someone to draw out the system architecture. Have them show you what connects to which service, what depends on the project, how data flows through the service, and how the project is deployed.
    SRE as a framework for new projects

    Figure 3: An example system architecture diagram. This is a very simple diagram that someone might draw on a whiteboard. Most companies will have something much more complex or detailed than this, but this is often the level of detail you need. Boxes with names and arrows show what talks to what.

    SRE as a framework for new projects

    Figure 4: Second example of an architecture diagram. This system is a classic static site generator model. The admin service creates or modifies things and writes update notifications into a queue. A worker reads data from the queue, does work on the data, and uploads it to a static object store, in this case vendor 2. Then, we put in some sort of CDN or serving system, in this case vendor 1 in front of vendor 2.

    Name

    Role

    Manager

    Things they know/specializations

    Akil

    Junior Full Stack Dev

    Jeff

    Seems pretty new and jumps around a lot.

    Catherine

    Senior Frontend Dev

    Jeff

    Does a lot of initial design prototyping and built most of the frontend originally.

    Kareem

    Senior Mobile Dev

    Melissa

    Wrote both mobile apps.

    Steph

    Senior Backend Dev

    Melissa

    TO DO: Set up a one-on-one to understand mobile backend.

    Suzy

    Full Stack Dev

    Jeff

    Animation wizard who knows the database for CMS better than anyone.

    Tom

    Full Stack Dev

    Jeff

    Frontend architecture, made initial protocol buffers and knows sync queue best.

    Table 1: An example table with notes on people in the project. With this, we have a reference on team structure. If we need to know who to talk to about mobile apps, we can look at our handy chart and see that we need to talk to Kareem or the manager, Melissa.

    Now that you have context for the project, or service, start working through each chapter of the book and ask:

  • Does the service have monitoring?
  • Does the team have plans for incident response?
  • Does the team create postmortems? Are they stored anywhere?
  • How is the service tested? Does the project have a release plan?
  • Has anyone done any capacity planning?
  • What tools could we build to improve the service?
  • Is the current level of reliability providing a positive user experience?

Tip

The trick to note here is that these questions could be asked about a piece of software that has been running for years, as well as one that is just being created.

The service you are investigating could be a large project with many pieces of software (a service-oriented architecture (SOA) for example) or a single monolithic application. If you are working on a project with many services, then work through each service one at a time. The downside of this can be that if you want to build a framework that will fit all of the services you are interacting with, you will not know how best to solve the problems and needs of them until after you have done a bunch of research and work. The upside is that you will not be pulled immediately in many directions and will be able to focus on one specific service's problems.

Your time and energy are limited resources and, because of this, you will always need to work with more people than you have time for, so make sure to take it slow. Going slow will mean that things do not get lost in the cracks. You also do not want to burn out before each service has its base few levels of its hierarchy filled up.

Summary

Alright! We made it through the introduction. We learned what SRE is at a high level, and we talked about the sorts of problems people in the role tend to focus on. We discussed the structure of the book, and also how to apply that structure to a software project.

In the next chapter, we will be diving into the world of monitoring! Monitoring is the foundation of learning about a system. It is how you record historical data about a system and learn about what is actually going on by analyzing the data you collect. By the end of the chapter, you'll know the basics of instrumenting an application, aggregating that data, storing that data, and displaying it.

References

  1. Oxford Living Dictionary, 2017, https://en.oxforddictionaries.com/definition/reliability
  2. Oxford Living Dictionary, 2017, https://en.oxforddictionaries.com/definition/engineering
  3. Site Reliability Engineering: How Google Runs Production Systems; B. Beyer, C.D. Jones, J. Petoff, N. Murphy, 2016, by O'Reilly Media, https://landing.google.com/sre/sre-book/toc/index.html
Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Proven methods for keeping your website running
  • A survival guide for incident response
  • Written by an ex-Google SRE expert

Description

Real-World SRE is the go-to survival guide for the software developer in the middle of catastrophic website failure. Site Reliability Engineering (SRE) has emerged on the frontline as businesses strive to maximize uptime. This book is a step-by-step framework to follow when your website is down and the countdown is on to fix it. Nat Welch has battle-hardened experience in reliability engineering at some of the biggest outage-sensitive companies on the internet. Arm yourself with his tried-and-tested methods for monitoring modern web services, setting up alerts, and evaluating your incident response. Real-World SRE goes beyond just reacting to disaster—uncover the tools and strategies needed to safely test and release software, plan for long-term growth, and foresee future bottlenecks. Real-World SRE gives you the capability to set up your own robust plan of action to see you through a company-wide website crisis. The final chapter of Real-World SRE is dedicated to acing SRE interviews, either in getting a first job or a valued promotion.

Who is this book for?

Real-World SRE is aimed at software developers facing a website crisis, or who want to improve the reliability of their company's software. Newcomers to Site Reliability Engineering looking to succeed at interview will also find this invaluable.

What you will learn

  • Monitor for approaching catastrophic failure
  • Alert your team to an outage emergency
  • Dissect your incident response strategies
  • Test automation tools and build your own software
  • Predict bottlenecks and fight for user experience
  • Eliminate the competition in an SRE interview

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Aug 31, 2018
Length: 340 pages
Edition : 1st
Language : English
ISBN-13 : 9781788628884
Languages :

What do you get with a Packt Subscription?

Free for first 7 days. $24.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Aug 31, 2018
Length: 340 pages
Edition : 1st
Language : English
ISBN-13 : 9781788628884
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
AU$24.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
AU$249.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just AU$5 each
Feature tick icon Exclusive print discounts
AU$349.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just AU$5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total AU$ 189.97
Cloud Native Architectures
AU$67.99
Practical Site Reliability Engineering
AU$60.99
Real-World SRE
AU$60.99
Total AU$ 189.97 Stars icon
Banner background image

Table of Contents

12 Chapters
1. Introduction Chevron down icon Chevron up icon
2. Monitoring Chevron down icon Chevron up icon
3. Incident Response Chevron down icon Chevron up icon
4. Postmortems Chevron down icon Chevron up icon
5. Testing and Releasing Chevron down icon Chevron up icon
6. Capacity Planning Chevron down icon Chevron up icon
7. Building Tools Chevron down icon Chevron up icon
8. User Experience Chevron down icon Chevron up icon
9. Networking Foundations Chevron down icon Chevron up icon
10. Linux and Cloud Foundations Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5
(10 Ratings)
5 star 80%
4 star 0%
3 star 10%
2 star 10%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




andrew drozdov Oct 11, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
An easy to follow playbook with some insightful gems. Nat's book should be required reading for devops engineers and benefits anyone that is building or maintaining software. If you want to learn a lot about the headaches of building software and how to handle them (simulating years of experience), then this book is for you.
Amazon Verified review Amazon
Christos Perivolaropoulos Sep 10, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
a great book that covers with details the ground from beginner to competent SRE. This is not only useful for SREs but anyone either working closely with SREs. My favorite chapter would be chapter 2 (monitoring) which goes over the more recent jargon and tools that SREs use that you will not find in a UNIX programming book.I also enjoyed chapter 9 about networking fundamentals. It explains in detail the way the internet works, but not enough detail to overwhelm the reader. Throughout the book, but especially on this chapter, the writer manages to keep the theoretical aspects grounded on practice by providing ways for the reader to test their knowledge with simple tools and techniques.I would recommend this book without hesitation.
Amazon Verified review Amazon
Customer Oct 11, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book really succeeds in covering the baselines of what you need to know for applying SRE to your organization, at a reasonable level. That is, Welch explicitly mentions multiple times in the book about scaling efforts to what makes sense for your company -- because in tech publishing, it's easy to act like everyone is Google, but we aren't. I learned something in every chapter, and I really appreciate the inclusion of UX and the "bonus" (to me) chapters on linux fundamentals. A great book for someone who needs to do SRE but doesn't come from an ops background.
Amazon Verified review Amazon
Aaron Sep 19, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Most of the technical books I’ve read tend to fall into one of two categories. The first is a overview of the technical details that covers everything you’d like to know but is really dry. The other is a glorified autobiography of the author’s experience providing little technical information that can be found beyond a Wikipedia page. This book defies the odds and succeeds at providing a great level of technical detail while being an inherently easy to read. I certainly did not intend to finish half of it in one sitting, ignoring everything else I needed to do, but I did.What I like most about the book is that each major element of SRE isn’t just thrown out there as a fact. Nat introduces each topic follows it up with an explanation as to why each element is important and provides a story that shows why each major element is important. This style of writing is not only easily to read but it helps me retain It as well as having a concrete use case.Full disclosure: I’ve known Nat since high school so I received a review copy for free. But since we had no issues making fun of each other back in school I’d certainly have no problem calling him out if this book was bad.
Amazon Verified review Amazon
Alicja Raszkowska Oct 17, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book sets up a framework for SRE supported by real-life examples and experiences. It focuses not only on the positive side, but also emphasizes continuous improvement of SRE tools and processes, as well as engineer skills and happiness. It's full of examples and use cases that present the perspective of companies of all sizes and priorities. I thought I was a complete SRE newbie when I started reading it and I was surprised how many of the outlined strategies followed best practices and common sense I was already familiar with.I especially liked Nat's focus on communication - positive, encouraging, teamwork-driven culture that automates the boring parts of the job and gives space for growing in other areas. Building tools and thinking about user experience, especially for users we might never meet or get feedback from, shows the long-term focus on delivering quality products. Nat never shies away from sharing lessons learned from past experiences.I think this book is a good introduction to SRE both on a detail-oriented level of building a system of alerts, metrics and schedules, as well as understanding the bigger picture and impact it can have on the whole team and company. There are certain parts of it I wouldn't expect to show up in such a technical read, e.g. examples of how to have a growth-mindset as an SRE expert.I throughly enjoyed this book - the only caveat I experienced was that the code samples don't have syntax highlighting, so it was hard to parse them in longer snippets.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.