Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Real-World SRE
Real-World SRE

Real-World SRE: The Survival Guide for Responding to a System Outage and Maximizing Uptime

eBook
€17.99 €26.99
Paperback
€32.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Real-World SRE

Chapter 1. Introduction

As the internet has grown, people have become used to having access to content all of the time, from a variety of devices. This means that the reputation of a brand has slowly become connected with the responsiveness and reliability of its products. People choose Google for searching because it always returns relevant and useful results quickly. People share content on Twitter because their message will be seen in real time by their followers. Netflix's great content selection is useless if it cannot deliver consistently on a variety of network speeds. As this reliability has become more important to businesses, a specialization focused on software reliability has emerged: Site Reliability Engineering (SRE). This chapter will introduce you to the field and also describe what you will learn from this book, helping you to write software to navigate the ever-changing internet landscape.

Before we explain what the field and role of SRE pertains to, let us start with a thought experiment. Imagine that it's early in the morning and you wake up to a screenshot of a blank web page in a text message from a friend with the caption: "I can't load your website."

If your personal website is indeed down, maybe you will message back with an, "I'll check it after breakfast," or an, "Oh yeah, been meaning to look into that." If it is your company's website, or maybe the page hosting your resume that you just sent to 15 possible employers, then a stream of expletives and indecipherable emojis will probably erupt from your mouth and in your text message back. This is because, for many businesses, websites have become the main source of incoming business. For some companies, like Facebook, Amazon, or iFixit, their entire business is a website. For other businesses, like restaurants or advertising agencies, a website acts as a way for people interested in the organization to learn more. It is often part of the marketing flow that helps companies to grow.

Introduction

It is probably impossible to completely remove the adrenaline spike that comes from discovering a website is down if you are responsible for fixing it. However, we can work to set up a framework to limit how often things break. We can create a world where responding to outages is easy, and transition from, "Oh god, everything is on fire, what do I do?!" to "Oh hey, a page isn't loading, so let's check out what's having a rough day."

This chapter is our introduction to the book and the field of SRE. We will cover the following topics in the next few pages:

  • Exploring a brief history of the people who work on information systems
  • Defining what SRE is
  • Describing what is in the book and providing a rough framework for SRE.

A brief history

SRE is a relatively new field, but it is a slightly different take on many existing ideas. In 1958, the term IT was coined in the Harvard Business Review, and eventually became the descriptor for the maintenance of technology used for collecting, storing, and distributing data and information. At that time, computers were transitioning toward having integrated circuits, but they were still the size of a room and were maintained and programmed by a team of people. As computers shrank, that team started focusing on multiple computers. Over time, some people started to specialize in programming those computers, and others focused on keeping them running. "Dumb terminals" would connect to a single computer, which was maintained by a team while programmers and users used the terminals.

Eventually, these maintainers started taking care of both the machines that individuals used, as well as large arrays of machines that provided services. Users would use a word processor on their local machine, and then upload files to a remote machine. Those who maintained the remote machines became known as system engineers, system administrators, and system operators.

As computers became smaller and more commodified, programmers began spending more time interacting with infrastructure, and configuring their software and infrastructure to work together well. On the other end, system admins were writing more and more complex code to maintain infrastructure. The closer these teams became, the more they began working together. In smaller teams, often, people would start focusing on both code for infrastructure and business code. In larger organizations, teams were created that focused on tools for managing infrastructure in reliable ways, so that product teams could quickly and easily manage the infrastructure they needed. These joint teams were often described as SRE or DevOps (developer and operations) teams.

Benjamin Treynor Sloss of Google, often referred to as just Treynor, says in Google's Site Reliability Engineering book, "SRE is what happens when you ask a software engineer to design an operations team." He is often credited with the creation of the idea that operations work is now just a specialization of software engineering. Given Google's success with reliability, the idea has caught on at many companies.

SRE is still a burgeoning field and, like DevOps, is often used to describe roles that include a wide diversity of work. Some companies give the title of SRE to a position, but it is much closer to a traditional system admin role. You can use this book's framework to evaluate a job before you apply for it, however, the goal of this book is to introduce you to the SRE mindset and help you to apply it to an organization, regardless of your past experience in the tech world.

What is SRE?

SRE is an exciting field. As mentioned earlier, it has evolved from a long line of roles and, as it is a relatively new field, its definition is steadily changing. SRE is an extension and evolution of many past concepts and, as such, concepts relevant to SRE apply to many roles, including but not exclusive to, backend engineering, DevOps, systems engineering, systems administration, operations, and so on. Depending on the company, these roles can involve very similar or very different responsibilities. The point is that, no matter what your job title is, you can apply SRE principles to your role.

In an attempt to define the field, we can learn a lot from its full name, Site Reliability Engineering:

Merging these three definitions, we get something like, "The field focused on working artfully to bring about a website that performs consistently well." While this definition could use some brushing up, it suits our needs for now. If you work, or know people who work, in the web development or software engineering world and you ask them what SRE means, then they may ask you, "Isn't that like X?" To someone from that background, X might be "DevOps," "ops," "platform engineering," "infrastructure engineering," "24/7 engineering," "a sysadmin," and so on.

This variation of answers presents the first problem we will see throughout this book: every organization is different. SRE's primary goal is making a website perform consistently according to our previous definition, which is difficult because it is dependent on the organization, the business around that organization, and the website's (or product's) requirements. One of the primary goals of this book is to present a framework that you can apply even if you do not belong to an organization with any of the aforementioned roles. The framework should be effective if you work for yourself, and it should also work if you are employed by some gigantic international multi-headed Hydra organization, and anything in between.

I worked as an SRE in 2016 for Hillary for America. It was the lead organization (but definitely not the only one) working to help to elect Hillary Clinton as President of the United States of America. We were not successful, and while this example immediately dates this book, I found it to have the most concrete separation of concerns between the parts of a website that I have ever worked on. The organization was hyper-focused on one goal (electing Hillary Clinton as president), so it had a very explicit list of goals that made my job a lot easier.

There were many separate parts of the campaign that the technology team worked on, including a mobile application, different websites, data pipelines, and large databases. To keep this simple though, and to explain what I mean by a separation of concerns, let me use three separate websites that we built and maintained as an example:

What is SRE?

Figure 1: Screenshots of different parts of hillaryclinton.com, courtesy of the Hillary for America design team. From left to right: the header on the home page, a page about Nevada, a page about Hillary's policies, Hillary's home page in Spanish, the campaign blog, and the donate page.

The home page was a general landing page. It needed to be available during the hours that people in North America were awake (as our target audience was mostly based in the United States), but very few people visited the home page unless driven there.

The main reason you would go to https://www.hillaryclinton.com/ was if you were sent there, not because it was part of your daily browsing like you would visit Twitter or Reddit. Surrogates speaking at rallies, on the radio, or on television supporting Hillary Clinton would often say things like, "Go to hillaryclinton.com now to sign up," or "hillaryclinton.com has more details on her policies on this topic." A five-minute outage here and there was OK, because of this semipredictable traffic spike, but like many media organizations, there were no guarantees of when a large spike of traffic would occur.

The donate page always needed to be up. According to our product team and senior leadership, the donate page's availability was priority number one. If people could not give money, then the campaign might not be able to pay people's salaries or get the candidate to her speaking engagements. The donation site was not the only way that the campaign made money, but it was a significant source of income.

The voter registration page only needed to be fully available when there was an election coming soon. This was because the page let people say they were going to vote for Hillary Clinton and find their nearest polling location. While the donate page needed to be available for the majority of the campaign (May 2015 through to November 2016), the voter registration page only really needed to be available during the lead up to the primary election (September through to November of 2016). If we had built the voter registration page earlier in the election, it also would have been needed in the days leading up to the primaries, but then only for states that were voting on those days. Primary elections are a precursor to the general election and happen from February to June, with different states voting on different days.

The key here is that different websites and features have different requirements and a different definition of being reliable. Nothing will ever be perfect, nor is 100% uptime achievable on the internet, because things are always breaking. So, all we can do is figure out what sort of failures we might have and optimize our product to be resilient in a way that is useful for us. SRE isn't just the analysis of systems; it is also the architecting and building of systems so that they meet the requirements of the product.

Tip

Software on the internet can never be fully reliable for two reasons. The first reason is that the internet is a distributed system and, often, parts fail, which will affect your service's availability. The second reason is that humans write software, and that software will often have bugs, which will also cause outages.

Often, the job of someone working in SRE is to take in reliability requirements for software, and its infrastructure, and then figure out how to make the infrastructure meet those requirements. Steps toward this often require figuring out if existing infrastructure is meeting those needs, collaborating with teams (or people writing software that will run on the infrastructure), evaluating external tools, or just designing and writing what you need yourself.

As I mentioned at the beginning of the chapter, an SRE role can be very diverse. The requirements of an SRE position at a Fortune 500 company can be very different to those of a 20-person video game company. The role could be different at a bank in the USA from a role at a bank on the other side of the world. This is because the organization is different. For smaller organizations, someone working as an SRE may handle everything in the organization related to infrastructure and reliability. On the other hand, larger organizations may have multiple teams of SREs working with many diverse teams of developers. The role between two different banks could be different because of each bank's needs.

A local bank may only need someone to improve the reliability of tools for people who work for the bank, while a much larger bank in London may need someone who can make sure their bank's systems can make trades at very high speeds with the London Stock Exchange or support millions of individual customers. This book will provide a structure for anyone interested in becoming an SRE. The goal is to empower you, no matter your background or current situation. It will not be a panacea but will provide a knowledge base and a framework for making sites more reliable and moving your career forwards.

What is in the book?

I worked as an SRE at Google for four years, and that is where I started specializing, moving away from being a full stack engineer, and instead considering myself an SRE. Google had lots of internal education courses, and when I left, I found it difficult to continue my education. I also quickly discovered that SRE at Google is a very different beast than SRE at much smaller organizations. I decided to write this book for people interested in starting with SRE or applying it to organizations that are much smaller than Google.

To do this, the book is broken up into two parts. The first eight chapters walk through the hierarchy of reliability. This hierarchy was originally designed by Mikey Dickerson of the United States Digital Service (and– surprise, surprise –Google). The hierarchy says that as you are trying to add reliability to a system, you need to walk through each level before you get to the next one.

The following diagram shows a slightly modified version of Mikey's original pyramid. I have updated it to include the all-encompassing aspect of communication:

What is in the book?

Figure 2: This seven-layer pyramid is encircled with communication. Each layer builds upon and needs the previous layer. It is surrounded by communication because each layer needs communication to succeed.

Let us walk through the layers as a preview of what you can expect in each chapter.

  • Chapter 2, Monitoring: The first level is monitoring, which makes sure that you have insight into a system, tracking health, availability, and what is happening internally in the system. Monitoring is not just tools though, because it also requires communication. Monitoring is a very contentious part of SRE and operations because, depending on implementation, it can either be very useful or very pointless. Figuring out what to monitor, how to monitor it, where to store the monitoring data, who can access historical monitoring data, and how to look at data often takes time. Many people in your engineering organization will have opinions on these points based on past experiences.

    Some engineers will have had bad experiences and will not think monitoring is worth the investment, whereas others will have religious zealotry toward certain tools, and some will just ignore you. This chapter will help you to navigate all of these competing opinions and find and create the implementation that is best for your project and team.

  • Chapter 3, Incident Response: The next level is incident response. If something is broken, how do you alert people and respond? While tools help with this, as they define the rules by which to alert humans, most of incident response is about defining policy and setting up training so humans know what to do when they get alerts. If team members see an automated message in Slack, what should they do? If they get a phone call, how quickly do they need to respond? Will employees be paid extra if they have to work on a Saturday due to an outage? These are all questions we will address in the What is incident response section. Setting up on-call rotations, best practices for working together as a team, and building infrastructure to make incidents as low-stress as possible will also be covered.
  • Chapter 4, Postmortems: The third level is postmortems. Once you have had an outage, how do you make sure the problem does not happen again? Should you have a meeting about your incident? Does there need to be documentation? In this chapter, we will consider how to talk about past incidents and make it an enjoyable process for all involved. Postmortems are the act of recording for history how an incident happened, how the team fixed it, and how the team is working to prevent another similar incident in the future. We want to set up a culture of blameless and transparent postmortems, so people can work together.

    Individuals should not be afraid of incidents, but rather feel confident that if an incident happens, the team will respond and improve the system for the future, instead of focusing on the shame and anger that can come with failure. Incidents are things to learn from, not things to be afraid and ashamed of!

  • Chapter 5, Testing and Releasing: The fourth level is testing and releasing your software. In this chapter, we will be talking about the tooling and strategies that can be used to test and release software. This level in the hierarchy is our first level where instead of focusing on things that have happened, we focus on prevention. Prevention is about trying to limit the number of incidents that happen and also making sure that infrastructure and services stay stable when releasing new code. The chapter will talk about how to focus on all of the different types of testing that exist and make them useful for you and your team. It will also explore releasing software, when to use methodologies like continuous deployment, and some tools you can use.
  • Chapter 6, Capacity Planning: The fifth level is capacity planning. While Chapter 5, Testing and Releasing focused on the current world, this chapter is all about predicting the future and finding the limits of your system. Capacity planning is also about making sure you can grow over time. Once you are monitoring your system, and running a reliable system, you can start thinking about how to grow it over time, and how to find and anticipate bottlenecks and resource limits. In this chapter, we will talk about planning for long-term growth, writing budgets, communicating with outside teams about the future, and things to keep in mind as your service shrinks and grows.
  • Chapter 7, Building Tools: The sixth level is the development of new tools and services. SRE is not only about operations but also about software development. We hope SREs will spend around half of their time developing new tools and services. Some of these tools will exist to automate tasks that an employee has been doing by hand, while others will exist to improve another part of the hierarchy, such as automated load testing, or services to improve performance. In this chapter, we will talk about finding these projects, defining them, planning them, and building them. We will also talk about communicating their usefulness to your fellow engineers.
  • Chapter 8, User Experience: The final tier is user experience, which is about making sure the user has a good experience. We'll talk about measuring performance, working with user researchers, and defining what a good experience means to your team. We will also discuss how the experience of a tool and processes can cause outages. The goal is to make sure that, no matter the tool, or the user, people enjoy using it, understand how to use it, and cannot easily hurt themselves with it.

    Nori Heikkinen, an SRE at Google with many years of experience, adds that "the hierarchy does not include prevention, partly because 100% uptime is impossible, and partly because the bottom three needs in the hierarchy must be addressed within an organization before prevention can be examined." (https://www.infoq.com/news/2015/06/too-big-to-fail)

    The last two chapters of this book are a cheat section and introduction to common useful topics.

  • Chapter 9, Networking Foundations: This is a selection of tools and definitions of important ideas in networking. We discuss network packets, DNS, UDP and TCP, and lots of other things. After this chapter you should feel like you know the basics of networking, and the ability to research more advanced topics.
  • Chapter 10, Linux and Cloud Foundations: This is a selection of tools and important concepts involved in Linux and modern cloud products. We cover what the Linux kernel is, common parts of public clouds, and other topics. After this chapter you should feel like you know the basics of Linux and most public cloud products. Afterwards you should feel comfortable researching specific clouds and more advanced Linux topics.

SRE as a framework for new projects

One way to use this book is as a framework for working on a new project. As each chapter is about a different level of the hierarchy, you can work through the book to figure out where in the hierarchy your project sits. If it is a new project, then often it will be right at the bottom of the hierarchy, with no, or very little, monitoring implemented.

At each level, if there are others on the team, then you should begin a conversation to figure out what exists, and if it meets the team's needs. Each chapter will provide a rough rubric for that discussion, but remember that every team and project is unique. If you are the only person who is thinking about reliability and infrastructure, then you may end up spending a significant amount of time proposing solutions and pushing the project in a certain direction. Just remember that the point is to improve the reliability of the service, help the business, and improve the user's experience of the service.

You may find yourself distracted by each thing that you could fix. It is highly recommended to document the problems that you see first before diving in. Documenting first can be helpful in a few ways. Diving in is very satisfying, but it also may lead you to skip over requirements or spend too much time on a solution that doesn't work for your business (for example, integrating your system with a monitoring service you can't afford, or building a distributed job scheduler when you could have just used a piece of open source software).

So, when joining a new project, or evaluating a new service, here is a set of steps to follow:

  • Figure out the team structure. Who owns what? Who is in charge?
  • Find any documentation the team has for their service or the project.
  • Get someone to draw out the system architecture. Have them show you what connects to which service, what depends on the project, how data flows through the service, and how the project is deployed.
    SRE as a framework for new projects

    Figure 3: An example system architecture diagram. This is a very simple diagram that someone might draw on a whiteboard. Most companies will have something much more complex or detailed than this, but this is often the level of detail you need. Boxes with names and arrows show what talks to what.

    SRE as a framework for new projects

    Figure 4: Second example of an architecture diagram. This system is a classic static site generator model. The admin service creates or modifies things and writes update notifications into a queue. A worker reads data from the queue, does work on the data, and uploads it to a static object store, in this case vendor 2. Then, we put in some sort of CDN or serving system, in this case vendor 1 in front of vendor 2.

    Name

    Role

    Manager

    Things they know/specializations

    Akil

    Junior Full Stack Dev

    Jeff

    Seems pretty new and jumps around a lot.

    Catherine

    Senior Frontend Dev

    Jeff

    Does a lot of initial design prototyping and built most of the frontend originally.

    Kareem

    Senior Mobile Dev

    Melissa

    Wrote both mobile apps.

    Steph

    Senior Backend Dev

    Melissa

    TO DO: Set up a one-on-one to understand mobile backend.

    Suzy

    Full Stack Dev

    Jeff

    Animation wizard who knows the database for CMS better than anyone.

    Tom

    Full Stack Dev

    Jeff

    Frontend architecture, made initial protocol buffers and knows sync queue best.

    Table 1: An example table with notes on people in the project. With this, we have a reference on team structure. If we need to know who to talk to about mobile apps, we can look at our handy chart and see that we need to talk to Kareem or the manager, Melissa.

    Now that you have context for the project, or service, start working through each chapter of the book and ask:

  • Does the service have monitoring?
  • Does the team have plans for incident response?
  • Does the team create postmortems? Are they stored anywhere?
  • How is the service tested? Does the project have a release plan?
  • Has anyone done any capacity planning?
  • What tools could we build to improve the service?
  • Is the current level of reliability providing a positive user experience?

Tip

The trick to note here is that these questions could be asked about a piece of software that has been running for years, as well as one that is just being created.

The service you are investigating could be a large project with many pieces of software (a service-oriented architecture (SOA) for example) or a single monolithic application. If you are working on a project with many services, then work through each service one at a time. The downside of this can be that if you want to build a framework that will fit all of the services you are interacting with, you will not know how best to solve the problems and needs of them until after you have done a bunch of research and work. The upside is that you will not be pulled immediately in many directions and will be able to focus on one specific service's problems.

Your time and energy are limited resources and, because of this, you will always need to work with more people than you have time for, so make sure to take it slow. Going slow will mean that things do not get lost in the cracks. You also do not want to burn out before each service has its base few levels of its hierarchy filled up.

Summary

Alright! We made it through the introduction. We learned what SRE is at a high level, and we talked about the sorts of problems people in the role tend to focus on. We discussed the structure of the book, and also how to apply that structure to a software project.

In the next chapter, we will be diving into the world of monitoring! Monitoring is the foundation of learning about a system. It is how you record historical data about a system and learn about what is actually going on by analyzing the data you collect. By the end of the chapter, you'll know the basics of instrumenting an application, aggregating that data, storing that data, and displaying it.

References

  1. Oxford Living Dictionary, 2017, https://en.oxforddictionaries.com/definition/reliability
  2. Oxford Living Dictionary, 2017, https://en.oxforddictionaries.com/definition/engineering
  3. Site Reliability Engineering: How Google Runs Production Systems; B. Beyer, C.D. Jones, J. Petoff, N. Murphy, 2016, by O'Reilly Media, https://landing.google.com/sre/sre-book/toc/index.html
Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Proven methods for keeping your website running
  • A survival guide for incident response
  • Written by an ex-Google SRE expert

Description

Real-World SRE is the go-to survival guide for the software developer in the middle of catastrophic website failure. Site Reliability Engineering (SRE) has emerged on the frontline as businesses strive to maximize uptime. This book is a step-by-step framework to follow when your website is down and the countdown is on to fix it. Nat Welch has battle-hardened experience in reliability engineering at some of the biggest outage-sensitive companies on the internet. Arm yourself with his tried-and-tested methods for monitoring modern web services, setting up alerts, and evaluating your incident response. Real-World SRE goes beyond just reacting to disaster—uncover the tools and strategies needed to safely test and release software, plan for long-term growth, and foresee future bottlenecks. Real-World SRE gives you the capability to set up your own robust plan of action to see you through a company-wide website crisis. The final chapter of Real-World SRE is dedicated to acing SRE interviews, either in getting a first job or a valued promotion.

Who is this book for?

Real-World SRE is aimed at software developers facing a website crisis, or who want to improve the reliability of their company's software. Newcomers to Site Reliability Engineering looking to succeed at interview will also find this invaluable.

What you will learn

  • Monitor for approaching catastrophic failure
  • Alert your team to an outage emergency
  • Dissect your incident response strategies
  • Test automation tools and build your own software
  • Predict bottlenecks and fight for user experience
  • Eliminate the competition in an SRE interview
Estimated delivery fee Deliver to Greece

Premium delivery 7 - 10 business days

€17.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Aug 31, 2018
Length: 340 pages
Edition : 1st
Language : English
ISBN-13 : 9781788628884
Languages :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Greece

Premium delivery 7 - 10 business days

€17.95
(Includes tracking information)

Product Details

Publication date : Aug 31, 2018
Length: 340 pages
Edition : 1st
Language : English
ISBN-13 : 9781788628884
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 102.97
Real-World SRE
€32.99
Practical Site Reliability Engineering
€32.99
Cloud Native Architectures
€36.99
Total 102.97 Stars icon
Banner background image

Table of Contents

12 Chapters
1. Introduction Chevron down icon Chevron up icon
2. Monitoring Chevron down icon Chevron up icon
3. Incident Response Chevron down icon Chevron up icon
4. Postmortems Chevron down icon Chevron up icon
5. Testing and Releasing Chevron down icon Chevron up icon
6. Capacity Planning Chevron down icon Chevron up icon
7. Building Tools Chevron down icon Chevron up icon
8. User Experience Chevron down icon Chevron up icon
9. Networking Foundations Chevron down icon Chevron up icon
10. Linux and Cloud Foundations Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5
(10 Ratings)
5 star 80%
4 star 0%
3 star 10%
2 star 10%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




andrew drozdov Oct 11, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
An easy to follow playbook with some insightful gems. Nat's book should be required reading for devops engineers and benefits anyone that is building or maintaining software. If you want to learn a lot about the headaches of building software and how to handle them (simulating years of experience), then this book is for you.
Amazon Verified review Amazon
Christos Perivolaropoulos Sep 10, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
a great book that covers with details the ground from beginner to competent SRE. This is not only useful for SREs but anyone either working closely with SREs. My favorite chapter would be chapter 2 (monitoring) which goes over the more recent jargon and tools that SREs use that you will not find in a UNIX programming book.I also enjoyed chapter 9 about networking fundamentals. It explains in detail the way the internet works, but not enough detail to overwhelm the reader. Throughout the book, but especially on this chapter, the writer manages to keep the theoretical aspects grounded on practice by providing ways for the reader to test their knowledge with simple tools and techniques.I would recommend this book without hesitation.
Amazon Verified review Amazon
Customer Oct 11, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book really succeeds in covering the baselines of what you need to know for applying SRE to your organization, at a reasonable level. That is, Welch explicitly mentions multiple times in the book about scaling efforts to what makes sense for your company -- because in tech publishing, it's easy to act like everyone is Google, but we aren't. I learned something in every chapter, and I really appreciate the inclusion of UX and the "bonus" (to me) chapters on linux fundamentals. A great book for someone who needs to do SRE but doesn't come from an ops background.
Amazon Verified review Amazon
Aaron Sep 19, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Most of the technical books I’ve read tend to fall into one of two categories. The first is a overview of the technical details that covers everything you’d like to know but is really dry. The other is a glorified autobiography of the author’s experience providing little technical information that can be found beyond a Wikipedia page. This book defies the odds and succeeds at providing a great level of technical detail while being an inherently easy to read. I certainly did not intend to finish half of it in one sitting, ignoring everything else I needed to do, but I did.What I like most about the book is that each major element of SRE isn’t just thrown out there as a fact. Nat introduces each topic follows it up with an explanation as to why each element is important and provides a story that shows why each major element is important. This style of writing is not only easily to read but it helps me retain It as well as having a concrete use case.Full disclosure: I’ve known Nat since high school so I received a review copy for free. But since we had no issues making fun of each other back in school I’d certainly have no problem calling him out if this book was bad.
Amazon Verified review Amazon
Alicja Raszkowska Oct 17, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book sets up a framework for SRE supported by real-life examples and experiences. It focuses not only on the positive side, but also emphasizes continuous improvement of SRE tools and processes, as well as engineer skills and happiness. It's full of examples and use cases that present the perspective of companies of all sizes and priorities. I thought I was a complete SRE newbie when I started reading it and I was surprised how many of the outlined strategies followed best practices and common sense I was already familiar with.I especially liked Nat's focus on communication - positive, encouraging, teamwork-driven culture that automates the boring parts of the job and gives space for growing in other areas. Building tools and thinking about user experience, especially for users we might never meet or get feedback from, shows the long-term focus on delivering quality products. Nat never shies away from sharing lessons learned from past experiences.I think this book is a good introduction to SRE both on a detail-oriented level of building a system of alerts, metrics and schedules, as well as understanding the bigger picture and impact it can have on the whole team and company. There are certain parts of it I wouldn't expect to show up in such a technical read, e.g. examples of how to have a growth-mindset as an SRE expert.I throughly enjoyed this book - the only caveat I experienced was that the code samples don't have syntax highlighting, so it was hard to parse them in longer snippets.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact [email protected] with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at [email protected] using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on [email protected] with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on [email protected] within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on [email protected] who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on [email protected] within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela