Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
PostgreSQL High Availability Cookbook

You're reading from   PostgreSQL High Availability Cookbook Managing a reliable PostgreSQL database

Arrow left icon
Product type Paperback
Published in Feb 2017
Publisher Packt
ISBN-13 9781787125537
Length 536 pages
Edition 2nd Edition
Arrow right icon
Author (1):
Arrow left icon
Shaun Thomas Shaun Thomas
Author Profile Icon Shaun Thomas
Shaun Thomas
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Hardware Planning 2. Handling and Avoiding Downtime FREE CHAPTER 3. Pooling Resources 4. Troubleshooting 5. Monitoring 6. Replication 7. Replication Management Tools 8. Simple Stack 9. Advanced Stack 10. Cluster Control 11. Data Distribution

Having enough IOPS

IOPS stands for Input/Output Operations Per Second. Essentially, this describes how many operations a device can perform per second before it should be considered saturated. If a device is saturated, further requests must wait until the device has spare bandwidth. A server overwhelmed with requests can amount to seconds, minutes, or even hours of delayed results.

Depending on application timeout settings and user patience, a device with low IOPS appears as a bottleneck that reduces both system responsiveness and the perception of quality. A database with insufficient IOPS to service queries in a timely manner is unavailable for all intents and purposes. It doesn't matter if PostgreSQL is still available and serving results in this scenario, as its availability has already suffered. We are trying to build a highly-available database. To do so, we need to build a server with enough performance to survive daily operation. In addition, we must overprovision for unexpected surges in popularity, and account for future storage and throughput needs based on monthly increases in storage utilization.

Getting ready

This process is more of a thought experiment. We will present some very rough estimates of IO performance for many different disk types. For each, we should increment entries in our hardware spreadsheet based on perceived need.

The main things we will need for this process are numbers. During development, applications commonly have a goal, expected client count, table count, estimated growth rates, and so on. Even if we have to guess for many of these, they will all contribute to our IOPS requirements. Have these numbers ready, even if they're simply guesses.

If the application already exists on a development or stage environment, try to get the development or QA team to run operational tests. This is a great opportunity to gather statistics before choosing potential production hardware.

How to do it...

We need to figure out how many operations per second we can expect. We can estimate this by using the following steps:

  1. Increment the Count column in our hardware spreadsheet for one or more of the following, and round up:
    • For 3.5" hard drives, divide by 200
    • For 2.5" hard drives, divide by 150
    • For SSD hard drives, divide by 50,000, then add two
  2. Multiply these numbers together, and double the result. Then multiply the total by eight.
  3. Count the amount of tables used in those queries. If this is unavailable, use three.
  4. Obtain the average number of queries per page. If this is unavailable, use 10.
  5. Collect the amount of simultaneous database connections. Start with the expected user count, and divide by 50.
  6. Add 10 percent to any count greater than 0 and then round up.

How it works...

Wow, that's a lot of work! There's a reason for everything, of course.

In the initial three steps, we're trying to figure out how many operations might touch an object on disk. For every user that's actively loading a page, for every query in that page, and for every table in that query, that's a potential disk read or write.

We double that number to account for the fact we're estimating all of this. It's a common engineering trick to double or triple calculations to absorb unexpected capacity, variance in materials, and so on. We can use that same technique here.

Why did we suggest dividing the user count by 50 to get the connection total? Since we do not know the average query runtime, we assume 20 ms for each query. For every query that's executing, a connection is in use. Assuming full utilization, up to 50 queries can be active per second. If you have a production system that can provide a better query runtime average, we suggest using that value instead.

But why do we then multiply by eight? In a worst (or best) case scenario, it's not uncommon for an application to double the amount of users or requests on a yearly basis. Doubled usage means doubled hardware needs. If requirements double in one year, we would need a server three times more powerful (1 + 2) than the original estimates to account for the second year. Another doubling would mean a server seven times better (1 + 2 + 4). CPUs, RAM, and storage are generally available as powers of two. Since it's fairly difficult to obtain storage seven times faster than what we already have, we multiply the total by eight.

That gives a total IOPS value roughly necessary for our database to immediately serve every request for the next three years, straight from the disk device. Several companies buy servers every three or four years as a balance between cost and capacity, so these estimates are based on that assumption.

In the next step, we get a rough estimate of the amount of disks necessary to serve the required IOPS. Our numbers in these steps are based on hard drive performance. A 15,000 RPM hard drive can serve under ideal conditions, roughly 200 operations per second. Likewise, a 10,000 RPM drive can provide about 150 operations per second. Current SSDs at the time of writing commonly reach 200,000-300,000 IOPS, and some even regularly eclipse a cool million. However, because they are so fast, we need far fewer of them, and thus the risk is not as evenly distributed. We artificially increase the amount of these drives because, again, we are erring toward availability.

Finally, we add a few extra devices for spares that will go in a closet somewhere, just in case one or more drives fail. This also insulates us from the rare event that hardware is discontinued or otherwise difficult to obtain.

There's more...

Figuring out the number of IOPS we need and the devices involved is only part of the story.

A working example

Sometimes these large lists of calculations make more sense if we see them in practice. So let's make the assumption that 20,000 users will use our application each second. This is how that would look:

  • 20000 / 50 = 400
  • Default queries per page = 10
  • Default tables per query = 3
  • 400 * 10 * 3 * 2 = 2400
  • 2400 * 8 = 19200
  • 19200 IOPS in drives:
    • 3.5" drives: 19200 / 200 = 96
    • 2.5" drives: 19200 / 150 = 128
    • SSDs: 2 + (19200 / 50000) = 2.38 ~ 3
  • Add 10 percent:
    • 3.5" drives: 96 + 9.6 = 105.6 ~ 106
    • 2.5" drives: 128 + 12.8 = 140.8 ~ 141
    • SSDs: 3 + 0.3 = 3.3 ~ 4

We are not taking space into account either, which would also increase our SSD count. We will be discussing capacity soon.

Making concessions

Our calculations always assume worst-case scenarios. This is both expensive and in many cases, overzealous. We ignore RAM caching of disk blocks, we don't account for application frontend caches, and the PostgreSQL shared buffers are also not included.

Why? Crashes are always a concern. If a database crashes, buffers are forfeit. If the application frontend cache gets emptied or has problems, reads will be served directly from the database. Until caches are rebuilt, query results can be multiple orders of magnitude slower than normal for minutes or hours. We will discuss methods of circumventing these effects, but these IOPS numbers give us a baseline.

The number of necessary IOPS, and hence disk requirements, are subject to risk evaluation and cost benefit analysis. Deciding between 100 percent coverage and an acceptable fraction is a careful balancing act. Feel free to reduce these numbers; just consider the cost of an outage as part of the total. If a delay is considered standard operating procedures, fractions up to 50 percent are relatively low risk. If possible, try to run tests for an ultimate decision before purchase.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image