Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
PostgreSQL Replication, Second Edition
PostgreSQL Replication, Second Edition

PostgreSQL Replication, Second Edition: Leverage the power of PostgreSQL replication to make your databases more robust, secure, scalable, and fast

Arrow left icon
Profile Icon Hans-Jürgen Schönig
Arrow right icon
$27.98 $39.99
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5 (4 Ratings)
eBook Jul 2015 322 pages 1st Edition
eBook
$27.98 $39.99
Paperback
$48.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Hans-Jürgen Schönig
Arrow right icon
$27.98 $39.99
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5 (4 Ratings)
eBook Jul 2015 322 pages 1st Edition
eBook
$27.98 $39.99
Paperback
$48.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$27.98 $39.99
Paperback
$48.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

PostgreSQL Replication, Second Edition

Chapter 1. Understanding the Concepts of Replication

Replication is an important issue and, in order to get started, it is highly important to understand some basic concepts and theoretical ideas related to replication. In this chapter, you will be introduced to various concepts of replication, and learn which kind of replication is most suitable for which kind of practical scenario. By the end of the chapter, you will be able to judge whether a certain concept is feasible under various circumstances or not.

We will cover the following topics in this chapter:

  • The CAP theorem and physical limitations of replication
  • Different types of replication
  • Synchronous and asynchronous replication
  • Sharding and data distribution

The goal of this chapter is to give you a lot of insights into the theoretical concepts. This is truly important in order to judge whether a certain customer requirement is actually technically feasible or not. You will be guided through the fundamental ideas and important concepts related to replication.

The CAP theorem and physical limitations of replication

You might wonder why theory is being covered in such a prominent place in a book that is supposed to be highly practical. Well, there is a very simple reason for that: some nice-looking marketing papers of some commercial database vendors might leave you with the impression that everything is possible and easy to do, without any serious limitation. This is not the case; there are physical limitations that every software vendor has to cope with. There is simply no way around the laws of nature, and shiny marketing cannot help overcome nature. The laws of physics and logic are the same for everybody, regardless of the power of someone's marketing department.

In this section, you will be taught the so-called CAP theorem. Understanding the basic ideas of this theorem is essential to avoid some requirements that cannot be turned into reality.

The CAP theorem was first described by Eric Brewer back in the year 2000. It has quickly developed into one of the most fundamental concepts in the database world. Especially with the rise of NoSQL database systems, Brewer's theorem (as the CAP theorem is often called) has become an important cornerstone of every distributed system.

Understanding the CAP theorem

Before we dig into the details, we need to discuss what CAP actually means. CAP is an abbreviation for the following three concepts:

  • Consistency: This term indicates whether all the nodes in a cluster see the same data at the same time or not. A read-only node has to see all previously completed reads at any time.
  • Availability: Reads and writes have to succeed all the time. In other words a node has to be available for users at any point of time.
  • Partition tolerance: This means that the system will continue to work even if arbitrary messages are lost on the way. A network partition event occurs when a system is no longer accessible (think of a network connection failure). A different way of considering partition tolerance is to think of it as message passing. If an individual system can no longer send or receive messages from other systems, it means that it has been effectively partitioned out of the network. The guaranteed properties are maintained even when network failures prevent some machines from communicating with others.

Why are these three concepts relevant to normal users? Well, the bad news is that a replicated (or distributed) system can provide only two out of these three features at the same time.

Note

Keep in mind that only two out of the three promises can be fulfilled.

It is theoretically impossible to offer consistency, availability, and partition tolerance at the same time. As you will see later in this book, this can have a significant impact on system layouts that are safe and feasible to use. There is simply no such thing as the solution to all replication-related problems. When you are planning a large-scale system, you might have to come up with different concepts, depending on needs that are specific to your requirements.

Note

PostgreSQL, Oracle, DB2, and so on will provide you with CAp ("consistent" and "available"), while NoSQL systems, such as MongoDB and Cassandra, will provide you with cAP ("available" and "partition tolerant"). This is why NoSQL is often referred to as eventually consistent.

Consider a financial application. You really want to be consistent and partition tolerant. Keeping balances in sync is the highest priority.

Or consider an application collecting a log of weather data from some remote locations. If the data is a couple of minutes late, it is really no problem. In this case, you might want to go for cAP. Availability and partition tolerance might really be the most important things in this case.

Depending on the use, people have to decide what is really important and which attributes (consistency, availability, or partition tolerance) are crucial and which can be neglected.

Keep in mind there is no system which can fulfill all those wishes at the same time (neither open source nor paid software).

Understanding the limits of physics

The speed of light is not just a theoretical issue; it really does have an impact on your daily life. And more importantly, it has a serious implication when it comes to finding the right solution for your cluster.

We all know that there is some sort of cosmic speed limit called the speed of light. So why care? Well, let's do a simple mental experiment. Let's assume for a second that our database server is running at 3 GHz clock speed.

How far can light travel within one clock cycle of your CPU? If you do the math, you will figure out that light travels around 10 cm per clock cycle (in pure vacuum). We can safely assume that an electric signal inside a CPU will be very slow compared to pure light in vacuum. The core idea is, "10 cm in one clock cycle? Well, this is not much at all."

For the sake of our mental experiment, let's now consider various distances:

  • Distance from one end of the CPU to the other
  • Distance from your server to some other server next door
  • Distance from your server in Central Europe to a server somewhere in China

Considering the size of a CPU core on a die, you can assume that you can send a signal (even if it is not traveling anywhere close to the speed of light) from one part of the CPU to some other part quite fast. It simply won't take 1 million clock cycles to add up two numbers that are already in your first level cache on your CPU.

But what happens if you have to send a signal from one server to some other server and back? You can safely assume that sending a signal from server A to server B next door takes a lot longer because the cable is simply a lot longer. In addition to that, network switches and other network components will add some latency as well.

Let's talk about the length of the cable here, and not about its bandwidth.

Sending a message (or a transaction) from Europe to China is, of course, many times more time-consuming than sending some data to a server next door. Again, the important thing here is that the amount of data is not as relevant as the so-called latency, consider the following criteria:

  • Long-distance transmission: To explain the concept of latency, let's cover a very simple example. Let's assume you are a European and you are sending a letter to China. You will easily accept the fact that the size of your letter is not the limiting factor here. It makes absolutely no difference whether your letter is two or 20 pages long; the time it takes to reach the destination is basically the same. Also, it makes no difference whether you send one, two, or 10 letters at the same time. Given a reasonable numbers of letter, the size of the aircraft required (that is, the bandwidth) to ship the stuff to China is usually not the problem. However, the so-called round trip might very well be an issue. If you rely on the response to your letter from China to continue your work, you will soon find yourself waiting for a long time.
  • Why latency matters: Latency is an important issue. If you send a chunk of data from Europe to China, you should avoid waiting for the response. But if you send a chunk of data from your server to a server in the same rack, you might be able to wait for the response, because your electronic signal will simply be fast enough to make it back in time.

Note

The basic problems of latency described in this section are not PostgreSQL-specific. The very same concepts and physical limitations apply to all types of databases and systems. As mentioned before, this fact is sometimes silently hidden and neglected in shiny commercial marketing papers. Nevertheless, the laws of physics will stand firm. This applies to both commercial and open source software.

The most important point you have to keep in mind here is that bandwidth is not always the magical fix to a performance problem in a replicated environment. In many setups, latency is at least as important as bandwidth.

Different types of replication

Now that you are fully armed with the basic understanding of physical and theoretical limitations, it is time to learn about different types of replication. It is important to have a clear image of these types to make sure that the right choice can be made and the right tool can be chosen. In this section, synchronous as well as asynchronous replication will be covered.

Synchronous versus asynchronous replication

Let's dig into some important concepts now. The first distinction we can make is whether to replicate synchronously or asynchronously.

What does this mean? Let's assume we have two servers and we want to replicate data from one server (the master) to the second server (the slave). The following diagram illustrates the concept of synchronous and asynchronous replication:

Synchronous versus asynchronous replication

We can use a simple transaction like the one shown in the following:

BEGIN;
INSERT INTO foo VALUES ('bar');
COMMIT;

In the case of asynchronous replication, the data can be replicated after the transaction has been committed on the master. In other words, the slave is never ahead of the master; and in the case of writing, it is usually a little behind the master. This delay is called lag.

Synchronous replication enforces higher rules of consistency. If you decide to replicate synchronously (how this is done practically will be discussed in Chapter 5, Setting Up Synchronous Replication), the system has to ensure that the data written by the transaction will be at least on two servers at the time the transaction commits. This implies that the slave does not lag behind the master and that the data seen by the end users will be identical on both the servers.

Tip

Some systems will also use a quorum server to decide. So, it is not always about just two or more servers. If a quorum is used, more than half of the servers must agree on an action inside the cluster.

Considering performance issues

As you have learned earlier in the section about the speed of light and latency, sending unnecessary messages over the network can be expensive and time-consuming. If a transaction is replicated in a synchronous way, PostgreSQL has to make sure that the data reaches the second node, and this will lead to latency issues.

Synchronous replication can be more expensive than asynchronous replication in many ways, and therefore, people should think twice about whether this overhead is really needed and justified. In the case of synchronous replication, confirmations from a remote server are needed. This, of course, causes some additional overhead. A lot has been done in PostgreSQL to reduce this overhead as much as possible. However, it is still there.

Tip

Use synchronous replication only when it is really needed.

Understanding replication and data loss

When a transaction is replicated from a master to a slave, many things have to be taken into consideration, especially when it comes to things such as data loss.

Let's assume that we are replicating data asynchronously in the following manner:

  1. A transaction is sent to the master.
  2. It commits on the master.
  3. The master dies before the commit is sent to the slave.
  4. The slave will never get this transaction.

In the case of asynchronous replication, there is a window (lag) during which data can essentially be lost. The size of this window might vary, depending on the type of setup. Its size can be very short (maybe as short as a couple of milliseconds) or long (minutes, hours, or days). The important fact is that data can be lost. A small lag will only make data loss less likely, but any lag larger than zero lag is susceptible to data loss. If data can be lost, we are about to sacrifice the consistency part of CAP (if two servers don't have the same data, they are out of sync).

If you want to make sure that data can never be lost, you have to switch to synchronous replication. As you have already seen in this chapter, a synchronous transaction is synchronous because it will be valid only if it commits to at least two servers.

Single-master versus multimaster replication

A second way to classify various replication setups is to distinguish between single-master and multi-master replication.

"Single-master" means that writes can go to exactly one server, which distributes the data to the slaves inside the setup. Slaves may receive only reads, and no writes.

In contrast to single-master replication, multi-master replication allows writes to all the servers inside a cluster. The following diagram shows how things work at a conceptual level:

Single-master versus multimaster replication

The ability to write to any node inside the cluster sounds like an advantage, but it is not necessarily one. The reason for this is that multimaster replication adds a lot of complexity to the system. In the case of only one master, it is totally clear which data is correct and in which direction data will flow, and there are rarely conflicts during replication. Multimaster replication is quite different, as writes can go to many nodes at the same time, and the cluster has to be perfectly aware of conflicts and handle them gracefully. An alterative would be to use locks to solve the problem, but this approach will also have its own challenges.

Tip

Keep in mind that the need to resolve conflicts will cause network traffic, and this can instantly turn into scalability issues caused by latency.

Logical versus physical replication

One more way of classifying replication is to distinguish between logical and physical replication.

The difference is subtle but highly important:

  • Physical replication means that the system will move data as is to the remote box. So, if something is inserted, the remote box will get data in binary format, not via SQL.
  • Logical replication means that a change, which is equivalent to data coming in, is replicated.

Let's look at an example to fully understand the difference:

test=# CREATE TABLE t_test (t date);
CREATE TABLE
test=# INSERT INTO t_test VALUES (now())
RETURNING *;
t
------------
 2013-02-08
(1 row)

INSERT 0 1

We see two transactions going on here. The first transaction creates a table. Once this is done, the second transaction adds a simple date to the table and commits.

In the case of logical replication, the change will be sent to some sort of queue in logical form, so the system does not send plain SQL, but maybe something such as this:

test=# INSERT INTO t_test VALUES ('2013-02-08');
INSERT 0 1

Note that the function call has been replaced with the real value. It would be a total disaster if the slave were to calculate now() once again, because the date on the remote box might be a totally different one.

Tip

Some systems do use statement-based replication as the core technology. MySQL, for instance, uses a so-called bin-log statement to replicate, which is actually not too binary but more like some form of logical replication. Of course, there are also counterparts in the PostgreSQL world, such as pgpool, Londiste, and Bucardo.

Physical replication will work in a totally different way; instead of sending some SQL (or something else) over, which is logically equivalent to the changes made, the system will send binary changes made by PostgreSQL internally.

Here are some of the binary changes our two transactions might have triggered (but by far, this is not a complete list):

  1. Added an 8 K block to pg_class and put a new record there (to indicate that the table is present).
  2. Added rows to pg_attribute to store the column names.
  3. Performed various changes inside the indexes on those tables.
  4. Recorded the commit status, and so on.

The goal of physical replication is to create a copy of your system that is (largely) identical on the physical level. This means that the same data will be in the same place inside your tables on all boxes. In the case of logical replication, the content should be identical, but it makes no difference whether it is in the same place or not.

When to use physical replication

Physical replication is very convenient to use and especially easy to set up. It is widely used when the goal is to have identical replicas of your system (to have a backup or to simply scale up).

In many setups, physical replication is the standard method that exposes the end user to the lowest complexity possible. It is ideal for scaling out the data.

When to use logical replication

Logical replication is usually a little harder to set up, but it offers greater flexibility. It is also especially important when it comes to upgrading an existing database. Physical replication is totally unsuitable for version jumps because you cannot simply rely on the fact that every version of PostgreSQL has the same on-disk layout. The storage format might change over time, and therefore, a binary copy is clearly not feasible for a jump from one version to the next.

Logical replication allows decoupling of the way data is stored from the way it is transported and replicated. Using a neutral protocol, which is not bound to any specific version of PostgreSQL, it is easy to jump from one version to the next.

Since PostgreSQL 9.4, there is something called Logical Decoding. It allows users to extract internal changes sent to the XLOG as SQL again. Logical decoding will be needed for a couple of replication techniques outlined in this book.

Using sharding and data distribution

In this section, you will learn about basic scalability techniques, such as database sharding. Sharding is widely used in high-end systems and offers a simple and reliable way to scale out a setup. In recent years, it has become the standard way to scale up professional systems.

Understanding the purpose of sharding

What happens if your setup grows beyond the capacity of a single machine in a single-master setup? What if you want to run so many transactions that one server is simply not able to keep up with them? Let's assume you have millions of users and tens of thousands among them want to perform a certain task at the same time.

Clearly, at some point, you cannot buy servers that are big enough to handle an infinite load anymore. It is simply impossible to run a Facebook- or Google-like application on a single server. At some point, you have to come up with a scalability strategy that serves your needs. This is when sharding comes into play.

The idea of sharding is simple: What if you could split data in a way that it can reside on different nodes?

Designing a sharded system

To demonstrate the basic concept of sharding, let's assume the following scenario: we want to store information about millions of users. Each user has a unique user ID. Let's further assume that we have only two servers. In this case, we can store even user IDs on server 1 and odd user IDs on server 2.

The following diagram shows how this can be done:

Designing a sharded system

As you can see in our diagram, we have nicely distributed the data. Once this is done, we can send a query to the system, as follows:

SELECT * FROM t_user WHERE id = 4;

The client can easily figure out where to find the data by inspecting the filter in our query. In our example, the query will be sent to the first node because we are dealing with an even number.

As we have distributed the data based on a key (in this case, the user ID), we can search for any person easily if we know the key. In large systems, referring to users through a key is a common practice, and therefore, this approach is suitable. By using this simple approach, we have also easily doubled the number of machines in our setup.

When designing a system, we can easily come up with an arbitrary number of servers; all we have to do is to invent a nice and clever partitioning function to distribute the data inside our server farm. If we want to split the data between 10 servers (not a problem), how about using user ID % 10 as a partitioning function? If you are interested in sharding, consider checking out shard_manager, which is available on PGXN.

When you are trying to break up data and store it on different hosts, always make sure that you are using a proper partitioning function. It can be very beneficial to split data in such a way that each host has more or less the same amount of data.

Splitting users alphabetically might not be a good idea. The reason for that is that not all the letters are equally likely. We cannot simply assume that the letters from A to M occur as often as the letters from N to Z. This can be a major issue if you want to distribute a dataset to a thousand servers instead of just a handful of machines. As stated before, it is essential to have a proper partitioning function, that produces evenly distributed results.

Tip

In many cases, a hash function will provide you with nicely and evenly distributed data. This can especially be useful when working with character fields (such as names, e-mail addresses, and so on).

Querying different fields

In the previous section, you saw how we can easily query a person using their key. Let's take this a little further and see what happens if the following query is used:

SELECT * FROM t_test WHERE name = 'Max';

Remember that we have distributed data using the ID. In our query, however, we are searching for the name. The application will have no idea which partition to use because there is no rule telling us what is where.

As a logical consequence, the application has to ask every partition for the name parameter. This might be acceptable if looking for the name was a real corner case; however, we cannot rely on this fact. Requiring to ask many servers instead of one is clearly a serious deoptimization, and therefore, not acceptable.

We have two ways to approach the problem: coming up with a cleverer partitioning function, or storing the data redundantly.

Coming up with a cleverer partitioning function would surely be the best option, but it is rarely possible if you want to query different fields.

This leaves us with the second option, which is storing data redundantly. Storing a set of data twice, or even more often, is not too uncommon, and it's actually a good way to approach the problem. The following diagram shows how this can be done:

Querying different fields

As you can see, we have two clusters in this scenario. When a query comes in, the system has to decide which data can be found on which node. For cases where the name is queried, we have (for the sake of simplicity) simply split the data into half alphabetically. In the first cluster, however, our data is still split by user ID.

Pros and cons of sharding

One important thing to understand is that sharding is not a simple one-way street. If someone decides to use sharding, it is essential to be aware of the upsides as well as the downsides of the technology. As always, there is no Holy Grail that magically solves all the problems of mankind out of the box without them having to think about it.

Each practical use case is different, and there is no replacement for common sense and deep thinking.

First, let's take a look at the pros of sharding:

  • It has the ability to scale a system beyond one server
  • It is a straightforward approach
  • It is widely supported by various frameworks
  • It can be combined with various other replication approaches
  • It works nicely with PostgreSQL (for example, using PL/Proxy)

Light and shade tend to go together, and therefore sharding also has its downsides:

  • Adding servers on the fly can be far from simple (depending on the type of partitioning function)
  • Your flexibility might be seriously reduced
  • Not all types of queries will be as efficient as they would be on a single server
  • There is an increase in overall complexity of the setup (such as failover, resyncing, maintenance and so on)
  • Backups need more planning
  • You might face redundancy and additional storage requirements
  • Application developers need to be aware of sharding to make sure that efficient queries are written

In Chapter 13, Scaling with PL/Proxy, we will discuss how you can efficiently use sharding along with PostgreSQL, and how to set up PL/Proxy for maximum performance and scalability.

Choosing between sharding and redundancy

Learning how to shard a table is only the first step to designing a scalable system's architecture. In the example we showed in the previous section, we had only one table, which could be distributed easily using a key. But what if we have more than one table? Let's assume we have two tables:

  • A table called t_user for storing the users in our system
  • A table called t_language for storing the languages supported by our system

We might be able to partition the t_user table nicely and split it in such a way that it can reside on a reasonable number of servers. But what about the t_language table? Our system might support as many as 10 languages.

It can make perfect sense to shard and distribute hundreds of millions of users, but splitting up 10 languages? This is clearly useless. In addition to all this, we might need our language table on all the nodes so that we can perform joins.

One solution to the problem is simple: you need a full copy of the language table on all the nodes. This will not cause a storage-consumption-related problem because the table is so small. Of course, there are many different ways to attack the problem.

Tip

Make sure that only large tables are sharded. In the case of small tables, full replicas of the tables might just make much more sense.

Again, every case has to be thought over thoroughly.

Increasing and decreasing the size of a cluster

So far, we have always considered the size of a sharded setup to be constant. We have designed sharding in a way that allowed us to utilize a fixed number of partitions inside our cluster. This limitation might not reflect in everyday requirements. How can you really tell for certain how many nodes will be needed at the time a setup is designed? People might have a rough idea of the hardware requirements, but actually knowing how much load to expect is more art than science.

Tip

To reflect this, you have to design a system in such a way that it can be resized easily.

A commonly made mistake is that people tend to increase the size of their setup in unnecessarily small steps. Somebody might want to move from five to maybe six or seven machines. This can be tricky. Let's assume for a second that we have split data using user id % 5 as the partitioning function. What if we wanted to move to user id % 6? This is not so easy; the problem is that we have to rebalance the data inside our cluster to reflect the new rules.

Remember that we have introduced sharding (that is, partitioning) because we have so much data and so much load that one server cannot handle the requests anymore. Now, if we come up with a strategy that requires rebalancing of data, we are already on the wrong track. You definitely don't want to rebalance 20 TB of data just to add two or three servers to your existing system.

Practically, it is a lot easier to simply double the number of partitions. Doubling your partitions does not require rebalancing of data because you can simply follow the strategy outlined here:

  1. Create a replica of each partition
  2. Delete half of the data on each partition

If your partitioning function was user id % 5 before, it should be user id % 10 afterwards. The advantage of doubling is that data cannot move between partitions. When it comes to doubling, users might argue that the size of your cluster might increase too rapidly. This is true, but if you are running out of capacity, adding 10 percent storage to your resources won't fix the problem of scalability anyway.

Instead of just doubling your cluster (which is fine for most cases), you can also give more thought to writing a more sophisticated partitioning function that leaves the old data in place but handles the more recent data more intelligently. Having time-dependent partitioning functions might cause issues of its own, but it might be worth investigating this path.

Tip

Some NoSQL systems use range partitioning to spread out data. Range partitioning means that each server has a fixed slice of data for a given time frame. This can be beneficial if you want to perform time series analysis or something similar. However, it can be counterproductive if you want to make sure that data is split evenly.

If you expect your cluster to grow, we recommend starting with more partitions than those initially necessary, and packing more than just one partition on a single server. Later on, it will be easy to move single partitions to additional hardware joining the cluster setup. Some cloud services are able to do that, but those aspects are not covered in this book.

To shrink your cluster again, you can simply apply the opposite strategy and move more than just one partition to a single server. This leaves the door for a future increase of servers wide open, and can be done fairly easily.

Consistent hashing is another approach to distributing data. This technique is widely used in NoSQL systems and allows us to extend clusters in a more flexible way. However, the same technique can be used for PostgreSQL, of course.

Let's assume we have three servers (A, B, and C). What happens in consistent hashing is that an array of, say, 1,000 slots can be created. Then each server name is hashed a couple of times and entered in the array. The result might look like this:

 43 B, 153 A, 190 C, 340 A, 450 C, 650 B, 890 A, 930 C, 980 B

Each value shows up multiple times. In the case of insertion, we take the input key and calculate a value. Let's assume hash (some input value) equals 58. The result will go to server A. Why? There is no entry for 58, so the system moves forward in the list, and the first valid entry is 153, which points to A. If the hash value returned 900, the data would end up on C. Again, there is no entry for 900 so the system has to move forward in the list until something is found.

If a new server is added, new values for the server will be added to the array (D might be on 32, 560, 940, or so). The system has to rebalance some data, but of course, not all of the data. It is a major advantage over a simple hashing mechanism, such as a simple key % server_number function. Reducing the amount of data to be resharded is highly important.

The main advantage of consistent hashing is that it scales a lot better than simple approaches.

Combining sharding and replication

Once data has been broken down into useful chunks, which can be handled by one server or a partition, we have to think about how to make the entire setup more reliable and fail-safe.

The more servers you have in your setup, the more likely it will be that one of those servers will be down or not available for some other reason.

Tip

Always avoid single points of failure when designing a highly scalable system.

In order to ensure maximum throughput and availability, we can again turn to redundancy. The design approach can be summed up in a simple formula, which should always be in the back of a system architect's mind:

"One is none and two is one."

One server is never enough to provide us with High Availability. Every system needs a backup system that can take over in the event of a serious emergency. By just splitting a set of data, we definitely do not improve availability. This is because we have more servers, which can fail at this point. To fix this problem, we can add replicas to each of our partitions (shards), just as is shown in the following diagram:

Combining sharding and replication

Each partition is a separate PostgreSQL database instance, and each of those instances can have its own replica (or replicas). Essentially, it is the same concept as you will find in a RAID 1+0 setup on the hardware side.

Keep in mind that you can choose from the entire arsenal of features and options discussed in this book (for example, synchronous and asynchronous replication). All the strategies outlined in this book can be combined flexibly. A single technique is usually not enough, so feel free to combine various technologies in different ways to achieve your goals.

Various sharding solutions

In recent years, sharding has emerged as an industry-standard solution to many scalability-related problems. Thus, many programming languages, frameworks, and products are already providing out-of-the-box support for sharding.

When implementing sharding, you can basically choose between two strategies:

  • Rely on some framework or middleware
  • Rely on PostgreSQL's means to solve the problem

In the following chapters, we will discuss both options briefly. This little overview is not meant to be a comprehensive guide, but rather an overview to get you started with sharding.

PostgreSQL-based sharding

PostgreSQL cannot shard data out of the box, but it has all the interfaces and means required to allow sharding through add-ons. One of these add-ons, which is widely used, is called PL/Proxy. It has been around for many years, and offers superior transparency as well as good scalability. It was originally developed by Skype to scale up their infrastructure.

The idea behind PL/Proxy is basically to use a local virtual table to hide an array of servers making up the table.

PL/Proxy will be discussed in depth in Chapter 13, Scaling with PL/Proxy.

Summary

In this chapter, you learned about basic replication-related concepts as well as the physical limitations. We dealt with the theoretical concepts, which are the groundwork for what is still to come in this book.

In the next chapter, you will be guided through the PostgreSQL transaction log, and we will outline all important aspects of this vital component. You will learn what the transaction log is good for and how it can be applied.

Left arrow icon Right arrow icon

Description

This book is ideal for PostgreSQL administrators who want to set up and understand replication. By the end of the book, you will be able to make your databases more robust and secure by getting to grips with PostgreSQL replication.

Who is this book for?

This book is ideal for PostgreSQL administrators who want to set up and understand replication. By the end of the book, you will be able to make your databases more robust and secure by getting to grips with PostgreSQL replication.

What you will learn

  • Use Pointintime Recovery to perform data recovery as well as replication
  • Set up synchronous as well as asynchronous streaming replication
  • Get familiarized with the transaction log, the core component of most replication setups and its purpose
  • Improve speed and reliability with an understanding of pgpool and PgBouncer
  • Increase your data security and geographically distribute data
  • Make your systems more available and secure with Linux High Availability
  • Scale out with PL/Proxy and PostgresXC
  • Detect, investigate, and solve replicationrelated problems

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jul 28, 2015
Length: 322 pages
Edition : 1st
Language : English
ISBN-13 : 9781783988549
Category :
Tools :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Jul 28, 2015
Length: 322 pages
Edition : 1st
Language : English
ISBN-13 : 9781783988549
Category :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 142.97
Troubleshooting PostgreSQL
$32.99
PostgreSQL Replication, Second Edition
$48.99
Learning PostgreSQL
$60.99
Total $ 142.97 Stars icon
Banner background image

Table of Contents

16 Chapters
1. Understanding the Concepts of Replication Chevron down icon Chevron up icon
2. Understanding the PostgreSQL Transaction Log Chevron down icon Chevron up icon
3. Understanding Point-in-time Recovery Chevron down icon Chevron up icon
4. Setting Up Asynchronous Replication Chevron down icon Chevron up icon
5. Setting Up Synchronous Replication Chevron down icon Chevron up icon
6. Monitoring Your Setup Chevron down icon Chevron up icon
7. Understanding Linux High Availability Chevron down icon Chevron up icon
8. Working with PgBouncer Chevron down icon Chevron up icon
9. Working with pgpool Chevron down icon Chevron up icon
10. Configuring Slony Chevron down icon Chevron up icon
11. Using SkyTools Chevron down icon Chevron up icon
12. Working with Postgres-XC Chevron down icon Chevron up icon
13. Scaling with PL/Proxy Chevron down icon Chevron up icon
14. Scaling with BDR Chevron down icon Chevron up icon
15. Working with Walbouncer Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5
(4 Ratings)
5 star 50%
4 star 50%
3 star 0%
2 star 0%
1 star 0%
Jeff L Sep 21, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is a great introductory text to anyone needing to get started with data replication on PostgreSQL and is unsure about how to get started. It covers all of the major replication technologies currently available and describes some of the benefits and limitations of each. General optimization strategies to improve replication performance are also covered. Since the scope of coverage is pretty wide across all of the different replication technologies, this book doesn't quiet have enough space to devote more than single chapter to each, so you may still need to consult other manuals and reference materials once you've narrowed down your replication strategy.
Amazon Verified review Amazon
Lance Alligood Nov 12, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Normally when I am googling for assistance with PostgreSQL, it's all but guaranteed that at least one of the first 3 hits is for the postgresql.org. However, when it came to configuring master-slave replication, that wasn't the case. Most of the hits were instead for various blog posts or forums. Any details were frequently poorly formatted, glossed over--or omitted(!)--important steps, and never supplied adequate background information for the values assigned to parameters in postgresql.conf. To make matters worse, there was little mention of what version of PostgreSQL the instructions were for in those posts. My frustration level was off the charts as I tried to make sense of this poor & inconsistent "documentation"!I was in (desperate!) need of some accurate, explanatory guidance to get replication & point-in-time-recovery (PITR) up & going on my PostgreSQL 9.4 servers. Normally the PostgreSQL project is fantastic in supplying clear & precise documentation for commands & components of the database, but that just isn't the case at all when it came to making sense of replication and/or PITR. However, I must fully acknowledge that the PostgreSQL project has been exerting considerable effort into improving the ease of configuration & use when it comes to replication & PITR. (And it's only a matter of time the website documentation is there to go along with it.) Based on my searches, there have been massive changes in replication & PITR for (at least) 9.1, 9.2, & 9.3.PostgreSQL Replication Second Edition just came out in July 2015 so it is thankfully relevant to 9.4, although the changes from 9.3 -> 9.4 (command-wise at least) appear to be less dramatic than the previous 9.x releases. This was very much the documentation I was looking for! I have no issues with an author nor a publisher putting out a well-written book, but why PostgreSQL doesn't have webpages on its site comparable to this excellent tome is a real head-scratcher. That said, this book definitely lives up to the hype of showing you what it takes to get either/both replication or PITR running on PostgreSQL. I was particularly thankful for the explanation of the main files/directories that make up the database & all of the necessary parameters in postgresql.conf & pg_hba.conf!The project admits that there will certainly be more changes in the future--and I'm quite certain they'll all be for the better!--when it comes to replication & PITR, but for the hear & now in late 2015, this book will get you up to speed on configuring those with the current tools right now, quickly, & with little effort.
Amazon Verified review Amazon
Ángel Mar 28, 2018
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Tiene base teórica, ejemplos prácticos, bastante completo en sus explicaciones, además de que referencia a documentos en internet (las URLs han cambiado desde la edición del libro y es el único fallo que le encuentro).
Amazon Verified review Amazon
Ajay Thakur Aug 12, 2015
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
This is an amazing book on Replication.Concepts are well explained and things are well described too.RegardsChitij Chauhan
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.