Modern processor architectures

All the classic performance advice and algorithmic foo stems from the times of simple CPU setups, where processor and memory speeds were roughly equal. But then the processor speeds exploded by increasing quite faithfully to the Moore law by 60% per year where memory access times increased by only 10% and couldn't quite hold pace with them. The problem is that the main memory (dynamic random-access memory (DRAM), contains minuscule capacitors keeping an electrical charge to indicate the 1 bit and none to indicate the 0 bit. This results in an inexpensive circuitry that doesn't have to be kept under voltage but is working basically in the analog realm and can't profit that much from advances made in the digital components.

The second change that occurred since then was the demise of Moore's law in its simple form. Up to the early 2000s, CPU manufacturers steadily increased processor frequency rates, making CPUs run faster and faster. That was achieved by increasing the number of transistors packed on chips, and Moore's law predicted that number of transistors that can be packed on a chip will double every 18 months. In simple terms, it was understood as doubling the processor speed every two years.

This trend continued until processor manufacturers hit a physical barrier, the so-called power wall—at some point, the densely packed transistors produced so much heat that they couldn't be effectively cooled on consumer machines (high-end, expensive water-cooling systems are, too expensive for a laptop or a mobile device), so a different approach to increasing a CPU's performance had to be found.

Caches

The attempts to overcome these problems led to a slew of architectural innovations. First, the impedance between CPU and memory speeds was fought using a classic optimization technique we already know, namely, caching, on chip level. The on-chip static RAM (SRAM) memory requires six transistors (forming a flip-flop) per bit, and all of them must be kept under voltage. This means it's expensive and it drives the power consumption up (take care not to hit that power wall!). In exchange, the memory access times are at lightning speed, as all that is needed is to apply the current to the input and read the output.

So, the idea is to add a small caching stage of expensive but fast on-chip memory in front of big, slow but inexpensive main memory. Meanwhile, modern CPUs can command up to three levels of caches, commonly denoted with L1, L2, and L3 acronyms, decreasing in density and speed but increasing in size as the cache level goes up. The figure below shows us an overview of the memory hierarchy typically found in modern CPUs:

As of the time of this writing, access times for L1 caches are in the order of 3 cycles, L2 of 12 cycles, L3 of 38 cycles, and main memory is around 100-300 cycles. The main memory access time is that high because the analog nature of DRAM requires, among other things, periodic charge refreshing, pre-charging of the read line before reading, analog-digital conversion, communication through memory controller unit (MCU), and so on.

Caches are organized in cache lines, which on the current Intel architectures are 64 bytes long. Each cache update will hence fetch the entire cache line from main memory, doing a kind of prefetching already at that level. Speaking about prefetching, Intel processors have a special prefetch instruction we can invoke in assembler code for very low-level optimizations.

In addition to data caches, there's also an instruction cache, because in the von Neumann architecture, both are kept in the common memory. The instruction caches were added to Intel Pentium Pro (P6) as an experiment, but they were never removed since then.

Pipelining

Another possibility to increase a processor's speed is the instruction level parallelism (ILP), also known as superscalar computation.

The processing of a CPU instruction can internally be split into several stages, such as instruction fetch, decode, execute, and write-back. Before the Intel 486 processor, each instruction has to be finished before the next can be started. With pipelining, when the first stage of an instruction is ready, that instruction can be forwarded to the next stage, and the next instruction's processing can begin with its first stage. In that manner, several instructions can be in flight in parallel, keeping a processor's resources optimally utilized. The next screenshot illustrates this principle, graphically, using a hypothetical four-stages pipeline:

The original Intel 486 pipeline was five stages long, but on modern processors, it can be much longer. For example, the current Intel Atom processors command a pipeline of 16 stages.

That's all well and good, but, unfortunately, there are some problems lurking in the corners.

Speculative execution and branch prediction

As long as the pipeline is pumping, everything is OK. But what if we encounter a conditional branch? The pipeline has to wait till the result of the test is known, hence the next instruction can be started only when the current one has finished. Welcome to the pre-80,486 world! This is called a pipeline stall and defeats the whole purpose of pipelining. And because every program is literally dotted with if-then-else clauses, something must be done here.

The solution manufacturers came up with was speculative execution: instead of idling, we just start executing one of the branches speculatively. If we are lucky, we've just done the right thing, but, if not, we discard our speculative work, and we are on even ground with the pipeline stall case. As we decide randomly, we'll be right 50% of the time, and we just seriously increased the throughput of the pipeline!

The only problem is that the branches of the if clause are not equally probable! In most cases, they're even highly unevenly distributed: one of them is the error branch; the other is the normal case branch. But the processor doesn't know the meaning of the test, so what can we do? The solution to that is branch predicting—the processor is learning about branches in your code and can predict which branch will be taken on a given condition rather well.

This got complicated quickly, didn't it? If you're thinking it, you're not alone. Not so long ago, the programming world was shaken by disclosure of the Spectre and Meltdown vulnerabilities, which allowed the attacker to see contents of the memory regions where they don't have access rights. The first part of the exploit's to fool the branch predictor to take the false branch for speculative execution. After the processor sees a disallowed access, the instruction will be retired, but the protected data will be present in the cache, where they can be guessed with some complicated techniques we won't discuss here. These bugs basically put in question the processor optimizations of the last decade, as fixing them would incur meaningful performance losses.

Considering that, we are all rather curious about how CPU architectures evolve next time, aren't we?

Out-of-order execution

There's another refinement to the pipeline concept allowing an even higher utilization of a CPU's resources. Namely, as processor manufacturers started to add redundant processing units (Intel P6 already had two integer and two floating-point execution units), it became possible to execute two instructions in parallel.

Up until Pentium Pro (P6), instructions were fed into the pipeline in their order of appearance. But if there's a data dependency between two consecutive instructions, then they can't be processed in parallel, leaving the additional execution unit idle:

a = b + 1;  // 1
c = a + 5;  // 2
d = e + 10; // 3
f = d + 15; // 4

The solution to this problem is to take the next independent instruction and execute it before the dependent one. See the next diagram for a visual explanation:

Here, on the left side, we see the traditional execution preserving the instruction order and, on the right, parallel execution with reordering, where instruction 3 will be executed before instruction 2.

Multicore

The problem of the power wall was in the end overcome by freezing or even decreasing CPU frequencies but introducing parallelly working processor cores, adding more general registers, vector processing single instruction multiple data (SIMD) registers, and instructions. In a word, they either duplicated active processing units or added more elements that don't have to be always under voltage. In that way, the density of the transistors didn't have to be increased.

When all of the CPU cores are placed on one chip, we have a symmetric multiprocessing (SMP) CPU, because all cores can access their respective data within a local chip. The counterpart to that is a non-uniform memory access (NUMA) system, where we have several physically separate CPUs having their own internal caches. The memory access for internal CPUs will then be much cheaper than the memory access to an external CPU. Another problem is the cache coherence between the CPUs, which requires complicated cache-coherence protocols and can take down performance. In the context of Qt's application area, we normally encounter SMP machines, so we'll ignore NUMA in this book.

In a multicore chip, the processing resources can be classified in core and uncore ones—those which are duplicated for each core, and those that aren't and must be shared. For example, the top-level cache (L3 or L2, depending on the processor) is an uncore resource shared among processor cores.

One often-encountered notion is that of hyperthreading. This is another idea for increasing a CPU's parallelism, and hence its resource utilization. A processor with hyperthreading consists of two logical processors per core, each of which keeps its own internal state. The parts of the processor where it holds its architectural state (for example, running, interrupted, and halted) will be duplicated for each core, but the computation resources will be shared among logical cores. The intent of that is to increase the utilization of the processor's resources and prevent pipeline stalls, by borrowing resources from the stalled logical core. The operating system then uses these two logical cores as physical ones and has to be HTT-aware to optimally use such a system.

Strictly speaking, graphical-processing units (GPUs) are not a form of multicore, but we'll mention them in this context, because processors can ship with integrated GPUs on board. A GPU comprises many very simple processing units that can run massively parallel, although simple, computations. Normally, they're used to accelerate graphic processing, but they can be also used to speed up in general computations, such as the training of neural networks in deep-learning applications. In a Qt context, however, we use GPUs only for their graphical capabilities.

Additional instruction sets

As already mentioned, to increase processor performance chip, manufacturers started to add more sophisticated instructions that can either vectorize computations or execute algorithms that hitherto had to be implemented in application code.

The SIMD or vector instructions can be used to parallelize a scalar computation by executing calculations on several scalar values in parallel. For that, we have to load several float values in two sets of SIMD registers and then apply an operation to all of them at once. Intel processors introduced SIMD in a series of extensions named, namely the following:

Streaming SIMD Extension (SSE): Uses 128-bit registers and is available in several versions from SSE, over SSE2 to SSE4
Advanced Vector Extension (AVX): Uses 265-bit registers and is available in AVX2 and AVX-512 (for 512 bit) versions

As for the more specialized instructions, we mention the following Intel extensions:

Advanced Encryption Standard New Instructions (AES-NI): This implements the cryptographic AES encoding standard.
32-bit cyclic redundancy check (CRC32): This implements computation of the CRC32 correction code.
SSE4.2: This SSE extension also implements basic string operations using SIMD registers.

Impact on performance

As you see, a modern CPU doesn't look like your daddy's old-fashioned von Neuman CPU at all. It's a very complex computer system on its own, banned on a very small chip, desperately trying out all possible tricks to squeeze out a little bit more performance from your program.

On the other hand, modern CPUs are trying to appear simple to the programmer, unfortunately leaking the implementation specifics all of the time along the way. So, what changed since the days of classic performance advice we summed up in the preceding sections?

Fortunately, not much. The basic techniques and guidelines still apply, but some additional advice must be heeded. Let's discuss these new points.

Keeping your caches hot

First of all, don't thrash your caches because main memory is very slow. This means that reading data all over the memory (also known as pointer chasing) isn't the best of ideas. Modern processors, such as programs that read consecutive chunks of memory in a predictable manner, allow them to leverage hardware-level prefetching. The word here is data locality.

A negative example for that, alas, is our old and trusty linked list! Traversing it's a real pointer-chasing feast, because all of the nodes are allocated dynamically and can be placed anywhere in memory. However, we could remedy it by using the already-mentioned techniques of preallocating and custom memory allocators. In that case, all of the nodes of a list would be located in the adjacent element of the preallocated buffer, making it again more cache friendly. Of course, we could just use an array from the start, but this is an example of how to improve data locality in similar cases.

The second classic mistake would be placing two often used integers, x and y, well apart from each other in a data structure, so that when we want to use both, we need to needlessly load two cache lines instead of one. What's used together should stay together; don't break your data structures on a cache line boundary!

Further optimization that's widely known is the replacement of an array of structures by a structure of arrays. This will be beneficial in the case of loading data for SIMD instructions and other techniques where we read data in parallel.

Another often-cited optimization trick is improving the performance of matrix multiplication by changing the data-access pattern. Instead of a straightforward triple i, j, k loop, we change it to the k, j, i loop:

for (size_t k = 0; k < P; ++k)
  for (size_t j = 0; j <M; ++j)
    for (size_t i = 0; i < N; ++i)
      res[i][[j] += m1[i][[k] * m2[k][[j];

Here, just switching the order of traversal to be more cache-friendly and reading row elements of both matrices in consecutive manner will dramatically improve performance (in some measurements, up to 94%)!

What we learn here is that the outline and structure of our data, but also its access patterns, can have a big impact on our program's performance on modern processors!

The second type of cache, that is, the instruction cache, needs some love as well. Jumping through code back and forth is equally bad, as it's in the case of your data. This code locality is the next important notion. One possibility to influence that is to place your normal case code before the handling error, like this:

if (ok) {
  do_work();
} else {
  printf(“ERROR!”);
  return;
}

This avoids a jump in normal case, hence improving code locality. We'll discuss that in more detail in Chapter 3, Deep Dive into C++ and Performance, when we'll look at the role of the compiler in program optimization.

Don't confuse your branch predictor

To avoid pipeline stalls, preferably, there shouldn't be any branches at all. Unfortunately, that can't be done in programming, but the next best thing we can do is to minimize branching. A classic example of branch avoidance is replacing simple conditional expressions with bit manipulations, like this:

const int maxValue = 16;
if (x >= maxValue) x = 0;
// is equivalent to:
x = (x + 1) & (maxValue - 1);

I think we can agree, these are ugly, low-level tricks, best left to the micro-optimization stage or to the compiler. Another technique of that type is loop unrolling.

A more usable technique could be helping the compiler to generate more branch predictor-friendly code such as the Linux kernel practice of macros: likely() and unlikely(), which internally use the GNU compiler collection (GCC) compiler's __builtin_expect() directive. This is supposed to support the branch-predictor, but, in reality, it allows code reordering by the compiler, which will then enable different optimizations. Older architectures with static branch predictors use special instruction prefixes indicating whether the branch is likely to be taken or not. As newer architectures are using dynamic predictors, these prefixes will be ignored and don't take any effect.

So, the received wisdom is that the branch predictors have meanwhile gotten pretty good, and that we should try to mess with them in the rarest cases only. With the exception of one thing: as a branch predictor's table has a finite (small) size, you shouldn't thrash it! And here we are again: don't use too many branches; keep good code locality!

Parallelizing your application

This is the famous no free lunch conundrum: if we want our program to run faster, it's not enough to buy a faster processor—we have to restructure it to be parallel and to use more processor cores! We'll have a closer look at these techniques in Chapter 5, An In-Depth Guide to Concurrency and Multithreading, when we'll discuss multithreading.

As for vector processing, we have to notice that, for SIMD instructions to be performant, the loaded data usually has to be aligned. Apart from that, as of today, any decent modern compiler will try to vectorize the code, so normally, we don't have to bother. However, the compiler will generate code for a specific architecture, but maybe you'd like to run your program on several processor and look for the more advanced extended instruction sets available? This is possible, and techniques for doing that are known as CPU dispatching.

The second theme related to parallelism to be mentioned here is the out-of-order execution. First, sometimes, we can encounter advice to break too long dependency chains so as to allow reordering, as was shown in a previous diagram. Arguably, this is a low-level technique, but sometimes every nanosecond may count.

Another theme is that there could be a time where we would like to disable instruction reordering. What could it be? Right—when synchronizing among threads, we have to know exactly in which order which variables were read or written. On the processor level, this can be forced with memory barriers, but this will prevent the possible reordering optimizations. That is the reason synchronization in a multithreaded program is already expensive on the processor level.