Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Hands-On High Performance Programming with Qt 5
Hands-On High Performance Programming with Qt 5

Hands-On High Performance Programming with Qt 5: Build cross-platform applications using concurrency, parallel programming, and memory management

Arrow left icon
Profile Icon Marek Krajewski
Arrow right icon
Mex$1004.99
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (2 Ratings)
Paperback Jan 2019 384 pages 1st Edition
eBook
Mex$561.99 Mex$803.99
Paperback
Mex$1004.99
Subscription
Free Trial
Arrow left icon
Profile Icon Marek Krajewski
Arrow right icon
Mex$1004.99
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (2 Ratings)
Paperback Jan 2019 384 pages 1st Edition
eBook
Mex$561.99 Mex$803.99
Paperback
Mex$1004.99
Subscription
Free Trial
eBook
Mex$561.99 Mex$803.99
Paperback
Mex$1004.99
Subscription
Free Trial

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Hands-On High Performance Programming with Qt 5

Understanding Performant Programs

In this introductory chapter, we'll start this book with some general discussions about program performance: why it's important, what are the factors that determine it, and how programmers generally go about performance themes. We'll begin with a broad discussion of performances' relevance in programming, before looking at some traditional performance-related knowledge, and we'll finish this chapter with the impact modern CPU architectures made in this field.

This chapter will therefore cover the following topics:

  • Why performance is important: To motivate ourselves before diving into technicalities
  • Traditional wisdom and basic guidelines: Old and proven performance knowledge
  • Modern processor architectures: At least the performance-relevant parts of it

Why performance is important

Maybe you just started reading this book out of curiosity and you're asking yourself this question: Why is performance important? Isn't that a thing of the past, when we didn't have enough CPU power and memory, and when our networks were grinding to a halt? In today's high-tech world, we have enough resourcesthe computers are so insanely fast!

Well, in principle, you're right to some degree, but consider the following:

  • A faster program runs more quickly, consuming less power along the way. This is good for the planet (if you're running it in a big server farm) and good for your user (if you're running it on a desktop computer).
  • A faster program means that it can serve more requests in the same time than a slower one. This is good for business, as you'll need to buy or lease fewer machines to serve your customers, and, again, it's good for the planet!
  • Faster software in today's business world's cut-throat competition means an advantage in respect to your competitors. This is nowhere more evident than in the world of automated trading (which is, by the way, dominated by C++), but also the fact that sluggishly-loading websites and programs needing an eternity to start won't be used that much!
  • And, lastly, especially on mobile devices, we still have to cope with constrained resources—network speed is finite (light speed) and battery life is finite too—and as faster programs use less resources, they're good for users!

Our quest for performance will hence pursue a three-pronged objective: to save the planet, to strengthen your business, and to make the life of users better—not a small feat I'd say!

The price of performance optimization

So, everything is rosy? Well, not quite, as there's a dark side to performance optimization too. If we'll try to squeeze the last drops of performance from our hardware, we can end up having unreadable, inflexible, unmaintainable, and hence outright ugly, code!

So, be aware that there are caveats and that there's a price to be paid. We must decide whether we want to pay it, and at what point we'll stop the optimization to save the clarity of our code.

Traditional wisdom and basic guidelines

When I started with programming (a long time ago), the pieces of advice about performance optimization traditionally given to a newbie were the following:

  • Don't do it (yet)
  • Premature optimization is the root of all evil
  • First make it run, then make it right, then make it fast

The first advice contained the yet only in its variant for the experts; the second was (and still is) normally misquoted, leaving out the "in say 97% of the cases" part, and the third quote gives you the impression that merely writing a program is already so difficult that fretting about performance is a luxury. It's no wonder then that the normal approach to performance was to fix it later!

But all of the adages nonetheless highlight an important insight—performance isn't distributed evenly through your code. The 80-20, or maybe even the 90-10 rule, applies here, because there are some hotspots where extreme care is needed, but we shouldn't try to optimize every nook and cranny in our code. So, our first guideline will be premature optimization—we should forget about it in, say, 95% of cases.

But what exactly are the 20%, 10%, or 5% of code where we shouldn't forget about it? Another old-age programming wisdom states this—programmers are notoriously bad at guessing performance bottlenecks.

So, we shouldn't try to predict the tight spot and measure the performance of a ready program instead. This does sound a lot like the fix it later cowboy coder's approach. Well, this book takes the stance that though premature optimization should be avoided, nonetheless, premature pessimization should be avoided at all costs, as it's even worse! However, avoiding premature pessimizations requires much detailed knowledge about which language constructs, which framework use cases, and which architectural decisions come with what kind of performance price tags. This book will try to provide this knowledge in the context of the Qt framework.

But, first, let's talk about quite general principles that address the question of what should be avoided, lest the performance degrades. As I see it, we can distill from the traditional performance wisdom from the following basic common-sense advice:

  • Don't do the same thing twice.
  • Don't do slow things often.
  • Don't copy data unnecessarily.

You'll agree that all that can't be good for performance? So, let's discuss these three simple but fundamental insights in some more detail.

Avoiding repeated computation

The techniques falling under the first point are concerned with unneeded repetition of work. The basic counter measure here is caching, that is, saving the results of computation for later use. A more extreme example of avoiding repletion of work is to precompute results even before their first usage. This is normally achieved by hand-coded (or generated by a script) precomputed tables or, if your programming language allows that, with compile-time computation. In the latter case, we sacrifice compilation times for better run-time performance. We'll have a look at C++ compile time techniques in Chapter 3, Deep Dive into C++ and Performance.

Choosing the optimal algorithm and data structure also falls into that realm, as different algorithms and data structures are optimized for different use cases, and you have to make your choice wisely. We'll have a look at some gotchas pertaining Qt's own data structures in Chapter 4, Using Data Structures and Algorithms Efficiently.

The very basic techniques such as pulling code out of a loop, such as the repeated computations or initializations of local variables, fall into that class as well, but I'm convinced you knew about this already.

Avoiding paying the high price

The techniques falling under the second point come into play if there's something we can't avoid doing, but it has a pretty high cost tagged on to it. An example of this is interaction with the operating system or hardware, such as writing data to a file or sending a packet over the network. In this case, we resort to batching, also known in I/O context as buffering—instead of writing or sending a couple of small chunks of data right away, we first gather them and then write or send them together to avoid paying the high cost each time.

On the other hand, we can apply techniques of this type too. In I/O or memory context, this would be the prefetching of data, also known as read-ahead. When reading data from a file, we read more than the user actually requested, hoping that the next portion of data will be needed soon. In the networking context, there are examples of speculative pre-resolving of Domain Name System (DNS) addresses when a user is hovering over a link in browsers or even pre-connecting to such addresses. However, such measures can turn into its counterpart when the prediction fails, and such techniques require very careful tuning!

Related techniques to be mentioned in this context are also avoidance of system calls and avoidance of locking to spare the costs of system call and switching to the kernel context.

We'll see some applications of such techniques in last chapters of the book when we discuss I/O, graphics , and networking.

Another example of when this rule can be used is memory management. General-purpose memory allocators tend to incur rather high costs on single allocations, so the remedy is to preallocate one big buffer at first and then use it for all needs of the program by managing it by ourselves using a custom allocation strategy. If we additionally know how big our objects are going to be, we can just allocate several buffer pools for different object sizes, making the custom allocation strategy rather simple. Preallocating memory at the start used to be a classic measure to improve the performance of memory intensive programs. We'll discuss these technical C++ details in Chapter 3, Deep Dive into C++ and Performance.

Avoiding copying data around

The techniques falling under the third point tend to be somehow of a lower-level nature. The first example is avoiding copying data when passing parameters to a function call. A suitable choice of data structure will avoid copying of data as well—just think about an automatically growing vector. In many cases, we can use preallocation techniques to prevent this (such as the reserve() method of std::vector) or choose a different data structure that will better match the intended use case.

Another common case when the copying of data can be a problem is string processing. Just adding two strings together will, in the naive implementation, allocate a new one and copy the contents of the two strings to be joined. And as much of programming contains some string manipulations, this can be a big problem indeed! The remedy for that could be using static string literals or just choosing a better library implementation for strings.

We'll discuss these themes in Chapter 3, Deep Dive into C++ and Performance, and Chapter 4, Using Data Structures and Algorithms Efficiently.

Another example of this optimization rule is the holy grail of network programming—the zero-copy sending and receiving of data. The idea is that data isn't copied between user buffers and network stack before sending it out. Most modern network hardware supports scatter-gather (also known as vectored I/O), where the data to be sent doesn't have to be provided in a single contiguous buffer but can be made available as a series of separate buffers.

In that way, a user's data doesn't have to be consolidated before sending, sparing us copying of data. The same principle can be applied to software APIs as well; for example, Facebook's recent TSL 1.3 implementation (codename Fizz, open sourced) supports scatter-gather API on library level!

General performance optimization approach

Up to now, we listed the following classic optimization techniques:

  • Optimal algorithms
  • Optimal data structures
  • Caching
  • Precomputed tables
  • Preallocation and custom allocators
  • Buffering and batching
  • Read-ahead
  • Copy avoidance
  • Finding a better library

With our current stand of knowledge, we can formulate the following general-performance optimization procedure:

  1. Write your code, avoiding unnecessary pessimizations where it doesn't cost much, as in the following examples:
    • Pass parameters by reference.
    • Use reasonably good, widely known algorithms and data structures.
    • Avoid copying data and unnecessary allocations.

This alone should give you a pretty decent baseline performance.

  1. Measure the performance, find the tight spots, and use some of the standard techniques listed. Then, measure again and iterate. This step must be done if the performance of our program isn't satisfactory despite our sound programming practices. Unfortunately, we can't know or anticipate everything that will happen in the complex interplay of hardware and software—there can always be surprises waiting for us.
  2. If you still can't achieve good performance, then your hardware is probably too slow. Even with performance optimization techniques, we still can't do magic, sorry!

The preceding advice looks quite reasonable, and you might ask: Are we done? That wasn't that scary! Unfortunately, it's not the whole story. Enter the leaky abstraction of modern processor architectures.

Modern processor architectures

All the classic performance advice and algorithmic foo stems from the times of simple CPU setups, where processor and memory speeds were roughly equal. But then the processor speeds exploded by increasing quite faithfully to the Moore law by 60% per year where memory access times increased by only 10% and couldn't quite hold pace with them. The problem is that the main memory (dynamic random-access memory (DRAM), contains minuscule capacitors keeping an electrical charge to indicate the 1 bit and none to indicate the 0 bit. This results in an inexpensive circuitry that doesn't have to be kept under voltage but is working basically in the analog realm and can't profit that much from advances made in the digital components.

The second change that occurred since then was the demise of Moore's law in its simple form. Up to the early 2000s, CPU manufacturers steadily increased processor frequency rates, making CPUs run faster and faster. That was achieved by increasing the number of transistors packed on chips, and Moore's law predicted that number of transistors that can be packed on a chip will double every 18 months. In simple terms, it was understood as doubling the processor speed every two years.

This trend continued until processor manufacturers hit a physical barrier, the so-called power wall—at some point, the densely packed transistors produced so much heat that they couldn't be effectively cooled on consumer machines (high-end, expensive water-cooling systems are, too expensive for a laptop or a mobile device), so a different approach to increasing a CPU's performance had to be found.

Caches

The attempts to overcome these problems led to a slew of architectural innovations. First, the impedance between CPU and memory speeds was fought using a classic optimization technique we already know, namely, caching, on chip level. The on-chip static RAM (SRAM) memory requires six transistors (forming a flip-flop) per bit, and all of them must be kept under voltage. This means it's expensive and it drives the power consumption up (take care not to hit that power wall!). In exchange, the memory access times are at lightning speed, as all that is needed is to apply the current to the input and read the output.

So, the idea is to add a small caching stage of expensive but fast on-chip memory in front of big, slow but inexpensive main memory. Meanwhile, modern CPUs can command up to three levels of caches, commonly denoted with L1, L2, and L3 acronyms, decreasing in density and speed but increasing in size as the cache level goes up. The figure below shows us an overview of the memory hierarchy typically found in modern CPUs:

As of the time of this writing, access times for L1 caches are in the order of 3 cycles, L2 of 12 cycles, L3 of 38 cycles, and main memory is around 100-300 cycles. The main memory access time is that high because the analog nature of DRAM requires, among other things, periodic charge refreshing, pre-charging of the read line before reading, analog-digital conversion, communication through memory controller unit (MCU), and so on.

Caches are organized in cache lines, which on the current Intel architectures are 64 bytes long. Each cache update will hence fetch the entire cache line from main memory, doing a kind of prefetching already at that level. Speaking about prefetching, Intel processors have a special prefetch instruction we can invoke in assembler code for very low-level optimizations.

In addition to data caches, there's also an instruction cache, because in the von Neumann architecture, both are kept in the common memory. The instruction caches were added to Intel Pentium Pro (P6) as an experiment, but they were never removed since then.

Pipelining

Another possibility to increase a processor's speed is the instruction level parallelism (ILP), also known as superscalar computation.

The processing of a CPU instruction can internally be split into several stages, such as instruction fetch, decode, execute, and write-back. Before the Intel 486 processor, each instruction has to be finished before the next can be started. With pipelining, when the first stage of an instruction is ready, that instruction can be forwarded to the next stage, and the next instruction's processing can begin with its first stage. In that manner, several instructions can be in flight in parallel, keeping a processor's resources optimally utilized. The next screenshot illustrates this principle, graphically, using a hypothetical four-stages pipeline:

The original Intel 486 pipeline was five stages long, but on modern processors, it can be much longer. For example, the current Intel Atom processors command a pipeline of 16 stages.

That's all well and good, but, unfortunately, there are some problems lurking in the corners.

Speculative execution and branch prediction

As long as the pipeline is pumping, everything is OK. But what if we encounter a conditional branch? The pipeline has to wait till the result of the test is known, hence the next instruction can be started only when the current one has finished. Welcome to the pre-80,486 world! This is called a pipeline stall and defeats the whole purpose of pipelining. And because every program is literally dotted with if-then-else clauses, something must be done here.

The solution manufacturers came up with was speculative execution: instead of idling, we just start executing one of the branches speculatively. If we are lucky, we've just done the right thing, but, if not, we discard our speculative work, and we are on even ground with the pipeline stall case. As we decide randomly, we'll be right 50% of the time, and we just seriously increased the throughput of the pipeline!

The only problem is that the branches of the if clause are not equally probable! In most cases, they're even highly unevenly distributed: one of them is the error branch; the other is the normal case branch. But the processor doesn't know the meaning of the test, so what can we do? The solution to that is branch predicting—the processor is learning about branches in your code and can predict which branch will be taken on a given condition rather well.

This got complicated quickly, didn't it? If you're thinking it, you're not alone. Not so long ago, the programming world was shaken by disclosure of the Spectre and Meltdown vulnerabilities, which allowed the attacker to see contents of the memory regions where they don't have access rights. The first part of the exploit's to fool the branch predictor to take the false branch for speculative execution. After the processor sees a disallowed access, the instruction will be retired, but the protected data will be present in the cache, where they can be guessed with some complicated techniques we won't discuss here. These bugs basically put in question the processor optimizations of the last decade, as fixing them would incur meaningful performance losses.

Considering that, we are all rather curious about how CPU architectures evolve next time, aren't we?

Out-of-order execution

There's another refinement to the pipeline concept allowing an even higher utilization of a CPU's resources. Namely, as processor manufacturers started to add redundant processing units (Intel P6 already had two integer and two floating-point execution units), it became possible to execute two instructions in parallel.

Up until Pentium Pro (P6), instructions were fed into the pipeline in their order of appearance. But if there's a data dependency between two consecutive instructions, then they can't be processed in parallel, leaving the additional execution unit idle:

a = b + 1;  // 1
c = a + 5; // 2
d = e + 10; // 3
f = d + 15; // 4

The solution to this problem is to take the next independent instruction and execute it before the dependent one. See the next diagram for a visual explanation:

Here, on the left side, we see the traditional execution preserving the instruction order and, on the right, parallel execution with reordering, where instruction 3 will be executed before instruction 2.

Multicore

The problem of the power wall was in the end overcome by freezing or even decreasing CPU frequencies but introducing parallelly working processor cores, adding more general registers, vector processing single instruction multiple data (SIMD) registers, and instructions. In a word, they either duplicated active processing units or added more elements that don't have to be always under voltage. In that way, the density of the transistors didn't have to be increased.

When all of the CPU cores are placed on one chip, we have a symmetric multiprocessing (SMP) CPU, because all cores can access their respective data within a local chip. The counterpart to that is a non-uniform memory access (NUMA) system, where we have several physically separate CPUs having their own internal caches. The memory access for internal CPUs will then be much cheaper than the memory access to an external CPU. Another problem is the cache coherence between the CPUs, which requires complicated cache-coherence protocols and can take down performance. In the context of Qt's application area, we normally encounter SMP machines, so we'll ignore NUMA in this book.

In a multicore chip, the processing resources can be classified in core and uncore ones—those which are duplicated for each core, and those that aren't and must be shared. For example, the top-level cache (L3 or L2, depending on the processor) is an uncore resource shared among processor cores.

One often-encountered notion is that of hyperthreading. This is another idea for increasing a CPU's parallelism, and hence its resource utilization. A processor with hyperthreading consists of two logical processors per core, each of which keeps its own internal state. The parts of the processor where it holds its architectural state (for example, running, interrupted, and halted) will be duplicated for each core, but the computation resources will be shared among logical cores. The intent of that is to increase the utilization of the processor's resources and prevent pipeline stalls, by borrowing resources from the stalled logical core. The operating system then uses these two logical cores as physical ones and has to be HTT-aware to optimally use such a system.

Strictly speaking, graphical-processing units (GPUs) are not a form of multicore, but we'll mention them in this context, because processors can ship with integrated GPUs on board. A GPU comprises many very simple processing units that can run massively parallel, although simple, computations. Normally, they're used to accelerate graphic processing, but they can be also used to speed up in general computations, such as the training of neural networks in deep-learning applications. In a Qt context, however, we use GPUs only for their graphical capabilities.

Additional instruction sets

As already mentioned, to increase processor performance chip, manufacturers started to add more sophisticated instructions that can either vectorize computations or execute algorithms that hitherto had to be implemented in application code.

The SIMD or vector instructions can be used to parallelize a scalar computation by executing calculations on several scalar values in parallel. For that, we have to load several float values in two sets of SIMD registers and then apply an operation to all of them at once. Intel processors introduced SIMD in a series of extensions named, namely the following:

  • Streaming SIMD Extension (SSE): Uses 128-bit registers and is available in several versions from SSE, over SSE2 to SSE4
  • Advanced Vector Extension (AVX): Uses 265-bit registers and is available in AVX2 and AVX-512 (for 512 bit) versions

As for the more specialized instructions, we mention the following Intel extensions:

  • Advanced Encryption Standard New Instructions (AES-NI): This implements the cryptographic AES encoding standard.
  • 32-bit cyclic redundancy check (CRC32): This implements computation of the CRC32 correction code.
  • SSE4.2: This SSE extension also implements basic string operations using SIMD registers.

Impact on performance

As you see, a modern CPU doesn't look like your daddy's old-fashioned von Neuman CPU at all. It's a very complex computer system on its own, banned on a very small chip, desperately trying out all possible tricks to squeeze out a little bit more performance from your program.

On the other hand, modern CPUs are trying to appear simple to the programmer, unfortunately leaking the implementation specifics all of the time along the way. So, what changed since the days of classic performance advice we summed up in the preceding sections?

Fortunately, not much. The basic techniques and guidelines still apply, but some additional advice must be heeded. Let's discuss these new points.

Keeping your caches hot

First of all, don't thrash your caches because main memory is very slow. This means that reading data all over the memory (also known as pointer chasing) isn't the best of ideas. Modern processors, such as programs that read consecutive chunks of memory in a predictable manner, allow them to leverage hardware-level prefetching. The word here is data locality.

A negative example for that, alas, is our old and trusty linked list! Traversing it's a real pointer-chasing feast, because all of the nodes are allocated dynamically and can be placed anywhere in memory. However, we could remedy it by using the already-mentioned techniques of preallocating and custom memory allocators. In that case, all of the nodes of a list would be located in the adjacent element of the preallocated buffer, making it again more cache friendly. Of course, we could just use an array from the start, but this is an example of how to improve data locality in similar cases.

The second classic mistake would be placing two often used integers, x and y, well apart from each other in a data structure, so that when we want to use both, we need to needlessly load two cache lines instead of one. What's used together should stay together; don't break your data structures on a cache line boundary!

Further optimization that's widely known is the replacement of an array of structures by a structure of arrays. This will be beneficial in the case of loading data for SIMD instructions and other techniques where we read data in parallel.

Another often-cited optimization trick is improving the performance of matrix multiplication by changing the data-access pattern. Instead of a straightforward triple i, j, k loop, we change it to the k, j, i loop:

for (size_t k = 0; k < P; ++k)
for (size_t j = 0; j <M; ++j)
for (size_t i = 0; i < N; ++i)
res[i][[j] += m1[i][[k] * m2[k][[j];

Here, just switching the order of traversal to be more cache-friendly and reading row elements of both matrices in consecutive manner will dramatically improve performance (in some measurements, up to 94%)!

What we learn here is that the outline and structure of our data, but also its access patterns, can have a big impact on our program's performance on modern processors!

The second type of cache, that is, the instruction cache, needs some love as well. Jumping through code back and forth is equally bad, as it's in the case of your data. This code locality is the next important notion. One possibility to influence that is to place your normal case code before the handling error, like this:

if (ok) {
do_work();
} else {
printf(“ERROR!”);
return;
}

This avoids a jump in normal case, hence improving code locality. We'll discuss that in more detail in Chapter 3, Deep Dive into C++ and Performance, when we'll look at the role of the compiler in program optimization.

Don't confuse your branch predictor

To avoid pipeline stalls, preferably, there shouldn't be any branches at all. Unfortunately, that can't be done in programming, but the next best thing we can do is to minimize branching. A classic example of branch avoidance is replacing simple conditional expressions with bit manipulations, like this:

const int maxValue = 16;
if (x >= maxValue) x = 0;
// is equivalent to:
x = (x + 1) & (maxValue - 1);

I think we can agree, these are ugly, low-level tricks, best left to the micro-optimization stage or to the compiler. Another technique of that type is loop unrolling.

A more usable technique could be helping the compiler to generate more branch predictor-friendly code such as the Linux kernel practice of macros: likely() and unlikely(), which internally use the GNU compiler collection (GCC) compiler's __builtin_expect() directive. This is supposed to support the branch-predictor, but, in reality, it allows code reordering by the compiler, which will then enable different optimizations. Older architectures with static branch predictors use special instruction prefixes indicating whether the branch is likely to be taken or not. As newer architectures are using dynamic predictors, these prefixes will be ignored and don't take any effect.

So, the received wisdom is that the branch predictors have meanwhile gotten pretty good, and that we should try to mess with them in the rarest cases only. With the exception of one thing: as a branch predictor's table has a finite (small) size, you shouldn't thrash it! And here we are again: don't use too many branches; keep good code locality!

Parallelizing your application

This is the famous no free lunch conundrum: if we want our program to run faster, it's not enough to buy a faster processor—we have to restructure it to be parallel and to use more processor cores! We'll have a closer look at these techniques in Chapter 5, An In-Depth Guide to Concurrency and Multithreading, when we'll discuss multithreading.

As for vector processing, we have to notice that, for SIMD instructions to be performant, the loaded data usually has to be aligned. Apart from that, as of today, any decent modern compiler will try to vectorize the code, so normally, we don't have to bother. However, the compiler will generate code for a specific architecture, but maybe you'd like to run your program on several processor and look for the more advanced extended instruction sets available? This is possible, and techniques for doing that are known as CPU dispatching.

The second theme related to parallelism to be mentioned here is the out-of-order execution. First, sometimes, we can encounter advice to break too long dependency chains so as to allow reordering, as was shown in a previous diagram. Arguably, this is a low-level technique, but sometimes every nanosecond may count.

Another theme is that there could be a time where we would like to disable instruction reordering. What could it be? Right—when synchronizing among threads, we have to know exactly in which order which variables were read or written. On the processor level, this can be forced with memory barriers, but this will prevent the possible reordering optimizations. That is the reason synchronization in a multithreaded program is already expensive on the processor level.

Summary

I hope you've enjoyed the journey so far. We acquired a basic understanding of the matter that we'll put to good use in future chapters of this book. Admittedly, in the second half, the discussion started to be a little low-level, going down into the inner workings of processors, but I hope you picked up at least a couple of buzzwords along the way.

So, we've arrived at the end of this chapter. Looking back, first we learned about benefits and caveats arising when we optimize performance and the pair, premature optimization and pessimization. Then, we discussed the basic rules for performance optimization, the well-known optimization techniques inferred from those rules, how and why memory access patterns matter, the way processors are trying to parallelize work on the instruction level, and, not to forget, what the common performance lingo's buzzwords mean.

Not bad for an introductory chapter, don't you think?

So, after we learned all of those things that we can respond to, the main question of the chapter, namely what's a performant program? The widely acknowledged response is that a performant program is a program that does the following:

  • Uses the optimal algorithm for the problem
  • Optimizes memory access patterns as to be cache friendly
  • Fully uses the parallelization possibilities of the hardware

In the following chapter, we'll look at techniques that allow us to avoid the dreaded premature optimization trap. We can steer clear of this trap by measuring how our code performs and where the bottlenecks and the tight spots are located, before actually starting to optimize. In the next chapter, we'll learn tools and techniques for doing exactly that.

Questions

Here are some questions so that you can test your understanding of the topics in this chapter. You'll find the responses at the end of this book, if you'd like to look them up:

  1. What is premature optimization?
  2. What did Donald Knuth really say about it?
  3. Why do we need performance optimization techniques?
  4. Which is the worst premature pessimization?
  5. Why is main memory slow?
  6. What is ILP, and what has it to do with thread synchronization?
  7. Is hyperthreading good or bad for performance?
  8. Why do we still use the likely/unlikely macros in the Linux kernel?
  9. Is a 64-bit program performance-wise better than a 32-bit program?

Further reading

If you would like to learn more about basic performance techniques and processor architectures, these are the resources I'd recommend:

  • The short book, The Performance of Open Source Applications, edited by Travish Armstrong (available on http://aosabook.org, 2013) contains case studies showing mostly applications of the basic principles we discussed, for example, in the chapter about the Ninjia build system.
  • Books about C++ performance, such as, this admittedly somewhat older books, Efficient C++ Performance Programming Techniques, by Dov Bulka and David Mayhew, Addison Wesley 1999, (by the way, this was my first computer performance book!) or a more recent Optimized C++ by Kurt Guntheroth, O'Reiily 2013, discuss many of the traditional basic performance techniques in their first chapters.
  • Power and Performance. Software Analysis and Optimization, by Jim Kukunas, Morgam Kaufman 2015, discusses the history and architecture of modern Intel processors, introduces tools for obtaining CPU performance data and explains some low-level performance techniques. However, this book concentrates on Linux operating systems and tools.
  • An even more in-depth discussion of performance gotchas in different processor architectures can be found on Agner's Fog site (http://www.agner.org/optimize/microarchitecture.pdf and others, 2018) and is directed more toward compiler writers. The widely cited article by Ullrich Drepper What every programmer needs to know about memory (available on http://www.akkadia.org/drepper/cpumemory.pdf, 2007) provides many details on RAM memory, caches, virtual memory, and possible low-level memory optimization techniques.
  • Lastly, the most famous quote about performance optimization by Donald Knuth can be looked up in his article Structured programming with go to statements, ACM Computing. Survey., 6(4), 1974.
Left arrow icon Right arrow icon

Key benefits

  • Build efficient and concurrent applications in Qt to create cross-platform applications
  • Identify performance bottlenecks and apply the correct algorithm to improve application performance
  • Delve into parallel programming and memory management to optimize your code

Description

Achieving efficient code through performance tuning is one of the key challenges faced by many programmers. This book looks at Qt programming from a performance perspective. You'll explore the performance problems encountered when using the Qt framework and means and ways to resolve them and optimize performance. The book highlights performance improvements and new features released in Qt 5.9, Qt 5.11, and 5.12 (LTE). You'll master general computer performance best practices and tools, which can help you identify the reasons behind low performance, and the most common performance pitfalls experienced when using the Qt framework. In the following chapters, you’ll explore multithreading and asynchronous programming with C++ and Qt and learn the importance and efficient use of data structures. You'll also get the opportunity to work through techniques such as memory management and design guidelines, which are essential to improve application performance. Comprehensive sections that cover all these concepts will prepare you for gaining hands-on experience of some of Qt's most exciting application fields - the mobile and embedded development domains. By the end of this book, you'll be ready to build Qt applications that are more efficient, concurrent, and performance-oriented in nature

Who is this book for?

This book is designed for Qt developers who wish to build highly performance applications for desktop and embedded devices. Programming Experience with C++ is required.

What you will learn

  • Understand classic performance best practices
  • Get to grips with modern hardware architecture and its performance impact
  • Implement tools and procedures used in performance optimization
  • Grasp Qt-specific work techniques for graphical user interface (GUI) and platform programming
  • Make Transmission Control Protocol (TCP) and Hypertext Transfer Protocol (HTTP) performant and use the relevant Qt classes
  • Discover the improvements Qt 5.9 (and the upcoming versions) holds in store
  • Explore Qt s graphic engine architecture, strengths, and weaknesses
Estimated delivery fee Deliver to Mexico

Standard delivery 10 - 13 business days

Mex$149.95

Premium delivery 3 - 6 business days

Mex$299.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jan 31, 2019
Length: 384 pages
Edition : 1st
Language : English
ISBN-13 : 9781789531244
Vendor :
Qt
Category :
Languages :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Mexico

Standard delivery 10 - 13 business days

Mex$149.95

Premium delivery 3 - 6 business days

Mex$299.95
(Includes tracking information)

Product Details

Publication date : Jan 31, 2019
Length: 384 pages
Edition : 1st
Language : English
ISBN-13 : 9781789531244
Vendor :
Qt
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just Mex$85 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just Mex$85 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total Mex$ 2,912.97
Mastering Qt  5
Mex$1004.99
Hands-On High Performance Programming with Qt 5
Mex$1004.99
Qt5 C++ GUI Programming Cookbook
Mex$902.99
Total Mex$ 2,912.97 Stars icon
Banner background image

Table of Contents

13 Chapters
Understanding Performant Programs Chevron down icon Chevron up icon
Profiling to Find Bottlenecks Chevron down icon Chevron up icon
Deep Dive into C++ and Performance Chevron down icon Chevron up icon
Using Data Structures and Algorithms Efficiently Chevron down icon Chevron up icon
An In-Depth Guide to Concurrency and Multithreading Chevron down icon Chevron up icon
Performance Failures and How to Overcome Them Chevron down icon Chevron up icon
Understanding I/O Performance and Overcoming Related Problems Chevron down icon Chevron up icon
Optimizing Graphical Performance Chevron down icon Chevron up icon
Optimizing Network Performance Chevron down icon Chevron up icon
Qt Performance on Embedded and Mobile Platforms Chevron down icon Chevron up icon
Testing and Deploying Qt Applications Chevron down icon Chevron up icon
Assessments Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(2 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
sanghyun Dec 12, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Simply a great book on performance!
Subscriber review Packt
Roman Apr 29, 2019
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The book is virtually the only book book on Qt performance at this point in time, and it does its job exceptionally well, It gives you an insight on existing tools and how to use them, on existing techniques and how to apply them, and on the various aspects of performance from a C++ perspective, as you can't really talk about Qt performance without talking about C++ performance. I consider myself a seasoned Qt and C++ developer, but I had a lot to learn from this book, and I would definitely recommend it to anyone dealing with Qt, the book really covers a lot.Another thing, which is not less important, is that the book is written really well, it is easy to read, and to understand.That being said - I noticed a few typos (not many though), the book is more Windows oriented (but many important Linux specific topics are also in the book), and I skipped almost everything related to QML, but this is not enough to detract a star from the ratings, so here we go.Very highly recommended.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact [email protected] with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at [email protected] using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on [email protected] with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on [email protected] within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on [email protected] who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on [email protected] within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela