Rust High Performance

Common Performance Pitfalls

If you are reading this book, you are probably concerned about the performance of your Rust code. It's known that Rust can offer performance close to that of C/C++ programs, and in some cases, Rust can even win those benchmarks. The main issue, though, is that it's sometimes hard to get that efficiency, especially if you are coming from C/C++. Some concepts don't apply, and some simple efficient approaches in those languages are notably worse in Rust.

In this book, you will learn how to really take advantage of Rust to make it perform at its best while maintaining all the benefits it brings—safety, zero-cost abstractions, and great concurrency. The book can be read from start to finish, and you will probably learn new concepts in every chapter. You can go directly to the chapter that interests you, though, as each chapter contains all the required information for its completion, so that it can be used as a reference.

In this first part of the book, we will start with an introduction on how to improve the performance of your sequential code. You will learn how to avoid common performance pitfalls and how to fix direct translations from other languages. You will then learn how to get better performance from your code, and finally understand memory management in Rust.

In this chapter, we will be looking into:

Configuration of the compilation process with profiles
Translation pitfalls—learning how to avoid performance pitfalls with array/slice indexing and master iterators
New iterator adaptors, both in the standard library and in external crates, and coding any complex behavior at zero cost
How to use the borrow checker to your advantage

Most of the people that start learning Rust, myself included, tend to bring lessons learned in other languages to Rust. This is usually a great thing, as it will enable you to learn the language faster. The main issue with this approach is that some patterns used in other languages can actually be a trade-off in Rust. We will learn about the most common ones and the not-so-common ones, so that anyone trying to get better performance in Rust can learn how to do it.

Asking the Rust compiler about performance

Rust sometimes has interesting and lesser-known features that really make a difference when talking about performance enhancements. When it comes to big improvements with small changes, the first thing that you should understand is the release mode. Rust by default compiles your software in development mode, which is pretty good to check for compiling errors quickly, but if you want to run it, it will run slowly. This is due to the development mode not performing any optimizations. It will create object (machine) code directly related to the Rust code without optimizing it.

Rust makes use of the LLVM backend, which makes it easy to take advantage of its performance optimizations without having to develop all of these by themselves. They will only need to use LLVM intermediate representation. An intermediate language between Rust and assembly code that the LLVM compiler understands. While in development mode, no optimizations get performed by Rust or LLVM; enabling them is as easy as adding the --release flag to the cargo compilation. So, for example, if you were running your software by typing cargo run in the console, just by using cargo run --release it will compile with optimizations and run much, much faster. Usually, the gain is of more than one order of magnitude.

Optimizations

By default, Rust will perform level 3 optimizations in the code. Optimizations get divided into levels depending on how complex they are. Higher-level optimizations, in theory, improve the performance of the code greatly, but they might have bugs that could change the behavior of the program. Usually, level 1 optimizations are totally safe, and level 2 optimizations are the most-used ones in the C/C++ ecosystem. Level 3 optimizations have not been known to cause any issues, but in some critical situations, it might be better to avoid them. This can be configured, but we should first understand how the Rust compiler compiles the code to machine instructions so that we know what different options accomplish.

Rust first starts with parsing your code files. It will get the keywords and the different symbols to create a representation of the code in memory. This parsing will find common errors such as a missing semicolon or an invalid keyword. This memory representation of the code is called High Intermediate Representation (HIR). This representation of the code will be greatly simplified, removing complex flow structures and converting it into Middle Intermediate Representation (MIR).

The MIR representation is then used to check more complex flows of the software, and enables complex variable lifetime checks, along with some other improvements. This is then converted to the LLVM Intermediate Representation and gets passed to the LLVM compiler. When passing this code to LLVM, Rust adds some flags that will modify the way that LLVM optimizes the code. We have already seen that by default one of the flags it passes is the -O0 flag, or do not optimize flag, so it simply translates to machine code. When compiling in release mode, though, a -O3 gets passed so that level 3 optimizations get performed.

This behavior can be configured in the Cargo.toml file of the project and it can be configured for each profile. You can configure how to compile for tests, development, documentation, benchmarks, and release. You will probably want to keep development and documentation optimizations to a minimum, as in those profiles the main idea is to compile quickly. In the case of the development profile, you will want to check if everything compiles properly, and even test the behavior of the program a little bit, but you probably won't be concerned about the performance. When generating the documentation, the performance of the application doesn't matter at all, so the best idea is to just not optimized.

When testing, the optimization level you need will depend on how many tests you want to run and how computationally expensive they are. If it takes a really long time to run the tests, it may make sense to compile them optimized. Also, in some critical situations in which you might not be 100% sure that optimizations get performed in a completely secure way, you might want to optimize the tests the same way you optimize the release, and that way you can check if all unit and integration tests pass properly even after optimizations. If they don't, this is a compiler malfunction, and you should report it to the Rust compiler team. They will be glad to help.

Of course, benchmarks and release profiles should be the most optimized ones. In benchmarks, you will want to know the real optimized performance of the code, while in the release, you will want your users to get the best out of their hardware and your software to make things run as efficiently as possible. In these cases, you will want to optimize up to level 2 at least, and if you are not sending satellites to space or programming a pacemaker, you will probably want to optimize all the way up to level 3.

Build configuration

There is one section in the Cargo.toml file that enables these configurations: the profile section. In this section, you will find one subsection for each of the profiles. Each of them gets declared with the [profile.{profile}] format. So, for example, for the development profile, it would be [profile.dev]. The different profile configuration keywords are the following:

dev for the development profile, used in cargo build or cargo run
release for the release profile, used in cargo build --release or cargo run --release
test for the testing profile, used in cargo test
bench for the benchmarking profile, used in cargo bench
doc for the documentation profile, used in cargo doc

When configuring each profile, you will have many options, and we will check all of them out here.

Optimization level

The first option is the one mentioned before, the optimization level. This configuration option can be set by using the opt-level key in the relevant profile section. By default, optimizations will be level 3 for benchmarking and release, and zero for the rest. For example, to only perform level 2 optimizations in the release profile, you can add this code to your Cargo.toml file:

    [profile.release]
    opt-level = 2

Debug information

The next option is the debug information. This does not directly affect performance, but it's an interesting configuration item. In this case, you can decide if the debug symbol information gets added to the final executable. This is really useful if you are developing, and especially if you are using a debugger such as GDB. Adding debug information to the executable will enable you to get the function names and even the line numbers of each instruction being executed in the processor. This will give you great insight about what is happening in the code.

In any case, debug information is not so useful in final release binaries, as final release binaries are not meant to be used in debugging. And the debug information usually adds a lot of size to the final binary. This has many times been a concern among developers, as the size of Rust binaries is usually much bigger than the ones written in C/C++. This is in part due to this configuration, and in most cases due to the panic behavior, which we will check later. Debug symbols will also show information about the original code, so it might make sense to hide it in closed-source projects.

To avoid extra debug symbols in the final binary, the debug option must be set to false. This can be done for each profile, and by default, it's true only for the development profile. If you'd like to enable it for testing also, for example, you can add this in the Cargo.toml file:

    [profile.test]
    debug = true

You can, of course, combine this with any other profile option:

    [profile.test]
    debug = true
    opt-level = 1

Link-time optimizations

The next configuration option, useful for improving the performance of the application, is link-time optimizations. Usually, when a program gets built, once all the code has been optimized, it gets linked to other libraries and functions that provide the required functionality. This, however, does not always happen in the most efficient way. Sometimes, a function gets linked twice, or a piece of code gets used in many places, and in that case, the compiler might decide to duplicate some code.

The program will work perfectly, but this has two main disadvantages—first of all, duplicating code and links will make the binary bigger, which is probably something you don't want, and secondly, it will reduce the performance. You might ask why. Well, since it's the same code being accessed from different places in the program, it might make sense that if it gets executed once, it gets added to the L1/L2/L3 caches in the processor. This will enable the future reuse of these instructions without requiring the processor to get them from the RAM memory (much slower) or even the disk/SSD (extremely slow) if the memory has been swapped.

The main advantage when performing Link-Time Optimizations, or LTOs in short, is that while Rust compiles the code file by file, LTOs fit the whole, almost final, representation into a big compilation unit that can be optimized in its entirety, enabling better execution paths.

This can be performed, of course, but at a really high compilation time cost. These optimizations are very costly because they sometimes require changing the final representation, the one ready to be written to the binary. Not only that, but this requires checking lots of execution paths and code samples to find similar blocks. And remember, this is done with the object code, not the Rust code, so the compiler doesn't know about libraries or modules; it only sees instructions.

This costly optimization will improve the performance of your software and the size of your binaries, but being so costly (usually taking as much time as the rest of the compilation, or even more), it's not enabled by default in any of the profiles. And you should not enable it on any but release and maybe benchmarking profiles (you don't want to wait for an LTO every time you make a small change in a function and want to test it). The configuration item to change is the lto configuration item:

    [profile.release]
    lto = true

There is a related configuration item, which will be ignored if LTO is turned on for the given profile. I'm talking about the codegen units. This divides the code into multiple smaller code units and compiles each of them separately, enabling parallel compiling, which improves how fast the program compiles. This is one in the case of LTO, but can be modified for the rest. Of course, using separate compilation units avoids some optimizations that can improve the performance of the code, so it might make sense to enable faster compilation in development mode. It will be 16 by default in development mode.

This is as simple as changing the codegen-units configuration option in the development profile:

    [profile.dev]
    codegen-units = 32

An example value could be the number of processors/threads in your computer. But remember that this will make the compiled software slower, so do not use it in the release profile. It will, in any case, be ignored if you activate the link-time optimizations.

Debug assertions

The next interesting configuration item is the one that allows debug assertions to be removed. Debug assertions are like normal assertions, but by default they are only executed in the development profile. They are written in the code by prefixing the assert! macros with debug_, using for example debug_assert! or debug_assert_eq!. This enables you to fill the whole code with assertions that must be true and that take processing cycles to test, while not reducing the performance of a release application. Of course, this means that those assertions won't run in release mode. This is useful for testing internal methods, but is probably not the best for APIs, and certainly not a good idea in unsafe code wrappers.

For example, the indexing function in the standard library Vec object has an assertion that will check each time you get an element of the vector by index if the index is out of bounds. This is great to avoid buffer overflows, but makes the operation of getting an element of the vector slower, and if the index is out of bounds, the program will panic. We will talk about this particular example later, but in general, it shows how useful these assertions are—in this case for release mode also.

On the other hand, if you plan to create a small internal API that will input numbers between 0 and 100 and do some calculations with them, but is not exposed to the public, you could simply add a debug_assert!(num <= 100 && num >= 0) and, in tests and debug mode, it will panic the program if a number outside that range is received by the function, but it will not run the assertion in release mode. This can be a potential error vector, but with thorough unit testing, the odds of not getting the error in testing/development mode and an incorrect number being received in release mode are much, much lower. Of course, once again, this shouldn't be used for security-focused areas or input that would cause unsafe or undefined behavior.

By default, as explained, these assertions run in development, testing, and documentation modes. This last one is useful if you have documentation tests with debug assertions. This can be configured, in any case, easily, by changing the debug-assertions configuration option. For example:

    [profile.doc]
    debug-assertions = false

Panic behavior

The next configuration variable to check is the panic behavior. By default, Rust will unwind in a panic. This means that it will call each destructor of each variable in the stack if something goes terribly wrong and the application panics. There is another option: not calling anything and simply aborting the program (the standard C/C++ behavior).

The main advantage of the unwind is that you will be able to call the destructors, so any cleanup that should be done for your variables in the stack of the program will be done properly. The main advantage of the abort behavior is that it will require much less code to be compiled, since for each potential panic location a new branch gets added to the code where all the destructors are run. It also gives the code much fewer branches, which makes it easier to optimize, but the main advantage is smaller binaries. Of course, you lose the ability to run the destructors, so some complex behavior might not be properly cleaned, for example, if you need to write something to a log upon shutdown.

If you still think that in your use case, using the abort behavior is a good idea, you can enable it by using the panic keyword:

    [profile.doc]
    panic = 'abort'

Runtime library paths

The last configuration option is rpath. This configuration item accepts a Boolean and allows you to ask the Rust compiler to set loader paths when the executable looks OK for libraries at runtime. Even though, most of the time, Rust will link crates and libraries statically, you can ask a specific library to be linked dynamically. In this case, that library will be searched at runtime, not at compile time, and it will therefore use system libraries installed where the program is running.

This configuration option asks cargo to add -C rpath to the rustc compiler invocation. This will add paths to the dynamic library search paths. Nevertheless, this should not be required in most cases, and you should avoid it if it's not necessary by using false as the option value. If you are having issues making your application run in multiple operating systems, you might try it, since it might make the executable look for dynamic libraries in new places.

Translation issues

When translating your C/C++/Java mindset, or directly porting a project to Rust, you might find yourself writing similar code to what you wrote in your native language, but if you have tried it, you might have noticed that it performs poorly, or at least much worse than your old code. This happens, especially with C/C++, since in the case of Java the performance issue is much lower compared to the high memory and computation footprint of a Java application, with its Java Virtual Machine and its garbage collector.

But why does a direct translation hurt the performance? We will see in this section how Rust's guarantees can sometimes create unnecessary boilerplate instructions, and we will learn how to bypass them by using safe and efficient code. Of course, in some performance-critical situations, unsafe scopes might be needed, but in general, that's not the case.

Indexing degradations

Let's start with a simple example. This Rust code will perform poorly:

    let arr = ['a', 'b', 'c', 'd', 'e', 'f'];

    for i in 0..arr.len() {
      println!("{}", arr[i]);
    }

This will, of course, work, and it's perfectly safe. We create an index that goes from 0 to the length of the array (6 in this case), but exclude the last one, so the i binding will take the values 0, 1, 2, 3, 4, and 5. For each of them, it will get the element at that index in the array and print it in a new line. There is one problem with this approach though. In C/C++, an equivalent code will simply add the size of the element to the pointer in the array and get the next element, but that sometimes causes issues. Look at this code:

    let arr = ['a', 'b', 'c', 'd', 'e', 'f'];

    for i in 0..arr.len() + 1 {
      println!("{}", arr[i]);
    }

In this case, we are iterating until the array length + 1, and since ranges are exclusive, the last index will be 6. This means it will try to get the seventh element in the array, but there is no seventh element. In C/C++, this will create a buffer overflow and will get whatever is next in the memory. In the case that this memory is outside the program, you will get a segmentation fault, but if it's part of the program, it will print whatever was in that position, leading to leaks. Of course, that is not possible in Rust, since Rust is a memory-safe language, so what will happen?

Well, the answer is surprising—it will panic the program, unwind the stack (call all the destructors of all the variables in the stack), and exit the program safely without trying to access invalid memory. Depending on your perspective, you might think great, I will no longer have buffer overflows, or you might think oh my God, the whole server will go down to prevent a buffer overflow. Well, the second can be mitigated by stopping the panic and recovering the proper server state, already implemented in most frameworks, so it's mostly a win-win.

But is it? How does Rust know if the index is out of bounds? In this simple example, the compiler could know that the arr variable has only six elements, so trying to access the seventh would violate memory constraints. But what about this more complex program:

    fn print_request(req: Request) {
      for i in 0..req.content_length {
        println!("{}", req.data[i]);
      }
    }

Here I'm receiving an HTTP request (very naively represented) that has at least one content_length attribute and one data attribute. The first should contain the length of the data field, in a number of bytes, while the second will be a vector of bytes. Let's suppose we don't have the len() function in that data field, and that we trust the content_length attribute. What if somebody were to send us an invalid request with a bigger content_length than the actual length of the content? The compiler wouldn't know this in advance because the request originated at runtime from a TCP connection, but again, Rust must always be memory-safe (unless working in an unsafe scope, which is not the case).

Well, what happens is that the index operation has two parts. First, it checks the bounds of the slice, and if the index is fine, it will return the element; if not, it will panic. And yes, it does this for every indexing operation. So in this case, if the request is a valid request with a supposed 1 million bytes (1 MB), it will compare the index to the length of the vector 1 million times. That is at least 2 million extra instructions (the comparison and the branching for each, at least). That becomes much less efficient than the equivalent C/C++ code.

Using iterators

There is a way around this, though, that gives the same effect as the C/C++ code: using iterators. The previous code can be converted into the following:

    let arr = ['a', 'b', 'c', 'd', 'e', 'f'];

    for c in &arr {
      println!("{}", c);
    }

This will compile roughly to the same machine code as the C/C++ variant since it won't check the bounds of the slice more than once, and it will then use the same pointer arithmetic. This is great when iterating through a slice, but in the case of a direct lookup, it can be an issue. Suppose we will receive thousands of 100-element slices, and we are supposed to get the last element of each and print it. In this case, iterating through all 100 elements of each array just to get the last one is a bad idea, as it would be more efficient to bounds check just the last element. There are a couple of ways of doing this.

The first one is straightforward:

    for arr in array_of_arrays {
      let last_index = arr.len() - 1;
      println!("{}", arr[last_index]);
    }

In this concrete case, where we want to get the last element, we can do something like this:

    for arr in array_of_arrays {
      if let Some(elt) = arr.iter().rev().next() {
        println!("{}", elt);
      }
    }

This will reverse the iterator with the call to rev() and then get the next element (the last one). If it exists, it will print it. But if we have to get a number that is not close to the end or to the beginning of the slice, the best way is to use the get() method:

    for arr in array_of_arrays {
      if let Some(elt) = arr.get(125) {
        println!("{}", elt);
      }
    }

This last one has a double bound check, though. It will first check if the index is correct to return a Some(elt) or a None, and then the last check will see if the returned element is Some or None. If we know for sure, and I mean 100% sure, that the index is always inside the slice, we can use get_unchecked() to get the element. This is an exact equivalent to the C/C++ indexing operation, so it will not do bounds checking, allowing for better performance, but it will be unsafe to use. So in the HTTP example before, an attacker would be able to get what was stored in that index even if it was a memory address outside the slice. You will need to use an unsafe scope, of course:

    for arr in array_of_arrays {
      println!("{}", unsafe { arr.get_unchecked(125) });
    }

The get_unchecked() function will always return something or segfault, so no need to check if it's Some or None. Remember also that upon a segfault, this will not panic, and no destructors will be called. It should only be used if a safe alternative would not meet the performance requirements and if the bounds of the slice were previously known.

In most cases, you will want to use an iterator. Iterators allow for precise iteration of elements, even filtering them, skipping some, taking a maximum amount of them, and finally, collecting them into a collection. They can even be extended or joined with other iterators to allow for any kind of solution. Everything gets managed by the std::iter::Iterator trait. You now understand the most-used methods of the trait, and I leave the rest to you to research in the standard library documentation.

It's important to properly use and understand iterators since they will be very useful for doing really fast loops. Iterators are cost-free abstractions that work the same way as indexing, but will not require bounds checking, making them ideal for efficiency improvements.

Iterator adaptors

Let's start with the most simple method. The basic method for the rest to work is the next() method. This function will return either the next element in the iteration or a None if the iterator has been consumed. This can be used to manually get the next element, or to create a for using a while, for example:

let arr = [10u8, 14, 5, 76, 84];
let mut iter = arr.iter();

while let Some(elm) = iter.next() {
    println!("{}", elm);
}

That would be the same as this:

let arr = [10u8, 14, 5, 76, 84];

for elm in &arr {
    println!("{}", elm);
}

Note the & before the array variable in the for. This is because the basic array type does not implement the Iterator trait, but a reference to the array is a slice, and slices implement the IntoIterator trait, which makes it usable as an iterator.

The next two methods you should know about are the skip() and the take() methods. These make it easy to get only the correct members of a known ordered iterator. For example, let's say we want to take from the third to the tenth element of an iterator with unknown length (at least 10 elements). In this case, the best thing would be to skip the first two and then take the next eight. We then collect them in a vector. Note that the iterator will not run until you call collect() method or use it in a loop. Those are the moments in which the next() method gets executed:

let arr = [10u8, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

let collection: Vec<_> = arr.iter().cloned().skip(2).take(8).collect();

for elm in collection {
    println!("{}", elm);
}

This will start iterating through the array, and it will first clone each element. That's because by default an iterator will yield references to the elements, and in the case of u8 it's better to copy them than to reference them, as we will see at the end of the chapter. The skip() method will call next() twice and discard what it returns. Then, for each next() operation, it will return the element. Until it calls next() eight times, the take() method will return the element. It will then return None. The collect() method will create an empty vector, and will push elements to it, while the next() method returns Some, then returns the vector.

Note that the collect() method requires a type hint, as it can return any kind of collection—actually, any type that implements the FromIterator trait. We simply tell it that it will be a standard library Vec, and we let the compiler infer the type of element the vector will hold.

There are also a couple of functions that are a generalization of the previous ones, skip_while() and take_while(). These two will skip or take elements, respectively, while the closure they run returns true. Let's see an example:

let arr = [10u8, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

let collection: Vec<_> = arr.iter()
    .cloned()
    .skip_while(|&elm| elm < 25)
    .take_while(|&elm| elm <= 100)
    .collect();

for elm in collection {
    println!("{}", elm);
}

In this case, the skip_while() method will run next() until it finds an element bigger than or equal to 25. In this case, this is the fourth element (index 3), number 76. The take_while() method starts then calling next() and returning all elements while they are less than or equal to 100. When it finds 143, it returns None. The collect() method will then include all those elements, from the 76 to the 100, both included in a vector, and return it. Note that the 23 is also added to the final result, since even if it's lower than 25, while the skip method stops skipping, it will never skip again.

To fine-tune the filtering of the elements in the iteration, some other very interesting methods are the filter() method and its companion map(). The first lets you filter elements of an iterator based on a closure, while the second lets you map each element to a different one. Let's explore this by using a simple iterator that yields the odd elements of an iterator and collects them into a vector:

let arr = [10u8, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

let collection: Vec<_> = arr.iter()
    .enumerate()
    .filter(|&(i, _)| i % 2 != 0)
    .map(|(_, elm)| elm)
    .collect();

for elm in collection {
    println!("{}", elm);
}

In this case, we enumerate the iterator by calling to enumerate(). That will yield a tuple with the index and the element for each next() call. This will then be filtered by checking the index. If the index is odd, it will be returned in the next() call; if it's not, it will call next() again. This will then be mapped, as the filter will also return the tuple. The map() function will take only the element, discard the index, and return it.

The filter and map functions can be reduced by using the helpful filter_map() function, which combines the two of them:

let arr = [10u8, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

let collection: Vec<_> = arr.iter()
    .enumerate()
    .filter_map(|(i, elm)| if i % 2 != 0 { Some(elm) } else { None })
    .collect();

for elm in collection {
    println!("{}", elm);
}

The filter_map() adaptor expects a closure that will return Some(element) when it should return the element, and None when it should retry and call next(). This will avoid some extra code. In this concrete case, you can also use the step_by() method, which only returns one element every n elements. In this case, using a two-step will have the same effect.

When trying to do calculations with iterators, instead of using a for, we can use the great fold() method. This will hold a variable between each call to next() that you will be able to update. That way, you can sum, multiply, and perform any other operation in the iterator. Let's, for example, perform the sum of all the elements of the iterator:

let arr = [10u32, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

let sum = arr.iter().fold(0u32, |acc, elm| acc + elm);
println!("{}", sum);

This will print 985, without needing a loop. Of course, this will be implemented with a loop under the hood, but for the programmer, it's a zero-cost abstraction that helps a lot in terms of simplifying the code.

Real-life example

As a real-life example, here is the VSOP87 algorithm's variable function implemented with a fold() method. The VSOP87 algorithm is used to find planets and moons in the sky with really good accuracy, useful for simulators and telescope star finders, for example:

fn calculate_var(t: f64, var: &[(f64, f64, f64)]) -> f64 {
    var.iter()
       .fold(0_f64, |term, &(a, b, c)| term + a * (b + c * t).cos())
}

This is equivalent to this other code:

fn calculate_var(t: f64, var: &[(f64, f64, f64)]) -> f64 {
    let mut term = 0_f64;
    for &(a, b, c) in var {
        term += a * (b + c * t).cos();
    }
    term
}

And in C/C++, this would probably require a structure to hold the tuple. Five lines reduced to one with the same native code. As we talked about, this has no extra cost and will be compiled to the same machine code.

Specialized adaptors

In the case of a summation or a multiplication, there are specialized methods: the sum() and the product() methods. These methods will do the same as the fold() method that is used to add all the numbers in an iterator or to multiply all the items of the iterator. The example we saw before can be reduced to this:

let arr = [10u32, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

let sum: u32 = arr.iter().sum();
println!("{}", sum);

Type annotations are required for now, but the code looks much simpler. You can also use the product() function in the same way, and it will be equivalent to this code:

let arr = [10u32, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

let prod = arr.iter().fold(0u32, |acc, elm| acc * elm);
println!("{}", prod);

Interaction between adaptors

There are also some functions to control how the iterators interact with other iterators or even themselves. For example, the cycle() function will make the iterator start again from the beginning once it gets to the end of the iterator. This is useful to create an infinite loop with an iterator. There are also a couple of functions that help you deal with multiple iterators at the same time. Let's suppose that you have two slices of the same length and want to generate a new vector with that same length, but with each element being the sum of the elements with the same index in the slices:

let arr1 = [10u32, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];
let arr2 = [25u32, 12, 73, 2, 98, 122, 213, 22, 39, 300, 144, 163, 127, 3, 56];

let collection: Vec<_> = arr1.iter()
    .zip(arr2.iter())
    .map(|(elm1, elm2)| elm1 + elm2)
    .collect();
println!("{:?}", collection);

In this case, we have used the zip() function that will yield a tuple with each element being the next of each iterator. We can also chain them with the chain() function, which will generate a new iterator that, once the first starts yielding None, will start yielding elements from the second iterator. There are many more iteration functions, but we will leave the standard library here for now and focus on external crates.

Itertools

There is one external crate that can make working with iterators much easier, and gives you superpowers. Remember the idea that these iterators allow you to perform the same operations you would do in C with indexing, but with complete memory safety and zero-cost abstractions? They also make the code much easier to understand. In terms of iterator capabilities, the most important crate is the itertools crate. This crate provides a new trait, the Itertools trait, which gives iterators many new methods and functions that make the life of the developer much easier, while staying true to its core values of performance thanks to zero-cost abstractions. You can add it to your project by adding it to your Cargo.toml file in the [dependencies] section.

Let's explore some of its iterator adapters. We start with a simple one that helps us create batches or chunks of the given iterator, the batching() function. Let's say that we want to use an iterator over one of the previous arrays and we want to make it return elements in groups of three. It's as simple as using that method and creating a closure that directly calls the next() method and returns the required tuple:

// Remember
extern crate itertools;
use itertools::Itertools;

let arr = [10u32, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

for tuple in arr.iter().batching(|it| match it.next() {
    None => None,
    Some(x) => {
        match it.next() {
            None => None,
            Some(z) => {
                match it.next() {
                    None => None,
                    Some(y) => Some((x, y, z)),
                }
            }
        }
    }
})
{
    println!("{:?}", tuple);
}

This will print the array in groups of three elements, in order:

(10, 5, 14)
(76, 35, 84)
(23, 100, 94)
(143, 200, 23)
(12, 72, 94)

A similar operation can be accomplished by using the chunks() function. We can say that the batching() adaptor is a generalization of the chunks() adaptor, since it gives you the option to create the internal logic of the function. In the case of chunks(), it will only receive as a parameter the number of elements in a chunk, and it will return slices to those chunks.

A really similar operation will be performed with the tuples() method. As you can see, the batching() method is a complete generalization in terms of how you create batches or chunks of an iterator. Let's see the same example we saw previously using the tuples() method:

// Remember
extern crate itertools;
use itertools::Itertools;

let arr = [10u32, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

for tuple in arr.iter().tuples::<(_, _, _)>() {
    println!("{:?}", tuple);
}

Much less boilerplate code, right? In this case, we are required to specify the number of elements in a tuple, but if we used type inference in the for, we could avoid it:

// Remember
extern crate itertools;
use itertools::Itertools;

let arr = [10u32, 14, 5, 76, 84, 35, 23, 94, 100, 143, 23, 200, 12, 94, 72];

for (a, b, c) in arr.iter().tuples() {
    println!("({}, {}, {})", a, b, c);
}

Of course, in this case, we would be pattern-assigning the variables. There is also another interesting function that allows for creating the cartesian product of two iterators.

Unsurprisingly, the name is cartesian_product(). This will create a new iterator with all possible combinations of the previous two:

// Remember
extern crate itertools;
use itertools::Itertools;

let arr1 = [10u32, 14, 5];
let arr2 = [192u32, 73, 44];

for row in arr1.iter().cartesian_product(arr2.iter()) {
    print!("{:?}, ", row);
}

This will print the following:

(10, 192), (10, 73), (10, 44), (14, 192), (14, 73), (14, 44), (5, 192), (5, 73), (5,44),

There are many other methods in the Itertools trait, and I invite you to check the official documentation, since it's very detailed and has many examples. For now, these common methods should help you do any operation you need to perform with slices in a much more efficient way.

Borrowing degradations

Iterations are not the only place where translation degradations occur. There are also a couple of extra points where you can sometimes see that the same code performs much worse in Rust than in C/C++. One of these points is reference handling. Due to borrow checker constraints, you can do three things with variables when passing them to a function: send a reference (borrow), give the new function control of the variable (own), or copy/clone the variable to send it to a function. It seems easy to decide, right? If you do not require the variable anymore, let the function own your variable. If you require it, send the reference, and if you require it and the API only accepts ownership, clone it.

Well, it actually isn't so simple. For example, integers are faster to copy than to reference, and so are small structures. The rule of thumb is, if it's smaller than or equal to usize, copy, always. If it's somewhere between usize and 10 times that size, it's probably better to copy. If it's bigger, it's probably better to reference. If the structure has a heap allocation (such as a Vec or a Box), it's usually better to send a reference.

There are some cases, though, when you cannot decide what happens to the variable. In a macro, for example, the variable is passed as is, and the macro decides what to do with it. For example, the println! macro gets all elements by reference, since it does not require more. The problem is that if you are trying to print an integer, for example, a bottleneck appears. That's what happened some time ago to Robert Grosse, and he wrote an article about it.

Long story short, he had to force the copying of the integer. How did he do that? Well, it's as simple as creating a scope that will return that integer. Since integers implement Copy, the integer will be copied to the scope and then returned, effectively copying it to the macro:

let my_int = 76_u32;
println!("{}", {my_int});

For normal prints, this is not usually necessary, but if you need to quickly print thousands or millions of integers, you will not avoid the I/O interface, but you can at least avoid this bottleneck.

Cyclomatic complexity

Another possible bottleneck is the cyclomatic complexity of functions. While not directly related to the translation of code from other languages, it's true that Rust can sometimes increase the cyclomatic complexity of the code, since it forces you to check for optional (nullable) results, some complex iterators, functional programming, and so on. This is great for code security, but sometimes the compiler has issues properly optimizing the code we write.

The only way to avoid this is to separate the code into smaller code units that will help the compiler optimize better unit by unit. One way of doing that is by creating smaller functions, with no more than 20–25 branches each. A branch is a place where, depending on one variable, the program will run one code or another. The simplest branch is conditional, an if. There are many others, such as loops (especially when the loop contains returns) or the ? operator. This will create two branches, one for each result option. One of them will return the function while the other will assign the variable.

Nested loops and conditionals make this list grow larger, and the branches can be more and more complex, so you will have to try to divide those deeply nested conditionals in new functions. It's even considered a good practice. As you will see in the Tools section, there are tools that will help you find these bottlenecks.

Captain Yossarian May 11, 2020

One of the best Rust book I have ever read. This books is for you if you know Rut syntax and main principles of the language.Before reading this book I thought I'm strong beginner, but on 10th page I have understand that I'm still dont know Rust.

Amazon Verified review

jesion Oct 27, 2020

This book does not deliver on its promise. I was expecting deep-dive analysis of Rust code in specific situations. To my disappointment, performance aspects are covered only in the first 1/3 of whole book, And even there the topics covered are mostly trivial and well known to even non-advanced Rust users, lacking any serious analysis. Plus some basics of performance aspects generic to software development (tools, CPU architecture), not specific to Rust.The rest of the book is a random selection of various topics with very few loose references to performance (macros???).I'm not sure whom this book is for. Beginners will start somewhere else as this book does not introduce the language. Anyone who passed the introduction to Rust and got some minimal practics, will not find much interesting information here, either.

Blackmagic Jun 06, 2023

The first few chapters are ok, cover some rust specific topics like iterators, memory alignment, clippy etc.Then the parts about profiling and benchmarks are so high level they are pointless. The bench marking misses anything to do with issues relating to micro-benchmarking and compiler optimizations changing results. The later sections about threading and async could just be wikipedia articles.I get the feeling the author started with a lot of energy and then just wrote the second half of the book as quickly as possible.

Rust High Performance: Learn to skyrocket the performance of your Rust applications

What do you get with Print?

Rust High Performance

Common Performance Pitfalls

Asking the Rust compiler about performance

Optimizations

Build configuration

Optimization level

Debug information

Link-time optimizations

Debug assertions

Panic behavior

Runtime library paths

Translation issues

Indexing degradations

Using iterators

Iterator adaptors

Real-life example

Specialized adaptors

Interaction between adaptors

Itertools

Borrowing degradations

Cyclomatic complexity

Summary

Page 1 of 4

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs

Rust High Performance: Learn to skyrocket the performance of your Rust applications

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs