Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
R High Performance Programming

You're reading from   R High Performance Programming Overcome performance difficulties in R with a range of exciting techniques and solutions

Arrow left icon
Product type Paperback
Published in Jan 2015
Publisher
ISBN-13 9781783989263
Length 176 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Tjhi W Chandra Tjhi W Chandra
Author Profile Icon Tjhi W Chandra
Tjhi W Chandra
Aloysius Shao Qin Lim Aloysius Shao Qin Lim
Author Profile Icon Aloysius Shao Qin Lim
Aloysius Shao Qin Lim
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Understanding R's Performance – Why Are R Programs Sometimes Slow? FREE CHAPTER 2. Profiling – Measuring Code's Performance 3. Simple Tweaks to Make R Run Faster 4. Using Compiled Code for Greater Speed 5. Using GPUs to Run R Even Faster 6. Simple Tweaks to Use Less RAM 7. Processing Large Datasets with Limited RAM 8. Multiplying Performance with Parallel Computing 9. Offloading Data Processing to Database Systems 10. R and Big Data Index

Three constraints on computing performance – CPU, RAM, and disk I/O

First, let's see how R programs are executed in a computer. This is a very simplified version of what actually happens, but it suffices for us to understand the performance limitations of R. The following figure illustrates the steps required to execute an R program.

Three constraints on computing performance – CPU, RAM, and disk I/O

Steps to execute an R program

Take for example, this simple R program, which loads some data from a CSV file, computes the column sums, and writes the results into another CSV file:

data <- read.csv("mydata.csv")
totals <- colSums(data)
write.csv(totals, "totals.csv")

We use the numbering to understand the preceding diagram:

  1. When we load and run an R program, the R code is first loaded into RAM.
  2. The R interpreter then translates the R code into machine code and loads the machine code into the CPU.
  3. The CPU executes the program.
  4. The program loads the data to be processed from the hard disk into RAM (read.csv() in the example).
  5. The data is loaded in small chunks into the CPU for processing.
  6. The CPU processes the data one chunk at a time, and exchanges chunks of data with RAM until all the data has been processed (in the example, the CPU executes the instructions of the colSums() function to compute the column sums on the data set).
  7. Sometimes, the processed data is stored back onto the hard drive (write.csv() in the example).

From this depiction of the computing process, we can see a few places where performance bottlenecks can occur:

  • The speed and performance of the CPU determines how quickly computing instructions, such as colSums() in the example, are executed. This includes the interpretation of the R code into the machine code and the actual execution of the machine code to process the data.
  • The size of RAM available on the computer limits the amount of data that can be processed at any given time. In this example, if the mydata.csv file contains more data than can be held in the RAM, the call to read.csv() will fail.
  • The speed at which the data can be read from or written to the hard disk (read.csv() and write.csv() in the example), that is, the speed of the disk input/output (I/O) affects how quickly the data can be loaded into the memory and stored back onto the hard disk.

Sometimes, you might encounter these limiting factors one at a time. For example, when a dataset is small enough to be quickly read from the disk and fully stored in the RAM, but the computations performed on it are complex, then only the CPU constraint is encountered. At other times, you might find them occurring together in various combinations. For example, when a dataset is very large, it takes a long time to load it from the disk, only one small chunk of it can be loaded at any given time into the memory, and it takes a long time to perform any computations on it. In either case, these are the symptoms of performance problems. In order to diagnose the problems and find solutions for them, we need to look at what is happening behind the scenes that might be causing these constraints to occur.

Let's now take a look at how R is designed and how it works, and see what the implications are for its performance.

You have been reading a chapter from
R High Performance Programming
Published in: Jan 2015
Publisher:
ISBN-13: 9781783989263
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image