Improving I/O
Although SAS has created many features to improve data warehousing, it is still necessary to improve I/O through the strategic use of SAS code. This section will cover the following:
Features for warehousing that have been developed by SAS
The importance of using the
WHERE
rather than theIF
clause in data processingHow sorting and indexing can be done to improve I/O in SAS
Developing warehouse environments
The 1990s saw people working with SAS and big data to find creative solutions to improve data I/O. In his 1997 SAS white paper (available under Further reading), Ian Robertson describes the benefits of his case study migrating the Wisconsin Department of Transportation Traffic Safety and Record-keeping System (TSRS) from a mainframe SAS setup to one where data was served up to analysts through a local area network (LAN).
By this time, SAS had been reconfigured to run on their LAN's operating system, OS/2, so his team was able to save processing costs by moving the SAS analysis and reporting functions away from the mainframe and onto the PCs of the analysts. One of the innovations that enabled this was SAS/Connect, a component that allowed local PCs to connect to a mainframe storing data elsewhere. Using SAS/Connect, Robertson's team downloaded the data onto their LAN, making a local copy for analysis. Over a 2-day period, they were able to transfer 1.6 GB of TSRS data from the years 1988 through 1995 and 30 MB of source code from the mainframe storage system to the LAN.
In SAS's history of data warehousing mentioned on their website, the differences between a data warehouse, data mart, and data lake are explained:
A data warehouse stores a large amount of enterprise data covering many topics.
A data mart stores a smaller amount of data focused on one topic, usually sourced in a data warehouse.
Data warehouses and data marts consist of raw datasets that have been restructured for the purposes of analysis and reporting.
By contrast, a data lake stores a large amount of raw data that has not been processed.
Robertson's local version of the TSRS could be seen as a data warehouse, in that all the data from the source system had been moved to the LAN. By moving the data from the mainframe to the LAN, the data was now in the same physical place as the application accessing it, and this reduced not only I/O time but CPU cost. However, the group encountered a few difficulties after moving the data that required revising their source code.
Using the WHERE clause
Robertson's team used a few different strategies to improve the efficiency of their source code as an approach to performance tuning. Robertson's team reviewed their SAS processing code and found that they were using a lot of WHERE
clauses to subset data into regions. They began looking into ways to improve the efficiency of their coding with respect to the use of WHERE
clauses.
The purpose of the WHERE
clause is to subset datasets. The WHERE
clause becomes important in data warehousing in two major areas: data management and data reporting. For this reason, WHERE
is used in both data steps and PROCs.
Consider our example dataset, Chap_1_1_Infile
, and imagine we wanted to create a dataset of just the Massachusetts records (_STATE = 25
). Even though the source dataset, Chap_1_1_Infile
, has 38,901 observations, by declaring a WHERE
clause in our data step, we can avoid reading all those records and only process the ones where _STATE = 25
comes into the data step:
DATA Massachusetts; set Chap_1_1_Infile; WHERE _STATE = 25; RUN;
The first line of the log file says this:
NOTE: There were 6669 observations read from the data set WORK.CHAP_1_1_INFILE. WHERE _STATE=25;
This indicates that as SAS was processing the file, it skipped over reading the rows where it saw anything other than 25
in _STATE
. On my computer, the log file said this operation only took 0.01 seconds of CPU time. However, at the time Robertson was writing, the difference between using and not using a WHERE
clause could really impact CPU time, which drove up data warehousing costs.
WHERE
can also speed up the reporting of large datasets. Consider the use of PROC FREQ
on our example dataset, Chap_1_1_Infile
, to get the frequency of the gender variable, SEX1
, in Massachusetts (_STATE=25
):
PROC FREQ data=Chap_1_1_Infile; Where _STATE = 25; Tables(SEX1); RUN;
Looking at the log file, again, we see that this frequency calculation only considered the 6,669 records from Massachusetts in its processing, thus saving processing time by not considering the whole file. On my computer, the log file says that the CPU time used was 0.10 seconds – still a very small number, but 10 times the number seen in the data step processing. This demonstrates the extra processing power needed for the frequency calculations produced by PROC FREQ
.
Using IF compared to WHERE
In 2003, Nancy Croonen and Henri Theuwissen published a white paper providing tips on reducing CPU time through more efficient data step and PROC programming (available under Further reading), as many SAS users were still operating in a mainframe environment. In addition to advocating the use of the WHERE
clause, the authors did studies to compare the CPU time for different ways of accomplishing the same tasks in SAS. They discussed trade-offs between using the WHERE
and IF
clauses.
Earlier, we used WHERE
in a data step to subset the Chap_1_1_Infile
dataset to just the records from Massachusetts and named the dataset Massachusetts
. Let's do the same thing again, but this time, we'll use IF
instead of WHERE
:
DATA Massachusetts; set Chap_1_1_Infile; IF _STATE = 25; RUN;
On my computer, I notice no difference in processing time – it is still 0.01 seconds. However, the first line of the log file is different:
NOTE: There were 38901 observations read from the data set WORK.CHAP_1_1_INFILE.
This indicates that SAS read in all 38,901 records before processing the rest of the code, which was instructions to only keep records in the Massachusetts
dataset IF _STATE = 25
. From this demonstration, it seems like WHERE
is superior to IF
when trying to make efficient code.
However, this is not always true. There is only a meaningful reduction in processing time by using WHERE
instead of IF
in case the variable in the statement has many different values, otherwise known as high cardinality. Croonen and Theuwissen show an example comparing the use of WHERE
and IF
to subset a large dataset by a nominal variable with seven levels, which would have low cardinality. In their example, there was only a small reduction in processing time using WHERE
compared to IF
.
Note:
When the WHERE
clause is used, SAS creates a simple index on that variable and searches the index. When the IF
clause is used, SAS does a line-by-line sequential search of the dataset. Note that IF
must be used when applied to temporary variables not in the dataset. For a deeper discussion on the use of the WHERE
and the IF
clauses, please see the SAS white paper by Sunil Gupta (under Further reading).
Robertson's team managing the TSRS data warehouse had already optimized their code using WHERE
for subsetting in their data processing and reporting. However, they found that after they moved their data to a local LAN, they needed to further improve the efficiency of their code, so they started looking into approaches to indexing.
Sorting in SAS
An index is a separate file from a dataset that is like an address book for SAS to use when looking up records in a large dataset. Unlike structured query language (SQL), SAS does not create indexes automatically in processing. There are ways the programmer can index variables in a dataset as well. Given certain data processing code, placing indexes on particular variables can speed up processing.
As discussed earlier, it is helpful to identify what variables are used in WHERE
clauses that could benefit from indexing, and also, whether they are high or low cardinality. If a certain high-cardinality variable is used repeatedly in a WHERE
clause, it is a good candidate for an index.
A simple index is an index made of one variable (such as _STATE
). A composite index is one made of two or more variables (such as _STATE
plus SEX1
). Because SAS processes records sequentially, the easiest way a programmer can simulate an index on a SAS dataset is by sorting the dataset by that variable. It is not unusual for SAS datasets in a data lake to be stored sorted by a particular high-cardinality variable (such as a unique identification number of the row).
According to the BRFSS 2018 Codebook, in the source dataset, 53 regions are represented under the _STATE
variable. If this variable was used in warehouse processing, then it would be logical to store the dataset sorted by _STATE
, as shown in the following code:
PROC SORT data=Chap_1_1_Infile; by _STATE; RUN;
Sorting itself takes some time; on my computer, it took 0.05 seconds of CPU time. Compared to the 0.01 seconds it took to read in the dataset of about 36,000 rows and the 0.10 seconds it took to do a PROC FREQ
, this shows that if there is to be a policy in the data warehouse that datasets are to be stored sorted by a particular variable, sorting them will take some time to execute.
While sorting is a simple way of placing an index on a SAS variable, it may not be adequate. Using our example dataset, imagine we wanted to know the mean age (_AGE80
) of respondents by gender (SEX1
). We could use PROC MEANS
to do this using the following code:
PROC MEANS data=Chap_1_1_Infile; by SEX1; RUN;
With our dataset, Chap_1_1_Infile
, in the state it is in at the beginning of the PROC, meaning sorted by the _STATE
variable and not the SEX1
variable, the preceding PROC MEANS
code will not run. The code produces the following error:
ERROR: Data set WORK.CHAP_1_1_INFILE is not sorted in ascending sequence. The current BY group has SEX1 = 9 and the next BY group has SEX1 = 1.
As described in the error wording, SAS is expecting the dataset to already be sorted by the BY
variable, which is SEX1
in the PROC MEANS
code. Solving this problem can easily be accomplished by resorting the dataset on the SEX1
variable prior to running the preceding code. But this will cause the dataset to no longer be sorted by _STATE
, and we will lose the benefit of being able to efficiently use _STATE
in a WHERE
clause. It is in these more complex situations that we cannot rely on using sorting for a simple index, and should consider placing an index on certain variables.
Setting indexes on variables
Robertson's team ran into a similar problem, where different variables were used in WHERE
clauses throughout their programming. Therefore, they could not simply sort their datasets by one variable and rely on that indexed variable to speed up their processing.
One of the ways they dealt with this was to deliberately place indexes on certain SAS variables using a method other than sorting. As described earlier with sorting, indexes can help improve SAS's performance when extracting a small subset of data from a larger dataset. According to Michael Raithel, who wrote a SAS white paper about indexing, if the subset comprises up to 20% of the dataset, then an index should improve program performance (white paper available under Further reading). But if it is larger, it may not impact or even worsen performance. We saw this situation earlier when comparing the efficiency of processing between using a WHERE
versus an IF
clause.
SAS continued to release new enterprise versions with upgraded PROCs and data steps and new functionality. Starting in the 1980s, main upgrade versions were released and stated as an integer (for example, version 6), but in reality, these versions were upgraded regularly, with each upgrade designated by two digits after the decimal, (for example, version 6.01). SAS had tried to build indexing features into its version 6 but found that there were performance problems, according to Diane Olson's white paper on the topic (available under Further reading). The version 7 releases, otherwise known as the Nashville releases and first available in 1998, fixed these problems.
Let's create an index on the _STATE
variable using our example dataset, Chap_1_1_Infile
. One way we could have created an index on _STATE
was in the original data step we used to read in the data. Notice the same code we used before follows, but with the addition of the index
command in the first line of the data step:
data Chap_1_1_Infile (index=(_STATE));
This is often the most efficient way to place an index, but datasets that have already been read into WORK
can have indexes set on a variable using various approaches. One way is to use PROC DATASETS
, which is demonstrated here:
PROC DATASETS nolist; modify Chap_1_1_Infile; index create _STATE; RUN;
The nolist
option suppresses the printout of the dataset, and the modify
statement is used to tell SAS to modify the Chap_1_1_Infile
dataset to create an index on the _STATE
variable. In both of these examples, a simple index was created. Imagine we wanted to create a composite index including both _STATE
and SEX1
. We could do that using a data step, or we could do it using PROC DATASETS
.
Using a data step, we could set the composite index by replacing the first line of our data step code shown earlier with this code:
data Chap_1_1_Infile (index=(STATE_SEX = (_STATE SEX1)));
Notice the differences between when we set a simple index on _STATE
in the data step and the preceding code:
Because we are setting a composite index, we have to actually name the index a name that is different than the variables in the dataset. We are using the name
STATE_SEX
for the index.Then, in parentheses, we specify – in order – the two variables in the composite index, which are
_STATE
andSEX1
.Note:
Some analysts prefer to use the
_IDX
suffix when naming indexes to indicate they are indexes.
To create the same index using PROC DATASETS
, we would use the same code as we did for our simple index on _STATE
, only replacing the index create
line with this code:
index create STATE_SEX = (_STATE SEX1);
In the development of the TSRS data warehouse, Robertson's team leveraged indexes in their performance tuning. First, since indexes are used in WHERE
clauses and not IF
clauses, they rewrote their code to strategically switch IF
clauses with WHERE
clauses to improve performance. Then, they set indexes on the variables that were used in WHERE
clauses, and only saw a 6% storage overhead.