You're reading from Mastering SAS Programming for Data Warehousing An advanced programming guide to designing and managing Data Warehouses using SAS

Product type Paperback

Published in Oct 2020

Publisher Packt

ISBN-13 9781789532371

Length 494 pages

Edition 1st Edition

Tools

SAS

Concepts

Application Development

Author (1):

Monika Wahi

View More author details

Table of Contents (18) Chapters

Preface

1. Section 1: Managing Data in a SAS Data Warehouse

2. Chapter 1: Using SAS in a Data Mart, Data Lake, or Data Warehouse FREE CHAPTER

3. Chapter 2: Reading Big Data into SAS

4. Chapter 3: Helpful PROCs for Managing Data

5. Chapter 4: Managing ETL in SAS

6. Chapter 5: Managing Data Reporting in SAS

7. Section 2: Using SAS for Extract-Transform-Load (ETL) Protocols in a Data Warehouse

8. Chapter 6: Standardizing Coding Using SAS Arrays

9. Chapter 7: Designing and Developing ETL Code in SAS

10. Chapter 8: Using Macros to Automate ETL in SAS

11. Chapter 9: Debugging and Troubleshooting in SAS

12. Section 3: Using SAS When Serving Warehouse Data to Users

13. Chapter 10: Considering the User Needs of SAS Data Warehouses

14. Chapter 11: Connecting the SAS Data Warehouse to Other Systems

15. Chapter 12: Using the ODS for Visualization in SAS

16. Assessments

17. Other Books You May Enjoy

Leave a review - let other readers know what you think

Improving I/O

Although SAS has created many features to improve data warehousing, it is still necessary to improve I/O through the strategic use of SAS code. This section will cover the following:

Features for warehousing that have been developed by SAS
The importance of using the WHERE rather than the IF clause in data processing
How sorting and indexing can be done to improve I/O in SAS

Developing warehouse environments

The 1990s saw people working with SAS and big data to find creative solutions to improve data I/O. In his 1997 SAS white paper (available under Further reading), Ian Robertson describes the benefits of his case study migrating the Wisconsin Department of Transportation Traffic Safety and Record-keeping System (TSRS) from a mainframe SAS setup to one where data was served up to analysts through a local area network (LAN).

By this time, SAS had been reconfigured to run on their LAN's operating system, OS/2, so his team was able to save processing costs by moving the SAS analysis and reporting functions away from the mainframe and onto the PCs of the analysts. One of the innovations that enabled this was SAS/Connect, a component that allowed local PCs to connect to a mainframe storing data elsewhere. Using SAS/Connect, Robertson's team downloaded the data onto their LAN, making a local copy for analysis. Over a 2-day period, they were able to transfer 1.6 GB of TSRS data from the years 1988 through 1995 and 30 MB of source code from the mainframe storage system to the LAN.

In SAS's history of data warehousing mentioned on their website, the differences between a data warehouse, data mart, and data lake are explained:

A data warehouse stores a large amount of enterprise data covering many topics.
A data mart stores a smaller amount of data focused on one topic, usually sourced in a data warehouse.
Data warehouses and data marts consist of raw datasets that have been restructured for the purposes of analysis and reporting.
By contrast, a data lake stores a large amount of raw data that has not been processed.

Robertson's local version of the TSRS could be seen as a data warehouse, in that all the data from the source system had been moved to the LAN. By moving the data from the mainframe to the LAN, the data was now in the same physical place as the application accessing it, and this reduced not only I/O time but CPU cost. However, the group encountered a few difficulties after moving the data that required revising their source code.

Using the WHERE clause

Robertson's team used a few different strategies to improve the efficiency of their source code as an approach to performance tuning. Robertson's team reviewed their SAS processing code and found that they were using a lot of WHERE clauses to subset data into regions. They began looking into ways to improve the efficiency of their coding with respect to the use of WHERE clauses.

The purpose of the WHERE clause is to subset datasets. The WHERE clause becomes important in data warehousing in two major areas: data management and data reporting. For this reason, WHERE is used in both data steps and PROCs.

Consider our example dataset, Chap_1_1_Infile, and imagine we wanted to create a dataset of just the Massachusetts records (_STATE = 25). Even though the source dataset, Chap_1_1_Infile, has 38,901 observations, by declaring a WHERE clause in our data step, we can avoid reading all those records and only process the ones where _STATE = 25 comes into the data step:

DATA Massachusetts;
    set Chap_1_1_Infile;
    WHERE _STATE = 25;
RUN;

The first line of the log file says this:

NOTE: There were 6669 observations read from the data set WORK.CHAP_1_1_INFILE.
       WHERE _STATE=25;

This indicates that as SAS was processing the file, it skipped over reading the rows where it saw anything other than 25 in _STATE. On my computer, the log file said this operation only took 0.01 seconds of CPU time. However, at the time Robertson was writing, the difference between using and not using a WHERE clause could really impact CPU time, which drove up data warehousing costs.

WHERE can also speed up the reporting of large datasets. Consider the use of PROC FREQ on our example dataset, Chap_1_1_Infile, to get the frequency of the gender variable, SEX1, in Massachusetts (_STATE=25):

PROC FREQ data=Chap_1_1_Infile;
    Where _STATE = 25;
    Tables(SEX1);
RUN;

Looking at the log file, again, we see that this frequency calculation only considered the 6,669 records from Massachusetts in its processing, thus saving processing time by not considering the whole file. On my computer, the log file says that the CPU time used was 0.10 seconds – still a very small number, but 10 times the number seen in the data step processing. This demonstrates the extra processing power needed for the frequency calculations produced by PROC FREQ.

Using IF compared to WHERE

In 2003, Nancy Croonen and Henri Theuwissen published a white paper providing tips on reducing CPU time through more efficient data step and PROC programming (available under Further reading), as many SAS users were still operating in a mainframe environment. In addition to advocating the use of the WHERE clause, the authors did studies to compare the CPU time for different ways of accomplishing the same tasks in SAS. They discussed trade-offs between using the WHERE and IF clauses.

Earlier, we used WHERE in a data step to subset the Chap_1_1_Infile dataset to just the records from Massachusetts and named the dataset Massachusetts. Let's do the same thing again, but this time, we'll use IF instead of WHERE:

DATA Massachusetts;
    set Chap_1_1_Infile;
    IF _STATE = 25;
RUN;

On my computer, I notice no difference in processing time – it is still 0.01 seconds. However, the first line of the log file is different:

NOTE: There were 38901 observations read from the data set WORK.CHAP_1_1_INFILE.

This indicates that SAS read in all 38,901 records before processing the rest of the code, which was instructions to only keep records in the Massachusetts dataset IF _STATE = 25. From this demonstration, it seems like WHERE is superior to IF when trying to make efficient code.

However, this is not always true. There is only a meaningful reduction in processing time by using WHERE instead of IF in case the variable in the statement has many different values, otherwise known as high cardinality. Croonen and Theuwissen show an example comparing the use of WHERE and IF to subset a large dataset by a nominal variable with seven levels, which would have low cardinality. In their example, there was only a small reduction in processing time using WHERE compared to IF.

Note:

When the WHERE clause is used, SAS creates a simple index on that variable and searches the index. When the IF clause is used, SAS does a line-by-line sequential search of the dataset. Note that IF must be used when applied to temporary variables not in the dataset. For a deeper discussion on the use of the WHERE and the IF clauses, please see the SAS white paper by Sunil Gupta (under Further reading).

Robertson's team managing the TSRS data warehouse had already optimized their code using WHERE for subsetting in their data processing and reporting. However, they found that after they moved their data to a local LAN, they needed to further improve the efficiency of their code, so they started looking into approaches to indexing.

Sorting in SAS

An index is a separate file from a dataset that is like an address book for SAS to use when looking up records in a large dataset. Unlike structured query language (SQL), SAS does not create indexes automatically in processing. There are ways the programmer can index variables in a dataset as well. Given certain data processing code, placing indexes on particular variables can speed up processing.

As discussed earlier, it is helpful to identify what variables are used in WHERE clauses that could benefit from indexing, and also, whether they are high or low cardinality. If a certain high-cardinality variable is used repeatedly in a WHERE clause, it is a good candidate for an index.

A simple index is an index made of one variable (such as _STATE). A composite index is one made of two or more variables (such as _STATE plus SEX1). Because SAS processes records sequentially, the easiest way a programmer can simulate an index on a SAS dataset is by sorting the dataset by that variable. It is not unusual for SAS datasets in a data lake to be stored sorted by a particular high-cardinality variable (such as a unique identification number of the row).

According to the BRFSS 2018 Codebook, in the source dataset, 53 regions are represented under the _STATE variable. If this variable was used in warehouse processing, then it would be logical to store the dataset sorted by _STATE, as shown in the following code:

PROC SORT data=Chap_1_1_Infile;
    by _STATE;
RUN;

Sorting itself takes some time; on my computer, it took 0.05 seconds of CPU time. Compared to the 0.01 seconds it took to read in the dataset of about 36,000 rows and the 0.10 seconds it took to do a PROC FREQ, this shows that if there is to be a policy in the data warehouse that datasets are to be stored sorted by a particular variable, sorting them will take some time to execute.

While sorting is a simple way of placing an index on a SAS variable, it may not be adequate. Using our example dataset, imagine we wanted to know the mean age (_AGE80) of respondents by gender (SEX1). We could use PROC MEANS to do this using the following code:

PROC MEANS data=Chap_1_1_Infile;
    by SEX1;
RUN;

With our dataset, Chap_1_1_Infile, in the state it is in at the beginning of the PROC, meaning sorted by the _STATE variable and not the SEX1 variable, the preceding PROC MEANS code will not run. The code produces the following error:

ERROR: Data set WORK.CHAP_1_1_INFILE is not sorted in ascending sequence. The current BY group has SEX1 = 9 and the next BY group has SEX1 = 1.

As described in the error wording, SAS is expecting the dataset to already be sorted by the BY variable, which is SEX1 in the PROC MEANS code. Solving this problem can easily be accomplished by resorting the dataset on the SEX1 variable prior to running the preceding code. But this will cause the dataset to no longer be sorted by _STATE, and we will lose the benefit of being able to efficiently use _STATE in a WHERE clause. It is in these more complex situations that we cannot rely on using sorting for a simple index, and should consider placing an index on certain variables.

Setting indexes on variables

Robertson's team ran into a similar problem, where different variables were used in WHERE clauses throughout their programming. Therefore, they could not simply sort their datasets by one variable and rely on that indexed variable to speed up their processing.

One of the ways they dealt with this was to deliberately place indexes on certain SAS variables using a method other than sorting. As described earlier with sorting, indexes can help improve SAS's performance when extracting a small subset of data from a larger dataset. According to Michael Raithel, who wrote a SAS white paper about indexing, if the subset comprises up to 20% of the dataset, then an index should improve program performance (white paper available under Further reading). But if it is larger, it may not impact or even worsen performance. We saw this situation earlier when comparing the efficiency of processing between using a WHERE versus an IF clause.

SAS continued to release new enterprise versions with upgraded PROCs and data steps and new functionality. Starting in the 1980s, main upgrade versions were released and stated as an integer (for example, version 6), but in reality, these versions were upgraded regularly, with each upgrade designated by two digits after the decimal, (for example, version 6.01). SAS had tried to build indexing features into its version 6 but found that there were performance problems, according to Diane Olson's white paper on the topic (available under Further reading). The version 7 releases, otherwise known as the Nashville releases and first available in 1998, fixed these problems.

Let's create an index on the _STATE variable using our example dataset, Chap_1_1_Infile. One way we could have created an index on _STATE was in the original data step we used to read in the data. Notice the same code we used before follows, but with the addition of the index command in the first line of the data step:

data Chap_1_1_Infile (index=(_STATE));

This is often the most efficient way to place an index, but datasets that have already been read into WORK can have indexes set on a variable using various approaches. One way is to use PROC DATASETS, which is demonstrated here:

PROC DATASETS nolist;
    modify Chap_1_1_Infile;
        index create _STATE;
RUN;

The nolist option suppresses the printout of the dataset, and the modify statement is used to tell SAS to modify the Chap_1_1_Infile dataset to create an index on the _STATE variable. In both of these examples, a simple index was created. Imagine we wanted to create a composite index including both _STATE and SEX1. We could do that using a data step, or we could do it using PROC DATASETS.

Using a data step, we could set the composite index by replacing the first line of our data step code shown earlier with this code:

data Chap_1_1_Infile (index=(STATE_SEX = (_STATE SEX1)));

Notice the differences between when we set a simple index on _STATE in the data step and the preceding code:

Because we are setting a composite index, we have to actually name the index a name that is different than the variables in the dataset. We are using the name STATE_SEX for the index.
Then, in parentheses, we specify – in order – the two variables in the composite index, which are _STATE and SEX1.
Note:
Some analysts prefer to use the _IDX suffix when naming indexes to indicate they are indexes.

To create the same index using PROC DATASETS, we would use the same code as we did for our simple index on _STATE, only replacing the index create line with this code:

index create STATE_SEX = (_STATE SEX1);

In the development of the TSRS data warehouse, Robertson's team leveraged indexes in their performance tuning. First, since indexes are used in WHERE clauses and not IF clauses, they rewrote their code to strategically switch IF clauses with WHERE clauses to improve performance. Then, they set indexes on the variables that were used in WHERE clauses, and only saw a 6% storage overhead.

You're reading from Mastering SAS Programming for Data Warehousing An advanced programming guide to designing and managing Data Warehouses using SAS

Table of Contents (18) Chapters

Improving I/O

Developing warehouse environments

Using the WHERE clause

Using IF compared to WHERE

Sorting in SAS

Setting indexes on variables

Authors (1)

Other recommended products

Personalised recommendations for you