Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
IBM SPSS Modeler Cookbook

You're reading from   IBM SPSS Modeler Cookbook If you've already had some experience with IBM SPSS Modeler this cookbook will help you delve deeper and exploit the incredible potential of this data mining workbench. The recipes come from some of the best brains in the business.

Arrow left icon
Product type Paperback
Published in Oct 2013
Publisher Packt
ISBN-13 9781849685467
Length 382 pages
Edition 1st Edition
Languages
Concepts
Arrow right icon
Toc

Table of Contents (11) Chapters Close

Preface 1. Data Understanding FREE CHAPTER 2. Data Preparation – Select 3. Data Preparation – Clean 4. Data Preparation – Construct 5. Data Preparation – Integrate and Format 6. Selecting and Building a Model 7. Modeling – Assessment, Evaluation, Deployment, and Monitoring 8. CLEM Scripting A. Business Understanding Index

Using an @NULL multiple Derive to explore missing data

With great regularity the mere presence or absence of data in the input variable tells you a great deal. Dates are a classic example. Suppose LastDateRented_HorrorCategory is NULL. Does that mean that the value is unknown? Perhaps we should replace it with the average date of the horror movie renters? Please don't! Obviously, if the data is complete, the failure to find Jane Renter in the horror movie rental transactions much more likely means that she did not rent a horror movie. This is such a classic scenario you will want a series of simple tricks to deal with this type of missing data efficiently so that when the situation calls for it you can easily create NULL flag variables for dozens (or even all) of your variables.

Getting ready

We will start with the NULL Flags.str stream.

How to do it...

To use an @NULL multiple Derive node to explore missing data, perform the following steps:

  1. Run the Data Audit and examine the resulting Quality tab. Note that a number of variables are complete but many have more than 5 percent NULL. The Filter node on the stream allows only the variables with a substantial number of NULL values to flow downstream.
  2. Add a Derive node, and edit it, by selecting the Multiple option. Include all of the scale variables that are downstream of the Filter node. Use the suffix _null, and select Flag from the Derive as drop-down menu.
    How to do it...
  3. Add another Filter node and set it to allow only the new variables plus TARGET_B to flow downstream.
  4. Add a Type node forcing TARGET_B to be the target. Ensure that it is a flag measurement type.
  5. Add a Data Audit node. Note that some of the new NULL flag variables may be related to the target, but it is not easy to see which variables are the most related.
  6. Add a Feature Selection Modeling node and run it. Edit the resulting generated model. Note that a number of variables are predictive of the target.
    How to do it...

How it works...

There is no substitute for lots of hard work during Data Understanding. Some of the patterns here could be capitalized upon, and others could indicate the need for data cleaning. The Using the Feature Selection node creatively to remove or decapitate perfect predictors recipe in Chapter 2, Data Preparation – Select, shows how circular logic can creep into our analysis.

Note the large number of data and amount-related variables in the Generated model. These variables indicate that the potential donor did not give in those time periods. Failing to give in one time period is predicted with failing to give in another; it makes sense. Is this the best way to get at this? Perhaps a simple count would do the trick, or perhaps the number of recent donations versus total donations.

How it works...

Also note the TIMELAG_null variable. It is the distance between the first and second donation. What would be a common reason that it would be NULL? Obviously the lack of a second donation could cause that problem. Perhaps analyzing new donors and established donors separately could be a good way of tackling this. The Using a full data model/partial data model approach to address missing data recipe in Chapter 3, Data Preparation – Clean, is built around this very idea. Note that neither imputing with the mean, nor filling with zero would be a good idea at all. We have no reason to think that one time and two time donors are similar. We also know for a fact that the time distance is never zero.

Note the Wealth2_null variable. What might cause this variable to be missing, and for the missing status alone to be predictive? Perhaps we need a new donor to be on the mailing list for a substantial time before our list vendor can provide us that information. This too might be tackled with a new donor/established donor approach.

See also

  • The Using the Feature Selection node creatively to remove or decapitate perfect predictors recipe in Chapter 2, Data Preparation – Select
  • The Using CHAID stumps when interviewing an SME recipe in this chapter
  • The Binning scale variables to address missing data recipe in Chapter 3, Data Preparation – Clean
  • The Using a full data model/partial data model approach to address missing data recipe in Chapter 3, Data Preparation – Clean
You have been reading a chapter from
IBM SPSS Modeler Cookbook
Published in: Oct 2013
Publisher: Packt
ISBN-13: 9781849685467
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image