You're reading from Statistical Application Development with R and Python Develop applications using data processing, statistical models, and CART

Product type Paperback

Published in Aug 2017

Publisher

ISBN-13 9781788621199

Length 432 pages

Edition 2nd Edition

Languages

Python

Concepts

Application Development

Table of Contents (12) Chapters

Preface

1. Data Characteristics FREE CHAPTER

2. Import/Export Data

3. Data Visualization

4. Exploratory Analysis

5. Statistical Inference

6. Linear Regression Analysis

7. Logistic Regression Model

8. Regression Models with Regularization

9. Classification and Regression Trees

10. CART and Beyond

Index

Experiments with uncertainty in computer science

The common man of the previous century was skeptical about chance/randomness and attributed it to the lack of accurate instruments, and that information is not necessarily captured in many variables. The skepticism about the need for modeling for randomness in the current era continues for the common man, as he feels that the instruments are too accurate and that multi-variable information eliminates uncertainty. However, this is not the fact and we will look here at some examples that drive home this point.

In the previous section, we dealt with data arising from a questionnaire regarding the service level at a car dealer. It is natural to accept that different individuals respond in distinct ways, and further, the car being a complex assembly of different components, responds differently in near identical conditions. A question then arises as to whether we may have to really deal with such situations in computer science, which involve uncertainty. The answer is certainly affirmative and we will consider some examples in the context of computer science and engineering.

Suppose that the task is the installation of software, say R itself. At a new lab there has been an arrangement of 10 new desktops that have the same configuration. That is, the RAM, memory, the processor, operating system, and so on, are all same in the 10 different machines.

For simplicity, assume that the electricity supply and lab temperature are identical for all the machines. Do you expect that the complete R installation, as per the directions specified in the next section, will be the same in milliseconds for all the 10 installations? The runtime of an operation can be easily recorded, maybe using other software if not manually. The answer is a clear No as there will be minor variations of the processes active in the different desktops. Thus, we have our first experiment in the domain of computer science that involves uncertainty.

Suppose that the lab is now 2 years old. As an administrator, do you expect all the 10 machines to be working in the same identical conditions as we started with an identical configuration and environment? The question is relevant, as according to general experience, a few machines may have broken down. Despite warranty and assurance by the desktop company, the number of machines that may have broken down will not be exactly the same as those assured. Thus, we again have uncertainty.

Assume that three machines are not functioning at the end of 2 years. As an administrator, you have called the service vendor to fix the problem. For the sake of simplicity, we assume that the nature of failure of the three machines is the same, say motherboard failure on the three failed machines. Is it practical that the vendor would fix the three machines within an identical time?

Again, by experience, we know that this is very unlikely. If the reader thinks otherwise, assume that 100 identical machines were running for 2 years and 30 of them now have the motherboard issue. It is now clear that some machines may require a component replacement while others would start functioning following a repair/fix.

Let us now summarize the preceding experiments with following questions:

What is the average installation time for the R software on identically configured computer machines?
How many machines are likely to break down after a period of 1 year, 2 years, and 3 years?
If a failed machine has issues related to the motherboard, what is the average service time?
What is the fraction of failed machines that have a failed motherboard component?

The answers to these types of questions form the main objective of the Statistics subject. However, there are certain characteristics of uncertainty that are covered by the families of probability distributions. According to the underlying problem, we have discrete or continuous RVs. The important and widely useful probability distributions form the content of the rest of the chapter. We will begin with the useful discrete distributions.