Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
The Data Science Workshop

You're reading from   The Data Science Workshop A New, Interactive Approach to Learning Data Science

Arrow left icon
Product type Paperback
Published in Jan 2020
Publisher
ISBN-13 9781838981266
Length 818 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (5):
Arrow left icon
Thomas Joseph Thomas Joseph
Author Profile Icon Thomas Joseph
Thomas Joseph
Andrew Worsley Andrew Worsley
Author Profile Icon Andrew Worsley
Andrew Worsley
Robert Thas John Robert Thas John
Author Profile Icon Robert Thas John
Robert Thas John
Anthony So Anthony So
Author Profile Icon Anthony So
Anthony So
Dr. Samuel Asare Dr. Samuel Asare
Author Profile Icon Dr. Samuel Asare
Dr. Samuel Asare
+1 more Show less
Arrow right icon
View More author details
Toc

Table of Contents (18) Chapters Close

Preface 1. Introduction to Data Science in Python 2. Regression FREE CHAPTER 3. Binary Classification 4. Multiclass Classification with RandomForest 5. Performing Your First Cluster Analysis 6. How to Assess Performance 7. The Generalization of Machine Learning Models 8. Hyperparameter Tuning 9. Interpreting a Machine Learning Model 10. Analyzing a Dataset 11. Data Preparation 12. Feature Engineering 13. Imbalanced Datasets 14. Dimensionality Reduction 15. Ensemble Learning 16. Machine Learning Pipelines 17. Automated Feature Engineering

Explaining the Results of Regression Analysis

A primary objective of regression analysis is to find a model that explains the variability observed in a dependent variable of interest. It is, therefore, very important to have a quantity that measures how well a regression model explains this variability. The statistic that does this is called R-squared (R2). Sometimes, it is also called the coefficient of determination. To understand what it actually measures, we need to take a look at some other definitions.

The first of these is called the Total Sum of Squares (TSS). TSS gives us a measure of the total variance found in the dependent variable from its mean value.

The next quantity is called the Regression sum of squares (RSS). This gives us a measure of the amount of variability in the dependent variable that our model explains. If you imagine us creating a perfect model with no errors in prediction, then TSS will be equal to RSS. Our hypothetically perfect model will provide an explanation for all the variability we see in the dependent variable with respect to the mean value. In practice, this rarely happens. Instead, we create models that are not perfect, so RSS is less than TSS. The missing amount by which RSS falls short of TSS is the amount of variability in the dependent variable that our regression model is not able to explain. That quantity is the Error Sum of Squares (ESS), which is essentially the sum of the residual terms of our model.

R-squared is the ratio of RSS to TSS. This, therefore, gives us a percentage measure of how much variability our regression model is able to explain compared to the total variability in the dependent variable with respect to the mean. R2 will become smaller when RSS grows smaller and vice versa. In the case of simple linear regression where the independent variable is one, R2 is enough to decide the overall fit of the model to the data.

There is a problem, however, when it comes to multiple linear regression. The R2 is known to be sensitive to the addition of extra independent variables to the model, even if the independent variable is only slightly correlated to the dependent variable. Its addition will increase R2. Depending on R2 alone to make a decision between models defined for the same dependent variable will lead to chasing a complex model that has many independent variables in it. This complexity is not helpful practically. In fact, it may lead to a problem in modeling called overfitting.

To overcome this problem, the Adjusted R2 (denoted Adj. R-Squared on the output of statsmodels) is used to select between models defined for the same dependent variable. Adjusted R2 will increase only when the addition of an independent variable to the model contributes to explaining the variability in the dependent variable in a meaningful way.

In Activity 2.02, our model explained 88 percent of the variability in the transformed dependent variable, which is really good. We started with simple models and worked to improve the fit of the models using different techniques. All the exercises and activities done in this chapter have pointed out that the regression analysis workflow is iterative. You start by plotting to get a visual picture and follow from there to improve upon the model you finally develop by using different techniques. Once a good model has been developed, the next step is to validate the model statistically before it can be used for making a prediction or acquiring insight for decision making. Next, let's discuss what validating the model statistically means.

Regression Analysis Checks and Balances

In the preceding discussions, we used the R-squared and the Adjusted R-squared statistics to assess the goodness of fit of our models. While the R-squared statistic provides an estimate of the strength of the relationship between a model and the dependent variable(s), it does not provide a formal statistical hypothesis test for this relationship.

What do we mean by a formal statistical hypothesis test for a relationship between a dependent variable and some independent variable(s) in a model?

We must recall that, to say an independent variable has a relationship with a dependent variable in a model, the coefficient (β) of that independent variable in the regression model must not be zero (0). It is well and good to conduct a regression analysis with our Boston Housing dataset and find an independent variable (say the median value of owner-occupied homes) in our model to have a nonzero coefficient (β).

The question is will we (or someone else) find the median value of owner-occupied homes as having a nonzero coefficient (β), if we repeat this analysis using a different sample of Boston Housing dataset taken at different locations or times? Is the nonzero coefficient for the median value of owner-occupied homes, found in our analysis, specific to our sample dataset and zero for any other Boston Housing data sample that may be collected? Did we find the nonzero coefficient for the median value of owner-occupied homes by chance? These questions are what hypothesis tests seek to clarify. We cannot be a hundred percent sure that the nonzero coefficient (β) of an independent variable is by chance or not. But hypothesis testing gives a framework by which we can calculate the level of confidence where we can say that the nonzero coefficient (β) found in our analysis is not by chance. This is how it works.

We first agree a level of risk (α-value or α-risk or Type I error) that may exist that the nonzero coefficient (β) may have been found by chance. The idea is that we are happy to live with this level of risk of making the error or mistake of claiming that the coefficient (β) is nonzero when in fact it is zero.

In most practical analyses, the α-value is set at 0.05, which is 5 in percentage terms. When we subtract the α-risk from one (1-α) we have a measure of the level of confidence that we have that the nonzero coefficient (β) found in our analysis did not come about by chance. So, our confidence level is 95% at 5% α-value.

We then go ahead to calculate a probability value (usually called the p-value), which gives us a measure of the α-risk related to the coefficient (β) of interest in our model. We compare the p-value to our chosen α-risk, and if the p-value is less than the agreed α-risk, we reject the idea that the nonzero coefficient (β) was found by chance. This is because the risk of making a mistake of claiming the coefficient (β) is nonzero is within the acceptable limit we set for ourselves earlier.

Another way of stating that the nonzero coefficient (β) was NOT found by chance is to say that the coefficient (β) is statistically significant or that we reject the null hypothesis (the null hypothesis being that there is no relationship between the variables being studied). We apply these ideas of statistical significance to our models in two stages:

  1. In stage one, we validate the model as a whole statistically.
  2. In stage two, we validate the independent variables in our model individually for statistical significance.

The F-test

The F-test is what validates the overall statistical significance of the strength of the relationship between a model and its dependent variables. If the p-value for the F-test is less than the chosen α-level (0.05, in our case), we reject the null hypothesis and conclude that the model is statistically significant overall.

When we fit a regression model, we generate an F-value. This value can be used to determine whether the test is statistically significant. In general, an increase in R2 increases the F-value. This means that the larger the F-value, the better the chances of the overall statistical significance of a model.

A good F-value is expected to be larger than one. The model in Figure 2.19 has an F-statistic value of 261.5, which is larger than one, and a p-value (Prob (F-statistic)) of approximately zero. The risk of making a mistake and rejecting the null hypothesis when we should not (known as a Type I error in hypothesis testing), is less than the 5% limit we chose to live with at the beginning of the hypothesis test. Because the p-value is less than 0.05, we reject the null hypothesis about our model in Figure 2.19. Therefore, we state that the model is statistically significant at the chosen 95% confidence level.

The t-test

Once a model has been determined to be statistically significant globally, we can proceed to examine the significance of individual independent variables in the model. In Figure 2.19, the p-values (denoted p>|t| in Section 2) for the independent variables are provided. The p-values were calculated using the t-values also given on the summary results. The process is not different from what was just discussed for the global case. We compare the p-values to the 0.05 α-level. If an independent variable has a p-value of less than 0.05, the independent variable is statistically significant in our model in explaining the variability in the dependent variable. If the p-value is 0.05 or higher, the particular independent variable (or term) in our model is not statistically significant. What this means is that that term in our model does not contribute toward explaining the variability in our dependent variable statistically. A close inspection of Figure 2.19 shows that some of the terms have p-values larger than 0.05. These terms don't contribute in a statistically significant way of explaining the variability in our transformed dependent variable. To improve this model, those terms will have to be dropped and a new model tried. It is clear by this point that the process of building a regression model is truly iterative.

You have been reading a chapter from
The Data Science Workshop
Published in: Jan 2020
Publisher:
ISBN-13: 9781838981266
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image