Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Essential Statistics for Non-STEM Data Analysts

You're reading from   Essential Statistics for Non-STEM Data Analysts Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

Arrow left icon
Product type Paperback
Published in Nov 2020
Publisher Packt
ISBN-13 9781838984847
Length 392 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Rongpeng Li Rongpeng Li
Author Profile Icon Rongpeng Li
Rongpeng Li
Arrow right icon
View More author details
Toc

Table of Contents (19) Chapters Close

Preface 1. Section 1: Getting Started with Statistics for Data Science
2. Chapter 1: Fundamentals of Data Collection, Cleaning, and Preprocessing FREE CHAPTER 3. Chapter 2: Essential Statistics for Data Assessment 4. Chapter 3: Visualization with Statistical Graphs 5. Section 2: Essentials of Statistical Analysis
6. Chapter 4: Sampling and Inferential Statistics 7. Chapter 5: Common Probability Distributions 8. Chapter 6: Parametric Estimation 9. Chapter 7: Statistical Hypothesis Testing 10. Section 3: Statistics for Machine Learning
11. Chapter 8: Statistics for Regression 12. Chapter 9: Statistics for Classification 13. Chapter 10: Statistics for Tree-Based Methods 14. Chapter 11: Statistics for Ensemble Methods 15. Section 4: Appendix
16. Chapter 12: A Collection of Best Practices 17. Chapter 13: Exercises and Projects 18. Other Books You May Enjoy

Data standardization – when and how

Data standardization is a common preprocessing step. I use the terms standardization and normalization interchangeably. You may also encounter the concept of rescaling in literature or blogs.

Standardization often means shifting the data to be zero-centered with a standard deviation of 1. The goal is to bring variables with different units/ranges down to the same range. Many machine learning tasks are sensitive to data magnitudes. Standardization is supposed to remove such factors.

Rescaling doesn't necessarily bring the variables to a common range. This is done by means of customized mapping, usually linear, to scale original data to a different range. However, the common approach of min-max scaling does transform different variables into a common range [0, 1].

People may argue about the difference between standardization and normalization. When comparing their differences, normalization will refer to normalizing different variables to the same range [0, 1], and min-max scaling is considered a normalization algorithm. However, there are other normalization algorithms as well. Standardization cares more about the mean and standard deviation.

Standardization also transforms the original distribution closer to a Gaussian distribution. In the event that the original distribution is indeed Gaussian, standardization outputs a standard Gaussian distribution.

When to perform standardization

Perform standardization when your downstream tasks require it. For example, the k-nearest neighbors method is sensitive to variable magnitudes, so you should standardize the data. On the other hand, tree-based methods are not sensitive to different ranges of variables, so standardization is not required.

There are mature libraries to perform standardization. We first calculate the standard deviation and mean of the data, subtract the mean from every entry, and then divide by the standard deviation. Standard deviation describes the level of variety in data that will be discussed more in Chapter 2, Essential Statistics for Data Assessment.

Here is an example involving vanilla Python:

stdChol = np.std(chol)
meanChol = np.mean(chol)
chol2 = chol.apply(lambda x: (x-meanChol)/stdChol)
plt.hist(chol2,bins=range(int(min(chol2)), int(max(chol2))+1, 1));

The output is as follows:

Figure 1.14 – Standardized cholesterol data

Figure 1.14 – Standardized cholesterol data

Note that the standardized distribution looks more like a Gaussian distribution now.

Data standardization is irreversible. Information will be lost in standardization. It is only recommended to do so when no original information, such as magnitudes or original standard deviation, will be required later. In most cases, standardization is a safe choice for most downstream data science tasks.

In the next section, we will use the scikit-learn preprocessing module to demonstrate tasks involving standardization.

You have been reading a chapter from
Essential Statistics for Non-STEM Data Analysts
Published in: Nov 2020
Publisher: Packt
ISBN-13: 9781838984847
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image