You're reading from Principles of Data Science A beginner's guide to essential math and coding skills for data fluency and machine learning

Product type Paperback

Published in Jan 2024

Publisher Packt

ISBN-13 9781837636303

Length 326 pages

Edition 3rd Edition

Languages

Python

Tools

MLOps

Concepts

Data Science

Author (1):

Sinan Ozdemir

View More author details

Table of Contents (18) Chapters

Preface

1. Chapter 1: Data Science Terminology FREE CHAPTER

2. Chapter 2: Types of Data

3. Chapter 3: The Five Steps of Data Science

4. Chapter 4: Basic Mathematics

5. Chapter 5: Impossible or Improbable – A Gentle Introduction to Probability

6. Chapter 6: Advanced Probability

7. Chapter 7: What Are the Chances? An Introduction to Statistics

8. Chapter 8: Advanced Statistics

9. Chapter 9: Communicating Data

10. Chapter 10: How to Tell if Your Toaster is Learning – Machine Learning Essentials

11. Chapter 11: Predictions Don’t Grow on Trees, or Do They?

12. Chapter 12: Introduction to Transfer Learning and Pre-Trained Models

13. Chapter 13: Mitigating Algorithmic Bias and Tackling Model and Data Drift

14. Chapter 14: AI Governance

15. Chapter 15: Navigating Real-World Data Science Case Studies in Action

16. Index

Why subscribe?

17. Other Books You May Enjoy

What is data science?

This is a simple question, but before we go any further, let’s look at some basic definitions that we will use throughout this book. The great/awful thing about the field of data science is that it is young enough that sometimes, even basic definitions and terminology can be debated across publications and people. The basic definition is that data science is the process of acquiring knowledge through data.

It may seem like a small definition for such a big topic, and rightfully so! Data science covers so many things that it would take pages to list them all out. Put another way, data science is all about how we take data, use it to acquire knowledge, and then use that knowledge to do the following:

Make informed decisions
Predict the future
Understand the past/present
Create new industries/products

This book is all about the methods of data science, including how to process data, gather insights, and use those insights to make informed decisions and predictions.

Understanding basic data science terminology

The definitions that follow are general enough to be used in daily conversations and work to serve the purpose of this book, which is an introduction to the principles of data science.

Let’s start by defining what data is. This might seem like a silly first definition to look at, but it is very important. Whenever we use the word “data,” we refer to a collection of information in either a structured or unstructured format. These formats have the following qualities:

Structured data: This refers to data that is sorted into a row/column structure, where every row represents a single observation and the columns represent the characteristics of that observation
Unstructured data: This is the type of data that is in a free form, usually text or raw audio/signals that must be parsed further to become structured

Data is everywhere around us and originates from a multitude of sources, including everyday internet browsing, social media activities, and technological processes such as system logs. This data, when structured, becomes a useful tool for various algorithms and businesses. Consider the data from your online shopping history. Each transaction you make is recorded with details such as the product, price, date and time, and payment method. This structured information, laid out in rows and columns, forms a clear picture of your shopping habits, preferences, and patterns.

Yet not all data comes neatly packaged. Unstructured data, such as comments and reviews on social media or an e-commerce site, don’t follow a set format. They might include text, images, or even videos, making it more challenging to organize and analyze. However, once processed correctly, this free-flowing information offers valuable insights such as sentiment analysis, providing a deeper understanding of customer attitudes and opinions. In essence, the ability to harness both structured and unstructured data is key to unlocking the potential of the vast amounts of information we generate daily.

Opening Excel, or any spreadsheet software, presents you with a blank grid meant for structured data. It’s not ideally suited for handling unstructured data. While our primary focus will be structured data, given its ease of interpretation, we won’t overlook the richness of raw text and other unstructured data types, and the techniques to make them comprehensible.

The crux of data science lies in employing data to unveil insights that would otherwise remain hidden. Consider a healthcare setting, where data science techniques can predict which patients are likely not to attend their appointments. This not only optimizes resource allocation but also ensures other patients can utilize these slots. Understanding data science is more than grasping what it does – it’s about appreciating its importance and recognizing why mastering it is in such high demand.

Why data science?

Data science won’t replace the human brain (at least not for a while), but rather augment and complement it, working alongside it. Data science should not be thought of as an end-all solution to our data woes; it is merely an opinion – a very informed opinion, but still an opinion, nonetheless. It deserves a seat at the table.

In this Data Age, it’s clear that we have a surplus of data. But why should that necessitate an entirely new set of vocabulary? What was wrong with our previous forms of analysis? For one, the sheer volume of data makes it impossible for a human to parse it in a reasonable time frame. Data is collected in various forms and from different sources and often comes in a very unstructured format.

Data can be missing, incomplete, or just flat-out wrong. Oftentimes, we will have data on very different scales, and that makes it tough to compare it. Say we are looking at data concerning pricing used cars. One characteristic of a car is the year it was made, and another might be the number of miles on that car. Once we clean our data (which we will spend a great deal of time looking at in this book), the relationships between the data become more obvious, and the knowledge that was once buried deep in millions of rows of data simply pops out. One of the main goals of data science is to make explicit practices and procedures to discover and apply these relationships in the data.

Let’s take a minute to discuss its role today using a very relevant example.

Example – predicting COVID-19 with machine learning

A large component of this book is about how we can leverage powerful machine learning algorithms, including deep learning, to solve modern and complicated tasks. One such problem is using deep learning to be able to aid in the diagnosis, treatment, and prevention of fatal illnesses, including COVID-19. Since the global pandemic erupted in 2020, numerous organizations around the globe turned to data science to alleviate and solve problems related to COVID-19. For example, the following figure shows a visualization of a process for using machine learning (deep learning, in this case) to screen for COVID-19 that was published in March 2020. By then, the world had only known of COVID-19 for a few months, and yet we were able to apply machine learning techniques to such a novel use case with relative ease:

Figure 1.1 – A visualization of a COVID-19 screening algorithm based on deep learning from 2020

This screening algorithm was one of the first of its kind working to identify COVID-19 and recognize it from known illnesses such as the flu. Algorithms like these suggested that we could turn to data and machine learning to aid when unforeseen catastrophes strike. We will learn how to develop algorithms such as this life-changing system later in this book. Creating such algorithms takes a combination of three distinct skills that, when combined, form the backbone of data science. It requires people who are knowledgeable about COVID-19, people who know how to create statistical models, and people who know how to productionize those models so that people can benefit from them.