Python Machine Learning By Example

Getting Started with Machine Learning and Python

The concept of artificial intelligence (AI) outpacing human knowledge is often referred to as the “technological singularity.” Some predictions in the AI research community and other fields suggest that the singularity could happen within the next 30 years. Regardless of its time horizon, one thing is clear: the rise of AI highlights the growing importance of analytical and machine learning skills. Mastering these disciplines equips us to not only understand and interact with increasingly complex AI systems but also actively participate in shaping their development and application, ensuring they benefit humanity.

In this chapter, we will kick off our machine learning journey with the basic, yet important, concepts of machine learning. We will start with what machine learning is all about, why we need it, and its evolution over a few decades. We will then discuss typical machine learning tasks and explore several essential techniques to work with data and models.

At the end of the chapter, we will set up the software for Python, the most popular language for machine learning and data science, and the libraries and tools that are required for this book.

We will go into detail on the following topics:

An introduction to machine learning
Knowing the prerequisites
Getting started with three types of machine learning
Digging into the core of machine learning
Data preprocessing and feature engineering
Combining models
Installing software and setting up

An introduction to machine learning

In this first section, we will kick off our machine learning journey with a brief introduction to machine learning, why we need it, how it differs from automation, and how it improves our lives.

Machine learning is a term that was coined around 1960, consisting of two words—machine, which corresponds to a computer, robot, or other device, and learning, which refers to an activity intended to acquire or discover event patterns, which we humans are good at. Interesting examples include facial recognition, language translation, responding to emails, making data-driven business decisions, and creating various types of content. You will see many more examples throughout this book.

Understanding why we need machine learning

Why do we need machine learning, and why do we want a machine to learn the same way as a human? We can look at it from three main perspectives: maintenance, risk mitigation, and enhanced performance.

First and foremost, of course, computers and robots can work 24/7 and don’t get tired. Machines cost a lot less in the long run. Also, for sophisticated problems that involve a variety of huge datasets or complex calculations, it’s much more justifiable, not to mention intelligent, to let computers do all the work. Machines driven by algorithms that are designed by humans can learn latent rules and inherent patterns, enabling them to carry out tasks effectively.

Learning machines are better suited than humans for tasks that are routine, repetitive, or tedious. Beyond that, automation by machine learning can mitigate risks caused by fatigue or inattention. Self-driving cars, as shown in Figure 1.1, are a great example: a vehicle is capable of navigating by sensing its environment and making decisions without human input. Another example is the use of robotic arms in production lines, which are capable of causing a significant reduction in injuries and costs.

Figure 1.1: An example of a self-driving car

Let’s assume that humans don’t fatigue or we have the resources to hire enough shift workers; would machine learning still have a place? Of course it would! There are many cases, reported and unreported, where machines perform comparably, or even better, than domain experts. As algorithms are designed to learn from the ground truth and the best thought-out decisions made by human experts, machines can perform just as well as experts.

In reality, even the best expert makes mistakes. Machines can minimize the chance of making wrong decisions by utilizing collective intelligence from individual experts. A major study that identified that machines are better than doctors at diagnosing certain types of cancer is proof of this philosophy (https://www.nature.com/articles/d41586-020-00847-2). AlphaGo (https://deepmind.com/research/case-studies/alphago-the-story-so-far) is probably the best-known example of machines beating humans—an AI program created by DeepMind defeated Lee Sedol, a world champion Go player, in a five-game Go match.

Also, it’s much more scalable to deploy learning machines than to train individuals to become experts, from the perspective of economic and social barriers. Current diagnostic devices can achieve a level of performance similar to that of qualified doctors. We can distribute thousands of diagnostic devices across the globe within a week, but it’s almost impossible to recruit and assign the same number of qualified doctors within the same period.

You may argue against this: what if we have sufficient resources and the capacity to hire the best domain experts and later aggregate their opinions—would machine learning still have a place? Probably not (at least right now)—learning machines might not perform better than the joint efforts of the most intelligent humans. However, individuals equipped with learning machines can outperform the best group of experts. This is an emerging concept called AI-based assistance or AI plus human intelligence, which advocates for combining the efforts of machines and humans. It provides support, guidance, or solutions to users. And more importantly, it can adapt and learn from user interactions to improve performance over time.

We can summarize the previous statement in the following inequality:

human + machine learning → most intelligent tireless human ≥ machine learning > human

Artificial intelligence-generated content (AIGC) is one of the recent breakthroughs. It uses AI technologies to create or assist in creating various types of content, such as articles, product descriptions, music, images, and videos.

A medical operation involving robots is one great example of human and machine learning synergy. Figure 1.2 shows robotic arms in an operation room alongside a surgeon:

A picture containing person, clothing, medical equipment, technician

Description automatically generated

Figure 1.2: AI-assisted surgery

Differentiating between machine learning and automation

So does machine learning simply equate to automation that involves the programming and execution of human-crafted or human-curated rule sets? A popular myth says that machine learning is the same as automation because it performs instructive and repetitive tasks and thinks no further. If the answer to that question is yes, why can’t we just hire many software programmers and continue programming new rules or extending old rules?

One reason is that defining, maintaining, and updating rules becomes increasingly expensive over time. The number of possible patterns for an activity or event could be enormous, and therefore, exhausting all enumeration isn’t practically feasible. It gets even more challenging when it comes to events that are dynamic, ever-changing, or evolve in real time. It’s much easier and more efficient to develop learning algorithms that command computers to learn, extract patterns, and figure things out themselves from abundant data.

The difference between machine learning and traditional programming can be seen in Figure 1.3:

A diagram of a computer model

Description automatically generated with low confidence

Figure 1.3: Machine learning versus traditional programming

In traditional programming, the computer follows a set of predefined rules to process the input data and produce the outcome. In machine learning, the computer tries to mimic human thinking. It interacts with the input data, expected output, and the environment, and it derives patterns that are represented by one or more mathematical models. The models are then used to interact with future input data and generate outcomes. Unlike in automation, the computer in a machine learning setting doesn’t receive explicit and instructive coding.

The volume of data is growing exponentially. Nowadays, the floods of textual, audio, image, and video data are hard to fathom. The Internet of Things (IoT) is a recent development of a new kind of internet, which interconnects everyday devices. The IoT will bring data from household appliances and autonomous cars to the fore. This trend is likely to continue, and we will have more data that is generated and processed. Besides the quantity, the quality of data available has kept increasing in the past few years, partly due to cheaper storage. This has empowered the evolution of machine learning algorithms and data-driven solutions.

Machine learning applications

Jack Ma, co-founder of the e-commerce company Alibaba, explained in a speech in 2018 that IT was the focus of the past 20 years, but for the next 30 years, we will be in the age of data technology (DT) (https://www.alizila.com/jack-ma-dont-fear-smarter-computers/). During the age of IT, companies grew larger and stronger thanks to computer software and infrastructure. Now that businesses in most industries have already gathered enormous amounts of data, it’s presently the right time to exploit DT to unlock insights, derive patterns, and boost new business growth. Broadly speaking, machine learning technologies enable businesses to better understand customer behavior, engage with customers, and optimize operations management.

As for us individuals, machine learning technologies are already making our lives better every day. One application of machine learning with which we’re all familiar is spam email filtering. Another is online advertising, where adverts are served automatically based on information advertisers have collected about us. Stay tuned for the next few chapters, where you will learn how to develop algorithms to solve these two problems and more.

A search engine is an application of machine learning we can’t imagine living without. It involves information retrieval, which parses what we look for, queries the related top records, and applies contextual ranking and personalized ranking, which sorts pages by topical relevance and user preference. E-commerce and media companies have been at the forefront of employing recommendation systems, which help customers find products, services, and articles faster.

The application of machine learning is boundless, and we just keep hearing new examples everyday: credit card fraud detection, presidential election prediction, instant speech translation, robo advisors, AI-generated art, chatbots for customer support, and medical or legal advice provided by generative AI technologies—you name it!

In the 1983 War Games movie, a computer made life-and-death decisions that could have resulted in World War III. As far as we know, technology wasn’t able to pull off such feats at the time. However, in 1997, the Deep Blue supercomputer did manage to beat a world chess champion (https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)). In 2005, a Stanford self-driving car drove by itself for more than 130 miles in a desert (https://en.wikipedia.org/wiki/DARPA_Grand_Challenge_(2005)). In 2007, the car of another team drove through regular urban traffic for more than 60 miles (https://en.wikipedia.org/wiki/DARPA_Grand_Challenge_(2007)). In 2011, the Watson computer won a quiz against human opponents (https://en.wikipedia.org/wiki/Watson_(computer)). As mentioned earlier, the AlphaGo program beat one of the best Go players in the world in 2016. As of 2023, ChatGPT has been widely used across various industries, such as customer support, content generation, market research, and training and education (https://www.forbes.com/sites/bernardmarr/2023/05/30/10-amazing-real-world-examples-of-how-companies-are-using-chatgpt-in-2023).

If we assume that computer hardware is the limiting factor, then we can try to extrapolate into the future. A famous American inventor and futurist, Ray Kurzweil, did just that, predicting in 2017 that we can expect AI to gain human-level intelligence around 2029 (https://aibusiness.com/responsible-ai/ray-kurzweil-predicts-that-the-singularity-will-take-place-by-2045). What’s next?

Can’t wait to launch your own machine learning journey? Let’s start with the prerequisites and the basic types of machine learning.

Knowing the prerequisites

Machine learning mimicking human intelligence is a subfield of AI—a field of computer science concerned with creating systems. Software engineering is another field in computer science. Generally, we can label Python programming as a type of software engineering. Machine learning is also closely related to linear algebra, probability theory, statistics, and mathematical optimization. We usually build machine learning models based on statistics, probability theory, and linear algebra, and then optimize the models using mathematical optimization.

Most of you reading this book should have a good, or at least sufficient, command of Python programming. Those who aren’t feeling confident about mathematical knowledge might be wondering how much time should be spent learning or brushing up on the aforementioned subjects. Don’t panic; we will get machine learning to work for us without going into any deep mathematical details in this book. It just requires some basic 101 knowledge of probability theory and linear algebra, which helps us to understand the mechanics of machine learning techniques and algorithms. And it gets easier, as we will build models both from scratch and with popular packages in Python, a language we like and are familiar with.

For those who want to learn or brush up on probability theory and linear algebra, feel free to search for basic probability theory and basic linear algebra. There are a lot of resources available online, for example, https://people.ucsc.edu/~abrsvn/intro_prob_1.pdf, the online course Introduction to Probability by Harvard University (https://pll.harvard.edu/course/introduction-probability-edx) regarding probability 101, and the following paper regarding basic linear algebra: http://www.maths.gla.ac.uk/~ajb/dvi-ps/2w-notes.pdf.

Those who want to study machine learning systematically can enroll in computer science, AI, and, more recently, data science and AI master’s programs. There are also various data science boot camps. However, the selection for boot camps is usually stricter, as they’re more job-oriented and the program duration is often short, ranging from 4 to 10 weeks. Another option is free Massive Open Online Courses (MOOCs), such as Andrew Ng’s popular course on machine learning. Last but not least, industry blogs and websites are great resources for us to keep up with the latest developments.

Machine learning is not only a skill but also a bit of a sport. We can compete in several machine learning competitions, such as Kaggle (www.kaggle.com)—sometimes for decent cash prizes, sometimes for joy, but most of the time to play to our strengths. However, to win these competitions, we may need to utilize certain techniques, which are only useful in the context of competitions and not in the context of trying to solve a business problem. That’s right—the no free lunch theorem (https://en.wikipedia.org/wiki/No_free_lunch_theorem) applies here. In the context of machine learning, this theorem suggests that no single algorithm is universally superior across all possible datasets and problem domains.

Next, we’ll take a look at the three types of machine learning.

Getting started with three types of machine learning

A machine learning system is fed with input data—this can be numerical, textual, visual, or audiovisual. The system usually has an output—this can be a floating-point number, for instance, the acceleration of a self-driving car, or an integer representing a category (also called a class), for example, a cat or tiger from image recognition.

The main task of machine learning is to explore and construct algorithms that can learn from historical data and make predictions on new input data. For a data-driven solution, we need to define (or have it defined by an algorithm) an evaluation function called a loss or cost function, which measures how well the models learn. In this setup, we create an optimization problem with the goal of learning most efficiently and effectively.

Depending on the nature of the learning data, machine learning tasks can be broadly classified into the following three categories:

Unsupervised learning: When the learning data only contains indicative signals without any description attached (we call this unlabeled data), it’s up to us to find the structure of the data underneath, discover hidden information, or determine how to describe the data. Unsupervised learning can be used to detect anomalies, such as fraud or defective equipment, or group customers with similar online behaviors for a marketing campaign. Data visualization that makes data more digestible, as well as dimensionality reduction that distills relevant information from noisy data, are also in the family of unsupervised learning.
Supervised learning: When learning data comes with a description, targets, or desired output besides indicative signals (we call this labeled data), the learning goal is to find a general rule that maps input to output. The learned rule is then used to label new data with unknown output. The labels are usually provided by event-logging systems or evaluated by human experts. Also, if feasible, they may be produced by human raters, through crowd-sourcing, for instance.

Supervised learning is commonly used in daily applications, such as face and speech recognition, product or movie recommendations, sales forecasting, and spam email detection.

Reinforcement learning: Learning data provides feedback so that a system adapts to dynamic conditions in order to ultimately achieve a certain goal. The system evaluates its performance based on the feedback responses and reacts accordingly. The best-known instances include robotics for industrial automation, self-driving cars, and the chess master AlphaGo. The key difference between reinforcement learning and supervised learning is the interaction with the environment.

The following diagram depicts the types of machine learning tasks:

Figure 1.4: Types of machine learning tasks

As shown in the diagram, we can further subdivide supervised learning into regression and classification. Regression trains on and predicts continuous-valued responses, for example, predicting house prices, while classification attempts to find the appropriate class label, such as analyzing a positive/negative sentiment and predicting a loan default.

If not all learning samples are labeled, but some are, we have semi-supervised learning. This makes use of unlabeled data (typically a large amount) for training, besides a small amount of labeled data. Semi-supervised learning is applied in cases where it is expensive to acquire a fully labeled dataset and more practical to label a small subset. For example, it often requires skilled experts to label hyperspectral remote sensing images, while acquiring unlabeled data is relatively easy.

Feeling a little bit confused by the abstract concepts? Don’t worry. We will encounter many concrete examples of these types of machine learning tasks later in this book. For example, in Chapter 2, Building a Movie Recommendation Engine with Naïve Bayes, we will dive into supervised learning classification and its popular algorithms and applications. Similarly, in Chapter 5, Predicting Stock Prices with Regression Algorithms, we will explore supervised learning regression.

We will focus on unsupervised techniques and algorithms in Chapter 8, Discovering Underlying Topics in the Newsgroups Dataset with Clustering and Topic Modeling. Last but not least, the third machine learning task, reinforcement learning, will be covered in Chapter 15, Making Decisions in Complex Environments with Reinforcement Learning.

Besides categorizing machine learning based on the learning task, we can categorize it chronologically.

A brief history of the development of machine learning algorithms

In fact, we have a whole zoo of machine learning algorithms that have experienced varying popularity over time. We can roughly categorize them into five main approaches: logic-based learning, statistical learning, artificial neural networks, genetic algorithms, and deep learning.

The logic-based systems were the first to be dominant. They used basic rules specified by human experts, and with these rules, systems tried to reason using formal logic, background knowledge, and hypotheses.

Statistical learning theory attempts to find a function to formalize the relationships between variables. In the mid-1980s, artificial neural networks (ANNs) came to the fore. ANNs imitate animal brains and consist of interconnected neurons that are also an imitation of biological neurons. They try to model complex relationships between input and output values and capture patterns in data. ANNs were superseded by statistical learning systems in the 1990s.

Genetic algorithms (GA) were popular in the 1990s. They mimic the biological process of evolution and try to find optimal solutions, using methods such as mutation and crossover.

In the 2000s, ensemble learning methods gained attention, which combined multiple models to improve performance.

We have seen deep learning become a dominant force since the late 2010s. The term deep learning was coined around 2006 and refers to deep neural networks with many layers. The breakthrough in deep learning was the result of the integration and utilization of Graphical Processing Units (GPUs), which massively speed up computation. The availability of large datasets also fuels the deep learning revolution.

GPUs were originally developed to render video games and are very good in parallel matrix and vector algebra. It’s believed that deep learning resembles the way humans learn. Therefore, it may be able to deliver on the promise of sentient machines. Of course, in this book, we will dig deep into deep learning in Chapter 11, Categorizing Images of Clothing with Convolutional Neural Networks, and Chapter 12, Making Predictions with Sequences Using Recurrent Neural Networks, after touching on it in Chapter 6, Predicting Stock Prices with Artificial Neural Networks.

Machine learning algorithms continue to evolve rapidly, with ongoing research in areas including transfer learning, generative models, and reinforcement learning, which are the backbone of AIGC. We will explore the latest developments in Chapter 13, Advancing Language Understanding and Generation with the Transformer Models, and Chapter 14, Building an Image Search Engine Using CLIP: a Multimodal Approach.

Some of us may have heard of Moore’s law—an empirical observation claiming that computer hardware improves exponentially with time. The law was first formulated by Gordon Moore, the co-founder of Intel, in 1965. According to the law, the number of transistors on a chip should double every two years. In the following diagram, you can see that the law holds up nicely (the size of the bubbles corresponds to the average transistor count in GPUs):

A picture containing text, screenshot, plot, line

Description automatically generated

Figure 1.5: Transistor counts over the past decades

The consensus seems to be that Moore’s law should continue to be valid for a couple of decades. This gives some credibility to Ray Kurzweil’s predictions of achieving true machine intelligence by 2029.

Digging into the core of machine learning

After discussing the categorization of machine learning algorithms, we are now going to dig into the core of machine learning—generalizing with data, the different levels of generalization, as well as the approaches to attain the right level of generalization.

Generalizing with data

The good thing about data is that there’s a lot of it in the world. The bad thing is that it’s hard to process this data. The challenge stems from the diversity and noisiness of it. We humans usually process data coming into our ears and eyes. These inputs are transformed into electrical or chemical signals. On a very basic level, computers and robots also work with electrical signals.

These electrical signals are then translated into ones and zeros. However, we program in Python in this book, and on that level, normally we represent the data either as numbers or texts. However, text isn’t very convenient, so we need to transform this into numerical values.

Especially in the context of supervised learning, we have a scenario similar to studying for an exam. We have a set of practice questions and the actual exams. We should be able to answer exam questions without being exposed to identical questions beforehand. This is called generalization—we learn something from our practice questions and, hopefully, can apply this knowledge to other similar questions. In machine learning, these practice questions are called training sets or training samples. This is where the machine learning models derive patterns from. And the actual exams are testing sets or testing samples. They are where the models are eventually applied. Learning effectiveness is measured by the compatibility of the learning models and the testing.

Sometimes, between practice questions and actual exams, we have mock exams to assess how well we will do in actual exams and to aid revision. These mock exams are known as validation sets or validation samples in machine learning. They help us to verify how well the models will perform in a simulated setting, and then we fine-tune the models accordingly in order to achieve greater accuracy.

An old-fashioned programmer would talk to a business analyst or other expert, and then implement a tax rule that adds a certain value multiplied by another corresponding value, for instance. In a machine learning setting, we can give the computer a bunch of input and output examples; alternatively, if we want to be more ambitious, we can feed the program the actual tax texts. We can let the machine consume the data and figure out the tax rule, just as an autonomous car doesn’t need a lot of explicit human input.

In physics, we have almost the same situation. We want to know how the universe works and formulate laws in a mathematical language. Since we don’t know how it works, all we can do is measure the error produced in our attempt at law formulation and try to minimize it. In supervised learning tasks, we compare our results against the expected values. In unsupervised learning, we measure our success with related metrics. For instance, we want data points to be grouped based on similarities, forming clusters; the metrics could be how similar the data points within one cluster are, or how different the data points from two clusters are. In reinforcement learning, a program evaluates its moves, for example, by using a predefined function in a chess game.

Aside from correct generalization with data, there are two levels of generalization, overfitting and underfitting, which we will explore in the next section.

Overfitting, underfitting, and the bias-variance trade-off

In this section, let’s take a look at both levels of generalization in detail and explore the bias-variance trade-off.

Overfitting

Reaching the right fit model is the goal of a machine learning task. What if the model overfits? Overfitting means a model fits the existing observations too well but fails to predict future new observations. Let’s look at the following analogy.

If we go through many practice questions for an exam, we may start to find ways to answer questions that have nothing to do with the subject material. For instance, given only five practice questions, we might find that if there are two occurrences of potatoes, one of tomato, and three of banana in a multiple-choice question, the answer is always A, and if there is one occurrence of potato, three of tomato, and two of banana in a question, the answer is always B. We could then conclude that this is always true and apply such a theory later, even though the subject or answer may not be relevant to potatoes, tomatoes, or bananas. Or, even worse, we might memorize the answers to each question verbatim. We would then score highly on the practice questions, leading us to hope that the questions in the actual exams would be the same as the practice questions. However, in reality, we would score very low on the exam questions, as it’s rare that the exact same questions occur in exams.

The phenomenon of memorization can cause overfitting. This can occur when we’re over-extracting too much information from the training sets and making our model just work well with them. At the same time, however, overfitting won’t help us to generalize it to new data and derive true patterns from it. The model, as a result, will perform poorly on datasets that weren’t seen before. We call this situation high variance in machine learning. Let’s quickly recap variance: variance measures the spread of the prediction, which is the variability of the prediction. It can be calculated as follows:

Here, ŷ is the prediction, and E[] is the expectation or expected value that represents the average value of a random variable, based on its probability distribution in statistics.

The following example demonstrates what a typical instance of overfitting looks like, where the regression curve tries to flawlessly accommodate all observed samples:

A picture containing map, screenshot

Description automatically generated

Figure 1.6: Example of overfitting

Overfitting occurs when we try to describe the learning rules based on too many parameters relative to the small number of observations, instead of the underlying relationship, such as the preceding potato, tomato, and banana example, where we deduced three parameters from only five learning samples. Overfitting also takes place when we make the model so excessively complex that it fits every training sample, such as memorizing the answers for all questions, as mentioned previously.

Underfitting

The opposite scenario is underfitting. When a model is underfit, it doesn’t perform well on the training sets and won’t do so on the testing sets, which means it fails to capture the underlying trend of the data. Underfitting may occur if we don’t use enough data to train the model, just like we will fail the exam if we don’t review enough material; this may also happen if we try to fit a wrong model to the data, just as we will score low in any exercises or exams if we take the wrong approach and learn in the wrong way. We describe any of these situations as high bias in machine learning, although its variance is low, as the performance in training and test sets is consistent, in a bad way. If you need a quick recap of bias, here it is: bias is the difference between the average prediction and the true value. It is computed as follows:

Here, ŷ is the prediction and y is the ground truth.

The following example shows what typical underfitting looks like, where the regression curve doesn’t fit the data well enough or capture enough of the underlying pattern of the data:

A picture containing screenshot, line, diagram, plot

Description automatically generated

Figure 1.7: Example of underfitting

Now, let’s look at what a well-fitting example should look like:

A picture containing screenshot, line, plot

Description automatically generated

Figure 1.8: Example of desired fitting

The bias-variance trade-off

Obviously, we want to avoid both overfitting and underfitting. Recall that bias is the error stemming from incorrect assumptions in the learning algorithm; high bias results in underfitting. Variance measures how sensitive the model prediction is to variations in the datasets. Hence, we need to avoid cases where either bias or variance gets high. So, does it mean we should always make both bias and variance as low as possible? The answer is yes, if we can. But, in practice, there is an explicit trade-off between them, where decreasing one increases the other. This is the so-called bias-variance trade-off. Sounds abstract? Let’s look at the next example.

Let’s say we’re asked to build a model to predict the probability of a candidate being the next president of America based on phone poll data. The poll is conducted using zip codes. We randomly choose samples from one zip code, and we estimate there’s a 61% chance the candidate will win. However, it turns out they lost the election. Where did our model go wrong? The first thing we might think of is the small size of samples from only one zip code. It’s a source of high bias also, as people in a geographic area tend to share similar demographics, although it results in a low variance of estimates. So can we fix it simply by using samples from a large number of zip codes? Yes, but don’t get happy too soon. This might cause an increased variance of estimates at the same time. We need to find the optimal sample size—the best number of zip codes to achieve the lowest overall bias and variance.

Minimizing the total error of a model requires a careful balancing of bias and variance. Given a set of training samples, x₁, x₂, …, x_n, and their targets, y₁, y₂, …, y_n, we want to find a regression function ŷ(x) that estimates the true relation y(x) as correctly as possible. We measure the error of estimation, i.e., how good (or bad) the regression model is, in mean squared error (MSE):

The E denotes the expectation. This error can be decomposed into bias and variance components following the analytical derivation, as shown in the following formula (although it requires a bit of basic probability theory to understand):

The term Bias measures the error of estimations, and the term Variance describes how much the estimation, ŷ, moves around its mean, E[ŷ]. The more complex the learning model ŷ(x) is, and the larger the size of the training samples is, the lower the bias will become. However, this will also create more adjustments to the model to better fit the increased data points. As a result, the variance will be lifted.

We usually employ the cross-validation technique, as well as regularization and feature reduction, to find the optimal model balancing bias and variance and diminish overfitting. We will discuss these next.

You may ask why we only want to deal with overfitting: how about underfitting? This is because underfitting can be easily recognized: it occurs if a model doesn’t work well on a training set. When this occurs, we need to find a better model or tweak some parameters to better fit the data, which is a must under all circumstances. On the other hand, overfitting is hard to spot. Oftentimes, when we achieve a model that performs well on a training set, we are overly happy and think it is ready for production right away. This can be very dangerous. We should instead take extra steps to ensure that the great performance isn’t due to overfitting and that the great performance applies to data that excludes the training data.

Avoiding overfitting with cross-validation

You will see cross-validation in action multiple times later in this book. So don’t panic if you find this section difficult to understand, as you will become an expert on cross-validation very soon.

Recall that between practice questions and actual exams, there are mock exams where we can assess how well we will perform in actual exams and use that information to conduct the necessary revision. In machine learning, the validation procedure helps to evaluate how models will generalize to independent or unseen datasets in a simulated setting. In a conventional validation setting, the original data is partitioned into three subsets, usually 60% for the training set, 20% for the validation set, and the rest (20%) for the testing set. This setting suffices if we have enough training samples after partitioning and we only need a rough estimate of simulated performance. Otherwise, cross-validation is preferable. Cross-validation helps to reduce variability and, therefore, limit overfitting.

In one round of cross-validation, the original data is divided into two subsets, for training and testing (or validation), respectively. The testing performance is recorded. Similarly, multiple rounds of cross-validation are performed under different partitions. Testing results from all rounds are finally averaged to generate a more reliable estimate of model prediction performance.

When the training size is very large, it’s often sufficient to split it into training, validation, and testing (three subsets) and conduct a performance check on the latter two. Cross-validation is less preferable in this case, since it’s computationally costly to train a model for each single round. But if you can afford it, there’s no reason not to use cross-validation. When the size isn’t so large, cross-validation is definitely a good choice.

There are mainly two cross-validation schemes in use: exhaustive and non-exhaustive. In the exhaustive scheme, we leave out a fixed number of observations in each round as testing (or validation) samples and use the remaining observations as training samples. This process is repeated until all possible different subsets of samples are used for testing once. For instance, we can apply Leave-One-Out-Cross-Validation (LOOCV), which lets each sample be in the testing set once. For a dataset of the size n, LOOCV requires n rounds of cross-validation. This can be slow when n gets large. The following diagram presents the workflow of LOOCV:

A screenshot of a test

Description automatically generated with low confidence

Figure 1.9: Workflow of leave-one-out-cross-validation

A non-exhaustive scheme, on the other hand, as the name implies, doesn’t try out all possible partitions. The most widely used type of this scheme is k-fold cross-validation. First, we randomly split the original data into k equal-sized folds. In each trial, one of these folds becomes the testing set, and the rest of the data becomes the training set.

We repeat this process k times, with each fold being the designated testing set once. Finally, we average the k sets of test results for the purpose of evaluation. Common values for k are 3, 5, and 10. The following table illustrates the setup for five-fold cross-validation:

Round	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5
1	Testing	Training	Training	Training	Training
2	Training	Testing	Training	Training	Training
3	Training	Training	Testing	Training	Training
4	Training	Training	Training	Testing	Training
5	Training	Training	Training	Training	Testing

Table 1.1: Setup for five-fold cross-validation

K-fold cross-validation often has a lower variance compared to LOOCV, since we’re using a chunk of samples instead of a single one for validation.

We can also randomly split the data into training and testing sets numerous times. This is formally called the holdout method. The problem with this algorithm is that some samples may never end up in the testing set, while some may be selected multiple times in the testing set.

Last but not least, nested cross-validation is a combination of cross-validations. It consists of the following two phases:

Inner cross-validation: This phase is conducted to find the best fit and can be implemented as a k-fold cross-validation
Outer cross-validation: This phase is used for performance evaluation and statistical analysis

We will apply cross-validation very intensively throughout this entire book. Before that, let’s look at cross-validation with an analogy next, which will help us to better understand it.

A data scientist plans to take his car to work, and his goal is to arrive before 9 a.m. every day. He needs to decide the departure time and the route to take. He tries out different combinations of these two parameters on certain Mondays, Tuesdays, and Wednesdays and records the arrival time for each trial. He then figures out the best schedule and applies it every day. However, it doesn’t work quite as well as expected.

It turns out the scheduling model is overfitted to the data points gathered in the first three days and may not work well on Thursdays and Fridays. A better solution would be to test the best combination of parameters derived from Mondays to Wednesdays on Thursdays and Fridays and similarly repeat this process, based on different sets of learning days and testing days of the week. This analogized cross-validation ensures that the selected schedule works for the whole week.

In summary, cross-validation derives a more accurate assessment of model performance by combining measures of prediction performance on different subsets of data. This technique not only reduces variance and avoids overfitting but also gives an insight into how a model will generally perform in practice.

Avoiding overfitting with regularization

Another way of preventing overfitting is regularization. Recall that the unnecessary complexity of a model is a source of overfitting. Regularization adds extra parameters to the error function we’re trying to minimize, in order to penalize complex models.

According to the principle of Occam’s razor, simpler methods are to be favored. William Occam was a monk and philosopher who, around the year 1320, came up with the idea that the simplest hypothesis that fits data should be preferred. One justification for this is that we can invent fewer simple models than complex models. For instance, intuitively, we know that there are more high-polynomial models than linear ones. The reason is that a line (y = ax + b) is governed by only two parameters—the intercept, b, and slope, a. The possible coefficients for a line span two-dimensional space. A quadratic polynomial adds an extra coefficient for the quadratic term, and we can span a three-dimensional space with the coefficients. Therefore, it is much easier to find a model that perfectly captures all training data points with a high-order polynomial function, as its search space is much larger than that of a linear function. However, these easily obtained models generalize worse than linear models, which are more prone to overfitting. Also, of course, simpler models require less computation time. The following diagram displays how we try to fit a linear function and a high order polynomial function, respectively, to the data:

A picture containing line, screenshot, text, diagram

Description automatically generated

Figure 1.10: Fitting data with a linear function and a polynomial function

The linear model is preferable, as it may generalize better to more data points drawn from the underlying distribution. We can use regularization to reduce the influence of the high orders of a polynomial by imposing penalties on them. This will discourage complexity, even though a less accurate and less strict rule is learned from the training data.

We will employ regularization quite often in this book, starting from Chapter 4, Predicting Online Ad Click-Through with Logistic Regression. For now, let’s look at an analogy that can help you better understand regularization.

A data scientist wants to equip his robotic guard dog with the ability to identify strangers and his friends. He feeds it with the following learning samples:

Male	Young	Tall	With glasses	In grey	Friend
Female	Middle	Average	Without glasses	In black	Stranger
Male	Young	Short	With glasses	In white	Friend
Male	Senior	Short	Without glasses	In black	Stranger
Female	Young	Average	With glasses	In white	Friend
Male	Young	Short	Without glasses	In red	Friend

Table 1.2: Training samples for the robotic guard dog

The robot may quickly learn the following rules:

Any middle-aged female of average height without glasses and dressed in black is a stranger
Any senior short male without glasses and dressed in black is a stranger
Anyone else is his friend

Although these perfectly fit the training data, they seem too complicated and unlikely to generalize well to new visitors. In contrast, the data scientist limits the learning aspects. A loose rule that can work well for hundreds of other visitors could be as follows: anyone without glasses dressed in black is a stranger.

Besides penalizing complexity, we can also stop a training procedure early as a technique to prevent overfitting. If we limit the time a model spends learning or set some internal stopping criteria, it’s more likely to produce a simpler model. The model complexity will be controlled in this way; hence, overfitting becomes less probable. This approach is called early stopping in machine learning.

Last but not least, it’s worth noting that regularization should be kept at a moderate level or, to be more precise, fine-tuned to an optimal level. Too small a regularization doesn’t make any impact; too large a regularization will result in underfitting, as it moves the model away from the ground truth. We will explore how to achieve optimal regularization in Chapter 4, Predicting Online Ad Click-Through with Logistic Regression, Chapter 5, Predicting Stock Prices with Regression Algorithms, and Chapter 6, Predicting Stock Prices with Artificial Neural Networks.

Avoiding overfitting with feature selection and dimensionality reduction

We typically represent data as a grid of numbers (a matrix). Each column represents a variable, which we call a feature in machine learning. In supervised learning, one of the variables is actually not a feature but the label that we’re trying to predict. And in supervised learning, each row is an example that we can use for training or testing.

The number of features corresponds to the dimensionality of the data. Our machine learning approach depends on the number of dimensions versus the number of examples. For instance, text and image data are very high dimensional, while sensor data (such as temperature, pressure, or GPS) has relatively fewer dimensions.

Fitting high-dimensional data is computationally expensive and prone to overfitting, due to the high complexity. Higher dimensions are also impossible to visualize, and therefore, we can’t use simple diagnostic methods.

Not all of the features are useful, and they may only add randomness to our results. Therefore, it’s often important to do good feature selection. Feature selection is the process of picking a subset of significant features for use in better model construction. In practice, not every feature in a dataset carries information useful for discriminating samples; some features are either redundant or irrelevant and, hence, can be discarded with little loss.

In principle, feature selection boils down to multiple binary decisions about whether to include a feature. For n features, we get 2ⁿ feature sets, which can be a very large number for a large number of features. For example, for 10 features, we have 1,024 possible feature sets (for instance, if we’re deciding what clothes to wear, the features can be temperature, rain, the weather forecast, and where we’re going). Basically, we have two options: we either start with all of the features and remove features iteratively, or we start with a minimum set of features and add features iteratively. We then take the best feature sets for each iteration and compare them. At a certain point, brute-force evaluation becomes infeasible. Hence, more advanced feature selection algorithms were invented to distill the most useful features/signals. We will discuss in detail how to perform feature selection in Chapter 4, Predicting Online Ad Click-Through with Logistic Regression.

Another common approach to reducing dimensionality is to transform high-dimensional data into lower-dimensional space. This is known as dimensionality reduction or feature projection. We will get into this in detail in Chapter 7, Mining the 20 Newsgroups Dataset with Text Analysis Techniques, where we will encode text data into two dimensions, and Chapter 9, Recognizing Faces with Support Vector Machine, where we will talk about projecting high-dimensional image data into low-dimensional space.

In this section, we talked about how the goal of machine learning is to find the optimal generalization to the data, and how to avoid ill-generalization. In the next two sections, we will explore tricks to get closer to the goal throughout individual phases of machine learning, including data preprocessing and feature engineering in the next section, and modeling in the section after that.

Data preprocessing and feature engineering

Data preprocessing and feature engineering play a crucial and foundational role in machine learning. It’s like laying the groundwork for a building – the stronger and better prepared the foundation, the better the final structure (machine learning model) will be. Here is a breakdown of their relationship:

Preprocessing prepares data for efficient learning: Raw data from various sources often contains inconsistencies, errors, and irrelevant information. Preprocessing cleans, organizes, and transforms the data into a format suitable for the chosen machine learning algorithm. This allows the algorithm to understand the data more easily and efficiently, leading to better model performance.
Preprocessing helps improve model accuracy and generalizability: By handling missing values, outliers, and inconsistencies, preprocessing reduces noise in data. This enables a model to focus on the true patterns and relationships within the data, leading to more accurate predictions and better generalization on unseen data.
Feature engineering provides meaningful input variables: Raw data is transformed and manipulated to create new features or select relevant ones. New features potentially improve model performance and insight generation.

Overall, data preprocessing and feature engineering is an essential step in the machine learning workflow. By dedicating time and effort to proper preprocessing and feature engineering, you lay the foundation to build reliable, accurate, and generalizable machine learning models. We will cover the preprocessing phase first in this section.

Preprocessing and exploration

When we learn, we require high-quality learning material. We can’t learn from gibberish, so we automatically ignore anything that doesn’t make sense. A machine learning system isn’t able to recognize gibberish, so we need to help it by cleaning the input data. It’s often claimed that cleaning the data forms a large part of machine learning. Sometimes, the cleaning is already done for us, but you shouldn’t count on it.

To decide how to clean data, we need to be familiar with it. There are some projects that try to automatically explore the data and do something intelligent, such as producing a report. For now, unfortunately, we don’t have a solid solution in general, so you need to do some work.

We can do two things, which aren’t mutually exclusive: first, scan the data, and second, visualize the data. This also depends on the type of data we’re dealing with—whether we have a grid of numbers, images, audio, text, or something else.

Ultimately, a grid of numbers is the most convenient form, and we will always work toward having numerical features. Let’s pretend that we have a table of numbers in the rest of this section.

We want to know whether features have missing values, how the values are distributed, and what type of features we have. Values can approximately follow a normal distribution, a binomial distribution, a Poisson distribution, or another distribution altogether. Features can be binary: either yes or no, positive or negative, and so on. They can also be categorical: pertaining to a category, such as continents (Africa, Asia, Europe, South America, North America, and so on). Categorical variables can also be ordered, for instance, high, medium, and low. Features can also be quantitative, for example, the temperature in degrees or the price in dollars. Now, let’s dive into how we can cope with each of these situations.

Dealing with missing values

Quite often, we miss values for certain features. This could happen for various reasons. It can be inconvenient, expensive, or even impossible to always have a value. Maybe we weren’t able to measure a certain quantity in the past because we didn’t have the right equipment or just didn’t know that the feature was relevant. However, we’re stuck with missing values from the past.

Sometimes, it’s easy to figure out that we’re missing values, and we can discover this just by scanning the data or counting the number of values we have for a feature and comparing this figure with the number of values we expect, based on the number of rows. Certain systems encode missing values with, for example, values such as 999,999 or -1. This makes sense if the valid values are much smaller than 999,999. If you’re lucky, you’ll have information about the features provided by whoever created the data in the form of a data dictionary or metadata.

Once we know that we’re missing values, the question arises of how to deal with them. The simplest answer is to just ignore them. However, some algorithms can’t deal with missing values, and the program will just refuse to continue. In other circumstances, ignoring missing values will lead to inaccurate results. The second solution is to substitute missing values with a fixed value—this is called imputing. We can impute the arithmetic mean, median, or mode of the valid values of a certain feature. Ideally, we will have some prior knowledge of a variable that is somewhat reliable. For instance, we may know the seasonal averages of temperature for a certain location and be able to impute guesses for missing temperature values, given a date. We will talk about dealing with missing data in detail in Chapter 10, Machine Learning Best Practices. Similarly, techniques in the following sections will be discussed and employed in later chapters, just in case you feel uncertain about how they can be used.

Label encoding

Humans are able to deal with various types of values. Machine learning algorithms (with some exceptions) require numerical values. If we offer a string such as Ivan, unless we’re using specialized software, the program won’t know what to do. In this example, we’re dealing with a categorical feature—names, probably. We can consider each unique value to be a label. (In this particular example, we also need to decide what to do with the case—is Ivan the same as ivan?). We can then replace each label with an integer—label encoding.

The following example shows how label encoding works:

Label	Encoded Label
Africa	1
Asia	2
Europe	3
South America	4
North America	5
Other	6

Table 1.3: Example of label encoding

This approach can be problematic in some cases because the learner may conclude that there is an order (unless it is expected, for example, bad=0, ok=1, good=2, and excellent=3). In the preceding mapping table, Asia and North America in the preceding case differ by 4 after encoding, which is a bit counterintuitive, as it’s hard to quantify them. One-hot encoding in the next section takes an alternative approach.

One-hot encoding

The one-of-K, or one-hot encoding, scheme uses dummy variables to encode categorical features. Originally, it was applied to digital circuits. The dummy variables have binary values such as bits, so they take the values zero or one (equivalent to true or false). For instance, if we want to encode continents, we will have dummy variables, such as is_asia, which will be true if the continent is Asia and false otherwise. In general, we need as many dummy variables as there are unique values minus one (or sometimes the exact number of unique values). We can determine one of the labels automatically from the dummy variables because they are exclusive.

If the dummy variables all have a false value, then the correct label is the label for which we don’t have a dummy variable. The following table illustrates the encoding for continents:

Continent	Is_africa	Is_asia	Is_europe	Is_sam	Is_nam
Africa	1	0	0	0	0
Asia	0	1	0	0	0
Europe	0	0	1	0	0
South America	0	0	0	1	0
North America	0	0	0	0	1
Other	0	0	0	0	0

Table 1.4: Example of one-hot encoding

The encoding produces a matrix (grid of numbers) with lots of zeros (false values) and occasional ones (true values). This type of matrix is called a sparse matrix. The sparse matrix representation is handled well by the scipy package, which we will discuss later in this chapter.

Dense embedding

While one-hot encoding is a simple and sparse representation of categorical features, dense embedding provides a compact, continuous representation that captures semantic relationships based on the co-occurrence patterns in data. For example, using dense embedding, the continent categories might be represented by 3-dimensional continuous vectors like:

Africa: [0.9, -0.2, 0.5]
Asia: [-0.1, 0.8, 0.6]
Europe: [0.6, 0.3, -0.7]
South America: [0.5, 0.2, 0.1]
North America: [0.4, 0.3, 0.2]
Other: [-0.8, -0.5, 0.4]

In this example, you may notice the vectors of South America and North America are closer together than those of Africa and Asia. Dense embedding can capture the similarities between categories. In another example, you may see more closeness of the vectors of Europe and North America, based on cultural similarity.

We will explore dense embedding further in Chapter 7, Mining the 20 Newsgroups Dataset with Text Analysis Techniques.

Scaling

Values of different features can differ by orders of magnitude. Sometimes, this can mean that the larger values dominate the smaller values. This depends on the algorithm we use. For certain algorithms to work properly, we’re required to scale data.

There are the following several common strategies that we can apply:

Standardization removes the mean of a feature and divides it by the standard deviation. If the feature values are normally distributed, we will get a Gaussian, which is centered around zero with a variance of one.
If the feature values aren’t normally distributed, we can remove the median and divide by the interquartile range. The interquartile range is the range between the first and third quartile (or 25^th and 75^th percentile).
A range between zero and one is a common choice of range for feature scaling.

We will use scaling in many projects throughout the book.

An advanced version of data preprocessing is usually called feature engineering. We will cover that next.

Feature engineering

Feature engineering is the process of creating or improving features. Features are often created based on common sense, domain knowledge, or prior experience. There are certain common techniques for feature creation; however, there is no guarantee that creating new features will improve your results. We are sometimes able to use the clusters found by unsupervised learning as extra features. Deep neural networks are often able to derive features automatically.

We will briefly look at some feature engineering techniques: polynomial transformation and binning.

Polynomial transformation

If we have two features, a and b, we can suspect that there is a polynomial relationship, such as a² + ab + b². We can consider a new feature an interaction between a and b, such as the product ab. An interaction doesn’t have to be a product—although this is the most common choice—it can also be a sum, a difference, or a ratio. If we use a ratio to avoid dividing by zero, we should add a small constant to the divisor and dividend.

The number of features and the order of the polynomial for a polynomial relationship aren’t limited. However, if we follow the Occam’s razor principle, we should avoid higher-order polynomials and interactions of many features. In practice, complex polynomial relations tend to be more difficult to compute and tend to overfit, but if you really need better results, they may be worth considering. We will see polynomial transformation in action in Best practice 12 – performing feature engineering without domain expertise section in Chapter 10, Machine Learning Best Practices.

Binning

Sometimes, it’s useful to separate feature values into several bins. For example, we may only be interested in whether it rained on a particular day. Given the precipitation values, we can binarize the values so that we get a true value if the precipitation value isn’t zero, and a false value otherwise. We can also use statistics to divide values into high, low, and medium bins. In marketing, we often care more about the age group, such as 18 to 24, than a specific age, such as 23.

The binning process inevitably leads to a loss of information. However, depending on your goals, this may not be an issue, actually reducing the chance of overfitting. Certainly, there will be improvements in speed and a reduction of memory or storage requirements and redundancy.

Any real-world machine learning system should have two modules: a data preprocessing module, which we just covered in this section, and a modeling module, which will be covered next.

Combining models

A model takes in data (usually preprocessed) and produces predictive results. What if we employ multiple models? Will we make better decisions by combining predictions from individual models? We will talk about this in this section.

Let’s start with an analogy. In high school, we sit together with other students and learn together, but we aren’t supposed to work together during the exam. The reason is, of course, that teachers want to know what we’ve learned, and if we just copy exam answers from friends, we may not have learned anything. Later in life, we discover that teamwork is important. For example, this book is the product of a whole team, or possibly a group of teams.

Clearly, a team can produce better results than a single person. However, this goes against Occam’s razor, since a single person can come up with simpler theories compared to what a team will produce. In machine learning, we nevertheless prefer to have our models cooperate with the following model combination schemes:

Voting and averaging
Bagging
Boosting
Stacking

Let’s dive into each of them now.

Voting and averaging

This is probably the most understandable type of model aggregation. It just means the final output will be the majority or average of prediction output values from multiple models. It is also possible to assign different weights to individual models in the ensemble; for example, some models that are more reliable might be given two votes.

Nonetheless, combining the results of models that are highly correlated to each other doesn’t guarantee a spectacular improvement. It is better to somehow diversify the models by using different features or different algorithms. If you find two models are strongly correlated, you may, for example, decide to remove one of them from the ensemble and increase proportionally the weight of the other model.

Bagging

Bootstrap aggregating, or bagging, is an algorithm introduced by Leo Breiman, a distinguished statistician at the University of California, Berkeley, in 1994, which applies bootstrapping to machine learning problems. Bootstrapping is a statistical procedure that creates multiple datasets from an existing one by sampling data with replacement. Bootstrapping can be used to measure the properties of a model, such as bias and variance.

In general, a bagging algorithm follows these steps:

We generate new training sets from input training data by sampling with replacement.
For each generated training set, we fit a new model.
We combine the results of the models by averaging or majority voting.

The following diagram illustrates the steps for bagging, using classification as an example (the circles and crosses represent samples from two classes):

Figure 1.11: Workflow of bagging for classification

As you can imagine, bagging can reduce the chance of overfitting.

We will study bagging in depth in Chapter 3, Predicting Online Ad Click-Through with Tree-Based Algorithms.

Boosting

In the context of supervised learning, we define weak learners as learners who are just a little better than a baseline, such as randomly assigning classes or average values. Much like ants, weak learners are weak individually, but together, they have the power to do amazing things.

It makes sense to take into account the strength of each individual learner using weights. This general idea is called boosting. In boosting, all models are trained in sequence, instead of in parallel as in bagging. Each model is trained on the same dataset, but each data sample has a different weight, factoring in the previous model’s success. The weights are reassigned after a model is trained, which will be used for the next training round. In general, weights for mispredicted samples are increased to stress their prediction difficulty.

The following diagram illustrates the steps for boosting, again using classification as an example (the circles and crosses represent samples from two classes, and the size of a circle or cross indicates the weight assigned to it):

A screenshot of a device

Description automatically generated with low confidence

Figure 1.12: Workflow of boosting for classification

There are many boosting algorithms; boosting algorithms differ mostly in their weighting scheme. If you’ve studied for an exam, you may have applied a similar technique by identifying the type of practice questions you had trouble with and focusing on the hard problems.

Viola-Jones, a popular face detection framework, leverages the boosting algorithm to efficiently identify faces in images. Detecting faces in images or videos is supervised learning. We give the learner examples of regions containing faces. There’s an imbalance, since we usually have far more regions that don’t have faces than those that do (about 10,000 times more).

A cascade of classifiers progressively filters out these negative image areas stage by stage. In each progressive stage, the classifiers use progressively more features on fewer image windows. The idea is to spend the majority of time on image patches that contain faces. In this context, boosting is used to select features and combine results.

Stacking

Stacking takes the output values of machine learning models and then uses them as input values for another algorithm. You can, of course, feed the output of the higher-level algorithm to another predictor. It’s possible to use any arbitrary topology, but for practical reasons, you should try a simple setup first, as also dictated by Occam’s razor.

A fun fact is that stacking is commonly used in the winning models in the Kaggle competition. For instance, the first place for the Otto Group Product Classification Challenge (www.kaggle.com/c/otto-group-product-classification-challenge) went to a stacking model composed of more than 30 different models.

So far, we have covered the tricks required to more easily reach the right generalization for a machine learning model throughout the data preprocessing and modeling phase. I know you can’t wait to start working on a machine learning project. Let’s get ready by setting up the working environment.

Installing software and setting up

As the book title says, Python is the language we will use to implement all machine learning algorithms and techniques throughout the entire book. We will also exploit many popular Python packages and tools, such as NumPy, SciPy, scikit-learn, TensorFlow, and PyTorch. By the end of this initial chapter, make sure you have set up the tools and working environment properly, even if you are already an expert in Python or familiar with some of the aforementioned tools.

Setting up Python and environments

We will use Python 3 in this book. The Anaconda Python 3 distribution is one of the best options for data science and machine learning practitioners.

Anaconda is a free Python distribution for data analysis and scientific computing. It has its own package manager, conda. The distribution (https://docs.anaconda.com/free/anaconda/, depending on your OS, or Python version 3.7 to 3.11) includes around 700 Python packages (as of 2023), which makes it very convenient. For casual users, the Miniconda (https://conda.io/miniconda.html) distribution may be the better choice. Miniconda contains the conda package manager and Python. Obviously, Miniconda takes up much less disk space than Anaconda.

The procedures to install Anaconda and Miniconda are similar. You can follow the instructions from https://docs.conda.io/projects/conda/en/latest/user-guide/install/. First, you must download the appropriate installer for your OS and Python version, as follows:

A picture containing text, screenshot, font

Description automatically generated

Figure 1.13: Installation entry based on your OS

Follow the steps listed in your OS. You can choose between a GUI and a CLI. I personally find the latter easier.

Anaconda comes with its own Python installation. On my machine, the Anaconda installer created an anaconda directory in my home directory and required about 900 MB. Similarly, the Miniconda installer installs a miniconda directory in your home directory.

Feel free to play around with it after you set it up. One way to verify that you have set up Anaconda properly is by entering the following command line in your terminal on Linux/Mac or Command Prompt on Windows (from now on, we will just mention Terminal):

python

The preceding command line will display your Python running environment, as shown in the following screenshot:

Figure 1.14: Screenshot after running “python” in the terminal

If you don’t see this, please check the system path or the path Python is running from.

To wrap up this section, I want to emphasize the reasons why Python is the most popular language for machine learning and data science. First of all, Python is famous for its high readability and simplicity, which makes it easy to build machine learning models. We spend less time worrying about getting the right syntax and compilation and, as a result, have more time to find the right machine learning solution. Second, we have an extensive selection of Python libraries and frameworks for machine learning:

Tasks	Python libraries
Data analysis	NumPy, SciPy, and pandas
Data visualization	Matplotlib, and Seaborn
Modeling	scikit-learn, TensorFlow, Keras, and PyTorch

Table 1.5: Popular Python libraries for machine learning

The next step involves setting up some of the packages that we will use throughout this book.

Installing the main Python packages

For most projects in this book, we will use NumPy (http://www.numpy.org/), SciPy (https://scipy.org/), the pandas library (https://pandas.pydata.org/), scikit-learn (http://scikit-learn.org/stable/), TensorFlow (https://www.tensorflow.org/), and PyTorch (https://pytorch.org/).

In the sections that follow, we will cover the installation of several Python packages that we will mainly use in this book.

Conda environments provide a way to isolate dependencies and packages for different projects. So it is recommended to create and use an environment for a new project. Let’s create one using the following command to create an environment called “pyml":

conda create --name pyml python=3.10

Here, we also specify the Python version, 3.10, which is optional but highly recommended. This is to avoid using the latest Python version by default, which may not be compatible with many Python packages. For example, at the time of writing (late 2023), PyTorch does not support Python 3.11.

To activate the newly created environment, we use the following command:

conda activate pyml

The activated environment is displayed in front of your prompt like this:

(pyml) hayden@haydens-Air ~ %

NumPy

NumPy is the fundamental package for machine learning with Python. It offers powerful tools including the following:

The N-dimensional array (ndarray) class and several subclasses representing matrices and arrays
Various sophisticated array functions
Useful linear algebra capabilities

Installation instructions for NumPy can be found at https://numpy.org/install/. Alternatively, an easier method involves installing it with conda or pip in the command line, as follows:

conda install numpy

pip install numpy

A quick way to verify your installation is to import it in Python, as follows:

>>> import numpy

It is installed correctly if no error message is visible.

SciPy

In machine learning, we mainly use NumPy arrays to store data vectors or matrices composed of feature vectors. SciPy (https://scipy.org/) uses NumPy arrays and offers a variety of scientific and mathematical functions. Installing SciPy in the terminal is similar, again as follows:

conda install scipy

pip install scipy

pandas

We also use the pandas library (https://pandas.pydata.org/) for data wrangling later in this book. The best way to get pandas is via pip or conda, for example:

conda install pandas

scikit-learn

The scikit-learn library is a Python machine learning package optimized for performance, as a lot of its code runs almost as fast as equivalent C code. The same statement is true for NumPy and SciPy. scikit-learn requires both NumPy and SciPy to be installed. As the installation guide in http://scikit-learn.org/stable/install.html states, the easiest way to install scikit-learn is to use pip or conda, as follows:

pip install -U scikit-learn

conda install -c conda-forge scikit-learn

Here, we use the “-c conda-forge" option to tell conda to search for packages in the conda-forge channel, which is a community-driven channel with a wide range of open-source packages.

TensorFlow

TensorFlow is a Python-friendly open-source library invented by the Google Brain team for high-performance numerical computation. It makes machine learning faster and deep learning easier, with the Python-based convenient frontend API and high-performance C++-based backend execution. TensorFlow 2 was largely a redesign of its first mature version, 1.0, and was released at the end of 2019.

TensorFlow has been widely known for its deep learning modules. However, its most powerful point is computation graphs, which algorithms are built on. Basically, a computation graph is used to convey relationships between the input and the output via tensors.

For instance, if we want to evaluate a linear relationship, y = 3 * a + 2 * b, we can represent it in the following computation graph:

A picture containing screenshot, circle, diagram, sketch

Description automatically generated

Figure 1.15: Computation graph for a y = 3 * a + 2 * b machine

Here, a and b are the input tensors, c and d are the intermediate tensors, and y is the output.

You can think of a computation graph as a network of nodes connected by edges. Each node is a tensor, and each edge is an operation or function that takes its input node and returns a value to its output node. To train a machine learning model, TensorFlow builds the computation graph and computes the gradients accordingly (gradients are vectors that provide the steepest direction where an optimal solution is reached). In the upcoming chapters, you will see some examples of training machine learning models using TensorFlow.

We highly recommend you go through https://www.tensorflow.org/guide/data if you are interested in exploring more about TensorFlow and computation graphs.

TensorFlow allows easy deployment of computation across CPUs and GPUs, which empowers expensive and large-scale machine learning. In this book, we will focus on the CPU as our computation platform. Hence, according to https://www.tensorflow.org/install/, installing TensorFlow 2 is done via the following command line:

conda install -c conda-forge tensorflow

pip install tensorflow

You can always verify the installation by importing it in Python.

PyTorch

PyTorch is an open-source machine learning library primarily used to develop deep learning models. It provides a flexible and efficient framework to build neural networks and perform computations on GPUs. PyTorch was developed by Facebook’s AI Research lab and is widely used in both research and industry.

Similar to TensorFlow, PyTorch performs its computations based on a directed acyclic graph (DAG). The difference is that PyTorch utilizes a dynamic computational graph, which allows for on-the-fly graph construction during runtime, while TensorFlow uses a static computational graph, where the graph structure is defined upfront and then executed. This dynamic nature enables greater flexibility in model design and easier debugging, and also facilitates dynamic control flow, making it suitable for a wide range of applications.

PyTorch has become a popular choice among researchers and practitioners in the field of deep learning, due to its flexibility, ease of use, and efficient computational capabilities. Its intuitive interface and strong community support make it a powerful tool for various applications, including computer vision, natural language processing, reinforcement learning, and more.

To install PyTorch, it is recommended to look up the command in the latest instructions on https://pytorch.org/get-started/locally/, based on the system and method.

As an example, we install the latest stable version (2.2.0 as of late 2023) via conda on a Mac using the following command:

conda install pytorch::pytorch torchvision  -c pytorch

Best practice

If you encounter issues in installation, please read more about the platform and package-specific recommendations provided on the instructions page. All PyTorch code in this book can be run on your CPU, unless specifically indicated for a GPU only. However, using a GPU is recommended if you want to expedite training neural network models and fully enjoy the benefits of PyTorch. If you have a graphics card, refer to the instructions and set up PyTorch with the appropriate compute platform. For example, I install it on Windows with a GPU using the following command:

conda install pytorch torchvision pytorch-cuda=11.8 -c pytorch -c nvidia

To check if PyTorch with GPU support is installed correctly, run the following Python code:

>>> import torch
>>> torch.cuda.is_available()
True

Alternatively, you can use Google Colab (https://colab.research.google.com/) to train some neural network models using GPUs for free.

There are many other packages we will use intensively, for example, Matplotlib for plotting and visualization, Seaborn for visualization, NLTK for natural language processing tasks, transformers for state-of-the-art models pretrained on large datasets, and OpenAI Gym for reinforcement learning. We will provide installation details for any package when we first encounter it in this book.

Filter reviews by

All

Feefo verified reviews

Amazon verified reviews

Jacob Smith Sep 21, 2024

This book is an absolute gem for anyone looking to dive deep into the world of machine learning using Python! From the moment I opened it, I was impressed by the clear, concise explanations and the practical examples that make even the most complex topics easy to understand.The author does a fantastic job of breaking down key machine learning algorithms, explaining not just the "how" but the "why" behind each method. The inclusion of real-world datasets and hands-on exercises makes it easy to follow along and apply what you've learned immediately.

Amazon Verified review

Ayon Roy Sep 05, 2024

Starting my journey in machine learning was both exciting and overwhelming. I struggled to bridge the gap between theory and practical application in real-world projects. That’s why Yuxi Hayden Liu’s "Python Machine Learning by Example" has been a game-changer for me. This book offers a structured approach, making it easier to transition from learning to execution.Liu covers essential topics like overfitting, underfitting, and cross-validation right from the start, ensuring that you grasp the fundamentals. What truly sets this book apart is the hands-on projects that accompany each concept. From building a movie recommendation engine using Naive Bayes to predicting stock prices and exploring deep learning through artificial neural networks, Liu walks you through each step—from data preparation to model evaluation.The book is rich with best practices, such as feature engineering, algorithm selection, and monitoring model performance. By the end, you'll not only have a solid understanding of basic and advanced topics, including CNNs, transformer models, and reinforcement learning, but you’ll also feel confident applying them in real-world scenarios.Yuxi Hayden Liu’s industry experience shines through, making this book an invaluable guide for anyone feeling lost in their machine learning journey. Highly recommended for both students and professionals looking to elevate their skills. Happy reading!

C. C Chin Oct 14, 2024

Need hands on ML newbie!!Also Python newbie too but got computer science degree!!Ready all 5* reviews, book perfect for Machine learning newbie and Python newbie and AWS MLS-C01 exam and entry level machine learning specalty exam and Sagemaker studio!!All new for me!!!Need examples to make practice exams answers to help for AWS mls-c01 machine learning specalty exam AWS Sagemaker studio too, since all new to me!!!Got book October 13, 2024!! And pdf too!!Reading now to do ML example!!Got Oliver beginner book, udemy classBook 3 months old pretty new, October 14,2024!!!Explain Oliver beginner book got 3 of those!!

saandeep sreerambatla Jul 31, 2024

"Python Machine Learning by Example, Fourth Edition" by Yuxi (Hayden) Liu is a fantastic resource for anyone interested in machine learning, whether you're just starting out or already have some experience. This book strikes a great balance between explaining the theory behind machine learning and showing you how to apply it in real-world scenarios, making it an essential addition to any data scientist’s collection.The book is well-organized, kicking off with the basics of machine learning and Python programming. Liu does an excellent job of explaining why machine learning is so important today and then helps you set up your Python environment. This ensures that even those with minimal programming experience can keep up.What really stands out about this book is its hands-on approach. Each chapter is packed with real-world examples that help bring complex machine learning concepts to life. For instance, the chapters on building a movie recommendation engine with Naïve Bayes and predicting stock prices with regression algorithms are particularly insightful, showing you exactly how these models work and how to apply them to real problems.The book also covers advanced topics like deep learning, natural language processing (NLP), and reinforcement learning. The sections on convolutional neural networks (CNNs) for image classification and recurrent neural networks (RNNs) for sequence prediction are especially useful. They provide a deep dive into these advanced models, complete with code examples using TensorFlow and PyTorch, which are incredibly helpful for anyone looking to implement these techniques in their own projects.Another great feature of this book is the focus on best practices. Liu includes 21 best practices that cover the entire machine learning workflow, from data preparation to model deployment and monitoring. This is invaluable for anyone looking to build robust and scalable machine learning solutions.It's worth noting that the book assumes you have a basic understanding of Python and some familiarity with statistical concepts. This might be a bit challenging for complete beginners, but it doesn't take away from the overall value of the book. Instead, it sets a realistic expectation for the level of expertise needed to fully benefit from the content.In conclusion, "Python Machine Learning by Example, Fourth Edition" is an excellent resource that bridges the gap between theory and practice. Yuxi (Hayden) Liu's clear explanations, practical examples, and focus on best practices make this book a must-read for anyone serious about mastering machine learning with Python. Whether you're a data analyst, a machine learning engineer, or a data scientist, this book will provide you with the tools and knowledge you need to succeed.

Thomas M. Aug 21, 2024

I highly recommend Liu's Python ML by Example! As a long term practitioner of all things analytics and data science, it was refreshing to come back to the foundations with this book. I wish I had this resource available when I was originally getting started in the field, as Liu has a knack for covering a broad range of salient topics in ML, while still offering plenty of depth for those looking to go into the weeds of how algorithms work. Super practical, this book focuses on real-life examples, spanning marketing & ads, content recommendations, text sentiment, image classification and beyond. The book also navigates tabular ML and deep learning concepts flawlessly. Liu doesn't stop at the fundamentals; the book also covers advanced topics like deep learning, natural language processing (NLP), and reinforcement learning. The sections on convolutional neural networks (CNNs) for image classification and recurrent neural networks (RNNs) for sequence prediction offer valuable insights into these cutting-edge techniques. These topics area all presented in ways that even new-to-ML readers would be able to grasp. These days, no ML book is complete without including GenAI as a topic, which the author integrates seamlessly. All around a super well rounded and practical read!

Python Machine Learning By Example: Unlock machine learning best practices with real-world use cases , Fourth Edition

What do you get with eBook?

Contact Details

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Contact Details

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs