R Machine Learning Projects

Exploring the Machine Learning Landscape

Machine learning (ML) is an amazing subfield of Artificial Intelligence (AI) that tries to mimic the learning behavior of humans. Similar to the way a baby learns by observing the examples it encounters, an ML algorithm learns the outcome or response to a future incident by observing the data points that are provided as input to it.

In this chapter, we will cover the following topics:

ML versus software engineering
Types of ML methods
ML terminology—a quick review
ML project pipeline
Learning paradigm
Datasets

ML versus software engineering

With most people transitioning from traditional software engineering practice to ML, it is important to understand the underlying difference between both areas. Superficially, both of these areas seem to generate some sort of code to perform a particular task. An interesting fact to observe is that, unlike software engineering where a programmer explicitly writes a program with various responses based on several conditions, the ML algorithm infers the rules of the game by observing the input examples. The rules that are learned are further used for better decision making when new input data is fed to the system.

As you can observe in the following diagram, automatically inferring the actions from data without manual intervention is the key differentiator between ML and traditional programming:

Another key differentiator of ML from traditional programming is that the knowledge acquired through ML is able to generalize beyond the training samples by successfully interpreting data that the algorithm has never seen before, while a program coded in traditional programming can only perform the responses that were included as part of the code.

Yet another differentiator is that in software engineering, there are certain specific ways to solve a problem at hand. Given an algorithm developed based on certain assumptions of inputs and the conditions incorporated, you will be able to guarantee the output that will be obtained given an input. In the ML world, it is not possible to provide such assurances on the output obtained from the algorithms. It is also very difficult in the ML world to confirm if a particular technique is better than another without actually trying both the techniques on the dataset for the problem at hand.

ML and software engineering are not the same! ML projects may involve some software engineering in them, but ML cannot be considered to be the same as software engineering.

While there is more than one formal definition that exists for ML, the following mentioned are a few key definitions encountered often:

"Machine learning is the science of getting computers to act without being explicitly programmed."
—Stanford

"Machine learning is based on algorithms that can learn from data without relying on rules-based programming."
—McKinsey and Co.

With the rise of data as the fuel of the future, the terms AI, ML, data mining, data science, and data analytics are used interchangeably by industry practitioners. It is important to understand the key differences between these terms to avoid confusion.

The terms AI, ML, data mining, data science, and data analytics, though used interchangeably, are not the same!

Let's take a look at the following terms:

AI: AI is a paradigm where machines are able to perform tasks in a smart way. It may be observed that in the definition of AI, it is not specified whether the smartness of machines may be achieved manually or automatically. Therefore, it is safe to assume that even a program written with several if...else or switch...case statements that has then been infused with a machine to carry out tasks may be considered to be AI.
ML: ML, on the other hand, is a way for the machine to achieve smartness by learning from the data that is provided as input and, thereby, we have a smart machine performing a task. It may be observed that ML achieves the same objective of AI except that the smartness is achieved automatically. Therefore, it can be concluded that ML is simply a way to achieve AI.
Data mining: Data mining is a specific field that focuses on discovering the unknown properties of the datasets. The primary objective of data mining is to extract rules from large amounts of data provided as input, whereas in ML, an algorithm not only infers rules from the data input, but also uses the rules to perform predictions on any new, incoming data.
Data analytics: Data analytics is a field that encompasses performing fundamental descriptive statistics, data visualization, and data points communication for conclusions. Data analytics may be considered to be a basic level within data science. It is normal for practitioners to perform data analytics on the input data provided for data mining or ML exercises. Such analysis on data is generally termed as exploratory data analysis (EDA).

Data science: Data science is an umbrella term that includes data analytics, data mining, ML, and any specific domain expertise pertaining to the field of work. Data science is a concept that includes several aspects of handling the data such as acquiring the data from one or more sources, data cleansing, data preparation, and creating new data points based on existing data. It includes performing data analytics. It also encompasses using one or more data mining or ML techniques on the data to infer knowledge to create an algorithm that performs a task on unseen data. This concept also includes deploying the algorithm in a way that it is useful to perform the designated tasks in the future.

The following is a Venn diagram which demonstrates the skills required by a professional working in the data science ambit. It has three circles, each of which defines a specific skill that a data science professional should have:

Let's explore the following skills mentioned in the preceding diagram:

Math & Statistic Knowledge: This skill is required to analyze the statistical properties of the data.
Hacking Skills: Programming skills play a key role in order to process the data in a quick manner. The ML algorithm is applied to create an output that will perform the prediction on unseen data.
Substantive Expertise: This skill refers to the domain expertise in the field of the problem at hand. It helps the professional to be able to provide proper inputs to the system from which it can learn and to assess the appropriateness of the inputs and results obtained.

To be a successful data science professional you need to have math, programming skills, as well as knowledge of the business domain.

As we can see, AI, data science, data analytics, data mining, and ML are all interlinked. All of these areas are the most in-demand domains in the industry right now. The right skill sets in combination with real-world experience will lead to a strong career in these areas which are currently trending. As ML forms the core of the leading space, the next section explores the various types of ML methods that may be applied to several real-world problems.

ML is everywhere! Most of the time, we may be using something that is ML-based but don’t realize its existence or the influence that it has on our lives! Let's explore together some very popular devices or applications that we experience on a daily basis, which are powered by ML:

Virtual personal assistants (VPAs) such as Google Allo, Alexa, Google Now, Google Home, Siri, and so on
Smart maps that show you traffic predictions, given your source and destination
Demand-based price surging in Uber or similar transportation services
Automated video surveillance in airports, railway stations, and other public places
Face recognition of individuals in pictures posted on social media sites such as Facebook
Personalized news feeds served to you on Facebook
Advertisements served to you on YouTube
People you may know suggestions on Facebook and other similar sites
Job recommendations on LinkedIn, based on your profile
Automated responses on Google Mail
Chatbots that you converse with in online customer support forums
Search engine results filtering
Email spam filtering

Of course, the list does not end here. The preceding applications mentioned are just a few of the basic ones that illustrate the influence that ML has on our lives today. It is not astonishing to quote that there is no subject area that ML has not touched!

The topics in this section are by no means an exhaustive description of ML, but just a quick touch point to get us started on a journey of exploration. Now that we have a basic understanding of what ML is and where it can be applied, let's delve deeper into other ML-related topics in the next section.

Types of ML methods

Several types of tasks that aim at solving real-world problems can be achieved thanks to ML. An ML method generally means a group of specific types of algorithms that are suitable for solving a particular kind of problem and the method addresses any constraints that the problem brings along with it. For example, a constraint of a particular problem could be the availability of labeled data that can be provided as input to the learning algorithm.

Essentially, the popular ML methods are supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, and transfer learning. The rest of this section details each of these methods.

Supervised learning

A supervised learning algorithm is applied when one is very clear about the result that needs to be achieved from a problem, however one is unsure about the relationships between the data that affects the output. We would like the ML algorithm that we apply on the data to perceive these relationships between different data elements so as to achieve the desired output.

The concept can be better explained with an example—at a bank, prior to extending a loan, they would like to predict if a loan applicant would pay the loan back. In this case, the problem is very clear. If a loan is extended to a prospective customer X, there are two possibilities: that X would successfully repay the loan or X would not repay the loan. The bank would like to use ML to identify the category into which customer X falls; that is, a successful repayer of the loan or a repayment defaulter.

While the problem definition that is to be solved is clear, please note that the features of a customer that will contribute to successful loan repayment or non-repayment are not clear and this is something we would like the ML algorithm to learn by observing the patterns in the data.

The major challenge here is that we need to provide input data that represents both customers that repaid their loans successfully and also customers that failed to repay. The bank can simply look at the historical data to get the records of customers in both categories and then label each record as paid or unpaid categories as appropriate.

The records, thus labeled, now become input to a supervised learning algorithm so that it can learn the patterns of both categories of customers. The process of learning from the labeled data is called training and the output obtained (algorithm) from the learning process is called a model. Ideally, the bank would keep some part of the labeled data aside from training data so as to be able to test the model created, and this data is termed as test data. It should be no surprise that the labeled data that is used for training the model is called training data.

Once the model has been built, measurements are obtained by testing the model with test data to ensure the model yields a satisfactory level of performance, otherwise model-building iterations are carried out until the desired model performance is obtained. The model that achieved the desired performance on test data can be used by the bank to infer if any new loan applicant will be a future defaulter at all and, if so, make a better decision in terms of extending a loan to that applicant.

In a nutshell, supervised ML algorithms are employed when the objective is very clear and labeled data is available as input for the algorithm to learn the patterns from. The following diagram summarizes the supervised learning process:

Supervised learning can be further divided into two categories, namely classification and regression. The prediction of a bank loan defaulter explained in this section is an example of classification and it aims to predict a label of a nominal type such as yes or no. On the other hand, it is also possible to predict numeric values (continuous values) and this type of prediction is called regression. An example of regression is predicting the monthly rental of a home in a prime location of a city based on features such as the demand for houses in the area, the number of bedrooms, the dimensions of the house, and accessibility to public transportation.

Several supervised learning algorithms exist, and a few popularly known algorithms in this area include classification and regression trees (CART), logistic regression, linear regression, Naive Bayes, neural networks, k-nearest neighbors (KNN), and support vector machine (SVM).

Unsupervised learning

The availability of labeled data is not very common and manually labeling data is also not cheap. This is the situation where unsupervised learning comes into play.

For example, one small boutique firm wants to roll out a promotion to its customers, who are registered on their Facebook page. While the business objective is clear—that a promotion needs to be rolled out to customers—it is unclear as to which customer falls under which group. Unlike the supervised learning method where prior knowledge existed in terms of bad debtors and good debtors, in this case there are no such clues.

When the customer information is given as input to unsupervised learning algorithms, it tries to identify the patterns in the data and thereby groups the data of the customers with similar kinds of attributes.

Birds of the same feather flock together is the principle followed in customer grouping with unsupervised learning.

The reasoning behind the formation of these organic groups from the grouping exercise may not be very intuitive. It may take some research to identify the factors that contributed to the gathering of a set of customers in a group. Most of the time, this research is manual and the data points in each group need verifying. This research may form the basis to determine the groups to which the particular promotion at hand needs to be rolled out. This application of unsupervised learning is called clustering. The following diagram shows the application of unsupervised ML to cluster the data points:

There are a number of clustering algorithms. However, the most popular ones are namely, k-means clustering, k-modes clustering, hierarchical clustering, fuzzy clustering, and so on.

Other forms of unsupervised learning do exist. For example, in retail industry, an unsupervised learning method called association rule mining is applied on customer purchases to identify the goods that are purchased together. In this case, unlike supervised learning, there is no need for labels at all. The task involved only requires the ML algorithm to identify the latent associations between the products that are billed together by customers. Having the information from association rule mining helps retailers place the products that are bought together in proximity. The idea is that customers can be intuitively encouraged to buy the extra products.

A priori, equivalence class transformation (Eclat), and frequency pattern growth (FPG) are popular among the several algorithms that exist to perform association rule mining.

Yet another form of unsupervised learning is anomaly detection or outlier detection. The goal of the exercise is to identify data points that do not belong to the rest of the elements that are given as input to the unsupervised learning algorithm. Similar to association rule mining, due to the nature of the problem at hand, there is no requirement for labels to be made use of by the algorithm to achieve the goal.

Fraud detection is an important application of anomaly detection in the credit cards industry. Credit card transactions are monitored in real time and any spurious transaction patterns are flagged immediately to avoid losses to the credit card user as well as the credit card provider. The unusual pattern that is monitored for could be a huge transaction in a foreign currency rather than that of a normal currency in which the particular customer generally transacts. It could be transactions in physical stores located in two different continents on the same day. The general idea is to be able to flag up a pattern that is a deviation from the norm.

K-means clustering and one-class SVM are two well-known unsupervised ML algorithms that are used to observe abnormalities in the population.

Overall, it may be understood that unsupervised learning is unarguably a very important method, given that labeled data used for training is a scarce resource.

Semi-supervised learning

Semi-supervised learning is a hybrid of both supervised and unsupervised methods. ML requires large amounts of data for training. Most of the time, a directly proportional relationship is observed between the amount of data used for model training and the performance of the model.

In niche domains such as medical imagining, a large amount of image data (MRIs, x-rays, CT scans) is available. However, the time and availability of qualified radiologists to label these images is scarce. In this situation, we might end up getting only a few images labeled by radiologists.

Semi-supervised learning takes advantage of the few labeled images by building an initial model that is used to label the large amount of unlabeled data that exists in the domain. Once the large amount of labeled data is available, a supervised ML algorithm may be used to train and create a final model that is used for prediction tasks on the unseen data. The following diagram illustrates the steps involved in semi-supervised learning:

Speech analysis, protein synthesis, and web content classifications are certain areas where large amounts of unlabeled data and fewer amounts of labeled data are available. Semi-supervised learning is applied in these areas with successful results.

Generative adversarial networks (GANs), semi-supervised support vector machines (S3VMs), graph-based methods, and Markov chain methods are well-known methods among others in the semi-supervised ML area.

Reinforcement learning

Reinforcement learning (RL) is an ML method that is neither supervised learning nor unsupervised learning. In this method, a reward definition is provided as input to this kind of a learning algorithm at the start. As the algorithm is not provided with labeled data for training, this type of learning algorithm cannot be categorized as supervised learning. On the other hand, it is not categorized as unsupervised learning, as the algorithm is fed with information on reward definition that guides the algorithm through taking the steps to solve the problem at hand.

Reinforcement learning aims to improve the strategies used to solve any problem continuously by relying on the feedback received. The goal is to maximize the rewards while taking steps to solve the problem. The rewards obtained are computed by the algorithm itself going by the rewards and penalty definitions. The idea is to achieve optimal steps that maximize the rewards to solve the problem at hand.

The following diagram is an illustration depicting a robot automatically determining the ideal behavior through a reinforcement learning method within the specific context of fire:

A machine outplaying humans in an Atari video game is termed as one of the foremost success stories of reinforcement learning. To achieve this feat, a large number of example games played by humans are fed as input to the algorithm that learned the steps to take to maximize the reward. The reward in this case is the final score. The algorithm, post learning from the example inputs, just simulated the pattern at each step of the game that eventually maximized the score obtained.

Though it might appear that reinforcement learning can be applied to game scenarios only, there are numerous use cases for this method in industry as well. The following examples mentioned are three such use cases:

Dynamic pricing of goods and services based on spontaneous supply and demand targeted at achieving profit maximization is achieved through a variant of reinforcement learning called Q-learning.
Effective use of space in warehouses is a key challenge faced by inventory management professionals. Market demand fluctuations, the large availability of inventory stocks, and delays in refilling the inventory are the key constraints that affect space utilization. Reinforcement learning algorithms are used to optimize the time to procure inventory as well as to reduce the time to retrieve the goods from warehouses, thereby directly impacting the space management issue referred to as a problem in the inventory management area.
Prolonged treatments and differential drug administration is required in medical science to treat diseases such as cancer. The treatments are highly personalized, based on the characteristics of the patient. Treatment often involves variations of the treatment strategy at various stages. This kind of treatment plan is typically referred to as a dynamic treatment regime (DTR). Reinforcement learning helps with processing the clinical trials data to come up with the appropriate personalized DTR for the patient, based on the characteristics of the patient that are fed in as inputs to the reinforcement learning algorithm.

There are four very popular reinforcement learning algorithms, namely Q-learning, state-action-reward-state-action (SARSA), deep Q network (DQN), and deep deterministic policy gradient (DDPG).

Transfer learning

The reusability of code is one of the fundamental concepts of object-oriented programming (OOP) and it is pretty popular in the software-engineering world. Similarly, transfer learning involves reusing a model built to achieve a specific task to solve another related task.

It is understandable that to achieve better performance measurements, ML models need to be trained on large amounts of labeled data. The availability of fewer amounts of data means less training and the result is a model with suboptimal performance.

Transfer learning attempts to solve the problems arising from the availability of fewer amounts of data by reusing the knowledge obtained by a different related model. Having fewer data points available to train a model should not impede building a better model, which is the core concept behind transfer learning. The following diagram is an illustration showing the purpose of transfer learning in an image recognition task that classifies dog and cat images:

In this task, a neural network model is involved with detecting the edges, color blob detection, and so on in the first few layers. Only at the progressive layers (maybe in the last few layers) does the model attempt to identify the facial features of dogs or cats in order to classify them as one of the targets (a dog or a cat).

It may be observed that the tasks of identifying edges and color blobs are not specific to cats' and dogs' images. The knowledge to infer edges or color blobs may be generally inferred even if a model is trained on non-dog or non-cat images. Eventually, if this knowledge is clubbed with knowledge derived from inferring cat faces versus dog faces, even if they are small in number, we will have a better model than the suboptimal model obtained by training on fewer images.

In the case of a dogs-cats classifier, first, a model is trained on a large set of images that are not confined to cats' and dogs' images. The model is then taken and the last few layers are retrained on the dogs' and cats' faces. The model, thus obtained, is then tested and used post evidencing performance measurements that are satisfactory.

The concept of transfer learning is used not just for image-related tasks. Another example of it being used is in natural language processing (NLP) where it can perform sentiment analysis on text data.

Assume a company that launched a new product has a concept that never existed before (say, for now, a flying car). The task is to analyze the tweets related to the new product and identify each of them as being of positive, negative, or neural sentiment. It may be observed that prior, labeled tweets are unavailable in the flying car's domain. In such cases, we can take a model built based on the labeled data of generic product reviews for several products and domains. We can reuse the model by supplementing it with flying-car-domain-specific terminology to avail a new model. This new model will be finally used for testing and deploying to analyze sentiment on the tweets obtained about the newly launched flying cars.

It is possible to achieve transfer learning through the following two ways:

By reusing one's own model
By reusing a pretrained model

Pretrained models are models built by various organizations or individuals as part of their research work or as part of a competition. These models are generally very complex and are trained on large amounts of data. They are also optimized to perform their tasks with high precision. These models may take days or weeks to train on modern hardware. Organizations or individuals often release these models under permissive license for reuse. Such pretrained models can be downloaded and reused through the transfer-learning paradigm. This will effectively make use of the vast existing knowledge that the pretrained models possess, which would otherwise be hard to attain for an individual with limited hardware resources and amounts of data to train.

There are several pretrained models made available by various parties. The following described are some of the popular pretrained models:

Inception-V3 model: This model has been trained on ImageNet as part of a large visual recognition challenge. The competition required the participants to classify a given image into one of 1,000 classes. Some of the classes include the names of animals and object names.

MobileNet: This pretrained model has been built by Google and it is meant to perform object detection using the ImageNet database. The architecture is designed for mobiles.
VCG Face: This is a pretrained model built for face recognition.
VCG 16: This is a pretrained model trained on the MS COCO dataset. This one accomplishes image captioning; that is, given an input image, it generates a caption describing the image's contents.
Google's Word2Vec model and Stanford's GloVe model: These pretrained models take text as input and produce word vectors as output. Distributed word vectors offer one form of representing documents for NLP or ML applications.

Now that we have a basic understanding of various possible ML methods, in the next section, we focus on quickly reviewing the key terminology used in ML.

ML terminology – a quick review

In this section, we take the popular ML terms and review them. This non-exhaustive review will helps us as a quick refresher and enable us to follow the projects covered by this book without any hiccups.

Deep learning

This is a revolutionary trend and has become a super-hot topic in recent times in the ML world. It is a category of ML algorithms that use artificial neural networks (ANNs) with multiple hidden layers of neurons to address problems.

Superior results are obtained by applying deep learning to several real-world problems. Convolutional neural networks (CNNs), recurrent neural networks (RNNs) autoencoders (AEs), generative adversarial networks (GANs), and deep belief networks (DBNs) are some of the popular deep learning methods.

Big data

The term refers to large volumes of data that combine both structured data types (rows and columns similar to a table) and unstructured data types (text documents, voice recordings, image data, and so on). Due to the volume of data, it does not fit into the main memory of the hardware where ML algorithms need to be executed. Separate strategies are needed to work on these large volumes of data. Distributed processing of the data and combining the results (typically called MapReduce) is one strategy. It is also possible to process just enough data sequentially that can fit in a main memory each time and store the results somewhere on a hard drive; we need to repeat this process until the entirety of the data is processed completely. After the data processing, the results need to be combined to avail the final results of all the data that has been processed.

Special technologies such as Hadoop and Spark are required to perform ML on big data. Needless to say, you will need to hone specialized skills in order to apply ML algorithms successfully using these technologies on big data.

Natural language processing

This is an application area of ML that aims for computers to comprehend human languages such as English, French, and Mandarin. NLP applications enable users to interact with computers using spoken languages.

Chatbot, speech synthesis, machine translation, text classification and clustering, text generation, and text summarization are some of the popular applications of NLP.

Computer vision

This field of ML tries to mimic human vision. The aim is to enable computers to see, process, and determine the objects in images or videos. Deep learning and the availability of powerful hardware has led to the rise of very powerful applications in this area of ML.

Autonomous vehicles such as self-driving cars, object recognition, object tracking, motion analysis, and the restoration of images are some of the applications of computer vision.

Cost function

Cost function, loss function, or error function are used interchangeably by practitioners. Each is used to define and measure the error of a model. The objective for the ML algorithm is to minimize the loss from the dataset.

Some of the examples of cost function are square loss that is used in linear regression, hinge loss that is used in support vector machines and 0/1 loss used to measure accuracy in classification algorithms.

Model accuracy

Accuracy is one of the popular metrics used to measure the performance of ML models. The measurement is easy to understand and helps the practitioner to communicate the goodness of a model very easily to its business users.

Generally, this metric is used for classification problems. Accuracy is measured as the number of correct predictions divided by the total number of predictions.

Confusion matrix

This is a table that describes the classification model's performance. It is an n rows, n columns matrix where n represents the number of classes that are predicted by the classification model. It is formed by noting down the number of correct and incorrect predictions by the model when compared to the actual label.

Confusion matrices are better explained with an example—assume that there are 100 images in a dataset where there are 50 dog images and 50 cat images. A model that is built to classify images as cat images or dog images is given this dataset. The output from the model showed that 40 dog images are classified correctly and 20 cat images are predicted correctly. The following table is the confusion matrix construction from the prediction output of the model:

Model predicted labels	Actual labels
		cats	dogs
	cats	20	30
	dogs	10	40

Predictor variables

These variables are otherwise called independent variables or x-values. These are the input variables that help to predict the dependent or target or response variable.

In a house rent prediction use case, the size of the house in square feet, the number of bedrooms, the number of houses available unoccupied in the region, the proximity to public transport, the accessibility to facilities such as hospitals and schools are all some examples of predictor variables that determine the rental cost of the house.

Response variable

Dependent variables or target or y-values are all interchangeably used by practitioners as alternatives for the term response variable. This is the variable the model predicts as output based on the independent variables that are provided as input to the model.

In the house rent prediction use case, the rent predicted is the response variable.

Dimensionality reduction

Feature reduction (or feature selection) or dimensionality reduction is the process of reducing the input set of independent variables to obtain a lesser number of variables that are really required by the model to predict the target.

In certain cases, it is possible to represent multiple dependent variables by combining them together without losing much information. For example, instead of having two independent variables such as the length of a rectangle and the breath of a rectangle, the dimensions can be represented by only one variable called the area that represents both the length and breadth of the rectangle.

The following mentioned are the multiple reasons we need to perform a dimensionality reduction on a given input dataset:

To aid data compression, therefore accommodate the data in a smaller amount of disk space.
The time to process the data is reduced as fewer dimensions are used to represent the data.
It removes redundant features from datasets. Redundant features are typically known as multicollinearity in data.

Reducing the data to fewer dimensions helps visualize the data through graphs and charts.
Dimensionality reduction removes noisy features from the dataset which, in turn, improves the model performance.

There are many ways by which dimensionality reduction can be attained in a dataset. The use of filters, such as information gain filters, and symmetric attribute evaluation filters, is one way. Genetic-algorithm-based selection and principal component analysis (PCA) are other popular techniques used to achieve dimensionality reduction. Hybrid methods do exist to attain feature selection.

Class imbalance problem

Let's assume that one needs to build a classifier that identifies cat and dog images. The problem has two classes namely cat and dog. If one were to train a classification model, training data is required. The training data in this case is based on images of dogs and cats given as input so a supervised learning model can learn the features of dogs versus cats.

It may so happen that if there are 100 images available for training in the dataset and 95 of them are dog pictures, five of them are cat pictures. This kind of unequal representation of different classes in a training dataset is termed as a class imbalance problem.

Most ML techniques work best when the number of examples in each class are roughly equal. One can employ certain techniques to counter class imbalance problems in data. One technique is to reduce the majority class (images of dogs) samples and make them equal to the minority class (images of cats). In this case, there is information loss as a lot of the dog images go unused. Another option is to generate synthetic data similar to the data for the minority class (images of cats) so as to make the number of data samples equal to the majority class. Synthetic minority over-sampling technique (SMOTE) is a very popular technique for generating synthetic data.

It may be noted that accuracy is not a good metric for evaluating the performance of models where the training dataset experiences class imbalance problems. Assume a model built based on a class-imbalanced dataset that predicts a majority class for any test sample that it is asked to predict on. In this case, one gets 95% accuracy as roughly 95% of the images are dog images in the test dataset. But this performance can only be termed as a hoax as the model does not have any discriminative power—it just predicts dog as the class for any image it needs to predict about. In this case, it just happened that every image is predicted as a dog, but still the model got away with a very high accuracy indicating that it is a great model, whether it is in reality or not!

There are several other performance metrics available to use in a situation where a class imbalance is a problem, F1 score and the area under the curve of the receiver operating characteristic (AUCROC) are some of the popular ones.

Model bias and variance

While several ML algorithms are available to build models, model selection can be done on the basis of the bias and variance errors that the models produce.

Bias error occurs when the model has a limited capability to learn the true signals from a dataset provided as input to it. Having a highly biased model essentially means the model is consistent but inaccurate on average.

Variance errors occur when the models are too sensitive to the training datasets with which they are trained. Having high variance in a model essentially means that the trained model will produce high accuracies on any test dataset on average, but their predictions are inconsistent.

Underfitting and overfitting

Underfitting and overfitting are the concepts closely associated with bias and variance. These two are the biggest causes for the poor performance of the models, therefore a practitioner has to pay very close attention to these issues while building ML models.

A situation where the model does not perform well with both training data as well as test data is termed as underfitting. This situation can be detected by observing high training errors and test errors. Having an underfitting problem means that the ML algorithm chosen to fit the model is not suitable to model the features of the training data. Therefore, the only remedy is to try other kinds of ML algorithms to model the data.

Overfitting is a situation where the model learned the features of the training data so well that it fails to generalize on other unseen data. In an overfitting model, noise or random fluctuations in the training data are considered as true signals by the model and it looks for these patterns in unseen data as well, therefore impacting the poor model performance.

Overfitting is more prevalent in non-parametric and non-linear models such as decision trees, and neural networks. Pruning the trees is one remedy to overcome the problem. Another remedial measure is a technique called dropout where some of the features learned from the model are dropped randomly from the model therefore making the model more generalizable to unseen data. Regularization is yet another technique to resolve overfitting problems. This is attained by penalizing the coefficients of the model so that the model generalizes better. L1 penalty and L2 penalty are the types of penalties through which regularization can be performed in regression scenarios.

The goal for a practitioner is to ensure that the model neither overfits nor underfits. To achieve this, it is essential to learn when to stop training the ML data. One could plot the training error and validation error (an error that is measured on a small portion of the training dataset that is kept aside) on a chart and identify the point where the training data keeps decreasing, however the validation error starts to rise.

At times, obtaining performance measurement on training data and expecting a similar measurement to be obtained on unseen data may not work. A more realistic training and test performance estimate is to be obtained from a model by adopting a data-resampling technique called k-fold cross validation. The k in k-fold cross validation refers to a number; examples include 3-fold cross validation, 5-fold cross validation, and 10-fold cross validation. The k-fold cross validation technique involves dividing the training data into k parts and running the training process k + 1 times. In each iteration, the training is performed on k - 1 partitions of the data and the k^th partition is used exclusively for testing. It may be noted that the k^th partition for testing and k - 1 partitions for training are shuffled in each iteration, therefore the training data and testing data do not stay constant in each iteration. This approach enables getting a pessimistic measurement of performance that can be expected from the model on the unseen data in the future.

10-fold cross validation with 10 runs to obtain model performance is considered to be a gold standard estimate for a model's performance among practitioners. Estimating the model's performance in this way is always recommended in industrial setups and for critical ML applications.

Data preprocessing

This is essentially a step that is adopted in the early stages of an ML project pipeline. Data preprocessing involves transforming the raw data in a format that is acceptable as input by ML algorithms.

Feature hashing, missing values imputation, transforming variables from numeric to nominal, and vice versa, are a few data preprocessing steps among the numerous things that can be done to data during preprocessing.

Raw text documents' transformation into word vectors is an example of data preprocessing. The word vectors thus obtained can be fed to an ML algorithm to achieve documents classification or documents clustering.

Holdout sample

While working on a training dataset, a small portion of the data is kept aside for testing the performance of the models. The small portion of data is unseen data (not used in training), therefore one can rely on the measurements obtained for this data. The measurements obtained can be used to tune the parameters of the model or just to report out the performance of the model so as to set expectations in terms of what level of performance can be expected from the model.

It may be noted that the performance measurement reported out on the basis of a holdout sample is not as robust an estimate as that of a k-fold cross validation estimate. This is because there could be some unknown biases that could have crept in during the random split of the holdout set from the original dataset. Also, there are also no guarantees that the holdout dataset has a representation of all the classes involved in the training dataset. If we need representation of all classes in the holdout dataset, then a special technique called a stratified holdout sample needs to be applied. This ensures that there is representation for all classes in the holdout dataset. It is obvious that a performance measurement obtained from a stratified holdout sample is a better estimate of performance than that of the estimate of performance obtained from a nonstratified holdout sample.

70%-30%, 80%-20%, and 90%-10% are generally the sets of training data-holdout data splits observed in ML projects.

Hyperparameter tuning

ML or deep learning algorithms take hyperparameters as input prior to training the model. Each algorithm comes with its own set of hyperparameters and some algorithms may have zero hyperparameters.

Hyperparameter tuning is an important step in model building. Each of the ML algorithms comes with some default hyperparameter values that are generally used to build an initial model, unless the practitioner manually overrides the hyperparameters. Setting the right combination of hyperparameters and the right hyperparameter values for the model greatly improves the performance of the model in most cases. Hence, it is strongly recommended that one does hyperparameter tuning as part of ML model building. Searching through the possible universe of hyperparameter values is a very time-consuming task.

The k in k-means clustering and k-nearest neighbors classification, the number of tress and the depth of tress in random forest, and eta in XGBoost are all examples of hyperparameters.

Grid search and Bayesian optimization-based hyperparameter tuning are two popular methods of hyperparameter tuning among practitioners.

Performance metrics

A model needs to be evaluated on unseen data to assess its goodness. The term goodness may be expressed in several ways and these ways are termed as model performance metrics.

Several metrics exist to report the performance of models. Accuracy, precision, recall, F-score, sensitivity, specificity, AUROC curve, root mean squared error (RMSE), Hamming loss, and mean squared error (MSE) are some of the popular model performance metrics among others.

Feature engineering

Feature engineering is the art of creating new features either from existing data in the dataset or by procuring additional data from an external data source. It is done with the intent that adding additional features improves the model performance. Feature engineering generally requires domain expertise and in-depth business problem understanding.

Let's take a look at an example of feature engineering—for a bank that is working on a loan defaulter prediction project, sourcing and supplementing the training dataset with information on the unemployment trends of the region for the past few months might improve the performance of the model.

Model interpretability

Often, in a business environment when ML models are built, just reporting the performance measurements obtained to confirm the goodness of the model may not be enough. The stakeholders generally are inquisitive to understand the whys of the model, that is, what are the factors contributing to the model's performance? In other words, the stakeholders want to understand the causes of the effects. Essentially, the expectation from the stakeholders is to understand the importance of various features in the model and the direction in which each of the variables impacts the model.

For example, does a feature of time spent on exercising every day in the dataset for a cancer prediction model have any impact on the model predictions at all? If so, does time spent on exercising every day push the prediction in a negative direction or positive direction?

While the example might sound simple to generate an answer for, in real-world ML projects, model interpretability is not so very simple due to the complex relationships between variables. It is seldom that one feature, in its isolation, impacts the prediction in any one direction. It is indeed a combination of features that impact the prediction outcome. Thus, it is even more difficult to explain to what extent the feature is impacting the prediction.

Linear models are generally easier to explain even to business users. This is because we obtain weights for various features as a result of model training with linear algorithms. These weights are direct indicators of how a feature is contributing to model prediction. After all, in a linear model, a prediction is the linear combination of model weights and features passed through a function. It should be noted that interaction between variables in the real world are not essentially linear. So, a linear model trying to model the underlying data that has non-linear relationships may not have good predictive power. So, while linear models' interpretability is great, it comes at the cost of model performance.

On the contrary, non-linear and non-parametric models tend to be very difficult to interpret. In most cases, it may not be apparent even to the person building the models as to what exactly are the factors driving the prediction and in which direction. This is simply because the prediction outcome is a complex non-linear combination of variables. It is also known that non-linear models in general are better performing models when compared to linear models. Therefore, there is a trade-off needed between model interpretability and model performance.

While the goal of model interpretability is difficult to achieve, there is some merit in accomplishing this goal. It helps with the retrospection of a model that is deemed as being a good performing model and confirming that no noise inadvertently existed in the data that is used for model building and testing. It is obvious that models with noise as features fail to generalize on unseen data. Model interpretability helps with making sure that no noise crept into the models as features. Also, it helps build trust with business users that are eventually consumers of the model output. After all, there is no point in building a model whose output is not going to be consumed!

Non-parametric, non-linear models are difficult to interpret, if not impossible. Specialized ML methods are now available to aid black box models interpretability. Partial dependency plot (PDP), Locally interpretable model-agnostic explanations (LIME), and Shapley additive explanations (SHAP) also known as Sharpley's are some of the popular methods used by practitioners to decipher the internals of a black box model.

Now that there is a good understanding of the various fundamental terms of ML, our next journey is to explore the details of the ML project pipeline. This journey discussed in the next section helps us understand the process of building an ML project, deploying it, and obtaining predictions to use in a business.

ML project pipeline

Most of the content available on ML projects, either through books, blogs, or tutorials, explains the mechanics of machine learning in such a way that the dataset available is split into training, validation, and test datasets. Models are built using training datasets, and model improvements through hyperparameter tuning are done iteratively through validation data. Once a model is built and improved upon to a point that is acceptable, it is tested for goodness with unseen test data and the results of testing are reported out. Most of the public content available, ends at this point.

In reality, the ML projects in a business situation go beyond this step. We may observe that if one stops at testing and reporting a built model performance, there is no real use of the model in terms of predicting about data that is coming up in future. We also need to realize that the idea of building a model is to be able to deploy the model in production and have the predictions based on new data so that businesses can take appropriate action.

In a nutshell, the model needs to be saved and reused. This also means that any new data on which predictions need to be made needs to be preprocessed in the same way as training data. This ensures that, the new data has the same number of columns and also the same types of columns as training data. This part of productionalization of the models built in the lab is totally ignored when being taught. This section covers an end-to-end pipeline for the models, right from data preprocessing to building the models in the lab to productionalization of the models.

ML pipelines describe the entire process from raw data acquisition to obtaining post processing of the prediction results on unseen data so as to make it available for some kind of action by business. It is possible that a pipeline may be depicted at a generalized level or described at a very granular level. This current section focuses on describing a generic pipeline that may be applied to any ML project. Figure 1.8 shows the various components of the ML project pipeline otherwise known as the cross-industry standard process for data mining (CRISP-DM).

Business understanding

Once the problem description that is to be solved using ML is clearly articulated, the first step in the ML pipeline is to be able to ascertain if the problem is of relevance to business and if the goals of the project are laid out without any ambiguities. It may also be wise to check if the problem at hand is feasible to be solved as an ML problem. These are the various aspects typically covered during the business understanding step.

Understanding and sourcing the data

The next step is to identify all the sources of data that are relevant to the business problem at hand. Organizations will have one or more systems, such as HR management systems, accounting systems, and inventory management systems. Depending on the nature of the problem at hand, we may need to fetch the data from multiple sources. Also, data that is obtained through the data acquisition step need not always be structured as tabular data; it could be unstructured data, such as emails, recorded voice files, and images.

In corporate organizations of reasonable size, it may not be possible for an ML professional to work all alone on the task of fetching the data from the diverse range of systems. Tight collaboration with other professionals in the organization may be required to complete this step of the pipeline successfully:

Preparing the data

Data preparation enables the creation of input data for ML algorithms to consume. Raw data that we get from data sources is often not very clean. Sometimes, the data cannot be readily fused into an ML algorithm to create a model. We need to ensure that the raw data is cleaned up and it is prepared in a format that is acceptable for the ML algorithm to take as input.

EDA is a substep in the process of creating the input data. It is a process of using visual and quantitative aids to understand the data without getting prejudice about the contents of the data. EDA gives us deeper insights into the data available at hand. It helps us to understand the required data preparation steps. Some of the insights that we could obtain during EDA are the existence of outliers in the data, missing values existence in the data, and the duplication of data. All of these problems are addressed during data cleansing which is another substep in data preparation. Several techniques may be adopted during data cleansing and the following mentioned are some of the popular techniques:

Deleting records that are outliers
Deleting redundant columns and irrelevant columns in data
Missing values imputation—filling missing values with special value NA or a blank or median or mean or mode or with a regressed value
Scaling the data
Removing stop words such as a, and, and how, from unstructured text data
Normalizing words in unstructured text documents with techniques such as stemming, and lemmatization
Eliminating non-dictionary words in text data
Spelling corrections on misspelled words in text documents
Replacing non-recognizable domain-specific acronyms in the text with actual word descriptions
Rotation, scaling, and translation of image data

Representing the unstructured data as vectors, providing labels for the records if the problem at hand needs to be dealt with by supervised learning, handling class imbalance problems in the data, feature engineering, transforming the data through transformation functions such as log transform, min-max transform, square root transform, and cube transform, are all part of the data preparation process.

The output of the data preparation step is tabular data that can be fit readily into an ML algorithm as input in order to create models.

An additional substep that is typically done in data preparation is to divide the dataset into training data, validation data, and test data. These various datasets are used for specific purposes in the model-building step.

Model building and evaluation

Once the data is ready and prior to the creation of the model, we need to pick and select the features from the list of features available. This can be accomplished through several off-the-shelf feature selection techniques. Some ML algorithms (for example, XGBoost) have feature selection inbuilt within the algorithm, therefore we need not explicitly perform feature selection prior to carrying out the modeling activity.

A suite of ML algorithms is available to try and create models on the data. Additionally, models may be created through ensembling techniques as well. One needs to pick the algorithm(s) and create models using training datasets, then tune the hyperparameters of the model using validation datasets. Finally, the model that is created can be tested using the test dataset. Issues, such as selecting the right metric to evaluate model performance, overfitting, underfitting, and acceptable performance thresholds all need to be taken care of in the model-building step itself. It may be noted that if we do not obtain acceptable performance on the model, it is required to go back to previous steps in order to get more data or to create additional features and then repeat the model-building step once again to check if the model performance improves. This may be done as many times as required until the desired level of performance is achieved by the model.

At the end of the modeling step, we might end up with a suite of models each having its own performance measurement on the unseen test data. The model that has the best performance can be selected for use in production.

Model deployment

The next step is to save the final model that can be used for the future. There are several ways to save the model as an object. Once saved, the model can be reloaded any time and it can be used for scoring the new data. Saving the model as an object is a trivial task and a number of libraries are available in Python and R to achieve it. As a result of saving the model, the model object gets persisted to the disk as a .sav file or a .pkl file or a .pmml object depending on the library used. The object can then be loaded into the memory to perform scoring on unseen data.

The final model that is selected for use in production can be deployed to score unseen data in the following two modes:

Batch mode: Batch mode scoring is when one accumulates the unseen data to be scored in a file, then run a batch job (just another executable script) at a predetermined time to perform scoring. The job loads the model object from disk to the memory and runs on each of the records in the file that needs to be scored. The output is written to another file at a specified location as directed in the batch job script. It may be noted that the records to be scored should have the same number of columns as in the training data and the type of columns should also comply with the training data. It should be ensured that the number of levels in factors columns (nominal type data) should also match with that of the training data.
Real-time mode: There are times where the business needs model scoring to happen on the fly. In this case, unlike the batch mode, data is not accumulated and we do not wait until the batch job runs for scoring. The expectation is that each record of the data, as and when it is available for scoring should be scored by the model. The result of the scoring is to be available to business users almost instantaneously. In this case, a model needs to be deployed as a web service that can serve any requests that come in. The record to be scored can be passed to the web service through a simple API call which, in turn, returns the scored result that can be consumed by the downstream applications. Again, the unscored data record that is passed in the API call should comply with the format of the training data records.

Yet another way of achieving near real-time results is by running the model job on micro batches of data several times a day and at very frequent intervals. The data gets accumulated between the intervals until a point where the model job kicks off. The model job scores and outputs the results for the data that is accumulated similar to batch mode. The business user gets to see the scored results as soon as the micro batch job finishes execution. The only difference between the micro batches processing versus the batch is that unlike the batch mode, business users need not wait until the next business day to get the scored results.

Though, the model building pipeline ends with successfully deploying the ML model and making it available for scoring, in real-world business situations, the job does not end here. Of course, the success parties flow in but there is a need to look again at the models post a certain point in time (maybe in a few months post the deployment). A model that is not maintained at regular intervals does not get very well used by businesses.

To avoid the models from perishing and not being used by business users, it is important to collect feedback on the performance of the model over a period of time and capture if any improvements need to be incorporated in the models. The unseen data does not come with labels, therefore comparing the model output with that of the desired output by business is a manual exercise. Collaborating with business users is a strong requirement to get feedback in this situation.

If there is a continued business need for the model and if the performance is not up to the mark on the unseen data that is scored with existing model, it needs to be investigated to identify the root cause(s). It may so happen that several things have changed in the data that is scored over a period of time when compared to the data on which model was initially trained. In which case, there is a strong need to recalibrate the model and it is essentially a jolly good idea to start once again!

Now that the book has covered all the essentials of ML and the project pipeline, the next topic to be covered is the learning paradigm, which will help us learn several ML algorithms.

Learning paradigm

Most learning paradigms that are followed in other books or content about ML follow a bottom-up approach. This approach starts from the bottom and works its way up. The approach first covers the theoretical elements, such as mathematical introductions to the algorithm, the evolution of the algorithm, variations, and parameters that the algorithm takes, and then delves into the application of the ML algorithm specific to a dataset. This may be a good approach; however, it takes longer really to see the results produced by the algorithm. It needs a lot of perseverance on the part of the learner to be patient and wait until the practical application of the algorithm is covered. In most cases, practitioners and certain classes of industry professionals working on ML are really interested in the practical aspects and they want to experience the power of the algorithm. For these people, the focus is not the theoretical foundations of the algorithm, but it is the practical application. The bottom-up approach works counterproductively in this case.

The learning paradigm followed in this book to teach several ML algorithms is opposite to the bottom-up approach. It rather follows a very practical top-down approach. The focus of this approach is learning by coding.

Each chapter of the book will focus on learning a particular class of ML algorithm. To start with, the chapter focuses on how to use the algorithm in various situations and how to obtain results from the algorithm in practice. Once the practical application of the algorithm is demonstrated using code and a dataset, gradually, the rest of the chapter unveils the theoretical details/concepts of the algorithms experienced in the chapter thus far. All theoretical details will be ensured to be covered only in as much detail as is required to understand the code and to apply the algorithm on any new datasets. This ensures that we get to learn the focused application areas of the algorithms rather than unwanted theoretical aspects that are of less importance applied in the ML world.