Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
50 Algorithms Every Programmer Should Know

You're reading from   50 Algorithms Every Programmer Should Know Tackle computer science challenges with classic to modern algorithms in machine learning, software design, data systems, and cryptography

Arrow left icon
Product type Paperback
Published in Sep 2023
Publisher Packt
ISBN-13 9781803247762
Length 538 pages
Edition 2nd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Imran Ahmad Imran Ahmad
Author Profile Icon Imran Ahmad
Imran Ahmad
Arrow right icon
View More author details
Toc

Table of Contents (22) Chapters Close

Preface 1. Section 1: Fundamentals and Core Algorithms
2. Overview of Algorithms FREE CHAPTER 3. Data Structures Used in Algorithms 4. Sorting and Searching Algorithms 5. Designing Algorithms 6. Graph Algorithms 7. Section 2: Machine Learning Algorithms
8. Unsupervised Machine Learning Algorithms 9. Traditional Supervised Learning Algorithms 10. Neural Network Algorithms 11. Algorithms for Natural Language Processing 12. Understanding Sequential Models 13. Advanced Sequential Modeling Algorithms 14. Section 3: Advanced Topics
15. Recommendation Engines 16. Algorithmic Strategies for Data Handling 17. Cryptography 18. Large-Scale Algorithms 19. Practical Considerations 20. Other Books You May Enjoy
21. Index

For classification algorithms, the winner is...

Let’s take a moment to compare the performance metrics of the various algorithms we’ve discussed. However, keep in mind that these metrics are highly dependent on the data we’ve used in these examples, and they can significantly vary for different datasets.

The performance of a model can be influenced by factors such as the nature of the data, the quality of the data, and how well the assumptions of the model align with the data.

Here’s a summary of our observations:

Algorithm

Accuracy

Recall

Precision

Decision tree

0.94

0.93

0.88

XGBoost

0.93

0.90

0.87

Random Forest

0.93

0.90

0.87

Logistic regression

0.91

0.81

0.89

SVM

0.89

0.71

0.92

Naive Bayes

0.92

0.81

0.92

From the table above, the decision tree classifier exhibits the highest performance in terms of both accuracy and recall in this particular context. For precision, we see a tie between the SVM and Naive Bayes algorithms.

However, remember that these results are data-dependent. For instance, SVM might excel in scenarios where data is linearly separable or can be made so through kernel transformations. Naive Bayes, on the other hand, performs well when the features are independent. Decision trees and Random Forests might be preferred when we have complex non-linear relationships. Logistic regression is a solid choice for binary classification tasks and can serve as a good benchmark model. Lastly, XGBoost, being an ensemble technique, is powerful when dealing with a wide range of data types and often leads in terms of model performance across various tasks.

So, it’s critical to understand your data and the requirements of your task before choosing a model. These results are merely a starting point, and deeper exploration and validation should be performed for each specific use case.

Understanding regression algorithms

A supervised machine learning model uses one of the regression algorithms if the label is a continuous variable. In this case, the machine learning model is called a regressor.

To provide a more concrete understanding, let’s take a couple of examples. Suppose we want to predict the temperature for the next week based on historical data, or we aim to forecast sales for a retail store in the coming months.

Both temperatures and sales figures are continuous variables, which means they can take on any value within a specified range, as opposed to categorical variables, which have a fixed number of distinct categories. In such scenarios, we would use a regressor rather than a classifier.

In this section, we will present various algorithms that can be used to train a supervised machine learning regression model—or, put simply, a regressor. Before we go into the details of the algorithms, let’s first create a challenge for these algorithms to test their performance, abilities, and effectiveness.

Presenting the regressors challenge

Similar to the approach that we used with the classification algorithms, we will first present a problem to be solved as a challenge for all regression algorithms. We will call this common problem the regressors challenge. Then, we will use three different regression algorithms to address the challenge. This approach of using a common challenge for different regression algorithms has two benefits:

  • We can prepare the data once and use the prepared data on all three regression algorithms.
  • We can compare the performance of three regression algorithms in a meaningful way, as we will use them to solve the same problem.

Let’s look at the problem statement of the challenge.

The problem statement of the regressors challenge

Predicting the mileage of different vehicles is important these days. An efficient vehicle is good for the environment and is also cost-effective. The mileage can be estimated from the power of the engine and the characteristics of the vehicle. Let’s create a challenge for regressors to train a model that can predict the Miles per Gallon (MPG) of a vehicle based on its characteristics.

Let’s look at the historical dataset that we will use to train the regressors.

Exploring the historical dataset

The following are the features of the historical dataset data that we have:

Name

Type

Description

NAME

Category

Identifies a particular vehicle

CYLINDERS

Continuous

The number of cylinders (between four and eight)

DISPLACEMENT

Continuous

The displacement of the engine in cubic inches

HORSEPOWER

Continuous

The horsepower of the engine

ACCELERATION

Continuous

The time it takes to accelerate from 0 to 60 mph (in seconds)

The label for this problem is a continuous variable, MPG, that specifies the MPG for each of the vehicles.

Let’s first design the data processing pipeline for this problem.

Feature engineering using a data processing pipeline

Let’s see how we can design a reusable processing pipeline to address the regressors challenge. As mentioned, we will prepare the data once and then use it in all the regression algorithms. Let’s follow these steps:

  1. We will start by importing the dataset, as follows:
    dataset = pd.read_csv('https://storage.googleapis.com/neurals/data/data/auto.csv')
    
  2. Let’s now preview the dataset:
    dataset.head(5)
    
  3. This is how the dataset will look:
Table

Description automatically generated with medium confidence

Figure 7.16: Please add a caption here

  1. Now, let’s proceed on to feature selection. Let’s drop the NAME column, as it is only an identifier that is needed for cars. Columns that are used to identify the rows in our dataset are not relevant to training the model. Let’s drop this column.
  2. Let’s convert all of the input variables and impute all the null values:
    dataset=dataset.drop(columns=['NAME'])
    dataset.head(5)
    dataset= dataset.apply(pd.to_numeric, errors='coerce')
    dataset.fillna(0, inplace=True)
    

    Imputation improves the quality of the data and prepares it to be used to train the model. Now, let’s see the final step.

  1. Let’s divide the data into testing and training partitions:
    y=dataset['MPG']
    X=dataset.drop(columns=['MPG'])
    # Splitting the dataset into the Training set and Test set
    from sklearn.model_selection import train_test_split
    from sklearn.cross_validation import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
    

This has created the following four data structures:

  • X_train: A data structure containing the features of the training data
  • X_test: A data structure containing the features of the training test
  • y_train: A vector containing the values of the label in the training dataset
  • y_test: A vector containing the values of the label in the testing dataset

Now, let’s use the prepared data on three different regressors so that we can compare their performance.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image