Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Machine Learning Fundamentals

You're reading from   Machine Learning Fundamentals Use Python and scikit-learn to get up and running with the hottest developments in machine learning

Arrow left icon
Product type Paperback
Published in Nov 2018
Publisher
ISBN-13 9781789803556
Length 240 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Hyatt Saleh Hyatt Saleh
Author Profile Icon Hyatt Saleh
Hyatt Saleh
Arrow right icon
View More author details
Toc

Scikit-Learn API

The objective of the scikit-learn API is to provide an efficient and unified syntax to make machine learning accessible to non-machine learning experts, as well as to facilitate and popularize its use among several industries.

How Does It Work?

Although it has many collaborators, the scikit-learn API was built and has been updated by considering a set of principles that prevent framework code proliferation, where different codes perform similar functionalities. On the contrary, it promotes simple conventions and consistency. Due to this, the scikit-learn API is consistent among all models, and once the main functionalities have been learned, it can be widely used.

The scikit-learn API is divided into three complementary interfaces that share a common syntax and logic: the estimator, the predictor, and the transformer. The estimator interface is used for creating models and fitting the data into them; the predictor, as the name suggests, is used to make predictions based on the models trained before; and finally, the transformer is used for converting data.

Estimator

This is considered to be the core of the entire API, as it is the interface in charge of fitting the models to the input data. It works by initializing the model to be used, and then applying a fit() method that triggers the learning process to build a model based on the data.

The fit() method receives as arguments the training data, in two separate variables, the features matrix, and the target matrix (conventionally called X_train and Y_train). For unsupervised models, the method only takes in the first argument (X_train).

This method creates the model trained to the input data, which can later be used for predicting.

Some models take other arguments besides the training data, which are also called hyperparameters. These hyperparameters are initially set to their default values, but can be tuned to improve the performance of the model, which will be discussed in further sections.

The following is an example of a model being trained:

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, Y_train)

First, it is required that you import the type of algorithm to be used from scikit-learn, for example, a Gaussian Naïve Bayes algorithm for classification. It is always a good practice to import only the algorithm to be used, and not the entire library, as this will ensure that your code runs faster.

Note

To find out the syntax to import a different model, use the documentation of scikit-learn. Go to the following link, click over the algorithm that you wish to implement, and you will find the instructions there: http://scikit-learn.org/stable/user_guide.html.

The second line of code oversees the initialization of the model and stores it in a variable. Lastly, the model is fit to the input data.

In addition to this, the estimator also offers other complementary tasks, as follows:

  • Feature extraction, which involves transforming input data into numerical features that can be used for machine learning purposes
  • Feature selection, which selects the features in your data that most contribute to the prediction output of the model
  • Dimensionality reduction, which takes higher-dimensional data and converts it into a lower dimension

Predictor

As explained previously, the predictor takes the model created by the estimator and extends it to perform predictions on unseen data. In general terms, for supervised models, it feeds the model a new set of data, usually called X_test, to get a corresponding target or label based on the parameters learned during the training of the model.

Moreover, some unsupervised models can also benefit from the predictor. While this method does not output a specific target value, it can be useful to assign a new instance to a cluster.

Following the preceding example, the implementation of the predictor can be seen as follows:

Y_pred = model.predict(X_test)

We apply the predict() method to the previously trained model, and input the new data as an argument to the method.

In addition to predicting, the predictor can also implement methods that are in charge of quantifying the confidence of the prediction, also called the performance of the model. These confidence functions vary from model to model, but their main objective is to determine how far the prediction is from reality. This is done by taking an X_test with its corresponding Y_test and comparing it to the predictions made with the same X_test.

Transformer

As we saw previously, data is usually transformed before being fed to a model. Considering this, the API contains a transform() method that allows you to perform some preprocessing techniques.

It can be used both as a starting point to transform the input data of the model (X_train), as well as further along to modify data that will be fed to the model for predictions. This latter application is crucial to get accurate results, as it ensures that the new data follows the same distribution as the data used to train the model.

The following is an example of a transformer that normalizes the values of the training data:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)

As you can see, after importing and initializing the transformer, it needs to be fit to the data to then effectively transform it:

X_test = scaler.transform(X_test)

The advantage of the transformer is that once it has been applied to the training dataset, it stores the values used for transforming the training data; this can be used to transform the test dataset to the same distribution.

In conclusion, we discussed one of the main benefits of using scikit-learn, which is its API. This API follows a consistent structure that makes it easy for non-experts to apply machine learning algorithms.

To model an algorithm on scikit-learn, the first step is to initialize the model class and fit it to the input data using an estimator, which is usually done by calling the fit() method of the class. Finally, once the model has been trained, it is possible to predict new values using the predictor by calling the predict() method of the class.

Additionally, scikit-learn also has a transformer interface that allows you to transform data as needed. This is useful for performing preprocessing methods over the training data, which can then be also used to transform the testing data to follow the same distribution.

You have been reading a chapter from
Machine Learning Fundamentals
Published in: Nov 2018
Publisher:
ISBN-13: 9781789803556
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image