Scikit-Learn API
The objective of the scikit-learn API is to provide an efficient and unified syntax to make machine learning accessible to non-machine learning experts, as well as to facilitate and popularize its use among several industries.
How Does It Work?
Although it has many collaborators, the scikit-learn API was built and has been updated by considering a set of principles that prevent framework code proliferation, where different codes perform similar functionalities. On the contrary, it promotes simple conventions and consistency. Due to this, the scikit-learn API is consistent among all models, and once the main functionalities have been learned, it can be widely used.
The scikit-learn API is divided into three complementary interfaces that share a common syntax and logic: the estimator, the predictor, and the transformer. The estimator interface is used for creating models and fitting the data into them; the predictor, as the name suggests, is used to make predictions based on the models trained before; and finally, the transformer is used for converting data.
Estimator
This is considered to be the core of the entire API, as it is the interface in charge of fitting the models to the input data. It works by initializing the model to be used, and then applying a fit()
method that triggers the learning process to build a model based on the data.
The fit()
method receives as arguments the training data, in two separate variables, the features matrix, and the target matrix (conventionally called X_train
and Y_train
). For unsupervised models, the method only takes in the first argument (X_train
).
This method creates the model trained to the input data, which can later be used for predicting.
Some models take other arguments besides the training data, which are also called hyperparameters. These hyperparameters are initially set to their default values, but can be tuned to improve the performance of the model, which will be discussed in further sections.
The following is an example of a model being trained:
from sklearn.naive_bayes import GaussianNB model = GaussianNB() model.fit(X_train, Y_train)
First, it is required that you import the type of algorithm to be used from scikit-learn, for example, a Gaussian Naïve Bayes algorithm for classification. It is always a good practice to import only the algorithm to be used, and not the entire library, as this will ensure that your code runs faster.
Note
To find out the syntax to import a different model, use the documentation of scikit-learn. Go to the following link, click over the algorithm that you wish to implement, and you will find the instructions there: http://scikit-learn.org/stable/user_guide.html.
The second line of code oversees the initialization of the model and stores it in a variable. Lastly, the model is fit to the input data.
In addition to this, the estimator also offers other complementary tasks, as follows:
- Feature extraction, which involves transforming input data into numerical features that can be used for machine learning purposes
- Feature selection, which selects the features in your data that most contribute to the prediction output of the model
- Dimensionality reduction, which takes higher-dimensional data and converts it into a lower dimension
Predictor
As explained previously, the predictor takes the model created by the estimator and extends it to perform predictions on unseen data. In general terms, for supervised models, it feeds the model a new set of data, usually called X_test
, to get a corresponding target or label based on the parameters learned during the training of the model.
Moreover, some unsupervised models can also benefit from the predictor. While this method does not output a specific target value, it can be useful to assign a new instance to a cluster.
Following the preceding example, the implementation of the predictor can be seen as follows:
Y_pred = model.predict(X_test)
We apply the predict()
method to the previously trained model, and input the new data as an argument to the method.
In addition to predicting, the predictor can also implement methods that are in charge of quantifying the confidence of the prediction, also called the performance of the model. These confidence functions vary from model to model, but their main objective is to determine how far the prediction is from reality. This is done by taking an X_test
with its corresponding Y_test
and comparing it to the predictions made with the same X_test
.
Transformer
As we saw previously, data is usually transformed before being fed to a model. Considering this, the API contains a transform()
method that allows you to perform some preprocessing techniques.
It can be used both as a starting point to transform the input data of the model (X_train
), as well as further along to modify data that will be fed to the model for predictions. This latter application is crucial to get accurate results, as it ensures that the new data follows the same distribution as the data used to train the model.
The following is an example of a transformer that normalizes the values of the training data:
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train)
As you can see, after importing and initializing the transformer, it needs to be fit to the data to then effectively transform it:
X_test = scaler.transform(X_test)
The advantage of the transformer is that once it has been applied to the training dataset, it stores the values used for transforming the training data; this can be used to transform the test dataset to the same distribution.
In conclusion, we discussed one of the main benefits of using scikit-learn, which is its API. This API follows a consistent structure that makes it easy for non-experts to apply machine learning algorithms.
To model an algorithm on scikit-learn, the first step is to initialize the model class and fit it to the input data using an estimator, which is usually done by calling the fit()
method of the class. Finally, once the model has been trained, it is possible to predict new values using the predictor by calling the predict()
method of the class.
Additionally, scikit-learn also has a transformer interface that allows you to transform data as needed. This is useful for performing preprocessing methods over the training data, which can then be also used to transform the testing data to follow the same distribution.