Supervised and Unsupervised Learning
Machine learning is divided into two main categories: supervised and unsupervised learning.
Supervised Learning
Supervised learning consists of understanding the relation between a given set of features and a target value, also known as a label or class. For instance, it can be used for modeling the relationship between a person's demographic information and their ability to pay loans, as shown in the following table:
Figure 1.19: A table depicting the relationship between a person's demographic information and the ability to pay loans
Models trained to foresee these relationships can then be applied to predict labels for new data. As we can see from the preceding example, a bank that builds such a model can then input data from loan applicants to determine if they are likely to pay back the loan.
These models can be further divided into classification and regression tasks, which are explained as follows.
Classification tasks are used to build models out of data with discrete categories as labels; for instance, a classification task can be used to predict whether a person will pay a loan. You can have more than two discrete categories, such as predicting the ranking of a horse in a race, but they must be a finite number.
Most classification tasks output the prediction as the probability of an instance to belong to each output label. The assigned label is the one with the highest probability, as can be seen in the following diagram:
Figure 1.20: An illustration of the working of a classification algorithm
Some of the most common classification algorithms are as follows:
- Decision trees: This algorithm follows a tree-like architecture that simulates the decision process given a previous decision.
- Naïve Bayes classifier: This algorithm relies on a group of probabilistic equations based on Bayes' theorem, which assumes independence among features. It has the ability to consider several attributes.
- Artificial neural networks (ANNs): These replicate the structure and performance of a biological neural network to perform pattern recognition tasks. An ANN consists of interconnected neurons, laid out with a set architecture. They pass information to one another until a result is achieved.
Regression tasks, on the other hand, are used for data with continuous quantities as labels; for example, a regression task can be used for predicting house prices. This means that the value is represented by a quantity and not by a set of possible outputs. Output labels can be of integer or float types:
- The most popular algorithm for regression tasks is linear regression. It consists of only one independent feature (x) whose relation with its dependent feature (y) is linear. Due to its simplicity, it is often overseen, even though it performs very well for simple data problems.
- Other, more complex regression algorithms include regression trees and support vector regression, as well as ANNs once again.
In conclusion, for supervised learning problems, each instance has a correct answer, also known as a label or class. The algorithms under this category aim to understand the data and then predict the class of a given set of features. Depending on the type of class (continuous or discrete), the supervised algorithms can be divided into classification or regression tasks.
Unsupervised Learning
Unsupervised learning consists of modeling the model to the data, without any relationship with an output label, also known as unlabeled data. This means that algorithms under this category search to understand the data and find patterns in it. For instance, unsupervised learning can be used to understand the profile of people belonging to a neighborhood, as shown in the following diagram:
Figure 1.21: An illustration of how unsupervised algorithms can be used to understand the profiles of people
When applying a predictor over these algorithms, no target label is given as output. The prediction, only available for some models, consists of placing the new instance into one of the subgroups of data that has been created.
Unsupervised learning is further divided into different tasks, but the most popular one is clustering, which will be discussed next.
Clustering tasks involve creating groups of data (clusters) and complying with the condition that instances from other groups differ visibly from the instances within the group. The output of any clustering algorithm is a label, which assigns the instance to the cluster of that label:
Figure 1.22: A diagram representing clusters of multiple sizes
The preceding diagram shows a group of clusters, each of a different size, based on the number of instances that belong to each cluster. Considering this, even though clusters do not need to have the same number of instances, it is possible to set the minimum number of instances per cluster to avoid overfitting the data into tiny clusters of very specific data.
Some of the most popular clustering algorithms are as follows:
- k-means: This focuses on separating the instances into n clusters of equal variance by minimizing the sum of the squared distances between two points.
- Mean-shift clustering: This creates clusters by using centroids. Each instance becomes a candidate for centroid to be the mean of the points in that cluster.
- Density-Based Spatial Clustering of Applications with Noise (DBSCAN): This determines clusters as areas with a high density of points, separated by areas with low density.
In conclusion, unsupervised algorithms are designed to understand data when there is no label or class that indicates a correct answer for each set of features. The most common types of unsupervised algorithms are the clustering methods that allow you to classify a population into different groups.