What is ML?
ML and artificial intelligence (AI) have almost become buzzwords in recent years. Everyone must have heard about AI in the news after AlphaGo from Google beat the best Go player in the world. There is no doubt that ML is now one of the most popular and advanced areas of research and applications. So, what exactly is ML?
Let's imagine that we are building an application to automatically recognize rock/paper/scissors based on video inputs from a camera. The hand gesture from the user will be recognized by the computer as one of rock/paper/scissors.
Let's look at the differences between ML and traditional programming in solving this problem.
In traditional programming, the working process usually goes like this:
- Software development: Product managers and software engineers work together to understand business requirements and transform them into detailed business rules. Then, software engineers write the code to transform business rules into computer programs. This stage is shown as process 1 in the following diagram.
- Software usage: Computer software transforms users' input to output. This stage is shown as process 2 in the following diagram:
Let's go back to our rock/paper/scissors application. If we use a traditional programming methodology, it will be very difficult to recognize the position of hands and boundaries of the fingers, not to mention that even the same gesture can evolve into many different representations, including the position of the hand, different sizes and shapes of hands and fingers, different skin colors, and so on. In order to solve all these problems, the source code will be very cumbersome, the logic will become very complicated, and it will become almost impossible to maintain and update the solution. In reality, probably no one can accomplish their target with traditional programming methodology.
On the other hand, in ML, the working process usually follows this pattern:
- Software development: The ML algorithm infers hidden business rules by learning from training data and encodes the business rules into models with lots of weight parameters. Process 1 in the following diagram shows the data flow.
- Software usage: The model transforms users' input to output. In the following diagram, process 2 corresponds to this stage:
There are a few types of ML algorithms: supervised learning (SL), unsupervised learning (UL), and reinforcement learning (RL). In NLP, the most useful and most common algorithms belong to SL, so let's focus on this learning algorithm.
Supervised learning (SL)
An SL algorithm builds a mathematical model of a set of data that contains both the inputs (x) and the expected outputs (y). The algorithm's input data is also known as training data, composed of a set of training examples. The SL algorithm learns a function or a mapping from inputs to outputs of training data. Such a function or mapping is called a model. A model can be used to predict outputs associated with new inputs.
The algorithm used for our rock/paper/scissors application is an SL algorithm. More specifically, this is a classification task. Classification is a task that requires algorithms to learn how to assign (limited) class labels to examples—for example, classifying emails as "spam" or "non-spam" is a classification task. More specifically, it divides data into two categories, so it is a binary classification task. The rock/paper/scissors application in this example divides the picture into three categories, so, to be more specific, it belongs to a multi-class classification task. The opposite of a classification task is a regression task, which predicts a continuous quantity output for each example—for example, predicting future house prices in a certain area is a regression task.
Our application's training data contains the data (the image) and a label (one of rock/paper/scissors), which are the input and output (I/O) of the SL algorithm. The data consists of many pictures. As the example in the following screenshot shows, each picture is simply a big matrix of pixel values for the algorithm to consume, and the label of the picture is rock or paper or scissors for the hand gesture in the picture:
Now we understand what an SL algorithm is, in the next section, we will cover the general process of ML.
Stages of machine learning
There are three basic stages of applying ML algorithms: training, inference, and evaluation. Let's look at these stages in more detail here:
- Training stage: The training stage is when the algorithms learn knowledge or business rules from training data. As shown in process 1 in Figure 1.2, the input of the training stage is training data, and the output of the training stage is the model.
- Inference stage: The inference stage is when we use a model to compute the output label of a new input data. The input of this stage is the new input data without labels, and the output is the most likely label.
- Evaluation stage: In a serious application, we always want to know how good a model is before we use it in production. This is a stage called evaluation. The evaluation stage will measure the model's performance in various ways and can help users to compare models.
In the next section, we will introduce how to measure model performance.
Performance metrics
In NLP, most problems can be viewed as classification problems. A key concept in classification performance is a confusion matrix, on which almost all other performance metrics are based.
A confusion matrix is a table of the model predictions versus the ground-truth labels.
Let me give you a specific example. Assume we are building a binary classification to classify whether an image is a cat image or not. When the image is a cat image, we call it a positive. Remember—we are building an application to detect cats, so a cat image is a positive result for our system, and if it is not a cat image (in our case, it's a dog image), we call it a negative. Our test data has 10 images. The real label of test data is listed as follows, where the cat image represents a cat and the dog image represents a dog:
The prediction result of our model is shown here:
The confusion matrix of our case would look like this:
In this confusion matrix, there are five cat images, and the model predicts that one of them is a dog. This is an error, and we call it a false negative (FN) because the model says it is a negative result, but that is actually incorrect. And in the five dog images, the model predicts that two of these are cats. This is another error, and we call it a false positive (FP) because the model says it is a positive result but it's actually incorrect. All correct predictions belong to one of two cases: cats-to-cats prediction, which we call a true positive (TP), and dogs-to-dogs prediction, which we call a true negative (TN).
So, the preceding confusion matrix can be viewed as an instance of the following abstract confusion matrix:
Many important performance metrics are derived from a confusion matrix. Here, we will introduce some of the most important ones, as follows:
- Accuracy (ACC):
- Recall:
- Precision:
- F1 score:
Among the preceding metrics, the F1 score is the combined advantage of recall and precision, so it is the most commonly used metric for now.
In the next section, we will talk about the root cause of poor performance (the performance metrics being low): overfitting and underfitting.
Overfitting and underfitting
Generally speaking, there are two types of errors found in ML models: overfitting and underfitting.
When a model performs poorly on the training data, we call it underfitting. Common reasons that can lead to underfitting include the following:
- The algorithm is too simple. It does not have enough power to capture the complexity of the training data. For algorithms based on neural networks, there are too few hidden layers.
- The network architecture or features used for training is not suitable for the task—for example, models based on bag-of-words (BoW) are not suitable for complex NLP tasks. In these tasks, the order of words is critical, but a BoW model completely discards this information.
- Training a model for too few epochs (a full training pass over the entire training data so that each example has been seen once) or at too low a learning rate (a scalar used to train a model via gradient descent, which can determine the degree of weight changes).
- Using a too-high regularization rate (a scale used to indicate the penalty degree on a model's complexity; the penalty can reduce the power of fitting) to train a model.
When a model performs very well on the training data but performs poorly on new data that it has never seen before, we call this overfitting. Overfitting means the algorithm has the ability to fit the training data well, but it does not generalize well to samples that are not in the training data. Generalization is the most important key feature of ML. It means that algorithms learn some key concepts from training data rather than just simply remembering them. When overfitting happens, it shows that the model is more likely to remember what it saw in training than learn from it, so it performs very well on the training data, but since it does not see the new data before and does not learn the concept well, it thus performs poorly on the new data. ML scientists have already developed various methods against overfitting, such as adding more training data, regularization, dropout, and stopping early.
In the next section, we will introduce TL, which is very useful when the training data is insufficient (this is a common situation).
Transfer learning (TL)
TL is a method where a model can use knowledge from another model for another task.
TL is popular in the chatbot domain. There are many reasons for this, and some of them are listed here:
- TL needs less training data: In a chatbot domain, there usually is not much training data. When using a traditional ML method to train a model, it usually does not perform well due to a lack of training data. With TL, we can achieve much better performance on the same amount of training data. The less data you have, the more performance increase you can get.
- TL makes training faster: TL only needs a few training epochs to fine-tune a model for a new task. Generally, it is much faster than the traditional ML method and makes the whole development process more efficient.
Now we understand what ML is, in the next section, we will cover the basics of NLP.