While in previous chapters we focused heavily on removing features that were not helping us with our machine learning pipelines, this chapter will look at techniques in creating brand new features and placing them correctly within our dataset. These new features will ideally hold new information and generate new patterns that ML pipelines will be able to exploit and use to increase performance.
These created features can come from many places. Oftentimes, we will create new features out of existing features given to us. We can create new features by applying transformations to existing features and placing the resulting vectors alongside their previous counterparts. We will also look at adding new features from separate party systems. As an example, if we are working with data attempting to cluster people based on shopping behaviors, then we might benefit from adding in census data that is separate from the corporation and their purchasing data. However, this will present a few problems:
- If the census is aware of 1,700 Jon does and the corporation only knows 13, how do we know which of the 1,700 people match up to the 13? This is called entity matching
- The census data would be quite large and entity matching would take a very long time
These problems and more make for a fairly difficult procedure but oftentimes create a very dense and data-rich environment.
In this chapter, we will take some time to talk about the manual creation of features through highly unstructured data. Two big examples are text and images. These pieces of data by themselves are incomprehensible to machine learning and artificial intelligence pipelines, so it is up to us to manually create features that represent the images/pieces of text. As a simple example, imagine that we are making the basics of a self-driving car and to start, we want to make a model that can take in an image of what the car is seeing in front of it and decide whether or not it should stop. The raw image is not good enough because a machine learning algorithm would have no idea what to do with it. We have to manually construct features out of it. Given this raw image, we can split it up in a few ways:
- We could consider the color intensity of each pixel and consider each pixel an attribute:
- For example, if the camera of the car produces images of 2,048 x 1,536 pixels, we would have 3,145,728 columns
- We could consider each row of pixels as an attribute and the average color of each row being the value:
- In this case, there would only be 1,536 rows
- We could project this image into space where features represent objects within the image. This is the hardest of the three and would look something like this:
Stop sign |
Cat |
Sky |
Road |
Patches of grass |
Submarine |
1 |
0 |
1 |
1 |
4 |
0 |
Where each feature is an object that may or may not be within the image and the value represents the number of times that object appears in the image. If a model were given this information, it would be a fairly good idea to stop!