Tokenization
Tokenization is the process of breaking down a sequence of text into smaller units, or tokens, which can be words, subwords, or characters. This process is essential for converting text into a format suitable for computational processing, enabling models to learn patterns at a finer granularity.
Some key terms in the Tokenization phase are vocabulary and Unique Identifiers (IDs). The vocabulary is a fixed set of tokens that a model knows. It can include words, subwords, punctuation, and special tokens (such as [CLS]
for classification, [SEP]
for separation, etc.). Each token in the vocabulary is assigned an ID, which the model uses to represent the token internally. These IDs are integers and typically range from 0 to the size of the vocabulary minus one.
Can all the words in the world fit into a vocabulary? The answer is no! OOV words are words that are not present in the model’s vocabulary.
Now that we know the main terms that are used, let’s explore...