Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Transformers for Natural Language Processing and Computer Vision
Transformers for Natural Language Processing and Computer Vision

Transformers for Natural Language Processing and Computer Vision: Explore Generative AI and Large Language Models with Hugging Face, ChatGPT, GPT-4V, and DALL-E 3 , Third Edition

eBook
$29.99 $43.99
Paperback
$43.98 $54.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Transformers for Natural Language Processing and Computer Vision

Getting Started with the Architecture of the Transformer Model

Language is the essence of human communication. Civilizations would never have been born without the word sequences that form language. We now mostly live in a world of digital representations of language. Our daily lives rely on NLP digitalized language functions: web search engines, emails, social networks, posts, tweets, smartphone texting, translations, web pages, speech-to-text on streaming sites for transcripts, text-to-speech on hotline services, and many more everyday functions.

In December 2017, Google Brain and Google Research published the seminal Vaswani et al., Attention Is All You Need paper. The Transformer was born. The Transformer outperformed the existing state-of-the-art NLP models. The Transformer trained faster than previous architectures and obtained higher evaluation results. As a result, transformers have become a key component of NLP.

Since 2017, transformer models such as OpenAI’s ChatGPT and GPT-4, Google’s PaLM and LaMBDA, and other Large Language Models (LLMs) have emerged. However, this is just the beginning! You need to understand how attention heads work to join this new era of LLM for AI experts.

The idea of the attention head of the Transformer is to do away with recurrent neural network features. In this chapter, we will open the hood of the Original Transformer model described by Vaswani et al. (2017) and examine the main components of its architecture. Then, we will explore the fascinating world of attention and illustrate the key components of the Transformer.

This chapter covers the following topics:

  • The architecture of the Transformer
  • The Transformer’s self-attention model
  • The encoding and decoding stacks
  • Input and output embedding
  • Positional embedding
  • Self-attention
  • Multi-head attention
  • Masked multi-attention
  • Residual connections
  • Normalization
  • Feedforward network
  • Output probabilities

With all the innovations and library updates in this cutting-edge field, packages and models change regularly. Please go to the GitHub repository for the latest installation and code examples: https://github.com/Denis2054/Transformers-for-NLP-and-Computer-Vision-3rd-Edition/tree/main/Chapter02.

You can also post a message in our Discord community (https://www.packt.link/Transformers) if you have any trouble running the code in this or any chapter.

Let’s dive directly into the structure of the original Transformer’s architecture.

The rise of the Transformer: Attention Is All You Need

As mentioned earlier, in December 2017, Vaswani et al. (2017) published their seminal paper, Attention Is All You Need. They performed their work at Google Research and Google Brain. I will refer to the model described in Attention Is All You Need as the “original Transformer model” throughout this chapter and book.

Exploring the architecture of the Transformer is essential for the following reasons:

  • The source code of any machine learning or deep learning model remains a product of classical mathematics.
  • It’s only when the system is put to work that it becomes artificial intelligence.
  • Artificial intelligence is the function of a model built with classical mathematics.
  • The Transformer is no exception. If you look inside the algorithm, you will discover classical mathematics.
  • The magic of AI is not in math but in the behavior of the system when it actually runs and performs tasks.
  • Sometimes, we must get our hands dirty to find ways to optimize the functions implemented in our AI (or any other software).

In this section, we will look at the structure of the Transformer model they built. In the following sections, we will explore what is inside each component of the model.

The Original Transformer model is a stack of six layers. Each layer contains sublayers. The output of layer l is the input of layer l+1 until the final prediction is reached. There is a six-layer encoder stack on the left and a six-layer decoder stack on the right. The following figure shows the breakdown of each encoder (on the left) and decoder layer (on the right):

Figure 2.1: The architecture of the Transformer

On the left, the inputs enter the encoder side of the Transformer through an attention sublayer and a feedforward sublayer. On the right, the target outputs go into the decoder side of the Transformer through two attention sublayers and a feedforward network sublayer. We immediately notice that there is no RNN, LSTM, or CNN. This is because recurrence has been abandoned in this architecture.

Attention has replaced recurrence functions requiring an increase in the number of parameters as the distance between two words increases. The attention mechanism is a “word-to-word” operation. It is actually a token-to-token operation, but we will keep it to the word level to keep the explanation simple. The attention mechanism determines how each word is related to all other words in a sequence, including the word being analyzed itself. For example, let’s examine the following sequence:

The cat sat on the mat.

Attention will run dot products between word vectors and determine the strongest relationships between a given word and all the other words, including itself (“cat” and “cat” ):

Figure 2.2: Attending to all the words

The attention mechanism will provide a deeper relationship between words and produce better results.

For each attention sublayer, the Original Transformer model runs not one but eight attention mechanisms in parallel to speed up the calculations. We will explore this architecture in the following section, The encoder stack. This process is named “multi-head attention,” providing:

  • A broader in-depth analysis of sequences
  • The preclusion of recurrence reducing calculation operations
  • Implementation of parallelization, which reduces training time
  • Each attention mechanism learns different perspectives of the same input sequence

Attention replaced recurrence. However, several other creative aspects of the Transformer are as critical as the attention mechanism, as you will see when we look inside the architecture.

We just looked at the Transformer structure from the outside. Let’s now go into each component of the Transformer. We will start with the encoder.

The encoder stack

The layers of the encoder and decoder of the Original Transformer model are stacks of layers. Each layer of the encoder stack has the following structure:

Figure 2.3: A layer of the encoder stack of the Transformer

The original encoder layer structure remains the same for all N = 6 layers of the Transformer model. Each layer contains two main sublayers: a multi-headed attention mechanism and a fully connected position-wise feedforward network.

Notice that a residual connection surrounds each main sublayer, Sublayer(x), in the Transformer model. These connections transport the unprocessed input x of a sublayer to a layer normalization function. This way, we are certain that key information, such as positional encoding, is not lost on the way. The normalized output of each layer is thus:

LayerNormalization (x + Sublayer(x))

Though the structure of each of the N = 6 layers of the encoder is identical, the content of each layer is not strictly identical to the previous layer through the different weights of each layer.

For example, the embedding sublayer is only present at the bottom level of the stack. The other five layers do not contain an embedding layer, guaranteeing that the encoded input is stable through all the layers.

Also, the multi-head attention mechanisms perform the same functions from layers 1 to 6. However, they do not perform the same tasks. Each layer learns from the previous layer and explores different ways of associating the tokens in the sequence. It looks for various associations of words, just like we look for different associations of letters and words when we solve a crossword puzzle.

The designers of the Transformer introduced a very efficient constraint. The output of every sublayer of the model has a constant dimension, including the embedding layer and the residual connections. This dimension is dmodel and can be set to another value depending on your goals. In the Original Transformer architecture, dmodel = 512.

dmodel has a powerful consequence. Practically all the key operations are dot products. As a result, the dimensions remain stable, which reduces the number of operations to calculate, reduces machine resource consumption, and makes it easier to trace the information as it flows through the model.

This global view of the encoder shows the highly optimized architecture of the Transformer. In the following sections, we will zoom into each of the sublayers and mechanisms.

We will begin with the embedding sublayer.

Input embedding

The input embedding sublayer converts the input tokens to vectors of dimension dmodel = 512 using learned embeddings in the original Transformer model:

Figure 2.4: The input embedding sublayer of the Transformer

The embedding sublayer works like other standard transduction models. A tokenizer will transform a sentence into tokens. Each tokenizer has its methods, such as Byte-Pair Encoding (BPE), word piece, and sentence piece methods. The Transformer initially used BPE, but other models use other methods.

The goals are similar, and the choice depends on the strategy chosen. For example, a tokenizer applied to the sequence the Transformer is an innovative NLP model! will produce the following tokens in one type of model:

['the', 'transform', 'er', 'is', 'an', 'innovative', 'n', 'l', 'p', 'model', '!']

You will notice that this tokenizer normalized the string to lowercase and truncated it into subparts. A tokenizer will generally provide an integer representation that will be used for the embedding process. For example:

text = "The cat slept on the couch.It was too tired to get up."
tokenized text= [1996, 4937, 7771, 2006, 1996, 6411, 1012, 2009, 2001, 2205, 5458, 2000, 2131, 2039, 1012]

There is not enough information in the tokenized text at this point to go further. The tokenized text must be embedded.

The Transformer contains a learned embedding sublayer. As a result, many embedding methods can be applied to the tokenized input.

I chose the skip-gram architecture of the word2vec embedding approach Google made available in 2013 to illustrate the embedding sublayer of the Transformer. A skip-gram will focus on a center word in a window of words and predict context words. For example, if word(i) is the center word in a two-step window, a skip-gram model will analyze word(i-2), word(i-1), word(i+1), and word(i+2). Then, the window will slide and repeat the process. A skip-gram model generally contains an input layer, weights, a hidden layer, and an output containing the word embeddings of the tokenized input words.

Suppose we need to perform embedding for the following sentence:

The blackbrown cat sat on the couch and the  dog slept on the rug.

We will focus on two words, black and brown. The word embedding vectors of these two words should be similar.

Since we must produce a vector of size dmodel = 512 for each word, we will obtain a size 512 vector embedding for each word:

black=[[-0.01206071  0.11632373  0.06206119  0.01403395  0.09541149  0.10695464 0.02560172  0.00185677 -0.04284821  0.06146432  0.09466285  0.04642421 0.08680347  0.05684567 -0.00717266 -0.03163519  0.03292002 -0.11397766 0.01304929  0.01964396  0.01902409  0.02831945  0.05870414  0.03390711 -0.06204525  0.06173197 -0.08613958 -0.04654748  0.02728105 -0.07830904
    …
0.04340003 -0.13192849 -0.00945092 -0.00835463 -0.06487109  0.05862355 -0.03407936 -0.00059001 -0.01640179  0.04123065 
-0.04756588  0.08812257 0.00200338 -0.0931043  -0.03507337  0.02153351 -0.02621627 -0.02492662 -0.05771535 -0.01164199 
-0.03879078 -0.05506947  0.01693138 -0.04124579 -0.03779858 
-0.01950983 -0.05398201  0.07582296  0.00038318 -0.04639162 
-0.06819214  0.01366171  0.01411388  0.00853774  0.02183574 
-0.03016279 -0.03184025 -0.04273562]]

The word black is now represented by 512 dimensions. Other embedding methods could be used, and dmodel could have a higher number of dimensions.

The word embedding of brown is also represented by 512 dimensions:

brown=[[ 1.35794589e-02 -2.18823571e-02  1.34526128e-02  6.74355254e-02
   1.04376070e-01  1.09921647e-02 -5.46298288e-02 -1.18385479e-02
   4.41223830e-02 -1.84863899e-02 -6.84073642e-02  3.21860164e-02
   4.09143828e-02 -2.74433400e-02 -2.47369967e-02  7.74542615e-02
   9.80964210e-03  2.94299088e-02  2.93895267e-02 -3.29437815e-02
…
  7.20389187e-02  1.57317147e-02 -3.10291946e-02 -5.51304631e-02
  -7.03861639e-02  7.40829483e-02  1.04319192e-02 -2.01565702e-03
   2.43322570e-02  1.92969330e-02  2.57341694e-02 -1.13280728e-01
   8.45847875e-02  4.90090018e-03  5.33546880e-02 -2.31553353e-02
   3.87288055e-05  3.31782512e-02 -4.00604047e-02 -1.02028981e-01
   3.49597558e-02 -1.71501152e-02  3.55573371e-02 -1.77437533e-02
  -5.94457164e-02  2.21221056e-02  9.73121971e-02 -4.90022525e-02]]

To verify the word embedding produced for these two words, we can use cosine similarity to see if the word embeddings of the words black and brown are similar.

Cosine similarity uses the Euclidean (L2) norm to create vectors in a unit sphere. The dot product of the vectors we are comparing is the cosine between the points of those two vectors. For more on the theory of cosine similarity, you can consult scikit-learn’s documentation, among many other sources: https://scikit-learn.org/stable/modules/metrics.html#cosine-similarity.

The cosine similarity between the black vector of size dmodel = 512 and the brown vector of size dmodel = 512 in the embedding of the example is:

cosine_similarity(black, brown)= [[0.9998901]]

The skip-gram produced two vectors that are close to each other. It detected that black and brown form a color subset of the dictionary of words.

The Transformer’s subsequent layers do not start empty-handed. Instead, they have learned word embeddings that already provide information on how the words can be associated.

However, a big chunk of information is missing because no additional vector or information indicates a word’s position in a sequence.

The designers of the Transformer came up with yet another innovative feature: positional encoding.

Let’s see how positional encoding works.

Positional encoding

We enter this positional encoding function of the Transformer with no idea of the position of a word in a sequence:

Figure 2.5: Positional encoding

We cannot create independent positional vectors that would have a high cost on the training speed of the Transformer and make attention sublayers overly complex to work with. The idea is to add a positional encoding value to the input embedding instead of having additional vectors to describe the position of a token in a sequence.

The Original Transformer model has only one vector that contains word embedding and position encoding.

The Transformer expects a fixed size dmodel = 512 (or other constant value for the model) for each vector of the output of the positional encoding function.

If we go back to the sentence we used in the word embedding sublayer, we can see that black and brown may be semantically similar, but they are far apart in the sentence:

The blackbrown cat sat on the couch and the  dog slept on the rug.

The word black is in position 2, pos=2, and the word brown is in position 10, pos=10.

Our problem is to find a way to add a value to the word embedding of each word so that it has that information. However, we need to add a value to the dmodel =512 dimensions! For each word embedding vector, we need to find a way to provide information to i in the range(0,512) dimensions of the word embedding vector of black and brown.

There are many ways to achieve positional encoding. This section will focus on the designers’ clever technique of using a unit sphere to represent positional encoding with sine and cosine values that will thus remain small but useful.

Vaswani et al. (2017) provide sine and cosine functions so that we can generate different frequencies for the positional encoding (PE) for each position and each dimension i of the dmodel = 512 of the word embedding vector:

If we start at the beginning of the word embedding vector, we will begin with a constant (512), i=0, and end with i=511. This means that the sine function will be applied to the even numbers and the cosine function to the odd numbers. Some implementations do it differently. In that case, the domain of the sine function can be , and the domain of the cosine function can be . This will produce similar results.

In this section, we will use the functions the way they were described by Vaswani et al. (2017). A literal translation into Python pseudo-code produces the following code for a positional vector pe[0][i] for a position pos:

def positional_encoding(pos,pe):
for i in range(0, 512,2):
         pe[0][i] = math.sin(pos / (10000 ** ((2 * i)/d_model)))
         pe[0][i+1] = math.cos(pos / (10000 ** ((2 * i)/d_model)))
return pe

Google Brain Trax and Hugging Face, among others, provide ready-to-use libraries for the word embedding section and the present positional encoding section. Thus, you don’t need to run the code I shared in this section. However, if you wish to explore the code, you will find it in the positional_encoding.ipynb notebook in the directory of this chapter in the GitHub repository.

Before going further, you might want to see the plot of the sine function, for example, for pos=2. You can Google the following plot:

plot y=sin(2/10000^(2*x/512))

Just enter the plot request:

Figure 2.6: Plotting with Google

You will obtain the following graph:

Figure 2.7: The graph

If we go back to the sentence we are parsing in this section, we can see that black is in position pos=2 and brown is in position pos=10:

The blackbrown cat sat on the couch and the  dog slept on the rug.

If we apply the sine and cosine functions literally for pos=2, we obtain a size=512 positional encoding vector:

PE(2)= 
[[ 9.09297407e-01 -4.16146845e-01  9.58144367e-01 -2.86285430e-01
   9.87046242e-01 -1.60435960e-01  9.99164224e-01 -4.08766568e-02
   9.97479975e-01  7.09482506e-02  9.84703004e-01  1.74241230e-01
   9.63226616e-01  2.68690288e-01  9.35118318e-01  3.54335666e-01
   9.02130723e-01  4.31462824e-01  8.65725577e-01  5.00518918e-01
   8.27103794e-01  5.62049210e-01  7.87237823e-01  6.16649508e-01
   7.46903539e-01  6.64932430e-01  7.06710517e-01  7.07502782e-01
…
   5.47683925e-08  1.00000000e+00  5.09659337e-08  1.00000000e+00
   4.74274735e-08  1.00000000e+00  4.41346799e-08  1.00000000e+00
   4.10704999e-08  1.00000000e+00  3.82190599e-08  1.00000000e+00
   3.55655878e-08  1.00000000e+00  3.30963417e-08  1.00000000e+00
   3.07985317e-08  1.00000000e+00  2.86602511e-08  1.00000000e+00
   2.66704294e-08  1.00000000e+00  2.48187551e-08  1.00000000e+00
   2.30956392e-08  1.00000000e+00  2.14921574e-08  1.00000000e+00]]

We also obtain a size=512 positional encoding vector for position 10, pos=10:

PE(10)= 
[[-5.44021130e-01 -8.39071512e-01  1.18776485e-01 -9.92920995e-01
   6.92634165e-01 -7.21289039e-01  9.79174793e-01 -2.03019097e-01
   9.37632740e-01  3.47627431e-01  6.40478015e-01  7.67976522e-01
   2.09077001e-01  9.77899194e-01 -2.37917677e-01  9.71285343e-01
  -6.12936735e-01  7.90131986e-01 -8.67519796e-01  4.97402608e-01
  -9.87655997e-01  1.56638563e-01 -9.83699203e-01 -1.79821849e-01
…
  2.73841977e-07  1.00000000e+00  2.54829672e-07  1.00000000e+00
   2.37137371e-07  1.00000000e+00  2.20673414e-07  1.00000000e+00
   2.05352507e-07  1.00000000e+00  1.91095296e-07  1.00000000e+00
   1.77827943e-07  1.00000000e+00  1.65481708e-07  1.00000000e+00
   1.53992659e-07  1.00000000e+00  1.43301250e-07  1.00000000e+00
   1.33352145e-07  1.00000000e+00  1.24093773e-07  1.00000000e+00
   1.15478201e-07  1.00000000e+00  1.07460785e-07  1.00000000e+00]]

Having looked at the results obtained with an intuitive literal translation of the Vaswani et al. (2017) functions into Python, we now need to check whether the results are meaningful.

The cosine similarity function used for word embedding comes in handy for getting a better visualization of the proximity of the positions:

cosine_similarity(pos(2), pos(10))= [[0.8600013]]

The similarity between the position of the words black and brown and the lexical field (groups of words that go together) similarity is different:

cosine_similarity(black, brown)= [[0.9998901]]

The encoding of the position shows a lower similarity value than the word embedding similarity.

The positional encoding has taken these words apart. Bear in mind that word embeddings will vary depending on the corpus used to train them. The problem is now how to add the positional encoding to the word embedding vectors.

Adding positional encoding to the embedding vector

The authors of the Transformer found a simple way by merely adding the positional encoding vector to the word embedding vector:

Figure 2.8: Positional encoding

If we go back and take the word embedding of black, for example, and name it y1 = black, we are ready to add it to the positional vector pe(2) we obtained with positional encoding functions. We will obtain the positional encoding pc(black)of the input word black:

pc(black) = y1 + pe(2)

In the following example, we add the positional vector to the embedding vector of the word black, which are both of the same size (512):

for i in range(0, 512,2):
          pe[0][i] = math.sin(pos / (10000 ** ((2 * i)/d_model)))
          pc[0][i] = (y[0][i]*math.sqrt(d_model))+ pe[0][i]
            
          pe[0][i+1] = math.cos(pos / (10000 ** ((2 * i)/d_model)))
          pc[0][i+1] = (y[0][i+1]*math.sqrt(d_model))+ pe[0][i+1]

The result obtained is the final positional encoding vector of dimension dmodel = 512:

pc(black)=
[[ 9.09297407e-01 -4.16146845e-01  9.58144367e-01 -2.86285430e-01
   9.87046242e-01 -1.60435960e-01  9.99164224e-01 -4.08766568e-02
   …
  4.74274735e-08  1.00000000e+00  4.41346799e-08  1.00000000e+00
   4.10704999e-08  1.00000000e+00  3.82190599e-08  1.00000000e+00
   2.66704294e-08  1.00000000e+00  2.48187551e-08  1.00000000e+00
   2.30956392e-08  1.00000000e+00  2.14921574e-08  1.00000000e+00]]

The same operation is applied to the word brown and all of the other words in a sequence.

We can apply the cosine similarity function to the positional encoding vectors of black and brown:

cosine_similarity(pc(black), pc(brown))= [[0.9627094]]

We now have a clear view of the positional encoding process through the three cosine similarity functions we applied to the three states representing the words black and brown:

[[0.99987495]] word similarity
[[0.8600013]] positional encoding vector similarity
[[0.9627094]] final positional encoding similarity

We saw that the initial word similarity of their embeddings was high, with a value of 0.99. Then, we saw that the positional encoding vector of positions 2 and 10 drew these two words apart with a lower similarity value of 0.86.

Finally, we added the word embedding vector of each word to its respective positional encoding vector. We saw that this brought the cosine similarity of the two words to 0.96.

The positional encoding of each word now contains the initial word embedding information and the positional encoding values.

The output of positional encoding leads to the multi-head attention sublayer.

Sublayer 1: Multi-head attention

The multi-head attention sublayer contains eight heads and is followed by post-layer normalization, which will add residual connections to the output of the sublayer and normalize it:

Figure 2.9: Multi-head attention sublayer

This section begins with the architecture of an attention layer. Then, an example of multi-attention is implemented in a small module in Python. Finally, post-layer normalization is described.

Let’s start with the architecture of multi-head attention.

The architecture of multi-head attention

The input of the multi-attention sublayer of the first layer of the encoder stack is a vector that contains the embedding and the positional encoding of each word. The next layers of the stack do not start these operations over.

The dimension of the vector of each word xn of an input sequence is dmodel = 512:

pe(xn)=[d1=9.09297407e-01, d2=-4.16146845e-01, .., d512=1.00000000e+00]

The representation of each word xn has become a vector of dmodel = 512 dimensions.

Each word is mapped to all the other words to determine how it fits in a sequence.

In the following sentence, we can see that it could be related to cat and rug in the sequence:

Sequence=The cat sat on the rug and it was dry-cleaned.

The model will train to determine if it is related to cat or rug. We could run a huge calculation by training the model using the dmodel = 512 dimensions as they are now.

However, we would only get one point of view at a time by analyzing the sequence with one dmodel block. Furthermore, it would take quite some calculation time to find other perspectives.

A better way is for the 8 heads of the model to project the dmodel = 512 dimensions of each word xn in sequence x into dk=64 dimensions.

We then can run the eight “heads” in parallel to speed up the training and obtain eight different representation subspaces of how each word relates to another:

Figure 2.10: Multi-head representations

You can see that there are now 8 heads running in parallel. For example, one head might decide that it fits well with cat, another that it fits well with rug, and another that rug fits well with dry-cleaned.

The output of each head is a matrix Zi with a shape of x * dk. The output of a multi-attention head is Z defined as:

Z = (Z0, Z1, Z2, Z3, Z4, Z5, Z6, Z7)

However, Z must be concatenated so that the output of the multi-head sublayer is not a sequence of dimensions but one line of an xm * dmodel matrix.

Before exiting the multi-head attention sublayer, the elements of Z are concatenated:

MultiHead(output) = Concat(Z0, Z1, Z2, Z3, Z4, Z5, Z6, Z7) = x, dmodel

Notice that each head is concatenated into z, which has a dimension of dmodel = 512. The output of the multi-headed layer respects the constraint of the original Transformer model.

Inside each head hn of the attention mechanism, the “word” matrices have three representations:

  • A query matrix (Q) that has a dimension of dq =64, which seeks all the key-value pairs of the other “word” matrices.
  • A key matrix (K) with a dimension of dk =64.
  • A value matrix (V) with a dimension of dv =64.

Attention is defined as scaled dot-product attention, which is represented in the following equation in which we plug Q, K, and V:

The matrices all have the same dimension, making it relatively simple to use a scaled dot product to obtain the attention values for each head and then concatenate the output Z of the 8 heads.

To obtain Q, K, and V, we must train the model with their respective weight matrices Qw, Kw, and Vw, which have dk = 64 columns and dmodel = 512 rows. For example, Q is obtained by a dot product between x and Qw. Q will have a dimension of dk = 64.

You can modify all the parameters, such as the number of layers, heads, dmodel, dk, and other variables of the Transformer, to fit your model. It is essential to understand the original architecture before modifying it or exploring variants of the original model designed by others.

Google Brain Trax, OpenAI, Google Cloud AI, and Hugging Face, among others, provide ready-to-use libraries that we will be using throughout this book.

However, let’s open the hood of the Transformer model and get our hands dirty in Python to illustrate the architecture we just explored to visualize the model in code and show it with intermediate images.

We will use basic Python code with only numpy and a softmax function in 10 steps to run the main aspects of the attention mechanism.

Remember that an AI specialist will face the challenge of dealing with multiple architectures for the same algorithm.

Let’s now start building Step 1 of our model to represent the input.

Step 1: Represent the input

Save Multi_Head_Attention_Sub_Layer.ipynb to your Google Drive (make sure you have a Gmail account) and then open it in Google Colaboratory. The notebook is in the GitHub repository for this chapter.

We will start by only using minimal Python functions to understand the Transformer at a low level with the inner workings of an attention head. Then, we will explore the inner workings of the multi-head attention sublayer using basic code:

import numpy as np
from scipy.special import softmax

The input of the attention mechanism we are building is scaled down to dmodel = 4 instead of dmodel = 512. This brings the dimensions of the vector of an input x down to dmodel = 4, which is easier to visualize.

x contains 3 inputs with 4 dimensions each instead of 512:

print("Step 1: Input : 3 inputs, d_model=4")
x =np.array([[1.0, 0.0, 1.0, 0.0],   # Input 1
             [0.0, 2.0, 0.0, 2.0],   # Input 2
             [1.0, 1.0, 1.0, 1.0]])  # Input 3
print(x)

The output shows that we have 3 vectors of dmodel = 4:

Step 1: Input : 3 inputs, d_model=4
[[1. 0. 1. 0.]
 [0. 2. 0. 2.]
 [1. 1. 1. 1.]]

The first step of our model is ready:

Figure 2.11: Input of a multi-head attention sublayer

We will now add the weight matrices to our model.

Step 2: Initializing the weight matrices

Each input has three weight matrices:

  • Qw to train the queries
  • Kw to train the keys
  • Vw to train the values

These three weight matrices will be applied to all the inputs in this model.

The weight matrices described by Vaswani et al. (2017) are dK ==64 dimensions. However, let’s scale the matrices down to dK ==3. The dimensions are scaled down to 3*4 weight matrices to visualize the intermediate results more easily and perform dot products with the input x.

The size and shape of the matrices in this educational notebook are arbitrary. The goal is to go through the overall process of an attention mechanism.

The three weight matrices are initialized starting with the query weight matrix:

print("Step 2: weights 3 dimensions x d_model=4")
print("w_query")
w_query =np.array([[1, 0, 1],
                   [1, 0, 0],
                   [0, 0, 1],
                   [0, 1, 1]])
print(w_query)

The output is the w_query weight matrix:

w_query
[[1 0 1]
 [1 0 0]
 [0 0 1]
 [0 1 1]]

We will now initialize the key weight matrix:

print("w_key")
w_key =np.array([[0, 0, 1],
                 [1, 1, 0],
                 [0, 1, 0],
                 [1, 1, 0]])
print(w_key)

The output is the key weight matrix:

w_key
[[0 0 1]
 [1 1 0]
 [0 1 0]
 [1 1 0]]

Finally, we initialize the value weight matrix:

print("w_value")
w_value = np.array([[0, 2, 0],
                    [0, 3, 0],
                    [1, 0, 3],
                    [1, 1, 0]])
print(w_value)

The output is the value weight matrix:

w_value
[[0 2 0]
 [0 3 0]
 [1 0 3]
 [1 1 0]]

The second step of our model is ready:

Figure 2.12: Weight matrices added to the model

We will now multiply the weights by the input vectors to obtain Q, K, and V.

Step 3: Matrix multiplication to obtain Q, K, and V

We will now multiply the input vectors by the weight matrices to obtain a query, key, and value vector for each input.

In this model, we will assume that there is one w_query, w_key, and w_value weight matrix for all inputs. Other approaches are possible.

Let’s first multiply the input vectors by the w_query weight matrix:

print("Step 3: Matrix multiplication to obtain Q,K,V")
print("Query: x * w_query")
Q=np.matmul(x,w_query)
print(Q)

The output is a vector for Q1 ==64= [1, 0, 2], Q2= [2,2, 2], and Q3= [2,1, 3]:

Step 3: Matrix multiplication to obtain Q,K,V
Query: x * w_query
[[1. 0. 2.]
 [2. 2. 2.]
 [2. 1. 3.]]

We now multiply the input vectors by the w_key weight matrix:

print("Key: x * w_key")
K=np.matmul(x,w_key)
print(K)

We obtain a vector for K1= [0, 1, 1], K2= [4, 4, 0], and K3= [2 ,3, 1]:

Key: x * w_key
[[0. 1. 1.]
 [4. 4. 0.]
 [2. 3. 1.]]

Finally, we multiply the input vectors by the w_value weight matrix:

print("Value: x * w_value")
V=np.matmul(x,w_value)
print(V)

We obtain a vector for V1= [1, 2, 3], V2= [2, 8, 0], and V3= [2 ,6, 3]:

Value: x * w_value
[[1. 2. 3.]
 [2. 8. 0.]
 [2. 6. 3.]]

The third step of our model is ready:

Figure 2.13: Q, K, and V are generated

We have the Q, K, and V values we need to calculate the attention scores.

Step 4: Scaled attention scores

The attention head now implements the original Transformer equation:

Step 4 focuses on Q and K:

For this model, we will round = = 1.75 to 1 and plug the values into the Q and K parts of the equation:

print("Step 4: Scaled Attention Scores")
k_d=1   #square root of k_d=3 rounded down to 1 for this example
attention_scores = (Q @ K.transpose())/k_d
print(attention_scores)

The intermediate result is displayed:

Step 4: Scaled Attention Scores
[[ 2.  4.  4.]
 [ 4. 16. 12.]
 [ 4. 12. 10.]]

Step 4 is now complete. For example, the score for x1 is [2,4,4] across the K vectors across the head as displayed:

Figure 2.14: Scaled attention scores for input #1

The attention equation will now apply softmax to the intermediate scores for each vector.

Step 5: Scaled softmax attention scores for each vector

We now apply a softmax function to each intermediate attention score. Instead of doing a matrix multiplication, let’s zoom down to each individual vector:

print("Step 5: Scaled softmax attention_scores for each vector")
attention_scores[0]=softmax(attention_scores[0])
attention_scores[1]=softmax(attention_scores[1])
attention_scores[2]=softmax(attention_scores[2])
print(attention_scores[0])
print(attention_scores[1])
print(attention_scores[2])

We obtain scaled softmax attention scores for each vector:

Step 5: Scaled softmax attention_scores for each vector
[0.06337894 0.46831053 0.46831053]
[6.03366485e-06 9.82007865e-01 1.79861014e-02]
[2.95387223e-04 8.80536902e-01 1.19167711e-01]

Step 5 is now complete. For example, the softmax of the score of x1 for all the keys is:

Figure 2.15: The softmax score of input #1 for all of the keys

We can now calculate the final attention values with the complete equation.

Step 6: The final attention representations

We can now finalize the attention equation by plugging V in:

We will first calculate the attention score of input x1 for Steps 6 and 7. We calculate one attention value for one word vector. When we reach Step 8, we will generalize the attention calculation to the other two input vectors.

To obtain Attention(Q,K,V) for x1 we multiply the intermediate attention score by the three value vectors one by one to zoom in on the inner workings of the equation:

print("Step 6: attention value obtained by score1/k_d * V")
print(V[0])
print(V[1])
print(V[2])
print("Attention 1")
attention1=attention_scores[0].reshape(-1,1)
attention1=attention_scores[0][0]*V[0]
print(attention1)
print("Attention 2")
attention2=attention_scores[0][1]*V[1]
print(attention2)
print("Attention 3")
attention3=attention_scores[0][2]*V[2]
print(attention3)
Step 6: attention value obtained by score1/k_d * V
[1. 2. 3.]
[2. 8. 0.]
[2. 6. 3.]
Attention 1
[0.06337894 0.12675788 0.19013681]
Attention 2
[0.93662106 3.74648425 0.        ]
Attention 3
[0.93662106 2.80986319 1.40493159]

Step 6 is complete and the three attention values for x1 for each input have been calculated:

Figure 2.16: Attention representations

The attention values now need to be summed up.

Step 7: Summing up the results

The three attention values of input #1 obtained will now be summed to obtain the first line of the output matrix:

print("Step 7: summed the results to create the first line of the output matrix")
attention_input1=attention1+attention2+attention3
print(attention_input1)

The output is the first line of the output matrix for input #1:

Step 7: summed the results to create the first line of the output matrix
[1.93662106 6.68310531 1.59506841]]

The second line will be for the output of the next input, input #2, for example.

We can see the summed attention value for x1 in Figure 2.17:

Figure 2.17: Summed results for one input

We have completed the steps for input #1. We now need to add the results of all the inputs to the model.

Step 8: Steps 1 to 7 for all the inputs

The Transformer can now produce the attention values of input #2 and input #3 using the same method described in Steps 1 to 7 for one attention head.

From this step onward, we will assume we have three attention values with learned weights with dmodel = 64. We now want to see what the original dimensions look like when they reach the sublayer’s output.

We have seen the attention representation process in detail with a small model. Let’s go directly to the result and assume we have generated the three attention representations with a dimension of dmodel = 64:

print("Step 8: Step 1 to 7 for inputs 1 to 3")
#We assume we have 3 results with learned weights (they were not trained in this example)
#We assume we are implementing the original Transformer paper. We will have 3 results of 64 dimensions each
attention_head1=np.random.random((3, 64))
print(attention_head1)

The following output displays the simulation of z0, which represents the 3 output vectors of dmodel = 64 dimensions for head 1:

Step 8: Step 1 to 7 for inputs 1 to 3
[[0.31982626 0.99175996…(61 squeezed values)…0.16233212]
 [0.99584327 0.55528662…(61 squeezed values)…0.70160307]
 [0.14811583 0.50875291…(61 squeezed values)…0.83141355]]

The results will vary when you run the notebook because of the stochastic nature of the generation of the vectors.

The Transformer now has the output vectors for the inputs of one head. The next step is to generate the output of the eight heads to create the final output of the attention sublayer.

Step 9: The output of the heads of the attention sublayer

We assume that we have trained the eight heads of the attention sublayer. The Transformer now has three output vectors (of the three input vectors that are words or word pieces) of dmodel = 64 dimensions each:

print("Step 9: We assume we have trained the 8 heads of the attention sublayer")
z0h1=np.random.random((3, 64))
z1h2=np.random.random((3, 64))
z2h3=np.random.random((3, 64))
z3h4=np.random.random((3, 64))
z4h5=np.random.random((3, 64))
z5h6=np.random.random((3, 64))
z6h7=np.random.random((3, 64))
z7h8=np.random.random((3, 64))
print("shape of one head",z0h1.shape,"dimension of 8 heads",64*8)

The output shows the shape of one of the heads:

Step 9: We assume we have trained the 8 heads of the attention sublayer
shape of one head (3, 64) dimension of 8 heads 512

The eight heads have now produced Z:

Z = (Z0, Z1, Z2, Z3, Z4, Z5, Z6, Z7)

The Transformer will now concatenate the eight elements of Z for the final output of the multi-head attention sublayer.

Step 10: Concatenation of the output of the heads

The Transformer concatenates the eight elements of Z:

MultiHead(Output) = Concat = (Z0, Z1, Z2, Z3, Z4, Z5, Z6, Z7) W0 = x, dmodel

Note that Z is multiplied by W0, a weight matrix that is also trained. In this model, we will assume W0 is trained and integrated into the concatenation function.

Z0 to Z7 are concatenated:

print("Step 10: Concantenation of heads 1 to 8 to obtain the original 8x64=512 ouput dimension of the model")
output_attention=np.hstack((z0h1,z1h2,z2h3,z3h4,z4h5,z5h6,z6h7,z7h8))
print(output_attention)

The output is the concatenation of Z:

Step 10: Concatenation of heads 1 to 8 to obtain the original 8x64=512 output dimension of the model
[[0.65218495 0.11961095 0.9555153  ... 0.48399266 0.80186221 0.16486792]
 [0.95510952 0.29918492 0.7010377  ... 0.20682832 0.4123836  0.90879359]
 [0.20211378 0.86541746 0.01557758 ... 0.69449636 0.02458972 0.889699  ]]

The concatenation can be visualized as stacking the elements of Z side by side:

Figure 2.18: Attention sublayer output

The concatenation produced a standard dmodel = 512-dimensional output:

Figure 2.19: Concatenation of the output of the eight heads

Layer normalization will now process the attention sublayer.

Post-layer normalization

Each attention sublayer and each feedforward sublayer of the Transformer is followed by Post-Layer Normalization (Post-LN):

Figure 2.20: Post-layer normalization

The Post-LN contains an add function and a layer normalization process. The add function processes the residual connections that come from the input of the sublayer and are normalized. The goal of the residual connections is to make sure critical information is not lost. The Post-LN or layer normalization can thus be described as follows:

LayerNormalization (x + Sublayer(x))

Sublayer(x) is the sublayer itself. x is the information available at the input step of Sublayer(x).

The input of the LayerNormalization is a vector v resulting from x + Sublayer(x). dmodel = 512 for every input and output of the Transformer, which standardizes all the processes.

Many layerNormalization methods exist, and variations exist from one model to another. The basic concept for v = x + Sublayer(x) can be defined by LayerNormalization (v):

The variables are:

  • is the mean of v of dimension d. As such:
  • is the standard deviation v of dimension d. As such:
  • is a scaling parameter.
  • is a bias vector.

This version of LayerNormalization (v) shows the general idea of the many possible post-LN methods. The next sublayer can now process the output of the post-LN or LayerNormalization (v). In this case, the sublayer is a feedforward network.

Sublayer 2: Feedforward network

The input of the Feedforward Network (FFN) is the dmodel = 512 output of the post-LN of the previous sublayer:

Figure 2.21: The feedforward sublayer

The FFN sublayer can be described as follows:

  • The FFNs in the encoder and decoder are fully connected.
  • The FFN is a position-wise network. Each position is processed separately and in an identical way.
  • The FFN contains two layers and applies a ReLU activation function.
  • The input and output of the FFN layers is dmodel = 512, but the inner layer is larger with dff =2048.
  • The FFN can be viewed as performing two convolutions with size 1 kernels.

Taking this description into account, we can describe the optimized and standardized FFN as follows:

FFN(x) = max (0, xW1 + b1) W2 + b2

The output of the FFN goes to post-LN, as described in the previous section. Then the output is sent to the next layer of the encoder stack and the multi-head attention layer of the decoder stack.

Let’s now explore the decoder stack.

The decoder stack

The layers of the decoder of the Transformer model are stacks of layers like the encoder layers. Each layer of the decoder stack has the following structure:

Figure 2.22: A layer of the decoder stack of the Transformer

The structure of the decoder layer remains the same as the encoder for all the N = 6 layers of the Transformer model. Each layer contains three sublayers: a multi-headed masked attention mechanism, a multi-headed attention mechanism, and a fully connected position-wise feedforward network.

The decoder has a third main sublayer, which is the masked multi-head attention mechanism. In this sublayer output, at a given position, the following words are masked so that the Transformer bases its assumptions on its inferences without seeing the rest of the sequence. That way, in this model, it cannot see future parts of the sequence.

A residual connection, Sublayer(x), surrounds each of the three main sublayers in the Transformer model like in the encoder stack:

LayerNormalization (x + Sublayer(x))

The embedding layer sublayer is only present at the bottom level of the stack, like for the encoder stack. The output of every sublayer of the decoder stack has a constant dimension, dmodel, like in the encoder stack, including the embedding layer and the output of the residual connections.

We can see that the designers worked hard to create symmetrical encoder and decoder stacks.

The structure of each sublayer and function of the decoder is similar to the encoder. In this section, we can refer to the encoder for the same functionality when we need to. We will only focus on the differences between the decoder and the encoder.

Output embedding and position encoding

The structure of the sublayers of the decoder is mostly the same as the sublayers of the encoder. The output embedding layer and position encoding function are the same as in the encoder stack.

In our exploration of Transformer usage, we will work with the model presented by Vaswani (2017). The output is a translation we need to learn. I chose to use a French translation:

Output=Le chat noir était assis sur le canapé et le chien marron dormait sur le tapis

This output is the French translation of the English input sentence:

Input=The blackbrown cat sat on the couch and the  dog slept on the rug.

The output words go through the word embedding layer and then the positional encoding function, like in the first layer of the encoder stack.

Let’s see the specific properties of the multi-head attention layers of the decoder stack.

The attention layers

The Transformer is an auto-regressive model. It uses the previous output sequences as an additional input. The multi-head attention layers of the decoder use the same process as the encoder.

However, the masked multi-head attention sublayer 1 only lets attention apply to the positions up to and including the current position. The future words are hidden from the Transformer, and this forces it to learn how to predict.

A post-layer normalization process follows the masked multi-head attention sublayer 1 as in the encoder.

The multi-head attention sublayer 2 also only attends to the positions up to the current position the Transformer is predicting to avoid seeing the sequence it must predict.

The multi-head attention sublayer 2 draws information from the encoder by taking encoder (K, V) into account during the dot-product attention operations. This sublayer also draws information from the masked multi-head attention sublayer 1 (masked attention) by also taking sublayer 1(Q) into account during the dot-product attention operations. The decoder thus uses the trained information of the encoder. We can define the input of the self-attention multi-head sublayer of a decoder as:

Input_Attention = (Output_decoder_sub_layer -1 (Q), Output_encoder_layer(K, V))

A post-layer normalization process follows the masked multi-head attention sublayer 1 as in the encoder.

The Transformer then goes to the FFN sublayer, followed by a post-LN and the linear layer.

The FFN sublayer, the post-LN, and the linear layer

The FFN sublayer has the same structure as the FFN of the encoder stack. The post-layer normalization of the FFN works as the layer normalization of the encoder stack.

The Transformer produces an output sequence of only one element at a time:

Output sequence = (y1, y2, ... yn)

The linear layer produces an output sequence with a linear function that varies per model but relies on the standard method:

y = w * x + b

w and b are learned parameters.

The linear layer will thus produce the next probable elements of a sequence that a softmax function will convert into a probable element.

The decoder layer, like the encoder layer, will then go from layer l to layer l + 1, up to the top layer of the N = 6 layer transformer stack.

At the top layer of the decoder, the transformer will reach the output layer, which will map the outputs of the model to the size of the vocabulary to produce the raw logits of the prediction.

The raw logits of the output can go through a softmax function, apply the values obtained to the tokens in the vocabulary, and choose the best probable token for the task requested. Or, as shown in the From one token to an AI revolution section of Chapter 1, What Are Transformers?, the pipeline can apply sampling functions that will vary from one API to another.

Let’s now see how the Transformer was trained and the performance it obtained.

Training and performance

The Original Transformer was trained on a 4.5 million sentence-pair English-German dataset and a 36 million sentence-pair English-French dataset.

The datasets come from the ninth Workshop on Machine Translation (WMT), which can be found at the following link if you wish to explore the WMT datasets: http://www.statmt.org/wmt14/.

The training of the Original Transformer base models took 12 hours for 100,000 steps on a machine with 8 NVIDIA P100 GPUs. The big models took 3.5 days for 300,000 steps.

The Original Transformer outperformed all the previous machine translation models with a BLEU score of 41.8. The result was obtained on the WMT English-to-French dataset.

BLEU stands for Bilingual Evaluation Understudy. It is an algorithm that evaluates the quality of the results of machine translations.

The Google Research and Google Brain team applied optimization strategies to improve the performance of the Transformer. For example, the Adam optimizer was used, but the learning rate varied by first going through warmup states with a linear rate and decreasing the rate afterward.

Different types of regularization techniques, such as residual dropout and dropouts, were applied to the sums of embeddings. Also, the Transformer applies label smoothing to avoid overfitting with overconfident one-hot outputs. It introduces less accurate evaluations and forces the model to train more and better.

Several other transformer model variations have led to other models and usages that we will explore in the subsequent chapters.

Before the end of the chapter, let’s get a feel of the simplicity of ready-to-use transformer models in Hugging Face, for example.

Hugging Face transformer models

Everything we have learned in this chapter can be condensed into a ready-to-use Hugging Face transformer model.

With Hugging Face, we can implement machine translation in three lines of code!

Open Multi_Head_Attention_Sub_Layer.ipynb in Google Colaboratory. Save the notebook in your Google Drive (make sure you have a Gmail account). Then, go to the two last cells.

We first ensure that Hugging Face transformers are installed:

!pip -q install transformers

The first cell imports the Hugging Face pipeline that contains several transformer usages:

#@title Retrieve pipeline of modules and choose English to French translation
from transformers import pipeline

We then implement the Hugging Face pipeline, which contains ready-to-use functions. In our case, to illustrate the Transformer model of this chapter, we activate the translator model and enter a sentence to translate from English to French:

translator = pipeline("translation_en_to_fr")
#One line of code!
print(translator("It is easy to translate languages with transformers", max_length=40))

And voilà! The translation is displayed:

[{'translation_text': 'Il est facile de traduire des langues à l'aide de transformateurs.'}]

Hugging Face shows how transformer architectures can be used in ready-to-use models.

Summary

In this chapter, we first started by examining the mind-blowing long-distance dependencies that transformer architectures can uncover. Transformers can perform transductions from written and oral sequences to meaningful representations as never before in the history of Natural Language Understanding (NLU).

These two dimensions, the expansion of transduction and the simplification of implementation, are taking artificial intelligence to a level never seen before.

We explored the bold approach of removing RNNs, LSTMs, and CNNs from transduction problems and sequence modeling to build the Transformer architecture. The symmetrical design of the standardized dimensions of the encoder and decoder makes the flow from one sublayer to another nearly seamless.

We saw that beyond removing recurrent network models, transformers introduce parallelized layers that reduce training time. In addition, we discovered other innovations, such as positional encoding and masked multi-headed attention.

The flexible, Original Transformer architecture provides the basis for many other innovative variations that open the way for yet more powerful transduction problems and language modeling.

We will go more in depth into some aspects of the Transformer’s architecture in the following chapters when describing the many variants of the original model.

The arrival of the Transformer marks the beginning of a new generation of ready-to-use artificial intelligence models. For example, Hugging Face and Google Brain make artificial intelligence easy to implement with a few lines of code.

Before continuing to the next chapter, make sure you capture the details of the paradigm shift constituted by the architecture of the Original Transformer. You will then be able to face any present and future transformer model.

In this chapter, we have dived into the architecture of the Original Transformer. Now, we will see what they can do. In Chapter 3, Emergent vs. Downstream Tasks: The Unseen Depths of Transformers, we will explore the wide range of tasks transformer models can perform.

Questions

  1. NLP transduction can encode and decode text representations. (True/False)
  2. Natural Language Understanding (NLU) is a subset of Natural Language Processing (NLP). (True/False)
  3. Language modeling algorithms generate probable sequences of words based on input sequences. (True/False)
  4. A transformer is a customized LSTM with a CNN layer. (True/False)
  5. A transformer does not contain LSTM or CNN layers. (True/False)
  6. Attention examines all the tokens in a sequence, not just the last one. (True/False)
  7. A transformer uses a positional vector, not positional encoding. (True/False)
  8. A transformer contains a feedforward network. (True/False)
  9. The masked multi-headed attention component of the decoder of a transformer prevents the algorithm parsing a given position from seeing the rest of a sequence that is being processed. (True/False)
  10. Transformers can analyze long-distance dependencies better than LSTMs. (True/False)

References

Further reading

Transformers could not have increased their potential without hardware innovation. NVIDIA, for example, offers interesting insights on transformers (https://blogs.nvidia.com/blog/2022/03/25/what-is-a-transformer-model/) and the related hardware (https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html).

Join our community on Discord

Join our community’s Discord space for discussions with the authors and other readers:

https://www.packt.link/Transformers

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Compare and contrast 20+ models (including GPT-4, BERT, and Llama 2) and multiple platforms and libraries to find the right solution for your project
  • Apply RAG with LLMs using customized texts and embeddings
  • Mitigate LLM risks, such as hallucinations, using moderation models and knowledge bases
  • Purchase of the print or Kindle book includes a free eBook in PDF format

Description

Transformers for Natural Language Processing and Computer Vision, Third Edition, explores Large Language Model (LLM) architectures, applications, and various platforms (Hugging Face, OpenAI, and Google Vertex AI) used for Natural Language Processing (NLP) and Computer Vision (CV). The book guides you through different transformer architectures to the latest Foundation Models and Generative AI. You’ll pretrain and fine-tune LLMs and work through different use cases, from summarization to implementing question-answering systems with embedding-based search techniques. You will also learn the risks of LLMs, from hallucinations and memorization to privacy, and how to mitigate such risks using moderation models with rule and knowledge bases. You’ll implement Retrieval Augmented Generation (RAG) with LLMs to improve the accuracy of your models and gain greater control over LLM outputs. Dive into generative vision transformers and multimodal model architectures and build applications, such as image and video-to-text classifiers. Go further by combining different models and platforms and learning about AI agent replication. This book provides you with an understanding of transformer architectures, pretraining, fine-tuning, LLM use cases, and best practices.

Who is this book for?

This book is ideal for NLP and CV engineers, software developers, data scientists, machine learning engineers, and technical leaders looking to advance their LLMs and generative AI skills or explore the latest trends in the field. Knowledge of Python and machine learning concepts is required to fully understand the use cases and code examples. However, with examples using LLM user interfaces, prompt engineering, and no-code model building, this book is great for anyone curious about the AI revolution.

What you will learn

  • Breakdown and understand the architectures of the Original Transformer, BERT, GPT models, T5, PaLM, ViT, CLIP, and DALL-E
  • Fine-tune BERT, GPT, and PaLM 2 models
  • Learn about different tokenizers and the best practices for preprocessing language data
  • Pretrain a RoBERTa model from scratch
  • Implement retrieval augmented generation and rules bases to mitigate hallucinations
  • Visualize transformer model activity for deeper insights using BertViz, LIME, and SHAP
  • Go in-depth into vision transformers with CLIP, DALL-E 2, DALL-E 3, and GPT-4V
Estimated delivery fee Deliver to Chile

Standard delivery 10 - 13 business days

$19.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Feb 29, 2024
Length: 730 pages
Edition : 3rd
Language : English
ISBN-13 : 9781805128724
Vendor :
OpenAI
Category :
Languages :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Chile

Standard delivery 10 - 13 business days

$19.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Publication date : Feb 29, 2024
Length: 730 pages
Edition : 3rd
Language : English
ISBN-13 : 9781805128724
Vendor :
OpenAI
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $87.96 $126.97 $39.01 saved
Building LLM Powered  Applications
$34.98 $49.99
Mastering NLP from Foundations to LLMs
$36.99 $52.99
Transformers for Natural Language Processing and Computer Vision
$43.98 $54.99
Total $87.96$126.97 $39.01 saved Stars icon
Banner background image

Table of Contents

21 Chapters
What Are Transformers? Chevron down icon Chevron up icon
Getting Started with the Architecture of the Transformer Model Chevron down icon Chevron up icon
Emergent vs Downstream Tasks: The Unseen Depths of Transformers Chevron down icon Chevron up icon
Advancements in Translations with Google Trax, Google Translate, and Gemini Chevron down icon Chevron up icon
Diving into Fine-Tuning through BERT Chevron down icon Chevron up icon
Pretraining a Transformer from Scratch through RoBERTa Chevron down icon Chevron up icon
The Generative AI Revolution with ChatGPT Chevron down icon Chevron up icon
Fine-Tuning OpenAI GPT Models Chevron down icon Chevron up icon
Shattering the Black Box with Interpretable Tools Chevron down icon Chevron up icon
Investigating the Role of Tokenizers in Shaping Transformer Models Chevron down icon Chevron up icon
Leveraging LLM Embeddings as an Alternative to Fine-Tuning Chevron down icon Chevron up icon
Toward Syntax-Free Semantic Role Labeling with ChatGPT and GPT-4 Chevron down icon Chevron up icon
Summarization with T5 and ChatGPT Chevron down icon Chevron up icon
Exploring Cutting-Edge LLMs with Vertex AI and PaLM 2 Chevron down icon Chevron up icon
Guarding the Giants: Mitigating Risks in Large Language Models Chevron down icon Chevron up icon
Beyond Text: Vision Transformers in the Dawn of Revolutionary AI Chevron down icon Chevron up icon
Transcending the Image-Text Boundary with Stable Diffusion Chevron down icon Chevron up icon
Hugging Face AutoTrain: Training Vision Models without Coding Chevron down icon Chevron up icon
On the Road to Functional AGI with HuggingGPT and its Peers Chevron down icon Chevron up icon
Beyond Human-Designed Prompts with Generative Ideation Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.2
(37 Ratings)
5 star 67.6%
4 star 8.1%
3 star 8.1%
2 star 5.4%
1 star 10.8%
Filter icon Filter
Top Reviews

Filter reviews by




nicoleta simona Apr 05, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I adore Denis Rothman's Transformers and the way things are written and explained. A 5-year-old can understand his explanation and a scientist can improve his work. When I see his interviews, I learn, on the one hand, about the ChatGPT and Google Gemini transformers and, on the other hand, how to treat clients, be a proactive human, and get a new perspective on AI. AI becomes fun, easy, and life- and perspective-changing. I read the second edition, and I can not wait to apply what I read in the 3rd edition. I mean, it is written as if it would answer the pain points of becoming a GenAI pro and maximize business and living in any circumstance. I already attended a Packt conference with Denis Rothman as a speaker in October last year. With or without his opponents, GenAI is changing the world and will make an important difference. Thank you, Denis Rothman, for Transformers for Natural Language Processing and Computer Vision. I adore it.
Subscriber review Packt
Paul Burnett Oct 02, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is excellently presented. I find the material and examples stimulating and which has led to a million more questions. It a very complicated area to study but this book covers the topics very well. The code works well. Great book. Many thanks. I am still digesting this fabulous book.
Feefo Verified review Feefo
v Sep 26, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
If you're aspiring to become an expert in NLP or Generative AI, this book is an excellent resource. It provides a clear, step-by-step explanation of NLP models, making complex concepts easy to grasp through practical examples and Python code. . Starting with foundational models, the book introduces the architecture of Transformer, BERT, and RoBERTa, followed by an in-depth exploration of the GPT models which are the Generative AI revolution. The book also delves into image processing and computer vision. Additionally, the questions at the end of each chapter further enhance understanding and engagement with the material.
Amazon Verified review Amazon
Dr. Walter Aigner Mar 15, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
for those who can read, I can definitely say that this new third edition provides a fresh look at both the transformers themselves and the current environment in which they exist.A valuable resource to refresh our knowledge and inspire us to take the next stepsmy personal selection of what I appreciated in this third edition after about ten days of perusing, reading and note-takingthe emergence of new roles:* The role of AI professionals* The future of AI professionals* What resources should we use?* Guidelines for decision making* Chapter 3: Emergent vs. downstream tasks: The Unseen Depths of Transformers* Chapter 7: The Generative AI Revolution with ChatGPT* Chapter 12: Towards Syntax-Free Semantic Role Labelling with ChatGPT and GPT-4* Chapter 16: Beyond Text: Vision Transformers at the Dawn of Revolutionary AIRothman writes that this book is for data analysts, data scientists, and machine learning/AI engineers who want to understand how to process and interrogate the increasing amounts of speech and image data. Most of the programs in the book are Colaboratory notebooks. All you need is a free Google Gmail account and you can run the notebooks on the free Google Colaboratory VM.Context of my interest in this field: Shortly after the public release of ChatGPT in November 2022, Bill Gates described it and other LLMs as "as important as the PC, as important as the Internet". Jensen Huang, CEO of Nvidia, said ChatGPT was "truly one of the greatest things ever done for computing". Geoffrey Hinton, a Turing Laureate, said, "I think it's comparable in scale to the industrial revolution or electricity - or maybe the wheel. Perhaps that is why many of us need a qualified, updated context.I can definitely say that this new third edition gives a qualified context and fresh look at both the transformers themselves and the current environment in which they exist.and yet, the term "Computer Simulation" is far more accurate as an umbrella term than any characterization of machine software("AI," "LLM," "Generative AI," etc.).Rothman's profile shows that he has been designing and developing computer simulation software for decades in various forms: rule-based, expert systems, ML agents, DL agents, the first transformer models, and now trending Generative AI for NLP and Computer Vision. all these algorithms boil down to "computer simulation", no more, no less. They are toolss that are here for us to make "simulations" to enhance our abilities as a scientific calculator does.Who this book is for: Anyone who regularly works with LLMs professionally (e.g. data scientists, machine learning engineers, AI researchers, etc.) or anyone already familiar with natural language processing (NLP) who wants to take a deep dive into transformers.Another reviewer rightly wrote: Who this book is not for: Anyone with little to no knowledge of NLP, machine learning, or Python programming (i.e. the "casual" reader). This book is dense (in the sense of Clifford Geertz‘ thick description that helps us increase our understanding on both on a theoretical and a practical level). I still have a lot to think about.And I have to admit that I have not yet fully grasped all the emerging possibilities and food for thought that the book has triggered or will trigger as I re-read and explore the code provided.
Amazon Verified review Amazon
Didi Aug 01, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The transformer architecture was introduced by Google in 2017, and almost instantly revolutionized the field of natural language processing (NLP), and to some degree also that of computer vision. This book is a comprehensive and practical guide to the transformer architecture, on which modern LLMs are based, and its applications in NLP and computer vision.The book does a wonderful job in providing detailed and clear descriptions of a wide range of important topics in NLP, such as the fundamentals of the transformer architecture, model pre-training, fine-tuning, tokenization, and embeddings. Notable applications of LLMs are also covered in detail, and include summarization, translation, etc. Modern generative AI methods are also very nicely covered, both in NLP and in computer vision (e.g., ChatGPT, Stable Diffusion, and the like). The accompanying GitHub repo is also very helpful, and greatly assists in reinforcing the concepts presented in the book.This comprehensive and unique guide will benefit any researcher, data scientist, machine learning engineer, or software engineer interested in building and understanding modern NLP and LLMs, as well as modern methods in computer vision. Prior familiarity with machine learning concepts, as well as with the Python programming language, would be helpful to get the most out of this book.Highly recommended!
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact [email protected] with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at [email protected] using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on [email protected] with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on [email protected] within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on [email protected] who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on [email protected] within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela