Generative AI with LangChain

What Is Generative AI?

Over the past decade, deep learning has evolved massively to process and generate unstructured data like text, images, and video. These advanced AI models have gained popularity in various industries, and include large language models (LLMs). There is currently a significant amount of fanfare in both the media and the industry surrounding AI, and there’s a fair case to be made that Artificial Intelligence (AI), with these advancements, is about to have a wide-ranging and major impact on businesses, societies, and individuals alike. This is driven by numerous factors, including advancements in technology, high-profile applications, and the potential for transformative impacts across multiple sectors.

In this chapter, we’ll explore generative models and their basics. We’ll provide an overview of the technical concepts and training approaches that power these models’ ability to produce novel content. While we won’t be diving deep into generative models for sound or video, we aim to convey a high-level understanding of how techniques like neural networks, large datasets, and computational scale enable generative models to reach new capabilities in text and image generation. The goal is to demystify the underlying magic that allows these models to generate remarkably human-like content across various domains. With this foundation, readers will be better prepared to consider both the opportunities and challenges posed by this rapidly advancing technology.

We’ll follow this structure:

Introducing generative AI
Understanding LLMs
Model development
What are text-to-image models?
What can AI do in other domains?

Let’s start from the beginning – with the terminology!

Introducing generative AI

In the media, there is substantial coverage of AI-related breakthroughs and their potential implications. These range from advancements in Natural Language Processing (NLP) and computer vision to the development of sophisticated language models like GPT-4. Particularly, generative models have received a lot of attention due to their ability to generate text, images, and other creative content that is often indistinguishable from human-generated content. These same models also provide wide functionality, including semantic search, content manipulation, and classification. This allows cost savings with automation and allows humans to leverage their creativity to an unprecedented level.

Generative AI refers to algorithms that can generate novel content, as opposed to analyzing or acting on existing data like more traditional, predictive machine learning or AI systems.

Benchmarks capturing task performance in different domains have been major drivers of the development of these models. The Massive Multitask Language Understanding (MMLU) benchmark is a comprehensive suite of 57 tasks spanning diverse domains like math, history, computer science, and law. It serves as a standardized way to evaluate the multitask performance and broad capabilities of LLMs in both zero-shot and few-shot settings. The MMLU benchmark’s importance lies in providing a challenging and multifaceted test of a model’s understanding and problem-solving abilities across a wide range of topics. It allows for systematic comparisons between different LLMs and tracks progress in developing models with robust language understanding and reasoning skills beyond narrow domains.

The following graph, inspired by a blog post titled GPT-4 Predictions by Stephen McAleese on LessWrong, shows the improvements of LLMs in the benchmark:

Figure 1.1: Average performance on the MMLU benchmark of LLMs

Please note that results should be taken with a pinch of salt since they are self-reported and are obtained either by 5-shot or 0-shot conditioning. Most benchmark results come from 5-shot (indicated by an “o”). A few, like the GPT-2, PaLM, and PaLM-2 results, refer to zero-shot (“x”).

From the preceding graph, we can see significant improvements in recent years in the MMLU benchmark. Particularly, it highlights the progress of the models provided through a public user interface by OpenAI, especially the improvements between releases, from GTP-2 to GPT-3 and GPT-3.5 to GPT-4.

The graph shows the MMLU performance of models that have either prompted a question directly (zero-shot) or together with examples – typically 5 (few-shot). The added examples result in a 20% boost in the model’s performance according to Measuring Massive Multitask Language Understanding (Hendrycks et al., revised in 2023).

It is difficult to definitively declare the strongest LLM among Claude 3, GPT-4, and Gemini, as their performances appear to be closely matched and vary across different tasks. Ultimately, the choice of the strongest LLM may depend on specific use cases and requirements, including their costs.

There are a few differences between these models and the way they are trained that can account for differences in performance, such as scale, instruction tuning, a tweak to the attention mechanisms, and the choice of training data. First and foremost, the massive scaling up of parameters from 1.5 billion (GPT-2) to 175 billion (GPT-3) to more than a trillion (GPT-4) enables models to learn more complex patterns; however, another major change in early 2022 was the post-training fine-tuning of models based on human instructions, which teaches the model how to perform a task by providing demonstrations and feedback.

Across benchmarks, a few models have recently started to perform better than an average human rater, but generally, they still haven’t reached the performance of a human expert. These achievements of human engineering are impressive; however, it should be noted that the performance of these models depends on the field; most models are still performing poorly on the GSM8K benchmark of grade school math word problems. As AI models like OpenAI’s GPT continue to improve, they could become indispensable assets to teams in need of diverse knowledge and skills.

You could consider strong LLMs like GPT 4 or Claude 3 a polymath that works tirelessly without demanding compensation (beyond subscription or API fees), providing competent assistance in subjects like mathematics and statistics, macroeconomics, biology, and law (the model performs well on the Uniform Bar Exam). As these AI models become more proficient and easily accessible, they are likely to play a significant role in shaping the future of work and learning.

By making knowledge more accessible and adaptable, these models have the potential to level the playing field and create new opportunities for people from all walks of life. These models have shown potential in areas that require high levels of reasoning and understanding, although progress varies depending on the complexity of the tasks involved.

As for generative models with images, they have pushed the boundaries in their capabilities to assist in creating visual content, and their performance in computer vision tasks such as object detection, segmentation, captioning, and much more.

Let’s clear up the terminology a bit and explain in more detail what is meant by generative models, artificial intelligence, deep learning, and machine learning.

What are generative models?

In popular media, the term artificial intelligence is used a lot when referring to these new models. In theoretical and applied research circles, it is often joked that AI is just a fancy word for ML, or AI is ML in a suit, as illustrated in this image:

A cartoon of a person wearing a suit and a mask

Description automatically generated

Figure 1.2: ML in a suit. Generated by a model on replicate.com, Diffusers Stable Diffusion v2.1

It’s worth distinguishing more clearly between the terms generative model, artificial intelligence, machine learning, deep learning, and language model:

Artificial Intelligence (AI) is a broad field of computer science focused on creating intelligent agents that can reason, learn, and act autonomously.
Machine Learning (ML) is a subset of AI focused on developing algorithms that can learn from data.
Deep Learning (DL) uses deep neural networks, which have many layers, as a mechanism for ML algorithms to learn complex patterns from data.
Generative Models are a type of ML model that can generate new data based on patterns learned from input data.
Language Models (LMs) are statistical models used to predict words in a sequence of natural language. Some language models utilize deep learning and are trained on massive datasets, becoming LLMs.

The following class diagram illustrates how LLMs combine deep learning techniques like neural networks with sequence modeling objectives from language modeling at a very large scale:

A diagram of a computer program

Description automatically generated

Figure 1.3: Class diagram of different models. LLMs represent the intersection of deep learning techniques with language modeling objectives

Generative models are a powerful type of AI that can generate new data that resembles the training data. Generative AI models have come a long way, enabling the generation of new examples from scratch using patterns in data. These models can handle different data modalities and are employed across various domains, including text, image, music, and video. Their key distinction is that generative models synthesize new data rather than just making predictions or decisions. This enables applications like generating text, images, music, and video.

Generative models can facilitate the creation of synthetic data to train AI models when real data is scarce or restricted. This type of data generation reduces labeling costs and improves training efficiency. Microsoft Research took this approach (Textbooks Are All You Need, June 2023) when training their phi-1 model; they used GPT-3.5 to create synthetic Python textbooks and exercises.

The rapid progress across diverse domains shows the potential of generative AI. Within the industry, there is a growing sense of excitement around AI’s capabilities and its potential impact on business operations. But there are key challenges such as data availability, compute requirements, bias in data, evaluation difficulties, potential misuse, and other societal impacts that need to be addressed going forward, which we’ll discuss in Chapter 10, The Future of Generative Models.

Generative AI is extensively used in generating 3D images, avatars, videos, graphs, and illustrations for virtual or augmented reality, video games graphic design, logo creation, image editing, or enhancement. The most popular model category here is for text-conditioned image synthesis, specifically text-to-image generation. As mentioned, in this book, we’ll focus on LLMs, since they have the broadest practical application, but we’ll also have a look at image models, which sometimes can be quite useful.

Let’s delve a bit more into this progress and pose the question why is it happening now and what conditions have made this advancement possible?

Why now?

The success of generative AI is due to several factors, including:

Improved algorithms
Considerable advances in computer power and hardware design
The availability of large, labeled datasets
An active and collaborative research community

Additionally, the development of more sophisticated mathematical and computational methods has played a vital role in advancing generative models. An example is the backpropagation algorithm, which was introduced in the 1980s and provides a way to effectively train multi-layer neural networks.

In the 2000s, neural networks began to regain popularity as researchers developed more complex architectures. However, it was the advent of deep learning, a type of neural network with numerous layers, that marked a significant turning point in the performance and capabilities of these models.

Although the concept of deep learning has existed for some time, the development and expansion of generative models correlate with significant advances in hardware, particularly Graphics Processing Units (GPUs), which have been instrumental in the development of deeper models. This is because deep learning models require a lot of computing power to train and run. This concerns all aspects of processing power, memory, and disk space.

The capabilities of LLMs changed dramatically once they became bigger. The more parameters a model has, the higher its capacity to capture knowledge relationships between words and phrases. As a simple example of these higher-order correlations, an LLM could learn that the word “cat” is more likely to be followed by the word “dog” if it is preceded by the word “chase,” even if there are other words in between. Generally, the lower a model’s perplexity, the better it will perform, for example, in terms of answering questions.

Particularly, it seems that in models with between 2 and 7 billion parameters, new capabilities emerge such as the ability to generate different creative text in formats like poems, code, scripts, musical pieces, emails, and letters, and to answer even open-ended and challenging questions in an informative way.

Understanding LLMs

LLMs are deep neural networks that are adept at understanding and generating human language. These models have practical applications in fields like content creation and NLP, where the ultimate goal is to create algorithms capable of understanding and generating natural language text.

The current generation of LLMs such as GPT-4 and others are deep neural network architectures that utilize the transformer model and undergo pre-training using unsupervised learning on extensive text data, enabling the model to learn language patterns and structures. Models have evolved rapidly, enabling the creation of versatile foundational AI models that are suitable for a wide range of downstream tasks and modalities, ultimately driving innovation across various applications and industries.

The notable strength of the latest generation of LLMs as conversational interfaces (chatbots) lies in their ability to generate coherent and contextually appropriate responses, even in open-ended conversations. By generating the next word based on the preceding words repeatedly, the model produces fluent and coherent text that is often indistinguishable from text produced by humans.

At its core, language modeling, and more broadly NLP, relies heavily on the quality of representation learning. A generative language model encodes information about the text that it has been trained on and generates new text based on what it has learned, thereby taking on the task of text generation.

Representation learning is about a model learning its internal representations of raw data to perform a machine learning task, rather than relying only on engineered feature extraction. For example, an image classification model based on representation learning might learn to represent images according to visual features like edges, shapes, and textures. The model isn’t told explicitly what features to look for – it learns representations of the raw pixel data that help it make predictions.

Recently, LLMs have been used in tasks like copywriting, code development, translation, and understanding genetic sequences. More broadly, applications of language models involve multiple areas, such as:

Question answering: AI chatbots and virtual assistants can provide personalized and efficient assistance, reducing response times in customer support and thereby enhancing customer experience. These systems can be used in specific contexts like restaurant reservations and ticket booking.
Automatic summarization: Language models can create concise summaries of articles, research papers, and other content, enabling users to consume and understand information rapidly.
Sentiment analysis: By analyzing opinions and emotions in texts, language models can help businesses understand customer feedback and opinions more efficiently.
Topic modeling: LLMs can discover abstract topics and themes across a corpus of documents. They identify word clusters and latent semantic structures.
Semantic search: LLMs can focus on understanding meaning within individual documents. They use NLP to interpret words and concepts for improved search relevance.
Machine translation: Language models can translate texts from one language into another, supporting businesses in their global expansion efforts. New generative models can perform on par with commercial products (for example, Google Translate).

Despite their remarkable achievements, language models still face limitations when dealing with complex mathematical or logical reasoning tasks. It remains uncertain whether continually increasing the scale of language models will inevitably lead to new reasoning capabilities. Further, LLMs are known to return the most probable answers within the context, which can sometimes yield fabricated information, called hallucinations. This is a feature as well as a bug since it highlights their creative potential.

We’ll talk about hallucinations in Chapter 5, Building a Chatbot Like ChatGPT, but for now, let’s discuss the nitty-gritty details – how do these LLMs work under the hood?

How do GPT models work?

A new deep learning architecture called the Transformer emerged in 2017, introduced in an article by researchers at Google and the University of Toronto (in an article called Attention Is All You Need by Vaswani et al.). It uses self-attention, allowing it to focus on the important parts of a sentence and understand how words relate to each other.

In 2018, researchers took transformers to the next level by creating Generative Pre-trained Transformers (GPTs) (in Improving Language Understanding by Generative Pre-Training; Radford et al.). These models are trained by predicting the next word in a sequence, like a massive guessing game that helps them grasp language patterns. After this pre-training process, GPTs can be further refined for specific tasks like translation or sentiment analysis. This combines unsupervised learning (pre-training) and supervised learning (fine-tuning) for better performance across various tasks. It also reduces the difficulty of training LLMs.

Transformers

Models based on transformers outperformed previous approaches, such as using recurrent neural networks, particularly Long Short-Term Memory (LSTM) networks. These recurrent neural networks such as LSTM, have a limited memory. This can be problematic for long sentences or complex ideas where earlier information is still relevant.

Transformers work differently, which means they take advantage of the full context, and they can keep learning and refining their understanding as they process more words in a sentence. This ability to leverage the entire context throughout the sentence leads to better performance for tasks like translation, summarization, and question-answering. The model can capture the nuances of longer sentences and complex relationships between words. In essence, a key reason for the success of transformers has been their ability to maintain performance across long sequences better than other models, for example, recurrent neural networks.

The transformer model architecture has an encoder-decoder structure, where the encoder maps an input sequence to a sequence of hidden states, and the decoder maps the hidden states to an output sequence. The hidden state representations consider not only the inherent meaning of the words (their semantic value) but also their context in the sequence.

The encoder is made up of identical layers, each with two sub-layers. The input embedding is passed through an attention mechanism, and the second sub-layer is a fully connected feed-forward network. Each sub-layer is followed by a residual connection and layer normalization. The output of each sub-layer is the sum of the input and the output of the sub-layer, which is then normalized.

The decoder uses this encoded information to generate the output sequence one item at a time, using the context of the previously generated items. It also has identical modules, with the same two sub-layers as the encoder. In addition, the decoder has a third sub-layer that performs Multi-Head Attention (MHA) over the output of the encoder stack. The decoder also uses residual connections and layer normalization. The self-attention sub-layer in the decoder is modified to prevent positions from attending to subsequent positions. This masking, combined with the fact that the output embeddings are offset by one position, ensures that the predictions for position i can only depend on the known outputs at positions less than i. These are indicated in the diagram here (source: Yuening Jia, Wikimedia Commons):

A diagram of a software system

Description automatically generated

Figure 1.4: The Transformer architecture

The architectural features that have contributed to the success of transformers are:

Positional encoding: Since the transformer doesn’t process words sequentially but instead processes all words simultaneously, it lacks any notion of the order of words. To remedy this, information about the position of words in the sequence is injected into the model using positional encodings. These encodings are added to the input embeddings representing each word, thus allowing the model to consider the order of words in a sequence.
Layer normalization: To stabilize the network’s learning, the transformer uses a technique called layer normalization. This technique normalizes the model’s inputs across the features dimension (instead of the batch dimension as in batch normalization), thus improving the overall speed and stability of learning.
MHA: Instead of applying attention once, the transformer applies it multiple times in parallel, improving the model’s ability to focus on different types of information and thus capturing a richer combination of features.

The basic idea behind attention mechanisms is to compute a weighted sum of the values (usually referred to as values or content vectors) associated with each position in the input sequence, based on the similarity between the current position and all other positions. This weighted sum, known as the context vector, is then used as an input to the subsequent layers of the model, enabling the model to selectively attend to relevant parts of the input during the decoding process.

To enhance the expressiveness of the attention mechanism, it is often extended to include multiple so-called heads, where each head has its own set of query, key, and value vectors, allowing the model to capture various aspects of the input representation. The individual context vectors from each head are then concatenated or combined in some way to form the final output.

Early attention mechanisms scaled quadratically with the length of the sequences (context size), rendering them inapplicable to settings with long sequences. Different mechanisms have been tried out to alleviate this. Many LLMs use some form of Multi-Query Attention (MQA), including OpenAI’s GPT-series models, Falcon, SantaCoder, and StarCoder.

MQA is an extension of MHA, where attention computation is replicated multiple times. MQA improves the performance and efficiency of language models for various language tasks. By removing the heads dimension from certain computations and optimizing memory usage, MQA allows for 11 times better throughput and 30% lower latency in inference tasks compared to baseline models without MQA.

Llama 2 and a few other models use Grouped-Query Attention (GQA), which is a practice used in autoregressive decoding to cache the key (K) and value (V) pairs for the previous tokens in the sequence, speeding up attention computation. However, as the context window or batch sizes increase, the memory costs associated with the KV cache size in MHA models also increase significantly. To address this, the key and value projections can be shared across multiple heads without much degradation of performance.

There have been many other proposed approaches to obtain efficiency gains, such as sparse, low-rank self-attention, and latent bottlenecks, to name just a few. Other work has tried to extend sequences beyond the fixed input size; architectures such as transformer-XL reintroduce recursion by storing hidden states of already encoded sentences to leverage them in the subsequent encoding of the next sentences.

The combination of these architectural features allows GPT models to successfully tackle tasks that involve understanding and generating text in human language and other domains. The overwhelming majority of LLMs are transformers, as are many other state-of-the-art models we will encounter in the different sections of this chapter, including models for image, sound, and 3D objects.

As the name suggests, a particularity of GPTs lies in pre-training. Let’s see how these LLMs are trained!

Pre-training

The transformer is trained in two phases using a combination of unsupervised pre-training and discriminative task-specific fine-tuning. The goal during pre-training is to learn a general-purpose representation that transfers to a wide range of tasks.

The unsupervised pre-training can follow different objectives. In Masked Language Modeling (MLM), introduced in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Devlin and others (2019), the input is masked out, and the model attempts to predict the missing tokens based on the context provided by the non-masked portion. For example, if the input sentence is “The cat [MASK] over the wall,” the model would ideally learn to predict “jumped” for the mask.

In this case, the training objective minimizes the differences between predictions and the masked tokens according to a loss function. Parameters in the models are then iteratively updated according to these comparisons.

Negative Log-Likelihood (NLL) and Perplexity (PPL) are important metrics used in training and evaluating language models. NLL is a loss function used in ML algorithms, aimed at maximizing the probability of correct predictions. A low NLL indicates that the network has successfully learned patterns from the training set, so it will accurately predict the labels of the training samples. It’s important to mention that NLL is a value that’s constrained within a positive interval.

PPL, on the other hand, is an exponentiation of NLL, providing a more intuitive way to understand the model’s performance. Small PPL values indicate a well-trained network that can predict accurately, while high values indicate poor learning performance. Intuitively, we could say that a low PPL means that the model is not surprised by the next word. Therefore, the goal in pre-training is to minimize PPL, which means the model’s predictions align more with the actual outcomes.

In comparing different language models, PPL is often used as a benchmark metric across various tasks. It gives us an idea of how well the language model is performing in that a lower PPL indicates the model is more certain of its predictions. Hence, a model with low PPL would be considered better performing than a model with high PPL.

The first step in training an LLM is tokenization. This process involves building a vocabulary, which maps tokens to unique numerical representations so that they can be processed by the model, given that LLMs are mathematical functions that require numerical inputs and outputs.

Tokenization

Tokenizing a text means splitting it into tokens (words or subwords), which then are converted to IDs through a look-up table mapping words in text to corresponding lists of integers.

Before training the LLM, the tokenizer – more precisely, its dictionary – is typically fitted to the entire training dataset and then frozen. It’s important to note that tokenizers do not produce arbitrary integers. Instead, they output integers within a specific range – from 0 to N, where N represents the vocabulary size of the tokenizer.

Definitions

A token is an instance of a sequence of characters, typically forming a word, punctuation mark, or number. Tokens serve as the base elements for constructing sequences of text.
Tokenization refers to the process of splitting text into tokens. A tokenizer splits on whitespace and punctuation to break text into individual tokens.

Examples

Consider the following text:

“The quick brown fox jumps over the lazy dog!”

This would get split into the following tokens:

[“The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”, “!”]

Each word is an individual token, as is the punctuation mark.

There are a lot of tokenizers that work according to different principles, but common types of tokenizers employed in models are Byte-Pair Encoding (BPE), WordPiece, and SentencePiece. For example, Llama 2’s BPE tokenizer splits numbers into individual digits and uses bytes to decompose unknown UTF-8 characters. The total vocabulary size is 32,000 tokens.

It is necessary to point out that LLMs can only generate outputs based on a sequence of tokens that does not exceed its context window. This context window refers to the length of the longest sequence of tokens that an LLM can use. Typical context window sizes for LLMs can range from about 1,000 to 10,000 tokens.

After pre-training, a major step is how models are prepared for specific tasks either by fine-tuning or prompting. Let’s see what this task conditioning is about!

Conditioning

Conditioning LLMs refers to adapting the model for specific tasks. It includes fine-tuning and prompting:

Fine-tuning involves modifying a pre-trained language model by training it on a specific task using supervised learning. For example, to make a model more amenable to chats with humans, the model is trained on examples of tasks formulated as natural language instructions (instruction tuning). For fine-tuning, pre-trained models are usually trained again using Reinforcement Learning from Human Feedback (RLHF) to be helpful and harmless.
Prompting techniques present problems in text form to generative models. There are a lot of different prompting techniques, from simple questions to detailed instructions. Prompts can include examples of similar problems and their solutions. Zero-shot prompting involves no examples, while few-shot prompting includes a small number of examples of relevant problem and solution pairs.

These conditioning methods continue to evolve, becoming more effective and useful for a wide range of applications. Prompt engineering and fine-tuning methods will be explored further in Chapter 8, Customizing LLMs and Their Output.

How have GPT models evolved?

The development of GPT models has seen considerable progress, with OpenAI’s GPT-n series leading the way in creating foundational AI models. A major driver has been the size of models in terms of their parameters; however, other drivers play a role as well.

A foundation model (sometimes known as a base model) is a large model that was trained on an immense quantity of data at scale so that the model can be adapted to a wide range of downstream tasks. In GPT models, this pre-training is done via self-supervised learning.

There has been a recent shift in focus towards exploring alternative approaches to improve model performance on benchmarks like MMLU, beyond simply scaling up the model size. A critical area of focus has been the curation and quality of the training data. Carefully selecting and filtering the training data to ensure its relevance, diversity, and quality can significantly impact the model’s performance, especially on benchmarks that test for a broad range of knowledge and reasoning abilities.

Another key area of innovation has been in model architectures. For example, the Mixtral and Leeroo models employ a mixture-of-experts approach, where different subsets of the model’s parameters are specialized for different tasks, potentially improving performance and computational efficiency.

By exploring these alternative approaches in conjunction with continued scaling efforts, the field is striving to develop language models with even more robust language understanding and reasoning abilities across diverse domains.

The computational requirements and the cost of the model training have been enormous and will probably increase in the future. The computational cost of LLMs is enough to make your wallet weep. But fear not! Before we explore ways to lighten the load, let’s explore what makes these models so weighty in the first place: their size!

Model size

The size of the training corpus for LLMs has been increasing drastically. GPT-1, introduced by OpenAI in 2018, was trained on BookCorpus, which has 985 million words. BERT, released in the same year, was trained on a combined corpus of BookCorpus and English Wikipedia, totaling 3.3 billion words. Now, training corpora for LLMs have up to trillions of tokens.

OpenAI has been coy about the technical details of their models; however, information has been circulating that, with about 1.8 trillion parameters, GPT-4 is more than 10x the size of GPT-3. Further, OpenAI was able to keep costs reasonable by utilizing a Mixture of Experts (MoE) model consisting of 16 experts within their model, each having about 111 billion parameters.

Apparently, GPT-4 was trained on about 13 trillion tokens. However, these are not unique tokens since they count repeated presentation of the data in each epoch. Training was conducted for two epochs for text-based data and four for code-based data. For fine-tuning, the dataset consisted of millions of rows of instruction fine-tuning data. Another rumor, again to be taken with a pinch of salt, is that OpenAI might be applying speculative decoding on GPT-4’s inference, with the idea that a smaller model (oracle model) could be predicting the large model’s responses, and these predicted responses could help speed up decoding by feeding them into the larger model, thereby skipping tokens. This is a risky strategy because – depending on the threshold of the confidence of the oracle’s responses – the quality could deteriorate.

The increase in the scale of language models has been a major driving force behind their impressive performance gains, with models like Google’s Gemini continuing to push the boundaries of size and capability. This graph illustrates how LLMs have been growing:

Figure 1.5: LLMs from BERT to GPT-4 – size (number of parameters), and licenses. For proprietary models, parameter sizes are often estimates

In examining the historical progression depicted in the graph, it is evident that LLMs have consistently increased in size, as indicated by the growing number of parameters. This trend aligns with a broader pattern observed in machine learning, where enhancing model performance often involves expanding model size. A paper from 2020 from OpenAI by Kaplan et al. (Scaling laws for neural language models, 2020) discussed scaling laws and the choice of parameters.

They identified a power-law relationship indicating that performance improvements in LLMs are proportional to increases in dataset size and model size. Specifically, to enhance performance by a certain factor, the size of the dataset or the model must be exponentially increased by that factor. For optimal results, both elements should be scaled simultaneously, thus preventing potential bottlenecks in model training and performance.

In addition to dataset and model size, it is essential to consider the training budget, which significantly influences the training process’s efficiency and outcomes. The training budget encompasses factors such as computational power and time allocated for model training. This metric serves as an alternative to measuring training in terms of epochs, allowing more flexibility and precision in determining the optimal point to cease training. Given the complexity and extensive training requirements of LLMs, it can be challenging to pinpoint the precise convergence point. Thus, the training budget plays a crucial role in efficiently managing resources while striving for the highest model performance.

Researchers at DeepMind (An empirical analysis of compute-optimal large language model training; Hoffmann et al., 2022) analyzed the training compute and dataset size of LLMs and concluded that LLMs are undertrained in terms of compute budget and dataset size as suggested by scaling laws. They predicted that large models would perform better if they were substantially smaller and trained for much longer, and – in fact – validated their prediction by comparing a 70-billion-parameter Chinchilla model on a benchmark to their Gopher model, which consists of 280 billion parameters.

However, more recently, a team at Microsoft Research has challenged these conclusions and surprised everyone (Textbooks Are All You Need; Gunasekar et al., June 2023), finding that small networks trained on high-quality datasets can give very competitive performance – their model phi-1-small only comprises 350 million parameters! We’ll discuss this model again in Chapter 6, Developing Software with Generative AI, and we’ll discuss the implications of scaling in Chapter 10, The Future of Generative Models.

We could see new scaling laws linking performance with data quality, and it will be instructive to observe whether model sizes for LLMs keep increasing at the same rate as they have. This is an important question since it determines if the development of LLMs will be firmly in the hands of large organizations. It could be that there’s a saturation of performance at a certain size, which only changes in the approach can overcome. We haven’t seen this leveling off yet, though.

The GPT model series

Trained on 300 billion tokens, GPT-3 has 175 billion parameters, an unprecedented size for DL models. GPT-4 is the most recent in the series, though its size and training details have not been published due to competitive and safety concerns. However, different estimates suggest it has between 200 and 500 billion parameters. Sam Altman, the CEO of OpenAI, has stated that the cost of training GPT-4 was more than $100 million.

ChatGPT, launched by OpenAI in November 2022, stands out as a conversational model developed on the foundation of earlier GPT models, notably GPT-3. It is specifically tailored for dialogue, employing a mix of role-playing scenarios by humans and examples to guide the model towards desired behaviors, significantly enhanced by the use of Reinforcement Learning from Human Feedback (RLHF). Instead of learning from a pre-set reward based on task performance, RLHF trains a model using feedback from humans to understand what good (high reward) and bad (low reward) responses look like. RLHF has proven effective in making AI models more aligned with human values and preferences, applied to fields like conversational agents and computer vision.

The introduction of GPT-4 in March 2023 marked a further leap in capabilities. GPT-4 provides superior performance on various evaluation tasks, coupled with significantly better response avoidance to malicious or provocative queries due to six months of iterative alignment during training.

The following diagram shows the timeline of the different model iterations:

https://lh7-us.googleusercontent.com/mXrcA8bmm_m7eoUNYeSmfJ9NTdAUrBCnLlAsZYsfbZA8PGboClOsvx3S_jh_9o9s_uPz0smzwiJuI_RuWPtWml-zwq4eUNeofPkcCtwyWFsTdJCBaCOnJygcLQRnBlbBTfadpiWtu95I-E8Cx-Pn7zM7YA=s2048

Figure 1.6: The development of the OpenAI GPT model series

There’s also a multi-modal version of GPT-4 that incorporates a separate vision encoder, trained on joined image and text data, giving the model the capability to read web pages and transcribe what’s in images and video.

As can be seen in Figure 1.5, there are quite a few both open-source and closed-source models besides OpenAI’s, some of which have come close to OpenAI models in performance, which we will have a look at.

PaLM and Gemini

PaLM 2, released in May 2023, was trained with the aim of improving multilingual and reasoning capabilities while being more compute efficient. Using evaluations at different compute scales, the authors (Anil et al.; PaLM 2 Technical Report) estimated an optimal scaling of training data sizes and parameters. PaLM 2 is small and exhibits fast and efficient inference, allowing broad deployment and fast response times for a natural pace of interaction. Extensive benchmarking across different model sizes has shown that PaLM 2 has significantly improved quality on downstream tasks, including multilingual common sense and mathematical reasoning, coding, and natural language generation, compared to its predecessor PaLM.

PaLM 2 was also tested on various professional language proficiency exams. The exams used were for Chinese (HSK 7-9 Writing and HSK 7-9 Overall), Japanese (J-Test A-C Overall), Italian (PLIDA C2 Writing and PLIDA C2 Overall), French (TCF Overall), and Spanish (DELE C2 Writing and DELE C2 Overall). Across these exams, which were designed to test C2-level proficiency, considered mastery or advanced professional level according to the Common European Framework of Reference for Languages (CEFR), PaLM 2 achieved mostly high-passing grades.

Gemini, released by Google in December 2023, is a family of highly capable multimodal models jointly trained on image, audio, video, and text data. The largest version, Gemini Ultra, sets new state-of-the-art results across 30 benchmarks spanning language, coding, reasoning, and multimodal tasks like MMMU (Massive Multi-discipline Multimodal). It demonstrates impressive crossmodal reasoning capabilities, understanding, and reasoning across different modalities like text, images, and audio.

Llama and Llama 2

The releases of the Llama and Llama series of models, with up to 70 billion parameters, by Meta AI in February and July 2023, respectively, have been highly influential by enabling the community to build on top of them, thereby kicking off a Cambrian explosion of open-source LLMs. Llama triggered the creation of models such as Vicuna, Koala, RedPajama, MPT, Alpaca, and Gorilla. Llama 2, since its recent release, has already inspired several very competitive coding models, such as WizardCoder.

Optimized for dialogue use cases, at their release, these LLMs outperformed other open-source chat models on most benchmarks and seem on par with some closed-source models based on human evaluations. The Llama 2 70B model performs on a par with or better than PaLM (540 billion parameters) on almost all benchmarks, but there is still a large performance gap between Llama 2 70B and GPT-4 and PaLM-2-L.

Llama 2 is an updated version of Llama 1 trained on a new mix of publicly available data. The pre-training corpus size has increased by 40% (2 trillion tokens of data), the context length of the model has doubled, and grouped-query attention has been adopted. Variants of Llama 2 with different parameter sizes (7 billion, 13 billion, 34 billion, and 70 billion) have been released. While Llama was released under a non-commercial license, Llama 2 is open to the general public for research and commercial use.

Llama 2-Chat has undergone safety evaluation results compared to other open-source and closed-source models. Human raters judged the safety violations of model generations across approximately 2,000 adversarial prompts, including both single and multi-turn prompts.

Claude 1–3

Claude, Claude 2, and Claude 3 are AI assistants created by Anthropic. Claude 2 improved upon previous versions in areas like helpfulness, honesty, and reduced bias. Key enhancements include a massively increased context window of up to 200,000 tokens and strong performance on coding, summarization, and long document understanding tasks.

The latest release is Claude 3, a new family of large multimodal models, including the flagship Claude 3 Opus (the most capable), Claude 3 Sonnet (balanced skills and speed), and Claude 3 Haiku (the fastest and least expensive). With vision capabilities, they demonstrate strong performance across benchmarks, including MMLU. Notably, Claude 3 Opus surpassed OpenAI’s GPT-4 on the Chatbot Arena leaderboard, while exhibiting improved multilingual fluency.

Mixture of Experts (MoE)

Recently, MoE models have had success with high performance at low usage of resources. Mixtral 8x7B by Mistral AI is a Sparse MOE model that outperforms or matches Llama 2 70B and GPT-3.5 across various benchmarks, excelling particularly in math, code generation, and multilingual tasks. Its instruction-tuned version, Mixtral 8x7B-Instruct, surpasses several other prominent models, like GPT-3.5 Turbo and Claude-2.1, on human benchmarks.

Grok-1 is a 314-billion-parameter MoE LLM trained from scratch by xAI and released under the Apache 2.0 license. With 25% of its weights active on a given token, this raw model provides a foundation for further fine-tuning and customization by researchers and developers. xAI trained Grok-1 using their custom training stack built on top of JAX and Rust, showcasing their expertise in developing cutting-edge language models at massive scales. The Leeroo Orchestrator by Leeroo proposes an architecture that integrates multiple LLMs to create a new state-of-the-art model. It achieves performance on par with Mixtral at lower computational cost and even exceeds GPT-4’s accuracy on the MMLU benchmark with further cost reductions.

DBRX is Databricks’ open LLM that establishes open LLMs across standard benchmarks. It surpasses GPT-3.5 and is competitive with Gemini 1.0 Pro, excelling as a code model by outperforming specialized models like CodeLLaMA-70B. DBRX advances efficiency with a fine-grained MoE architecture, offering up to 2x faster inference than LLaMA2-70B and a 40% smaller parameter count than Grok-1. Hosted on Mosaic AI Model Serving, it can generate text at up to 150 tokens/sec/user. Training is about 2x more computationally efficient than dense models for the same quality, achieving previous MPT model quality with nearly 4x less compute overall. Other notable models contributing to LLM advancements include DeepMind’s Chinchilla, Meta’s OPT, Google’s Gopher, Hugging Face’s BLOOM, and various models from research groups like EleutherAI’s GPT-NeoX.

How to use LLMs

You can access LLMs by OpenAI, Google, and Anthropic through their website or their API. If you want to try other LLMs on your laptop, open-source LLMs are a good place to get started. There is a whole model zoo out there! You can access these models through Hugging Face or other providers, as we’ll see starting in Chapter 3, Getting Started with LangChain. You can even download these open-source models, fine-tune them, or fully train them. We’ll fine-tune a model in Chapter 8, Customizing LLMs and Their Output.

The different licenses for LLMs significantly impact how they can be used, modified, and further developed for commercial or research purposes. Some code for training, training datasets, and the weights themselves have been made available to the community to run locally, poke into for investigations, further develop, fine-tune, and improve upon. Other models have been kept behind APIs and the secrets behind their performance are a matter of rumors and speculation. Here’s a breakdown of some key license types and their implications.

Open source licenses (for example, Apache 2.0, MIT):

Allow free use, modification, and redistribution for both commercial and non-commercial purposes
Permit the creation of derivative works and the integration of the models into products/services
Research institutions and commercial entities can build upon and extend these models
Examples: BERT, Mistral

Non-commercial licenses (for example, CC-BY-NC-4.0, non-commercial research):

Permit use and modification only for non-commercial research purposes
Commercial entities cannot directly use or integrate these models into products/services
Researchers can study, evaluate, and build upon the models within academic settings
Examples: Galactica, OPT, Llama 60B

Proprietary licenses:

Models are closed-source and cannot be freely used, modified, or redistributed
Commercial entities retain full control and can monetize the models as products/services
Research institutions may have limited access for evaluation/benchmarking purposes
Examples: GPT-4, Claude, Gemini

Licenses like the Databricks Open Model License and Llama 2 Community License:

Allow the use, modification, and creation of derivative works for both commercial and non-commercial purposes
But may place certain conditions on redistribution, indemnification, or usage tracking
Strike a balance between open access and commercial interests

In general, open source licenses promote wide adoption, collaboration, and innovation around the models, benefiting both research and commercial development. Proprietary licenses give companies exclusive control but may limit academic research progress. Non-commercial licenses restrict commercial use while enabling research. New licenses aim to mitigate these trade-offs.

In the next section, we’ll be reviewing state-of-the-art methods for text-conditioned image generation. I’ll highlight the progress made in the field so far, but also discuss existing challenges and potential future directions.

What are text-to-image models?

Text-to-image models are a powerful type of generative AI that creates realistic images from textual descriptions. They have diverse use cases in creative industries and design for generating advertisements, product prototypes, fashion images, and visual effects. The main applications are:

Text-conditioned image generation: Creating original images from text prompts like “a painting of a cat in a field of flowers.” This is used for art, design, prototyping, and visual effects.
Image inpainting: Filling in missing or corrupted parts of an image based on the surrounding context. This can restore damaged images (denoising, dehazing, and deblurring) or edit out unwanted elements.
Image-to-image translation: Converting input images to a different style or domain specified through text, like “make this photo look like a Monet painting.”
Image recognition: Large foundation models can be used to recognize images, including classifying scenes, but also object detection, for example, detecting faces.

Models like Midjourney, DALL-E 2, and Stable Diffusion provide creative and realistic images derived from textual input or other images. These models work by training deep neural networks on large datasets of image-text pairs. The key technique used is diffusion models, which start with random noise and gradually refine it into an image through repeated denoising steps.

Popular models like Stable Diffusion and DALL-E 2 use a text encoder to map input text into an embedding space. This text embedding is fed into a series of conditional diffusion models, which denoise and refine a latent image in successive stages. The final model output is a high-resolution image aligned with the textual description.

Two main classes of models are used: Generative Adversarial Networks (GANs) and diffusion models. GAN models like StyleGAN or GANPaint Studio can produce highly realistic images, but training is unstable and computationally expensive. They consist of two networks that are pitted against each other in a game-like setting – the generator, which generates new images from text embeddings and noise, and the discriminator, which estimates the probability of the new data being real. As these two networks compete, GANs get better at their task, generating realistic images and other types of data.

The setup for training GANs is illustrated in this diagram (taken from A Survey on Text Generation Using Generative Adversarial Networks, G de Rosa and J P. Papa, 2022; https://arxiv.org/pdf/2212.11119.pdf):

A diagram of a sample

Description automatically generated

Figure 1.7: GAN training

Diffusion models have become popular and promising for a wide range of generative tasks, including text-to-image synthesis. These models offer advantages over previous approaches, such as GANs, by reducing computation costs and sequential error accumulation. Diffusion models operate through a process like diffusion in physics. They follow a forward diffusion process by adding noise to an image until it becomes uncharacteristic and noisy. This process is analogous to an ink drop falling into a glass of water and gradually diffusing.

The unique aspect of generative image models is the reverse diffusion process, where the model attempts to recover the original image from a noisy, meaningless image. By iteratively applying noise removal transformations, the model generates images of increasing resolutions that align with the given text input. The final output is an image that has been modified based on the text input. An example of this is the Imagen text-to-image model (Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding by Google Research, May 2022), which incorporates frozen text embeddings from LLMs, pre-trained on text-only corpora. A text encoder first maps the input text to a sequence of embeddings. A cascade of conditional diffusion models takes the text embeddings as input and generates images.

The denoising process is demonstrated in this plot (source: user Benlisquare via Wikimedia Commons):

A collage of a building

Description automatically generated

Figure 1.8: European-style castle in Japan, created using the Stable Diffusion V1-5 AI diffusion model

In the preceding figure, only some steps within the 40-step generation process are shown. You can see the image generation step by step, including the U-Net denoising process using the Denoising Diffusion Implicit Model (DDIM) sampling method, which repeatedly removes Gaussian noise, and then decodes the denoised output into the pixel space.

With diffusion models, you can see a wide variety of outcomes using only minimal changes to the initial setting of the model or – as in this case – numeric solvers and samplers. Although they sometimes produce striking results, the instability and inconsistency are significant obstacles to applying these models more broadly.

Stable Diffusion was developed by the CompVis group at LMU Munich (High-Resolution Image Synthesis with Latent Diffusion Models by Blattmann et al., 2022). The Stable Diffusion model significantly cuts training costs and sampling time compared to previous (pixel-based) diffusion models. The model can be run on consumer hardware equipped with a modest GPU (for example, the GeForce 40 series). By creating high-fidelity images from text on consumer GPUs, the Stable Diffusion model democratizes access. Further, the model’s source code and even the weights have been released under the CreativeML OpenRAIL-M license, which doesn’t impose restrictions on reuse, distribution, commercialization, and adaptation.

Significantly, Stable Diffusion introduced operations in latent (lower-dimensional) space representations, which capture the essential properties of an image, in order to improve computational efficiency. A VAE provides latent space compression (called perceptual compression in the paper), while a U-Net performs iterative denoising.

Stable Diffusion generates images from text prompts through several clear steps:

It starts by producing a random tensor (a random image) in the latent space, which serves as the noise for our initial image.
A noise predictor (U-Net) takes in both the latent noisy image and the provided text prompt and predicts the noise.
The model then subtracts the latent noise from the latent image.
Steps 2 and 3 are repeated for a set number of sampling steps, for instance, 40 times, as shown in the plot.
Finally, the decoder component of the VAE transforms the latent image back into pixel space, providing the final output image.

A VAE is a model that encodes data into a learned, smaller representation (encoding). These representations can then be used to generate new data similar to that used for training (decoding). The VAE is trained first.

A U-Net is a popular type of convolutional neural network (CNN) that has a symmetric encoder-decoder structure. It is commonly used for image segmentation tasks, but in the context of Stable Diffusion, it can help to introduce and remove noise in an image. The U-Net takes a noisy image (seed) as input and processes it through a series of convolutional layers to extract features and learn semantic representations.

These convolutional layers, typically organized in a contracting path, reduce the spatial dimensions while increasing the number of channels. Once the contracting path reaches the bottleneck of the U-Net, it then expands through a symmetric expanding path. In the expanding path, transposed convolutions (also known as upsampling or deconvolutions) are applied to progressively upsample the spatial dimensions while reducing the number of channels.

When training the image generation model in the latent space itself (latent diffusion model), a loss function is used to evaluate the quality of the generated images. One commonly used loss function is the Mean Squared Error (MSE) loss, which quantifies the difference between the generated image and the target image. The model is optimized to minimize this loss, encouraging it to generate images that closely resemble the desired output.

This training was performed on the LAION-5B dataset, derived from Common Crawl data, comprising billions of image-text pairs from sources such as Pinterest, WordPress, Blogspot, Flickr, and DeviantArt.

The following images illustrate text-to-image generation from a text prompt with diffusion (source: Ramesh and others, Hierarchical Text-Conditional Image Generation with CLIP Latents, 2022; https://arxiv.org/abs/2204.06125):

A dog and a person with a mustache

Description automatically generated

Figure 1.9: Image generation from text prompts

Overall, image generation models such as Stable Diffusion and Midjourney process textual prompts into generated images, leveraging the concept of forward and reverse diffusion processes and operating in a lower-dimensional latent space for efficiency. But what about the conditioning for the model in the text-to-image use case?

The conditioning process allows these models to be influenced by specific textual prompts or input types like depth maps or outlines for greater precision to create relevant images. These embeddings are then processed by a text transformer and fed to the noise predictor, steering it to produce an image that aligns with the text prompt.

It’s out of the scope of this book to provide a comprehensive survey of generative AI models for all modalities. However, let’s get a bit of an overview of what models can do in various domains.

What can AI do in other domains?

Generative AI models have demonstrated impressive capabilities across modalities such as sound, music, video, and 3D shapes. In the audio domain, models can synthesize natural speech, generate original music compositions, and even mimic a speaker’s voice and the patterns of rhythm and sound (prosody).

Speech-to-text systems can convert spoken language into text (Automatic Speech Recognition (ASR)). For video, AI systems can create photorealistic footage from text prompts and perform sophisticated editing like object removal. 3D models learned to reconstruct scenes from images and generate intricate objects from textual descriptions.

There are many types of generative models, handling different data modalities across various domains, as shown in the following table:

Model Type	Input	Output	Examples
Text-to-Text	Text	Text	Mixtral, GPT-4, Claude 3, Gemini
Text-to-Image	Text	Images	DALL-E 2, Stable Diffusion, Imagen
Text-to-Audio	Text	Audio	Jukebox, AudioLM, MusicGen
Text-to-Video	Text	Video	Sora
Image-to-Text	Images	Text	CLIP, DALL-E 3
Image-to-Image	Images	Images	Super-resolution, style transfer, inpainting
Text-to-Code	Text	Code	Stable Diffusion, DALL-E 3, AlphaCode, Codex
Video-to-Audio	Video	Audio	Soundify
Text-to-Math	Text	Mathematical Expressions	ChatGPT, Claude
Text-to-Scientific	Text	Scientific Output	Minerva, Galactica
Algorithm Discovery	Text/Data	Algorithms	AlphaTensor
Multimodal Input	Text, Images	Text, Images	GPT-4V

Table 1.1: Models for audio, video, and other domains

There are a lot more combinations of modalities to consider; these are just some that I have come across. We could consider subcategories of text, such as text-to-math, which generates mathematical expressions from text, where some models such as ChatGPT and Claude shine; or text-to-code, which are models that generate programming code from text, such as AlphaCode and Codex. A few models specialize in scientific text, such as Minerva and Galactica, or algorithm discovery, such as AlphaTensor.

A few models work with several modalities for input or output. An example of a model that demonstrates generative capabilities in multimodal input is OpenAI’s GPT-4V model (GPT-4 with vision), released in September 2023, which takes both text and images and comes with better Optical Character Recognition (OCR) than previous versions to read text from images. Images can be translated into descriptive words, then text filters are applied. This mitigates the risk of generating unconstrained image captions.

As the table shows, text is a common input modality that can be converted into various outputs, like image, audio, and video. The outputs can also be converted back into text or kept within the same modality. LLMs have driven rapid progress for text-focused domains. These models enable a diverse range of capabilities via different modalities and domains. The LLM categories are the main focus of this book; however, we’ll also occasionally look at other models, text-to-image in particular. These models typically use a Transformer architecture trained on massive datasets via self-supervised learning.

Underlying many of these innovations are advances in deep generative architectures like GANs, diffusion models, and transformers. Leading AI labs at Google, OpenAI, Meta, and DeepMind are leading the way in innovation.

Filter reviews by

All

Packt verified reviews

Feefo verified reviews

Amazon verified reviews

Josep Oriol Oct 13, 2023

This book is as up-to-date as it can be, with lots of helpful code examples, and covering all aspects of LLM the development pipeline. It's my work companion, way more useful than official LangChain documentation. A must for everyone involved in LLMOps

Subscriber review

Kam F Siu Jan 30, 2024

Feefo Verified review

Andrew McVeigh May 01, 2024

i'm only 100 pages into this book, but boy is it well phrased and extremely clear. i write apps around LLMs, including RAG architectures. perhaps it's just the current state of my learning, but i've found this book to be extremely helpful and very logically organized. I'll revisit this review once i'm through the entire book, but so far 10/10. it's easily the best and most self-contained book i have on the subject.

Amazon Verified review

F. P. Dec 22, 2023

During my learning journey into large language model (LLM) development, I encountered several challenges:- The difficulty of providing precise instructions within specific contexts, which I found to be the most challenging and crucial aspect.- Switching between different LLM models with minimal programming effort.- Selectively saving chat history in memory.- Handling data efficiently, including managing input data of various modalities and making output data accessible.In overcoming these obstacles, I came across LangChain, a robust toolkit designed for LLM application development. The book "Generative AI with LangChain" by Ben Auffarth provides a comprehensive overview, covering the basics of LLM, LangChain, and its key components (chains, agents, memory, tools). The book also explores sample applications such as chatbots, customization of LLM models (conditioning, fine-tuning), and the deployment of LLM apps into production. Unlike theoretical research materials, this book serves as a practical, one-stop resource for understanding the current landscape of LLM applications.Some of the interesting points:- LangChain helps standardize prompts by providing prompt templates (LangChain Expression Language).- LangChain provides extensive integrations to other model APIs including Fake LLM, OpenAI, Hugging Face, GCP, Jina AI, Replicate, etc.- LangChain has "memory" which allows the model to be context-aware.- LangChain supports advanced data facilities such as map-reduce approach and output parser.This book has significantly saved me time, providing consolidated information without the need for extensive online searches or inquiries to ChatGPT. For those unsure about its content, I recommend checking out the free sample on Amazon – it's undoubtedly worth every penny.

hawkinflight Jan 05, 2024

I have not used LangChain before, and I am looking at this book to learn how to create an LLM app. I am really looking forward to trying it out for all three types of apps covered in the book - assistants/chatbot, code generation, and data science. The book is clear and straight to the point, so I expect to be able to try these out fairly quickly. I have gotten through the "setting up the dependencies" section. I cloned the book's github repo, and I tried three methods for variety's sake to create a python environment: pip, conda, and Docker, all on Windows, and I believe I have them all set up. I hit some bumps, but I was able to follow the onscreen error messages and get past them. For pip, I needed to install MSFT Build Tools to get C++. For the conda case, I had to modify the yaml file for two of the packages - ncurses and readline, which have different names for Windows. In Chapter 2 there is a comparison of LangChain with other frameworks, from which you get a feel that choosing LangChain at this moment is the best choice. I am happy to have found this book, and I can't wait to proceed w/the next steps. It's a lot of fun to be able to interact w/LLMs.

Generative AI with LangChain: Build large language model (LLM) apps with Python, ChatGPT, and other LLMs

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs