Evaluating GenAI Applications
Large Language Models (LLMs) have proved their performance on a variety of Natural Language Processing (NLP) tasks and even their abilities of common reasoning. When new LLMs are released, they are typically tested on various generalized datasets, and performance benchmarks and leaderboards are publicly available. Still, when building Generative AI (GenAI) applications, we need to evaluate their performance on the underlying task (or tasks) we’re working on. We need this for two reasons – to ensure we meet product requirements and quality expectations, and to compare various architectures or prompting techniques to pick the best setup for our specific use case.
In this chapter, we’re going to discuss how you can evaluate a GenAI application, briefly touch on using LangSmith for tracing and debugging your application’s performance, and explore using Vertex AI evaluation capabilities with LangChain.
Here, we’ll cover...