Evaluating GenAI applications
Evaluating GenAI applications is still an open problem. Unfortunately, up until now, many engineering teams tend to ignore this aspect, and their preferred way of evaluating such an application remains either not doing it or manually testing on a few hard-coded examples. We believe this should be changed, and that designing an evaluation approach and an internal benchmark should be your first priority after you’ve demoed the capabilities of GenAI and you begin developing a production-ready application. Otherwise, after you deploy your application to production, you might see unintended consequences if you don’t have a proper evaluation procedure.
A big challenge is the fact that GenAI applications are not deterministic by nature, and they provide rich answers in natural language given complex tasks. That makes evaluation of such applications difficult – for example, it’s not easy to compare two different summaries of text...