Pairwise evaluations
As we’ve already mentioned, pairwise evaluators use LLM to compare two different outputs from two versions of your application configured differently. A configuration change might be anything – a different prompt, a different foundational model, a new ingestion or chunking mechanism, or just a change in the temperature
argument. You don’t get a score on a specific scale, but you get preferences, and you can compute a share of cases when output from version A is a preferred one over output from version B.
LangChain offers you a few out-of-the-box evaluators:
- The
pairwise_string
andlabeled_pairwise_string
evaluators predict the preferred prediction between the two models, but the second uses the golden answer as an additional input. As usual, an evaluator with an expected output provided would most probably perform more reliably and get preferences with a higher correlation to human ones:from langchain.evaluation import load_evaluator...