Evals for LLM systems

Why evals should get more attention

Unsuccessful AI products almost always share a common root cause: a failure to create robust evaluation systems. Conversely, creating high quality evals is one of the most impactful things you can do.
Evaluating quality (ex: tests) and debugging issues (ex: logging & inspecting data) don’t get enough attention, but they improve your product past demo stage.
Addressing one failure mode led to the emergence of others, resembling a game of whack-a-mole.
There was limited visibility into the AI system’s effectiveness across tasks beyond vibe checks.
Prompts expanded into long and unwieldy forms, attempting to cover numerous edge cases and examples.

Unit Tests

Unlike typical unit tests, you want to organize these assertions for use in places beyond unit tests, such as data cleaning and automatic retries (using the assertion error to course-correct) during model inference.
The most effective way to think about unit tests is to break down the scope of your LLM into features and scenarios.
For example, here is one such prompt Rechat uses to generate synthetic inputs for a feature that creates and retrieves contacts.

Write 50 different instructions that a real estate agent can give to his assistant to create contacts on his CRM. The contact details can include name, phone, email, partner name, birthday, tags, company, address and job.

For each of the instructions, you need to generate a second instruction which can be used to look up the created contact.

. The results should be a JSON code block with only one string as the instruction like the following:

        
[
  ["Create a contact for John (johndoe@apple.com)",
  "What's the email address of John Smith?"]
]

Model Eval

A two stage process where the model first answers the question, then we ask a different model to look at the response to check if it’s correct.
A prerequisite to performing human and model-based eval is to log your traces
Researchers are now employing LLMs like GPT-4 to evaluate the outputs of similar models. This recursive use of LLMs for evaluation underscores the continuous cycle of improvement and refinement in the field. Human and GPT-4 judges can reach above 80% agreement on the correctness and readability score. If the requirement is smaller or equal to 1 score difference, the agreement level can reach above 95%
Critiques from a good evaluator model can be used to curate high-quality synthetic data,

Example

asks the model to write a funny joke. The model then generates a completion.
We then create a new input to the model to answer the question: “Is this following joke funny? First reason step by step, then answer yes or no”
We finally consider the original completion correct if the new model completion ends with “yes”.

Methodology

The evaluation begins with the creation of a benchmark dataset, which should be as representative as possible of the data the LLM will encounter in a live environment.
- One way to speed up the process of building eval datasets, is to use GPT-4 to generate synthetic data
Once we have our evaluation test set complete with ground truth and responses generated by our LLM application, the next step is to grade these results. This phase involves using a mix of LLM-assisted evaluation prompts and more integrated, hybrid approaches.
- Opt for the most robust model you can afford: Advanced reasoning capabilities are often required to effectively critique outputs. Your deployed system might need to have low latency for user experience. However, a slower and more powerful LLM might be needed for evaluating outputs effectively.
- The evaluating model might make errors and give you a false sense of confidence in your system.

Human Eval

I often find that “correctness” is somewhat subjective, and you must align the model with a human.
- A translation could score high on BLEU for having words in technically correct order, but still miss the mark in conveying the right tone, style, or even meaning as intended in the original text.
LLMs may exhibit a preference for answers generated by other LLMs over human-authored text, potentially leading to a skewed evaluation favoring machine-generated content.

Best practices

May need to build a custom tool to show
- What tool (feature) & scenario was being evaluated.
- Whether the trace resulted from a synthetic input or a real user input.
- Filters to navigate between different tools and scenario combinations.
- Links to the CRM and trace logging system for the current record.
Create positive and negative evals: Something cannot be logically true and untrue at the same time. Carefully design your evals to increase confidence.
we noticed that many failures involved small mistakes in the final output of the LLM (format, content, etc). We decided to make the final output editable by a human so that we could curate & fix data for fine-tuning.
When starting, you should examine as much data as possible. I usually read traces generated from ALL test cases and user-generated traces at a minimum. You can never stop looking at data—no free lunch exists. However, you can sample your data more over time, lessening the burden.
One signal you are writing good tests and assertions is when the model struggles to pass them - these failure modes become problems you can solve with techniques like fine-tuning later on.

References

Getting Started with OpenAI Evals
All about LLM Evals
Your AI Product Needs Evals