Evaluation

Create Custom Evaluators with Code Execution Metric

The Code Execution evaluation metric allows arbitrary Python code execution to be used to produce scores for LLM outputs.

It is intended as a quick and powerful way to format outputs and write conditionals without the restrictions of Jinja or Regex.

After selecting the “Code Execution” evaluation metric in the UI, a code editor will be provided. There will be a template with the bare minimum for the evaluator to run: A main() function that returns a score.

Examples

JSON Comparison

While JSON Validity checks for whether the output is JSON and JSON Schema Match checks if the output conforms to a structure, neither checks for exact key/value matches per test case. Using the following Python code, it’s possible to check that the output matches a known JSON regardless of order or spacing.

1def main(
2 completion: str,
3 target: str,
4) -> dict:
5 """Produces a dict containing at least a "score" key with a numerical value."""
6 completion_dict = json.loads(completion)
7 target_dict = json.loads(target)
8 completion_set = set(completion_dict.items())
9 target_set = set(target_dict.items())
10 is_equal = completion_set == target_set
11
12 return {
13 "score": 1.0 if is_equal else 0.0,
14 }

Ignore Whitespace

A common problem with exact match comparison using LLM outputs is that often there is additional leading or trailing whitespace. We can create an exact match evaluator that ignores such whitespace with a few short lines of Python.

1def main(
2 completion: str,
3 target: str,
4) -> dict:
5 """Produces a dict containing at least a "score" key with a numerical value."""
6 is_equal = completion.strip() == target.strip()
7
8 return {
9 "score": 1.0 if is_equal else 0.0,
10 }

Code Packages

You can add both pip packages for Python code and npm packages for TypeScript code. You must provide exact package versions and add the import to your code yourself.

Note that whenever you update your packages list, the first execution after doing so may be slow due to our system creating and caching the custom runtime.

1import * as _ from "lodash"
2
3async function main(variables: {
4 completion: string,
5 target: string,
6}): Promise<{ score: number }> {
7 return {
8 score: variables.target.length + _.floor(7.55),
9 }
10}

Code Package example