Evaluation

Code Execution Evaluation Metric

The Code Execution evaluation metric allows arbitrary Python code execution to be used to produce scores for LLM outputs.

It is intended as a quick and powerful way to format outputs and write conditionals without the restrictions of Jinja or Regex.

After selecting the "Code Execution" evaluation metric in the UI, a code editor will be provided. There will be a template with the bare minimum for the evaluator to run: A main() function that returns a score.

Examples

JSON Comparison

While JSON Validity checks for whether the output is JSON and JSON Schema Match checks if the output conforms to a structure, neither checks for exact key/value matches per test case. Using the following Python code, it's possible to check that the output matches a known JSON regardless of order or spacing.

def main(
completion: str,
target: str,
) -> dict:
"""Produces a dict containing at least a "score" key with a numerical value."""
completion_dict = json.loads(completion)
target_dict = json.loads(target)
completion_set = set(completion_dict.items())
target_set = set(target_dict.items())
is_equal = completion_set == target_set
return {
"score": 1.0 if is_equal else 0.0,
}

Ignore Whitespace

A common problem with exact match comparison using LLM outputs is that often there is additional leading or trailing whitespace. We can create an exact match evaluator that ignores such whitespace with a few short lines of Python.

def main(
completion: str,
target: str,
) -> dict:
"""Produces a dict containing at least a "score" key with a numerical value."""
is_equal = completion.strip() == target.strip()
return {
"score": 1.0 if is_equal else 0.0,
}