Evaluation

Quantitatively Evaluating Outputs

Ensuring model quality is challenging, because prompts need to work effectively and consistently over a wide range of potential inputs. When modifying a prompt, adjusting parameters, or switching models, the likelihood of regression is high. This is where quantitative evaluation comes in.

Vellum's answer to quantitative evaluation is our Test Suites feature. Unlike Comparison Mode and Chat Mode, where the output of each scenario is qualitatively evaluated by visual inspection, Test Suites use scores between 0 and 1 to provide a more objective measure of quality over a wider range of scenarios. This is especially important when your prompt's coverage needs scale and visual inspection is no longer feasible, which typically happens when you have 10+ scenarios.

Test Suites cover common use cases in LLM development:

  1. Test Driven Development of Prompts in Playground
  2. Performance testing on large numbers of scenarios
  3. Regression testing before deploying a change to a prompt

If you're looking to improve the velocity or quality of your LLM development, Test Suites could be your answer.

Getting Started

Create your first Test Suite

  1. Visit the Test Suites tab in Vellum to open the Test Suites page
  2. Click the blue Create Test Suite button at the top-right of the page to open the Create Test Suite page.
  3. In the "Label" field at the top left, enter "My First Test Suite"
  4. In the "Input Variables" list within the "Execution Interface" section, replace "Variable 1" with the name of the input variable used in your prompt.
  5. If your prompt has additional input variables, you can click "+ Add" below your existing variables to add them. Note that the input variables you configure here must exactly match the input variables to your prompt.
  6. Leave the remaining fields as their default values, and click the blue "Create" button at the bottom right.

Well done! You've now created a test suite that can test any prompt with the input variables you specified.

Test Suite Details Overview

By default, your test suite will utilize the Exact Match evaluator to check for whether the output is desirable. To learn how to use evaluators other than Exact Match, see Vellum's Evaluation Metrics.

Create your first Test Case

Test Cases are analogous to Scenarios. A Test Case by default will have a column for each input variable in your prompt, and a target variable. When a Test Suite is run, Vellum will iterate over every Test Case it contains. For each Test Case, it will feed the provided inputs into the prompt. The output of the prompt will be compared against the target column, and a score will be provided based on the evaluator being used.

With your first Test Suite created, you should be on the "Test Suite Details" page. Let's create your first test case now.

  1. Navigate to the "Test Cases" tab.
  2. Click the blue "+ Add Test Case" button just below the tab to add a new Test Case row.
  3. You can leave the "Label" field alone for now - it's used to help you visually identify your Test Cases.
  4. Enter a value for each of your input variables. For example, if you have an input variable user_age, you may enter "28".
  5. In the target field, enter "This is the expected output of my prompt."
  6. Click outside of your Test Case row to save. You should get a notification that it was successful.

Almost there! Now that you have your first Test Case, you're ready to run it against a prompt.

Test Suite Details Test Cases

Run your first Test Suite

Test Suites can be run on any prompt that uses the same input variables you configured earlier. Let's try it.

  1. Navigate to any prompt sandbox in the "Prompts" tab.
  2. Scroll to Add Test Suite at the bottom of the page.
  3. Attach a Test Suite by clicking the "+ Include" button.
  4. Press the blue "Run" to execute the Test Suite

Add Test Suite to Prompt

Run Test Suite on Prompt

Advanced usage

Evaluation Metrics

Exact Match

Check that the output is exactly equal to the target.

Returns a score of 1 if the output is an exact match, and 0 otherwise.

Regex Match

Check that the specified regular expression can be found in the output.

Returns a score of 1 if the regular expression matches, and 0 otherwise.

Note that unless the regular expression is explicitly anchored, it can match anywhere in the output.

Semantic Similarity

Check that the output is semantically similar to the target.

Returns a score between 0 and 1, where 1 is a perfect match.

Uses a cross encoder to compute the similarity.

Webhook

Use an external endpoint to evaluate the output.

See Webhook Evaluation Metric for more details.

JSON Validity

Check that the output is valid JSON.

Returns a score of 1 if the output is valid JSON, and 0 otherwise.

JSON Schema Match

Check that the output matches a specified JSON schema.

Returns a score of 1 if the output matches the schema, and 0 otherwise.

Workflow

Run a Workflow to evaluate the output.

See Workflow Evaluation Metric for more details.

Code

Run custom Python code to evaluate the output.

The code must include a function named main that takes the function arguments specified when creating the metric and returns a dictionary with keys score and optionally normalized_score.

Example
def main(input_1, input_2, target, completion):
return {
"score": 10,
"normalized_score": 0.5, // Optional, must be in the interval [0, 1] if provided
}

Multiple Evaluators

It's possible for a Test Suite to have multiple evaluators run simultaneously. This is often desirable when the output is complex and must meet multiple criterion. For example, you may want to validate that an output semantically means "I'm very happy", but must contain the word "ecstatic".

Additional evaluators can be configured on an existing Test Suite under the "Evaluation Metrics" section of the "Test Suite Details" page. Any additional columns needed by that evaluator will be editable under the "Test Cases" tab.

When running the Test Suite, a dropdown is available at the top to select the metric for which you want to see the results.

Upload Test Cases

To help you migrate your Test Cases into Vellum, we provide two methods for bulk Test Case uploads.

Via CSV

Under the "Test Cases" tab on the "Test Suite Details" page, there is a blue "Upload Test Cases" button. Clicking that button will open a modal that allows you to bulk upload test cases via a CSV file.

Via API

To upload test cases via API, please consult a team member at Vellum.

Testing Workflows

Just like Prompts, Workflows can be tested too. The key difference Workflow Test Suites have is that Workflows may have multiple outputs. You may choose to test some of them, or all of them.

Workflow Test Suites can be run from the "Evaluations" tab in the Workflow Builder. It works very similarly as testing prompts, but you can create new test cases inline.

One common pitfall is trying to attach a Test Suite where the output variable is named completion to a Workflow where the output variable is something else. You will receive a warning when this happens.

Interface Mismatch

Interface Match