Evaluation

Maximize LLM Development Quality with Vellum's Evaluations

Ensuring model quality is challenging, because prompts need to work effectively and consistently over a wide range of potential inputs. When modifying a Prompt, adjusting parameters, or switching models, the likelihood of regression is high. This is where quantitative evaluation comes in.

Vellum’s answer to quantitative evaluation are Test Suites and Evaluation Metrics. Unlike Comparison Mode and Chat Mode, where the output of each Scenario is qualitatively evaluated by visual inspection, Test Suites use Evaluation Metrics that return scores between 0 and 1 to provide a more objective measure of quality over a wider range of scenarios. This is especially important when your prompt’s coverage needs scale and visual inspection is no longer feasible, which typically happens when you have 10+ scenarios.

Test Suites cover common use cases in LLM development:

  1. Test Driven Development of Prompts/Workflows in a Sandbox
  2. Performance testing on large numbers of Scenarios
  3. Regression testing before deploying a change to a Prompt or Workflow
  4. Evaluating external entities

If you’re looking to improve the velocity or quality of your LLM development, Test Suites could be your answer.

Getting Started

Create your first Test Suite

You can create standalone Test Suites through the Test Suites page in the Evaluations tab. You can also create Test Suites starting from a Prompt or Workflow Sandbox. Here, we’ll begin by creating a Test Suite from a Prompt Sandbox.

  1. From your Prompt Sandbox, click on Evaluations sub-tab
  2. Click “Create New Test Suite” if you haven’t set up a Test Suite for this Prompt Sandbox yet or click the gray “Add Test Suite” button on the right of the page
  3. The “label” and “name” fields are autopopulated with the name of your Sandbox
  4. Open the “Interface Configuration” to see the expected inputs/outputs for this Test Suite. This is also autopopulated from your Sandbox.
  5. If you are creating a Test Suite from scratch, you can customize the inputs/outputs to meet your needs. For now, leave this as it is and click next to start setting up your Evaluation Metrics.
  6. Click the “Add Metric” button to select one or more Evaluation Metrics to evaluate your output against
  7. Select “Exact Match” from the list of available Evaluation Metrics. To learn how to create custom metrics that appear in this list, see Vellum’s Evaluation Metrics
  8. Press “Confirm” to start mapping the metric to this Test Suite
  9. First, select “Completion” to map the input to the output of your Prompt
  10. Next, select “Target” and then “Add New” in the dropdown to create a new expected output variable that’s mapped to this input
  11. Press “Next” in the bottom right

Well done! You’ve now created a Test Suite with the Exact Match evaluator to check for whether the output is desirable.

Test Suite Creation - Metric Configuration

Create your first Test Case

Test Cases are analogous to Scenarios. A Test Case by default will have a column for each input and output variable expected by your Test Suite, as well as columns for any expected output variables required by your selected Evaluation Metric. When a Test Suite is run, Vellum will iterate over every Test Case it contains. For each Test Case, it will feed the provided inputs into the Prompt, Workflow, or entity being tested. The output will be passed to the Evaluation Metric based on your mapping in the previous step, and a score will be provided based on the evaluator being used.

With your first Test Suite created, you should be on a step that asks you how you want to initialize your Test Cases.

Select “Start from Scenarios” which will autopopulate test cases for you from Scenarios in your Sandbox.

To create additional Test Cases, follow these steps:

  1. Click the blue ”+ Add Test Case” button just below the tab to add a new Test Case row
  2. You can leave the “Label” field alone for now - it’s used to help you visually identify your Test Cases
  3. Enter a value for each of your input variables. For example, if you have an input variable user_age, you may enter “28”.
  4. Enter a value for each of your expected output variables
  5. Click outside of your Test Case row to save. You should get a notification that it was successful.
  6. Click “Finish” to exit the Test Suite creation wizard

Almost there! Now that you have your Test Cases added, you’re ready to run it against your Prompt.

Test Suite Creation - Test Cases

Run your first Test Suite

This Test Suite is automatically added to the current Prompt Sandbox you are in. To evaluate your Prompt, simply click the blue Run button.

The Test Suite can also be run by any Prompt or Workflow that uses the same input variables you configured earlier. Let’s try it.

  1. Navigate to any Prompt Sandbox in the “Prompts” tab
  2. Click the gray “Add Test Suite” at the top right of the page
  3. Attach a Test Suite by clicking on “Use Existing Test Suite” and selecting your Test Suite from the dropdown
  4. Press the blue “Run” to execute the Test Suite

Link Test Suite to Prompt

Run Test Suite on Prompt

Download via API

We also support viewing your Test Suite Run results via the API. Check out our docs on Test Suite Runs to learn how to view existing runs. We also embed ready-to-use code snippets within the app itself for each executable column on the evaluations table.

Test Suite Runs via API

Evaluation Metrics

Vellum comes with a set of evaluators that you can use right away within your Test Suites. We are continually adding new Evaluation Metrics based on the needs of Vellum users.

Here are the default Evaluation Metrics currently available within Vellum:

Exact Match

Check that the output is exactly equal to the target.

Returns a score of 1 if the output is an exact match, and 0 otherwise.

Regex Match

Check that the specified regular expression can be found in the output.

Returns a score of 1 if the regular expression matches, and 0 otherwise.

Note that unless the regular expression is explicitly anchored, it can match anywhere in the output.

Semantic Similarity

Check that the output is semantically similar to the target.

Returns a score between 0 and 1, where 1 is a perfect match.

Uses a cross encoder to compute the similarity.

JSON Validity

Check that the output is valid JSON.

Returns a score of 1 if the output is valid JSON, and 0 otherwise.

Advanced Usage

Create Custom Reusable Evaluation Metric

In addition to the default metrics, Vellum makes it easy to define custom Reusable Evaluation Metrics tailored to your specific business logic and use-case. This saves you time and ensures standardized evaluation criteria for your Prompts, Workflows, or external entities you’d like to test.

Let’s create your first Reusable Evaluation Metric

  1. Visit the Evaluations tab in Vellum and open the Metrics page
  2. Click the blue Create Metric button at the top-right of the page to open the Create Metric modal
  3. From the metric type dropdown, select JSON Schema Match. To learn about metric types other than JSON Schema Match, see Vellum’s Available Metric Types.
  4. In the “Label” field at the top left, enter “My First Metric”. The “Name” field should autopopulate. This is a unique name that you can use to programmatically identify this metric.
  5. In the “Description” field, type in “My first metric description”
  6. Click next to configure your metric and define what the expected output should match
  7. Add “name” and “email” properties to the JSON schema
  8. Click Finish to exit the modal and see your newly added metric card on the Metrics page

Congrats! You’ve now created a Reusable Evaluation Metric that will be visible when selecting and configuring Evaluation Metrics within any Test suite.

Create New Reusable Evaluation Metric

Available Metric Types

JSON Schema Match

Check that the output matches a specified JSON schema.

Returns a score of 1 if the output matches the schema, and 0 otherwise.

Workflow

Run a Workflow to evaluate the output.

See Workflow Evaluation Metric for more details.

Code

Run custom Python code to evaluate the output.

The code must include a function named main that takes the function arguments specified when creating the metric and returns a dictionary with keys score and optionally normalized_score.

Example
1def main(input_1, input_2, target, completion):
2 return {
3 "score": 10,
4 "normalized_score": 0.5, // Optional, must be in the interval [0, 1] if provided
5 }

Multiple Evaluators

It’s possible for a Test Suite to have multiple evaluators run simultaneously. This is often desirable when the output is complex and must meet multiple criterion. For example, you may want to validate that an output semantically means “I’m very happy”, but must contain the word “ecstatic”.

Additional evaluators can be configured on an existing Test Suite under the “Metric Setup” section of the “Test Suite Details” page. Any additional columns needed by that evaluator will be editable under the “Test Cases” tab.

When running the Test Suite, you’ll see columns displaying the results for each metric that’s been added to the Test Suite.

Uploading Test Cases

To help you migrate your Test Cases into Vellum, we provide two methods for bulk Test Case uploads.

Via CSV

Under the “Test Cases” tab on the “Test Suite Details” page, there is a blue “Upload Test Cases” button. Clicking that button will open a modal that allows you to bulk upload test cases via a CSV file.

Via API

To upload test cases via API, check out the Test Cases API documentation.

Function Calling

Prompts can output different modalities, which at Vellum we call input and output types. One increasingly popular output type for models if function/tool calling. You can use Vellum Test Suites to ensure that your model is producing the correct function call based on any given combination of inputs.

First, you will want to edit the Output Variable type of the test suite to the “Function Call” type:

Output Function Type

Next, define the Expected Output type to be of type “Function Call” too:

Expected Function Type

You can then use any metric to help ensure that your prompt is outputting function calls that perform well across your test cases! Here’s an example of a test suite using Exact Match as the metric:

Function Call Tests

Testing Workflows

Just like Prompts, Workflows can be tested too. The key difference Workflow Test Suites have is that Workflows may have multiple outputs. You may choose to test some of them, or all of them.

Workflow Test Suites can be run from the “Evaluations” tab in the Workflow Builder. It works very similarly as testing prompts, but you can create new test cases inline.

One common pitfall is trying to attach a Test Suite where the output variable is named completion to a Workflow where the output variable is something else. You will receive a warning when this happens.

Interface Mismatch

Interface Match

Viewing Workflow Test Case Entity’s Execution Details

To help you diagnose issues with workflows you can click on the “View Workflow Details” button located within a test case’s value cell to view the details.

Workflow Executions2

Testing Functions External to Vellum

Not only can Vellum’s evaluation framework be used to test Prompts & Workflows hosted in Vellum, it can also be used to evaluate the outputs of arbitrary functions hosted externally to Vellum.

For example, you might test a prompt chain that lives in your codebase and that’s defined using another third party library. This can be particularly useful if you want to incrementally migrate to Vellum Prompts/Workflows, but ensure that the outputs remain consistent.

For a detailed example of how to use Vellum’s evaluation framework to test external functions, see the python example here