Reusing Metrics with Different Configurations in Test Suites | Vellum

When evaluating LLM outputs, you may want to check for different criteria using the same type of Metric. Vellum allows you to add the same Metric multiple times to a test suite, each with different configurations and names, enabling more comprehensive evaluation of your outputs.

Why Reuse Metrics?

There are several scenarios where reusing the same Metric with different configurations is valuable:

Evaluating different aspects of the same output - For example, using an LLM-as-Judge Metric to evaluate both factual accuracy and tone in a response
Checking for different expected values - Testing if an output contains one of several possible correct answers
Applying different thresholds - Using the same Metric with different strictness levels
Evaluating different parts of a complex output - Checking different sections of a structured response

Adding Multiple Instances of the Same Metric

To add multiple instances of the same Metric to your Test Suite:

Navigate to your test suite
Add your first Metric as normal through the “Add Metric” button
Configure the Metric with your first set of inputs and parameters
Add the same Metric again through the “Add Metric” button
Give this instance a different name that reflects its specific purpose
Configure it with different inputs or parameters

Renaming Metric Instances

When using the same Metric multiple times, it’s important to rename each instance to clearly indicate its purpose:

When adding or editing a Metric in your test suite, look for the Metric name field at the top of the configuration panel
Change the default name to something descriptive that indicates what this specific instance is evaluating
This custom name will appear in the test suite results, making it clear which aspect of the output is being evaluated

Example 1: Multiple LLM-as-Judge Metrics

A common use case is using the LLM-as-Judge Metric multiple times to evaluate different aspects of your outputs:

Evaluating a Customer Service Response

You might add three instances of the LLM-as-Judge Metric:

Tone Evaluation - Configured to check if the response is polite and empathetic
Accuracy Evaluation - Configured to verify that the information provided is correct
Completeness Evaluation - Configured to ensure all parts of the customer query were addressed

Example 2: Evaluating Different Fields in JSON Output

Another powerful use case is using the same Code Execution Metric multiple times to evaluate different fields within a structured JSON output.

Evaluating a Product Recommendation JSON

Imagine your LLM generates a product recommendation in JSON format with multiple nested fields:

1 {
2   "recommendation": {
3     "product": {
4       "name": "Ultra Comfort Mattress",
5       "price": 899.99,
6       "features": ["memory foam", "cooling gel", "hypoallergenic"]
7     },
8     "reasoning": {
9       "customer_needs": ["back pain", "overheating at night"],
10       "product_benefits": "The cooling gel layer addresses nighttime overheating while the memory foam provides support for back pain."
11     },
12     "alternatives": [
13       {
14         "name": "Ergonomic Support Mattress",
15         "price": 799.99
16       }
17     ]
18   }
19 }

You can create a single Code Execution Metric that extracts and evaluates a specific field based on a provided key path:

1 def main(completion, target, key_path):
2     """
3     Evaluates a specific field in a JSON output based on a provided key path.
4     
5     Args:
6         completion: The JSON output from the LLM
7         target: The expected value for the specified field
8         key_path: Dot-notation path to the field to evaluate (e.g., "recommendation.product.name")
9     
10     Returns:
11         Dictionary with score (1.0 if match, 0.0 if no match)
12     """
13     import json
14     
15     try:
16         # Parse the completion JSON
17         data = json.loads(completion)
18         
19         # Navigate to the specified field using the key path
20         keys = key_path.split('.')
21         actual_value = data
22         for key in keys:
23             if isinstance(actual_value, dict):
24                 actual_value = actual_value.get(key, None)
25             else:
26                 return {"score": 0.0, "reason": f"Could not navigate to {key} in {actual_value}"}
27         
28         # Compare with target value
29         if actual_value == target:
30             return {"score": 1.0, "reason": f"Field {key_path} matches expected value"}
31         else:
32             return {"score": 0.0, "reason": f"Field {key_path} value '{actual_value}' does not match expected '{target}'"}
33     
34     except Exception as e:
35         return {"score": 0.0, "reason": f"Error evaluating JSON: {str(e)}"}

Then, you can add this same Metric multiple times to your test suite with different configurations:

Product Name Validation
- Rename to: “Product Name Check”
- Inputs:
  - completion: output the Prompt or Workflow output
  - target: expected_output would resolve to “Ultra Comfort Mattress”
  - key_path: “recommendation.product.name” a constant
Price Validation
- Rename to: “Price Check”
- Inputs:
  - completion: output the Prompt or Workflow output
  - target: expected_output would resolve to 899.99
  - key_path: “recommendation.product.price” a constant
Features Validation
- Rename to: “Features Check”
- Inputs:
  - completion: output the Prompt or Workflow output
  - target: expected_output would resolve to [“memory foam”, “cooling gel”, “hypoallergenic”]
  - key_path: “recommendation.product.features” a constant

This approach allows you to reuse the exact same code while evaluating different aspects of your JSON output by simply changing the input parameters. It’s a powerful pattern that reduces duplication and makes your evaluation more maintainable.

Learn more about setting and using Expected Outputs in Quantitative Evaluation.

Best Practices

When reusing Metrics in your test suites:

Use clear, descriptive names for each Metric instance
Keep configurations focused on specific aspects rather than trying to evaluate too many things at once
Review results separately for each Metric instance to understand which specific aspects of your outputs need improvement
Design Metrics to be reusable by parameterizing the aspects that will change between instances

Conclusion

Reusing Metrics with different configurations provides a powerful way to perform multi-dimensional evaluation of your LLM outputs. By applying the same Metric type in different ways, you can gain deeper insights into the quality and correctness of your AI-generated content.