Reusing Metrics with Different Configurations in Test Suites
Reusing Metrics with Different Configurations
When evaluating LLM outputs, you may want to check for different criteria using the same type of Metric. Vellum allows you to add the same Metric multiple times to a test suite, each with different configurations and names, enabling more comprehensive evaluation of your outputs.
Why Reuse Metrics?
There are several scenarios where reusing the same Metric with different configurations is valuable:
- Evaluating different aspects of the same output - For example, using an LLM-as-Judge Metric to evaluate both factual accuracy and tone in a response
- Checking for different expected values - Testing if an output contains one of several possible correct answers
- Applying different thresholds - Using the same Metric with different strictness levels
- Evaluating different parts of a complex output - Checking different sections of a structured response
Adding Multiple Instances of the Same Metric
To add multiple instances of the same Metric to your Test Suite:
- Navigate to your test suite
- Add your first Metric as normal through the “Add Metric” button
- Configure the Metric with your first set of inputs and parameters
- Add the same Metric again through the “Add Metric” button
- Give this instance a different name that reflects its specific purpose
- Configure it with different inputs or parameters
Renaming Metric Instances
When using the same Metric multiple times, it’s important to rename each instance to clearly indicate its purpose:
- When adding or editing a Metric in your test suite, look for the Metric name field at the top of the configuration panel
- Change the default name to something descriptive that indicates what this specific instance is evaluating
- This custom name will appear in the test suite results, making it clear which aspect of the output is being evaluated
Example 1: Multiple LLM-as-Judge Metrics
A common use case is using the LLM-as-Judge Metric multiple times to evaluate different aspects of your outputs:
Evaluating a Customer Service Response
You might add three instances of the LLM-as-Judge Metric:
- Tone Evaluation - Configured to check if the response is polite and empathetic
- Accuracy Evaluation - Configured to verify that the information provided is correct
- Completeness Evaluation - Configured to ensure all parts of the customer query were addressed
Example 2: Evaluating Different Fields in JSON Output
Another powerful use case is using the same Code Execution Metric multiple times to evaluate different fields within a structured JSON output.
Evaluating a Product Recommendation JSON
Imagine your LLM generates a product recommendation in JSON format with multiple nested fields:
You can create a single Code Execution Metric that extracts and evaluates a specific field based on a provided key path:
Then, you can add this same Metric multiple times to your test suite with different configurations:
-
Product Name Validation
- Rename to: “Product Name Check”
- Inputs:
completion
:output
the Prompt or Workflow outputtarget
:expected_output
would resolve to “Ultra Comfort Mattress”key_path
: “recommendation.product.name” a constant
-
Price Validation
- Rename to: “Price Check”
- Inputs:
completion
:output
the Prompt or Workflow outputtarget
:expected_output
would resolve to 899.99key_path
: “recommendation.product.price” a constant
-
Features Validation
- Rename to: “Features Check”
- Inputs:
completion
:output
the Prompt or Workflow outputtarget
:expected_output
would resolve to [“memory foam”, “cooling gel”, “hypoallergenic”]key_path
: “recommendation.product.features” a constant
This approach allows you to reuse the exact same code while evaluating different aspects of your JSON output by simply changing the input parameters. It’s a powerful pattern that reduces duplication and makes your evaluation more maintainable.
Learn more about setting and using Expected Outputs in Quantitative Evaluation.
Best Practices
When reusing Metrics in your test suites:
- Use clear, descriptive names for each Metric instance
- Keep configurations focused on specific aspects rather than trying to evaluate too many things at once
- Review results separately for each Metric instance to understand which specific aspects of your outputs need improvement
- Design Metrics to be reusable by parameterizing the aspects that will change between instances
Conclusion
Reusing Metrics with different configurations provides a powerful way to perform multi-dimensional evaluation of your LLM outputs. By applying the same Metric type in different ways, you can gain deeper insights into the quality and correctness of your AI-generated content.