Evaluate your LLM Workflows with Dozens of Premade Vellum Metrics

Metrics

Vellum comes with a set of Metrics that you can use right away within your Test Suites. We are continually adding new Metrics based on the needs of Vellum users.

Here are the default Metrics currently available within Vellum:

Exact Match

Check that the output is exactly equal to the target.

Returns a score of 1 if the output is an exact match, and 0 otherwise.

Regex Match

Check that the specified regular expression can be found in the output.

Returns a score of 1 if the regular expression matches, and 0 otherwise.

Note that unless the regular expression is explicitly anchored, it can match anywhere in the output.

Semantic Similarity

Check that the output is semantically similar to the target.

Returns a score between 0 and 1, where 1 is a perfect match.

Uses a cross encoder to compute the similarity.

JSON Validity

Check that the output is valid JSON.

Returns a score of 1 if the output is valid JSON, and 0 otherwise.


The Metrics below are Ragas Metrics designed to evaluate your Retrieval Augmented Generation (RAG) systems. For tips on evaluating your RAG pipeline in Vellum, check out this help center article

Ragas - Faithfulness

Faithfulness measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.

For details, see: https://docs.ragas.io/en/latest/concepts/metrics/faithfulness.html

Ragas - Answer Relevance

The Metric, Answer Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy.

For details, see: https://docs.ragas.io/en/latest/concepts/metrics/answer_relevance.html

Ragas – Context Relevancy

This Metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy.

For details, see: https://docs.ragas.io/en/v0.1.5/concepts/metrics/context_relevancy.html