Leveraging Online Evaluations for LLM Development with Vellum

Online Evaluations in Vellum provide a powerful way to continuously assess the quality of your deployed LLM applications. This feature allows you to monitor and evaluate the performance of your prompts or workflows in real-time as they’re being used in production.

Getting Started with Online Evaluations

Step 1: Create and Deploy Your LLM Application

  1. Start by creating either a Workflow in the Workflow Sandbox or a Prompt in the Prompt Sandbox.
  2. Once you’re satisfied with what you’ve created, deploy your Workflow or Prompt.

Step 2: Configure Metrics

After deployment, you can configure Metrics to evaluate your LLM application’s performance:

Configure Metrics for use in Online Evals

  1. Navigate to your Prompt or Workflow Deployment.
  2. Locate the “Metrics” tab in the tab bar.
  3. In the Metrics tab, configure which Metrics you’d like like to use to evaluate the performance of your Deployment.
  4. Save your changes. From this point forward, every execution of your Deployment will be automatically evaluated against these Metrics.

See results of Metrics alongside Execution details

For information on using and defining Metrics in Vellum, see our Metrics page.

Step 3: Understanding Online Evaluations

Online Evaluations offer several key benefits for LLM application development:

  1. Real-time Performance Monitoring: Continuously assess your Deployment’s performance as it handles live requests.
  2. Quality Assurance: Ensure your LLM application maintains high standards even as input patterns may shift over time.
  3. Regression Detection: Quickly identify any degradation in performance, allowing for swift corrective action.
  4. Insight-Driven Improvement: Use the gathered data to inform future iterations and improvements of your LLM application.

Selecting the Right Metrics

When configuring Metrics for use with Online Evaluations, it’s essential to choose the right ones to align with your specific use case and quality standards. Here are some key considerations to keep in mind:

  1. You should start by defining what “good” means to you and how you might decompose your definition of “good” into multiple smaller dimensions that are easier to measure individually.
  2. From there, you can select Metrics provided by Vellum that align with these dimensions, or you can define your own.
  3. Note that for now, Metrics are only able to operate on the inputs sent to a Deployment and the outputs generated by it. In the future, Metrics will also be able to operate on Actuals (i.e. end-user feedback send back to Vellum), such that they can more effectively measure accuracy.
  4. If you’d like advice on which Metrics to use, please free to reach out to the Vellum team for guidance!

Viewing Evaluation Results

To access your Online Evaluation results:

  1. Go to your Prompt or Workflow’s Deployment details page.
  2. Navigate to the “Executions” tab.
  3. Click on an individual Execution ID to view its details.
  4. In the Execution Details page, you’ll find the evaluation results based on your configured metrics.

You can analyze these results to gain insights into your Deployment’s strengths and areas for improvement.

Advanced Usage

Multiple Metrics

You can configure multiple Metrics within a single Deployment to evaluate its performance across multiple dimensions. This allows for a more comprehensive assessment of your Deployment’s capabilities. For example, you might configure a Metric to evaluate whether your LLM application produced a response of an appropriate length and another Metric to assess whether it used the proper tone of voice.

Conclusion

Online Evaluations in Vellum offer a robust, automated way to ensure the ongoing quality and performance of your LLM applications. By providing continuous, metric-based assessments, this feature empowers you to maintain high standards and make data-driven improvements to your Prompts and Workflows.

Remember, the key to leveraging Online Evaluations effectively is in thoughtfully configuring your Metrics to align with your specific use case and quality standards. Regularly reviewing and adjusting these Metrics will help you get the most out of this powerful feature.