Prompt Caching is a powerful feature that allows you to reduce both the cost and latency of your LLM calls by caching frequently used portions of your prompts. This page explains how Prompt Caching works, which models support it, and how to implement it effectively in Vellum.
Prompt Caching is a feature provided by model providers (currently Anthropic and OpenAI) that allows for caching frequently used portions of prompts. When enabled:
Prompt Caching is performed by the model providers themselves (Anthropic and OpenAI), not by Vellum. Vellum provides the interface to enable and configure caching for supported models.
Anthropic allows you to explicitly define which parts of your Prompt should be cached using a special syntax. In Vellum, this is made simple with a UI toggle:

OpenAI automatically handles caching without requiring explicit markup. When you use the same Prompt content repeatedly, OpenAI will automatically cache it to improve performance and reduce costs.
For optimal results, cache:
Even if content is technically “dynamic” but will be reused frequently (like a specific document or context), it’s still beneficial to cache it.
A common pattern is to cache document content while keeping questions dynamic:

This approach allows you to ask multiple different questions about the same document without re-processing the document content each time.
Cached tokens usually expire within a few minutes, after which the model will need to process the full Prompt again. You will likely notice a brief increase in latency, and you will be charged for the full prompt.
Cache durations can vary by model provider, so be sure to check the documentation for the specific model you are using.
Vellum provides visibility into your cache performance through the Prompt Deployment Executions table:

These metrics can help you analyze your cache hit rate and optimize your prompting strategy to maximize cost savings.
For a RAG (Retrieval Augmented Generation) application, you can cache the retrieved document chunks while keeping the user query dynamic:
For chatbots with complex instructions, cache the system instructions while keeping the conversation history dynamic:
Prompt Caching is a powerful optimization technique that can significantly reduce both the cost and latency of your LLM calls. By strategically caching static or frequently reused content, you can build more efficient and responsive AI applications while reducing your operational costs.