When building production applications with Vellum Workflows, you may encounter scenarios where Workflows take several minutes or longer to complete. This guide covers best practices for handling these long-running Workflows effectively.
Long-running Workflows can present challenges in production environments:
The most robust approach for long-running Workflows is to execute them asynchronously and use webhooks to receive completion notifications. See our Webhooks documentation for detailed setup instructions.
Set up webhook endpoints in your Vellum organization to receive Workflow execution events:
workflow.execution.initiatedworkflow.execution.fulfilledworkflow.execution.rejectedFor security, we recommend using HMAC authentication for webhook endpoints. See our HMAC Authentication guide for implementation details.
Use the async execution endpoint to initiate your Workflow and immediately receive an execution_id for correlation. You can also provide your own external_id (such as a job_id, content_id, document_id, or any entity in your system):
You can also use the streaming endpoint (execute_workflow_stream) if you want to receive the execution_id from the first event, but the async endpoint (execute_workflow_async) is simpler and more efficient for webhook-based patterns since it returns the execution_id directly without requiring you to handle a stream.
Use the async execution endpoint to initiate a Workflow and immediately receive an execution_id, then poll for the execution status using the Check Workflow Execution Status endpoint. This is the recommended approach for polling-based async execution.
Async executions automatically queue when you exceed your concurrency limit, making this endpoint ideal for batch jobs where you don’t need everything to complete at once. You can initiate many executions quickly and they’ll process as capacity becomes available:
The status endpoint returns the current execution state (PENDING, FULFILLED, REJECTED, etc.), along with outputs and an execution detail URL once the workflow completes. This makes it ideal for polling-based async execution patterns.
For batch processing scenarios where you need to process many Workflow executions at once, see our Batch Jobs guide for detailed patterns and best practices.
This approach is beneficial when you want to specify a specific callback URL or payload format. You can use an API Node at the end of your Workflow to send results directly to your system:
For Workflows where you want to provide real-time progress updates, use the streaming execution endpoint. This can still be problematic if updates are really long, but at least enables you to stream updated messaging from your LLMs as they invoke tools:
Streaming connections should still have reasonable timeout limits. For very long Workflows (>10 minutes), webhooks are still the recommended approach.
When using synchronous execution, configure appropriate timeouts:
If using AWS Lambda, be aware of the 15-minute maximum execution time:
For containerized applications, ensure your containers can handle long-running connections if using streaming:
For Workflows that may fail due to transient issues, implement retry logic:
Prompts may experience nondeterministic errors from model providers. Longer running Workflows are particularly prone to these issues. To mitigate this, it’s a good idea to implement retry logic with Node Adornemnts. You can also disable streaming from the Model Settings while editing a Prompt to mitigate other nondeterministic connection issues.
Use Node Adornments to add resilience to individual Workflow nodes:
See our Node Adornments documentation for detailed configuration.
Monitor long-running Workflows through the Monitoring tab of your Workflow Deployment (see our monitoring documentation for details):
Always use webhook-based async execution for Workflows that may take more than a few minutes. This provides the most reliable and scalable approach.
Always include an external_id when executing Workflows to correlate webhook events with your internal processes.
Handle both Workflow-level failures and infrastructure timeouts gracefully. Consider retry strategies for transient failures.
Track Workflow execution times to identify performance bottlenecks and optimize your Workflows.
Use HMAC authentication to secure your webhook endpoints and verify that events are coming from Vellum.
Use Node Adornments (Retry, Try) to make individual Workflow components more resilient to transient failures.