Long Running Workflows
When building production applications with Vellum Workflows, you may encounter scenarios where Workflows take several minutes or longer to complete. This guide covers best practices for handling these long-running Workflows effectively.
Understanding the Challenge
Long-running Workflows can present challenges in production environments:
- Client timeouts: HTTP clients may timeout before Workflow completion
- Resource constraints: Keeping connections open for extended periods
- Error handling: Managing failures in long-running processes
- User experience: Providing feedback during extended operations
Recommended Approaches
Option 1: Webhooks (Recommended)
The most robust approach for long-running Workflows is to execute them asynchronously and use webhooks to receive completion notifications. See our Webhooks documentation for detailed setup instructions.
Configure Webhooks
Set up webhook endpoints in your Vellum organization to receive Workflow execution events:
- Navigate to Organization Settings → Webhooks
- Add your webhook endpoint URL
- Configure authentication (API key, Bearer token, or HMAC)
- Select the events you want to receive:
workflow.execution.initiated
workflow.execution.fulfilled
workflow.execution.rejected
For security, we recommend using HMAC authentication for webhook endpoints. See our HMAC Authentication guide for implementation details.
Option 2: Polling
Use the streaming endpoint to initiate a Workflow and immediately receive an execution_id
(API reference) — or you can provide your own external_id
— and then poll for the execution results using the Retrieve Workflow Deployment Execution Event endpoint:
Option 3: API Node
This approach is beneficial when you want to specify a specific callback URL or payload format. You can use an API Node at the end of your Workflow to send results directly to your system:
Option 4: Streaming Updates
For Workflows where you want to provide real-time progress updates, use the streaming execution endpoint. This can still be problematic if updates are really long, but at least enables you to stream updated messaging from your LLMs as they invoke tools:
Streaming connections should still have reasonable timeout limits. For very long Workflows (>10 minutes), webhooks are still the recommended approach.
Timeout Management
Client-Side Timeouts
When using synchronous execution, configure appropriate timeouts:
Infrastructure Considerations
AWS Lambda
If using AWS Lambda, be aware of the 15-minute maximum execution time:
Container Environments
For containerized applications, ensure your containers can handle long-running connections if using streaming:
Error Handling and Retry Strategies
Workflow-Level Retry
For Workflows that may fail due to transient issues, implement retry logic:
Node-Level Resilience
Prompts may experience nondeterministic errors from model providers. Longer running Workflows are particularly prone to these issues. To mitigate this, it’s a good idea to implement retry logic with Node Adornemnts. You can also disable streaming from the Model Settings while editing a Prompt to mitigate other nondeterministic connection issues.
Use Node Adornments to add resilience to individual Workflow nodes:
- Retry Node Adornments: Automatically retry failed nodes
- Try Node Adornments: Gracefully handle node failures with fallback paths
See our Node Adornments documentation for detailed configuration.
Monitoring and Observability
Execution Tracking
Monitor long-running Workflows through the Monitoring tab of your Workflow Deployment (see our monitoring documentation for details):
- Executions Tab: View real-time execution status
- Timeline View: Analyze execution flow and bottlenecks
- Cost Tracking: Monitor resource usage for long Workflows
Best Practices Summary
Use Webhooks for Production
Always use webhook-based async execution for Workflows that may take more than a few minutes. This provides the most reliable and scalable approach.
Include External IDs
Always include an external_id
when executing Workflows to correlate webhook events with your internal processes.
Implement Proper Error Handling
Handle both Workflow-level failures and infrastructure timeouts gracefully. Consider retry strategies for transient failures.
Monitor Execution Times
Track Workflow execution times to identify performance bottlenecks and optimize your Workflows.
Secure Webhook Endpoints
Use HMAC authentication to secure your webhook endpoints and verify that events are coming from Vellum.
Design for Resilience
Use Node Adornments (Retry, Try) to make individual Workflow components more resilient to transient failures.