Long Running Workflows
When building production applications with Vellum Workflows, you may encounter scenarios where Workflows take several minutes or longer to complete. This guide covers best practices for handling these long-running Workflows effectively.
Understanding the Challenge
Long-running Workflows can present challenges in production environments:
- Client timeouts: HTTP clients may timeout before Workflow completion
- Resource constraints: Keeping connections open for extended periods
- Error handling: Managing failures in long-running processes
- User experience: Providing feedback during extended operations
Recommended Approaches
Option 1: Webhooks (Recommended)
The most robust approach for long-running Workflows is to execute them asynchronously and use webhooks to receive completion notifications. See our Webhooks documentation for detailed setup instructions.
Configure Webhooks
Set up webhook endpoints in your Vellum organization to receive Workflow execution events:
- Navigate to Organization Settings → Webhooks
- Add your webhook endpoint URL
- Configure authentication (API key, Bearer token, or HMAC)
- Select the events you want to receive:
workflow.execution.initiatedworkflow.execution.fulfilledworkflow.execution.rejected
For security, we recommend using HMAC authentication for webhook endpoints. See our HMAC Authentication guide for implementation details.
Execute Workflow
Use the async execution endpoint to initiate your Workflow and immediately receive an execution_id for correlation. You can also provide your own external_id (such as a job_id, content_id, document_id, or any entity in your system):
You can also use the streaming endpoint (execute_workflow_stream) if you want to receive the execution_id from the first event, but the async endpoint (execute_workflow_async) is simpler and more efficient for webhook-based patterns since it returns the execution_id directly without requiring you to handle a stream.
Option 2: Async Execution with Status Polling
Use the async execution endpoint to initiate a Workflow and immediately receive an execution_id, then poll for the execution status using the Check Workflow Execution Status endpoint. This is the recommended approach for polling-based async execution.
Async executions automatically queue when you exceed your concurrency limit, making this endpoint ideal for batch jobs where you don’t need everything to complete at once. You can initiate many executions quickly and they’ll process as capacity becomes available:
The status endpoint returns the current execution state (PENDING, FULFILLED, REJECTED, etc.), along with outputs and an execution detail URL once the workflow completes. This makes it ideal for polling-based async execution patterns.
For batch processing scenarios where you need to process many Workflow executions at once, see our Batch Jobs guide for detailed patterns and best practices.
Option 3: API Node
This approach is beneficial when you want to specify a specific callback URL or payload format. You can use an API Node at the end of your Workflow to send results directly to your system:
Option 4: Streaming Updates
For Workflows where you want to provide real-time progress updates, use the streaming execution endpoint. This can still be problematic if updates are really long, but at least enables you to stream updated messaging from your LLMs as they invoke tools:
Streaming connections should still have reasonable timeout limits. For very long Workflows (>10 minutes), webhooks are still the recommended approach.
Timeout Management
Client-Side Timeouts
When using synchronous execution, configure appropriate timeouts:
Infrastructure Considerations
AWS Lambda
If using AWS Lambda, be aware of the 15-minute maximum execution time:
Container Environments
For containerized applications, ensure your containers can handle long-running connections if using streaming:
Error Handling and Retry Strategies
Workflow-Level Retry
For Workflows that may fail due to transient issues, implement retry logic:
Node-Level Resilience
Prompts may experience nondeterministic errors from model providers. Longer running Workflows are particularly prone to these issues. To mitigate this, it’s a good idea to implement retry logic with Node Adornemnts. You can also disable streaming from the Model Settings while editing a Prompt to mitigate other nondeterministic connection issues.
Use Node Adornments to add resilience to individual Workflow nodes:
- Retry Node Adornments: Automatically retry failed nodes
- Try Node Adornments: Gracefully handle node failures with fallback paths
See our Node Adornments documentation for detailed configuration.
Monitoring and Observability
Execution Tracking
Monitor long-running Workflows through the Monitoring tab of your Workflow Deployment (see our monitoring documentation for details):
- Executions Tab: View real-time execution status
- Timeline View: Analyze execution flow and bottlenecks
- Cost Tracking: Monitor resource usage for long Workflows
Best Practices Summary
Use Webhooks for Production
Always use webhook-based async execution for Workflows that may take more than a few minutes. This provides the most reliable and scalable approach.
Include External IDs
Always include an external_id when executing Workflows to correlate webhook events with your internal processes.
Implement Proper Error Handling
Handle both Workflow-level failures and infrastructure timeouts gracefully. Consider retry strategies for transient failures.
Monitor Execution Times
Track Workflow execution times to identify performance bottlenecks and optimize your Workflows.
Secure Webhook Endpoints
Use HMAC authentication to secure your webhook endpoints and verify that events are coming from Vellum.
Design for Resilience
Use Node Adornments (Retry, Try) to make individual Workflow components more resilient to transient failures.
Related Documentation
- Batch Jobs - Process large volumes of Workflow executions efficiently
- Webhooks Configuration
- HMAC Authentication
- Node Adornments
- Workflow Monitoring
- API Reference - Execute Workflow
- API Reference - Execute Workflow Async
- API Reference - Check Workflow Execution Status