In this example, we’ll build a Data Extraction Workflow that fetches documents from a Vellum Document Index, extracts key information from unstructured text, and generated structured JSON data using Pydantic models. This is useful for automating data extraction from PDFs (insurance policies, reports, lesson plans), spreadsheets, and more.

Ultimately, we’ll end up with a Workflow that performs the following steps:

RetrieveData: retrieves document content using the Vellum API
DataExtractionNode: processes the document content and extracts structured data in JSON format

workflow.py

1 class DataExtractionWorkflow(BaseWorkflow[Inputs, BaseState]):
2     graph = RetrieveData >> DataExtractionNode
3 
4     class Outputs(BaseWorkflow.Outputs):
5         extracted_data = DataExtractionNode.Outputs.text
6 
7     ## Running it
8     workflow = DataExtractionWorkflow()
9     terminal_event = workflow.run(
10         inputs=Inputs(
11             document_id="b9442ad1-ee4c-4582-8690-b6375d9b8611"
12         )
13     )
14 
15 ## Output:
16 print(terminal_event.outputs.extracted_data)
17 
18 """
19 {
20     "Example Table": "Description of test results",
21     "Disability Category": {
22         "blind": {
23             "participants": 25,
24             "Ballots Completed": 20,
25             "Ballots Incomplete/Terminated": 5,
26             "accuracy": "95% (n=20)",
27             "Time to Complete": "15-20 minutes"
28         },
29         "Low Vision": {
30             "participants": 30,
31             "Ballots Completed": 28,
32             "Ballots Incomplete/Terminated": 2,
33             "accuracy": "92% (n=28)",
34             "Time to Complete": "12-18 minutes"
35         },
36         ...
37     }
38 }
39 """

You can get document IDs from the Vellum UI or via an external_id when uploading documents.

Setup

Install Vellum

$ pip install vellum-ai

Create your Project

In this example, we’ll structure our project like this:

1 document_data_extraction/
2 ├── sandbox.py
3 ├── workflow.py
4 ├── inputs.py
5 ├── __init__.py
6 └── nodes/
7     ├── __init__.py
8     ├── retrieve_data.py
9     └── data_extraction.py

Folder structure matters! Vellum relies on this structure to convert between UI and code representations of the graph. If you don’t want to use the UI, you can use whatever folder structure you’d like.

Define Workflow Inputs

inputs.py

1 from vellum.workflows.inputs.base import BaseInputs
2 
3 class Inputs(BaseInputs):
4     document_id: str

The workflow takes a single input: the ID of the document to process. This ID is used to retrieve the document’s content from Vellum’s API.

Build the Nodes

Retrieve Data Node

This node handles retrieving document content from Vellum’s API. It polls until the document is ready to be used, in case it has been recently uploaded and is still being indexed. Files are usually available within a few seconds.

nodes/retrieve_data.py

1 import os
2 import requests
3 from vellum import Vellum
4 from vellum.workflows.errors import WorkflowErrorCode
5 from vellum.workflows.exceptions import NodeException
6 from vellum.workflows.nodes import BaseNode, RetryNode
7 
8 api_key = os.getenv("VELLUM_API_KEY")
9 if api_key is None:
10     raise ValueError("VELLUM_API_KEY environment variable is not set")
11 client = Vellum(api_key=api_key)
12 
13 @RetryNode.wrap(max_attempts=20, retry_on_error_code=WorkflowErrorCode.USER_DEFINED_ERROR)
14 class RetrieveData(BaseNode):
15     document_id = Inputs.document_id
16 
17     class Outputs(BaseNode.Outputs):
18         document_content: str
19 
20     def run(self) -> BaseNode.Outputs:
21         # Retrieve the document status from Vellum
22         try:
23             response = client.documents.retrieve(id=self.document_id)
24             indexing_state = response.document_to_document_indexes[0].indexing_state
25             processing_state = response.processing_state
26             text_file_url = response.document_to_document_indexes[0].extracted_text_file_url
27 
28             if indexing_state == "INDEXED":
29                 return self.Outputs(document_content=requests.get(text_file_url).text)
30             elif indexing_state == "FAILED" or processing_state == "FAILED":
31                 raise NodeException("Indexing or processing failed")
32 
33             raise NodeException("Document processing not complete yet.", retry_on_error_code=WorkflowErrorCode.USER_DEFINED_ERROR)
34         except Exception as e:
35             raise NodeException(f"An error occurred: {str(e)}")

This node subclasses BaseNode and implements a custom run() method to:

Check document processing status
Retrieve document content if processing is complete
Retry while the document is still being indexed
Throw errors if the document processing fails

Data Extraction Node

This node extracts structured data from the processed document content according to a Pydantic model schema.

nodes/data_extraction.py

1 from pydantic import BaseModel, Field
2 from vellum import ChatMessagePromptBlock, JinjaPromptBlock, PromptParameters
3 from vellum.workflows.nodes import InlinePromptNode
4 
5 # Define Pydantic models to match the JSON structure
6 class CategoryMetrics(BaseModel):
7     participants: int = Field(..., description="Number of participants in the category.")
8     ballots_completed: int = Field(..., alias="Ballots Completed", description="Number of ballots completed.")
9     ballots_incomplete_terminated: int = Field(
10         ..., alias="Ballots Incomplete/Terminated", description="Number of ballots incomplete or terminated."
11     )
12     accuracy: str = Field(..., description="Accuracy percentage with sample size.")
13     time_to_complete: str = Field(..., alias="Time to Complete", description="Time taken to complete ballots.")
14 
15 class ExtractedSchema(BaseModel):
16     example_table: str = Field(..., alias="Example Table", description="Description of the table.")
17     disability_category: DisabilityCategory = Field(..., alias="Disability Category")
18 
19 class DataExtractionNode(InlinePromptNode):
20     ml_model = "gpt-4o-mini"
21     blocks = [
22         ChatMessagePromptBlock(
23             chat_role="SYSTEM",
24             blocks=[
25                 JinjaPromptBlock(
26                     template="""Analyze the following document content and extract all identifiable key-value pairs. Present the extracted information in JSON format, ensuring compliance with the following schema:
27 <document_content>
28 {{ document_content }}
29 </document_content>
30 Extracted Data (in JSON format):
31 """,
32                 ),
33             ],
34         ),
35     ]
36     prompt_inputs = {"document_content": RetrieveData.Outputs.document_content}
37     parameters = PromptParameters(
38         temperature=0,
39         max_tokens=1000,
40         top_p=1,
41         custom_parameters={
42             "json_schema": {"name": "data_extraction_schema", "schema": ExtractedSchema.model_json_schema()}
43         },
44     )

This node uses:

Pydantic models to define the expected JSON structure
An InlinePromptNode to process the document content with an LLM
Custom prompt parameters to ensure consistent, structured output

Running the Workflow

Using the Sandbox Runner

The sandbox runner is ideal for testing and development. It enables you to execute the workflow locally using sample inputs, providing a quick way to validate functionality.

You can run the sandbox runner by running the following command: python -m basic_rag_chatbot.sandbox 0 (where 0 is the index of the Scenario you want to run).

sandbox.py

1 from vellum.workflows.sandbox import WorkflowSandboxRunner
2 from .inputs import Inputs
3 from .workflow import DataExtractionWorkflow
4 
5 if __name__ != "__main__":
6     raise Exception("This file is not meant to be imported")
7 
8 runner = WorkflowSandboxRunner(
9     workflow=DataExtractionWorkflow(),
10     inputs=[
11         Inputs(document_id="b9442ad1-ee4c-4582-8690-b6375d9b8611"),
12     ],
13 )
14 
15 runner.run()

The sandbox runner is useful for testing and development, allowing you to run the workflow locally with sample inputs.

Integration into Project

Instantiate the Workflow

1 ## From any file / function from which you want to reference the Workflow
2 
3 # Required import (the file imported from depends on your folder structure)
4 # from .workflow import DataExtractionWorkflow
5 
6 workflow = DataExtractionWorkflow()

Invoke the Workflow and Output the Results

1 ## From any file / function from which you want to run the Workflow
2 
3 # Required imports (the file imported from depends on your folder structure)
4 # from .inputs import Inputs
5 
6 terminal_event = workflow.run(
7     inputs=Inputs(
8         document_id="b9442ad1-ee4c-4582-8690-b6375d9b8611"
9     )
10 )
11 ## Output:
12 print(terminal_event.outputs.extracted_data)
13 
14 """
15 {
16     "Example Table": "Accessibility Testing Results Summary",
17     "Disability Category": {
18         "blind": {
19             "participants": 25,
20             "Ballots Completed": 20,
21             "Ballots Incomplete/Terminated": 5,
22             "accuracy": "95% (n=20)",
23             "Time to Complete": "15-20 minutes"
24         },
25         "Low Vision": {
26             "participants": 30,
27             "Ballots Completed": 28,
28             "Ballots Incomplete/Terminated": 2,
29             "accuracy": "92% (n=28)",
30             "Time to Complete": "12-18 minutes"
31         },
32         "dexterity": {
33             "participants": 22,
34             "Ballots Completed": 19,
35             "Ballots Incomplete/Terminated": 3,
36             "accuracy": "89% (n=19)",
37             "Time to Complete": "18-25 minutes"
38         },
39         "mobility": {
40             "participants": 28,
41             "Ballots Completed": 25,
42             "Ballots Incomplete/Terminated": 3,
43             "accuracy": "91% (n=25)",
44             "Time to Complete": "15-22 minutes"
45         }
46     }
47 }
48 """

Conclusion

In this tutorial, we’ve built a document processing workflow that can:

Retrieve documents from Vellum’s API
Extract structured data using LLMs
Validate output against a predefined schema

Looking forward, we can:

Add validation nodes to verify extracted data
Implement retry logic for failed extractions
Add post-processing nodes for data cleanup
Deploy to Vellum for production use
Version control the workflow with the rest of our project
Continue building the graph in the Vellum UI