Document Data Extraction

In this example, we’ll build a Data Extraction Workflow that fetches documents from a Vellum Document Index, extracts key information from unstructured text, and generated structured JSON data using Pydantic models. This is useful for automating data extraction from PDFs (insurance policies, reports, lesson plans), spreadsheets, and more.

Ultimately, we’ll end up with a Workflow that performs the following steps:

  1. RetrieveData: retrieves document content using the Vellum API
  2. DataExtractionNode: processes the document content and extracts structured data in JSON format
workflow.py
1class DataExtractionWorkflow(BaseWorkflow[Inputs, BaseState]):
2 graph = RetrieveData >> DataExtractionNode
3
4 class Outputs(BaseWorkflow.Outputs):
5 extracted_data = DataExtractionNode.Outputs.text
6
7 ## Running it
8 workflow = DataExtractionWorkflow()
9 terminal_event = workflow.run(
10 inputs=Inputs(
11 document_id="b9442ad1-ee4c-4582-8690-b6375d9b8611"
12 )
13 )
14
15## Output:
16print(terminal_event.outputs.extracted_data)
17
18"""
19{
20 "Example Table": "Description of test results",
21 "Disability Category": {
22 "blind": {
23 "participants": 25,
24 "Ballots Completed": 20,
25 "Ballots Incomplete/Terminated": 5,
26 "accuracy": "95% (n=20)",
27 "Time to Complete": "15-20 minutes"
28 },
29 "Low Vision": {
30 "participants": 30,
31 "Ballots Completed": 28,
32 "Ballots Incomplete/Terminated": 2,
33 "accuracy": "92% (n=28)",
34 "Time to Complete": "12-18 minutes"
35 },
36 ...
37 }
38}
39"""

You can get document IDs from the Vellum UI or via an external_id when uploading documents.

Setup

Install Vellum

$pip install vellum-ai

Create your Project

In this example, we’ll structure our project like this:

1document_data_extraction/
2├── sandbox.py
3├── workflow.py
4├── inputs.py
5├── __init__.py
6└── nodes/
7 ├── __init__.py
8 ├── retrieve_data.py
9 └── data_extraction.py

Folder structure matters! Vellum relies on this structure to convert between UI and code representations of the graph. If you don’t want to use the UI, you can use whatever folder structure you’d like.

Define Workflow Inputs

inputs.py
1from vellum.workflows.inputs.base import BaseInputs
2
3class Inputs(BaseInputs):
4 document_id: str

The workflow takes a single input: the ID of the document to process. This ID is used to retrieve the document’s content from Vellum’s API.

Build the Nodes

1

Retrieve Data Node

This node handles retrieving document content from Vellum’s API. It polls until the document is ready to be used, in case it has been recently uploaded and is still being indexed. Files are usually available within a few seconds.

nodes/retrieve_data.py
1import os
2import requests
3from vellum import Vellum
4from vellum.workflows.errors import WorkflowErrorCode
5from vellum.workflows.exceptions import NodeException
6from vellum.workflows.nodes import BaseNode, RetryNode
7
8api_key = os.getenv("VELLUM_API_KEY")
9if api_key is None:
10 raise ValueError("VELLUM_API_KEY environment variable is not set")
11client = Vellum(api_key=api_key)
12
13@RetryNode.wrap(max_attempts=20, retry_on_error_code=WorkflowErrorCode.USER_DEFINED_ERROR)
14class RetrieveData(BaseNode):
15 document_id = Inputs.document_id
16
17 class Outputs(BaseNode.Outputs):
18 document_content: str
19
20 def run(self) -> BaseNode.Outputs:
21 # Retrieve the document status from Vellum
22 try:
23 response = client.documents.retrieve(id=self.document_id)
24 indexing_state = response.document_to_document_indexes[0].indexing_state
25 processing_state = response.processing_state
26 text_file_url = response.document_to_document_indexes[0].extracted_text_file_url
27
28 if indexing_state == "INDEXED":
29 return self.Outputs(document_content=requests.get(text_file_url).text)
30 elif indexing_state == "FAILED" or processing_state == "FAILED":
31 raise NodeException("Indexing or processing failed")
32
33 raise NodeException("Document processing not complete yet.", retry_on_error_code=WorkflowErrorCode.USER_DEFINED_ERROR)
34 except Exception as e:
35 raise NodeException(f"An error occurred: {str(e)}")

This node subclasses BaseNode and implements a custom run() method to:

  • Check document processing status
  • Retrieve document content if processing is complete
  • Retry while the document is still being indexed
  • Throw errors if the document processing fails
2

Data Extraction Node

This node extracts structured data from the processed document content according to a Pydantic model schema.

nodes/data_extraction.py
1from pydantic import BaseModel, Field
2from vellum import ChatMessagePromptBlock, JinjaPromptBlock, PromptParameters
3from vellum.workflows.nodes import InlinePromptNode
4
5# Define Pydantic models to match the JSON structure
6class CategoryMetrics(BaseModel):
7 participants: int = Field(..., description="Number of participants in the category.")
8 ballots_completed: int = Field(..., alias="Ballots Completed", description="Number of ballots completed.")
9 ballots_incomplete_terminated: int = Field(
10 ..., alias="Ballots Incomplete/Terminated", description="Number of ballots incomplete or terminated."
11 )
12 accuracy: str = Field(..., description="Accuracy percentage with sample size.")
13 time_to_complete: str = Field(..., alias="Time to Complete", description="Time taken to complete ballots.")
14
15class ExtractedSchema(BaseModel):
16 example_table: str = Field(..., alias="Example Table", description="Description of the table.")
17 disability_category: DisabilityCategory = Field(..., alias="Disability Category")
18
19class DataExtractionNode(InlinePromptNode):
20 ml_model = "gpt-4o-mini"
21 blocks = [
22 ChatMessagePromptBlock(
23 chat_role="SYSTEM",
24 blocks=[
25 JinjaPromptBlock(
26 template="""Analyze the following document content and extract all identifiable key-value pairs. Present the extracted information in JSON format, ensuring compliance with the following schema:
27<document_content>
28{{ document_content }}
29</document_content>
30Extracted Data (in JSON format):
31""",
32 ),
33 ],
34 ),
35 ]
36 prompt_inputs = {"document_content": RetrieveData.Outputs.document_content}
37 parameters = PromptParameters(
38 temperature=0,
39 max_tokens=1000,
40 top_p=1,
41 custom_parameters={
42 "json_schema": {"name": "data_extraction_schema", "schema": ExtractedSchema.model_json_schema()}
43 },
44 )

This node uses:

  • Pydantic models to define the expected JSON structure
  • An InlinePromptNode to process the document content with an LLM
  • Custom prompt parameters to ensure consistent, structured output

Running the Workflow

Using the Sandbox Runner

The sandbox runner is ideal for testing and development. It enables you to execute the workflow locally using sample inputs, providing a quick way to validate functionality.

You can run the sandbox runner by running the following command: python -m basic_rag_chatbot.sandbox 0 (where 0 is the index of the Scenario you want to run).

sandbox.py
1from vellum.workflows.sandbox import WorkflowSandboxRunner
2from .inputs import Inputs
3from .workflow import DataExtractionWorkflow
4
5if __name__ != "__main__":
6 raise Exception("This file is not meant to be imported")
7
8runner = WorkflowSandboxRunner(
9 workflow=DataExtractionWorkflow(),
10 inputs=[
11 Inputs(document_id="b9442ad1-ee4c-4582-8690-b6375d9b8611"),
12 ],
13)
14
15runner.run()

The sandbox runner is useful for testing and development, allowing you to run the workflow locally with sample inputs.

Integration into Project

1

Instantiate the Workflow

1## From any file / function from which you want to reference the Workflow
2
3# Required import (the file imported from depends on your folder structure)
4# from .workflow import DataExtractionWorkflow
5
6workflow = DataExtractionWorkflow()
2

Invoke the Workflow and Output the Results

1## From any file / function from which you want to run the Workflow
2
3# Required imports (the file imported from depends on your folder structure)
4# from .inputs import Inputs
5
6terminal_event = workflow.run(
7 inputs=Inputs(
8 document_id="b9442ad1-ee4c-4582-8690-b6375d9b8611"
9 )
10)
11## Output:
12print(terminal_event.outputs.extracted_data)
13
14"""
15{
16 "Example Table": "Accessibility Testing Results Summary",
17 "Disability Category": {
18 "blind": {
19 "participants": 25,
20 "Ballots Completed": 20,
21 "Ballots Incomplete/Terminated": 5,
22 "accuracy": "95% (n=20)",
23 "Time to Complete": "15-20 minutes"
24 },
25 "Low Vision": {
26 "participants": 30,
27 "Ballots Completed": 28,
28 "Ballots Incomplete/Terminated": 2,
29 "accuracy": "92% (n=28)",
30 "Time to Complete": "12-18 minutes"
31 },
32 "dexterity": {
33 "participants": 22,
34 "Ballots Completed": 19,
35 "Ballots Incomplete/Terminated": 3,
36 "accuracy": "89% (n=19)",
37 "Time to Complete": "18-25 minutes"
38 },
39 "mobility": {
40 "participants": 28,
41 "Ballots Completed": 25,
42 "Ballots Incomplete/Terminated": 3,
43 "accuracy": "91% (n=25)",
44 "Time to Complete": "15-22 minutes"
45 }
46 }
47}
48"""

Conclusion

In this tutorial, we’ve built a document processing workflow that can:

  • Retrieve documents from Vellum’s API
  • Extract structured data using LLMs
  • Validate output against a predefined schema

Looking forward, we can:

  • Add validation nodes to verify extracted data
  • Implement retry logic for failed extractions
  • Add post-processing nodes for data cleanup
  • Deploy to Vellum for production use
  • Version control the workflow with the rest of our project
  • Continue building the graph in the Vellum UI
Built with