Document Data Extraction
In this example, we’ll build a Data Extraction Workflow that fetches documents from a Vellum Document Index, extracts key information from unstructured text, and generated structured JSON data using Pydantic models. This is useful for automating data extraction from PDFs (insurance policies, reports, lesson plans), spreadsheets, and more.
Ultimately, we’ll end up with a Workflow that performs the following steps:
RetrieveData
: retrieves document content using the Vellum APIDataExtractionNode
: processes the document content and extracts structured data in JSON format
You can get document IDs from the Vellum UI or via an external_id
when uploading documents.
Setup
Install Vellum
Create your Project
In this example, we’ll structure our project like this:
Folder structure matters! Vellum relies on this structure to convert between UI and code representations of the graph. If you don’t want to use the UI, you can use whatever folder structure you’d like.
Define Workflow Inputs
The workflow takes a single input: the ID of the document to process. This ID is used to retrieve the document’s content from Vellum’s API.
Build the Nodes
Retrieve Data Node
This node handles retrieving document content from Vellum’s API. It polls until the document is ready to be used, in case it has been recently uploaded and is still being indexed. Files are usually available within a few seconds.
This node subclasses BaseNode
and implements a custom run()
method to:
- Check document processing status
- Retrieve document content if processing is complete
- Retry while the document is still being indexed
- Throw errors if the document processing fails
Data Extraction Node
This node extracts structured data from the processed document content according to a Pydantic model schema.
This node uses:
- Pydantic models to define the expected JSON structure
- An
InlinePromptNode
to process the document content with an LLM - Custom prompt parameters to ensure consistent, structured output
Running the Workflow
Using the Sandbox Runner
The sandbox runner is ideal for testing and development. It enables you to execute the workflow locally using sample inputs, providing a quick way to validate functionality.
You can run the sandbox runner by running the following command: python -m basic_rag_chatbot.sandbox 0
(where 0
is the index of the Scenario you want to run).
The sandbox runner is useful for testing and development, allowing you to run the workflow locally with sample inputs.
Integration into Project
Conclusion
In this tutorial, we’ve built a document processing workflow that can:
- Retrieve documents from Vellum’s API
- Extract structured data using LLMs
- Validate output against a predefined schema
Looking forward, we can:
- Add validation nodes to verify extracted data
- Implement retry logic for failed extractions
- Add post-processing nodes for data cleanup
- Deploy to Vellum for production use
- Version control the workflow with the rest of our project
- Continue building the graph in the Vellum UI