In this example, we’ll build a Data Extraction Workflow that fetches documents from a Vellum Document Index, extracts key information from unstructured text, and generated structured JSON data using Pydantic models. This is useful for automating data extraction from PDFs (insurance policies, reports, lesson plans), spreadsheets, and more.
Ultimately, we’ll end up with a Workflow that performs the following steps:
RetrieveData: retrieves document content using the Vellum APIDataExtractionNode: processes the document content and extracts structured data in JSON formatYou can get document IDs from the Vellum UI or via an external_id when uploading documents.
In this example, we’ll structure our project like this:
Folder structure matters! Vellum relies on this structure to convert between UI and code representations of the graph. If you don’t want to use the UI, you can use whatever folder structure you’d like.
The workflow takes a single input: the ID of the document to process. This ID is used to retrieve the document’s content from Vellum’s API.
This node handles retrieving document content from Vellum’s API. It polls until the document is ready to be used, in case it has been recently uploaded and is still being indexed. Files are usually available within a few seconds.
This node subclasses BaseNode and implements a custom run() method to:
This node extracts structured data from the processed document content according to a Pydantic model schema.
This node uses:
InlinePromptNode to process the document content with an LLMThe sandbox runner is ideal for testing and development. It enables you to execute the workflow locally using sample inputs, providing a quick way to validate functionality.
You can run the sandbox runner by running the following command: python -m basic_rag_chatbot.sandbox 0 (where 0 is the index of the Scenario you want to run).
The sandbox runner is useful for testing and development, allowing you to run the workflow locally with sample inputs.
In this tutorial, we’ve built a document processing workflow that can:
Looking forward, we can: