Field Notes: Extracting Structured Data with Multimodal Prompts using Gemini and LangChain
If you’re a software or data engineer, I’m willing to bet you’ve spent a significant chunk of your career wrangling data. Transforming it from one format to another – SQL to NoSQL, CSV to Parquet, Delta Lake, or Iceberg – is often a rite of passage. There’s a certain satisfaction in seeing a well-oiled data processing workflow hum along.
But what if we could leapfrog traditional ETL/ELT for certain types of data? One of the most exciting opportunities with Large Language Models (LLMs) is their ability to transform not just structured data, but to extract structure from highly unstructured sources. Think of it as OCR on steroids, fused with advanced reasoning capabilities and the ability to comprehend verbose, nuanced language. This isn’t just about recognizing characters; it’s about understanding context, layout, and intent.
The business value here is immense. We’re talking about automating tasks that were previously unfeasible or prohibitively expensive, unlocking insights from datasets that have been gathering digital dust.
The Challenge: Taming the Unstructured Beast
Recently, I worked on a project involving a mountain of scanned legal contracts in PDF format. The client needed to extract key clauses, dates, party names, and other critical information for analysis and compliance. Past attempts to automate this extraction, including significant manual effort and traditional OCR tools, had stalled due to high costs and technological limitations. The accuracy just wasn’t there, and the complexity of legal language proved too challenging for older methods.
This scenario is far from unique. Legal documents, historical archives, handwritten notes, complex diagrams, invoices with varied formats – unstructured data is everywhere, in every company, industry, and application.
Enter Gemini and LangChain: A Powerful Duo
For this particular client, already leveraging Google Cloud Platform (GCP), Google’s Gemini models were a natural choice. Gemini was designed from the ground up with first-class, native multimodal support, making it exceptionally adept at processing and understanding information from various input types simultaneously (like images and text).
To orchestrate the interactions with Gemini and manage the overall workflow, we turned to LangChain. LangChain simplifies building applications with LLMs by providing a standard interface for models, prompt management capabilities, and tools for chaining sequences of calls.
In this post, I want to share some key insights from this project, demonstrating how you can leverage Gemini’s multimodal capabilities with LangChain to tackle similar data extraction challenges.
What Are Multimodal Prompts?
At its core, a multimodal prompt is one that provides information to an LLM in more than one “mode” or data type. Instead of just sending text, you might send:
- An image and a text question about the image.
- A video and a text request to summarize it.
- An audio clip and a text instruction to transcribe and translate it.
For our use case – extracting data from scanned PDFs – we’re primarily interested in combining images (the scanned pages) with text prompts (our instructions for what to extract). Gemini can “look” at the image of the document page and “read” our textual instructions to identify and pull out the specific pieces of information we need, often in a structured format like JSON.
Setting the Stage: A Demo with the “Bite of Seattle” Menu (1983)
To illustrate this technique without using sensitive contract data, let’s use a fun, publicly available example: a scanned menu from the “Bite of Seattle” food festival in 1983.
(Image of the 1983 Bite of Seattle Menu would be embedded here in a real blog post) Imagine a scanned image of an old, slightly faded menu with various food items, descriptions, and prices, possibly with some handwritten notes or varied fonts.
Why this menu?
- Visual Complexity: It’s a scanned image, not pure text. It has layout, different fonts, and potentially some noise.
- Implicit Structure: While unstructured as an image, it contains inherently structured information (e.g., Item Name, Description, Price, Vendor).
- Relatability: Everyone understands a menu!
Our goal will be to instruct Gemini, via LangChain, to extract a list of food items along with their prices and descriptions from this menu image.
The Workflow: Image, Prompt, Extraction
Here’s a high-level overview of how we’d approach this with LangChain and Gemini:
- Load the Image: The scanned menu page (e.g., a PNG or JPEG file) is loaded.
- Craft the Multimodal Prompt: This is where the magic happens. We’ll construct a prompt that includes:
- The image data itself.
- Text instructions specifying what we want to extract and the desired output format. For example:
"From the provided menu image, extract all food items. For each item, provide its name, price, and a brief description if available. Return the information as a JSON list, where each object in the list has the keys 'item_name', 'price', and 'description'."
- Invoke Gemini via LangChain: LangChain provides convenient wrappers for interacting with Gemini’s multimodal models (like
gemini-pro-vision
). We’ll send our combined image and text prompt. - Receive and Parse the Response: Gemini will process the image based on our instructions and return a response. Ideally, this will be the structured JSON data we requested. LangChain can also help with output parsing, ensuring the response conforms to a Pydantic model or other defined schema.
# Conceptual Python-like pseudocode (actual LangChain/Gemini code would be more specific)
from langchain_google_vertexai import ChatGemini # Or similar for Gemini
from langchain_core.messages import HumanMessage
# 1. Prepare the image (e.g., load from file and base64 encode)
image_data = load_and_prepare_image("path/to/menu.png")
# 2. Initialize the Gemini model via LangChain
llm = ChatGemini(model_name="gemini-pro-vision") # Or newer multimodal model
# 3. Craft the multimodal prompt
prompt_text = """
From the provided menu image, extract all food items.
For each item, identify its name, price, and a brief description if available.
Return the information as a JSON list. Each object in the list should have the keys:
'item_name', 'price', and 'description'.
If a price or description is not clearly available for an item, use null for that field.
"""
message = HumanMessage(
content=[
{"type": "text", "text": prompt_text},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}
]
)
# 4. Send the prompt and get the response
response = llm.invoke([message])
# 5. Process the response (which should ideally be a JSON string)
extracted_data = parse_json_from_response(response.content)
print(extracted_data)