Tutorial May 13, 2026

AI Multimodal API Guide 2026

Vision, image understanding, video analysis, and cross-modal AI. OpenAI, Google Gemini, Claude vision capabilities compared with implementation patterns.

Multimodal AI — models that understand images, video, audio, and text together — has gone from research novelty to production necessity in 2026. You can now send a screenshot to GPT-5 and ask it to debug the UI, feed a document scan to Claude for data extraction, or analyze video frames with Gemini. This guide covers every major multimodal API, their capabilities, pricing, and how to integrate them into your applications.

What is Multimodal AI?

Multimodal models process multiple types of input (text, images, audio, video) in a single request and generate responses that reason across these modalities. The key capabilities:

CapabilityInputOutputExample Use Case
Image understandingImage + textTextDescribe photo, extract text (OCR)
Document analysisPDF/document + textTextInvoice extraction, form reading
Video understandingVideo + textTextSummarize meeting recording
Audio understandingAudio + textTextTranscribe and analyze call
Image generationText (+ image)ImageCreate graphics, edit photos
Cross-modal reasoningMultiple modalitiesAnyCompare images, match audio to text

OpenAI Vision (GPT-5 / GPT-5.4)

All GPT-5 series models support image inputs natively:

from openai import OpenAI

client = OpenAI()

# Analyze an image from URL
response = client.chat.completions.create(
    model="gpt-5",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image? List every object you see."},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/photo.jpg"}
                }
            ]
        }
    ],
    max_tokens=500
)

print(response.choices[0].message.content)

Image Input Options

# Option 1: URL (simplest)
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}

# Option 2: Base64 encoded (for local files)
import base64

with open("photo.png", "rb") as f:
    encoded = base64.b64encode(f.read()).decode("utf-8")

{
    "type": "image_url",
    "image_url": {"url": f"data:image/png;base64,{encoded}"}
}

# Option 3: Multiple images in one request
content = [
    {"type": "text", "text": "Compare these two screenshots. What changed?"},
    {"type": "image_url", "image_url": {"url": "https://example.com/before.png"}},
    {"type": "image_url", "image_url": {"url": "https://example.com/after.png"}}
]

Detail Levels

Control image resolution for cost/quality tradeoff:

# Low detail: faster, cheaper, good for high-level descriptions
{"type": "image_url", "image_url": {"url": "...", "detail": "low"}}

# High detail: slower, more expensive, better for OCR and fine details
{"type": "image_url", "image_url": {"url": "...", "detail": "high"}}

# Auto: model decides based on image size (default)
{"type": "image_url", "image_url": {"url": "...", "detail": "auto"}}

Vision Pricing

Image inputs are tokenized based on resolution. Approximate costs:

Image SizeTokens (low detail)Tokens (high detail)
Thumbnail (<512px)~85~170
Standard (512-1024px)~85~680
Large (1024-2048px)~85~1,105
Low detail is a fixed ~85 tokens regardless of image size. Use it for "what's in this picture?" questions. High detail scales with resolution — use it for OCR, reading text, or detecting small details.

Google Gemini Multimodal

Gemini was built multimodal from the ground up and has the most comprehensive cross-modal support:

import google.generativeai as genai

model = genai.GenerativeModel('gemini-2.5-pro')

# Image analysis
response = model.generate_content([
    "Describe this image in detail",
    genai.Image(url="https://example.com/photo.jpg")
])

print(response.text)

# Video analysis (upload and process)
video_file = genai.upload_file(path="meeting.mp4")
while video_file.state.name == "PROCESSING":
    time.sleep(5)
    video_file = genai.get_file(video_file.name)

response = model.generate_content([
    "Summarize the key points from this meeting",
    video_file
])

# PDF/document analysis
doc_file = genai.upload_file(path="invoice.pdf")
response = model.generate_content([
    "Extract: vendor, total amount, date, line items",
    doc_file
])

Gemini's Unique Multimodal Features

  • Native video understanding: Process up to 1 hour of video natively (no frame extraction needed)
  • Audio understanding: Transcribe and reason about audio content
  • Document understanding: Process PDFs with layout awareness — tables, forms, charts
  • Grounded generation: Generate content based on specific image regions

Claude Vision

Claude's vision capabilities are strong for document analysis and detailed image reasoning:

import anthropic

client = anthropic.Anthropic()

# Image analysis with Claude
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": base64_encoded_image
                    }
                },
                {
                    "type": "text",
                    "text": "Extract all the text from this document and format as a table."
                }
            ]
        }
    ]
)

print(response.content[0].text)

Claude excels at: reading dense documents, extracting structured data from forms, describing visual layouts, and reasoning about diagrams/charts.

Multimodal Comparison

Feature GPT-5 Gemini 2.5 Pro Claude Sonnet 4
Image understanding Excellent Excellent Very Good
Video understanding Limited Excellent (native) No
Audio understanding Via Realtime-2 Yes (native) No
PDF/document Good Excellent Excellent
OCR quality Good Very Good Very Good
Multiple images Yes Yes Yes
Max images per request 20 3,600 20
Detail control low/high/auto No No

Implementation Patterns

1. Document Data Extraction

from pydantic import BaseModel

class InvoiceData(BaseModel):
    vendor_name: str
    invoice_number: str
    date: str
    total_amount: float
    line_items: list[dict]
    due_date: str | None = None

# OpenAI with structured output
response = client.responses.parse(
    model="gpt-5",
    input=[
        {"role": "user", "content": [
            {"type": "text", "text": "Extract invoice data from this image"},
            {"type": "image_url", "image_url": {"url": invoice_url, "detail": "high"}}
        ]}
    ],
    text_format=InvoiceData,
)

invoice = response.output_parsed
print(f"Total: ${invoice.total_amount}")

2. Visual Q&A for Customer Support

async def handle_screenshot_question(image_base64, question):
    """Answer questions about a screenshot (e.g., error message)."""
    response = client.chat.completions.create(
        model="gpt-5",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": f"""You are a technical support agent. 
A user sent a screenshot with this question: "{question}"
Analyze the screenshot and provide a clear, step-by-step solution."""},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{image_base64}",
                    "detail": "high"
                }}
            ]
        }],
        max_tokens=1000
    )
    return response.choices[0].message.content

3. Video Content Analysis

# Using Gemini for video summarization
def analyze_video(video_path, questions):
    """Analyze video content with Gemini."""
    video_file = genai.upload_file(path=video_path)
    
    # Wait for processing
    while video_file.state.name == "PROCESSING":
        time.sleep(5)
        video_file = genai.get_file(video_file.name)
    
    prompt = f"""Analyze this video and answer:
{chr(10).join(f'- {q}' for q in questions)}

Provide timestamps for key moments."""
    
    response = model.generate_content([prompt, video_file])
    return response.text

# Example
questions = [
    "What are the main topics discussed?",
    "Were any decisions made? If so, what?",
    "Who spoke and for how long?"
]

summary = analyze_video("team_meeting.mp4", questions)

4. Multi-Image Comparison

def compare_images(image_urls, question):
    """Compare multiple images."""
    content = [{"type": "text", "text": question}]
    for url in image_urls:
        content.append({"type": "image_url", "image_url": {"url": url}})
    
    response = client.chat.completions.create(
        model="gpt-5",
        messages=[{"role": "user", "content": content}]
    )
    return response.choices[0].message.content

# Compare UI designs
result = compare_images(
    ["https://example.com/design-v1.png", "https://example.com/design-v2.png"],
    "Compare these two UI designs. Which has better accessibility? List specific differences."
)

Cost Optimization for Multimodal

Multimodal requests are significantly more expensive than text-only. Optimize costs:

1. Resize Images Before Sending

from PIL import Image
import io
import base64

def optimize_image(image_path, max_size=1024):
    """Resize and compress image for API calls."""
    img = Image.open(image_path)
    
    # Resize while maintaining aspect ratio
    img.thumbnail((max_size, max_size))
    
    # Convert to JPEG for smaller size
    buffer = io.BytesIO()
    img.save(buffer, format="JPEG", quality=85)
    
    return base64.b64encode(buffer.getvalue()).decode("utf-8")

2. Use Low Detail When Possible

# Low detail is fine for high-level questions
"Is this a photo of a cat or dog?"  → detail: "low"

# High detail needed for specific content
"Read the text on the sign" → detail: "high"

3. Pre-process with Specialized APIs

# For OCR-heavy workloads, consider dedicated OCR first
# Then send extracted text to LLM (much cheaper)

import pytesseract

# Step 1: OCR the image (free, local)
text = pytesseract.image_to_string(Image.open("document.png"))

# Step 2: Send only text to LLM (cheaper than vision)
response = client.chat.completions.create(
    model="gpt-5.4-mini",  # Cheaper model sufficient for text-only
    messages=[{
        "role": "user",
        "content": f"Extract structured data from this text:\n{text}"
    }]
)

Common Pitfalls

  1. Sending full-resolution images — Resize to max 1024-2048px. The model doesn't benefit from 4K resolution, and you pay for every pixel token
  2. Using vision when text extraction suffices — If you just need OCR, use Tesseract or a dedicated OCR API. It's cheaper and faster
  3. Not setting detail level — Default is "auto" which may choose high detail unnecessarily. Be explicit
  4. Exceeding image limits — GPT-5 allows max 20 images per request. Gemini allows more but at higher cost
  5. Forgetting image token costs — A single high-detail image can cost as much as 1,000+ tokens. Budget accordingly
  6. Assuming perfect OCR — Models still make OCR errors on handwritten text, low-quality scans, and unusual fonts. Always validate critical extractions

Conclusion

Multimodal AI has transformed what's possible in 2026. The ability to send an image alongside a text prompt and get intelligent, context-aware responses is now table stakes. GPT-5 offers the most flexible image understanding with detail control. Gemini provides the best native video and audio support. Claude excels at document analysis and detailed visual reasoning.

For most developers, the practical starting point is GPT-5 vision with structured outputs — you can extract structured data from screenshots, documents, and photos with a single API call. Add Gemini for video-heavy workloads, and Claude for complex document processing. Optimize costs by resizing images, using low detail when appropriate, and pre-processing with OCR for text-heavy documents.