AI Multimodal API Guide 2026
Vision, image understanding, video analysis, and cross-modal AI. OpenAI, Google Gemini, Claude vision capabilities compared with implementation patterns.
Multimodal AI — models that understand images, video, audio, and text together — has gone from research novelty to production necessity in 2026. You can now send a screenshot to GPT-5 and ask it to debug the UI, feed a document scan to Claude for data extraction, or analyze video frames with Gemini. This guide covers every major multimodal API, their capabilities, pricing, and how to integrate them into your applications.
What is Multimodal AI?
Multimodal models process multiple types of input (text, images, audio, video) in a single request and generate responses that reason across these modalities. The key capabilities:
| Capability | Input | Output | Example Use Case |
|---|---|---|---|
| Image understanding | Image + text | Text | Describe photo, extract text (OCR) |
| Document analysis | PDF/document + text | Text | Invoice extraction, form reading |
| Video understanding | Video + text | Text | Summarize meeting recording |
| Audio understanding | Audio + text | Text | Transcribe and analyze call |
| Image generation | Text (+ image) | Image | Create graphics, edit photos |
| Cross-modal reasoning | Multiple modalities | Any | Compare images, match audio to text |
OpenAI Vision (GPT-5 / GPT-5.4)
All GPT-5 series models support image inputs natively:
from openai import OpenAI
client = OpenAI()
# Analyze an image from URL
response = client.chat.completions.create(
model="gpt-5",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image? List every object you see."},
{
"type": "image_url",
"image_url": {"url": "https://example.com/photo.jpg"}
}
]
}
],
max_tokens=500
)
print(response.choices[0].message.content)
Image Input Options
# Option 1: URL (simplest)
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
# Option 2: Base64 encoded (for local files)
import base64
with open("photo.png", "rb") as f:
encoded = base64.b64encode(f.read()).decode("utf-8")
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{encoded}"}
}
# Option 3: Multiple images in one request
content = [
{"type": "text", "text": "Compare these two screenshots. What changed?"},
{"type": "image_url", "image_url": {"url": "https://example.com/before.png"}},
{"type": "image_url", "image_url": {"url": "https://example.com/after.png"}}
]
Detail Levels
Control image resolution for cost/quality tradeoff:
# Low detail: faster, cheaper, good for high-level descriptions
{"type": "image_url", "image_url": {"url": "...", "detail": "low"}}
# High detail: slower, more expensive, better for OCR and fine details
{"type": "image_url", "image_url": {"url": "...", "detail": "high"}}
# Auto: model decides based on image size (default)
{"type": "image_url", "image_url": {"url": "...", "detail": "auto"}}
Vision Pricing
Image inputs are tokenized based on resolution. Approximate costs:
| Image Size | Tokens (low detail) | Tokens (high detail) |
|---|---|---|
| Thumbnail (<512px) | ~85 | ~170 |
| Standard (512-1024px) | ~85 | ~680 |
| Large (1024-2048px) | ~85 | ~1,105 |
Low detail is a fixed ~85 tokens regardless of image size. Use it for "what's in this picture?" questions. High detail scales with resolution — use it for OCR, reading text, or detecting small details.
Google Gemini Multimodal
Gemini was built multimodal from the ground up and has the most comprehensive cross-modal support:
import google.generativeai as genai
model = genai.GenerativeModel('gemini-2.5-pro')
# Image analysis
response = model.generate_content([
"Describe this image in detail",
genai.Image(url="https://example.com/photo.jpg")
])
print(response.text)
# Video analysis (upload and process)
video_file = genai.upload_file(path="meeting.mp4")
while video_file.state.name == "PROCESSING":
time.sleep(5)
video_file = genai.get_file(video_file.name)
response = model.generate_content([
"Summarize the key points from this meeting",
video_file
])
# PDF/document analysis
doc_file = genai.upload_file(path="invoice.pdf")
response = model.generate_content([
"Extract: vendor, total amount, date, line items",
doc_file
])
Gemini's Unique Multimodal Features
- Native video understanding: Process up to 1 hour of video natively (no frame extraction needed)
- Audio understanding: Transcribe and reason about audio content
- Document understanding: Process PDFs with layout awareness — tables, forms, charts
- Grounded generation: Generate content based on specific image regions
Claude Vision
Claude's vision capabilities are strong for document analysis and detailed image reasoning:
import anthropic
client = anthropic.Anthropic()
# Image analysis with Claude
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": base64_encoded_image
}
},
{
"type": "text",
"text": "Extract all the text from this document and format as a table."
}
]
}
]
)
print(response.content[0].text)
Claude excels at: reading dense documents, extracting structured data from forms, describing visual layouts, and reasoning about diagrams/charts.
Multimodal Comparison
| Feature | GPT-5 | Gemini 2.5 Pro | Claude Sonnet 4 |
|---|---|---|---|
| Image understanding | Excellent | Excellent | Very Good |
| Video understanding | Limited | Excellent (native) | No |
| Audio understanding | Via Realtime-2 | Yes (native) | No |
| PDF/document | Good | Excellent | Excellent |
| OCR quality | Good | Very Good | Very Good |
| Multiple images | Yes | Yes | Yes |
| Max images per request | 20 | 3,600 | 20 |
| Detail control | low/high/auto | No | No |
Implementation Patterns
1. Document Data Extraction
from pydantic import BaseModel
class InvoiceData(BaseModel):
vendor_name: str
invoice_number: str
date: str
total_amount: float
line_items: list[dict]
due_date: str | None = None
# OpenAI with structured output
response = client.responses.parse(
model="gpt-5",
input=[
{"role": "user", "content": [
{"type": "text", "text": "Extract invoice data from this image"},
{"type": "image_url", "image_url": {"url": invoice_url, "detail": "high"}}
]}
],
text_format=InvoiceData,
)
invoice = response.output_parsed
print(f"Total: ${invoice.total_amount}")
2. Visual Q&A for Customer Support
async def handle_screenshot_question(image_base64, question):
"""Answer questions about a screenshot (e.g., error message)."""
response = client.chat.completions.create(
model="gpt-5",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": f"""You are a technical support agent.
A user sent a screenshot with this question: "{question}"
Analyze the screenshot and provide a clear, step-by-step solution."""},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{image_base64}",
"detail": "high"
}}
]
}],
max_tokens=1000
)
return response.choices[0].message.content
3. Video Content Analysis
# Using Gemini for video summarization
def analyze_video(video_path, questions):
"""Analyze video content with Gemini."""
video_file = genai.upload_file(path=video_path)
# Wait for processing
while video_file.state.name == "PROCESSING":
time.sleep(5)
video_file = genai.get_file(video_file.name)
prompt = f"""Analyze this video and answer:
{chr(10).join(f'- {q}' for q in questions)}
Provide timestamps for key moments."""
response = model.generate_content([prompt, video_file])
return response.text
# Example
questions = [
"What are the main topics discussed?",
"Were any decisions made? If so, what?",
"Who spoke and for how long?"
]
summary = analyze_video("team_meeting.mp4", questions)
4. Multi-Image Comparison
def compare_images(image_urls, question):
"""Compare multiple images."""
content = [{"type": "text", "text": question}]
for url in image_urls:
content.append({"type": "image_url", "image_url": {"url": url}})
response = client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": content}]
)
return response.choices[0].message.content
# Compare UI designs
result = compare_images(
["https://example.com/design-v1.png", "https://example.com/design-v2.png"],
"Compare these two UI designs. Which has better accessibility? List specific differences."
)
Cost Optimization for Multimodal
Multimodal requests are significantly more expensive than text-only. Optimize costs:
1. Resize Images Before Sending
from PIL import Image
import io
import base64
def optimize_image(image_path, max_size=1024):
"""Resize and compress image for API calls."""
img = Image.open(image_path)
# Resize while maintaining aspect ratio
img.thumbnail((max_size, max_size))
# Convert to JPEG for smaller size
buffer = io.BytesIO()
img.save(buffer, format="JPEG", quality=85)
return base64.b64encode(buffer.getvalue()).decode("utf-8")
2. Use Low Detail When Possible
# Low detail is fine for high-level questions
"Is this a photo of a cat or dog?" → detail: "low"
# High detail needed for specific content
"Read the text on the sign" → detail: "high"
3. Pre-process with Specialized APIs
# For OCR-heavy workloads, consider dedicated OCR first
# Then send extracted text to LLM (much cheaper)
import pytesseract
# Step 1: OCR the image (free, local)
text = pytesseract.image_to_string(Image.open("document.png"))
# Step 2: Send only text to LLM (cheaper than vision)
response = client.chat.completions.create(
model="gpt-5.4-mini", # Cheaper model sufficient for text-only
messages=[{
"role": "user",
"content": f"Extract structured data from this text:\n{text}"
}]
)
Common Pitfalls
- Sending full-resolution images — Resize to max 1024-2048px. The model doesn't benefit from 4K resolution, and you pay for every pixel token
- Using vision when text extraction suffices — If you just need OCR, use Tesseract or a dedicated OCR API. It's cheaper and faster
- Not setting detail level — Default is "auto" which may choose high detail unnecessarily. Be explicit
- Exceeding image limits — GPT-5 allows max 20 images per request. Gemini allows more but at higher cost
- Forgetting image token costs — A single high-detail image can cost as much as 1,000+ tokens. Budget accordingly
- Assuming perfect OCR — Models still make OCR errors on handwritten text, low-quality scans, and unusual fonts. Always validate critical extractions
Conclusion
Multimodal AI has transformed what's possible in 2026. The ability to send an image alongside a text prompt and get intelligent, context-aware responses is now table stakes. GPT-5 offers the most flexible image understanding with detail control. Gemini provides the best native video and audio support. Claude excels at document analysis and detailed visual reasoning.
For most developers, the practical starting point is GPT-5 vision with structured outputs — you can extract structured data from screenshots, documents, and photos with a single API call. Add Gemini for video-heavy workloads, and Claude for complex document processing. Optimize costs by resizing images, using low detail when appropriate, and pre-processing with OCR for text-heavy documents.
Related Guides: AI Image Generation Tools · Structured Outputs Guide · Voice & Audio API Guide