AI Agent Development Guide 2026: Build Your First AI Agent from Scratch
A complete tutorial for building autonomous AI agents. Covers architecture, tool calling, memory, frameworks, and production best practices with real Python code.
What Are AI Agents (And Why They Are Not Just LLM Calls)
There is a critical distinction that most tutorials gloss over: calling an LLM API is not building an agent. An AI agent is an autonomous system that perceives its environment, reasons about what to do, and takes actions to achieve a goal—all without requiring a human to manually drive every step.
Consider the difference. A simple LLM call looks like this: you send a prompt, you get a response. It is stateless, single-turn, and passive. An agent, on the other hand, can break a complex task into subtasks, call external tools when it needs information, remember what it has already done, and iterate until the goal is met.
Think of it this way: a chatbot answers questions. An agent solves problems.
| Aspect | Simple LLM Call | AI Agent |
|---|---|---|
| Control flow | Single request-response | Autonomous loop with branching |
| Tool access | None | Can call APIs, databases, browsers |
| Memory | Stateless (unless you manage it) | Built-in short-term and long-term |
| Error handling | None (fails silently) | Retries, fallbacks, self-correction |
| Goal pursuit | Answers one question | Plans and executes multi-step strategy |
| Example | "Summarize this text" | "Research competitors, compile a report, and email it" |
In 2026, agents are no longer a research curiosity. They power customer support systems that resolve tickets autonomously, research assistants that synthesize information from dozens of sources, coding assistants that plan and implement features across multiple files, and operations agents that monitor infrastructure and respond to incidents. The gap between "calling GPT" and "building an agent" is exactly what this guide bridges.
Agent Architecture: The Perception-Reasoning-Action Loop
Every AI agent, regardless of framework or complexity, follows the same fundamental loop:
- Perceive: Receive input (user message, sensor data, system event) and gather context from memory and environment
- Reason: The LLM analyzes the situation, decides what to do next, and selects which tools to call (if any)
- Act: Execute the chosen action—call an API, query a database, write a file, send a message
- Observe: Process the result of the action and update internal state
- Repeat: Continue the loop until the goal is achieved or a stopping condition is met
This is called the ReAct loop (Reason + Act), and it is the architectural backbone of virtually every production agent system built in 2026.
Key Insight
The power of the ReAct loop is that it is iterative. The agent does not need to know everything upfront. It can gather information, realize it needs more context, call another tool, and adjust its plan. This is what makes agents genuinely autonomous.
Here is what the loop looks like in code:
class Agent:
def __init__(self, llm, tools, memory):
self.llm = llm
self.tools = tools
self.memory = memory
def run(self, task: str, max_iterations: int = 10):
self.memory.add("user", task)
for i in range(max_iterations):
# Perceive + Reason
context = self.memory.get_context()
response = self.llm.chat(
messages=context,
tools=self.tools.definitions()
)
# If the LLM wants to call a tool
if response.tool_calls:
for call in response.tool_calls:
# Act
result = self.tools.execute(call)
# Observe
self.memory.add("tool", f"{call.name}: {result}")
else:
# No tool call = final answer
self.memory.add("assistant", response.content)
return response.content
return "Agent reached maximum iterations without completing the task."
This simple pattern—perceive, reason, act, observe, repeat—is the foundation. Everything else in this guide builds on top of it.
Tools and Function Calling
Tools are what transform a language model from a text generator into an agent that can interact with the real world. Without tools, an LLM can only produce text. With tools, it can search the web, query databases, read files, send emails, call APIs, and execute code.
Function calling (also called tool use) is the mechanism that makes this work. You define a set of functions with their names, descriptions, and parameter schemas. The LLM decides which function to call and with what arguments based on the current context.
Defining Tools
Here is how you define tools using the OpenAI function calling format:
tools = [
{
"type": "function",
"function": {
"name": "search_web",
"description": "Search the web for current information on a topic",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read the contents of a file from disk",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Absolute path to the file"
}
},
"required": ["path"]
}
}
},
{
"type": "function",
"function": {
"name": "send_email",
"description": "Send an email to a recipient",
"parameters": {
"type": "object",
"properties": {
"to": { "type": "string", "description": "Recipient email" },
"subject": { "type": "string", "description": "Email subject" },
"body": { "type": "string", "description": "Email body text" }
},
"required": ["to", "subject", "body"]
}
}
}
]
Executing Tool Calls
When the LLM decides to call a tool, you receive a structured response with the function name and arguments. You then execute the function and feed the result back:
import json
from openai import OpenAI
client = OpenAI()
def execute_tool(name: str, args: dict) -> str:
"""Dispatch tool calls to actual implementations."""
if name == "search_web":
return search_web(args["query"])
elif name == "read_file":
return read_file(args["path"])
elif name == "send_email":
return send_email(args["to"], args["subject"], args["body"])
return f"Unknown tool: {name}"
def run_agent(messages: list, max_turns: int = 5):
for _ in range(max_turns):
response = client.chat.completions.create(
model="gpt-4.1",
messages=messages,
tools=tools
)
msg = response.choices[0].message
messages.append(msg)
# No tool calls = agent is done
if not msg.tool_calls:
return msg.content
# Execute each tool call
for tool_call in msg.tool_calls:
args = json.loads(tool_call.function.arguments)
result = execute_tool(tool_call.function.name, args)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(result)
})
return "Agent reached maximum turns."
Security Warning
Never let an LLM directly execute shell commands or database queries without sanitization. Always validate tool arguments before execution. A malicious prompt can trick an agent into running destructive commands. Use allowlists, sandboxed environments, and confirmation steps for sensitive operations.
Memory and State Management
Memory is what separates a one-shot chatbot from a persistent, context-aware agent. There are three types of memory every production agent needs:
1. Working Memory (Conversation Context)
This is the immediate conversation history—the messages exchanged so far. Every LLM API call includes this as the messages array. The challenge is that context windows are finite. A 128K token window sounds large, but an agent making 20 tool calls can easily exceed it.
Solution: Implement a sliding window with summarization. Keep the most recent N messages verbatim and summarize older ones:
class ConversationMemory:
def __init__(self, max_messages: int = 20):
self.messages = []
self.summary = ""
self.max_messages = max_messages
def add(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
if len(self.messages) > self.max_messages:
self._summarize_older()
def _summarize_older(self):
# Summarize the oldest half of messages
old = self.messages[:len(self.messages) // 2]
self.summary += f"\nPrevious context: {summarize(old)}"
self.messages = self.messages[len(self.messages) // 2:]
def get_context(self) -> list:
context = []
if self.summary:
context.append({
"role": "system",
"content": f"Previous conversation summary: {self.summary}"
})
context.extend(self.messages)
return context
2. Episodic Memory (Past Interactions)
Episodic memory stores what happened in previous sessions. When a user returns, the agent remembers their preferences, past problems, and established context. This is typically stored in a vector database or relational database indexed by user ID.
class EpisodicMemory:
def __init__(self, db):
self.db = db
def store(self, user_id: str, event: str, embedding: list):
self.db.insert({
"user_id": user_id,
"event": event,
"embedding": embedding,
"timestamp": datetime.utcnow()
})
def recall(self, user_id: str, query_embedding: list, top_k: int = 5):
return self.db.search(
user_id=user_id,
embedding=query_embedding,
top_k=top_k
)
3. Semantic Memory (Factual Knowledge)
Semantic memory is the agent's knowledge base—documents, FAQs, product information, and any structured data it needs to answer questions accurately. This is where RAG (Retrieval-Augmented Generation) comes in. Store your knowledge in a vector database and retrieve relevant chunks on demand.
Memory Architecture Best Practice
- Working memory: In-memory, sliding window with summarization
- Episodic memory: Vector database (Pinecone, Qdrant, pgvector) per user
- Semantic memory: Vector database with curated knowledge chunks
- All three: Injected into the system prompt at query time
Building a Research Agent: Step by Step
Let us build a complete research agent that takes a topic, searches for information, synthesizes findings, and produces a structured report. This agent demonstrates every concept we have covered: the ReAct loop, tool calling, and memory management.
Step 1: Define the Tools
import requests
from duckduckgo_search import DDGS
def search_web(query: str, max_results: int = 5) -> str:
"""Search the web using DuckDuckGo."""
results = []
with DDGS() as ddgs:
for r in ddgs.text(query, max_results=max_results):
results.append(f"Title: {r['title']}\nURL: {r['href']}\nSnippet: {r['body']}")
return "\n\n".join(results) if results else "No results found."
def fetch_url(url: str) -> str:
"""Fetch and extract text content from a URL."""
try:
resp = requests.get(url, timeout=10, headers={"User-Agent": "ResearchAgent/1.0"})
resp.raise_for_status()
# Simple text extraction (in production, use proper HTML parser)
text = resp.text[:5000] # Truncate to avoid token overflow
return text
except Exception as e:
return f"Error fetching {url}: {e}"
RESEARCH_TOOLS = [
{
"type": "function",
"function": {
"name": "search_web",
"description": "Search the web for information on a topic",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "fetch_url",
"description": "Fetch the full content of a web page by URL",
"parameters": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "URL to fetch"}
},
"required": ["url"]
}
}
}
]
Step 2: Build the Agent Core
import json
from openai import OpenAI
class ResearchAgent:
def __init__(self, model: str = "gpt-4.1"):
self.client = OpenAI()
self.model = model
self.tool_map = {
"search_web": search_web,
"fetch_url": fetch_url,
}
self.conversation = []
def _system_prompt(self) -> str:
return """You are a research agent. Your job is to:
1. Understand the research topic
2. Search for relevant information
3. Read promising sources in detail
4. Synthesize findings into a clear, structured report
Guidelines:
- Start with a broad search, then narrow down
- Verify facts across multiple sources when possible
- Always cite your sources with URLs
- If a search returns poor results, try different query formulations
- Produce a final report with sections: Summary, Key Findings, Detailed Analysis, Sources"""
def research(self, topic: str, max_turns: int = 8) -> str:
self.conversation = [
{"role": "system", "content": self._system_prompt()},
{"role": "user", "content": f"Research this topic and produce a report: {topic}"}
]
for turn in range(max_turns):
response = self.client.chat.completions.create(
model=self.model,
messages=self.conversation,
tools=RESEARCH_TOOLS,
tool_choice="auto"
)
msg = response.choices[0].message
self.conversation.append(msg)
if not msg.tool_calls:
return msg.content
for tool_call in msg.tool_calls:
fn_name = tool_call.function.name
fn_args = json.loads(tool_call.function.arguments)
print(f" [Tool Call] {fn_name}({fn_args})")
result = self.tool_map[fn_name](**fn_args)
self.conversation.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(result)[:3000] # Truncate long results
})
return "Research agent reached maximum turns. Partial results may be incomplete."
# Run it
agent = ResearchAgent()
report = agent.research("AI agent frameworks comparison 2026 LangChain CrewAI AutoGen")
print(report)
Step 3: Add Structured Output
For production agents, you often want the final output in a structured format rather than free-form text. Use structured outputs to enforce a schema:
from pydantic import BaseModel
from typing import List
class Source(BaseModel):
title: str
url: str
relevance: str # "high", "medium", "low"
class ResearchReport(BaseModel):
topic: str
summary: str
key_findings: List[str]
detailed_analysis: str
sources: List[Source]
confidence: str # "high", "medium", "low"
# Use with the OpenAI structured output feature
response = client.beta.chat.completions.parse(
model="gpt-4.1",
messages=conversation,
response_format=ResearchReport,
)
report = response.choices[0].message.parsed
This gives you a typed, validated output that you can programmatically process, store, or display in a UI.
Frameworks: LangChain, CrewAI, OpenAI Agents SDK
Building an agent from scratch teaches you the fundamentals. But for production, most teams use a framework. Here is how the three most popular options compare in 2026.
LangChain / LangGraph
LangChain remains the most widely adopted framework, but the ecosystem has shifted significantly toward LangGraph—LangChain's graph-based agent orchestration layer. LangGraph models agents as state machines with explicit nodes and edges, making complex multi-step workflows much easier to reason about.
from langgraph.prebuilt import create_react_agent
from langchain_openai import ChatOpenAI
from langchain_community.tools import DuckDuckGoSearchRun
model = ChatOpenAI(model="gpt-4.1")
tools = [DuckDuckGoSearchRun()]
agent = create_react_agent(model, tools)
result = agent.invoke({
"messages": [{"role": "user", "content": "What are the top AI agent frameworks in 2026?"}]
})
print(result["messages"][-1].content)
Pros: Massive ecosystem, LangGraph visualization and debugging, LangSmith for observability, extensive integrations.
Cons: Steep learning curve, abstraction leaks, frequent breaking changes between versions.
CrewAI
CrewAI specializes in multi-agent systems. You define multiple agents with distinct roles, and they collaborate to solve problems. Think of it as assembling a team where each member has a specialty.
from crewai import Agent, Task, Crew, Process
from crewai_tools import SerperDevTool
search_tool = SerperDevTool()
researcher = Agent(
role="Senior Research Analyst",
goal="Find and analyze information about AI agent frameworks",
backstory="You are an expert researcher with 10 years of experience in AI.",
tools=[search_tool],
verbose=True,
)
writer = Agent(
role="Technical Writer",
goal="Write a comprehensive comparison article",
backstory="You are a skilled writer who makes complex topics accessible.",
verbose=True,
)
research_task = Task(
description="Research the top 5 AI agent frameworks in 2026",
expected_output="A detailed analysis with pros, cons, and use cases",
agent=researcher,
)
write_task = Task(
description="Write a comparison article based on the research",
expected_output="A 2000-word article with clear sections",
agent=writer,
)
crew = Crew(
agents=[researcher, writer],
tasks=[research_task, write_task],
process=Process.sequential,
)
result = crew.kickoff()
print(result)
Pros: Intuitive multi-agent design, role-based architecture, built-in task delegation, easy to reason about agent collaboration.
Cons: Higher cost (multiple LLM calls per task), harder to debug when agents disagree, less flexible than LangGraph for non-standard workflows.
OpenAI Agents SDK
Released in early 2026, the OpenAI Agents SDK is the newest entrant. It is lightweight, Python-native, and tightly integrated with the OpenAI API. If you are all-in on OpenAI models, this is the simplest path to production agents.
from openai import OpenAI
from agents import Agent, Runner, function_tool
@function_tool
def search_web(query: str) -> str:
"""Search the web for information."""
results = perform_search(query)
return results
research_agent = Agent(
name="Research Agent",
instructions="You research topics thoroughly and provide structured findings.",
tools=[search_web],
model="gpt-4.1",
)
result = Runner.run_sync(
research_agent,
"Compare LangChain vs CrewAI vs OpenAI Agents SDK for building AI agents"
)
print(result.final_output)
Pros: Minimal boilerplate, first-class OpenAI integration, built-in tracing, guardrails API.
Cons: Only works with OpenAI models, smaller ecosystem, less community support than LangChain.
Framework Comparison
| Feature | LangGraph | CrewAI | OpenAI Agents SDK |
|---|---|---|---|
| Language | Python / JS | Python | Python |
| Model support | Any (OpenAI, Anthropic, local) | Any | OpenAI only |
| Multi-agent | Yes (graph-based) | Yes (role-based) | Yes (handoff-based) |
| Observability | LangSmith | CrewAI+ monitoring | Built-in tracing |
| Learning curve | Steep | Medium | Low |
| Production readiness | High | Medium | High (OpenAI stack) |
| Best for | Complex workflows | Team simulation | Quick OpenAI agents |
Our Recommendation
- Just getting started? OpenAI Agents SDK — lowest friction
- Need multi-model or complex workflows? LangGraph — most flexible
- Building a team of specialists? CrewAI — best multi-agent ergonomics
- Production with observability? LangGraph + LangSmith — best debugging
Production Considerations
Demo agents work perfectly. Production agents break in spectacular ways. Here are the critical considerations for taking an agent from prototype to production.
Error Handling
Agents fail in ways that simple API calls do not. A tool might return an error, the LLM might produce malformed function call arguments, the agent might loop endlessly, or the context window might overflow. Every one of these scenarios needs explicit handling.
class RobustAgent:
def run(self, task: str, max_retries: int = 3):
for attempt in range(max_retries):
try:
response = self.client.chat.completions.create(
model=self.model,
messages=self.conversation,
tools=self.tools,
timeout=30 # Prevent hanging
)
msg = response.choices[0].message
if msg.tool_calls:
for call in msg.tool_calls:
try:
args = json.loads(call.function.arguments)
except json.JSONDecodeError:
# Ask the LLM to fix malformed arguments
self.conversation.append({
"role": "tool",
"tool_call_id": call.id,
"content": "Error: Invalid JSON in function arguments. Please retry with valid JSON."
})
continue
try:
result = self.execute_tool(call.function.name, args)
except Exception as e:
result = f"Tool error: {str(e)}. Try a different approach."
self.conversation.append({
"role": "tool",
"tool_call_id": call.id,
"content": str(result)[:3000]
})
else:
return msg.content
except Exception as e:
if attempt == max_retries - 1:
return f"Agent failed after {max_retries} attempts: {str(e)}"
time.sleep(2 ** attempt) # Exponential backoff
return "Agent reached maximum retries."
Cost Control
Agents are expensive because each turn involves an LLM API call with a growing conversation history. A 10-turn agent interaction can easily consume 50K+ tokens. Here are strategies to control costs:
- Set token budgets: Track cumulative tokens per session and stop when a budget is exceeded
- Use cheaper models for tool calling: GPT-4.1-mini handles most tool calls just as well as GPT-4.1 for a fraction of the cost
- Compress conversation history: Summarize older turns instead of keeping them verbatim
- Cache tool results: If the same search query was made recently, return the cached result
- Implement early stopping: If the agent's last three tool calls did not produce new information, stop and synthesize what you have
class CostAwareAgent:
def __init__(self, max_tokens_per_session: int = 100000):
self.max_tokens = max_tokens_per_session
self.tokens_used = 0
def run(self, task: str):
while self.tokens_used < self.max_tokens:
response = self.client.chat.completions.create(
model="gpt-4.1-mini", # Cheaper model
messages=self.conversation,
tools=self.tools,
)
self.tokens_used += response.usage.total_tokens
if self.tokens_used > self.max_tokens * 0.8:
# Switch to synthesis mode
self.conversation.append({
"role": "system",
"content": "Token budget is nearly exhausted. Provide your best answer now with the information you have."
})
# ... rest of agent loop
Observability
You cannot debug what you cannot see. Production agents need comprehensive logging and tracing. Every turn of the agent loop should log: the input messages, the LLM's response, which tools were called, the tool results, the token count, and the latency.
import structlog
logger = structlog.get_logger()
class ObservableAgent:
def run(self, task: str):
trace_id = generate_trace_id()
logger.info("agent_started", trace_id=trace_id, task=task)
for turn in range(self.max_turns):
start_time = time.time()
response = self.client.chat.completions.create(
model=self.model,
messages=self.conversation,
tools=self.tools,
)
latency = time.time() - start_time
logger.info("llm_call",
trace_id=trace_id,
turn=turn,
tokens=response.usage.total_tokens,
latency_ms=round(latency * 1000),
has_tool_calls=bool(response.choices[0].message.tool_calls)
)
if response.choices[0].message.tool_calls:
for call in response.choices[0].message.tool_calls:
logger.info("tool_call",
trace_id=trace_id,
turn=turn,
tool=call.function.name,
args=call.function.arguments[:200]
)
# Execute and log result...
For full observability, use LangSmith (with LangGraph), Arize Phoenix (open-source), or OpenTelemetry with a Jaeger/Tempo backend. These tools give you trace visualizations, latency breakdowns, and token cost tracking across every agent turn.
Guardrails and Safety
Production agents need guardrails at multiple levels:
- Input guardrails: Validate and sanitize user input before it reaches the agent. Check for prompt injection patterns, PII, and malicious content.
- Tool guardrails: Validate tool arguments before execution. Use allowlists for file paths, URL domains, and API endpoints. Require confirmation for destructive operations.
- Output guardrails: Check agent responses for harmful content, PII leaks, and hallucinations before returning to the user.
- Behavioral guardrails: Set explicit rules about what the agent should and should not do. Use system prompts to enforce boundaries.
Production Safety Checklist
- Set maximum iterations per agent run (prevent infinite loops)
- Set maximum tokens per session (prevent cost overruns)
- Sandbox tool execution environments (prevent system damage)
- Log every tool call with arguments and results (audit trail)
- Implement rate limiting per user (prevent abuse)
- Add human-in-the-loop confirmation for critical actions
- Test with adversarial inputs before deployment
Putting It All Together: A Complete Research Agent
Here is a complete, production-ready research agent that combines everything we have covered:
"""
Production Research Agent - Complete Implementation
Combines: ReAct loop, tool calling, memory, cost control, error handling, observability
"""
import json
import time
import structlog
from openai import OpenAI
from typing import Optional
logger = structlog.get_logger()
class ProductionResearchAgent:
def __init__(
self,
model: str = "gpt-4.1-mini",
max_turns: int = 8,
max_tokens: int = 100_000,
):
self.client = OpenAI()
self.model = model
self.max_turns = max_turns
self.max_tokens = max_tokens
self.tokens_used = 0
self.conversation = []
self.tool_map = {
"search_web": search_web,
"fetch_url": fetch_url,
}
self.tools = RESEARCH_TOOLS
def _system_prompt(self) -> str:
return """You are a research agent. Your job is to:
1. Understand the research topic
2. Search for relevant, current information
3. Read promising sources in detail (use fetch_url)
4. Synthesize findings into a structured report
Rules:
- Start broad, then narrow your searches
- Verify key facts across at least 2 sources
- Always cite URLs in your final report
- If a search returns poor results, reformulate the query
- Produce: Summary, Key Findings, Detailed Analysis, Sources
- Stop researching once you have sufficient information"""
def research(self, topic: str) -> dict:
trace_id = f"research-{int(time.time())}"
logger.info("research_started", trace_id=trace_id, topic=topic)
self.conversation = [
{"role": "system", "content": self._system_prompt()},
{"role": "user", "content": f"Research this topic: {topic}"}
]
for turn in range(self.max_turns):
if self.tokens_used >= self.max_tokens:
logger.warning("token_budget_exceeded", trace_id=trace_id)
self.conversation.append({
"role": "system",
"content": "Token budget reached. Provide your best answer now."
})
try:
start = time.time()
response = self.client.chat.completions.create(
model=self.model,
messages=self.conversation,
tools=self.tools,
tool_choice="auto",
timeout=30,
)
latency = time.time() - start
self.tokens_used += response.usage.total_tokens
logger.info("turn_complete",
trace_id=trace_id, turn=turn,
tokens=response.usage.total_tokens,
latency_ms=round(latency * 1000),
total_tokens=self.tokens_used
)
msg = response.choices[0].message
self.conversation.append(msg)
if not msg.tool_calls:
logger.info("research_complete", trace_id=trace_id, turns=turn + 1)
return {
"report": msg.content,
"turns": turn + 1,
"tokens_used": self.tokens_used,
"trace_id": trace_id,
}
for call in msg.tool_calls:
try:
args = json.loads(call.function.arguments)
logger.info("tool_call", trace_id=trace_id,
tool=call.function.name, args=str(args)[:100])
result = self.tool_map[call.function.name](**args)
except json.JSONDecodeError:
result = "Error: Invalid arguments. Retry with valid JSON."
except Exception as e:
result = f"Error: {str(e)}"
self.conversation.append({
"role": "tool",
"tool_call_id": call.id,
"content": str(result)[:3000]
})
except Exception as e:
logger.error("turn_failed", trace_id=trace_id, error=str(e))
if turn == self.max_turns - 1:
return {
"report": f"Research failed: {str(e)}",
"turns": turn + 1,
"tokens_used": self.tokens_used,
"trace_id": trace_id,
}
time.sleep(2 ** (turn % 3)) # Exponential backoff
return {
"report": "Max turns reached. Report may be incomplete.",
"turns": self.max_turns,
"tokens_used": self.tokens_used,
"trace_id": trace_id,
}
# Usage
agent = ProductionResearchAgent(model="gpt-4.1-mini", max_turns=8)
result = agent.research("Best practices for deploying AI agents in production 2026")
print(result["report"])
print(f"\nStats: {result['turns']} turns, {result['tokens_used']} tokens")
Conclusion
Building AI agents in 2026 is fundamentally about mastering the perception-reasoning-action loop and then layering on the production infrastructure that makes agents reliable: robust error handling, cost controls, observability, and guardrails. Start by building an agent from scratch to understand the core mechanics, then adopt a framework when your needs outgrow the hand-rolled approach.
The three frameworks we compared each serve a different need: LangGraph for complex, multi-step workflows; CrewAI for multi-agent collaboration; and the OpenAI Agents SDK for quick, OpenAI-native agents. Pick based on your use case, not hype.
Most importantly, remember that agents are software systems, not magic. They need the same engineering discipline as any production service: testing, monitoring, error handling, and iteration. The agent that works in a demo is not the agent that works in production. Build for reliability, and your users will trust the results.
Last updated: 2026-05-10. Code examples tested with OpenAI Python SDK v1.82+, LangGraph 0.4+, CrewAI 0.95+, and OpenAI Agents SDK 1.0+.