An AI agent is not a chatbot with more context. It is an autonomous system that receives a goal, plans the necessary steps, executes actions through external tools, and adapts when something fails. The difference is operational: a chatbot answers questions, an agent completes tasks.
At Cloudstudio we have been designing and deploying AI agents for over a year for clients who need to automate complex processes. In this article we share the architecture we use, the patterns that work, and the mistakes we have learned to avoid.
The fundamental agent loop
Every agent follows a cycle: observe the current state, decide the next action, execute it, and evaluate the result. This loop repeats until the goal is met or it escalates to a human. The key is that each step is a model call with updated context — the agent does not memorize a fixed plan, but re-evaluates at each iteration.
We use Claude as the brain of our agents. Its native tool use capability allows defining tools as functions the model can invoke directly: querying databases, sending emails, updating CRMs, generating documents. The model decides when and how to use each tool based on the task context.
Here is the core agent loop we use in production. It is deceptively simple — the complexity lives in the tools and the system prompt:
import anthropic
import json
import time
from dataclasses import dataclass, field
from typing import Any
@dataclass
class AgentState:
goal: str
messages: list = field(default_factory=list)
steps_taken: int = 0
total_tokens: int = 0
total_cost: float = 0.0
max_steps: int = 25
max_cost: float = 5.00 # USD budget limit
class Agent:
def __init__(self, system_prompt: str, tools: list, tool_handlers: dict):
self.client = anthropic.Anthropic()
self.system_prompt = system_prompt
self.tools = tools
self.tool_handlers = tool_handlers
def run(self, goal: str) -> AgentState:
state = AgentState(goal=goal)
state.messages = [{"role": "user", "content": goal}]
while state.steps_taken < state.max_steps:
# Check budget before each step
if state.total_cost >= state.max_cost:
state.messages.append({
"role": "user",
"content": "Budget limit reached. Summarize what you have accomplished so far."
})
response = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system=self.system_prompt,
tools=self.tools,
messages=state.messages,
)
# Track usage
state.total_tokens += response.usage.input_tokens + response.usage.output_tokens
state.total_cost += self._calculate_cost(response.usage)
state.steps_taken += 1
# If Claude is done (no more tool calls), return
if response.stop_reason == "end_turn":
state.messages.append({"role": "assistant", "content": response.content})
return state
# Process tool calls
if response.stop_reason == "tool_use":
state.messages.append({"role": "assistant", "content": response.content})
tool_results = self._execute_tools(response.content)
state.messages.append({"role": "user", "content": tool_results})
return state # Max steps reached
def _execute_tools(self, content) -> list:
results = []
for block in content:
if block.type == "tool_use":
result = self._safe_execute(block.name, block.input)
results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result),
})
return results
def _safe_execute(self, tool_name: str, params: dict) -> dict:
handler = self.tool_handlers.get(tool_name)
if not handler:
return {"error": f"Unknown tool: {tool_name}"}
try:
return handler(**params)
except Exception as e:
return {"error": f"{type(e).__name__}: {str(e)}"}
def _calculate_cost(self, usage) -> float:
# Sonnet pricing per million tokens
input_cost = (usage.input_tokens / 1_000_000) * 3.0
output_cost = (usage.output_tokens / 1_000_000) * 15.0
return input_cost + output_cost
The max_steps and max_cost limits are not optional — they are the difference between a controlled agent and a runaway process that burns through your API budget at 3 AM.
Tool definitions: designing for reliability
The quality of your tools determines the quality of your agent. Each tool must have a precise description, a strict input schema, and predictable error behavior. Here is an example set of tools for a customer support agent:
SUPPORT_TOOLS = [
{
"name": "search_knowledge_base",
"description": "Search the internal knowledge base for articles relevant to a customer issue. Returns the top 5 matching articles with titles and content. Use this FIRST before attempting to answer any technical question.",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Natural language search query describing the customer's issue"
},
"category": {
"type": "string",
"enum": ["billing", "technical", "account", "product"],
"description": "Category to narrow the search"
}
},
"required": ["query"]
}
},
{
"name": "lookup_customer",
"description": "Look up a customer's account details by email. Returns subscription plan, account status, recent tickets, and billing history. Use this to understand the customer's context.",
"input_schema": {
"type": "object",
"properties": {
"email": {
"type": "string",
"description": "Customer email address"
}
},
"required": ["email"]
}
},
{
"name": "create_ticket",
"description": "Create a support ticket for issues that require human follow-up. Use this when the issue cannot be resolved automatically or requires elevated permissions.",
"input_schema": {
"type": "object",
"properties": {
"customer_email": {"type": "string"},
"subject": {"type": "string", "description": "Brief summary of the issue"},
"body": {"type": "string", "description": "Detailed description including steps taken"},
"priority": {
"type": "string",
"enum": ["low", "medium", "high", "urgent"]
},
"category": {
"type": "string",
"enum": ["billing", "technical", "account", "product"]
}
},
"required": ["customer_email", "subject", "body", "priority", "category"]
}
},
{
"name": "escalate_to_human",
"description": "Immediately transfer the conversation to a human agent. Use this when: the customer explicitly requests a human, the issue involves refunds over $100, or you are not confident in your resolution.",
"input_schema": {
"type": "object",
"properties": {
"reason": {"type": "string", "description": "Why the escalation is needed"},
"summary": {"type": "string", "description": "Summary of the conversation so far"},
"suggested_team": {
"type": "string",
"enum": ["billing", "engineering", "account_management"]
}
},
"required": ["reason", "summary"]
}
}
]
Two design principles we have learned the hard way: first, always include an escalation tool. An agent that cannot escalate will hallucinate solutions when it is stuck. Second, write tool descriptions as directives — "Use this FIRST before..." gives the model a clear decision framework.
Patterns that work in production
Human-in-the-loop. No agent should operate without an escalation mechanism. We define confidence thresholds: if the agent is not sure about its decision, it pauses execution and notifies a human. This is critical in processes with financial or legal impact.
Error recovery. Agents in production encounter errors constantly: failing APIs, unexpected data, timeouts. We design each tool with automatic retries, fallbacks, and circuit breakers. The agent must be able to diagnose the error and try an alternative route.
Here is our error recovery wrapper that wraps every tool handler:
import time
import random
import logging
from functools import wraps
logger = logging.getLogger(__name__)
def resilient_tool(max_retries=3, timeout=30, fallback=None):
"""Decorator that adds retry logic and circuit breaking to tool handlers."""
def decorator(func):
_failure_count = {"value": 0}
_circuit_open_until = {"value": 0}
@wraps(func)
def wrapper(**kwargs):
# Circuit breaker: if too many recent failures, fail fast
if time.time() < _circuit_open_until["value"]:
if fallback:
logger.warning(f"Circuit open for {func.__name__}, using fallback")
return fallback(**kwargs)
return {"error": "Service temporarily unavailable", "retry_after": 60}
for attempt in range(max_retries):
try:
result = func(**kwargs)
_failure_count["value"] = 0 # Reset on success
return result
except TimeoutError:
if attempt < max_retries - 1:
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)
else:
_failure_count["value"] += 1
if _failure_count["value"] >= 5:
_circuit_open_until["value"] = time.time() + 60
return {"error": f"Timeout after {max_retries} retries"}
except Exception as e:
logger.error(f"Tool {func.__name__} failed: {e}")
_failure_count["value"] += 1
if _failure_count["value"] >= 5:
_circuit_open_until["value"] = time.time() + 60
return {"error": str(e)}
return wrapper
return decorator
@resilient_tool(max_retries=3, timeout=10)
def lookup_customer(email: str) -> dict:
"""Look up customer from CRM."""
response = crm_client.get(f"/customers?email={email}", timeout=10)
response.raise_for_status()
return response.json()
The circuit breaker pattern is essential. Without it, a downstream service outage causes your agent to spend all its budget retrying a tool that will never succeed. With the circuit breaker, after 5 consecutive failures the tool fails fast for 60 seconds, giving the agent a clear error signal to work with.
Observability. Every agent action is logged with timestamps, token costs, duration, and result. Without full observability, debugging an agent in production is impossible. We use structured traces that allow reconstructing each agent decision step by step.
Monitoring and observability setup
You cannot operate an agent you cannot observe. We log every step in a structured format that allows us to reconstruct the full decision chain:
import time
import json
import logging
from contextlib import contextmanager
from dataclasses import dataclass, asdict
logger = logging.getLogger("agent.trace")
@dataclass
class AgentTrace:
agent_id: str
session_id: str
step: int
action: str # "llm_call", "tool_call", "tool_result", "error", "escalation"
tool_name: str | None = None
input_summary: str | None = None
output_summary: str | None = None
input_tokens: int = 0
output_tokens: int = 0
cost_usd: float = 0.0
duration_ms: int = 0
success: bool = True
error: str | None = None
class AgentTracer:
def __init__(self, agent_id: str, session_id: str):
self.agent_id = agent_id
self.session_id = session_id
self.step = 0
self.traces: list[AgentTrace] = []
@contextmanager
def trace_step(self, action: str, tool_name: str = None):
self.step += 1
trace = AgentTrace(
agent_id=self.agent_id,
session_id=self.session_id,
step=self.step,
action=action,
tool_name=tool_name,
)
start = time.monotonic()
try:
yield trace
trace.success = True
except Exception as e:
trace.success = False
trace.error = str(e)
raise
finally:
trace.duration_ms = int((time.monotonic() - start) * 1000)
self.traces.append(trace)
# Emit as structured JSON log
logger.info(json.dumps(asdict(trace)))
def summary(self) -> dict:
return {
"total_steps": self.step,
"total_tokens": sum(t.input_tokens + t.output_tokens for t in self.traces),
"total_cost": sum(t.cost_usd for t in self.traces),
"total_duration_ms": sum(t.duration_ms for t in self.traces),
"errors": [t.error for t in self.traces if not t.success],
"tools_used": [t.tool_name for t in self.traces if t.tool_name],
}
We pipe these JSON logs into our observability stack (Datadog or similar) and build dashboards that show: agent success rate, average steps per task, cost per task, most-used tools, and error frequency by tool. The dashboard has been the single most valuable debugging tool — when an agent starts behaving unexpectedly, the traces tell you exactly where the reasoning went wrong.
Cost control: budget limits and token tracking
Agents can be expensive because they make multiple LLM calls per task. Without cost controls, a single pathological input can trigger dozens of iterations. Here is how we enforce budgets:
@dataclass
class BudgetConfig:
max_cost_per_session: float = 2.00 # USD
max_cost_per_step: float = 0.50 # USD
max_tokens_per_session: int = 100_000
max_steps: int = 25
alert_threshold: float = 0.75 # Alert at 75% of budget
class BudgetManager:
def __init__(self, config: BudgetConfig):
self.config = config
self.total_cost = 0.0
self.total_tokens = 0
self.steps = 0
def check_budget(self, estimated_input_tokens: int = 0) -> dict:
"""Check if we can afford another step. Returns status and remaining budget."""
if self.steps >= self.config.max_steps:
return {"allowed": False, "reason": "max_steps_reached"}
if self.total_cost >= self.config.max_cost_per_session:
return {"allowed": False, "reason": "cost_limit_reached"}
if self.total_tokens + estimated_input_tokens > self.config.max_tokens_per_session:
return {"allowed": False, "reason": "token_limit_reached"}
remaining = self.config.max_cost_per_session - self.total_cost
if remaining / self.config.max_cost_per_session < (1 - self.config.alert_threshold):
logger.warning(f"Budget alert: only ${remaining:.2f} remaining")
return {
"allowed": True,
"remaining_cost": remaining,
"remaining_steps": self.config.max_steps - self.steps,
}
def record_usage(self, input_tokens: int, output_tokens: int, model: str):
self.steps += 1
self.total_tokens += input_tokens + output_tokens
self.total_cost += self._price(input_tokens, output_tokens, model)
def _price(self, input_tokens: int, output_tokens: int, model: str) -> float:
pricing = {
"claude-haiku-4-20250514": (0.80, 4.0),
"claude-sonnet-4-20250514": (3.0, 15.0),
"claude-opus-4-20250514": (15.0, 75.0),
}
input_rate, output_rate = pricing.get(model, (3.0, 15.0))
return (input_tokens / 1_000_000 * input_rate) + (output_tokens / 1_000_000 * output_rate)
We set different budget configs for different agent types. A simple classification agent gets $0.50 per session. A complex research agent that needs to search multiple sources gets $5.00. These limits catch runaway agents early and give you predictable unit economics.
What we learned the hard way
The biggest mistake is assuming the agent will always make the right decision. In production, edge cases are the norm. A support agent that classifies tickets will work perfectly 95% of the time — but that remaining 5% can generate incorrect responses to important clients.
The solution is not more prompt engineering. It is designing the system so that failures are detectable, reversible, and escalable. Limiting the blast radius of each agent action. And continuously measuring decision quality against an evaluation set.
Three more lessons from running agents in production for a year:
Test with adversarial inputs. Users will send your agent things you never imagined — empty messages, messages in the wrong language, screenshots when it expects text, or deliberately misleading instructions. Build a test suite of 50+ adversarial cases and run it on every prompt change.
Keep the system prompt under 2,000 tokens. Longer prompts give the model more information, but they also increase latency and cost per step. The agent loop multiplies this cost by 5-25x. We have found that concise, well-structured system prompts outperform verbose ones.
Log every tool call, not just errors. When something goes wrong on step 15 of an agent run, you need the full trace to understand how it got there. Logging only errors gives you the symptom. Logging every step gives you the diagnosis.