Building my first AI agent — and the one way it actually broke

Phase 3 of my four-weekend AI engineering curriculum. Phase 1 stuffed every help article into a single prompt. Phase 2 replaced that with retrieval. Phase 3 makes the model an agent — it doesn’t just answer, it can take action: look up a real order, open a support ticket, escalate to a human. The model decides which tool to call; my code executes them and feeds the results back.

This is the conceptually heaviest phase. It’s also where I expected the most failures. Here’s what actually happened.

What changed in my head

Phase 2’s flow was one-shot:

user message → retrieve → answer

Phase 3 becomes a loop:

user message → model decides → calls a tool → result returned
           ↑                                          │
           └────────── (loop until model is done) ────┘

This is called ReAct — Reason, Act, Observe, repeat. Every product agent you’ve used (Intercom’s Fin, Linear’s agent, Cursor’s edit mode) is this loop with different tools and a fancier UI.

The shift in mental model matters. In a RAG pipeline you control retrieval; in an agent the model controls everything including whether to retrieve. Your job moves from writing prompts to writing tool descriptions — because the description is the only thing the model reads when deciding what to do.

The four tools

The curriculum’s suggested toolkit was good, so I stuck with it:

Tool	What it does	Backed by
`search_articles`	Search the help knowledge base	Phase 2 retrieval, now invoked on demand
`lookup_order_status`	Get details for a specific order ID	A mock `data/orders.json`
`create_ticket`	Open a support ticket	Appends to `data/tickets.json`
`escalate_to_human`	Hand off to a real agent	Returns immediately

Notice search_articles is the most interesting one. In Phase 2 I retrieved on every request. Now the model decides when retrieval is useful. For “I want to talk to a human,” retrieving help articles is wasted work. For “how do I cancel?”, it’s exactly right. The agent figures out which.

Switching the LLM

I’d been on Ollama with llama3.1:8b for Phase 2. For Phase 3, I flipped to Claude Haiku 4.5 by changing one environment variable:

CHAT_PROVIDER = anthropic;

The pluggable provider pattern from Phase 2 paid off in a way I didn’t fully appreciate at the time. The route code doesn’t change. The agent loop doesn’t change. Only the provider adapter does the translation. This is exactly why you build the abstraction even when you don’t strictly need it yet.

Why Anthropic for Phase 3, not Ollama: tool-calling reliability matters more than chat tone here. Llama 3.1 8B can technically tool-call, but it’s loose with arguments and sometimes refuses to call obvious tools. Claude Haiku is what was actually tuned for this. Cost for the entire phase was under $2.

The agent loop — actually fewer lines than I expected

The loop itself is straightforward:

Call the model with the user’s messages + the tool definitions.
Stream text deltas to the client.
When the response includes tool_use blocks, execute each one.
Append the assistant’s turn (with tool_use) and a user message containing tool_result blocks to the conversation.
Loop. Cap iterations at 6 to prevent runaway.
Exit when stop_reason is end_turn.

The whole thing is ~120 lines. The hard part isn’t the loop — it’s getting the tool descriptions right so the model picks the right one consistently. More on that in a moment.

Streaming tool events to the UI

A detail that made the demo actually feel like an agent: streaming tool calls as they happen, not as a summary at the end. Using the Vercel AI SDK’s data parts (data-tool-call, data-tool-result), the UI shows a small badge mid-stream:

"Let me check that order for you."

   🔍 lookup_order_status
      order_id: "ORD-1003"
      ✓ Found — Marcus Webb, processing

"Your order was placed yesterday and is currently..."

This turned out to matter more than I expected. When the agent chains two tools (look up an order, then create a ticket), seeing both badges appear in order is what makes the demo feel intelligent. Without that, the user stares at “thinking…” for 10 seconds.

The one failure mode I actually found

I tested the canonical multi-step flow: “I want a refund for ORD-1004 — one mug arrived chipped.”

What I expected: agent calls lookup_order_status first to verify the order, then create_ticket with the verified details and priority: high.

What the agent actually did: skipped the lookup entirely. Just called create_ticket with a description quoting what the user said. Priority was “normal”.

This isn’t a bug — it’s an optimization the model made under-prompted. It looked at the user’s message, decided “this is enough text for a ticket,” and took the shortest path. Reasonable, but it leaves real problems:

A typo’d order ID (ORD-1404) would slip through unverified.
The ticket lacks the actual order details a human agent would need.
An already-refunded order (like ORD-1005) could get a duplicate ticket.

The fix lived in one place: the create_ticket tool description. I added:

IMPORTANT: If the issue concerns a specific order, you MUST call lookup_order_status FIRST before creating the ticket. The ticket description should include the verified order details (customer name, items, status, dates).

After that change, the same query produced two badges in order, and the ticket description included “Sofia Lopez,” “Ceramic Coffee Mug Set,” the actual shipping address, the total — all pulled from the lookup. Priority high.

The lesson: when you want the agent to follow a sequence, put the rule in the tool description that fires at the decision point, not in the system prompt far away from the action. Production agent designers call this “keep the rule where the rule fires.”

The surprise: most edge cases just worked

I lined up four tests expecting at least two more failures:

“Where’s order 1234567?” (malformed ID without the ORD- prefix)
“hello” (would it tool-call unnecessarily?)
“I don’t know what’s wrong, just help me” (vague request, would it escalate prematurely?)
“I want a refund for ORD-1005” (already refunded — would it create a duplicate ticket?)

All four passed. In particular:

For the malformed ID, Claude responded: “Is the order ID ORD-1234567, or do you have the full order number with the ‘ORD-’ prefix?” It refused to pass the bad input.
For “hello,” it just said hi back — no tool calls, no token waste.
For the vague request, it asked structured clarifying questions with a numbered list of categories.
For ORD-1005, it called lookup_order_status, read the refunded status, explained the existing refund (with date and amount), and proactively offered “if the refund hasn’t appeared, I can open a ticket to investigate” — anticipating the actual underlying concern.

That last one is the response that made me genuinely impressed. The agent didn’t just not break. It used the data from the tool call to give a useful, anticipatory answer.

The honest takeaway

Most “my AI agent broke” posts overclaim. Mine almost did — I went in expecting drama, and the drama mostly didn’t materialize.

What I actually learned was less sexy and more useful:

Tool descriptions are the new prompts. The agent reads descriptions every turn when deciding what to do. Vague descriptions → vague behavior. Explicit MUSTs and negative rules (DO NOT use this when…) get followed.
Architecture beats prompting for safety. The Phase 2 distance gate, the Phase 3 max-iteration cap, returning typed ToolResult errors instead of throwing — these structural choices prevent whole categories of failure. They cost more upfront and less in debugging.
A small good model (Claude Haiku 4.5) with carefully-worded tools handled vague queries, malformed inputs, greetings, and ‘already-resolved’ edge cases out of the box. I don’t know if I would have gotten the same result with llama3.1:8b on Ollama. I do know the pluggable provider abstraction made trying both trivial — and I’d reach for it again on day one of any future agent project.
The killer demo for any agent is multi-tool chaining. Showing “lookup_order_status → create_ticket” play out as two badges in real time is what convinces a viewer the agent is actually thinking, not just reading a script.

Next up: Phase 4 — evaluation. “Is this agent actually good?” turns from a vibe-check into a measurable question. Golden sets, LLM-as-judge, observability. The part where I find out how many of the wins above hold up under real test pressure.