← Back to blog

Building a customer support chatbot in a weekend

Phase 1 of my AI engineering curriculum. A naive chatbot with every help article stuffed into the system prompt — and what it taught me about streaming, system messages, and where the simple approach starts to break.

· Updated

I want to teach myself AI engineering by actually building things, not by watching YouTube. So I started a four-weekend curriculum: a customer support agent for a fictional e-commerce company called OrderFlow. By the end I should have a deployed AI agent that answers questions, escalates when it can’t, and is measurably good (or measurably bad).

This post is about Phase 1: the dumbest possible version. No retrieval, no tools, no eval. Just chat in, streamed response out, with the entire help center pasted into the system prompt.

I built it dumb on purpose. Here’s why.

The setup

The constraints I gave myself:

The point was to feel the rawness of LLMs before adding complexity. “Stuff everything in the prompt” is wrong at scale, but I wouldn’t learn why it’s wrong unless I did it first.

The architecture

Three files do all the work.

Browser (page.tsx)  ─── POST /api/chat ──▶  Next.js route.ts  ─── messages.stream() ──▶  Claude
        ▲                                          │
        └────── streamed text ─────────────────────┘

                            reads on every request │

                              content/help-articles/*.md

On every request the server reads every .md file from content/help-articles/, wraps each one in <article filename="..."> tags, and pastes the whole thing into the system prompt with rules: answer only from the articles, escalate when out of scope, don’t invent policies. Then it calls Claude’s streaming API and forwards each text delta to the browser.

The client is useChat from @ai-sdk/react, which handles the streaming state and message history. About 150 lines of UI code total.

What I learned

System messages vs user messages

I had heard of this distinction but never internalized it. System messages tell the model who it is and what knowledge it has — the user never sees them. User and assistant messages are the visible conversation.

Where you put the articles changes how the model treats them. Pasting them as a fake first user message would mean the model treats them as the user’s content. Putting them in the system message makes them part of the model’s identity — stable across the whole conversation, never echoed back.

Streaming has to be end-to-end

If any layer in the pipeline buffers the full response before forwarding, streaming dies. The Anthropic SDK streams; my route forwards each chunk without waiting; the AI SDK pipes them through to the browser; the browser appends each delta to the visible message. The whole pipeline has to cooperate or the UX falls apart.

Tokens appearing word-by-word feels alive. Tokens appearing all at once after eight seconds feels broken, even when the total latency is identical.

The cost grows linearly with the knowledge base

Every chat message ships the entire system prompt — including all 19 articles — to Claude. With ~25K tokens total at ~$1 per million for Haiku, that’s about $0.025 per question. Fine at 19 articles. Painful at 200. Impossible at 2000.

This is the part you can’t truly understand from a blog post until you’ve watched the meter tick. Phase 2 fixes it; the whole point of Phase 1 was to feel the problem first.

The “lost in the middle” problem

LLMs attend more to the start and end of long contexts than the middle. If the article that answers a question is article #12 of 19, the model under-weights it. I noticed this most when asking questions whose answers were buried in middle sections.

This isn’t theoretical — it’s documented and reproducible. Another reason “stuff everything in” doesn’t scale.

What worked

The streaming is genuinely satisfying. Asking “how do I refund an order?” and watching the answer materialize one word at a time still feels like real magic.

Claude refuses out-of-scope questions politely. Ask it about the weather and it says “I can’t help with that, let me connect you with a human.” That’s the system prompt working.

The whole thing is around 200 lines of code, including the UI. useChat hides a lot of complexity I haven’t had to think about yet.

What I’m taking into Phase 2

The motivation for RAG is now concrete, not abstract. I know exactly why “all articles in prompt” doesn’t scale because I built it that way and watched it work. I have a baseline to compare against.

The plan for next weekend: replace “stuff articles in prompt” with proper retrieval. Articles get embedded into a vector database; at query time, only the most relevant chunks go into the prompt. Same chatbot, smaller prompt, better quality, lower cost.

Stack gotchas for anyone following along

If you’re following older tutorials and the imports don’t match — that’s why.

Next up: actual RAG.