Home / Post

The AI Stack Explained: From LLM to Agent Skill

A practical guide that walks you through the entire modern AI stack — how LLMs work, how agents are built, what every buzzword actually means, and how to customize agents for real-world tasks.

The AI Stack Explained: From LLM to Agent Skill

Part 1: What Is an LLM, Really?

An LLM is a giant autocomplete engine. It predicts the next word, over and over, until it forms a complete answer. That’s the whole trick.

The Engine Underneath

Everything starts with the Transformer architecture, proposed by Google in 2017 (“Attention Is All You Need”). It uses a mechanism called self-attention to process entire sentences in parallel instead of word-by-word — making it fast enough to train on massive datasets. Google invented it, but OpenAI turned it into a product. GPT-3.5 in late 2022 kicked off the revolution. Today every major model (GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro) is built on this same foundation.

Key distinction: Transformer is the architecture. LLM is the product built on top of it.

How Generation Works

An LLM is fundamentally a mathematical function. It receives numbers, does matrix multiplication, and outputs numbers. The “intelligence” emerges from patterns learned during training on billions of text samples.

The loop is simple:

  1. Your text gets split into tokens (the smallest units the model works with)
  2. Each token maps to a number (Token ID)
  3. The model predicts a probability distribution for the next token
  4. The highest-probability token is picked and appended
  5. Repeat until a stop token is generated

Every “intelligent” response is the result of this loop running thousands of times.

Tokens Are Not Words

A common misconception. The tokenizer splits text using a learned algorithm (BPE), and the results can be surprising:

InputToken CountWhy
”hello”1Common word → single token
”helpful”2”help” + “ful"
"程序员” (programmer)2”程序” + “员”
A checkmark ✓3Rare character → multiple tokens

Rule of thumb: 1 token ≈ 0.75 English words. A million tokens ≈ 750,000 words — roughly the entire Harry Potter series. This matters because everything in AI is priced and limited by tokens: API costs, context limits, and (as we’ll see later) the MCP vs CLI debate.

The Model Has No Memory

Here’s something most people don’t realize: LLMs have no persistent memory. Every time you send a message, the application re-sends the entire conversation history. What feels like “remembering” is just re-reading.

Context is everything the model can see in a single request — your conversation history, your current question, hidden system instructions, available tool definitions, and its own output so far. All measured in tokens, bounded by the context window:

Model (2026)Context Window
GPT-5.41.05M tokens
Gemini 3.1 Pro1M tokens
Claude Opus 4.61M tokens

A million tokens sounds huge, but stuffing everything in is wasteful and expensive. That’s why techniques like RAG (retrieve only what’s relevant) and Progressive Disclosure (load information layer by layer) exist.

So how does a chatbot “remember” what you said five messages ago? Before each new question, the application pastes the entire previous conversation into the context alongside your new message. The model reads the whole thing from scratch and responds as if it had been following along. It’s like handing someone a printed chat log every time you talk to them.

The problem: each new turn makes the context longer. Eventually, conversation history alone fills the context window. The solution is memory compression — use the LLM itself to summarize the conversation so far, preserving key points while cutting token count.

Judgment: What LLMs Actually Add to Computing

Here’s the deeper insight: the real contribution of LLMs isn’t text generation — it’s judgment. Traditional programs require precise, exhaustive specification. LLMs can make “good enough” decisions in ambiguous contexts, the same way humans do.

  • Traditional functions: deterministic — same input always produces same output
  • Judgment-enabled functions: probabilistic — handle ambiguity, produce contextually appropriate outputs

Consider determining which account an expense should be billed to. No traditional function could handle every edge case. A judgment-enabled one reads the account descriptions and the expense details, then makes a reasonable call.

LLM judgments aren’t as good as a conscientious human’s, but they’re often better than a distracted one. And the kinds of mistakes they make are different from human errors — which means you need to design for unusual failure modes.

Prompts and Context Engineering

A prompt is what you give the model to work with. There are two kinds:

  • User Prompt — what you type. “Write me a poem,” “What is 3+5?”
  • System Prompt — a hidden instruction set by the developer. The user never sees it, but it shapes every response.

Example system prompt:

“You are a patient math teacher. When students ask math questions, guide them step by step instead of giving direct answers.”

With this system prompt, asking “What is 3+5?” produces a guided lesson, not just “8.” This mechanism becomes critical when building agents — the system prompt is literally how you program an agent’s behavior.

“Prompt engineering” is actually a misleading term — it suggests clever tricks to manipulate the model. In practice, the real skill is context engineering: assembling everything the LLM needs to make good judgments. System instructions, relevant data, examples, documentation, tool outputs. Your intellectual property will increasingly take the form of system instructions and contexts — documentation written for LLMs, not just humans.


Part 2: How Agents Work

An Agent = LLM + Tools + a loop that keeps going until the job is done.

LLMs Can Only Talk

We’ve established that an LLM is a text-prediction engine. It’s incredibly smart, but it has no hands — it can’t check the weather, read your files, run code, or interact with the real world. It can only output text.

Tools solve this. A tool is any external function the model can call — a weather API, a file reader, a database query, a shell command. Tools are the agent’s senses and limbs.

How Tool Calling Actually Works

Here’s the critical insight most people miss: the model never actually calls a tool. It can only output text. When it “calls” a tool, it outputs a structured request (tool name + parameters), and a separate program executes it.

The four roles in every tool call:

RoleWhat It Does
UserAsks a question
LLMAnalyzes the question, decides which tool to call, generates parameters
PlatformThe intermediary code that intercepts tool requests and executes them
ToolPerforms the action, returns results

Example — “What’s the weather in Tokyo?”

  1. User asks the question
  2. LLM decides to call get_weather(city="Tokyo")
  3. Platform intercepts this, calls the actual weather API
  4. Tool returns {"temp": 22, "condition": "sunny"}
  5. Platform feeds the result back to the LLM
  6. LLM generates: “It’s 22°C and sunny in Tokyo today!”

The model’s two jobs: pick the right tool and summarize the result.

Function Calling is the format agreement that makes this work. It’s an API contract that forces the model to reply in structured, parseable JSON when it wants to invoke a tool:

{ "tool": "get_weather", "parameters": { "city": "Tokyo" } }

Without this, the model might say “Hey, can you check the weather in Tokyo for me?” — good luck parsing that reliably in code.

From Tool Calling to Autonomous Agent

A single tool call is useful but limited. An agent chains multiple tool calls together in a loop, autonomously deciding what to do next until the task is complete.

Example — “What’s the weather here? If it’s raining, find me a nearby umbrella store.”

  1. Think → I need the user’s location → call get_location() → got coordinates
  2. Think → I need the weather → call get_weather() → it’s raining
  3. Think → User wants an umbrella store → call search_stores("umbrella") → found one nearby
  4. Think → All done → generate final answer with the store details

No human intervention between steps. The agent figures out each next action on its own.

A Counterintuitive Insight: The Agent Is the Dumb Part

Here’s a reframe that clarifies things: an agent is composed of all the parts that don’t need the LLM. Fixed routing logic, tool execution, result passing — that’s the agent. The LLM handles the fuzzy decisions; the agent handles the deterministic plumbing.

The LLM is the brain. The agent is the delivery boy. The delivery boy doesn’t decide what to deliver — it just makes sure the package gets from A to B.

ReAct: The Most Common Pattern

ReAct (Reasoning + Acting), introduced in October 2022, is the backbone of most coding agents today.

User Task


Thought ──► Action ──► Observation
   ▲                        │
   └────────────────────────┘

   ▼ (when done)
Final Answer
  1. Thought — analyze the situation, decide what to do
  2. Action — call a tool
  3. Observation — examine the result
  4. Repeat until done, then output a Final Answer

The key revelation: ReAct is not special model training. It’s driven entirely by the system prompt. The model follows it like a script. You can build a ReAct agent with any LLM — the magic is in the prompt, not the model.

Plan-and-Execute: The Alternative

Used by Manus and sometimes Claude Code. Instead of thinking one step at a time, it plans everything upfront, then executes step by step with dynamic replanning. After each step, new information may change the plan. A re-plan model refines remaining steps with specifics learned from execution.

In practice, these patterns often nest — a plan-and-execute agent uses a ReAct agent to execute each step.

Subagents: Context Isolation

What happens when intermediate reasoning from sub-tasks would fill the entire context window? You spawn a subagent — a child agent with its own isolated context. When the sub-task completes, only the compact final result returns to the parent. The parent stays lean — it only sees the results, not the journey. This is how products like Cursor handle large multi-file tasks without choking on context.

Proof: An Agent in ~20 Lines of Code

The OpenClaw tutorial builds a fully working agent from scratch in about 20 lines of Python:

  1. LLM API call → single Q&A (a few lines)
  2. Add input() + while True → interactive chat
  3. Append conversation history → memory
  4. Detect command requests + execute → full agent

The agent code has zero intelligence — it blindly executes whatever the model requests. The agent’s power is bounded only by what commands are available and what the model knows about them. This is why skills make or break the agent. When the model doesn’t know a command, it fails or hallucinates. Adding a SKILL.md file with instructions fixes this immediately.


Part 3: The Buzzword Decoder

Every AI buzzword is either (a) adding information to context, or (b) reducing human-model interaction via a proxy. That’s it.

The Master Reference Card

TermWhat It IsWhat It Is NOTOne-Line Analogy
LLMA word-prediction engineA database or search engineA very sophisticated autocomplete
JudgmentFuzzy, “good enough” decisionsPrecise logic or guaranteed correctnessA human’s gut feeling, trained on the internet
ContextEverything the model sees in one requestPersistent memoryA printed chat log handed over each time
MemoryPasting conversation history back into contextThe model “remembering” anythingRe-reading the chat log from the start
RAGFetching relevant passages into contextA database queryAn open-book exam
ToolAn external function the model can invokeSomething the model runs itselfHands and eyes for the brain
Function CallingThe JSON format contract between Agent and LLMA tool or a protocol”Reply in this exact JSON format”
MCPA universal protocol for Agent ↔ tool servicesFunction CallingUSB-C for AI tools
AgentLLM + tools + an autonomous loopThe intelligent part (it’s the plumbing)A delivery boy following the brain’s orders
Agent SkillA SKILL.md doc that teaches an Agent proceduresA function, a tool, or an agentA runbook / SOP document
WorkflowLow-code drag-and-drop step chainingAn Agent (paths are pre-drawn)A flowchart you can execute
SubagentA child Agent with isolated contextA function call or threadA contractor who reports back only the result

The Three Confusions Everyone Gets Wrong

1. Function Calling ≠ MCP

This is the most common mix-up. They’re different layers of the same workflow:

  • Function Calling (Agent ↔ LLM): The model outputs { "tool": "get_weather", "city": "Tokyo" }. This is just a format — parseable JSON so the agent code can read it.
  • MCP (Agent ↔ Tool Service): The agent routes that request to the right tool server. MCP handles discovery, execution, authentication, and results.

FC outputs the plan. MCP handles execution. Asking “Can MCP replace Function Calling?” is like asking if a restaurant menu can replace the kitchen’s recipe book.

Other key differences: FC formats are vendor-specific (OpenAI, Anthropic, Google each have their own); MCP is universal. FC tool lists are static (defined in the prompt upfront); MCP supports dynamic discovery at runtime. MCP goes beyond tools — it also serves resources (static data) and prompt templates.

2. Agent Skill ≠ Traditional Function

Skills look like functions (input → process → output), but they operate on fundamentally different logic:

AspectTraditional FunctionAgent Skill
LogicDeterministic — same input, same outputProbabilistic — uses judgment for fuzzy inputs
InputStructured, exact matchingUnstructured, fuzzy/semantic matching
ExecutionRuns native codeExpands prompts/context — teaches the model
TestingUnit tests with expected outputsDynamic validation — does the judgment meet a useful threshold?

If an agent is a class, a skill is more like a runbook than a method. The agent reads the runbook and uses judgment to follow it — adapting to edge cases the skill author didn’t anticipate.

3. Agent With Skill vs Without Skill — Night and Day

AspectWithout SkillWith Skill
BehaviorReactive, varies between runs (“drift”)Specialized, repeatable workflows
Context usageHigh — full instructions in every promptLow — progressive disclosure loads on demand
KnowledgeLimited to base model training dataInjects domain knowledge the LLM doesn’t have

This is why some people’s agents seem brilliant and others’ seem useless — it depends entirely on what skills you’ve given them.

The Rigidity–Flexibility Spectrum

All approaches to multi-step tasks sit on a spectrum:

ApproachMethodRigidityFlexibility
LangChainPure code, hardcoded pipeline★★★★★
WorkflowLow-code drag-and-drop GUI★★★★★★
Agent SkillLLM-controlled flow with docs/scripts★★★★★★
Pure AgentLLM decides everything on the fly★★★★★

Agent Skill sits in the sweet spot — structured enough to be predictable, flexible enough that the LLM can handle edge cases with judgment.


Part 4: Teaching Your Agent — The Practical Guide

MCP connects your agent to data; Skills teach it what to do with that data; Scripts let it take action at zero token cost.

Three Mechanisms for Customization

MechanismWhat It DoesToken Cost
MCPConnects the agent to external data/servicesHigh (tool metadata always in context)
Agent SkillTeaches the agent step-by-step proceduresModerate (loaded on demand)
ScriptExecutes code without the agent reading itZero

MCP: The Universal Connector

MCP (Model Context Protocol) is a standard protocol for connecting tools to any AI model. Think of it as USB-C for AI tools — build a tool once, it works with ChatGPT, Claude, Gemini, and every other platform.

Before MCP, each platform had its own tool format. Same tool, three implementations. MCP unified this.

Strengths: Structured JSON parameters avoid quoting/escaping nightmares. Tools can only do what the designer allows — essential for enterprise environments. Beyond tool calls, MCP also supports resources (static data) and prompt templates.

Cost problem: Every MCP tool’s metadata is sent to the model in every request. A single MCP server like GitHub’s has 44 tools consuming ~14,268 tokens (~$0.30/query) — just for the tool descriptions, before any actual work.

CLI Tools: The Lightweight Alternative

Instead of registering dozens of MCP tools, give the agent one tool: Bash — and let it generate commands using its training knowledge of common CLI programs (git, grep, ffmpeg, gh, etc.).

Token savings: One Bash tool definition (~dozen lines) vs 14,268 tokens for GitHub MCP alone.

Pipeline power: CLI commands compose naturally:

exiftool ... | ImageMagick ... && scp ...

This filters photos, adds watermarks, and uploads — all in one command with no model round-trips. With MCP, each step requires a separate call back to the model.

The trade-off: Complex commands are error-prone, and unrestricted execution is dangerous in shared environments.

Agent Skill: The Instruction Manual

An Agent Skill is a Markdown file (SKILL.md) that teaches an agent what to do and how to do it for a specific task. It’s not a tool — it’s a set of instructions.

The key innovation is Progressive Disclosure — a three-layer loading system that keeps token cost low:

Layer 1: Metadata (ALWAYS loaded)
   │  Model scans skill names/descriptions
   │  Cost: minimal

Layer 2: Instructions (loaded ON DEMAND)
   │  Full SKILL.md content loaded only when matched
   │  Cost: moderate

Layer 3: Resources (loaded ON DEMAND within ON DEMAND)
   ├── Reference files → read into context (costs tokens)
   └── Script files → executed only (costs ZERO tokens)

Only the skill that matches your request gets loaded. If you have 50 skills but only one matches, only one skill’s instructions enter the context.

Reference vs Script: The Key Distinction Inside Skills

Reference (Read): A document the agent reads into context when certain conditions are met. A meeting summary skill might have a “Corporate Financial Handbook” reference — loaded only when the meeting involves budgets, zero tokens otherwise.

Script (Execute): Code the agent runs but never reads. The source code never enters context. A skill with upload.py (even 10,000 lines) runs it and only gets back the execution result — zero token cost for the script itself.

Caveat: If your SKILL.md instructions for running a script are unclear, the agent might try to read the script to understand it — defeating the purpose. Always write clear execution instructions.

The Decision Framework

"I need my agent to access external data or services"
   → MCP (structured, secure, cross-platform)

"I need my agent to follow a specific workflow or procedure"
   → Agent Skill (step-by-step rules in SKILL.md)

"I need my agent to run code without bloating context"
   → Script (inside an Agent Skill, zero token cost)

"I need lightweight tool access and I'm working locally"
   → CLI via Bash (token-efficient, composable)

"I need a visual, no-code multi-step pipeline"
   → Workflow (drag-and-drop, but rigid paths)

They’re Complementary, Not Competing

Anthropic’s official position: “MCP connects Claude to data. Skills teach Claude what to do with that data.”

The best setups combine them:

  • MCP supplies raw data (sales records, logistics status, database queries)
  • Agent Skill defines processing rules (summary format, judgment criteria, escalation logic)
  • Scripts handle heavy actions (data upload, report generation, deployment)
  • CLI handles lightweight local tasks (git operations, file processing)

The Complete Stack

┌─────────────────────────────────┐
│   Agent Skill (SKILL.md)        │  ← Teaches specific workflows
│   ├── Reference (read)          │  ← On-demand knowledge
│   └── Script (execute)          │  ← Zero-cost actions
├─────────────────────────────────┤
│   MCP / CLI / Workflow          │  ← External capabilities & orchestration
├─────────────────────────────────┤
│   Subagent                      │  ← Context isolation for sub-tasks
├─────────────────────────────────┤
│   Agent (ReAct / Plan-Exec)     │  ← Autonomous loop
├─────────────────────────────────┤
│   Tools + Function Calling      │  ← Senses + structured format contract
├─────────────────────────────────┤
│   Memory                        │  ← Conversation history injection
├─────────────────────────────────┤
│   Prompt / Context Engineering  │  ← Assemble everything the LLM needs
├─────────────────────────────────┤
│   Context Window                │  ← Bounded capacity
├─────────────────────────────────┤
│   Token / Tokenizer             │  ← The units everything is measured in
├─────────────────────────────────┤
│   LLM + Judgment                │  ← Prediction engine + fuzzy decisions
├─────────────────────────────────┤
│   Transformer                   │  ← The architecture underneath
└─────────────────────────────────┘

The Unifying Truth

After walking through the entire stack, here’s the punchline: every technique in the AI ecosystem reduces to one of two things:

  1. Automatically adding information to context — RAG, Memory, References, MCP resources, Skill instructions
  2. Reducing human-model interaction via a proxy program — Agent loops, tool calling, Scripts, Subagents

That’s the entire field. Once you see this, the buzzwords lose their mystique. The future isn’t about inventing new categories — it’s about building pre-configured agents where users don’t need to understand any of these terms to get value.

The stack diagram above isn’t just a summary. It’s a map. Next time someone throws a buzzword at you, find where it sits on the stack, and you’ll know exactly what it does, what it doesn’t do, and what it actually costs.