The AI Stack Explained: From LLM to Agent Skill

Part 1: What Is an LLM, Really?

An LLM is a giant autocomplete engine. It predicts the next word, over and over, until it forms a complete answer. That’s the whole trick.

The Engine Underneath

Everything starts with the Transformer architecture, proposed by Google in 2017 (“Attention Is All You Need”). It uses a mechanism called self-attention to process entire sentences in parallel instead of word-by-word — making it fast enough to train on massive datasets. Google invented it, but OpenAI turned it into a product. GPT-3.5 in late 2022 kicked off the revolution. Today every major model (GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro) is built on this same foundation.

Key distinction: Transformer is the architecture. LLM is the product built on top of it.

How Generation Works

An LLM is fundamentally a mathematical function. It receives numbers, does matrix multiplication, and outputs numbers. The “intelligence” emerges from patterns learned during training on billions of text samples.

The loop is simple:

Your text gets split into tokens (the smallest units the model works with)
Each token maps to a number (Token ID)
The model predicts a probability distribution for the next token
The highest-probability token is picked and appended
Repeat until a stop token is generated

Every “intelligent” response is the result of this loop running thousands of times.

Tokens Are Not Words

A common misconception. The tokenizer splits text using a learned algorithm (BPE), and the results can be surprising:

Input	Token Count	Why
”hello”	1	Common word → single token
”helpful”	2	”help” + “ful"
"程序员” (programmer)	2	”程序” + “员”
A checkmark ✓	3	Rare character → multiple tokens

Rule of thumb: 1 token ≈ 0.75 English words. A million tokens ≈ 750,000 words — roughly the entire Harry Potter series. This matters because everything in AI is priced and limited by tokens: API costs, context limits, and (as we’ll see later) the MCP vs CLI debate.

The Model Has No Memory

Here’s something most people don’t realize: LLMs have no persistent memory. Every time you send a message, the application re-sends the entire conversation history. What feels like “remembering” is just re-reading.

Context is everything the model can see in a single request — your conversation history, your current question, hidden system instructions, available tool definitions, and its own output so far. All measured in tokens, bounded by the context window:

Model (2026)	Context Window
GPT-5.4	1.05M tokens
Gemini 3.1 Pro	1M tokens
Claude Opus 4.6	1M tokens

A million tokens sounds huge, but stuffing everything in is wasteful and expensive. That’s why techniques like RAG (retrieve only what’s relevant) and Progressive Disclosure (load information layer by layer) exist.

So how does a chatbot “remember” what you said five messages ago? Before each new question, the application pastes the entire previous conversation into the context alongside your new message. The model reads the whole thing from scratch and responds as if it had been following along. It’s like handing someone a printed chat log every time you talk to them.

The problem: each new turn makes the context longer. Eventually, conversation history alone fills the context window. The solution is memory compression — use the LLM itself to summarize the conversation so far, preserving key points while cutting token count.

Judgment: What LLMs Actually Add to Computing

Here’s the deeper insight: the real contribution of LLMs isn’t text generation — it’s judgment. Traditional programs require precise, exhaustive specification. LLMs can make “good enough” decisions in ambiguous contexts, the same way humans do.

Traditional functions: deterministic — same input always produces same output
Judgment-enabled functions: probabilistic — handle ambiguity, produce contextually appropriate outputs

Consider determining which account an expense should be billed to. No traditional function could handle every edge case. A judgment-enabled one reads the account descriptions and the expense details, then makes a reasonable call.

LLM judgments aren’t as good as a conscientious human’s, but they’re often better than a distracted one. And the kinds of mistakes they make are different from human errors — which means you need to design for unusual failure modes.

Prompts and Context Engineering

A prompt is what you give the model to work with. There are two kinds:

User Prompt — what you type. “Write me a poem,” “What is 3+5?”
System Prompt — a hidden instruction set by the developer. The user never sees it, but it shapes every response.

Example system prompt:

“You are a patient math teacher. When students ask math questions, guide them step by step instead of giving direct answers.”

With this system prompt, asking “What is 3+5?” produces a guided lesson, not just “8.” This mechanism becomes critical when building agents — the system prompt is literally how you program an agent’s behavior.

“Prompt engineering” is actually a misleading term — it suggests clever tricks to manipulate the model. In practice, the real skill is context engineering: assembling everything the LLM needs to make good judgments. System instructions, relevant data, examples, documentation, tool outputs. Your intellectual property will increasingly take the form of system instructions and contexts — documentation written for LLMs, not just humans.

Part 2: How Agents Work

An Agent = LLM + Tools + a loop that keeps going until the job is done.

LLMs Can Only Talk

We’ve established that an LLM is a text-prediction engine. It’s incredibly smart, but it has no hands — it can’t check the weather, read your files, run code, or interact with the real world. It can only output text.

Tools solve this. A tool is any external function the model can call — a weather API, a file reader, a database query, a shell command. Tools are the agent’s senses and limbs.

How Tool Calling Actually Works

Here’s the critical insight most people miss: the model never actually calls a tool. It can only output text. When it “calls” a tool, it outputs a structured request (tool name + parameters), and a separate program executes it.

The four roles in every tool call:

Role	What It Does
User	Asks a question
LLM	Analyzes the question, decides which tool to call, generates parameters
Platform	The intermediary code that intercepts tool requests and executes them
Tool	Performs the action, returns results

Example — “What’s the weather in Tokyo?”

User asks the question
LLM decides to call get_weather(city="Tokyo")
Platform intercepts this, calls the actual weather API
Tool returns {"temp": 22, "condition": "sunny"}
Platform feeds the result back to the LLM
LLM generates: “It’s 22°C and sunny in Tokyo today!”

The model’s two jobs: pick the right tool and summarize the result.

Function Calling is the format agreement that makes this work. It’s an API contract that forces the model to reply in structured, parseable JSON when it wants to invoke a tool:

{ "tool": "get_weather", "parameters": { "city": "Tokyo" } }

Without this, the model might say “Hey, can you check the weather in Tokyo for me?” — good luck parsing that reliably in code.

From Tool Calling to Autonomous Agent

A single tool call is useful but limited. An agent chains multiple tool calls together in a loop, autonomously deciding what to do next until the task is complete.

Example — “What’s the weather here? If it’s raining, find me a nearby umbrella store.”

Think → I need the user’s location → call get_location() → got coordinates
Think → I need the weather → call get_weather() → it’s raining
Think → User wants an umbrella store → call search_stores("umbrella") → found one nearby
Think → All done → generate final answer with the store details

No human intervention between steps. The agent figures out each next action on its own.

A Counterintuitive Insight: The Agent Is the Dumb Part

Here’s a reframe that clarifies things: an agent is composed of all the parts that don’t need the LLM. Fixed routing logic, tool execution, result passing — that’s the agent. The LLM handles the fuzzy decisions; the agent handles the deterministic plumbing.

The LLM is the brain. The agent is the delivery boy. The delivery boy doesn’t decide what to deliver — it just makes sure the package gets from A to B.

ReAct: The Most Common Pattern

ReAct (Reasoning + Acting), introduced in October 2022, is the backbone of most coding agents today.

User Task
   │
   ▼
Thought ──► Action ──► Observation
   ▲                        │
   └────────────────────────┘
   │
   ▼ (when done)
Final Answer

Thought — analyze the situation, decide what to do
Action — call a tool
Observation — examine the result
Repeat until done, then output a Final Answer

The key revelation: ReAct is not special model training. It’s driven entirely by the system prompt. The model follows it like a script. You can build a ReAct agent with any LLM — the magic is in the prompt, not the model.

Plan-and-Execute: The Alternative

Used by Manus and sometimes Claude Code. Instead of thinking one step at a time, it plans everything upfront, then executes step by step with dynamic replanning. After each step, new information may change the plan. A re-plan model refines remaining steps with specifics learned from execution.

In practice, these patterns often nest — a plan-and-execute agent uses a ReAct agent to execute each step.

Subagents: Context Isolation

What happens when intermediate reasoning from sub-tasks would fill the entire context window? You spawn a subagent — a child agent with its own isolated context. When the sub-task completes, only the compact final result returns to the parent. The parent stays lean — it only sees the results, not the journey. This is how products like Cursor handle large multi-file tasks without choking on context.

Proof: An Agent in ~20 Lines of Code

The OpenClaw tutorial builds a fully working agent from scratch in about 20 lines of Python:

LLM API call → single Q&A (a few lines)
Add input() + while True → interactive chat
Append conversation history → memory
Detect command requests + execute → full agent

The agent code has zero intelligence — it blindly executes whatever the model requests. The agent’s power is bounded only by what commands are available and what the model knows about them. This is why skills make or break the agent. When the model doesn’t know a command, it fails or hallucinates. Adding a SKILL.md file with instructions fixes this immediately.

Part 3: The Buzzword Decoder

Every AI buzzword is either (a) adding information to context, or (b) reducing human-model interaction via a proxy. That’s it.

The Master Reference Card

Term	What It Is	What It Is NOT	One-Line Analogy
LLM	A word-prediction engine	A database or search engine	A very sophisticated autocomplete
Judgment	Fuzzy, “good enough” decisions	Precise logic or guaranteed correctness	A human’s gut feeling, trained on the internet
Context	Everything the model sees in one request	Persistent memory	A printed chat log handed over each time
Memory	Pasting conversation history back into context	The model “remembering” anything	Re-reading the chat log from the start
RAG	Fetching relevant passages into context	A database query	An open-book exam
Tool	An external function the model can invoke	Something the model runs itself	Hands and eyes for the brain
Function Calling	The JSON format contract between Agent and LLM	A tool or a protocol	”Reply in this exact JSON format”
MCP	A universal protocol for Agent ↔ tool services	Function Calling	USB-C for AI tools
Agent	LLM + tools + an autonomous loop	The intelligent part (it’s the plumbing)	A delivery boy following the brain’s orders
Agent Skill	A SKILL.md doc that teaches an Agent procedures	A function, a tool, or an agent	A runbook / SOP document
Workflow	Low-code drag-and-drop step chaining	An Agent (paths are pre-drawn)	A flowchart you can execute
Subagent	A child Agent with isolated context	A function call or thread	A contractor who reports back only the result

The Three Confusions Everyone Gets Wrong

1. Function Calling ≠ MCP

This is the most common mix-up. They’re different layers of the same workflow:

Function Calling (Agent ↔ LLM): The model outputs { "tool": "get_weather", "city": "Tokyo" }. This is just a format — parseable JSON so the agent code can read it.
MCP (Agent ↔ Tool Service): The agent routes that request to the right tool server. MCP handles discovery, execution, authentication, and results.

FC outputs the plan. MCP handles execution. Asking “Can MCP replace Function Calling?” is like asking if a restaurant menu can replace the kitchen’s recipe book.

Other key differences: FC formats are vendor-specific (OpenAI, Anthropic, Google each have their own); MCP is universal. FC tool lists are static (defined in the prompt upfront); MCP supports dynamic discovery at runtime. MCP goes beyond tools — it also serves resources (static data) and prompt templates.

2. Agent Skill ≠ Traditional Function

Skills look like functions (input → process → output), but they operate on fundamentally different logic:

Aspect	Traditional Function	Agent Skill
Logic	Deterministic — same input, same output	Probabilistic — uses judgment for fuzzy inputs
Input	Structured, exact matching	Unstructured, fuzzy/semantic matching
Execution	Runs native code	Expands prompts/context — teaches the model
Testing	Unit tests with expected outputs	Dynamic validation — does the judgment meet a useful threshold?

If an agent is a class, a skill is more like a runbook than a method. The agent reads the runbook and uses judgment to follow it — adapting to edge cases the skill author didn’t anticipate.

3. Agent With Skill vs Without Skill — Night and Day

Aspect	Without Skill	With Skill
Behavior	Reactive, varies between runs (“drift”)	Specialized, repeatable workflows
Context usage	High — full instructions in every prompt	Low — progressive disclosure loads on demand
Knowledge	Limited to base model training data	Injects domain knowledge the LLM doesn’t have

This is why some people’s agents seem brilliant and others’ seem useless — it depends entirely on what skills you’ve given them.

The Rigidity–Flexibility Spectrum

All approaches to multi-step tasks sit on a spectrum:

Approach	Method	Rigidity	Flexibility
LangChain	Pure code, hardcoded pipeline	★★★★★	★
Workflow	Low-code drag-and-drop GUI	★★★★	★★
Agent Skill	LLM-controlled flow with docs/scripts	★★★	★★★
Pure Agent	LLM decides everything on the fly	★	★★★★★

Agent Skill sits in the sweet spot — structured enough to be predictable, flexible enough that the LLM can handle edge cases with judgment.

Part 4: Teaching Your Agent — The Practical Guide

MCP connects your agent to data; Skills teach it what to do with that data; Scripts let it take action at zero token cost.

Three Mechanisms for Customization

Mechanism	What It Does	Token Cost
MCP	Connects the agent to external data/services	High (tool metadata always in context)
Agent Skill	Teaches the agent step-by-step procedures	Moderate (loaded on demand)
Script	Executes code without the agent reading it	Zero

MCP: The Universal Connector

MCP (Model Context Protocol) is a standard protocol for connecting tools to any AI model. Think of it as USB-C for AI tools — build a tool once, it works with ChatGPT, Claude, Gemini, and every other platform.

Before MCP, each platform had its own tool format. Same tool, three implementations. MCP unified this.

Strengths: Structured JSON parameters avoid quoting/escaping nightmares. Tools can only do what the designer allows — essential for enterprise environments. Beyond tool calls, MCP also supports resources (static data) and prompt templates.

Cost problem: Every MCP tool’s metadata is sent to the model in every request. A single MCP server like GitHub’s has 44 tools consuming ~14,268 tokens (~$0.30/query) — just for the tool descriptions, before any actual work.

CLI Tools: The Lightweight Alternative

Instead of registering dozens of MCP tools, give the agent one tool: Bash — and let it generate commands using its training knowledge of common CLI programs (git, grep, ffmpeg, gh, etc.).

Token savings: One Bash tool definition (~dozen lines) vs 14,268 tokens for GitHub MCP alone.

Pipeline power: CLI commands compose naturally:

exiftool ... | ImageMagick ... && scp ...

This filters photos, adds watermarks, and uploads — all in one command with no model round-trips. With MCP, each step requires a separate call back to the model.

The trade-off: Complex commands are error-prone, and unrestricted execution is dangerous in shared environments.

Agent Skill: The Instruction Manual

An Agent Skill is a Markdown file (SKILL.md) that teaches an agent what to do and how to do it for a specific task. It’s not a tool — it’s a set of instructions.

The key innovation is Progressive Disclosure — a three-layer loading system that keeps token cost low:

Layer 1: Metadata (ALWAYS loaded)
   │  Model scans skill names/descriptions
   │  Cost: minimal
   ▼
Layer 2: Instructions (loaded ON DEMAND)
   │  Full SKILL.md content loaded only when matched
   │  Cost: moderate
   ▼
Layer 3: Resources (loaded ON DEMAND within ON DEMAND)
   ├── Reference files → read into context (costs tokens)
   └── Script files → executed only (costs ZERO tokens)

Only the skill that matches your request gets loaded. If you have 50 skills but only one matches, only one skill’s instructions enter the context.

Reference vs Script: The Key Distinction Inside Skills

Reference (Read): A document the agent reads into context when certain conditions are met. A meeting summary skill might have a “Corporate Financial Handbook” reference — loaded only when the meeting involves budgets, zero tokens otherwise.

Script (Execute): Code the agent runs but never reads. The source code never enters context. A skill with upload.py (even 10,000 lines) runs it and only gets back the execution result — zero token cost for the script itself.

Caveat: If your SKILL.md instructions for running a script are unclear, the agent might try to read the script to understand it — defeating the purpose. Always write clear execution instructions.

The Decision Framework

"I need my agent to access external data or services"
   → MCP (structured, secure, cross-platform)

"I need my agent to follow a specific workflow or procedure"
   → Agent Skill (step-by-step rules in SKILL.md)

"I need my agent to run code without bloating context"
   → Script (inside an Agent Skill, zero token cost)

"I need lightweight tool access and I'm working locally"
   → CLI via Bash (token-efficient, composable)

"I need a visual, no-code multi-step pipeline"
   → Workflow (drag-and-drop, but rigid paths)

They’re Complementary, Not Competing

Anthropic’s official position: “MCP connects Claude to data. Skills teach Claude what to do with that data.”

The best setups combine them:

MCP supplies raw data (sales records, logistics status, database queries)
Agent Skill defines processing rules (summary format, judgment criteria, escalation logic)
Scripts handle heavy actions (data upload, report generation, deployment)
CLI handles lightweight local tasks (git operations, file processing)

The Complete Stack

┌─────────────────────────────────┐
│   Agent Skill (SKILL.md)        │  ← Teaches specific workflows
│   ├── Reference (read)          │  ← On-demand knowledge
│   └── Script (execute)          │  ← Zero-cost actions
├─────────────────────────────────┤
│   MCP / CLI / Workflow          │  ← External capabilities & orchestration
├─────────────────────────────────┤
│   Subagent                      │  ← Context isolation for sub-tasks
├─────────────────────────────────┤
│   Agent (ReAct / Plan-Exec)     │  ← Autonomous loop
├─────────────────────────────────┤
│   Tools + Function Calling      │  ← Senses + structured format contract
├─────────────────────────────────┤
│   Memory                        │  ← Conversation history injection
├─────────────────────────────────┤
│   Prompt / Context Engineering  │  ← Assemble everything the LLM needs
├─────────────────────────────────┤
│   Context Window                │  ← Bounded capacity
├─────────────────────────────────┤
│   Token / Tokenizer             │  ← The units everything is measured in
├─────────────────────────────────┤
│   LLM + Judgment                │  ← Prediction engine + fuzzy decisions
├─────────────────────────────────┤
│   Transformer                   │  ← The architecture underneath
└─────────────────────────────────┘

The Unifying Truth

After walking through the entire stack, here’s the punchline: every technique in the AI ecosystem reduces to one of two things:

Automatically adding information to context — RAG, Memory, References, MCP resources, Skill instructions
Reducing human-model interaction via a proxy program — Agent loops, tool calling, Scripts, Subagents

That’s the entire field. Once you see this, the buzzwords lose their mystique. The future isn’t about inventing new categories — it’s about building pre-configured agents where users don’t need to understand any of these terms to get value.

The stack diagram above isn’t just a summary. It’s a map. Next time someone throws a buzzword at you, find where it sits on the stack, and you’ll know exactly what it does, what it doesn’t do, and what it actually costs.