How Large Language Models Actually Work

The 60-second model

A large language model (LLM) is a giant function that, given a sequence of tokens, predicts the next token. Repeat that prediction step a few hundred times and you get a paragraph.

That's it. Everything else - chat, tools, reasoning, code - is built on top of "predict the next token."

Tokens, not words

Models read tokens, which are usually 3–4 characters in English.

"prompt" → 1 token
"Engineering" → 1–2 tokens
"prompt-engineering" → 3–4 tokens (hyphens and casing matter)
"私は学生です" → many tokens (Asian languages cost more)

You can inspect any prompt in OpenAI's tokenizer or with the tiktoken library.

How the model was trained (briefly)

Pre-training - predict the next token across trillions of tokens of internet text, code, and books.
Supervised fine-tuning (SFT) - humans write good answers; the model learns to imitate the answer style.
Reinforcement learning from human/AI feedback (RLHF / RLAIF / DPO) - humans rank model answers; the model learns to prefer the kind of answers humans (or constitutional rules) like.
Post-training for reasoning - newer models (o-series, Claude with extended thinking, Gemini Thinking) learn to use a private "scratchpad" before answering.

What that means for you as a prompt engineer

The model has no memory between calls (unless you give it some).
It's a probabilistic function - same prompt, different runs can give different output.
It only knows what was in its training data + your prompt + tools you give it. It doesn't browse, remember, or "look things up" unless you wire that in.
It's optimised to be plausible, not necessarily true. Hallucination is a structural feature, not a bug.

Activity