LLMs in Plain English
- LLMs are AI systems trained to understand and generate text
- They work by predicting the most likely next word/token
- Trained on hundreds of billions of words from the internet and books
- They don't know facts — they learned language patterns
- Parameters are the learned values that encode language understanding
The Core Mechanism — Next Token Prediction
Every LLM, at its core, does one thing: predict the next token (word or word-fragment) given everything that came before. Given "The capital of France is", a well-trained LLM predicts "Paris" because that pattern appeared countless times in training data.
Scale this simple mechanism to hundreds of billions of parameters and hundreds of billions of training examples, and something remarkable emerges: the model doesn't just predict next words — it appears to reason, write coherently, answer questions, and generate code.
How LLMs Are Trained
- Data collection — Collect vast amounts of text: web pages, books, academic papers, code, conversations. GPT-4's training data is estimated at 45 terabytes — roughly 10 billion books.
- Pre-training — Train the model on next-token prediction across this data using thousands of specialised AI chips over months. This is the most expensive part.
- Fine-tuning — Further train the model on curated examples of good responses, making it helpful and conversational rather than just a text predictor.
- RLHF — Reinforcement Learning from Human Feedback. Human raters compare model outputs and their preferences are used to further refine behaviour toward helpful, harmless, honest responses.
Parameters — What They Actually Are
The "large" in large language model refers to parameters — numerical values in the neural network that are adjusted during training to minimise prediction errors. A model with more parameters can represent more complex patterns.
An analogy: if a language model were a recipe book, parameters would be the individual ingredient measurements. More parameters = more complex recipes = more nuanced cooking. Modern LLMs have hundreds of billions to potentially trillions of these values.
Why LLMs Hallucinate
LLMs generate text based on probability, not factual lookup. When asked a question, they produce the most probable response given their training — not the most factually accurate one. If plausible-sounding but incorrect text appeared frequently in training data, the model may reproduce that pattern confidently.
This is why you should always verify specific facts, statistics, citations, and technical claims from LLM outputs. They're excellent at structure, reasoning frameworks, and general knowledge — but unreliable for precise facts, recent events, or narrow technical details.
Context Windows — The Working Memory of LLMs
An LLM can only "see" a limited amount of text at once — its context window. Early models had 4,000 tokens (~3,000 words). Modern models range from 128,000 tokens (GPT-4) to 1 million tokens (Gemini 1.5 Pro). Anything outside the context window is invisible to the model — it doesn't remember previous conversations unless they're included in the current context.