LLMs in Plain English

  • LLMs are AI systems trained to understand and generate text
  • They work by predicting the most likely next word/token
  • Trained on hundreds of billions of words from the internet and books
  • They don't know facts — they learned language patterns
  • Parameters are the learned values that encode language understanding

The Core Mechanism — Next Token Prediction

Every LLM, at its core, does one thing: predict the next token (word or word-fragment) given everything that came before. Given "The capital of France is", a well-trained LLM predicts "Paris" because that pattern appeared countless times in training data.

Scale this simple mechanism to hundreds of billions of parameters and hundreds of billions of training examples, and something remarkable emerges: the model doesn't just predict next words — it appears to reason, write coherently, answer questions, and generate code.

How LLMs Are Trained

  1. Data collection — Collect vast amounts of text: web pages, books, academic papers, code, conversations. GPT-4's training data is estimated at 45 terabytes — roughly 10 billion books.
  2. Pre-training — Train the model on next-token prediction across this data using thousands of specialised AI chips over months. This is the most expensive part.
  3. Fine-tuning — Further train the model on curated examples of good responses, making it helpful and conversational rather than just a text predictor.
  4. RLHF — Reinforcement Learning from Human Feedback. Human raters compare model outputs and their preferences are used to further refine behaviour toward helpful, harmless, honest responses.

Parameters — What They Actually Are

The "large" in large language model refers to parameters — numerical values in the neural network that are adjusted during training to minimise prediction errors. A model with more parameters can represent more complex patterns.

An analogy: if a language model were a recipe book, parameters would be the individual ingredient measurements. More parameters = more complex recipes = more nuanced cooking. Modern LLMs have hundreds of billions to potentially trillions of these values.

Why LLMs Hallucinate

LLMs generate text based on probability, not factual lookup. When asked a question, they produce the most probable response given their training — not the most factually accurate one. If plausible-sounding but incorrect text appeared frequently in training data, the model may reproduce that pattern confidently.

This is why you should always verify specific facts, statistics, citations, and technical claims from LLM outputs. They're excellent at structure, reasoning frameworks, and general knowledge — but unreliable for precise facts, recent events, or narrow technical details.

Context Windows — The Working Memory of LLMs

An LLM can only "see" a limited amount of text at once — its context window. Early models had 4,000 tokens (~3,000 words). Modern models range from 128,000 tokens (GPT-4) to 1 million tokens (Gemini 1.5 Pro). Anything outside the context window is invisible to the model — it doesn't remember previous conversations unless they're included in the current context.

Frequently Asked Questions

What is a large language model in simple terms?
A large language model is an AI system trained on enormous amounts of text to understand and generate human language. The 'large' refers to the billions of parameters (mathematical values) the model uses. Think of it as a very sophisticated autocomplete that has read most of the internet.
How are LLMs different from search engines?
Search engines index and retrieve existing web pages. LLMs generate new text based on patterns learned during training. Google finds pages that already exist; ChatGPT writes new responses based on patterns from its training data.
Why do LLMs sometimes give wrong answers?
LLMs generate text based on probability — they predict what word or sentence comes next based on their training data. They don't have a separate fact-checking system. When they generate plausible-sounding but incorrect information, it's called hallucination.
What is the difference between GPT-4 and Claude?
GPT-4 (made by OpenAI) and Claude (made by Anthropic) are both LLMs but trained differently with different strengths. Claude tends to produce more nuanced long-form writing; GPT-4 has a larger ecosystem and better tool integrations. Both are state-of-the-art.
How big is a large language model?
Modern LLMs have hundreds of billions to trillions of parameters. GPT-4's exact size isn't public, but estimates suggest 1+ trillion parameters. These models require enormous computing resources to train — GPT-4's training is estimated to have cost $100+ million.