RAG in 30 Seconds
- RAG gives AI access to your specific documents at query time
- The model searches first, then answers using what it finds
- No retraining required — you just add documents to the database
- It is why AI can answer questions about your company's internal data
- Most enterprise AI tools use RAG under the hood
The Problem RAG Solves
Large language models like GPT-5.4 or Claude are trained on enormous amounts of internet text up to a certain date. They know a lot about the world in general — but they know nothing about your company's internal documents, your customer database, last week's meeting notes, or any information that wasn't in their training data.
The obvious solution — retrain the model on your private data — is prohibitively expensive. Training a frontier model costs tens of millions of dollars and takes months. Even fine-tuning a smaller model on your data costs thousands of dollars and requires ML expertise.
RAG solves this problem elegantly. Instead of changing the model, you give it access to a searchable database of your documents at the time of each query. The model doesn't need to have memorised your data — it just needs to be able to find and read the relevant parts on demand.
How RAG Works — Step by Step
- Document ingestion: Your documents (PDFs, Word files, emails, database records, web pages) are processed and split into chunks — typically 200-500 word segments.
- Embedding: Each chunk is converted into a vector — a list of numbers that represents the meaning of that text. Similar chunks get similar vectors. This is done by a separate embedding model.
- Storage: All these vectors are stored in a vector database, alongside the original text they represent.
- Query: When you ask a question, your question is also converted to a vector using the same embedding model.
- Retrieval: The system finds the document chunks whose vectors are closest to your question vector — i.e., the chunks most likely to contain relevant information.
- Generation: The retrieved chunks are included in the prompt sent to the language model, along with your question. The model reads the relevant text and generates an answer based on it.
Simple analogy: Imagine you asked a very smart assistant a question. Instead of answering from memory, they first ran to a filing cabinet, pulled out the most relevant documents, read them quickly, and then answered your question using what they just read. That is RAG.
Why Vectors and Not Keywords?
Traditional search uses keywords — it finds documents that contain the same words as your query. This works for exact matches but fails when the concept you're looking for is described differently in the document than in your question.
Vector search finds documents based on meaning. If you ask "what is our refund policy?" it will find the document titled "Customer Returns Procedure" even though none of those words appear in your question. It understands that these topics are semantically related.
RAG vs Fine-Tuning — When to Use Which
These two approaches are often confused. They solve different problems:
| RAG | Fine-Tuning |
|---|---|
| Give model access to specific facts and documents | Change how the model behaves or communicates |
| Easy to update — add or remove documents | Requires retraining when data changes |
| Works with any up-to-date information | Knowledge is locked at training time |
| Costs cents per query | Costs thousands of dollars to train |
| Model can cite its sources | Model cannot attribute where it learned things |
Most enterprise use cases need RAG, not fine-tuning. Fine-tuning is the right choice when you want to change the model's tone, style, or behaviour — not when you want it to know specific facts about your organisation.
RAG in Practice — Real Examples
Microsoft Copilot for Microsoft 365: When you ask Copilot to summarise your emails or find a document, it uses RAG to search your SharePoint, OneDrive, and email. The language model never sees all your data — it only receives the specific chunks retrieved for your query.
Customer support chatbots: Enterprise support chatbots use RAG to search product documentation, knowledge bases, and previous support tickets. When a customer asks a question, the bot retrieves the relevant documentation sections and generates a specific, accurate answer.
Legal research tools: AI legal research platforms like Harvey use RAG to search millions of case documents, statutes, and legal memos. Lawyers ask questions in natural language and receive answers with citations to the specific documents retrieved.
The Limitations You Should Know
RAG is powerful but not magic. Understanding its failure modes helps you use it more effectively:
- Retrieval failures: If the retrieval step does not find the right document — because the question is phrased very differently from how the answer is written — the model answers from general knowledge and may hallucinate. This is the most common failure mode.
- Chunk boundary problems: If the answer spans across a chunk boundary (the relevant information is split across two chunks), retrieval may only find half of what it needs.
- Contradictory documents: If your document set contains conflicting information, the model may retrieve both and become confused, or choose the wrong one.
- Synthesis limitations: RAG is best at retrieving specific facts. It struggles with questions that require synthesising information across many documents or drawing inferences that are not explicit in any single document.
How to Evaluate a RAG System
If you are evaluating an enterprise AI tool that uses RAG, ask these questions:
- Can it cite the specific document and section it retrieved? (If not, you cannot verify its answers.)
- How does it handle queries where the answer is not in the documents? (It should say so clearly, not hallucinate.)
- How frequently are the documents updated? (Stale retrieval databases give outdated answers.)
- What chunk size and overlap does it use? (Larger chunks preserve more context; overlap reduces boundary problems.)