What Is Retrieval-Augmented Generation (RAG)?
Language models do not know what is in your documents unless you tell them. Pasting an entire library into the prompt is impractical — too slow, too expensive, and it hits context limits fast. RAG solves that.
Retrieval-augmented generation (RAG) is a technique that gives a language model access to a searchable document index at query time, so it can answer based on specific passages from your sources rather than from its training data alone. The name describes the pipeline: first retrieve relevant excerpts, then generate a response using them as grounding.
How it works
RAG has three stages. They always run in this order.
1. Indexing (done once, before any query)
Before you can query a document set, each document is:
- Chunked — split into segments, typically 200–500 words each. Whole documents are too long to score against a query reliably; smaller passages match more precisely.
- Embedded — each chunk is converted into an embedding: a vector of numbers that encodes its meaning. Two chunks about the same concept end up close together in vector space, even if they use different words.
- Stored — the embeddings go into a vector database alongside a pointer back to the original text.
This index is what makes fast, meaning-aware search possible. You build it once; querying it costs milliseconds.
2. Retrieval (runs on every query)
When you ask a question:
- The question is also embedded into a vector.
- The vector database finds the chunks whose embeddings are closest to the question vector — a similarity search.
- The top-N chunks (typically 3–10) are pulled from the database, along with their original text.
This is the retrieval step. The language model hasn't seen the question yet — retrieval is a separate, fast lookup that runs before the model is invoked.
3. Generation (runs on every query)
The retrieved chunks are packaged into a prompt:
Context: [chunk 1] [chunk 2] [chunk 3] Question: [your question]
The language model reads the context and generates a response grounded to those passages. It does not search — that already happened. Its job is synthesis and language.
A concrete example
You have 50 research notes about battery technology saved in a Space. You ask: "What did the Stanford paper say about solid-state electrolytes?"
Without RAG, the model answers from training data — which may predate the paper, misremember it, or conflate it with other work.
With RAG:
- Your question is embedded.
- The vector database finds the three most relevant chunks from your notes — including the passage you saved from that paper.
- The model reads those chunks and answers from them, with the option to cite the source passage.
The difference is the source of truth. RAG anchors the answer to your documents; a base model anchors it to its training data.
What RAG can and cannot do
| RAG | Base model (no RAG) | Fine-tuning | |
|---|---|---|---|
| Answers from your documents | ✅ Yes | ❌ No | ✅ Yes |
| Updates without retraining | ✅ Add to index | ❌ No | ❌ Retrain required |
| Can cite the source passage | ✅ Possible | ❌ No | ❌ No |
| Cost relative to fine-tuning | Medium | Low | High |
| Failure mode | Retrieves the wrong chunk | Hallucination | Overfitting |
RAG does not eliminate hallucination — it shifts the risk. If the retriever returns an irrelevant chunk, the model may still generate a plausible-sounding but wrong answer. The quality ceiling is set by retrieval quality, not just model quality.
Why it matters
RAG is the standard mechanism behind "ask your documents" tools because the alternatives are worse:
- Pasting whole documents into a prompt hits the context window fast (see [What Is a Context Window?](what-is-a-context-window.md)) and is expensive per query.
- Fine-tuning is slow to build, expensive to update, and cannot cite sources — so you can't audit why it said what it said.
- RAG indexes are cheap to build, update in seconds, and can surface exactly which passage drove an answer.
For note-takers and researchers, RAG turns a passive library into a queryable system — without touching the underlying model.
Try this
In JustJot.ai, the AI Chat inside a Space uses RAG over your saved notes. To see it working: write a few notes on one topic, open AI Chat, and ask a specific question about that topic. When it answers, check whether the response corresponds to what you actually wrote — that correspondence is the retrieval step surfacing your own passages, not the model guessing.
If the answer seems off, the diagnostic is retrieval, not the model. Ask whether your source actually contains the answer, and whether it was phrased in a way the retriever could match. Reframing the question as a search phrase often helps more than restating it as a natural sentence.
The decision rule: when an AI "chat with your documents" tool gives a wrong answer, suspect retrieval before you suspect the model. The retriever found the wrong chunks — or the right chunks weren't in the index yet.