🧵

Crate #11: RAG vs. Long-Context

When retrieval helps, when it hurts, and why bigger context changed the stack

🏗️ Architect⏱ ~16 min

raglong-contextprompt-cachingretrievalarchitecture

📋 Prerequisites

Why RAG Existed in the First Place

RAG stands for Retrieval-Augmented Generation. Translation: instead of asking a model to remember everything, you store your documents somewhere else and fetch the most relevant pieces right before the model answers.

This became popular because older models had tiny context windows. In 2023, many production systems only had room for a few thousand tokens. A long PDF, a policy manual, or a real codebase would overflow the prompt immediately. So engineers had to split documents into chunks, embed them, search the chunks, and feed back only the best bits.

That was not some weird fad. It was a practical response to a real limit. If your model literally cannot fit the corpus, retrieval is not optional. It is the bridge that lets a small-context model pretend it saw more than it really did.

Why People Are Re-thinking It Now

Context windows got much bigger. Some modern models can read hundreds of thousands or even a million tokens in one shot. Add prompt caching and the economics change too: you pay full price for the big context once, then much less for repeated follow-up queries over the same cached material.

That means a lot of systems that used to need a retrieval layer can now just load the whole corpus directly into context. No chunk boundaries. No missed table that got separated from its explanation. No top-k retrieval failure where the answer was sitting in chunk number 11 while your system only fetched the top 10.

So the new question is not "Is RAG dead?" The real question is: do you still need retrieval for THIS workload, or are you carrying around architecture that was only necessary because old models were cramped?

What Long-Context Simplifies — and What It Does Not

Long-context can simplify the stack brutally. In the best case, you only need: a file loader, a clear system prompt, the full corpus, and prompt caching. That is much easier to reason about than chunking, embeddings, vector storage, filtering, reranking, and recall debugging.

But long-context is not magic. Bigger context windows do NOT guarantee equal attention to every token. Models still show primacy, recency, and "lost in the middle" problems. They may technically see the whole corpus but still reason badly over a messy prompt. Also, sending giant private corpora into one prompt may be too expensive, too slow, or legally impossible.

So the upgrade is real, but the hype can overshoot. Long-context kills a lot of unnecessary retrieval scaffolding. It does not kill the need for careful prompt design, evaluation, access control, or good information architecture.

When RAG Still Wins

RAG still makes sense when the corpus is too large to fit comfortably, changes constantly, or must stay tightly partitioned between users. It also matters when latency is critical and you cannot afford to stuff half a million tokens into every interaction.

A good rule of thumb: if the same stable corpus is queried many times in a row, long-context plus prompt caching becomes very attractive. If the corpus is huge, rapidly changing, tenant-isolated, or only a tiny slice is relevant per request, retrieval still earns its keep.

The mature position is not religious. Do not build RAG because it was fashionable in 2024. Do not delete RAG just because a post said it was dead. Build the simplest architecture that matches your corpus size, update frequency, privacy constraints, latency target, and failure tolerance.

🤔 Think About It

If your AI assistant reads a 400-page handbook every day, would you rather pay once to cache the whole thing or repeatedly search tiny fragments? Why?
What kinds of errors become MORE likely with chunked retrieval? What kinds become MORE likely with giant prompts?
If you were designing an internal company assistant, what would decide whether you use RAG, long-context, or a hybrid?

🔬 Try This

Take a long article and split it into 500-word chunks. Then ask a friend to answer a question using only the top 2 chunks. What did they miss that required the full document?
Pick a tool or workflow you use today because of a technical limitation. Ask yourself: if that limitation disappeared, would the tool still be necessary?
Sketch two stacks on paper: one using chunking + embeddings + retrieval, and one using full-context + caching. Which one would be easier for your team to debug at 2am?

📚 Go Deeper

📰Prompt caching — Anthropic docsarticle
📄Lost in the Middle: How Language Models Use Long Contextspaper
📰Building effective agents — Anthropicarticle

🎯 Fun Fact

A lot of AI architecture arguments are really timing arguments in disguise. The same design can look genius in one model generation and bloated in the next, because the underlying constraints moved.

📝 Quick Quiz

1. Why did RAG become popular in the first place?

2. What is one major advantage of long-context over chunked retrieval?

3. Which scenario still strongly favors RAG?

4. What does prompt caching mainly change?

Answer all 4 questions to submit

← Crate #10: AI Agents & The Future All Crates →