2026-06-22
RAG vs Long-Context vs CAG
Part 1 asked why RAG exists. Part 16 asks the harder follow-up: when do you even need retrieval? Context windows reach about a million tokens in 2026, so sometimes you can just stuff everything in, and Cache-Augmented Generation (CAG) preloads a small, stable corpus once and reuses the cached KV state instead of retrieving. This part works out the prompt-caching economics that decide between them and gives you a clear decision matrix: massive or fast-moving or private corpus to RAG, small and stable to CAG or long-context, mid-size to long-context.
What you’ll learn
Part 1 opened this series with a single deflating fact: a language model asked something it does not know will confidently make something up, and retrieval is how we fix that. Fifteen parts later we have built the whole machine. This part closes the loop by asking the question Part 1 quietly assumed away: do you even need retrieval? In 2026 the answer is no longer automatic. Context windows have grown to roughly a million tokens, so for a small enough corpus you can skip the index entirely and just put everything in the prompt. And there is a sharper version of that idea, Cache-Augmented Generation (CAG), which preloads a small, stable corpus into the context once, caches the model’s internal state for it, and reuses that state on every query, trading retrieval latency for a bigger prompt that you only pay full price for once. You will learn how those three options (RAG, long-context stuffing, and CAG) actually differ, why prompt-caching economics are the bridge that makes CAG affordable, and how to read a clean decision matrix that tells you which one your corpus wants. No new retrieval technique here. This is the part where you learn when to reach for the ones you already have, and when not to.
Prerequisites
This is a closing chapter, so it leans on the two ends of the series. You need Why RAG Exists (Part 1), because this part is the direct answer to the question that one posed, and because the whole framing of “retrieval as a way to ground a model in facts it does not hold” is the thing we are now deciding when to skip. And you need RAG in Production (Part 12), specifically its cost section, because prompt caching first appeared there as one lever among many and here it becomes the deciding factor. It also helps to have Evaluating RAG (Part 11) in mind, since it already ran the long-context-versus-RAG comparison on cost grounds; this part picks up exactly where that left off and adds the caching dimension it did not cover. Basic Python is enough for the one small cost model.
The question Part 1 left open
Here is the assumption baked into every part so far. We assumed that to ground a model in your documents, you must retrieve: embed the query, search an index, pull back the few most relevant chunks, and paste them into the prompt. That assumption was correct in 2023, when context windows held a few thousand tokens and you physically could not fit a real corpus into a prompt. It is no longer always correct. The frontier models of 2026 carry context windows of about a million tokens, which is enough to hold a few thousand pages of text at once. So a question that was unaskable three years ago is now live: if the whole corpus fits in the window, why bother retrieving from it at all?
The honest answer is that sometimes you should not. If your knowledge base is small and changes rarely, retrieval is machinery you may not need, the same realization Part 15 reached about easy queries, now applied to the corpus as a whole. But “just stuff everything in” is not free, and the reason it is not free is the single most important fact in this part: every token in the prompt is billed on every request. We met this in Part 12 as context inflation, the counterintuitive way RAG can raise your bill rather than lower it. Stuffing the entire corpus into the prompt is that effect taken to its limit. A million-token prompt answered a thousand times is a billion input tokens, paid again and again, even though the corpus never changed. Long context solves the “does it fit” problem and creates a “can you afford it” problem in its place.
That tension, fits-in-the-window versus paid-on-every-request, is what the rest of this part resolves. There are three answers, and which one is right depends entirely on the shape of your corpus.
The three options
Before we price anything, name the three strategies cleanly, because the whole decision is just choosing among them.
The first is RAG, the thing this series built. You keep the corpus in an index, and on each query you retrieve only the handful of chunks you need and send just those. The defining property is that your per-query input is small and roughly constant: it tracks k, the number of chunks you retrieve, not the size of the corpus. A query against a ten-thousand-chunk corpus costs the same as a query against a ten-million-chunk corpus, because you send the same k chunks either way. That is RAG’s superpower and the reason it never goes away: it is the only one of the three whose cost does not grow with the corpus.
The second is long-context stuffing, the naive use of the big window. You put the entire corpus into the prompt on every request and let the model read all of it to answer. No index, no retrieval, no chunking decisions, no embedding model to maintain. It is the simplest thing that could possibly work, and for a small corpus it often does work, with one catch: you pay for the whole corpus, at the full fresh-input rate, on every single query. Simplicity now, a linear cost in corpus size forever.
The third is CAG, Cache-Augmented Generation, and it is the clever middle. The insight is that if the corpus is the same on every request, you should not have to pay to process it from scratch every time. So you preload the corpus into the context once, let the model compute its internal key-value state (its KV cache) over those tokens, and store that state. On every subsequent query you reuse the cached state instead of recomputing it, and only the new query gets processed fresh. The CAG paper (Chan et al., arXiv 2412.15605) frames it precisely as an alternative to RAG for knowledge tasks with a bounded corpus: preload all the relevant documents, cache the runtime state, and answer from it with no retrieval step, eliminating retrieval latency and the document-selection errors that come with it. CAG is long-context stuffing made affordable by caching. It keeps the simplicity (no index) and removes the recurring full-price tax (the corpus is cached, not re-read).
So: RAG sends a small slice each time, long-context sends everything fresh each time, and CAG sends everything once and reuses it. The differences are entirely about what you pay per query, which is why the decision is, at heart, an economics problem.
Cache-Augmented Generation, in a little more detail
CAG deserves a closer look, because it is the newest idea here and the one most likely to be unfamiliar. The mechanism rests on something every transformer already does. When a model processes a prompt, it computes, for every token, a set of key and value vectors that summarize that token in context. This is the KV cache, and it is what lets the model generate the next token without re-reading the whole prompt from scratch. Normally that cache lives only for the duration of one request and is thrown away after. CAG’s move is to keep it. You feed the model your fixed corpus once, capture the KV state it computed over those tokens, and persist it. Then, when a query arrives, you load that saved state and the model continues from it as if it had just read the corpus, even though it has not. The query is processed fresh; the corpus is not processed at all, only reloaded.
The payoff is twofold. First, latency: there is no retrieval step, no embedding the query, no vector search, no reranking, so the per-query path is just “load cache, process query, generate.” Second, accuracy on a bounded corpus: because the model genuinely sees the entire corpus rather than a top-k slice, it cannot make the retrieval mistakes Part 11 catalogued, where the right chunk was never fetched, or a comparison needed two chunks and got one. The whole corpus is always present. The CAG paper’s framing is exactly this: for knowledge tasks where the corpus is small and stable enough to preload, caching it sidesteps both retrieval latency and retrieval error.
The honest limits matter as much as the payoff, and they are the same two that bound long-context. The corpus must fit in the context window, which caps how much you can preload (a million tokens is a lot, but it is not a real enterprise knowledge base). And the corpus must be stable, because the moment a document changes you have to recompute and re-store the cached state for the whole thing. CAG is therefore not a RAG replacement in general. It is a RAG replacement for one specific, common shape: a small, fixed body of knowledge answered over and over. Think a product’s documentation, a policy handbook, a single contract, the rules of a game. For those, CAG is often the right tool. For a corpus that is large, or that changes hourly, or whose contents must be access-filtered per user (the multi-tenancy concern from Part 12), retrieval is still the only thing that works.
Prompt-caching economics: the cost bridge
Here is the thing that makes CAG more than a curiosity, and it is the same lever Part 12 introduced under cost optimization: prompt caching. The pricing reality of 2026 frontier models is that a token served from a cache costs roughly an order of magnitude less than a fresh one. Concretely, providers charge about 0.1x the normal input rate for cached input tokens, against a one-time write premium of about 1.25x the first time you lay a cacheable block down (a longer-lived cache costs more to write, around 2x for a one-hour entry, but the shape is the same). So the first time the model sees your corpus you pay a small premium to cache it, and every time after that you pay a tenth of the normal rate to read it. That tenfold discount on reuse is the entire economic engine behind CAG. Without it, “preload the corpus once” would still cost full price on every query and CAG would just be expensive long-context. With it, the corpus is nearly free after the first request.
To get this discount you have to structure the prompt correctly, and the rule is worth stating precisely because it is easy to get wrong. Prompt caching is a prefix match. The provider caches a stable prefix of your prompt and can only reuse it if the bytes up to the cache point are byte-for-byte identical to a previous request. The implication for layout is strict: put the stable content first (your fixed instructions, then the corpus) and the volatile content last (the user’s query). The corpus and instructions form a cacheable prefix; the query, which changes every time, sits after it and is processed fresh. Get the order backwards, with the query at the front, and nothing caches, because the prefix differs on every request.
The single most common way to wreck this is to let something volatile sneak into the prefix. A timestamp interpolated into the system prompt (“Current time: 14:32:07”), a per-request id, a session token, even a tool result that changes between calls, any of these in the prefix changes its bytes and invalidates the cache from that point on. This is the trap to watch with agentic setups in particular: if your prompt mixes the stable corpus with volatile tool results, the tool results break the cache for everything after them, and you quietly pay full price for the corpus you thought you were caching. Keep the volatile parts at the very end, after the last cached block, or out of the cached prefix entirely.
It helps to see the trade with actual numbers, so I wrote a small cost model you can run: prompt_cache_economics.py (pure standard library, no dependencies, no network). It prices the three strategies in relative cost units, using the public shape of 2026 prompt caching (fresh input as the unit, cache write at 1.25x, cache read at 0.1x), and answers the same small, stable corpus many times. The output makes the amortization concrete. At a single query, CAG is the worst option: you paid the write premium and got exactly one read out of it, which is the wrong test. By a hundred queries against that fixed corpus the write has amortized across a hundred cheap reads, and CAG drops below both naive long-context (which never caches and stays roughly 2.9x the bill) and even RAG on the small-corpus workload. The script also sweeps the corpus size and shows the crossover flip: for a small corpus CAG wins, but as the corpus grows the cached reads (still 0.1x of an ever-larger corpus, on every query) eventually lose to RAG’s near-constant top-k cost. That flip is the decision matrix, derived from the pricing rather than asserted.
A caution on those numbers, in the spirit of Part 11: they are relative cost units, not a vendor quote, and the model is deliberately a toy. It ignores output-token differences beyond a flat charge, assumes a perfectly stable prefix, and uses round pricing ratios. Plug in your provider’s real per-token rates and the exact crossover point will move. What does not move is the shape: long-context cost grows linearly with corpus size on every query, CAG grows the same way but about ten times slower (the cached reads), and RAG stays flat in corpus size because it only ever sends k chunks. Trust the shape, verify the digits against your own bill.
The decision matrix
Now we can answer the question the part opened with, and it comes down to two properties of your corpus plus cost as the tiebreaker. The two axes are how big the corpus is and how often it changes; the cost model above tells you which way to break ties.
Start with the disqualifiers, because they are absolute. If the corpus is massive (bigger than the context window, or large enough that even cached reads of it on every query are expensive), or fast-moving (changing often enough that you would be recomputing a cache constantly), or private in a way that needs per-user access filtering, then you need RAG. Retrieval is the only one of the three whose cost does not grow with the corpus, the only one that handles a corpus too large to fit in any window, and the only one that can filter what each user is allowed to see before it ever reaches the prompt. A genuinely large, living, multi-tenant knowledge base is what this entire series was built for, and none of that changes in 2026.
If the corpus clears those disqualifiers, that is, it is small and stable, then you have the luxury of skipping retrieval, and the choice is between CAG and plain long-context stuffing. Lean toward CAG when the corpus is small-but-not-tiny and you answer many queries against it, because that is exactly when the cache-write premium amortizes and the 0.1x reads pay off (the cost model’s hundred-query case). Lean toward plain long-context stuffing when the corpus is so tiny that the caching machinery is not worth the bother, or when queries are so infrequent that you would never amortize the write. And there is a middle stripe worth naming: a mid-size corpus that fits in the window but is large enough that you would not want to cache it and read it on every call. For that, ordinary long-context (or RAG, if the cost climbs) is the pragmatic choice. The matrix in the figure draws all of this: stable down the left (CAG at the bottom for small, long-context above it for mid-size), and the entire right side, anything fast-moving or massive, belonging to RAG.
The meta-point, and the one that ties this part back to the spine of the whole series, is that the question is no longer “RAG or not” answered once and globally. It is “what does this corpus want,” answered from its size, its volatility, and your traffic. That is the same discipline Part 11 reached on long-context versus RAG and Part 15 reached on per-query routing: stop choosing globally, measure, and spend complexity only where the corpus demands it. Sometimes the corpus demands a full retrieval pipeline. Sometimes it just wants to be cached. Knowing the difference is the last skill this series had to teach.
Try it yourself
The economics are the whole argument, so the best way to feel them is to make the cost model lie to you and watch it stop. Grab prompt_cache_economics.py (standard library only) and run it. You will see the three strategies priced at one query and at a hundred, the crossover points where CAG overtakes naive long-context and then RAG, and a corpus-size sweep where the winner flips from CAG to RAG as the corpus grows. Then try these three experiments, in order.
First, kill the cache discount and watch CAG collapse into long-context. The whole case for CAG rests on the cache-read rate being a fraction of the fresh rate. In the Pricing dataclass, change cache_read from 0.10 to 1.0 (caching now saves nothing) and re-run. The hundred-query CAG total jumps up to roughly the naive long-context total, the crossover where CAG beats RAG vanishes, and CAG is never the winner in the sweep. That is the counterfactual world without prompt caching, and it is exactly why CAG was not a viable idea before cheap cached reads existed. The 0.1x discount is not a detail; it is the load-bearing wall.
Second, move the crossover by changing your traffic, not your pricing. The reason CAG loses at one query and wins at a hundred is amortization of the write premium. Find the first_crossover("cag", "rag", ...) call and note the number of queries it reports. Now make the corpus bigger (raise corpus_tokens in the Workload from 4000 to 40000) and re-run: the crossover moves later or disappears, because a bigger cached corpus costs more to read on every query while RAG’s top-k cost barely moves. Then make the corpus smaller (drop it to 1000) and watch CAG win almost immediately. You are tracing the boundary of the decision matrix by hand: the more you reuse a small stable corpus, the more CAG pays off, and the larger the corpus, the sooner RAG takes back the lead.
Third, simulate the cache-breaking pitfall. The model assumes the corpus sits in a clean, stable prefix that caches perfectly. Real prompts often spoil that by mixing volatile content into the prefix. Add a volatile_tokens field to Workload (say 300 tokens of per-request tool results) and, in cost_cag, bill those tokens at the fresh rate on every call instead of the cached rate, then move them ahead of the corpus in your mental layout so they would invalidate the cache after them. Re-run and watch CAG’s advantage shrink or evaporate. That is the “volatile tool results break the cache” failure made numeric: the moment something changes inside your cached prefix, you are back to paying fresh-input rates for the corpus you thought was nearly free.
⚠️ Common pitfalls
- Stuffing a corpus that does not fit, or will not stay still. Long-context and CAG both require the whole corpus to fit in the context window and to stay stable. A corpus larger than the window simply cannot be stuffed, and one that changes often forces you to recompute the cache constantly, erasing CAG’s advantage. If either is true, the answer is RAG, not a bigger prompt.
- Putting volatile content in the cached prefix. Caching is a prefix match: a timestamp, a per-request id, a session token, or a changing tool result anywhere in the prefix invalidates the cache from that point on. Keep the stable parts first (instructions, then corpus) and the volatile query and tool results last, after the final cache point, or you will pay fresh-input rates for the corpus you meant to cache.
- Judging CAG on a single query. At one query CAG is the most expensive option, because you paid the write premium and got one read. Its whole value is amortization across many reuses of the same stable corpus. Measuring it on one call, or on a corpus you answer only occasionally, tells you nothing about its steady-state cost.
- Treating “context windows are huge now” as “RAG is dead.” A million-token window does not hold a real enterprise corpus, does not filter per-user access, and does not make cost vanish (every stuffed token is billed on every request). Long-context and CAG win on a narrow shape (small, stable, reused); the large, living, multi-tenant case is still RAG’s.
- Quoting a fixed cost saving. The order-of-magnitude cached-read discount is real, but your actual savings depend on your corpus size, your reuse rate, and your provider’s exact rates. The cost model here is in relative units to show the crossover shape; verify the numbers against your own bill before committing.
Key takeaways
- The 2026 reality is that context windows reach about a million tokens, so for a small enough corpus you can skip retrieval and put everything in the prompt. The catch is that every prompt token is billed on every request, so “just stuff it” trades a fits-in-the-window win for a paid-on-every-query cost.
- There are three options: RAG (retrieve only the k chunks you need; cost tracks k, not corpus size), long-context stuffing (send the whole corpus fresh every query; simplest, but linear in corpus size forever), and CAG (preload the corpus once, cache the KV state, and reuse it; long-context made affordable by caching).
- Cache-Augmented Generation (Chan et al., arXiv 2412.15605) preloads a small, stable corpus into the context, persists the model’s KV state over it, and reuses that state per query. It removes retrieval latency and retrieval error, but only for a corpus that fits the window and stays stable.
- Prompt-caching economics are the bridge. A cached input token costs roughly 0.1x the fresh rate, against a one-time write premium of about 1.25x. Structure the prompt as a stable cacheable prefix (instructions, then corpus) with the dynamic query last; volatile tool results in the prefix break the cache and quietly restore full pricing.
- The decision matrix runs on two axes plus cost: massive, fast-moving, or private corpus to RAG; small and stable to CAG (when you reuse it enough to amortize the write) or plain long-context stuffing (when it is tiny or rarely queried); mid-size to long-context.
- The closing lesson of the series, applied to the corpus rather than the query: stop asking “RAG or not” globally. Ask what this corpus wants, from its size, its volatility, and your traffic, and spend retrieval complexity only where it earns its keep.
References
- Brian J. Chan, Chao-Ting Chen, Jui-Hung Cheng, and Hen-Hsen Huang. “Don’t Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks.” 2024. arXiv:2412.15605. The paper that names and motivates CAG. It proposes preloading all relevant resources into an extended-context model and caching its runtime KV state, so that at inference the model answers from the preloaded state with no retrieval step. The reported advantage is exactly the two-sided one this part describes: eliminating retrieval latency and minimizing the document-selection errors of RAG, for knowledge tasks with a constrained, stable corpus. It is explicitly framed as an alternative to RAG for that shape of problem, not a general replacement, which is why this part files it in one cell of the decision matrix rather than at the top.
- Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky. “Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach.” EMNLP 2024 (industry track). arXiv:2407.16833. The Self-Route study, also cited in Part 11. Its finding underwrites the matrix here: long-context and RAG each win on different inputs, and the pragmatic move is to route per query rather than choose one globally, which is the same per-corpus judgment this part applies.
- Kuan Li, Liwen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Shuai Wang, and Minhao Cheng. “LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs - No Silver Bullet for LC or RAG Routing.” ICML 2025. arXiv:2502.09977. The benchmark whose subtitle states the conclusion this part adopts: there is no universal winner between long-context and RAG, so the answer is “it depends,” on corpus size, query type, and budget.
Glossary
- Long-context stuffing: putting an entire corpus into the prompt on every request and letting the model read all of it to answer, with no retrieval. Simple, but you pay for the whole corpus at the fresh-input rate on every query.
- Cache-Augmented Generation (CAG): preloading a small, stable corpus into the context once, persisting the model’s key-value (KV) state over those tokens, and reusing that cached state on every query so the corpus is never reprocessed. Trades retrieval latency for a large but cacheable prompt.
- KV cache: the key and value vectors a transformer computes for every token as it processes a prompt, which let it generate further tokens without re-reading the prompt. CAG persists this cache across requests instead of discarding it.
- Prompt caching: a provider feature that stores a stable prefix of your prompt and serves it on later requests at a fraction of the normal input rate (about 0.1x), after a one-time write premium (about 1.25x). It is a prefix match: any byte change in the prefix invalidates the cache from that point.
- Cacheable prefix: the stable, leading portion of a prompt (typically fixed instructions followed by the corpus) that can be cached and reused. The dynamic query, and any volatile tool results, must come after it or the cache is broken.
- Context inflation: the effect (from Part 12) that every chunk or document placed in the prompt is billed as input tokens on every request, so stuffing more context raises the per-query cost. Long-context stuffing is this effect at its limit; prompt caching is what tames it.
- Decision matrix: the rule for choosing among RAG, CAG, and long-context, on two axes (corpus size and volatility) with cost as the tiebreaker: massive, fast-moving, or private to RAG; small and stable to CAG or stuffing; mid-size to long-context.
Next up in Part 17, RAG in Production flagged the topic that gets skipped most and hurts most: security. We will give it the full treatment, the threat model, the attack catalog (prompt injection through retrieved documents, data leakage across tenants), and the defenses worked out in depth, building directly on the production foundation of Part 12. Deciding whether to retrieve, which we settled here, is only safe once you know how to retrieve safely.