24 essays · 5 also in Türkçe
RAG from First Principles — runnable code & step-by-step notebooks for the 20-part series Code on GitHub ↗Even a powerful LLM answers confidently and wrong about your own documents and yesterday's events. Part 1 of a from-scratch series on Retrieval-Augmented Generation: no code, just the four problems RAG solves and one durable mental model.
RAG retrieves the 'relevant' information, but how does a computer decide what counts as relevant? Part 2 of a from-scratch series on Retrieval-Augmented Generation: how we turn meaning into numbers, why similar meanings land close together, and the quiet geometric trick that makes search by meaning possible.
Relevant just means close in embedding space, but how do you turn 'close' into a single number you can rank by? Part 3 of a from-scratch series on Retrieval-Augmented Generation: Euclidean distance, the dot product, and why cosine similarity, which measures direction and ignores length, is the default for scoring chunks in RAG.
You can score one chunk against a query, but doing it for every chunk is exact, brute-force k-NN: perfectly accurate and painfully O(n). Part 4 of a from-scratch series on Retrieval-Augmented Generation: why ordinary database indexes break in high dimensions, the speed-versus-recall trade-off behind approximate nearest-neighbor (ANN) search, the intuition for HNSW and IVF, and what a vector database actually stores and does.
We have a retrieval engine, but it rests on one quiet assumption: that documents arrive as tidy chunks. Part 5 of a from-scratch series on Retrieval-Augmented Generation: getting clean text out of messy formats, why we chunk at all, the too-small versus too-large tension, the main splitting strategies (fixed-size, recursive, structure-aware, semantic), and the two dials that quietly decide retrieval quality, chunk size and overlap. Bad chunks poison everything downstream.
Five parts of theory, now one running program. Part 6 of a from-scratch series on Retrieval-Augmented Generation: build a complete chat-with-your-documents app by hand in Python, no framework hiding the mechanics. Embed with a local model, store vectors in plain NumPy, score by cosine similarity, retrieve top-k, ground the prompt, and generate, then swap in a real vector database. Every line ties back to a concept you already learned.
Our Part 6 app works, but it retrieves naively: pure semantic search with a fixed top-k. Part 7 of a from-scratch series on Retrieval-Augmented Generation: why dense retrieval whiffs on exact codes and names, the sparse (keyword) retrieval that nails them, TF-IDF and BM25 explained by intuition, how hybrid search fuses the two (weighted sum and Reciprocal Rank Fusion), and why top-k is a real knob with a lost-in-the-middle trap. Dense and sparse fail in opposite directions; combine them.
Part 8 of a from-scratch series on Retrieval-Augmented Generation. First-pass retrieval is fast but only roughly right: the best chunk can sit at rank six. Sharpen it with three levers, in pipeline order. Before retrieval, transform the query (multi-query, HyDE, step-back, decomposition). During retrieval, filter by metadata. After retrieval, rerank a wide candidate set with a cross-encoder and keep the best few. Includes a focused code addition that adds reranking and a metadata filter to the app you built in Part 6.
Eight parts in, your pipeline retrieves well. But it still assumes one unit of text does triple duty: the thing you embed, the thing you search, and the thing you hand the model. Part 9 of a from-scratch series on Retrieval-Augmented Generation breaks that assumption. The big idea is decoupling: the best unit to search on (small, sharp) is rarely the best unit to generate from (large, rich). Four patterns put it to work, parent-document, sentence-window, self-querying, and contextual compression, with one focused code addition on the running app.
The leap from a fixed pipeline that runs the same way every time to a dynamic, decision-making loop that can choose whether to retrieve, judge what came back, and try again. Part 10 of a from-scratch series on Retrieval-Augmented Generation: a guided tour of Agentic RAG, Corrective RAG (CRAG), Self-RAG, GraphRAG, and Multi-Modal RAG, what control flow each one adds, and the sober cost of reaching for any of them.
How to replace vibes with numbers. Part 11 of a from-scratch series on Retrieval-Augmented Generation: the two failure surfaces of a RAG system, the core metrics that probe each one (context precision and recall, faithfulness, answer relevance), LLM-as-a-judge and its biases, the frameworks that automate it, how to build an evaluation set, and the disciplined loop that turns guessing into engineering.
The finale. A RAG system that works in a notebook is about 20 percent of the job; the other 80 percent is making it fast, cheap, reliable, secure, and observable under real traffic. Part 12 of a from-scratch series on Retrieval-Augmented Generation: where latency and cost actually go and how to cut them, caching (including semantic caching), monitoring and tracing, failing gracefully, and the most underrated topic of all, security (prompt injection and data leakage). It closes with a capstone checklist for the whole series and a warm send-off.
Single-vector embeddings throw away token-level signal. Late interaction keeps a vector per token and scores with MaxSim, getting cross-encoder-quality matching at bi-encoder serving cost. Part 13 of a from-scratch series on Retrieval-Augmented Generation, opening the Frontier Track: ColBERT and ColBERTv2, MaxSim by hand in numpy, the storage tradeoff, and how ColPali extends late interaction to document page images without OCR or chunking.
A chunk that reads fine in isolation can be uninterpretable once it leaves its document: 'she' no longer resolves to 'Alice', 'the policy' loses its antecedent. Part 14 of a from-scratch series on Retrieval-Augmented Generation, on the Frontier Track: two training-free fixes, late chunking (pool token spans after the transformer) and Anthropic's Contextual Retrieval (prepend an LLM-written situating sentence before embedding), built by hand and compared.
Not every query needs the same machinery: a greeting needs no retrieval, a fact needs one lookup, a comparison needs several. Part 15 of a from-scratch series on Retrieval-Augmented Generation and the close of the Frontier Track: a small complexity classifier that routes each query to no-retrieval, single-step, or multi-step retrieval, unifying the pipelines built across Parts 6 to 10 into one adaptive system.
Part 1 asked why RAG exists. Part 16 asks the harder follow-up: when do you even need retrieval? Context windows reach about a million tokens in 2026, so sometimes you can just stuff everything in, and Cache-Augmented Generation (CAG) preloads a small, stable corpus once and reuses the cached KV state instead of retrieving. This part works out the prompt-caching economics that decide between them and gives you a clear decision matrix: massive or fast-moving or private corpus to RAG, small and stable to CAG or long-context, mid-size to long-context.
RAG widens the attack surface in a way ordinary apps do not: its whole premise is feeding external, often untrusted, content straight into a powerful model's prompt. Part 17 of a from-scratch series on Retrieval-Augmented Generation: the threats unique to RAG (indirect prompt injection through retrieved documents, knowledge-base poisoning, cross-tenant leakage) and the layered defensive pipeline that contains them, from input redaction and provenance scoring to a delimited untrusted-context wall, decline-if-not-grounded, output filtering, and identity-scoped access control.
Most enterprise knowledge does not live in documents, it lives in databases and tables, and dense passage retrieval cannot answer a question whose answer has to be computed. Part 18 of a from-scratch series on Retrieval-Augmented Generation: text-to-SQL with RAG (retrieve the schema, generate SQL, execute, answer), table retrieval and the scaling reality, and routing text-search versus SQL per query.
Part 19 of a from-scratch series on Retrieval-Augmented Generation: take the agentic RAG that Part 10 only toured in prose (the ReAct loop, tool use, routing, multi-hop) and build a real agent by hand, with four tools, a reason/act/observe loop, an honest step budget, and three traces you can read line by line.
Part 20 of a from-scratch series on Retrieval-Augmented Generation: give the one-shot agent a memory. Build multi-turn RAG by hand, where query condensation rewrites a context-dependent follow-up into a standalone question before retrieval, so 'what about damaged items?' finally finds the right chunk.
The EU's aviation regulator just published a 239-page concept paper on making AI safe enough to fly. From artificial narrow intelligence to learning assurance and the W-shape process, the mental model behind trustworthy aviation AI, explained for non-experts.
TOYGUN'un nasıl gördüğünü, hedefi nasıl tanıyıp izlediğini, lazerle nasıl ölçüp işaretlediğini ve uçağın radar izini bozmadan tüm bunları nasıl yaptığını kademe kademe, sinyal zinciri ve alt sistemler ile anlatır.
End-to-end notes from a $55 LoRA build on gpt-oss-20b: the data pipeline, the training run, the evaluation, and the surprise that the biggest win was teaching a reasoning model to stop reasoning.
Chunk size, embeddings, re-rankers: the usual suspects. But the language of your corpus quietly shapes every layer of the pipeline, and reasoning models make it decisive.