RAG FROM FIRST PRINCIPLES · PART 18 OF 20

2026-06-24

Structured and SQL RAG

Most enterprise knowledge does not live in documents, it lives in databases and tables, and dense passage retrieval cannot answer a question whose answer has to be computed. Part 18 of a from-scratch series on Retrieval-Augmented Generation: text-to-SQL with RAG (retrieve the schema, generate SQL, execute, answer), table retrieval and the scaling reality, and routing text-search versus SQL per query.

What you’ll learn

For seventeen parts we have treated knowledge as text: documents chunked into passages, embedded, retrieved by similarity, and read by a model. That worked because the questions we asked had answers sitting in passages, waiting to be found. This part confronts the knowledge that does not work that way: the rows, columns, and tables that hold most of what an enterprise actually knows. You will learn why the dense-retrieval intuition that carried the whole series quietly breaks on structured data, and what replaces it: text-to-SQL with RAG, where you retrieve the relevant schema (not passages), generate a SQL query grounded in it, execute that query against the database, and answer from the result. You will learn why you must retrieve a schema subset rather than paste the whole database (tabular reasoning degrades as tables grow, and real schemas dwarf any prompt budget), and finally how to route a query to the SQL path or the document path, tying straight back to the complexity classifier from Part 15.

Prerequisites

This part stands a little apart from the rest, because it is about a kind of data the series otherwise omits, but it leans on a few earlier pieces. You need Build Your First RAG (Part 6), because the document path we will contrast against is exactly that retrieve-augment-generate loop. You need Making Retrieval Smarter (Part 8) for the idea of retrieving a relevant subset of a larger thing, which is what schema retrieval is. You need Adaptive RAG (Part 15), because the router at the end of this part is the same complexity classifier, extended with one more branch. And the production discipline from RAG in Production (Part 12) is the lens for the caveats, since executing generated SQL against a live database is exactly the kind of consequential action that part taught you to handle carefully. Basic SQL is enough; the companion code is stdlib sqlite3 you can read in a sitting.

Why dense retrieval misses here

Start with the uncomfortable fact that motivates the whole part. Ask “what was our total revenue from shipped orders last quarter?” and there is no passage anywhere that holds the answer. The number does not exist as text until someone computes it. It lives, implicitly, spread across thousands of rows in an orders table and a products table, and the only way to produce it is to join those tables, filter to shipped orders in the quarter, multiply price by quantity, and sum. Dense passage retrieval, the engine of every part before this one, finds text that is similar in meaning to the query. It is very good at that and completely unable to do the thing this question needs, which is arithmetic over a set of rows. You can retrieve the most relevant-looking chunk in the corpus and it will still not contain a number nobody has written down.

This is not an edge case you can wave away. Most of what an organization knows is structured: transactions, inventory, user records, sensor logs, financials, the relational backbone that runs the business. The document-centric series we have built is the right tool for policies, manuals, tickets, and prose. It is the wrong tool for the database, and reaching for it there produces a particular, recognizable failure: the model, handed some vaguely related rows as text, guesses a number that looks plausible and is wrong. The fix is not a better embedder. It is to stop pretending the question is a retrieval problem and admit it is a query problem.

So the move is to change what “retrieval” produces. Instead of retrieving passages to read, we retrieve enough of the database’s schema for a model to write a correct query, generate that query in SQL, run it, and answer from real result rows. The retrieval is still doing its job (narrowing a huge space down to the relevant part) but the relevant part is now table definitions, not text, and the answer comes from execution, not from reading.

Text-to-SQL with RAG

Here is the loop, and it has the same five-word spine as the rest of the series with two words swapped. The document pipeline was embed, retrieve, ground, generate. The structured pipeline is retrieve schema, generate SQL, execute, answer. Walk it once and the shape is clear.

A four-stage left-to-right pipeline. Stage 0, a Question box reads total revenue from shipped orders. Stage 1, RETRIEVE SCHEMA in teal: a faded catalog of tables (customers, invoices, tickets, and more, marked too big for the prompt) above a highlighted retrieved subset of the top-2 tables, orders(id, product_id, quantity, status, date) and products(id, name, category, price). Stage 2, GENERATE SQL in indigo: a dark code card showing SELECT SUM(p.price * o.quantity) FROM orders o JOIN products p ON ... WHERE status='shipped'. Stage 3, EXECUTE in slate: an arrow into a database cylinder labeled DB rows, captioned the result is exact, computed, never hallucinated. Stage 4, ANSWER FROM THE RESULT in emerald: result rows [(436.0,)] turn into the sentence total revenue from shipped orders was 436.0, with a note that no passage holds this number, it must be computed over rows, dense retrieval finds similar text but cannot SUM, and the database does the arithmetic exactly. A footer reads: the schema is retrieved, the answer is executed; enterprise schemas are far too large to paste whole, so you retrieve the relevant tables, then let SQL compute the exact answer.
Fig 1 The structured-RAG loop. A numeric question (total revenue from shipped orders) cannot be answered by reading a passage; the number has to be computed. So instead of retrieving text, you (1) retrieve the relevant subset of the database schema, the orders and products tables out of a much larger catalog; (2) generate a SQL query grounded in that schema, plus a domain glossary and few-shot exemplars; (3) execute the SQL against the live database, which returns exact result rows; and (4) answer from those rows. The arithmetic is done by the database, never guessed by the model.

The interesting part is what you retrieve in step one, because it is more than table names. A schema-retrieval system typically pulls several kinds of context, and each one earns its place:

  • Schema embeddings. Each table (and often each column) gets a short, retrievable description: its name, its columns, and a one-line gloss of what a row means. You embed those cards and, given a question, retrieve the nearest few. This is ordinary dense retrieval again, but pointed at the schema rather than at documents.
  • Column descriptions. A column named tier means nothing to a model until something says it holds 'free', 'pro', or 'enterprise'. Real column documentation is what turns a cryptic schema into one a model can query correctly.
  • Sample rows. A handful of example rows per table shows the model the actual value formats: that status is the literal string 'shipped' and not 'Shipped' or 'SHIP', that dates are ISO strings. This single addition prevents a large class of off-by-a-string errors.
  • A domain glossary. This is the quiet hero. Users ask about “revenue”, “active users”, “churn”, business words that do not appear as columns. The glossary maps each business term to the columns and computation that encode it: “revenue” is price * quantity summed over orders, not a revenue column that does not exist. Without it the model invents columns; with it the model writes the query you meant.
  • Few-shot SQL exemplars. A few (question, SQL) pairs from your own database teach the model your conventions: how you join, how you name, the quirks of your dialect. These are retrieved too, the most similar past examples to the current question.

You assemble that retrieved context, ask a model to write SQL, and then, crucially, you execute it. Execution is the step that has no analog in the document pipeline, and it is what makes the answer trustworthy: the database computes the number exactly, so the model never has to. The result rows come back, and a final generation step phrases them as an answer. The companion file walks this whole loop in stdlib Python: sql_rag.py builds a tiny SQLite database, retrieves the relevant schema cards by keyword overlap, generates SQL with a rule-based stub (no LLM needed for the demo), executes it against real SQLite, and answers from the rows. Running it on “what was our total revenue from shipped orders?” retrieves the orders and products cards, generates the SUM(p.price * o.quantity) join, executes it, and prints 436.0, a number that exists nowhere in the data until the query computes it.

A word on what the demo mocks and what is real, because the distinction is the lesson. The schema retrieval is keyword overlap rather than embeddings, and the SQL generation is a rule-based stub rather than a model call, both so the file runs on the standard library alone. But the control flow is the production one exactly, and the execution step is genuinely real: the generated SQL runs against a real SQLite database and returns real rows. That is the part that transfers. Swap the keyword scorer for an embedder and the stub for a model, and you have a working text-to-SQL RAG system with the same spine.

Why you retrieve the schema instead of pasting it

The obvious objection is: why retrieve any schema at all? Modern models have large context windows. Why not paste the entire database schema into every prompt and let the model pick what it needs? There are two reasons, and both are about scale.

The first is that enterprise schemas are enormous. A real production database is not the three tidy tables in the demo. It is hundreds or thousands of tables, many with dozens of columns, plus views, foreign keys, and the column documentation and sample rows that make any of it usable. Serialized into a prompt, a large schema runs to tens or hundreds of thousands of tokens before you have written a single example. Pasting it whole is expensive on every request (the input-token cost from Part 12, paid over and over) and frequently does not fit at all.

The second reason is subtler and more important: tabular reasoning degrades as the table grows, well before you hit the context limit. This is a measured, repeated finding. Models show a consistent decline in accuracy on table question-answering as table size increases, and the drop sets in inside the nominal context window, which means the bottleneck is reasoning and attention, not raw capacity. The model can technically hold a hundred thousand tokens of schema and still reason worse over it than over the handful of tables the question actually concerns. Older table-QA benchmarks leaned on small tables (often under a few thousand tokens) precisely because that is where models are strong; push to the larger, scattered tables that real enterprises have and accuracy falls. This is the same lost-in-the-middle dynamic you met with documents in Part 7, now wearing a schema. More context past the relevant set does not help and actively hurts.

Put those two facts together and the conclusion is forced: you cannot paste the whole schema, and even if you could, you should not. So you do what RAG always does. You retrieve the relevant subset, the few tables and columns this question needs, and give the model only those. This is schema linking (sometimes called schema retrieval), and for large databases it is the make-or-break step: text-to-SQL accuracy is highly sensitive to whether the right tables made it into the prompt. Retrieval-augmented schema linking, retrieving the relevant schema subset before generating SQL, is exactly how recent systems scale text-to-SQL to massive databases. The series intuition holds: narrow the haystack first, reason over the needles.

💡 From experience

The first text-to-SQL system I helped ship looked great in the demo and fell over the moment it met the real warehouse. In the demo we had wired up about a dozen carefully chosen tables, the model wrote clean SQL, and everyone was happy. Production had something like nine hundred tables, and a lot of them had near-duplicate names left over from three migrations nobody had finished, things like orders, orders_v2, and orders_archive. The model, handed the whole catalog, kept confidently joining the wrong one. The query parsed, it executed, it returned a number, and the number was quietly drawn from a table that had not been written to in two years. Nothing errored. The fix was not a smarter prompt; it was schema retrieval done seriously, with real per-table descriptions that said which orders table was live, plus sample rows so the model could see one was empty. The moment we retrieved a small, well-described subset instead of dumping the catalog, accuracy jumped and the stale-table answers stopped. The lesson I keep: on structured data, most of your wrong answers are schema-linking failures wearing the costume of a model that “can’t write SQL.” It usually can. It just got handed the wrong tables.

Table retrieval, beyond a single database

Text-to-SQL assumes one queryable database with a known schema. A related problem shows up when the structured knowledge is a pile of tables, spreadsheets, CSVs, extracted HTML tables, with no shared schema and no SQL engine in front of them. Here the first job is table retrieval: given a question, find the table (or the few tables) most likely to contain the answer, out of possibly thousands. You embed a compact representation of each table (its caption, headers, and a sample of rows) and retrieve the nearest, the same dense-retrieval move, pointed at tables instead of passages or schema cards.

Once you have the right table, you face the scaling reality from the previous section in its purest form. A retrieved table can be large, and dumping all of its rows into the prompt runs straight into the degradation curve. So the mature pattern is not to read the table at all but to query it: load the retrieved table into a small in-memory database (or a dataframe), generate code or SQL against it, execute, and answer from the result. The table-retrieval step finds the haystack; the generate-and-execute step does the reasoning the model is bad at doing in its head. Either way, the principle that governs the whole part repeats: retrieve the relevant structure, then compute the answer rather than reading it.

Routing: text-search versus SQL

We now have two kinds of pipeline, the document path from Parts 6 through 17 and the SQL path from this one, and a system serving real questions needs to send each query to the right one. This is the same routing problem as Part 15, and it reuses the same machinery: a small, fast classifier that reads cheap signals off the query and picks a path. Part 15 routed by complexity (none, single, multi); here we add a routing axis, structure, and the natural place to put it is the same classifier.

The signal is reliable because the two question types sound different. Aggregational and numeric questions route to SQL: “how many”, “count”, “total”, “sum”, “average”, “top”, “most”, “per region”, “revenue”, anything that asks for a computed quantity over a set of records. These are precisely the questions no passage can answer, the ones from the opening of this part. Everything else, “what is our refund policy?”, “how do I fix error E-4042?”, the explain-and-describe questions, routes to the document path, where a passage really does hold the answer. The classifier keys on those aggregation words the same way Part 15’s keyed on comparison words.

The companion file’s route function is exactly this fork: a query carrying any aggregation signal goes to the SQL loop, everything else falls back to the document path (mocked there as a stub, since you already built the real one). It is deliberately the same shape as Part 15’s classify_complexity, because that is the point: structured routing is not a separate system bolted on, it is one more branch in the router you already have. In a production system both axes live together. First decide the kind of question (structured or unstructured), then, within each kind, decide how much machinery it needs (Part 15’s complexity routing). The two decisions compose, exactly as source routing and complexity routing did in Part 15.

The honest caveat is the same one Part 15 raised about every classifier: it is a failure surface, and the dangerous direction is the silent one. A numeric question misrouted to the document path will retrieve a plausible passage and let the model guess a number, producing a confident, wrong, ungrounded figure, the exact failure this whole part exists to prevent. So when you tune the structured router, bias it toward sending borderline numeric questions to SQL, where a wrong query at least tends to fail loudly (a SQL error, an empty result) rather than quietly fabricating a number from prose.

Try it yourself

The whole loop is small enough to feel by editing one file. Grab sql_rag.py (stdlib sqlite3 only, no model and no network needed) and run it. You will see the three structured demo questions route to sql, retrieve a two-table schema subset, generate real SQL, execute it, and answer (1 enterprise customer, 436.0 total shipped revenue, 1 refunded order), while the policy question routes to the document path and never touches the database. The bottom line prints all SQL answers match the seeded data: True. Now go break it on purpose, which is where the intuition lives.

First, watch a schema-linking failure. The retrieve_schema function keeps the top-2 tables by keyword overlap. Change k=2 to k=1 and re-run. The revenue question now retrieves only orders, the products table (which holds price) never makes it into the prompt, and generate_sql correctly declines rather than guessing, printing “I could not turn that into a query over the known schema.” That is the right failure: the query needed two tables and only got one, so it refused instead of fabricating. This is the production lesson in one line: on structured data, retrieving too little schema is a wrong (or absent) answer, not a smaller one.

Second, break the glossary’s promise. The revenue SQL exists because the schema cards and glossary say revenue is computed from price * quantity. Open SCHEMA_CARDS and delete the sentence “Revenue is computed from price here times order quantity.” from the products card. Re-run: with “revenue” no longer a keyword on the products card, the products table may no longer be retrieved for the revenue question, and the query declines. That is the glossary earning its keep, made visible. Business vocabulary that is not written down somewhere retrievable is vocabulary the model cannot map to your columns.

Third, add a new question type. Pick an aggregation the stub does not handle yet, say “how many orders are still pending?”, and trace what happens: it routes to sql (it has “how many”), retrieves a schema subset, but generate_sql has no matching intent, so it declines cleanly instead of guessing. Now add a small branch to generate_sql that recognizes “pending” and emits SELECT COUNT(*) FROM orders WHERE status = 'pending';, and watch it answer 1. You have just extended a real text-to-SQL system the honest way: one well-described intent at a time, each grounded in the schema you actually retrieved.

⚠️ Common pitfalls

  • Reaching for dense retrieval on a computed question. “Total revenue”, “how many”, “average per region”: these have no answer sitting in any passage, so retrieving the nearest chunk and letting the model read it produces a fluent, fabricated number. If the answer must be computed over records, it belongs on the SQL path, not the document path.
  • Pasting the whole schema into the prompt. Real schemas are hundreds of tables and tens of thousands of tokens, and even when they fit, tabular reasoning degrades as the table set grows, well before the context limit. Retrieve the relevant schema subset (schema linking); do not dump the catalog.
  • Retrieving table names without descriptions, sample rows, or a glossary. A bare schema lets the model invent columns, mismatch value formats (‘shipped’ vs ‘Shipped’), and guess which of three near-duplicate tables is live. Column descriptions, sample rows, and a domain glossary are what turn schema retrieval into correct SQL; most wrong answers on structured data are schema-linking failures, not generation failures.
  • Executing generated SQL with broad privileges. This is the Part 12 least-privilege rule, now load-bearing: a model writing and running SQL against your live database is a powerful, consequential action. Run generated queries through a read-only connection, never one that can write or drop, and treat the model’s SQL as untrusted until it is validated.
  • Misrouting a numeric question to the document path. The dangerous direction silently fabricates a number from prose. Bias the structured router toward sending borderline aggregational questions to SQL, where a wrong query tends to fail loudly (an error or empty result) rather than quietly inventing a figure.

Key takeaways

  • Dense passage retrieval cannot answer a computed question. Most enterprise knowledge lives in databases and tables, and a number like “total revenue” exists nowhere as text until a query computes it. Retrieving the nearest passage produces a fabricated number; the fix is a query, not a better embedder.
  • Text-to-SQL with RAG swaps two words of the series spine: instead of embed, retrieve, ground, generate it is retrieve schema, generate SQL, execute, answer. The database does the arithmetic exactly, so the model never has to guess it.
  • What you retrieve is more than table names: schema embeddings, column descriptions, sample rows, a domain glossary (mapping business words like “revenue” to the columns and computation that encode them), and few-shot SQL exemplars. The glossary and sample rows prevent most invented-column and wrong-format errors.
  • You retrieve a schema subset rather than paste the whole schema for two reasons: enterprise schemas are far too large for the prompt budget, and tabular reasoning degrades as the table set grows, well inside the context window. Schema linking (retrieving the relevant tables) is the make-or-break step for text-to-SQL at scale.
  • Table retrieval generalizes the idea to a pile of schema-less tables: find the right table by dense retrieval, then query it (load and run code or SQL) rather than reading its rows into the prompt.
  • Routing sends aggregational and numeric questions to the SQL path and everything else to the document path, using the same classifier as Part 15 with one more branch. Bias borderline numeric questions toward SQL, because the misroute toward documents fails silently by fabricating a number.

References

  • Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang, and Yuyu Luo. “A Survey of Text-to-SQL in the Era of LLMs: Where are we, and where are we going?” 2024. arXiv:2408.05109. A comprehensive survey of LLM-based text-to-SQL covering the full lifecycle: translation models (and the schema-mapping and ambiguity challenges this part calls schema linking), training data, evaluation, and error analysis. The backbone reference for the text-to-SQL framing here.
  • Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. “Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs (BIRD).” NeurIPS 2023. arXiv:2305.03111. Introduces BIRD, 12,751 question-SQL pairs over 95 large real-world databases (33.4 GB across 37 domains), built specifically because earlier benchmarks (Spider, WikiSQL) used small schemas with few rows and so hid the real-world difficulty. It is the evidence that real databases are large enough to force schema retrieval and value/content understanding.
  • Jeffrey Eben, Aitzaz Ahmad, and Stephen Lau. “RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL.” 2025. arXiv:2507.23104. A worked instance of the central idea of this part: retrieving the relevant schema subset before generating SQL so that text-to-SQL scales to massive databases, rather than pasting the whole catalog into the prompt.
  • Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park. “Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity.” NAACL 2024. arXiv:2403.14403. The paper behind Part 15, whose per-query classifier is the router this part extends with a structured-versus-unstructured branch.

Glossary

  • Structured RAG: retrieval-augmented generation over structured data (databases and tables), where the answer is computed by a query rather than read from a retrieved passage.
  • Text-to-SQL: translating a natural-language question into a SQL query that, when executed against a database, produces the answer.
  • Schema retrieval / schema linking: retrieving the relevant subset of a database’s schema (the tables and columns a question needs) before generating SQL, so the model reasons over a small, relevant schema instead of the whole catalog.
  • Schema card: a short, retrievable description of one table: its name, columns, a one-line gloss of what a row means, and often column descriptions and sample rows; the unit you embed and retrieve in schema linking.
  • Domain glossary: a mapping from business vocabulary (revenue, churn, active user) to the columns and computation that encode it in the schema, so the model does not invent columns that do not exist.
  • Few-shot SQL exemplars: example (question, SQL) pairs from your own database, retrieved by similarity to the current question, that teach the model your conventions and dialect.
  • Execution: running the generated SQL against the live database and reading the result rows; the step with no analog in the document pipeline, and the one that makes the answer exact rather than guessed.
  • Table retrieval: given many schema-less tables (spreadsheets, CSVs, extracted tables), finding the table most likely to contain the answer, then querying it rather than reading all its rows.
  • Structured routing: deciding whether a question is aggregational/numeric (route to the SQL path) or descriptive (route to the document path), using the same small classifier as the complexity router from Part 15.

Next up, Part 19. This was the planned close of the series. Across eighteen parts you built RAG from first principles: why it exists, how embeddings turn meaning into geometry, how to retrieve and rerank and ground and generate, how to measure it, how to run it in production, and now how to reach the structured knowledge that lives in databases rather than documents. The throughline never changed: retrieve the relevant thing, then reason over it, and add complexity only where the evidence demands it. Whether the relevant thing is a passage or a schema, the discipline is the same. There is one move the series kept describing and never ran, though: letting the model drive its own retrieval loop, deciding for itself what to fetch and when to stop. Part 19, Building a RAG Agent, picks the series back up and finally runs it. To revisit any piece or see the whole map at once, head back to the RAG from First Principles hub. Now go point all of it at a real corpus, and a real database, and build something that does not make things up.

RAGText-to-SQLStructured DataSchema RetrievalTablesRoutingLLMAI