2026-06-01

Fine-tuning an SLM on UAV combat doctrine

End-to-end notes from a $55 LoRA build on gpt-oss-20b: the data pipeline, the training run, the evaluation, and the surprise that the biggest win was teaching a reasoning model to stop reasoning.

The headline that isn’t the story

I fine-tuned a LoRA adapter on top of OpenAI’s open-weight gpt-oss-20b for UAV combat doctrine Q&A. On a stratified test set of 1,873 questions, a pair-wise GPT‑5.4 judge picked the adapter (“v4”) over the un-adapted base model 82.6% of the time (95% CI 80.9–84.4%). Headline-grade number. End of story.

Except: 91.8% of the baseline’s outputs were empty.

The base model, gpt-oss-20b served through the same renderer the adapter was trained against, burned its entire 300-token budget inside <|channel|>analysis|> and never opened a final channel. 1,721 out of 1,875 test cases came back as the empty string. The judge dutifully scored those at the floor and handed the win to v4.

The real result of this build wasn’t doctrinal accuracy. It was a behavioral edit: I taught a reasoning-tuned model to stop reasoning out loud. Once you control for the empty-answer baseline, v4’s clean doctrine-knowledge advantage shrinks from 82.6% to 77.9%, still a real edge, but no longer the story the headline number tells.

This essay walks the build end-to-end (corpus, chunks, Q&A generation, LoRA, ablation, eval, and the methodology potholes I fell into along the way). The repo is at mftnakrsu/gpt-oss-20b-uav-doctrine; the live demo is on HuggingFace Spaces.

What I trained

Base model: openai/gpt-oss-20b, an open-weight mixture-of-experts model, ~21B parameters total, ~3.6B active per token, 32 experts.
Task: UAV / UAS combat-doctrine Q&A. Single-turn, direct-answer style.
Adapter: LoRA, rank 16, α 32, on all-linear + lm_head, including the per-expert MoE projections.
Training data: 18,730 synthetic Q&A pairs generated by GPT-5.4 from 58 publicly-released, unclassified doctrine documents.
Infra: Everything served through the Tinker managed API, training and sampling both. No local GPU at any point. A laptop drove the API, built datasets, and ran evaluation.
Total spend: ~ $55 single run, ~$ 62 with a four-variant ablation. Azure OpenAI for Q&A generation + judge dominated the bill ( $53.45). Tinker LoRA training was ~$ 1.59 at the ADR-0001 placeholder rate.

Treat it as a research artifact. It is not validated for any operational, targeting, or safety-of-flight purpose. The corpus is intentionally and entirely public: no FOUO, no CUI, no NOFORN. Outputs can omit specifics or paraphrase technical terms; always verify against the cited publication.

From PDF to chunk to pair

The pipeline is mechanical: 58 unclassified PDFs go in, 18,730 Q&A pairs come out.

Pipeline funnel showing data transformations from PDFs to Q&A pairs — Fig 1 The funnel: 58 unclassified PDFs → cleaned text → 4,051 chunks → 3,747 after quality filter → 18,730 synthetic Q&A pairs (5 per chunk, generated by GPT-5.4).

The corpus is tiered. Tier 1 (official doctrine): US joint and service publications (JP series, ADP, ATP, AFDP, FM), NATO AJPs, UK JDPs, civil-aviation references from FAA / EASA / ICAO, plus DoDD 3000.09 and the DoD Dictionary. Tier 2 (academic and war college): NPS theses, DTIC reports, HDIAC. Tier 3 (think tanks): ICRC, CNAS, CRS, RUSI. Tier 4 (modern case studies): RUSI and LIIA Ukraine writeups, US Army Infantry Magazine. 60.8 MB of expansion on top of the original 98 MB baseline, ~2.37M total corpus tokens under the gpt-oss-20b tokenizer.

Cleaning is pypdf extraction plus a header/footer heuristic; retention is 95.2%, i.e., 4.8% of the raw extracted tokens are stripped because the same line repeated across ≥30% of pages (page numbers, classification stamps, document IDs). Chunking targets 1,200 tokens with 150 token overlap; output is 4,051 chunks at a 924-token median.

The quality filter is the most invisible step in the pipeline and the one I’d argue you can’t skip. The first pass through GPT-5.4 to generate Q&A pairs on every chunk produced a fair number of low-signal entries: TOCs, bibliographies, ”© 2023 …” reprint notices, “This page intentionally left blank.” A two-pronged filter (heuristic + tiny LLM classifier) dropped 304 chunks (7.5%), leaving 3,747 (92.5%). The 7.5% you remove are not the ones whose Q&A pairs are wrong; they’re the ones whose pairs are vapid: true, but conveying nothing. Vapid pairs train the model to produce vapid completions.

Q&A generation is 5 pairs per chunk via GPT-5.4 (Azure), targeted across four question types (factual, definitional, procedural, conceptual) at roughly equal proportions (23–27% each). 81 minutes wall, ~ $48 total,$ 0.0026 per pair. One chunk hit the Azure content policy filter and yielded zero pairs; everything else parsed clean.

The committed splits at data/processed/qa/ (train 15,905 / val 935 / test 1,875 / held-out-source 15) are chunk-stratified, not random-stratified, meaning every chunk lives entirely in one split. This prevents the eval from being inflated by near-paraphrases of training questions sitting in test.

The training run

LoRA on Tinker. The renderer is gpt_oss_no_sysprompt, so the system message gets mapped to the gpt-oss developer role as # Instructions. r=16, α=32, all-linear + lm_head, 2 epochs, batch size 32, max sequence length 512, lr 2e-4 with 100-step warmup and cosine decay, AdamW β₂=0.95 (gpt-oss-shaped).

Validation negative-log-likelihood curve descending from 2.25 to 1.47 over 994 training steps — Fig 2 Validation NLL over training (994 steps, 2 epochs). Strong early descent, with a clear knee around the epoch-1/epoch-2 boundary as the cosine LR decays.

994 steps, 2 epochs, 3.98M tokens processed, 1h07m wall. Validation NLL goes from 2.2539 at step 50 to 1.4706 at step 994, a 0.78 nat improvement and no divergence. Cost at the ADR-0001 placeholder rate ( $0.40 per 1M training tokens): **$ 1.59**. No guardrail tripped, no cost cap, no walltime cap, no divergence-streak threshold. Reproducible from scripts/train_phase_a.py with the config at configs/tinker/phase_a_qa/run_v1.yaml.

A note on the MoE LoRA placement. The adapter sits on the per-expert projections of the gpt-oss MoE, plus attention, plus lm_head. Some transformers / peft versions represent gpt-oss experts as a single fused module, and vanilla loading won’t apply the per-expert LoRA cleanly. The model was trained and evaluated through Tinker’s sampling API; the published HuggingFace adapter is for loading on top of the base in environments where the per-expert wiring is correct. The Space demo is the reference, known-good way to run it.

The ablation that didn’t matter

I ran a 4-variant single-axis ablation (each variant moves one hyperparameter off the v1 baseline so deltas are attributable):

variant	δ	min smooth train NLL	final val NLL	test NLL	Δ vs v1
v1	baseline	1.6380	1.4706	2.0037	+0.0000
v2	r 16 → 32	1.6356	1.4515	2.0026	-0.0011
v3	2 → 3 ep	1.3620	1.1418	2.0811	+0.0774
v4	lr → 2e-4	1.5230	1.2808	1.9868	-0.0168

v3 is interesting: the lowest train NLL in the sweep, but the highest test NLL. Classic memorization on a 3rd epoch of 18.7k pairs. The capacity bump in v2 didn’t materially help. v4 won by 0.02 nats on test, picked as the production checkpoint, and then forgotten by the rest of this essay because the variant choice isn’t where the story lives.

Cost for the entire 4-variant sweep including v3’s extra epoch: ~$7.16 in Tinker training. The real bill, and the real signal, were in the evaluation that followed.

The judge run, and the empty answers

Evaluation is a pair-wise GPT-5.4 judge on three axes (factual accuracy, completeness, grounding), 1–5 each, plus a categorical winner (A, B, or tie). Model A is always v4, Model B is always the un-adapted baseline. Position bias is not counter-balanced yet (see §position bias). 1,875 test cases, 5 concurrency, $40 hard cost cap. Settled at **$ 5.64** judge total.

The aggregate output:

v4 wins: 82.6% (95% CI 80.9–84.4%, n=1,873)
baseline wins: 1.3%
ties: 16.1%

Per axis on test:

metric	v4	baseline	delta
factual_accuracy_mean	2.719	1.081	+1.638
completeness_mean	2.340	1.061	+1.279
grounding_mean	2.613	1.054	+1.559

Then I looked at the baseline outputs. The renderer puts gpt-oss in Reasoning: medium. The base model entered the <|channel|>analysis|> block, spent the entire 300-token budget reasoning about the question, and never opened a final channel. 1,721 of 1,875 outputs (91.8%) returned empty after channel stripping. On the 15-case held-out-source split, the baseline produced zero final answers.

Bar comparison: baseline output budget filled with analysis CoT and empty final channel, v4 budget mostly unused with a small direct answer — Fig 3 What each model did with its 300-token output budget on the test set. Baseline burned the entire budget in hidden CoT (mean 297 / 300 tokens, ~92% of cases returned empty). v4 was trained on direct-answer pairs and learned to skip the analysis channel, mean 49 tokens, every one of them in the final channel.

The judge sees Model B = "" and floors it. v4 wins by default. Most of the “82.6%” headline is the gap between answering and not answering, not the gap between right and wrong answers.

What the un-adapted baseline actually does

gpt-oss is a reasoning model. It was trained to think first inside an analysis channel, then commit the user-visible response inside a final channel. At the medium reasoning effort the renderer sets, the model defaults to substantial deliberation before answering, which works fine when the output budget is generous and the question rewards step-by-step thought, and not at all when you cap it at 300 tokens on a domain-knowledge question that wants a one-paragraph answer.

What the SFT on (question, direct-answer) pairs does is teach the model that the answer to a UAV-doctrine question is a paragraph, not a chain of thought. The training data has no analysis channel. The loss only fires on the direct-answer tokens. Over 994 steps, the model learns that the right action after the chat-template prompt is to open final and answer, not to open analysis and deliberate.

The numerical evidence is unambiguous: baseline mean output 297.5 tokens, almost always at the 300 cap; v4 mean output 49.4 tokens. v4’s max output across the full test set is 123 tokens, and it doesn’t get close to the cap because it doesn’t need to.

This isn’t a bug in the base model. It’s the model behaving correctly for what it was trained to do. The interesting observation is that SFT on direct-answer pairs is a remarkably leveraged way to override reasoning-mode behavior, even with a small LoRA and a small budget. You don’t need to add a reasoning-suppression preference dataset, and you don’t need RLHF. A few thousand QA pairs in (prompt, direct answer) shape, two epochs, and the model picks up “answer immediately” as a behavioral prior. The doctrine knowledge gets absorbed alongside it as a side benefit.

That said, this is the bug-shaped version of the story. The fairer version: I never validated whether the baseline could reach the same quality with a larger output budget or a different renderer. A practitioner deploying gpt-oss-20b for a similar task could probably get respectable answers by raising the cap to 800–1000 tokens, parsing the final channel after the model finishes its analysis, and skipping the fine-tune entirely. The win rate I’m quoting is real, but it is bracketed by an engineering choice (300-token cap, single-shot extract) that is itself contestable.

The honest reframe

Take only the 154 test cases where the baseline did produce a final answer. This is the cleanest apples-to-apples doctrine-knowledge comparison the eval supports:

metric	v4	baseline	delta
factual_accuracy_mean	3.227	1.981	+1.247
completeness_mean	2.773	1.740	+1.032
grounding_mean	2.994	1.662	+1.331
v4 win rate (n=154)	77.9%	(baseline 16.2%, ties 5.8%)	n/a

77.9% is still a substantial win. The adapter has absorbed real doctrine knowledge (these aren’t toy deltas) and it answers more accurately, more completely, with better grounding when both models actually produce an answer.

Bar chart comparing v4 win rates across full test set (82.6%), non-empty baseline subset (77.9%), and held-out-source (66.7%) — Fig 4 The headline number decomposed. Full-set v4 win rate (82.6%) is inflated by baseline budget exhaustion; the non-empty-baseline subset (77.9%) is the cleaner doctrine-knowledge signal; held-out-source generalization (66.7%, n=15) is directional only.

The held-out-source split (sources never seen during training: CRS counter-UAS and FAA UAS lost-link) gives v4 a 66.7% win rate at n=15 with a 95% CI of 40–93%. Anecdotal, not statistical, but directionally positive: the adapter generalizes to adjacent doctrine that wasn’t in the training corpus, at least within the genre.

Two more findings worth knowing

These were the methodology potholes, both of them landmines for anyone running a similar SFT experiment.

(1) Training-time validation NLL is not inference-path eval NLL. They are on different scales and computed via different code paths.

Two parallel measurement paths producing different NLLs from the same checkpoint and same data — Fig 5 Two NLLs that look the same and are not. Training-time val_nll runs through the forward/backward path and was 1.28 for v4. The same checkpoint scored through Tinker's inference-path compute_logprobs gave 1.99, a 0.71-nat gap. Picking a checkpoint on training-time val NLL is picking on the wrong measurement.

v4’s training-time val_nll (1.28) sat ~0.71 nats below the same checkpoint’s inference-path test NLL (1.99). Same data, same checkpoint, different path. The training-time value is computed through the optimizer’s forward/backward pass; the inference-time value is computed through the sampling API’s compute_logprobs endpoint. They’re both honest, both measuring something legitimate, but they’re not measuring the same thing, and the constant offset is large enough to flip a checkpoint selection if you happen to be comparing two variants that are close on one path and far on the other. Never pick a checkpoint on training-time val NLL; re-score every candidate via the inference path before declaring a winner.

The corollary, which the ablation table earlier in this essay quietly demonstrates, is that eval NLL doesn’t track judge-perceived quality either. procedural questions had the highest NLL across all four variants (v4 procedural test NLL: 2.04, definitional: 1.77), but procedural was not the lowest-scoring category under the judge. factual was. Loss is not quality.

(2) Position bias was absent under structured-rubric judging. The headline 82.6% used v4 always in position A. I re-judged a 300-example stratified sample with positions randomized per call (4,242 RNG seed, same Prompt #7 verbatim) and recovered:

Position-A preference rate: 48.5% (n=511 non-tie decisions across 600 calls)
Model-agreement rate: 94.7% (284 / 300 examples had original and swap agree on which model won)
Bias-corrected v4 win rate: 80.7% (95% CI 76.3–85.0%)

48.5% is indistinguishable from 50%. The chat-arena literature reports ≈5–10pp toward A on free-form preference judging; this rubric (three axes scored 1–5 plus a strict winner field) appears to be largely position-symmetric. The reproducibility delta on the win rate (82.6 → 80.7%, i.e. -1.9pp) is also within the original CI. Position bias may be a property of free-form preference judging, not strict rubric judging. If you’re judging on a rubric like this one, the symmetry-vs-randomization cost may not be worth paying for every run.

Honest limitations

Specificity drift. v4 reliably gets the shape of an answer right but can drop or invent specifics: wrong CPDLC message codes, missing conditions on a procedural rule, approximate page-cited figures. Treat any specific term, number, or message ID as something to verify against the source doctrine, not a finished fact.
The win rate is inflated by baseline budget exhaustion. Reframed above; the 77.9% number is the one I quote in technical settings.
Held-out-source generalization is anecdotal. 15 unseen-source examples; the wide CI reflects this.
No safety / red-teaming pass. This is a doctrine-QA SFT adapter (milestone M1), not a safety-tuned assistant. The corpus excludes any classified, FOUO, CUI, or NOFORN material, so the model has no inappropriate knowledge to leak, but it also hasn’t been audited for prompt-injection robustness or unsafe-refusal behavior.
English only, and a snapshot of doctrine as of May 2026. Newer editions are not reflected.

Why this matters beyond UAV doctrine

The lesson I’d carry into the next fine-tune isn’t about doctrine. It’s about token economy on reasoning models.

In 2026, the gpt-oss family, and the broader open-weight reasoning tier, defaults to substantial hidden reasoning before answering. That’s a feature for math and code, where the chain of thought is load-bearing and the user is fine paying for it in latency. It is a failure mode for domain-knowledge QA at a tight output budget, where most of the work the model should be doing is recall, not deliberation, and the user expects a direct answer.

SFT on (prompt, direct answer) pairs is the highest-leverage cheap behavioral edit available. A few thousand pairs, a couple of epochs, ~$2 of Tinker training, and the model stops spending its budget on analysis and starts answering. The domain knowledge it picks up along the way is real, but secondary. The token-economy edit is the primary product of the fine-tune.

If you’re building a small-LLM domain assistant on top of a reasoning base in 2026, the most valuable measurement you can take before fine-tuning is the baseline’s empty-answer rate at your target output budget. If it’s substantial, say, anything over 20%, the first thing fine-tuning will buy you is final-channel discipline, and you should size your win-rate expectations and your training data shape around that finding.

Try it

The live HF Space, meftun/gpt-oss-20b-uav-doctrine-demo, runs both models head-to-head through the Tinker API. No setup; the first reply is ~8 s cold, then faster. The adapter is published at meftun/gpt-oss-20b-uav-doctrine; the full reproducible repo, including all 14 build journals and the eval data, is at mftnakrsu/gpt-oss-20b-uav-doctrine.

It is a research artifact. It is not a decision authority, is not validated for any operational, targeting, or safety-of-flight purpose, and must never substitute for the authoritative doctrinal publications or a qualified human in the loop. Verify any specific term, number, or rule against its source before relying on it.

LLMsFine-tuningLoRASLMReasoningMoE