Six months in, KAI had a problem. The v0.5 LoRA checkpoint — Mistral 7B, fine-tuned on 2,400 synthetic SOAP notes — was producing reasonable output. But v0.8, which rebased onto Gemma 3, regressed by 10 percentage points on our structured rubric. The fine-tuning loop was producing diminishing returns and we couldn't explain why.
We ran the numbers and ratified a new architecture decision. This is the story of that decision.
What the data said
Our Phase 3 evaluation set had grown to 57 cases across five languages (DE, EN, FR, IT, ES). The benchmark showed:
| Provider | Pass rate | Notes |
|---|---|---|
mistral-local (v0.5 LoRA) | 51 / 57 (89%) | Failures concentrated in FR/ES |
kai-local (Gemma rebase) | 46 / 57 (81%) | Regression vs v0.5 |
mistral-local (RAG v2) | 57 / 57 (100%) | After ADR-007 pivot |
The failure mode for kai-local was not safety — it got 10/11 species blacklist adversarial cases right. The failure mode was multilingual coverage and citation quality. The model kept hallucinating source references.
We had a 26% phantom citation rate on kai-local. Every claim that couldn't be traced back to a retrieved source chunk was a fabrication. In a clinical context, that's not a model quality problem — it's a trust problem.
The diagnosis
Fine-tuning was doing two jobs at once: teaching the model clinical knowledge, and teaching it the SOAP format. The format work was succeeding. The knowledge work wasn't — because knowledge baked into weights is static, unauditable, and expensive to update.
The moment a drug interaction table changes, or a new WSAVA guideline ships, the weights are stale. And you can't tell which claim came from training data vs. which came from the input.
RAG solves the audit problem structurally. If every clinical claim must reference a retrieved chunk, you can trace every assertion back to a source document with a version and a date. The phantom citation rate drops from 26% to 0.2% — not because the model got smarter, but because the architecture now enforces provenance.
What we built (ADR-007)
The new stack:
- Embeddings: BGE-M3 (multilingual, 1024-dim) — runs locally on M5 Max
- Vector store: LanceDB hybrid retrieval (dense + sparse BM25)
- Reranker: MLX bge-reranker-v2-m3 — local inference, 30ms latency
- Citation schema: ADR-008 v3 — every
plan.medicationsandplan.diagnosticsitem carries acite[]array of chunk IDs. The Layer-2 classifier rejects responses where Assessment or Plan sections make claims without chunk references.
The corpus covers Merck Veterinary Manual, PubMed OA (vet-filtered), FDA Animal Drugs, Swissmedic, and AAHA/WSAVA clinical guidelines — all with source metadata, version dates, and section identifiers.
The moat reframed
We spent the first four months thinking the moat was the weights: a fine-tuned model that knew veterinary medicine better than the base model. That turned out to be the wrong frame.
The actual moat is the corpus and the trust infrastructure:
- Corpus quality — curated, versioned, Swiss-and-EU-oriented clinical literature. When Swissmedic updates a drug monograph, we update the index. The model doesn't need to be retrained.
- Citation enforcement — structural, not probabilistic. A response without grounded citations cannot pass the Layer-2 check and is rejected before it reaches the vet's screen.
- Multilingual parity — BGE-M3 embeds across all five languages natively. The retrieval quality in FR and IT is now comparable to DE/EN, which was impossible with a weight-baked approach.
The v0.5 LoRA still ships as kai-local — it's the sovereignty option for clinics that cannot send data to any cloud provider, even Anthropic EU. But it's no longer the primary quality lever.
What this means for the vet surface
For the clinician using Loki, the visible change is citation chips. Every SOAP Assessment and Plan item that references a clinical source shows a small chip with the source name and section. Clicking it opens the retrieved passage in a side panel.
This does two things: it gives the vet a way to verify any AI claim in under 10 seconds, and it gives us an audit trail that satisfies the clinical governance requirements we're working through with Swissmedic.
The 99.8% citation grounding rate (1 phantom per 492 grounded claims, in a 57-case benchmark) is the number we're taking into the Vetsuisse pilot conversation. It's not a product claim — it's a measurable property of the architecture.
KAI is the clinical AI engine that powers Loki's vet surface. It runs locally on Swiss compute. Source: github.com/thoughtful-toby/kai