Case study 2026 AI engineering
Distill.
A document-intelligence engine that turns messy business paperwork into structured, verified data: extraction with per-field confidence, answers grounded in the source, and an agent that drafts only from what was actually found. Built to run on samples today and live against Claude the moment a key is connected.
- Role
- Architecture, AI engineering, build
- Stack
- Next.js / Claude API / TypeScript
- Surface
- Document intelligence
- Status
- Working sample, live-ready
Context
Most AI demos are toys. Real products refuse to guess.
Plenty of tools will read a document and hand back a confident answer. The problem is the confidence: a model that always answers will, often enough, answer wrong, and in a contract or an invoice a wrong number is worse than no number at all.
Distill is built around the opposite instinct. It extracts a fixed set of fields from a document, tags each with where it came from and how sure it is, and flags anything it cannot find rather than inventing a plausible value. A human can see exactly what to trust. The engine is a generalization of one I built for a production benefits tool, widened from a single insurance schema to any document type.
I built it as the kind of artifact a buyer can actually evaluate: a working demo that runs instantly on prepared documents, with the live engine wired behind the same interface and ready to switch on.
Approach
Four moves, one rule: never fabricate.
Extract
Claude fills a strict, typed schema from the document. Every value carries the section it came from and a confidence, so the output is auditable, not a wall of prose.
Flag, don't guess
Absent or ambiguous fields come back as not-found with a null value. A missing value surfaces honestly; it is never fabricated to look complete.
Answer, grounded
Questions are answered only from the extracted fields, with the fields they drew on cited back. If the document doesn't contain the answer, the system says so.
Draft, from verified data
A bounded tool-using agent assembles a finished artifact, pulling each value through a read-only lookup so it can only use data that was actually verified.
Under the hood
The engineering is the proof.
Every decision below exists to make the output trustworthy enough to act on. That is the whole job of an AI product: not to sound right, but to be checkable.
Structured output
Claude fills a JSON schema rather than free text, so every value lands in a typed contract the UI can render directly. Numbers never come from open-ended generation.
Confidence over completeness
The model is instructed to return not-found with a null value rather than guess. The discipline lives in the prompt and is enforced by the schema's confidence enum.
Native document handling
PDFs and images go to Claude as document and image blocks, so scanned and photographed pages work without bolting on a separate OCR pass.
Prompt caching
The system prompt is byte-identical across requests, so re-running a document hits Anthropic's prompt cache instead of paying to re-read the instructions each time.
Read-only drafting agent
The draft step is a bounded tool-use loop with a single get_field tool and a hard round cap. The agent can read verified values and nothing else, so it can't wander.
Sample now, live later
The API routes serve cached samples with no key and call Claude when one is present, with no client changes. A failed live call falls back to cache, so the demo never breaks.
The contract
Every field knows where it came from.
The whole system rests on one small type. A value is never just a value: it carries its source and its confidence, and a confidence of not-found is a first-class result, not an error.
type Confidence = "high" | "medium" | "low" | "not_found";
interface FieldValue {
value: string | null; // null when not found, never a guess
source: string | null; // e.g. "Section 7.2"
confidence: Confidence;
}The model is told, in a cached system prompt: a missing value is acceptable; a wrong value is not. That single instruction is what separates a tool you can act on from one you have to double-check.
By design
The decisions, stated plainly.
These are design facts, not performance claims. Accuracy, latency, and cost are worth measuring against real traffic; this build runs on prepared samples, so I am not going to put a number on the wall that I have not earned yet.
Sample documents
3A vendor contract, an invoice, and a benefits summary, deliberately cross-industry so the engine reads as general-purpose, not a single-domain trick.
Fabricated values
0Each sample leaves one field genuinely absent from the document, and the engine reports it as not-found rather than inventing a plausible answer.
Tools the agent can call
1read-onlyThe drafting agent's entire world is one get_field lookup over verified data. A small surface is a safe surface.
To go live
1API keyThe same routes serve samples now and call Claude when a key is present. Going live is a credential, not a rewrite.
Extra OCR steps
0Native document handling reads PDFs and images directly. There is no second pipeline to maintain or to drift out of sync.
If a live call fails
CacheThe routes fall back to the prepared result rather than showing an error, so a flaky network never turns into a broken demo in front of someone.
Your documents
Have a pile of paperwork that should be structured data?
Contracts, invoices, statements, forms: if a person is reading them by hand to pull out the same fields every time, that is work an engine like this can take over, with a human checking only what it flags.