Engineering

Reading 12,000 paper PODs a day with one vision model

Our OCR pipeline turns smudged, curled, sometimes wet paper PODs into structured JSON in 30 seconds. Here's the architecture — and the failure modes nobody warns you about.

Ishaan Kapoor

Staff Engineer, AI

· Apr 15, 2026· 11 min read

— Proof of Delivery —

LR NoLR-2410-887621

Date17-04-2026

VehicleMH 12 AB 1234

Cartons24

Receiver Sign

Ramesh K.

Notes: 2 cartons corner-dent, accepted with note.

RECEIVED
17 APR
2026

LR · 99.7%

DATE · 99.4%

VEHICLE · 98.9%

SIGN · 96.2%

The pipeline

Smudged paper PODs to structured JSON in 30 seconds.

Vision model + spatial reasoning + a verifier loop. Trained on 1.2M real PODs. Production accuracy is the boring part.

A POD — proof of delivery — is a paper slip the truck driver gets stamped by the consignee and photographs on WhatsApp. It carries the consignment note number, the delivery date, the receiver's signature, and (sometimes) hand-scribbled notes about shortage or damage. Across our customer base, 12,000+ of these arrive every day. Until last year, three people per shipper sat at desks copying them into Excel.

Our job was to read every one of them in under 30 seconds with greater than 99% accuracy on the four fields that drive billing. This is the engineering log of how that actually works in production — including the failure modes nobody warns you about.

The architecture, in plain English

We use a three-stage pipeline. Each stage exists because the previous one failed at a specific class of edge cases when we tried to do it in one shot.

Three-stage pipeline

Normalise

UNet finds POD edges, deskew + dewarp

Extract

VLM returns structured fields

Verify

Deterministic constraints + retry

Stage 1 — Image normalisation

A small UNet model finds the four corners of the actual POD inside the photo, even when half of it is shadow and the other half is the trucker's thumb. We deskew, dewarp, and crop. About 14% of incoming images need a second pass because the photo is taken from below the receiving table — those go to a perspective-correction step that knows what a "rectangle held at 45° below" looks like.

Stage 2 — Field extraction

A vision-language model reads the cropped POD and returns a structured object: LR number, date, vehicle number, signatory name, and a free-text notes field. The interesting trick here is the prompt — we don't ask "what's on this POD". We ask "what would a billing clerk type into Excel from this POD". The grounding makes a real difference: the model stops hallucinating fields that aren't there.

Stage 3 — Verifier loop

The output goes through a deterministic verifier. LR number must match a known regex for the carrier. Date must be within 60 days of dispatch. Vehicle number must validate. If any check fails, the same image is sent back through the model with the failed constraint as additional context. This loop runs at most twice. After that, it goes to a human reviewer.

verifier.ts — the safety net

1// Constraints that must hold before we bill
2const checks = [
3  matchesCarrierLrFormat(extracted.lrNumber, carrier),
4  isWithinWindow(extracted.deliveredAt, dispatch.date, 60),
5  isValidVehicleNumber(extracted.vehicle, jurisdiction),
6  signaturePresent(extracted.signatureBox),
7];
8
9if (checks.every(Boolean)) return { status: 'auto', record: extracted };
10if (attempt < 2)              return retry(extracted, failingChecks(checks));
11return { status: 'human-review', record: extracted, failed: checks };

Production accuracy, broken down

Accuracy is a single number people love to quote and a single number that means nothing. What you actually care about is accuracy per field and accuracy per failure mode. Here's where we land after 18 months in production:

Per-field accuracy in production

LR number

99.7%

Date stamp

99.4%

Vehicle no.

98.9%

Signatory

96.2%

Damage notes

87.5%

Sampled across 18 months of production traffic · ~12K PODs/day

The free-text notes field is the lowest. It's also the one where accuracy matters the least for billing — a missed "torn carton" annotation triggers a human review queue, not a wrong invoice.

The failure modes nobody warns you about

Stamps over text

Some consignees stamp the POD directly on top of the LR number. We had to train a separate detector that recognises stamp ink (purple/blue rubber, smudgy edges) and inpaints behind it before the field extractor runs. That single addition recovered ~2.1% of the dataset.

Two PODs in one photo

Drivers occasionally photograph two PODs side by side because they got delivered in the same trip. Our original code returned one record. Now Stage 1 detects N rectangles and spawns N parallel jobs. Sounds obvious in retrospect.

The "almost a POD" problem

About 0.3% of incoming images are not PODs. They are: WhatsApp profile photos of the driver, screenshots of weather forecasts, and once — memorably — a photo of a wedding invitation. We have a binary classifier at the very front that flags these into a "not-a-POD" queue. It saved us from generating a lot of confidently wrong records.

12K

PODs processed per day at steady state

30s

p95 from photo upload to billable record

99.4%

blended accuracy across the 4 billing fields

2.7

FTEs displaced per shipper, per month

What we'd do differently

If we were starting today, we'd do two things differently. First, we'd skip building our own corner-detection UNet and use one of the off-the-shelf document-segmentation models that didn't exist two years ago. Second, we'd build the human-review UI before the model. Watching reviewers correct outputs is the single highest-leverage activity for an OCR pipeline. We did it third, after the model and the API. We should have done it first.

Key takeaways

1
OCR accuracy as a single number is a vanity metric. Per-field accuracy is what drives billing trust.
2
Build the human-review UI first. The corrections it captures are your best training data.
3
The hardest images aren't the smudged ones. They're the ones that are 'almost' a POD — wrong document classes pretending to be the right one.

Ishaan Kapoor

Staff Engineer, AI · Traqo

Writes about how the world's largest shippers actually run freight — the real workflows, the stuff vendors don't put in slides.

More from the team

All posts

Playbooks

Why we run freight auctions on WhatsApp (and not a portal)

Playbooks

Why we run freight auctions on WhatsApp (and not a portal)

After 14 customer interviews, every transporter chose WhatsApp over a slick web portal. Here's the data — and what it taught us about adoption in emerging-market logistics.

8 min read

Opinion

5 prioritised actions, not 50 alerts: rethinking the control tower