Reading 12,000 paper PODs a day with one vision model
Our OCR pipeline turns smudged, curled, sometimes wet paper PODs into structured JSON in 30 seconds. Here's the architecture — and the failure modes nobody warns you about.
17 APR
2026
A POD — proof of delivery — is a paper slip the truck driver gets stamped by the consignee and photographs on WhatsApp. It carries the consignment note number, the delivery date, the receiver's signature, and (sometimes) hand-scribbled notes about shortage or damage. Across our customer base, 12,000+ of these arrive every day. Until last year, three people per shipper sat at desks copying them into Excel.
Our job was to read every one of them in under 30 seconds with greater than 99% accuracy on the four fields that drive billing. This is the engineering log of how that actually works in production — including the failure modes nobody warns you about.
The architecture, in plain English
We use a three-stage pipeline. Each stage exists because the previous one failed at a specific class of edge cases when we tried to do it in one shot.
Stage 1 — Image normalisation
A small UNet model finds the four corners of the actual POD inside the photo, even when half of it is shadow and the other half is the trucker's thumb. We deskew, dewarp, and crop. About 14% of incoming images need a second pass because the photo is taken from below the receiving table — those go to a perspective-correction step that knows what a "rectangle held at 45° below" looks like.
Stage 2 — Field extraction
A vision-language model reads the cropped POD and returns a structured object: LR number, date, vehicle number, signatory name, and a free-text notes field. The interesting trick here is the prompt — we don't ask "what's on this POD". We ask "what would a billing clerk type into Excel from this POD". The grounding makes a real difference: the model stops hallucinating fields that aren't there.
Stage 3 — Verifier loop
The output goes through a deterministic verifier. LR number must match a known regex for the carrier. Date must be within 60 days of dispatch. Vehicle number must validate. If any check fails, the same image is sent back through the model with the failed constraint as additional context. This loop runs at most twice. After that, it goes to a human reviewer.
1// Constraints that must hold before we bill2const checks = [3 matchesCarrierLrFormat(extracted.lrNumber, carrier),4 isWithinWindow(extracted.deliveredAt, dispatch.date, 60),5 isValidVehicleNumber(extracted.vehicle, jurisdiction),6 signaturePresent(extracted.signatureBox),7];89if (checks.every(Boolean)) return { status: 'auto', record: extracted };10if (attempt < 2) return retry(extracted, failingChecks(checks));11return { status: 'human-review', record: extracted, failed: checks };
Production accuracy, broken down
Accuracy is a single number people love to quote and a single number that means nothing. What you actually care about is accuracy per field and accuracy per failure mode. Here's where we land after 18 months in production:
The free-text notes field is the lowest. It's also the one where accuracy matters the least for billing — a missed "torn carton" annotation triggers a human review queue, not a wrong invoice.
The failure modes nobody warns you about
Stamps over text
Some consignees stamp the POD directly on top of the LR number. We had to train a separate detector that recognises stamp ink (purple/blue rubber, smudgy edges) and inpaints behind it before the field extractor runs. That single addition recovered ~2.1% of the dataset.
Two PODs in one photo
Drivers occasionally photograph two PODs side by side because they got delivered in the same trip. Our original code returned one record. Now Stage 1 detects N rectangles and spawns N parallel jobs. Sounds obvious in retrospect.
The "almost a POD" problem
About 0.3% of incoming images are not PODs. They are: WhatsApp profile photos of the driver, screenshots of weather forecasts, and once — memorably — a photo of a wedding invitation. We have a binary classifier at the very front that flags these into a "not-a-POD" queue. It saved us from generating a lot of confidently wrong records.
What we'd do differently
If we were starting today, we'd do two things differently. First, we'd skip building our own corner-detection UNet and use one of the off-the-shelf document-segmentation models that didn't exist two years ago. Second, we'd build the human-review UI before the model. Watching reviewers correct outputs is the single highest-leverage activity for an OCR pipeline. We did it third, after the model and the API. We should have done it first.
- 1OCR accuracy as a single number is a vanity metric. Per-field accuracy is what drives billing trust.
- 2Build the human-review UI first. The corrections it captures are your best training data.
- 3The hardest images aren't the smudged ones. They're the ones that are 'almost' a POD — wrong document classes pretending to be the right one.
Writes about how the world's largest shippers actually run freight — the real workflows, the stuff vendors don't put in slides.
More from the team
After 14 customer interviews, every transporter chose WhatsApp over a slick web portal. Here's the data — and what it taught us about adoption in emerging-market logistics.
Most TMS dashboards drown ops teams in red. We rebuilt the control tower around the five decisions a dispatcher actually makes before lunch.
A teardown of the 90-day rollout — what changed in the indent process, the auction floor, and the settlement workflow.
_1777711377206.png)