On the Edge by Blueprint

On the Edge by Blueprint

My Tools

597 Attendees From a 3-Minute Screen Recording

The app has no export. I got the list anyway.

Jordan Crawford's avatar
Jordan Crawford
Apr 26, 2026
∙ Paid
Row-crop vs whole-frame OCR — 58% to 99%

You have a three-minute screen recording of a ██████████ conference attendee pane. A thousand people scrolled past.

No API. The app doesn't have one.

No scrapable HTML. The app's an iPad build.

No export. You asked. They said no.

Just pixels.

This situation comes up more than you'd think. Webinar attendee panes. Closed portals. Conference rosters. Any directory somebody could only capture visually. And every time, the instinct is the same: ship the video to a vision model, ask for JSON, move on.

That instinct is wrong by about 40 percentage points.

The pixels-only rule

If the data exists only as captured pixels — not an API, not scrapable HTML, not an export — use a row-cropping pipeline. Don't use whole-frame OCR. Don't use Gemini.

That's the whole rule. Everything below it is why.

Why whole-frame OCR fails

Hand any vision model a video frame with 25 rows of contact data. Ask for structured JSON.

You'll get about 58% accuracy.

58% to 99% — the OCR accuracy jump

This is empirically measured. The model's attention degrades over a long list. And Gemini specifically will hallucinate plausible-sounding fake companies — inventing department names for people who actually work somewhere three states away. The output looks real. It reads real. It's not.

The worst part: you can't spot the hallucinations by eyeballing the CSV. They look exactly like the real rows. The only way to catch them is to sample-verify against the original frames, and if you knew to do that you'd already be halfway to the fix.

The fix is row-level cropping

Crop each row into its own image. OCR them one at a time. Accuracy jumps to 99%.

Whole-frame OCR vs per-row OCR

The tradeoff is compute — you'll read a thousand small images instead of thirty big ones — but sub-agent parallelism keeps wall-clock time comparable.

The reason it works: vision-model attention at 25:1 is thin. Vision-model attention at 1:1 is dense. The card in isolation has nowhere to drift.

The pattern is broader than OCR. Anytime you're asking an AI model to process a batch — a page of search results, a table of records, a wall of log lines — you're degrading its per-item accuracy. The model will never tell you it's guessing. It'll give you confident wrong answers that look exactly like the right ones.

Shrink the unit. Parallelize the reads. Merge at the end. It's slower on paper. It's cheaper than cleaning up hallucinated data.

Today I shipped this as a real tool. Annual subscribers install it through Edge Copilot and point it at any screen recording of any list they couldn't export.

— Written by Claude Opus 4.6, Approved by Jordan


Below is the geeky version — the seven-phase pipeline, the gotcha that breaks every first attempt, and the paste-into-Claude recipe. Copy it into Claude Code and rebuild the whole thing yourself.

Or don't. Annual subscribers install the tool I actually built with one command — every tool I ship, all 3 courses, weekly office hours.

→ Go annual — $2,499/yr · Start at $50/mo (most readers start here)


This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 Jordan Crawford · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture