The PDF Contact Extractor I Built — Discovers Sources, Pulls Every Row

Government directories are a TAM list nobody extracts. The tool I built today, with the trick that makes it free.

May 07, 2026

∙ Paid

The PDF Contact Extractor I Built — Discovers Sources, Pulls Every Row

A 798-page state fire marshal directory.

That's the test case. Every fire department in New Jersey. Names, titles, departments, emails, phones. Locked inside a PDF nobody can export.

Multiply that by every state and every industry. State licensing boards. Federal agency rosters. Trade association member lists. Conference attendee PDFs. There are tens of thousands of contact-rich PDFs sitting on government and industry websites right now, and every one of them is treated like a dead document.

They're not dead. They're a TAM list nobody has bothered to extract.

So I built the tool that extracts them. Today. End to end.

The naive way fails

Hand a 30-row contact table to any vision model. Ask for clean JSON. You get 58% accuracy. Sometimes worse.

I've seen this before. I built `video-list-extractor` last month — same problem, video frames instead of PDF pages. The vision model's attention degrades across a long list. Gemini in particular invents organization names. Claude is better but still drops names, swaps emails, normalizes "Dr. Sarah Chen, MD" to "Sarah Chen."

The fix is not better prompting. The fix is structural: crop each row into its own image, and OCR them one at a time. Same model. Same prompt. 99% accuracy.

That's the insight that made `video-list-extractor` work. It applies to PDFs.

The smart move: don't OCR what you don't have to

Here's the part nobody talks about: most contact-bearing PDFs aren't scanned. They have a real text layer. pypdf can read them at zero LLM cost.

The state fire marshal directory? Native text. The 798-page NJ fire department roster? Native text. I ran my triage script on both. Score: 1.0. Route: text path.

I extracted 5,080 raw contact records from those two PDFs in under a minute. Total LLM cost: zero. The merge step deduped them down to 2,015 unique contacts.

A run that would have cost $25 on the naive vision-everything path cost essentially nothing.

The triage decision is the entire game. Text-rich PDFs go to a regex + pypdf path. Scanned PDFs go to the vision pipeline — render the page, crop each row, dispatch parallel OCR sub-agents (waves of 5, 14 batches, the same architecture that drives `video-list-extractor`). Most PDFs only need the cheap path. Only scanned ones earn the expensive one.

The pipeline

Eight phases. Each one is a script the orchestrator runs in sequence.

1. Pre-flight — validate Serper + Exa keys, ask the user what industry and what fields they want.

2. Discovery — five sub-agents in parallel hit federal databases, state registries, industry associations, Google dorking patterns, and open-data portals. They write candidate PDF URLs to JSON.

3. Download — stream each PDF with a 30 MB cap, save to `/tmp/pcf_pdfs/`.

4. Triage — score each PDF's text quality, route to text path or vision path.

5. Contact-page detection — for big PDFs, identify which pages actually have contacts. Skip the table-of-contents and intro pages.

6. Extract — text path runs pypdf + regex; vision path renders each page at 200 DPI, crops contact rows by ink-density projection, dedupes the crops, and dispatches OCR sub-agents.

7. Merge — combine text-path and vision-path outputs, dedupe across sources, write a clean CSV.

8. Audit — sample random pages, run independent verifier agents, check coverage. Target ≥97%.

That's it. No magic. The structure carries the accuracy.

What it cost to extract 2,015 fire department contacts

Two PDFs. ~820 pages combined. Both routed to the text path.

Discovery: skipped (I had the URLs)
Download: free
Triage: free
Page detection: free
Text extraction: free
Merge: free
Total: $0.00

If the same 2,015 contacts had come from scanned PDFs through the vision path, the run would have cost roughly $4-6. Still cheap. But the triage saved every dollar.

That ratio is the whole product. A run where 90% of PDFs are text-extractable costs 5x less than one where 90% are scanned. Triage is the lever.

Below is the geeky version. Copy it into Claude Code and rebuild the whole thing yourself.

Or don't. Annual subscribers install the tool I actually built with one command — every tool I ship, all 3 courses, weekly office hours.

→ Go annual — $2,499/yr · Start at $50/mo (most readers start here)

On the Edge by Blueprint

The PDF Contact Extractor I Built — Discovers Sources, Pulls Every Row

Government directories are a TAM list nobody extracts. The tool I built today, with the trick that makes it free.

The PDF Contact Extractor I Built — Discovers Sources, Pulls Every Row

The naive way fails

The smart move: don't OCR what you don't have to

The pipeline

What it cost to extract 2,015 fire department contacts

This post is for paid subscribers