A TAM of named contacts for $0

650 company sites, two markets, zero dollars. Classify first, extract deterministically, pay only for the residue. The build — and the bugs the real data caught.

Jun 04, 2026

∙ Paid

May 30, 2026 · Build log

A market of named contacts for $0 — classify first, pay only for the residue

Every contact tool charges you per domain. Hunter, Apollo, Clearbit, the whole shelf — you hand them a website, they hand you a lookup, you pay. Run a full market through one of them and the bill scales with the size of the market. That's backwards. Most of a market is sitting in plain HTML, free, if you bother to look before you pay.

So I built a thing that looks first. It's called Lynx. You feed it a list of company websites — a whole market, a hundred thousand rows if you have them — and it gets you the contact: an email, and the actual person attached to that email. The trick is it figures out which sites are free to read before it spends a cent on the ones that aren't.

This week I ran it on two real markets. 500 acupuncture clinics and schools. 150 law firms. Both at $0.00.

Classify before you pay

The core move is a fast triage. Before extracting anything, Lynx hits each homepage and sorts it into one of five buckets: the email's already sitting in the raw HTML (best case), the page is plain server-rendered text (easy), the page is an empty JavaScript shell (needs a render), the page is behind Cloudflare or a bot-wall (hard), or the domain is dead. You classify on what's actually in the response body — not the status code, because a Cloudflare challenge and an empty React shell both return a cheerful 200.

That sort is the whole economic argument. On acupuncture, 31% of sites had the email right there in the HTML and another 38% were plain static pages — call it 69% you can read for free. Another 20% were Cloudflare-protected and 6% needed JavaScript. The free buckets are the majority. You only ever pay for the residue.

Why it's free: classify before paying, read free pages with free code, pay only for the residue

Read the page, don't ask a robot to read the page

Once a page is in a free bucket, deterministic code does the extraction. Plain mailto: links. Cloudflare's obfuscated email trick — they hex-encode the address with a one-byte XOR key, which is fully reversible without running any of their JavaScript, so you decode it for free. Schema.org and JSON-LD blocks, which is structured author-declared markup that hands you name, title, and email in one shot. And team-card structure — the photo-name-title-email box every "Meet the Team" page is built out of.

Here's what each of those looks like on a real page from the run — and where it sits in the HTML.

A plain mailto link on a real law-firm page — lifted by a regex, labeled to a person, $0, never touches a model

Cloudflare hides the email behind data-cfemail — a one-byte XOR key, fully reversible for free without running their JavaScript

Only the genuinely ambiguous pages — the ones where there are six people and four emails and no clean structure tying them together — get sent to a model. The majority never touch an LLM. That's the second half of why it's free: you're not paying per-token to read pages a regex can read.

A team card pairs the name to the email in the same block — by DOM proximity, no model needed

JSON-LD structured data: the site declares its own identity in author-written markup — read it instead of scraping

The acupuncture run: 500 domains, 68% page-resolution, 456 contacts, $0.00. The lawyer run: 150 domains, 63% page-resolution, $0.00. Two unrelated markets, both landing within five points on the headline numbers. The method travels.

The name is the hard part

Getting an email is easy. Getting the person is the job. An info@ mailbox is worth a fraction of "the owner, an acupuncturist, at her own first-name address" — and most cheap tools blur that line. Lynx labels every email as a real person or a generic mailbox, and only person-tied emails count toward the number I actually care about.

On free routes alone, the named-contact rate ran around 24% on both markets. That's the honest floor — real, cold, no spend. Then I did something that cost almost nothing: I took the pages Lynx had already downloaded and ran them back through a cheap model — Gemini Flash-Lite — pointed only at the emails that didn't have a name yet. No re-crawling. Just re-reading what was already on disk.

On acupuncture that pulled the named-person rate from 37% to 50% of the contact list — 60 new people attached to emails that were previously anonymous. On lawyers it added six. The gap is itself a finding: law firms already put the attorney's name next to their email on the team page, so the deterministic pass already had them. Acupuncture clinics hide owners behind info@. Different market, different bottleneck, same engine.

37% to 50% named contacts after a cheap re-read of pages already on disk, at 100% name-to-email precision

The part I don't trust: the name↔email link

Here's where I get paranoid. A model that attaches names to emails will, given the chance, invent one. It'll see a first.last@ address, decide the owner is "First Last," and hand it to you with total confidence — even when that person appears nowhere on the page. It reverse-engineered the name straight out of the email address. That's a fabrication wearing a confidence score.

So every name the model attaches goes through a two-stage check before I believe it. First a dumb deterministic floor: the name's first and last word both have to literally appear in the page text, or it's rejected — that alone caught the reverse-engineered names. Then a second model, told to be hostile, gets the page and the pairing and one job: prove this is the wrong person. Default to "wrong" if you're unsure. Only pairings that survive both gates ship.

After I added those gates, the verified name↔email accuracy was 100% — fifty out of fifty on acupuncture, six out of six on lawyers. Before the gates it was 98%. The two percent it caught were exactly the fabrications, and two percent of a real list is the part that burns your domain reputation.

Where you actually pay

The free routes leave a residue: the Cloudflare-walled sites and the ones with no named owner anywhere on the page. That's the only slice that needs paid tools — a contacts API, a render service, an archived copy of the page from Common Crawl. It's built as a separate tier with a hard dollar cap, because the only safe way to run a paid escalation across a hundred thousand domains is to make overspend structurally impossible.

And I'll tell you what broke, because the breakages are the actual lesson. The paid tier's first live run found a billing bug in my own code — it was counting failed API calls as money spent. The contacts API I wired in turned out not to be subscribed to the one endpoint I needed, which I only learned by watching it return 401 on the exact call that mattered while every other endpoint worked fine. None of that showed up in a unit test. It showed up the instant real money and real domains touched it.

The real product is the calibration

This is the thing I keep relearning. The free run isn't just cheaper data collection — it's a cheap test that tells you what the expensive run will cost before you commit to it. The route split is the cost model. Lawyers carry more Cloudflare than acupuncture, so lawyers cost more to finish. You know that for a few cents, on a 500-row sample, before you spend a dollar on the full market.

And running it on real websites instead of test fixtures is what caught everything that mattered. Sentry error-tracking strings that parse as emails. The word "creative" decoding into an @ because "cre-AT-ive" has "at" in the middle. Section headers like "Beginning Spring 2027" getting read as a person's name. A DNS resolver that throttled my own laptop after fifty fast lookups and made a hundred live websites look dead — which would have quietly told me lawyers were a worse market than they are, if I hadn't gone and checked the "dead" ones by hand. Every one of those was caught for pennies, on a sample, before any of it scaled.

Real data breaks naive extraction: a section heading ("Areas of Practice") sitting next to an email got grabbed as a person's name — and the grounding gate rejected it

Lazy first. Crazy later. Read the free pages with free code, send only the hard ones to a model, and only pay for what's actually left.

How to run it

It's a Claude Code skill. Annual subscribers install it once:

/edge install tam-contact-harvester

Then point it at a CSV of company domains — it auto-detects the website column — and run the free pass:

python3 scripts/harvest.py --input your-market.csv --out run/

That's the $0 part: triage, deterministic extraction, the labeled contact list, and the page corpus saved to disk. When you want names on the anonymous emails, re-read the pages you already pulled — no re-crawl:

python3 scripts/enrich_names.py --run run/        # cheap model, names onto emails
python3 scripts/verify_names.py --run run-enriched/  # adversarial name↔email check

And only when you want to chase the Cloudflare-walled residue do you spend, with a hard cap you set:

python3 scripts/tier1.py --run run-enriched/ --budget 5 --source apify

Free by default. You decide when money enters.

— Jordan Written by Claude Opus 4.8, Approved by Jordan

Below is the geeky version. Copy it into Claude Code and rebuild the whole thing yourself.

Or don't. Annual subscribers install the tool I actually built with one command — every tool I ship, all 3 courses, weekly office hours.

→ Go annual — $2,499/yr · Start at $50/mo (most readers start here)

On the Edge by Blueprint