Content Gap Audit — Skill Build & Methodology

00TL;DR

The question: Surfer SEO scrapes top-ranking pages and tells a writer what entities/topics their draft is missing. Can we do the same with our own stack instead of paying for Surfer?

The answer: Yes — and better. Surfer is content-agnostic; it has no idea Brainzyme® copy has a banned-term gate. We built a pipeline that does the same job and routes risky terms intelligently. It costs ~$0.30–0.50 per article vs Surfer's £59–99/month — breakeven at ~3 articles/month.

What shipped: a new skill, str-content-gap-audit, plus a head-to-head test proving it beats a real Surfer pass on compliance, topic coverage, and article structure.

01The Journey — what we were thinking

This started as a question and turned into a build. The thinking evolved at each step, and two AI reviewers (Codex 5.5 and Gemini 2.5 Pro) materially changed the design.

1 · The premise

You shared a Surfer SEO before/after comparison — an article improved by feeding it ~80 entities Surfer scraped from top-ranking pages. The hypothesis: “we could do the same with DataForSEO.”
2 · The diagnosis

DataForSEO alone can't. Its keyword endpoints work from Google's ranking data — “what does this URL rank for” — not the page's actual content. Surfer reads the words on the page. So the real recipe is a composite: DataForSEO for the SERP + Firecrawl to scrape the pages + an LLM to extract and cluster the entities.
3 · The prototype

Built it and ran it on the keyword “buying nootropics uk.” Single-shot Gemini extraction over 5 scraped competitor pages produced 35 categorised, clustered entities — richer than Surfer's flat 80-item list.
4 · Codex 5.5 review

Verdict: REFINE. 3 HIGH issues — don't let competitor terms push banned words into the rewrite; add a deterministic compliance pass; reference the connections registry. 6 MEDIUM — use map-reduce extraction, compute metrics in code, detect mixed-intent SERPs.
5 · The A/B/C test

Three versions of the same article compared: original draft, the real Surfer rewrite, and our pipeline's rewrite. Ours won on every measured axis.
6 · Gemini wildcard observer

Brought in for a genuinely independent third opinion. It caught the blind spot Codex and Claude both missed: stripping every banned term to zero can make the article invisible. If every competitor for a keyword discusses “caffeine,” an article that avoids the word looks less relevant to Google. Editorial mention is not a brand claim.
7 · The redesign

Gemini's fix, adopted: replace the binary banned/clean flag with a four-value handling strategy. Don't delete the term — reframe it.
8 · The v1 skill

Built str-content-gap-audit with every Codex MEDIUM and Gemini's handling-strategy baked in. Tested end-to-end, fixed one regex bug, registered and cross-tied it into the skill system.

02Key Decisions

Decision	What we chose	Why
Separate skill or merge into `str-seo-audit`?	Separate skill	Different job: `str-seo-audit` is technical/site diagnosis; `str-content-gap-audit` is content/page generation. Bundled by orchestrator, not by skill.
Compliance flag design	4-value `handling_strategy`, not a binary ban	Gemini observer: a binary flag conflates “Brainzyme can't claim X” (true) with “the article can't mention X” (false, SEO-harmful).
Entity extraction	Map-reduce (per page, then merge)	Codex: single-shot loses per-page attribution and lets one long page dominate. Map-reduce gives an auditable competitor count per entity.
Metrics	Computed in code	Codex: never trust an LLM's word count. Word/heading counts are `len(re.findall(...))`.
Cross-tie / discoverability	`Related Skills` blocks + orchestrator phase	You raised that workflows don't always call the right tool. Solved with explicit cross-links in both SKILL.md files + an auto-run phase in `mkt-content-pipeline` — not a new Context-Matrix column (too invasive for a 40-row table).
Stack	DataForSEO + Firecrawl + Gemini 2.5 Pro	~$0.30–0.50/article vs Surfer £59–99/mo. All three already in the connections registry with service contracts.

The compliance trade-off, resolved. The binary strip was an over-correction. An SEO article can say “many nootropic formulas rely on caffeine” — it just cannot say “Brainzyme® contains caffeine.” The handling strategy is what holds that line: reframe the term, position Brainzyme as the alternative, never imply the brand claim.

03A/B/C Test — Surfer vs Our Pipeline

One article — “Buying Nootropics UK?” — three versions, same metric battery.

Metric	Tab 1 Original	Tab 2 Surfer	Tab 3 Ours	Winner
Brainzyme-banned terms (lower better)	46	42	0	Ours
Compliant entity coverage (of 21)	8	14	20	Ours
Structural sections present (of 8)	4	5	8	Ours
Cost	manual	£59–99/mo	~$0.30–0.50	Ours

The headline finding. Surfer's rewrite kept all 42 banned-for-Brainzyme terms that the draft already had — and added two more. It cannot know about them; the banned-copy gate and the 4-class lexicon are proprietary infrastructure Surfer has no access to. That is the moat.

Honest caveats. (1) n=1 — this is one article, a data point not a proof. Validate across several before trusting the pipeline as a system. (2) The prototype's binary strip drove banned terms to a literal 0; the v1 skill's handling-strategy will instead show a small non-zero count of editorially-reframed mentions — which is correct, not a regression. (3) Gemini's independent review rated our article the better read too — “not even close” — on structure and scannability, not just metrics.

04What We Built — the `str-content-gap-audit` skill

A self-contained pipeline. Given a draft article and a target keyword, it returns a gap report: what the top-ranking competitors cover that the draft is missing, and how to add it without breaching the banned-copy gate.

The 8-stage pipeline

#	Stage	Tool	What it does
1	SERP fetch	DataForSEO REST	Top-10 organic + People Also Asked
2	Intent classification	heuristic regex	Labels each result editorial / shop / product / forum. Picks editorial scrape targets. Flags mixed-intent SERPs.
3	Competitor scrape	Firecrawl REST	Top-N pages → clean markdown
4	Metrics	code (regex)	Word / heading / image counts — computed, never asked of an LLM
5	Extraction — MAP	Gemini 2.5 Pro	Per-page entity list with placement + mention count
6	Extraction — REDUCE	Gemini 2.5 Pro	Merge, dedupe, cluster, rank-weight across pages
7	Handling-strategy routing	deterministic map	Each term → include / reframe / define / avoid
8	Gap report	code	JSON + human-readable markdown

The handling strategy — the part Surfer cannot do

Strategy	Meaning	Example terms
include	Safe, on-brand — add freely	ashwagandha, ginseng, vitamin B12, adaptogens
reframe_as_alternative	Use the term editorially; position Brainzyme as the alternative; never imply the brand contains it	caffeine, lion's mane
define_and_differentiate	Name it to explain legal/category status; do not target it	piracetam, modafinil, smart drugs
avoid	Never appears, even editorially — ASA hard ban	neurodivergent

Grey-listed terms (reframe / define) carry needs_review: true — a human compliance owner signs off the actual wording against the canonical-messaging hybrid before publish. The skill recommends; it does not self-authorise grey-area copy.

05Where The Tools Live

What	Path
The skill folder	`F:/Agentic-OS/.claude/skills/str-content-gap-audit/`
Skill definition (triggers, steps, cross-links)	`…/SKILL.md`
The pipeline script	`…/scripts/content_gap_audit.py`
Methodology + Surfer comparison	`…/references/methodology.md`
API auth setup	`…/references/api-setup.md`
Prototype artifacts — A/B/C report, 3 article versions, gap audits	`F:/Agentic-OS/.tmp-drive-pull/` (scratch — not committed)
Service contracts (auth, failure modes)	`F:/Agentic-OS/reference/services/{dataforseo,firecrawl,gemini}.md`

Registered in: AGENTS.md (Skill Registry + Context Matrix), README.md, context/learnings.md, reference/tool-map.md — tool-map drift validator passes (41 skills in sync).

06URLs

What	URL
This dashboard	apps.nutritionalproducts.org/content-gap-audit/
Command Centre home	apps.nutritionalproducts.org
On-Page Audit dashboard (has the new “Decisions Needed” tab — Tab 11)	apps.nutritionalproducts.org/onpage-audit/
SEO Master Sheet (session 46)	docs.google.com/…/1y3pgPgdpVxO14

Related work this session — the Decisions tab. The On-Page Audit dashboard gained a Tab 11 — “Decisions Needed”: 12 decision blocks (schema deploy path, P0/P1 compliance fixes, legal-approved copy fields) with a Send to Claude button that pushes your answers straight to F:/Agentic-OS/inbox/onpage-audit/. End-to-end verified.

07How To Use It

Just ask Claude

The skill triggers on natural phrasing. Any of these route to it:

“content gap audit” · “what is my article missing” · “compare my draft to competitors” · “surfer alternative” · “why is my blog losing to competitors” · “audit this draft against top-ranking pages”

Or run it directly

python F:/Agentic-OS/.claude/skills/str-content-gap-audit/scripts/content_gap_audit.py \
  --keyword "buying nootropics uk" \
  --draft path/to/draft.md \
  --market uk \
  --top-n 5 \
  --out-dir projects/str-content-gap-audit/my-article/

Re-running? Pass --serp-json and/or --scraped-dir to reuse saved artifacts and skip the paid API calls.

What you get back

gap-audit.md (human view) + gap-audit.json (machine view): ranked entities to add, semantic clusters, missing structural sections, People-Also-Asked questions, a competitor word-count benchmark, and the handling strategy for every risky term.

The workflow

Run the audit on your draft + target keyword.
Check the mixed-intent warning — if the SERP is mostly shop pages, the signal is noisier; treat recommendations as directional.
Apply the include entities and missing_sections freely.
For reframe / define terms — apply the strategy, then a human signs off the wording (these carry needs_review).
Run the rewrite through check_copy_against_canonical.py before publish.

In the content pipeline. If you run the full mkt-content-pipeline orchestrator, this audit now runs automatically as Phase 3.5 for any SEO article — you don't have to invoke it by hand.

08What Skills Are

A skill is a self-contained capability folder under .claude/skills/. Each has a SKILL.md — YAML frontmatter (name + trigger phrases) and a body (steps, dependencies, references). When you describe a task, Claude matches it against every skill's trigger phrases and loads the matching one. Skills make Claude do a job the same reliable way every time, instead of improvising.

This skill's category

str- = Strategy. Siblings: str-seo-audit (technical SEO), str-ai-seo, str-programmatic-seo, str-schema-markup, str-campaign-strategy, str-trending-research.

The cross-tie

You flagged that workflows don't always reach for the right tool. Three fixes wired in:

1. Related Skills blocks in both str-seo-audit and str-content-gap-audit — each names the other with a routing rule.
2. Orchestrator phase — mkt-content-pipeline calls the audit automatically (Phase 3.5).
3. Registry rows — AGENTS.md + tool-map keep the index honest.

The routing rule, in one line: site fault (indexing, meta, speed) → str-seo-audit. Thin / shallow page content → str-content-gap-audit. Run both for the full picture.

09What's Not Done — v1.1 candidates

Item	Why it matters	Priority
Gemini 503 retry-once	One transient API failure in the test run — the pipeline degraded gracefully (skipped the page), but a retry would be cleaner.	low
From-scratch mode (no draft yet)	Currently requires `--draft`. A competitor-analysis-only mode would suit briefing a writer before the first draft exists.	medium
Validate beyond n=1	The head-to-head is one article. Run it on 5–10 more before trusting the pipeline as a system.	medium
Live content score	Surfer scores as you type. This is a batch audit; re-run after each rewrite to see movement.	backlog

Content Gap Audit dashboard · v1 · 2026-05-16 · Session 46 · Skill: str-content-gap-audit · Built with Claude (Opus) + Codex 5.5 review + Gemini 2.5 Pro observer.

00TL;DR

01The Journey — what we were thinking

02Key Decisions

03A/B/C Test — Surfer vs Our Pipeline

04What We Built — the str-content-gap-audit skill

The 8-stage pipeline

The handling strategy — the part Surfer cannot do

05Where The Tools Live

06URLs

07How To Use It

Just ask Claude

Or run it directly

What you get back

The workflow

08What Skills Are

This skill's category

The cross-tie

09What's Not Done — v1.1 candidates

04What We Built — the `str-content-gap-audit` skill