Methodology

The math, the sources, what we don't do.

Most "AI visibility" tools grade their own homework with synthetic data and call it a "score". We sample real answer-engine responses across six engines with transparent statistics. Here's exactly how.

The funnel

Three signals per probe, parsed from each engine's actual response shape:

Retrieved: Your domain appeared in the engine's web-search / RAG retrieval results. This is the engine's "considered" set — pages it could have linked to.
Cited: The engine actually linked your domain in its answer or Sources list. The visible attribution most users will click.
Mentioned: Your brand name appears in the prose — with or without a link. Catches "memorized" mentions from training data.

The gaps between these three numbers are the actionable signals. Mentioned high, cited low = engines vouch for you via third parties. Retrieved low = your own pages aren't even being considered.

Wilson 95% confidence intervals

Every rate we report is a sampled proportion (e.g. 6 of 24 probes cited your domain). We report the Wilson score interval at 95% — a small-sample-friendly binomial confidence interval that correctly handles edge cases at 0% and 100% (where the normal-approximation interval breaks).

When two runs are compared, a delta is flagged significant only when the two Wilson CIs don't overlap. Anything else is sampling noise — increase --runs to tighten the bound.

The proven GEO levers

The only peer-reviewed evidence base for what actually moves AI-answer-engine citation is the KDD'24 paper Generative Engine Optimization (Aggarwal et al., Princeton / Georgia Tech / Allen AI). It found measurable lifts from:

Adding statistics (~31% lift in position-adjusted visibility)
Adding quotations (~41%)
Adding cited sources (~30%)
Authoritative language (~11%)
Keyword stuffing — negative (–9%, actively hurts)

Our generated content briefs use exactly these levers — and explicitly avoid the snake-oil ones (no llms.txt claims, no "schema-as-citation-lever" pitches; both have no causal evidence).

Engines, per-engine

Claude (Sonnet) + WebSearch: Routed through the authenticated claude CLI; stream-JSON output gives us the exact WebSearch tool-call queries and result URLs.
OpenAI (gpt-4o) + Responses web_search: POST to /v1/responses with the built-in web_search tool; citations come back as URL annotations on output_text content.
Gemini (2.0-flash) + google_search grounding: POST to generativelanguage.googleapis.com; grounding_metadata exposes the web URIs the model relied on.
Perplexity Sonar: POST to Sonar /chat/completions; the top-level citations array is the live retrieval/citation set.
Google AI Overviews · via SerpApi: Two-step flow: regular Google SERP, then engine=google_ai_overview with the page_token when needed. text_blocks + references map cleanly to our funnel; when no AIO is shown for a query we record an error (rates compute over actual triggers).
Bing / Copilot · via SerpApi: Defensive extraction across generative_search, copilot_answer, ai_answer, and instant_answer. Same honest framing — when no AI artifact is returned, no fake probe is recorded.

What we don't do

No "AI visibility score" that bundles ten things into one opaque number
No synthetic prompts unrelated to real buyer intent
No single-sample claims (every visibility rate has a sample size and a CI)
No promises that llms.txt or schema markup move citation rate — they don't, and Google has said so on the record
No "rank tracking" framing — LLM outputs are non-deterministic; we sample, not rank