The math, the sources, what we don't do.
Most "AI visibility" tools grade their own homework with synthetic data and call it a "score". We sample real answer-engine responses across six engines with transparent statistics. Here's exactly how.
The funnel
Three signals per probe, parsed from each engine's actual response shape:
- Retrieved
- Your domain appeared in the engine's web-search / RAG retrieval results. This is the engine's "considered" set — pages it could have linked to.
- Cited
- The engine actually linked your domain in its answer or Sources list. The visible attribution most users will click.
- Mentioned
- Your brand name appears in the prose — with or without a link. Catches "memorized" mentions from training data.
The gaps between these three numbers are the actionable signals. Mentioned high, cited low = engines vouch for you via third parties. Retrieved low = your own pages aren't even being considered.
Wilson 95% confidence intervals
Every rate we report is a sampled proportion (e.g. 6 of 24 probes cited your domain). We report the Wilson score interval at 95% — a small-sample-friendly binomial confidence interval that correctly handles edge cases at 0% and 100% (where the normal-approximation interval breaks).
When two runs are compared, a delta is flagged significant only when
the two Wilson CIs don't overlap. Anything else is sampling noise — increase
--runs to tighten the bound.
The proven GEO levers
The only peer-reviewed evidence base for what actually moves AI-answer-engine citation is the KDD'24 paper Generative Engine Optimization (Aggarwal et al., Princeton / Georgia Tech / Allen AI). It found measurable lifts from:
- Adding statistics (~31% lift in position-adjusted visibility)
- Adding quotations (~41%)
- Adding cited sources (~30%)
- Authoritative language (~11%)
- Keyword stuffing — negative (–9%, actively hurts)
Our generated content briefs use exactly these levers — and explicitly avoid the
snake-oil ones (no llms.txt claims, no "schema-as-citation-lever" pitches;
both have no causal evidence).
Engines, per-engine
- Claude (Sonnet) + WebSearch
- Routed through the authenticated
claudeCLI; stream-JSON output gives us the exactWebSearchtool-call queries and result URLs. - OpenAI (gpt-4o) + Responses
web_search - POST to
/v1/responseswith the built-inweb_searchtool; citations come back as URL annotations on output_text content. - Gemini (2.0-flash) +
google_searchgrounding - POST to
generativelanguage.googleapis.com;grounding_metadataexposes the web URIs the model relied on. - Perplexity Sonar
- POST to Sonar
/chat/completions; the top-levelcitationsarray is the live retrieval/citation set. - Google AI Overviews · via SerpApi
- Two-step flow: regular Google SERP, then
engine=google_ai_overviewwith the page_token when needed.text_blocks+referencesmap cleanly to our funnel; when no AIO is shown for a query we record an error (rates compute over actual triggers). - Bing / Copilot · via SerpApi
- Defensive extraction across
generative_search,copilot_answer,ai_answer, andinstant_answer. Same honest framing — when no AI artifact is returned, no fake probe is recorded.
What we don't do
- No "AI visibility score" that bundles ten things into one opaque number
- No synthetic prompts unrelated to real buyer intent
- No single-sample claims (every visibility rate has a sample size and a CI)
- No promises that
llms.txtor schema markup move citation rate — they don't, and Google has said so on the record - No "rank tracking" framing — LLM outputs are non-deterministic; we sample, not rank