The epistemology of Plumb.
Publishers cannot get inside the LLM. They can only measure what the LLM does to their edge. Every metric in this product carries a confidence badge so you know exactly which lens you’re looking through.
Built for publishers where articles carry real economics.
Plumb’s commissioning workflow generates traceable ROI only when one well-placed article is worth meaningful money in the publisher’s business model. The methodology is built around two publisher archetypes:
For general programmatic and regional news publishers, the per-article unit economics do not support the commissioning workflow. Plumb is a vitamin for those publishers — useful for agentic traffic detection and CPM protection — not a revenue-restoration tool. We are explicit about this distinction rather than pitching universally.
You cannot look inside an LLM.
The conventional analytics stack assumes a funnel you can trace end-to-end: request arrives, referrer is known, user intent is legible from the URL. AI answer surfaces break every link in that chain. Retrieval is opaque. Prompts are private. Attribution leaks into “Direct.” Citations come and go hour-to-hour.
Plumb does not pretend otherwise. It measures what a publisher can measure from inside its own stack, samples what it can sample from the outside, and labels the difference. When you see a number, you can see, at a glance, how it was produced.
What we measure.
100% coverage. These signals come from your own infrastructure and do not depend on any probe of a third-party model.
- Cloudflare bot crawl logs
Every GET request tagged by Cloudflare's bot-verified list — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot, and the long tail. Plumb joins user-agent strings to a classification table so every crawl is attributable to a named platform. Coverage is 100% of traffic touching the edge.
- GA4 referral sessions
Sessions with source = chatgpt.com / claude.ai / perplexity.ai / gemini.google.com / copilot.microsoft.com / other AI surfaces. The referral classifier is maintained against a living list because user agents and referrer headers from AI surfaces change more often than search engines do.
- GSC query + impression data with AIO flagging
Google Search Console impressions, clicks, and rank, with an additional flag for queries where an AI Overview (AIO) was rendered. This is how we detect the search-cannibalization signal — impressions holding up while clicks collapse — without needing to scrape SERPs.
- robots.txt + llms.txt diffs
Versioned snapshots of our site-level permission files and week-over-week diffs. A policy change is a news event; we log it and correlate downstream with crawler behavior change to sanity-check whether the bots are listening.
What we sample.
High variance. These signals come from querying third-party models on a weekly schedule and recording what comes back.
- Weekly citation probe panel
500 queries across topic clusters relevant to affiliate (finance, insurance, commerce) and B2B vertical-trade (SaaS, legal, healthcare) publishers, run against Perplexity, ChatGPT, Gemini, and Claude every Monday at 06:00 UTC. We log which domains are cited, in what order, in what excerpt. This produces competitive citation share data.
- Known non-determinism
Same prompt, same model, temperature zero, same week — citation sets differ hour-to-hour. Week-to-week swings under 15 percentage points are inside the noise floor. We display trend sparklines, not single-point claims. For affiliate publishers tracking a finance cluster, this means a 5pp citation share swing in one week is noise; a 20pp swing over four weeks is signal.
- What we use it for
Competitive citation share on topic clusters. Head-to-head matchups: NerdWallet vs. Bankrate vs. Forbes Advisor on a query cluster, or STAT News vs. BioPharma Dive vs. Endpoints on a healthcare vertical cluster. Emerging-query detection — finding clusters where citation voices are thin and the field hasn't filled in. Never for absolute AI visibility claims — only for directional competitive signal on the queries that drive affiliate conversions or B2B newsletter signups.
What we infer.
Model-derived. These signals come from Bayesian bounds, heuristic scoring, or diagnostic classifiers — treat as estimates under assumptions, not observations.
- Scenario Explorer — hidden AI share of Direct
Bayesian posterior over a latent variable: the fraction of 'Direct' sessions that began with an AI conversation. Prior is survey-calibrated (three publisher surveys, Q4 2025). Evidence is the conversion-rate lift of Direct vs. Organic. Output is a credible interval, not a point estimate. Sensitivity is visible in the sweep chart — the number moves with the prior, as it should.
- AI-Readiness heuristic scoring
Composite score over seven factors (webutation, search position, freshness, structure, metadata, expertise, performance). Weights are tuned against a historical panel of intervention outcomes. Useful as a work-order generator; not a ground-truth measure of crawlability.
- Ignored-story diagnostics
Classifier that labels why a breakout-candidate story failed to resonate in AI surfaces: schema, timing, paywall, topic, stale, duplicate. Labels are probabilistic guesses from a small decision tree. Use them as hypotheses, not conclusions.
What we refuse to fabricate.
There are questions the product is structurally unable to answer. Every vendor claiming otherwise is extrapolating.
- LLM internal state
What the model 'thinks' about your content. What weights it assigns. How retrieval scored your domain. These are closed-source artifacts of private infrastructure. No outside-in tool can observe them.
- What users ask
The queries hitting ChatGPT, Claude, Gemini, or Perplexity are not logged to publishers. We can sample a synthetic panel (see §03) but we cannot know the long tail of what real users are typing. Any vendor showing 'real user queries' is either sampling or lying.
- How content is used in training
Whether your article ended up in a training set, a RAG index, or an evaluation suite. Crawl logs tell you a bot visited; they say nothing about downstream use. Statements about training-set inclusion are speculation.
- Retrieval decisions
When a user asks about 'Fed rate cuts,' why did the model cite you rather than Reuters? We can correlate crawl concentration with citation outcomes, but we cannot observe the scoring function. The link is suggestive, not explanatory.
- Answer-surface rendering
Outside-in scraping — using Playwright to render ChatGPT or Perplexity and extract citations — does not work at scale. Playwright is fingerprinted and rate-limited. Terms-of-service exposure is real. Signal decays faster than you can sample. Every serious attempt has failed; pretending otherwise creates false numbers.
How we aggregate across publishers.
When Plumb is deployed across multiple publishers, the instance can compute a cross-tenant aggregate — a first-party panel benchmark — alongside the published research literature.
- Minimum N = 3 tenants
No aggregate is published below three participating publishers. At N=0-2 the dashboard shows the static research-literature figure (Adobe, Moz, GEO Research, Position Digital, Enrichlabs) and labels it as such. At N≥3 the panel median replaces the primary reference and the research figure becomes the anchor in the tooltip.
- Distribution statistics only
Responses carry median, p25, and p75 — no min, no max, no per-tenant rows. Each caller's own tenant contributes to the aggregate but is indistinguishable from the others. Quantiles are the only exposure; the underlying data never leaves the panel.
- Refresh cadence
Panel aggregates are computed on demand with a five-minute in-process cache. Switching a panel participant in or out takes effect on the next cache expiry. There is no historical panel archive — each read is a live computation.
- What the panel is not
It is not a benchmark purchased from a vendor, not a survey, not a scraped corpus. It is the internal measurement of publishers who have instrumented Plumb. Its reach is the reach of the product. Anywhere we cite a figure from outside this panel, the source is named in the tooltip.
Why Plumb, not your marketplace's dashboard.
The AI-content tooling landscape has sorted itself into three layers. Every other tool in the stack belongs to one of the first two. Plumb is the third — the one the publisher owns.
A publisher using both TollBit and Cloudflare Pay-Per-Crawl already has two vendor dashboards. Each has a financial interest in framing the picture favorably. Plumb is what verifies either of them.
Honest instrumentation, or nothing.
Plumb refuses to fabricate visibility it cannot actually deliver. Every metric in this product carries a confidence badge — MEASURED for first-party, SAMPLED for weekly probes, INFERREDfor model outputs — so the reader of a chart always knows which lens they are looking through.
The audience for this dashboard is a revenue leader, a product owner, a Finance partner, a board. They need to defend their numbers. An over-claimed visibility metric will be shredded in the first follow-up question. A properly-labeled one, even when the uncertainty is large, survives scrutiny.
Decisions are durable too. Every triage call on a bleeding query — accept, dismiss, snooze — is persisted, as is every commission draft generated from one. The chain from measurement to decision to editorial action is a record, not a screenshot.
The panel (§05.5) extends the same principle to benchmarking. When three or more publishers run Plumb, the instance can surface a first-party cross-tenant median instead of citing another vendor’s research PDF. Aggregate only, quantiles only, revoked on exit.
Anything in this category that claims to “show you what ChatGPT thinks about your content” is lying. Plumb refuses to be that product.