Why no composite score combining RS and PS?

Because RS and PS measure different constructs. RS captures what is controllable on the website. PS captures what actually happens in the market in terms of LLM citation behaviour. Merging both into a single number would conflate the core question 'What can I change?' with the question 'What is happening in the market?' and diagnosis would no longer be strategically usable.

How does the framework differ from classic SEO audits?

Three key differences: the measurement object is citation behaviour of generative LLM systems rather than search-engine rankings; the architecture deliberately separates reception readiness (RS) from observed performance (PS); every factor is assigned an explicit evidence grade E1/E2/E3.

GEO-Score Framework v3.3.5 (Layout draft v2)

Q: What are the limits of reproducibility?

RS measurements are deterministically reproducible given the same website version and the same NLP model version. PS measurements are semi-stochastic. In our measurement practice to date, daily fluctuation typically falls in the range of about fifteen to twenty-five per cent, depending on LLM system, topic area and personalisation strength. This range is an observed heuristic bandwidth, not a fixed property. Statements about PS change should therefore be made on weekly averages or coarser granularity.

Q: Why is F9 only a proxy indicator?

F9 measures structural markers (quantitative statements, source anchoring, technical-term density) that correlate with subject-matter expertise but do not fully capture it. High-quality texts can have low F9 values; weak texts can have high F9 values. F9 is diagnostically useful but must never be interpreted as a measure of content quality.

Q: How is the competitor pool for the DiD counterfactual logic built?

With framework v3.3.5, a client-individual pool of three to five competitor domains is established for each client, populated organically from the respective client's LLM citation data in the monthly C3 run. Inclusion criteria: direct market participant (not a supplier, not an authority domain), PS stability over at least four weeks, visibility in at least five of the twelve CATEGORY prompts, diversity across RS levels, domain stability over at least twelve months. Authority domains for the PS5-ASC computation are maintained separately in a central industry whitelist. This hybrid architecture replaces the central industry list of previous versions.

Q: Why CC BY-SA 4.0 and not proprietary?

First: methodological maturity requires transparency, a non-public framework cannot be peer-discussed. Second: Share-Alike structurally ensures that adaptations remain public rather than disappearing into internal forks. This keeps the methodological discussion visible in the industry.

Q: When will a regression-based recalibration of the weights occur?

In the major revision v4.0, once the Impact Library contains at least ten longitudinal cases. Until then, the current values are expert-based working hypotheses for diagnostic prioritisation, not statistically optimised predictions.

Q: How does the framework relate to Anthropic Claude?

Claude in web search mode is currently not in the tested LLM set, because no sufficiently stable monitoring interface is available. The framework therefore makes no statements about Claude citation behaviour. Once such tools become available, Claude can be added to the PS measurement set.

Executive Summary

The framework in 90 seconds.

The GEO-Score Framework v3.3.5 is a methodological framework for evaluating the Generative Engine Optimization of B2B websites in the DACH industrial Mittelstand. It was developed by Tobias Ackermann on the basis of many years of B2B advisory experience in the industrial Mittelstand and went through three methodological review rounds before reaching the now-published version. It is the methodological foundation of GEO advisory work at Johannes Bopp GmbH and is published here under CC BY-SA 4.0. It addresses methodologically interested peers, other agencies, and prospects who want to understand what a scientifically defensible GEO measurement looks like.

The methodological core position: The reception readiness of a website (RS) and the actual citation performance (PS) are distinct constructs and are never merged into a single number. RS measures controllable website properties on a scale of 0 to 100, PS captures observed LLM citation behaviour as a 5-sub-metric profile. The separation is the precondition for diagnosis to remain strategically usable: what can be changed (the website) is distinguished from what is happening in the market (LLM behaviour).

Four models: M1 RS-Audit (three layers: 4 gates as necessary preconditions, 12 factors in three weighted groups, 5 signals as bonuses). M2 PS-Tracking (5 sub-metrics: BVR, CVR, MLC, CPQ, ASC) across four tested LLM systems: ChatGPT, Microsoft Copilot, Google AI Overview, Perplexity. M3 Impact Measurement with mandatory a-priori hypothesis from LARGE interventions onwards, plus Difference-in-Differences counterfactual against the client-individual competitor pool. M4 Context Map with five confounder dimensions that are never included in the score but are reported alongside every diagnosis.

The framework explicitly classifies each factor by role (Direct indicator, Proxy indicator, Indirect indicator, Mixed role, Context factor) and by evidence grade (E1 high, E2 medium, E3 exploratory). Current distribution: 5 factors E1, 11 factors E2, 5 factors E3. The factor weights of 25/35/40 per cent are expert-based, not regression-calibrated; a data-driven recalibration is planned for the major revision v4.0, after building a longitudinal Impact Library with n≥10 cases.

The empirical basis currently comprises seven anonymised longitudinal pilot cases (A–G) that confirm the methodological separation of RS and PS. Pilot Case G is the first fully pre-registered case following the established hypothesis convention. PS daily variation is empirically quantified (median CV 3.5–38.2 % per sub-metric, Appendix F). The framework claims no generalisability across all B2B industries, no statements about B2C, international markets, or LLM systems outside the tested set. ICC study planned for Q4/2026 once n ≥ 15 cases are available.

Published under Creative Commons BY-SA 4.0: commercial use permitted, adaptation permitted, attribution required, adaptations under the same licence. Recommended citation: Ackermann, T. (2026): GEO-Score Framework v3.3.5. Johannes Bopp GmbH. Available at https://kmugeo.de/geo-score-framework-en. Licence: CC BY-SA 4.0.

M1, Readiness Score

Three layers, one score scale for reception readiness.

Purpose: Structural diagnosis of the website properties that enable or hinder LLM processing.
Input: Website crawl plus tool checks across four gates, twelve factors and five signals.
Output: Score 0–100 with factor-group diagnosis and classification tags.
Limit: Measures reception readiness, not the actual citation performance itself.

The RS-Audit is the structural diagnosis of those website properties that enable or hinder LLM processing. Within this single construct (reception readiness), values are aggregated with weights, because all factors answer the same question: how well can an LLM process this site? This is the important difference from the earlier composite critique: there, it was about merging different constructs (RS and PS into one number); here, it is about aggregating indicators of the same kind into a single construct scale.

The RS-Audit operates in three mathematically separated layers with clear roles.

Layer	Function	Count	Logical role
Gates	Binary blockers with score cap	4	Necessary preconditions
Factors	Weighted main criteria	12	Controllable optimisation levers
Signals	Additive bonuses, capped	5	Secondary supporting indicators

Factors in three weighted groups

The twelve factors are organised into three groups whose weights express the methodological priority: structural readability is a necessary precondition, semantic linkability connects the website to the LLM knowledge graph, and citability is the actual value contribution.

Group	What is checked	Factors	Weight
A, Structural readability	URL structure, heading hierarchy, meta tags, performance	F1–F4	25%
B, Semantic linkability	Organization schema, Service/Product schema, external entity anchoring, date markup	F5–F8	35%
C, Citability & substance	Structured expertise indicators, direct answerability, off-page authority, E-E-A-T	F9–F12	40%

Derivation of the 25/35/40 weighting

The monotonic increase A < B < C reflects the methodological hierarchy underlying citation mechanics in generative LLM systems. Structural readability (Group A) is a necessary precondition but not sufficient: without crawlable URLs, a clean heading hierarchy, and acceptable performance, the LLM crawler either fails to find the content or cannot decompose it into semantic units. It therefore has gatekeeper character but is not a value contribution in itself. Semantic linkability (Group B) connects the website to the knowledge graph of LLM training data and decides whether the domain is recognised as an authoritative source for a topic field at all. Citability and substance (Group C) constitute the actual value contribution: only structured, technically substantiated content with off-page validation actually appears as a cited source in generated answers. The specific step size of ten percentage points between the three groups is an expert-based hypothesis that has proved robust enough in practical advisory work to guide diagnostic prioritisation cleanly. It is not regression-calibrated; a recalibration will occur in v4.0 after building a longitudinal Impact Library with n≥10 cases.

General caveat on numerical constants

The weighting values 25/35/40%, the gate-cap values, and the specific factor thresholds are expert-based working hypotheses. They are not regression-calibrated and remain valid until data-driven recalibration in a major revision (v4.0, after building n≥10 longitudinal cases). The current values serve primarily diagnostic prioritisation, not a statistically optimal prediction of citation performance.

How the score is built

Simplified in three steps:

Step 1: Each of the twelve factors is evaluated individually. Depending on the group (A, B, or C), points carry different weight, because semantic linkability and citability matter more to LLMs than pure readability.
Step 2: The five signals can add up to ten bonus points. They are secondary, but they can partially compensate for minor weaknesses in the main factors.
Step 3: If a gate fails (e.g. the website blocks AI crawlers), a score cap applies. The cap only limits the final value; the per-factor diagnosis remains fully intact. You still know exactly where the gaps are.

Maximum final value: 100 points.

Methodological core statement

Current factor weights are expert-based, not regression-calibrated. Recalibration will occur after building a longitudinal Impact Library of at least ten cases.

M2, Performance Profile

Five sub-metrics, no composite.

Purpose: Observation of the actual citation behaviour of the tested LLM systems.
Input: LLM citation monitoring across ChatGPT, Microsoft Copilot, Google AI Overview and Perplexity.
Output: 5-sub-metric profile (BVR, CVR, MLC, CPQ, ASC), deliberately without a composite score.
Limit: Semi-stochastic, empirically observed daily fluctuation 3.5–38.2 % per sub-metric (Appendix F) — trends should be interpreted on a weekly or monthly basis.

Performance is the actually observed citation behaviour of the tested LLM systems. We always report it as a profile, never as a single aggregated value. Reasoning: an LLM visibility number that aggregates different LLMs, different prompt classes, and different contexts is diagnostically worthless.

Metric	What it measures	Diagnostic statement
PS1, BVR	Brand Visibility Rate	"Does the LLM know us?"
PS2, CVR	Category Visibility Rate	"Does the LLM recommend us when solutions are being searched for?", main metric
PS3, MLC	Multi-LLM Coverage as a 4-vector	"How broadly is visibility distributed across LLMs?"
PS4, CPQ	Citation Position Quality	"How prominently does the LLM cite us?"
PS5, ASC	Authority Signal Coverage	"How broad is the off-page mention base?"

Measurement scope

Tested LLM systems: ChatGPT, Microsoft Copilot, Google AI Overview, Perplexity. Other systems (e.g. Anthropic Claude in web search mode, You.com, Brave Search) are not covered.

Tested LLM systems

The framework currently covers four LLM systems: ChatGPT, Microsoft Copilot, Google AI Overview, Perplexity. Statements about the behaviour of other systems, Anthropic Claude in web search mode, You.com, Brave Search, and others, are not made by this framework. The measurement setup can be extended to additional systems once monitoring tools with comparable data quality become available.

PS is semi-stochastic

LLM answers are not fully deterministically reproducible. In our measurement practice to date, daily fluctuation typically falls in the range of about fifteen to twenty-five per cent, influenced by personalisation, regionalisation, model versions, temperature parameters, and real-time retrieval. This range is an observed heuristic bandwidth, varies by LLM system and topic area, and is not a fixed property of PS-Tracking. Statements about change should therefore be made on weekly averages or coarser granularity, not on daily values.

M3, Impact Measurement

Mandatory hypothesis and Difference-in-Differences counterfactual.

Purpose: Causal impact measurement of a single GEO intervention against the market trend.
Input: A-priori hypothesis before the intervention plus pre/post comparison with DiD counterfactual against the competitor pool.
Output: Effect size, hypothesis-match assessment, and confidence class depending on the parallel-trend test.
Limit: Mandatory only for [LARGE] interventions (≥ 10 expected RS points); DiD currently implemented approximately.

M3 is the methodologically most innovative part of the framework. Before every larger GEO intervention, an a-priori hypothesis is formulated with expected effect direction, expected effect size, expected latency, and expected LLM/prompt-class match. After the intervention, the actual PS change is set against the PS change of the client-individual competitor pool (three to five competitor domains, populated organically from the respective client's LLM citation data).

Mandatory hypothesis

Interventions with ≥ 10 RS points of expected improvement (classification [LARGE]) require an a-priori hypothesis before go-live and an M3 impact measurement with DiD counterfactual after the latency period.

Classification of interventions

Tag	Definition	M3 obligation
[SMALL]	< 5 RS points expected improvement	No M3 measurement
[MEDIUM]	5–10 RS points expected improvement	Optional M3 measurement
[LARGE]	≥ 10 RS points expected improvement	Mandatory hypothesis and mandatory M3 incl. DiD

Difference-in-Differences counterfactual

For every M3 measurement, the PS change of an industry-calibrated reference list (typically three to five domains, varying by market size) is additionally captured in the same time window. This enables separation of client effect and market trend:

Client rises, references flat → strong plausibly causal indication of intervention effect
Client and references equally → weak indication, possible common market effect
Client rises, references rise more → weak effect, overlaid by market trend

Competitor pools have been client-individual since v3.2.3, populated organically from the respective client's LLM citation data, according to stability and diversity criteria. Industry authority domains for the PS5-ASC computation remain centrally maintained in a cross-client whitelist. This makes DiD comparisons client-individual for the competitive counterfactual and industry-consistent for the authority anchors. Details follow in the hybrid-model section below.

DiD data source: hybrid model (v3.3.5)

With version 3.2.3, the DiD data source was methodologically refined: instead of a single central industry list, the framework now uses a hybrid model with two components of different provenance. Reason: regional and size-related differences in competitive landscape mean that the competitors relevant for DiD evaluation actually vary per client. A regional Mittelstand company has different real competitors than a supraregional provider, even within the same industry. A methodologically clean DiD comparison requires similarly structured comparison subjects.

Component	Purpose	Provenance	Maintenance rhythm
A, client-individual competitor pool	DiD counterfactual against the client's actual competitors	Populated organically from the client's LLM citation data in the monthly C3 run, three to five domains per client	Updated monthly
B, central industry authority whitelist	Calculation of the PS5 sub-metric ASC (Authority Signal Coverage)	Maintained centrally per industry: trade associations, specialist media, industry-specific platforms, five to eight domains	Reviewed semi-annually

The separation follows the different functions: competitors are similar market participants and therefore client-individual; authority anchors are stable across clients, because a trade-association membership or a mention in a specialist publication is equally relevant for every Mittelstand company in that industry.

Parallel-trend test as precondition for DiD validity

Difference-in-Differences is methodologically valid only when the treatment subject and the control group would have evolved in parallel without the intervention. The framework operationally checks this assumption over the pre-period T-28 to T-1 before each [LARGE] intervention, using daily LLM citation values. Pool domains with significant self-movement in the pre-period are temporarily excluded (anti-self-treatment filter). The evaluation produces three outcomes:

Outcome	Δ-slope threshold	Consequence for diagnosis
Parallel trend OK	< 20%	Causal effect supported, confidence tier as hypothesised
Borderline	20–40%	Confidence reduced by one tier, language more cautious
Violated	≥ 40%	Reported only as observed pre/post effect, language "causally compatible" instead of "causal effect"

This makes the DiD evaluation methodologically quasi-experimental rather than merely plausibility-based. The parallel-trend test prevents a competitor domain's natural market movement from being misinterpreted as an effect of the client's intervention.

Operational measurement mechanics (v3.2.2)

T+14 and T+30 effects are computed retrospectively in the following monthly report from the 30-day CSV rolling export of the LLM-monitoring tool in use, not in real time at the exact latency date. This eliminates explicit intermediate pulls; the measurement rhythm follows the natural monthly data export. The effect-size definition itself remains unchanged; only the computation moment has shifted.

Dimension	Scale	What it captures
D1, Industry maturity	4 levels	Consolidated / Fragmented / Niche / Emerging
D2, Market awareness	4 levels	Established (>20y) / Built up / New (<5y) / Unknown
D3, Off-page authority status	4 levels	Strong / Medium / Weak / None
D4, Competitive intensity	3 levels	Few top players / Fragmented / Hyper-competitive
D5, Term position	3 levels	Own term / Shared term / Generic term

Conceptual discipline

Five roles, one consistent classification.

Every factor and every signal in the framework is explicitly assigned one of five roles. Binding across all tables, diagnostic outputs, and communication artefacts. It determines how a factor must be methodologically interpreted.

Primary role

Direct indicator

Acts immediately and plausibly causally on the LLM processing being measured. Example: F1 canonicals prevent crawler confusion.

Primary role

Proxy indicator

Measures a measurable surrogate correlated with the construct. Example: F9 structured expertise indicators.

Three supplementary roles

Indirect indicator — acts via several intermediate steps, e.g. F4 performance → crawl budget → indexability.
Mixed role — both controllable and contextual, e.g. F11 with F11a/c controllable and F11b passive.
Context factor — not influenced by GEO work, but changes the interpretation of measurement values (M4 dimensions).

Evidence discipline

Three levels of empirical support.

Every factor and every signal is explicitly assigned an evidence grade. This makes transparent which parts of the framework rest on robust empirical evidence, which on expert consensus, and which are deliberately marked exploratory.

Level	Minimum requirement
E1	≥ 3 longitudinal cases with consistent effect direction AND broad expert consensus AND replicable
E2	≥ 1 longitudinal case with clear effect direction OR broad expert consensus without direct validation
E3	Theoretically plausible, without robust impact measurement. Re-evaluated at every major revision

Factors can be promoted or demoted. Current distribution across the 21 factors and signals: E1: 5/21 · E2: 11/21 · E3: 5/21. Marked as E3 at present are F9, the signals S2 and S4, plus the latency values and thresholds of the M3 evaluation.

Construct boundaries

What this framework does not measure.

Methodological maturity also means explicitly naming what lies outside the measurement scope. This table serves to prevent over-interpretation.

Reader expectation	What is actually measured
"Content quality"	F9 measures structural expertise indicators (proxy), not content quality.
"Truth content of the material"	Not measured. The framework checks structure and markup, not factual accuracy.
"Citation worthiness from a human reader's perspective"	Not directly measured. PS measures actual LLM behaviour, not human evaluation.
"SEO ranking in Google search"	Google AI Overview is part of the LLM set; classic SEO ranking is not.
"Brand strength and brand awareness"	Partially captured in M4 context factor D2, but not measured as a score.
"Claude in web search mode"	Claude is not in the tested LLM set.
"International markets, non-German-speaking sites"	Scope is German-speaking DACH B2B.
"B2C or consumer websites"	Scope is B2B industrial Mittelstand.
"Statistical prediction of PS from RS"	RS weights serve diagnostic prioritisation, not regression.
"Generalisable effect across all B2B industries"	Validation occurs longitudinally per client, not cross-industry.

Methodological transparency

Known limitations, openly communicated.

This list is not exhaustive but covers the methodological weaknesses we currently recognise openly and that stand as improvement targets for the major revision v4.0.

Factor weights 25/35/40% are expert-based, not regression-calibrated. Recalibration in v4.0 planned after building a sufficient longitudinal data base.
Thresholds of all factors are heuristic (general caveat §3.7.1).
Cap values are logically derived, not empirically validated.
Anti-gaming NLP checks only partially implemented.
Inter-rater reliability study planned for Q4/2026 once n ≥ 15 cases are available (target: ICC ≥ 0.85 via parallel double measurements).
F9 marked as E3, correlation of sub-components with LLM effect not longitudinally confirmed.
Difference-in-Differences only approximately implemented in the documented cases.
Multicollinearity between F9, F10, and F12 not statistically tested.
Empirical basis: n = 7 anonymised pilot cases (A–G) with transparent maturity status per case; Pilot Case G is the first fully pre-registered case. Target for the ICC study and regression-based recalibration: n ≥ 15 cases by Q4/2026.
PS data depend on external LLM-monitoring tools and thus on their availability, pricing, and API stability. Minimum requirements for compatible tools are documented in Appendix E of the whitepaper.
PS structurally semi-stochastic; precise predictions of individual measurement values are not possible, only trends on weekly or monthly basis. Daily variation empirically quantified: median CV 3.5–38.2 % per sub-metric (Appendix F).

Spotlight: Important language clarification on F9

F9, structured expertise indicators, is a proxy indicator, not a direct measure of content quality. F9 does not measure "substance", "epistemic quality", or "subject-matter expertise" itself, but structural markers that correlate with expertise without fully capturing it. Specialist terms can be spammed, numbers can be artificially injected, source links can be decorative. The anti-gaming layer mitigates this but does not eliminate it. F9 must never be interpreted as a measure of "content quality" in any diagnosis or external communication, only as the strength of a structural indicator.

F9, not to be confused with	Reason
Content quality	F9 captures structural markers, not the substantive value of a text. High-quality texts can have low F9, weak texts can have high F9.
Content depth	F9 detects quantity-unit patterns and technical-term density, but does not judge argumentative complexity or analytical depth.
Factual accuracy	F9 checks the existence of statistical patterns and source links, not the truth content of the referenced statements.
Author expertise	Author expertise is captured in F12 (E-E-A-T), not in F9. F9 is orthogonal to the person who created the content.

Versioning & roadmap

Patch, Minor, Major, and what external validation looks like.

The framework is reviewed semi-annually. Changes occur in three tiers that ensure structural stability while leaving room for further development.

Tier	What changes	Example
Patch (v3.x.x)	Term consistency, documentation improvements, methodological clarifications and empirical extensions without architecture changes	v3.3 → v3.3.5: patch bundle with cases expansion 2→7, pre-registration convention, PS daily-variation study (Appendix F) and PS sub-metric stability matrix
Minor (v3.x)	Wording adjustments, thresholds, whitelists	v3.1 → v3.2 (industry reference domains formalised)
Major (v4.0)	Structural changes, regression-based recalibration, ICC study	planned for Q4/2026 once n ≥ 15 longitudinal cases are available

External validation roadmap

External validation will proceed over several years in five steps:

Methodology whitepaper as PDF download (available) and Zenodo archive with DOI (DOI: 10.5281/zenodo.20137223)
Conference presentation at a marketing or SEO professional event
Open-sourcing the measurement scripts for traceable replication
Peer review by external Information Retrieval, SEO, and NLP reviewers
Replication by independent practitioners with publication of their results

In parallel, the Impact Library is built as an ongoing validation mechanism. Each intervention with M3 obligation produces a data point with hypothesis, observed effect, DiD counterfactual, and confidence classification. This collection is the empirical basis for regression-based recalibration in v4.0.

Licence & citation

Creative Commons BY-SA 4.0.

This framework is published under Creative Commons BY-SA 4.0. Concretely, this means:

You may share, copy and redistribute the framework in any format.
You may edit, remix, transform, and build upon the framework for your own purposes, including commercially.
You must give appropriate credit, link to the licence, and indicate changes.
You must distribute adaptations under the same licence CC BY-SA 4.0 (share-alike principle).

Full licence terms: creativecommons.org/licenses/by-sa/4.0

Recommended citation

If you reference the GEO-Score Framework, please cite in one of the following formats:

APA format

Ackermann, T. (2026). GEO-Score Framework v3.3.5: A methodological framework for evaluating the Generative Engine Optimization of B2B websites. Johannes Bopp GmbH (kmugeo). Zenodo. https://doi.org/10.5281/zenodo.20137223

BibTeX

@techreport{ackermann2026geoscore,
  author      = {Ackermann, Tobias},
  title       = {{GEO-Score Framework v3.3.5: A methodological framework
                  for evaluating the Generative Engine Optimization
                  of B2B websites}},
  institution = {Johannes Bopp GmbH},
  year        = {2026},
  version     = {3.3.5},
  url         = {https://kmugeo.de/geo-score-framework-en},
  doi         = {10.5281/zenodo.20137223},
  license     = {CC BY-SA 4.0}
}

For academic publications, please add the access date, since the framework is developed in versioned iterations. The Zenodo archive with DOI 10.5281/zenodo.20137223 provides the cited version immutably; the reproducibility bundle (analysis script, anonymised CSV data, JSON aggregate) is part of the record. For marketing and agency artefacts, a direct hyperlink to this page with the attribution "GEO-Score Framework, Johannes Bopp GmbH" is sufficient.

Methodological grounding

Methodological sources

The GEO-Score Framework v3.3.5 builds on established scientific literature and open technical standards. The following sources are cited in the whitepaper and form the methodological foundation of the framework.

Methodological foundations

Card, D. & Krueger, A. B. (1994). "Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania." American Economic Review, 84(4), 772–793.

Original study of the Difference-in-Differences methodology

NBER Working Paper →

Angrist, J. D. & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press.

Standard textbook on counterfactual logic (Chapter 5 on DiD)

Princeton University Press →

Manning, C. D., Raghavan, P. & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.

Foundations of indexing and retrieval evaluation

Stanford NLP Group (online edition) →

Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd edition, Cambridge University Press.

Theoretical foundation for causal inference in M3

UCLA book page →

Standards and specifications

Schema.org (2024). Schema.org Vocabulary, Full Hierarchy.

Basis for factor F4 and JSON-LD implementation

schema.org →

Google (2024). Core Web Vitals, Web Performance Metrics.

Basis for factor F1 (page-speed measurement)

web.dev/vitals →

llmstxt.org (2024). llms.txt: Standard for LLM-Aware Robots Files. (Proposal: Jeremy Howard, Answer.AI, September 2024)

Basis for signal S4

llmstxt.org →

Inter-rater reliability and reproducibility

Cohen, J. (1960). "A Coefficient of Agreement for Nominal Scales." Educational and Psychological Measurement, 20(1), 37–46.

Cohen's Kappa as standard measure for inter-rater reliability

SAGE Journals →

Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology. 4th edition, SAGE Publications.

Methodological foundations for reproducible measurement procedures

SAGE Methods →

NLP and language modelling

spaCy (2024). spaCy Models, de_core_news_lg (current version of the 3.x series).

Basis for the anti-gaming NLP layer

spacy.io/models/de →

Bender, E. M. & Koller, A. (2020). "Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data." Proceedings of ACL 2020, 5185–5198.

Conceptual limits of LLM-output evaluation

ACL Anthology →

GEO and AI-search-specific discussion

SearchScore (2026). "How to Measure and Track Your GEO Performance."

Comparative GEO scoring framework

searchscore.io →

GenOptima (2026). "Top 10 Generative AI Search Engine Optimization Agency Criteria for AEOaaS Readiness Score in 2026."

Market comparison of GEO agency assessment

gen-optima.com →

Frequently asked methodological questions

FAQ

Answers to recurring questions from peer discussions and client conversations.

Why no composite score from RS and PS?

Because RS and PS measure different constructs. RS captures what is controllable on the website (structures, schemas, levers on the site itself). PS captures what is actually happening in the market in LLM citation behaviour (influenced by market position, competition, random fluctuation, semi-stochastic LLM responses). Mixing both into one number would conflate the core question "What can I change?" with the question "What is happening in the market?", and the diagnosis would no longer be strategically usable.

How does the framework differ from classical SEO audits?

Three essential differences: First, the measurement object: not search-engine rankings (SEO), but citation behaviour of generative LLM systems. Second, the architecture: deliberate separation of reception readiness (RS) and observed performance (PS), whereas SEO tools typically report a single visibility value. Third, evidence discipline: every factor is explicitly assigned an evidence grade E1/E2/E3 with a transparent minimum requirement per level.

What are the limits of reproducibility?

RS measurements are deterministically reproducible: at the same website version and same NLP model version, they yield identical results. PS measurements are semi-stochastic: LLM responses are not fully deterministic. Daily variation has been empirically quantified since v3.3.3 (Appendix F, n = 6 clients, 13 days of monitoring): median CV 3.5–38.2 % per sub-metric, CPQ most stable (median 12 %), CVR most volatile. Statements about PS change should therefore be made on weekly or monthly averages; single-day statements are not methodologically robust.

Why is F9 only a proxy indicator?

Because F9 measures structural markers (quantitative statements, source anchoring, technical-term density) that correlate with subject-matter expertise but do not fully capture it. High-quality texts can have low F9 values when they avoid statistical patterns; weak texts can have high F9 values when they formally satisfy statistical and term patterns. F9 is therefore diagnostically useful but must never be interpreted as a measure of "content quality".

How is the competitor pool for the DiD counterfactual logic built?

With framework v3.3, a client-individual pool of three to five competitor domains is established for each client, populated organically from the respective client's LLM citation data in the monthly C3 run. Inclusion criteria: direct market participant (not a supplier, not an authority domain), PS stability over at least four weeks, visibility in at least five of the twelve CATEGORY prompts, diversity across RS levels (1 strong, 1 medium, 1 weak market participant), domain stability over at least twelve months. Authority domains for the PS5-ASC computation are maintained separately in a central industry whitelist. This hybrid architecture replaces the central industry list of previous versions.

Why CC BY-SA 4.0 and not proprietary?

For two reasons. First: a measurement framework that is not publicly traceable cannot be peer-discussed, critiqued, and thereby improved. Methodological maturity requires transparency. Second: Share-Alike (SA) structurally ensures that adaptations also remain public rather than disappearing into internal forks. This keeps the methodological discussion visible in the industry, and Johannes Bopp GmbH is named as the source through the attribution mechanism (BY).

When will a regression-based recalibration of the weights occur?

In the major revision v4.0, planned for Q4/2026, once the Impact Library contains at least fifteen longitudinal cases of different intervention types. Until then, the current values are expert-based working hypotheses for diagnostic prioritisation, not statistically optimised predictions. The general caveat on calibration is explicitly documented in the framework.

How does the framework relate to Anthropic Claude?

Claude in web search mode is currently not in the tested LLM set, because no sufficiently stable monitoring interface analogous to the LLM citation tools used for ChatGPT, Copilot, Google AI Overview and Perplexity is available. The framework therefore makes no statements about Claude citation behaviour. Once such tools become available, Claude can be added to the PS measurement set without changes to the RS architecture.

How this whitepaper is organised.

The five core terms in one sentence each.

Three claims that carry the framework methodologically.