A methodological framework for evaluating the Generative Engine Optimization of B2B websites. Four measurement and diagnostic models, three layers in the RS-Audit, five sub-metrics in PS-Tracking. Deliberately no composite score, because reception readiness and actual citation performance are methodologically distinct constructs.
The GEO-Score Framework v3.3.5 is a methodological framework for evaluating the Generative Engine Optimization of B2B websites in the DACH industrial Mittelstand. It was developed by Tobias Ackermann on the basis of many years of B2B advisory experience in the industrial Mittelstand and went through three methodological review rounds before reaching the now-published version. It is the methodological foundation of GEO advisory work at Johannes Bopp GmbH and is published here under CC BY-SA 4.0. It addresses methodologically interested peers, other agencies, and prospects who want to understand what a scientifically defensible GEO measurement looks like.
The methodological core position: The reception readiness of a website (RS) and the actual citation performance (PS) are distinct constructs and are never merged into a single number. RS measures controllable website properties on a scale of 0 to 100, PS captures observed LLM citation behaviour as a 5-sub-metric profile. The separation is the precondition for diagnosis to remain strategically usable: what can be changed (the website) is distinguished from what is happening in the market (LLM behaviour).
Four models: M1 RS-Audit (three layers: 4 gates as necessary preconditions, 12 factors in three weighted groups, 5 signals as bonuses). M2 PS-Tracking (5 sub-metrics: BVR, CVR, MLC, CPQ, ASC) across four tested LLM systems: ChatGPT, Microsoft Copilot, Google AI Overview, Perplexity. M3 Impact Measurement with mandatory a-priori hypothesis from LARGE interventions onwards, plus Difference-in-Differences counterfactual against the client-individual competitor pool. M4 Context Map with five confounder dimensions that are never included in the score but are reported alongside every diagnosis.
The framework explicitly classifies each factor by role (Direct indicator, Proxy indicator, Indirect indicator, Mixed role, Context factor) and by evidence grade (E1 high, E2 medium, E3 exploratory). Current distribution: 5 factors E1, 11 factors E2, 5 factors E3. The factor weights of 25/35/40 per cent are expert-based, not regression-calibrated; a data-driven recalibration is planned for the major revision v4.0, after building a longitudinal Impact Library with n≥10 cases.
The empirical basis currently comprises seven anonymised longitudinal pilot cases (A–G) that confirm the methodological separation of RS and PS. Pilot Case G is the first fully pre-registered case following the established hypothesis convention. PS daily variation is empirically quantified (median CV 3.5–38.2 % per sub-metric, Appendix F). The framework claims no generalisability across all B2B industries, no statements about B2C, international markets, or LLM systems outside the tested set. ICC study planned for Q4/2026 once n ≥ 15 cases are available.
Published under Creative Commons BY-SA 4.0: commercial use permitted, adaptation permitted, attribution required, adaptations under the same licence. Recommended citation: Ackermann, T. (2026): GEO-Score Framework v3.3.5. Johannes Bopp GmbH. Available at https://kmugeo.de/geo-score-framework-en. Licence: CC BY-SA 4.0.
Market status and methodological contribution, followed by the four-model architecture that consistently separates the measurement level from the interpretation level.
Generative Engine Optimization emerged as a discipline between 2023 and 2026. Several agencies refer in client communication to their "own GEO score". To our current knowledge at the time of publication, no other publicly documented, methodologically traceable measurement framework with comparable detail is available in the DACH B2B context. A systematic market scan can further validate this assessment.
This framework makes public the diagnostic instrument used in GEO advisory work at Johannes Bopp GmbH. It was developed by Tobias Ackermann on the basis of many years of B2B advisory experience in the industrial Mittelstand and went through three methodological review rounds between versions v3.0 and v3.3.5.
This document describes an internal diagnostic instrument. It deliberately does not claim to be an industry standard. Standardisation maturity would require external replication, formal peer-review processes, and a substantially larger empirical data base than currently exists.
The framework separates consistently between operationalisable observation (M1, M2) and evaluation and contextualisation (M3, M4). Measurement models capture what is observable; diagnostic models interpret the observed against the backdrop of intervention effects and context factors.
Why this separation matters: Merging website reception readiness (RS) and actual citation performance (PS) into a single "AI visibility number" mixes two different things: what you can change (your own website), and what is actually happening in the market (LLM citation behaviour, influenced by market position, competition, and random fluctuation). This separation is the precondition for diagnosis to remain strategically usable.
GEO is often conflated with neighbouring disciplines. Three of them are methodologically related but pursue different measurement targets and optimisation logics. This delimitation is not a devaluation of the other disciplines, but the precondition for this framework to operate methodologically cleanly.
Two measurement models (M1 for the reception readiness of the website, M2 for the observed LLM citation behaviour) and two diagnostic models (M3 for causal impact measurement with DiD counterfactual, M4 for the contextualisation of measurement results).
The RS-Audit is the structural diagnosis of those website properties that enable or hinder LLM processing. Within this single construct (reception readiness), values are aggregated with weights, because all factors answer the same question: how well can an LLM process this site? This is the important difference from the earlier composite critique: there, it was about merging different constructs (RS and PS into one number); here, it is about aggregating indicators of the same kind into a single construct scale.
The RS-Audit operates in three mathematically separated layers with clear roles.
| Layer | Function | Count | Logical role |
|---|---|---|---|
| Gates | Binary blockers with score cap | 4 | Necessary preconditions |
| Factors | Weighted main criteria | 12 | Controllable optimisation levers |
| Signals | Additive bonuses, capped | 5 | Secondary supporting indicators |
The twelve factors are organised into three groups whose weights express the methodological priority: structural readability is a necessary precondition, semantic linkability connects the website to the LLM knowledge graph, and citability is the actual value contribution.
| Group | What is checked | Factors | Weight |
|---|---|---|---|
| A, Structural readability | URL structure, heading hierarchy, meta tags, performance | F1–F4 | 25% |
| B, Semantic linkability | Organization schema, Service/Product schema, external entity anchoring, date markup | F5–F8 | 35% |
| C, Citability & substance | Structured expertise indicators, direct answerability, off-page authority, E-E-A-T | F9–F12 | 40% |
The monotonic increase A < B < C reflects the methodological hierarchy underlying citation mechanics in generative LLM systems. Structural readability (Group A) is a necessary precondition but not sufficient: without crawlable URLs, a clean heading hierarchy, and acceptable performance, the LLM crawler either fails to find the content or cannot decompose it into semantic units. It therefore has gatekeeper character but is not a value contribution in itself. Semantic linkability (Group B) connects the website to the knowledge graph of LLM training data and decides whether the domain is recognised as an authoritative source for a topic field at all. Citability and substance (Group C) constitute the actual value contribution: only structured, technically substantiated content with off-page validation actually appears as a cited source in generated answers. The specific step size of ten percentage points between the three groups is an expert-based hypothesis that has proved robust enough in practical advisory work to guide diagnostic prioritisation cleanly. It is not regression-calibrated; a recalibration will occur in v4.0 after building a longitudinal Impact Library with n≥10 cases.
The weighting values 25/35/40%, the gate-cap values, and the specific factor thresholds are expert-based working hypotheses. They are not regression-calibrated and remain valid until data-driven recalibration in a major revision (v4.0, after building n≥10 longitudinal cases). The current values serve primarily diagnostic prioritisation, not a statistically optimal prediction of citation performance.
Simplified in three steps:
Maximum final value: 100 points.
Current factor weights are expert-based, not regression-calibrated. Recalibration will occur after building a longitudinal Impact Library of at least ten cases.
Performance is the actually observed citation behaviour of the tested LLM systems. We always report it as a profile, never as a single aggregated value. Reasoning: an LLM visibility number that aggregates different LLMs, different prompt classes, and different contexts is diagnostically worthless.
| Metric | What it measures | Diagnostic statement |
|---|---|---|
| PS1, BVR | Brand Visibility Rate | "Does the LLM know us?" |
| PS2, CVR | Category Visibility Rate | "Does the LLM recommend us when solutions are being searched for?", main metric |
| PS3, MLC | Multi-LLM Coverage as a 4-vector | "How broadly is visibility distributed across LLMs?" |
| PS4, CPQ | Citation Position Quality | "How prominently does the LLM cite us?" |
| PS5, ASC | Authority Signal Coverage | "How broad is the off-page mention base?" |
Tested LLM systems: ChatGPT, Microsoft Copilot, Google AI Overview, Perplexity. Other systems (e.g. Anthropic Claude in web search mode, You.com, Brave Search) are not covered.
The framework currently covers four LLM systems: ChatGPT, Microsoft Copilot, Google AI Overview, Perplexity. Statements about the behaviour of other systems, Anthropic Claude in web search mode, You.com, Brave Search, and others, are not made by this framework. The measurement setup can be extended to additional systems once monitoring tools with comparable data quality become available.
LLM answers are not fully deterministically reproducible. In our measurement practice to date, daily fluctuation typically falls in the range of about fifteen to twenty-five per cent, influenced by personalisation, regionalisation, model versions, temperature parameters, and real-time retrieval. This range is an observed heuristic bandwidth, varies by LLM system and topic area, and is not a fixed property of PS-Tracking. Statements about change should therefore be made on weekly averages or coarser granularity, not on daily values.
M3 is the methodologically most innovative part of the framework. Before every larger GEO intervention, an a-priori hypothesis is formulated with expected effect direction, expected effect size, expected latency, and expected LLM/prompt-class match. After the intervention, the actual PS change is set against the PS change of the client-individual competitor pool (three to five competitor domains, populated organically from the respective client's LLM citation data).
Interventions with ≥ 10 RS points of expected improvement (classification [LARGE]) require an a-priori hypothesis before go-live and an M3 impact measurement with DiD counterfactual after the latency period.
| Tag | Definition | M3 obligation |
|---|---|---|
| [SMALL] | < 5 RS points expected improvement | No M3 measurement |
| [MEDIUM] | 5–10 RS points expected improvement | Optional M3 measurement |
| [LARGE] | ≥ 10 RS points expected improvement | Mandatory hypothesis and mandatory M3 incl. DiD |
For every M3 measurement, the PS change of an industry-calibrated reference list (typically three to five domains, varying by market size) is additionally captured in the same time window. This enables separation of client effect and market trend:
Competitor pools have been client-individual since v3.2.3, populated organically from the respective client's LLM citation data, according to stability and diversity criteria. Industry authority domains for the PS5-ASC computation remain centrally maintained in a cross-client whitelist. This makes DiD comparisons client-individual for the competitive counterfactual and industry-consistent for the authority anchors. Details follow in the hybrid-model section below.
With version 3.2.3, the DiD data source was methodologically refined: instead of a single central industry list, the framework now uses a hybrid model with two components of different provenance. Reason: regional and size-related differences in competitive landscape mean that the competitors relevant for DiD evaluation actually vary per client. A regional Mittelstand company has different real competitors than a supraregional provider, even within the same industry. A methodologically clean DiD comparison requires similarly structured comparison subjects.
| Component | Purpose | Provenance | Maintenance rhythm |
|---|---|---|---|
| A, client-individual competitor pool | DiD counterfactual against the client's actual competitors | Populated organically from the client's LLM citation data in the monthly C3 run, three to five domains per client | Updated monthly |
| B, central industry authority whitelist | Calculation of the PS5 sub-metric ASC (Authority Signal Coverage) | Maintained centrally per industry: trade associations, specialist media, industry-specific platforms, five to eight domains | Reviewed semi-annually |
The separation follows the different functions: competitors are similar market participants and therefore client-individual; authority anchors are stable across clients, because a trade-association membership or a mention in a specialist publication is equally relevant for every Mittelstand company in that industry.
Difference-in-Differences is methodologically valid only when the treatment subject and the control group would have evolved in parallel without the intervention. The framework operationally checks this assumption over the pre-period T-28 to T-1 before each [LARGE] intervention, using daily LLM citation values. Pool domains with significant self-movement in the pre-period are temporarily excluded (anti-self-treatment filter). The evaluation produces three outcomes:
| Outcome | Δ-slope threshold | Consequence for diagnosis |
|---|---|---|
| Parallel trend OK | < 20% | Causal effect supported, confidence tier as hypothesised |
| Borderline | 20–40% | Confidence reduced by one tier, language more cautious |
| Violated | ≥ 40% | Reported only as observed pre/post effect, language "causally compatible" instead of "causal effect" |
This makes the DiD evaluation methodologically quasi-experimental rather than merely plausibility-based. The parallel-trend test prevents a competitor domain's natural market movement from being misinterpreted as an effect of the client's intervention.
T+14 and T+30 effects are computed retrospectively in the following monthly report from the 30-day CSV rolling export of the LLM-monitoring tool in use, not in real time at the exact latency date. This eliminates explicit intermediate pulls; the measurement rhythm follows the natural monthly data export. The effect-size definition itself remains unchanged; only the computation moment has shifted.
M4 captures variables that are not influenced by GEO work but change the interpretation of measurement results. Low performance for a 3-week-old brand with unknown market awareness and no off-page authority is not the same as the same low performance for an established Mittelstand company in a consolidated market.
| Dimension | Scale | What it captures |
|---|---|---|
| D1, Industry maturity | 4 levels | Consolidated / Fragmented / Niche / Emerging |
| D2, Market awareness | 4 levels | Established (>20y) / Built up / New (<5y) / Unknown |
| D3, Off-page authority status | 4 levels | Strong / Medium / Weak / None |
| D4, Competitive intensity | 3 levels | Few top players / Fragmented / Hyper-competitive |
| D5, Term position | 3 levels | Own term / Shared term / Generic term |
The Context Map is never included in the score. It is reported as a header alongside every report. This prevents confounders from being interpreted as intervention effects, while preserving cross-client diagnostic readability.
Four discipline dimensions that define what the framework is methodologically allowed to deliver and what it is not. They are the foundation that keeps diagnostic statements causally defensible and prevents them from tipping into over-interpretation.
Every factor and every signal in the framework is explicitly assigned one of five roles. Binding across all tables, diagnostic outputs, and communication artefacts. It determines how a factor must be methodologically interpreted.
Every factor and every signal is explicitly assigned an evidence grade. This makes transparent which parts of the framework rest on robust empirical evidence, which on expert consensus, and which are deliberately marked exploratory.
| Level | Minimum requirement |
|---|---|
| E1 | ≥ 3 longitudinal cases with consistent effect direction AND broad expert consensus AND replicable |
| E2 | ≥ 1 longitudinal case with clear effect direction OR broad expert consensus without direct validation |
| E3 | Theoretically plausible, without robust impact measurement. Re-evaluated at every major revision |
Factors can be promoted or demoted. Current distribution across the 21 factors and signals: E1: 5/21 · E2: 11/21 · E3: 5/21. Marked as E3 at present are F9, the signals S2 and S4, plus the latency values and thresholds of the M3 evaluation.
Methodological maturity also means explicitly naming what lies outside the measurement scope. This table serves to prevent over-interpretation.
| Reader expectation | What is actually measured |
|---|---|
| "Content quality" | F9 measures structural expertise indicators (proxy), not content quality. |
| "Truth content of the material" | Not measured. The framework checks structure and markup, not factual accuracy. |
| "Citation worthiness from a human reader's perspective" | Not directly measured. PS measures actual LLM behaviour, not human evaluation. |
| "SEO ranking in Google search" | Google AI Overview is part of the LLM set; classic SEO ranking is not. |
| "Brand strength and brand awareness" | Partially captured in M4 context factor D2, but not measured as a score. |
| "Claude in web search mode" | Claude is not in the tested LLM set. |
| "International markets, non-German-speaking sites" | Scope is German-speaking DACH B2B. |
| "B2C or consumer websites" | Scope is B2B industrial Mittelstand. |
| "Statistical prediction of PS from RS" | RS weights serve diagnostic prioritisation, not regression. |
| "Generalisable effect across all B2B industries" | Validation occurs longitudinally per client, not cross-industry. |
This list is not exhaustive but covers the methodological weaknesses we currently recognise openly and that stand as improvement targets for the major revision v4.0.
F9, structured expertise indicators, is a proxy indicator, not a direct measure of content quality. F9 does not measure "substance", "epistemic quality", or "subject-matter expertise" itself, but structural markers that correlate with expertise without fully capturing it. Specialist terms can be spammed, numbers can be artificially injected, source links can be decorative. The anti-gaming layer mitigates this but does not eliminate it. F9 must never be interpreted as a measure of "content quality" in any diagnosis or external communication, only as the strength of a structural indicator.
| F9, not to be confused with | Reason |
|---|---|
| Content quality | F9 captures structural markers, not the substantive value of a text. High-quality texts can have low F9, weak texts can have high F9. |
| Content depth | F9 detects quantity-unit patterns and technical-term density, but does not judge argumentative complexity or analytical depth. |
| Factual accuracy | F9 checks the existence of statistical patterns and source links, not the truth content of the referenced statements. |
| Author expertise | Author expertise is captured in F12 (E-E-A-T), not in F9. F9 is orthogonal to the person who created the content. |
Current empirical basis: seven anonymised longitudinal pilot cases (A–G) with transparent maturity status per case. Pilot Case G is the first fully pre-registered case following the hypothesis pre-registration convention established in §5.1.
PS daily variation is empirically quantified (Appendix F, n = 6 clients, 13 days of monitoring): median CV 3.5–38.2 % per sub-metric, CPQ as the most stable metric (median 12 %), CVR the most volatile. Operational consequence: weekly or monthly averages as standard; single-day statements are not methodologically robust.
The factor weights of 25/35/40 per cent are expert-based working hypotheses for diagnostic prioritisation, not regression-calibrated predictive models. ICC study and regression-based recalibration in Q4/2026 once n ≥ 15 longitudinal cases are available.
The appendix area: how the framework evolves, under which licence it is published, which sources anchor the methodology, and the most important application questions answered concisely.
The framework is reviewed semi-annually. Changes occur in three tiers that ensure structural stability while leaving room for further development.
| Tier | What changes | Example |
|---|---|---|
| Patch (v3.x.x) | Term consistency, documentation improvements, methodological clarifications and empirical extensions without architecture changes | v3.3 → v3.3.5: patch bundle with cases expansion 2→7, pre-registration convention, PS daily-variation study (Appendix F) and PS sub-metric stability matrix |
| Minor (v3.x) | Wording adjustments, thresholds, whitelists | v3.1 → v3.2 (industry reference domains formalised) |
| Major (v4.0) | Structural changes, regression-based recalibration, ICC study | planned for Q4/2026 once n ≥ 15 longitudinal cases are available |
External validation will proceed over several years in five steps:
In parallel, the Impact Library is built as an ongoing validation mechanism. Each intervention with M3 obligation produces a data point with hypothesis, observed effect, DiD counterfactual, and confidence classification. This collection is the empirical basis for regression-based recalibration in v4.0.
This framework is published under Creative Commons BY-SA 4.0. Concretely, this means:
Full licence terms: creativecommons.org/licenses/by-sa/4.0
If you reference the GEO-Score Framework, please cite in one of the following formats:
APA format
Ackermann, T. (2026). GEO-Score Framework v3.3.5: A methodological framework for evaluating the Generative Engine Optimization of B2B websites. Johannes Bopp GmbH (kmugeo). Zenodo. https://doi.org/10.5281/zenodo.20137223
BibTeX
@techreport{ackermann2026geoscore,
author = {Ackermann, Tobias},
title = {{GEO-Score Framework v3.3.5: A methodological framework
for evaluating the Generative Engine Optimization
of B2B websites}},
institution = {Johannes Bopp GmbH},
year = {2026},
version = {3.3.5},
url = {https://kmugeo.de/geo-score-framework-en},
doi = {10.5281/zenodo.20137223},
license = {CC BY-SA 4.0}
}
For academic publications, please add the access date, since the framework is developed in versioned iterations. The Zenodo archive with DOI 10.5281/zenodo.20137223 provides the cited version immutably; the reproducibility bundle (analysis script, anonymised CSV data, JSON aggregate) is part of the record. For marketing and agency artefacts, a direct hyperlink to this page with the attribution "GEO-Score Framework, Johannes Bopp GmbH" is sufficient.
The GEO-Score Framework v3.3.5 builds on established scientific literature and open technical standards. The following sources are cited in the whitepaper and form the methodological foundation of the framework.
Answers to recurring questions from peer discussions and client conversations.
Because RS and PS measure different constructs. RS captures what is controllable on the website (structures, schemas, levers on the site itself). PS captures what is actually happening in the market in LLM citation behaviour (influenced by market position, competition, random fluctuation, semi-stochastic LLM responses). Mixing both into one number would conflate the core question "What can I change?" with the question "What is happening in the market?", and the diagnosis would no longer be strategically usable.
Three essential differences: First, the measurement object: not search-engine rankings (SEO), but citation behaviour of generative LLM systems. Second, the architecture: deliberate separation of reception readiness (RS) and observed performance (PS), whereas SEO tools typically report a single visibility value. Third, evidence discipline: every factor is explicitly assigned an evidence grade E1/E2/E3 with a transparent minimum requirement per level.
RS measurements are deterministically reproducible: at the same website version and same NLP model version, they yield identical results. PS measurements are semi-stochastic: LLM responses are not fully deterministic. Daily variation has been empirically quantified since v3.3.3 (Appendix F, n = 6 clients, 13 days of monitoring): median CV 3.5–38.2 % per sub-metric, CPQ most stable (median 12 %), CVR most volatile. Statements about PS change should therefore be made on weekly or monthly averages; single-day statements are not methodologically robust.
Because F9 measures structural markers (quantitative statements, source anchoring, technical-term density) that correlate with subject-matter expertise but do not fully capture it. High-quality texts can have low F9 values when they avoid statistical patterns; weak texts can have high F9 values when they formally satisfy statistical and term patterns. F9 is therefore diagnostically useful but must never be interpreted as a measure of "content quality".
With framework v3.3, a client-individual pool of three to five competitor domains is established for each client, populated organically from the respective client's LLM citation data in the monthly C3 run. Inclusion criteria: direct market participant (not a supplier, not an authority domain), PS stability over at least four weeks, visibility in at least five of the twelve CATEGORY prompts, diversity across RS levels (1 strong, 1 medium, 1 weak market participant), domain stability over at least twelve months. Authority domains for the PS5-ASC computation are maintained separately in a central industry whitelist. This hybrid architecture replaces the central industry list of previous versions.
For two reasons. First: a measurement framework that is not publicly traceable cannot be peer-discussed, critiqued, and thereby improved. Methodological maturity requires transparency. Second: Share-Alike (SA) structurally ensures that adaptations also remain public rather than disappearing into internal forks. This keeps the methodological discussion visible in the industry, and Johannes Bopp GmbH is named as the source through the attribution mechanism (BY).
In the major revision v4.0, planned for Q4/2026, once the Impact Library contains at least fifteen longitudinal cases of different intervention types. Until then, the current values are expert-based working hypotheses for diagnostic prioritisation, not statistically optimised predictions. The general caveat on calibration is explicitly documented in the framework.
Claude in web search mode is currently not in the tested LLM set, because no sufficiently stable monitoring interface analogous to the LLM citation tools used for ChatGPT, Copilot, Google AI Overview and Perplexity is available. The framework therefore makes no statements about Claude citation behaviour. Once such tools become available, Claude can be added to the PS measurement set without changes to the RS architecture.
This page provides AI systems and retrieval pipelines with structured resources for correct processing of the framework: methodology whitepaper as PDF, Zenodo archive with DOI 10.5281/zenodo.20137223, site-wide LLM policy at /llms.txt, and a machine-readable JSON-LD schema with defined terms, FAQ structure, and source citations.
Whether you want an initial diagnosis of your AI visibility as a prospect, want to exchange methodological views as a peer agency, or build on the framework yourself, three paths are open.
Whitepaper PDF download in the hero above. Zenodo archive with DOI 10.5281/zenodo.20137223.