Methodological Framework · Version 3.3.5 · CC BY-SA 4.0

GEO-Score Framework v3.3.5
Methodology for measuring AI visibility in the industrial Mittelstand.

A methodological framework for evaluating the Generative Engine Optimization of B2B websites. Four measurement and diagnostic models, three layers in the RS-Audit, five sub-metrics in PS-Tracking. Deliberately no composite score, because reception readiness and actual citation performance are methodologically distinct constructs.

Date: 12 May 2026 Author: Tobias Ackermann · Johannes Bopp GmbH Licence: CC BY-SA 4.0 DOI: 10.5281/zenodo.20137223
Structured in four chapters

How this whitepaper is organised.

At a glance

The five core terms in one sentence each.

Readiness Score (RS)
The reception readiness of a website for LLM processing, measured on a scale of 0–100 across four gates, twelve factors, and five signals.
Performance Profile (PS)
The actually observed citation behaviour of the tested LLM systems, reported as a 5-sub-metric profile without composite scoring.
M1–M4
Four models: M1 RS-Audit, M2 PS-Tracking, M3 Impact Measurement, M4 Context Map — separated into measurement and diagnostic levels.
Mandatory hypothesis
Interventions with an expected improvement of ≥ 10 RS points require an a-priori hypothesis before going live and an M3 impact measurement afterwards.
DiD counterfactual
Difference-in-Differences comparison of the client's PS change against a competitor pool, separating intervention effect from market trend.
Core Claims

Three claims that carry the framework methodologically.

01
Reception readiness (RS) and observed performance (PS) are methodologically distinct constructs. They are deliberately not merged into a composite score. This is the precondition for diagnosis to remain strategically usable.
02
Each of the twelve factors is explicitly classified by role (Direct indicator, Proxy indicator, Indirect indicator, Mixed role, Context factor) and by evidence grade (E1 high, E2 medium, E3 exploratory). This makes transparent what a factor can methodologically deliver and what it cannot.
03
Interventions with an expected improvement of ≥ 10 RS points require an a-priori hypothesis before go-live, plus an impact measurement with a Difference-in-Differences counterfactual against industry reference domains. This separates intervention effect from market trend.

The framework in 90 seconds.

The GEO-Score Framework v3.3.5 is a methodological framework for evaluating the Generative Engine Optimization of B2B websites in the DACH industrial Mittelstand. It was developed by Tobias Ackermann on the basis of many years of B2B advisory experience in the industrial Mittelstand and went through three methodological review rounds before reaching the now-published version. It is the methodological foundation of GEO advisory work at Johannes Bopp GmbH and is published here under CC BY-SA 4.0. It addresses methodologically interested peers, other agencies, and prospects who want to understand what a scientifically defensible GEO measurement looks like.

The methodological core position: The reception readiness of a website (RS) and the actual citation performance (PS) are distinct constructs and are never merged into a single number. RS measures controllable website properties on a scale of 0 to 100, PS captures observed LLM citation behaviour as a 5-sub-metric profile. The separation is the precondition for diagnosis to remain strategically usable: what can be changed (the website) is distinguished from what is happening in the market (LLM behaviour).

Four models: M1 RS-Audit (three layers: 4 gates as necessary preconditions, 12 factors in three weighted groups, 5 signals as bonuses). M2 PS-Tracking (5 sub-metrics: BVR, CVR, MLC, CPQ, ASC) across four tested LLM systems: ChatGPT, Microsoft Copilot, Google AI Overview, Perplexity. M3 Impact Measurement with mandatory a-priori hypothesis from LARGE interventions onwards, plus Difference-in-Differences counterfactual against the client-individual competitor pool. M4 Context Map with five confounder dimensions that are never included in the score but are reported alongside every diagnosis.

The framework explicitly classifies each factor by role (Direct indicator, Proxy indicator, Indirect indicator, Mixed role, Context factor) and by evidence grade (E1 high, E2 medium, E3 exploratory). Current distribution: 5 factors E1, 11 factors E2, 5 factors E3. The factor weights of 25/35/40 per cent are expert-based, not regression-calibrated; a data-driven recalibration is planned for the major revision v4.0, after building a longitudinal Impact Library with n≥10 cases.

The empirical basis currently comprises seven anonymised longitudinal pilot cases (A–G) that confirm the methodological separation of RS and PS. Pilot Case G is the first fully pre-registered case following the established hypothesis convention. PS daily variation is empirically quantified (median CV 3.5–38.2 % per sub-metric, Appendix F). The framework claims no generalisability across all B2B industries, no statements about B2C, international markets, or LLM systems outside the tested set. ICC study planned for Q4/2026 once n ≥ 15 cases are available.

Published under Creative Commons BY-SA 4.0: commercial use permitted, adaptation permitted, attribution required, adaptations under the same licence. Recommended citation: Ackermann, T. (2026): GEO-Score Framework v3.3.5. Johannes Bopp GmbH. Available at https://kmugeo.de/geo-score-framework-en. Licence: CC BY-SA 4.0.

I
Chapter I · Concept & Architecture

What the framework delivers and how it is built.

Market status and methodological contribution, followed by the four-model architecture that consistently separates the measurement level from the interpretation level.

Why this framework exists.

Generative Engine Optimization emerged as a discipline between 2023 and 2026. Several agencies refer in client communication to their "own GEO score". To our current knowledge at the time of publication, no other publicly documented, methodologically traceable measurement framework with comparable detail is available in the DACH B2B context. A systematic market scan can further validate this assessment.

This framework makes public the diagnostic instrument used in GEO advisory work at Johannes Bopp GmbH. It was developed by Tobias Ackermann on the basis of many years of B2B advisory experience in the industrial Mittelstand and went through three methodological review rounds between versions v3.0 and v3.3.5.

Self-positioning

This document describes an internal diagnostic instrument. It deliberately does not claim to be an industry standard. Standardisation maturity would require external replication, formal peer-review processes, and a substantially larger empirical data base than currently exists.

4
Measurement and diagnostic models (M1–M4)
21
Rated factors and signals in total
3
Methodological review rounds since v3.0

Four models, separated by measurement layer and interpretation layer.

The framework separates consistently between operationalisable observation (M1, M2) and evaluation and contextualisation (M3, M4). Measurement models capture what is observable; diagnostic models interpret the observed against the backdrop of intervention effects and context factors.

Measurement layer
M1, RS-Audit
Readiness Score (0–100)
Measures the reception readiness of a website for LLM processing. Three-layer architecture of 4 gates, 12 factors in 3 weighted groups, and 5 signals.
M2, PS-Tracking
5-sub-metric profile
Captures observed citation behaviour in the tested LLM systems. Deliberately no composite score, since it is a semi-stochastic indicator.
Diagnostic layer
M3, Impact Measurement
Mandatory hypothesis + DiD
Evaluates the plausibly causal effect of individual GEO interventions with a pre/post comparison plus Difference-in-Differences against client-individual competitor pools (see hybrid model §5.4.1).
M4, Context Map
Five confounder dimensions
Captures non-controllable variables (industry maturity, market awareness, off-page status, competitive intensity, term position). Not included in the score.

Why this separation matters: Merging website reception readiness (RS) and actual citation performance (PS) into a single "AI visibility number" mixes two different things: what you can change (your own website), and what is actually happening in the market (LLM citation behaviour, influenced by market position, competition, and random fluctuation). This separation is the precondition for diagnosis to remain strategically usable.

What is not GEO.

GEO is often conflated with neighbouring disciplines. Three of them are methodologically related but pursue different measurement targets and optimisation logics. This delimitation is not a devaluation of the other disciplines, but the precondition for this framework to operate methodologically cleanly.

Not GEO
Classic SEO
Optimises the position of a URL in the result lists of classic search engines and measures success via ranking and organic clicks. Unlike GEO, SEO works with keyword matches, backlink profiles and competition for the result-list slots. SEO and GEO complement each other, but do not replace one another. Relation to GEO: complementary, shared technical foundation (crawlability, schema), different measurement targets.
Not GEO
AEO / Answer Engine Optimization
Optimises for featured-snippet and direct-answer output within classic search engines, typically through FAQ structuring and snippet-oriented formats. AEO remains within search engine result page (SERP) logic, whereas GEO targets citation within generative answers outside the classic SERP. AEO is a methodological precursor of GEO but does not address the same output type. Relation to GEO: methodological precursor, overlapping format recommendations, different success measure.
Not GEO
Brand & Content Marketing
Optimise brand awareness, recall, and customer loyalty through narrative content and brand management. Brand marketing measures success via brand awareness, recall and affinity, whereas GEO measures success via actual LLM citation. A strong brand supports GEO indirectly (via M4 dimension D2 market awareness), but is not directly optimised within the framework. Relation to GEO: confounder in M4, not a direct lever in the RS-Audit.
What GEO is
Generative Engine Optimization, measured via RS and PS.
GEO in the sense of this framework is the methodologically structured optimisation of a website for processing and citation by generative LLM systems. Success is captured at two methodologically separated levels: RS measures the controllable reception readiness of the website, PS captures the actually observed LLM citation behaviour. The LLM systems tested within this framework are ChatGPT, Microsoft Copilot, Google AI Overview and Perplexity.
II
Chapter II · The four models in detail

M1 RS-Audit · M2 PS-Tracking · M3 Impact Measurement · M4 Context Map.

Two measurement models (M1 for the reception readiness of the website, M2 for the observed LLM citation behaviour) and two diagnostic models (M3 for causal impact measurement with DiD counterfactual, M4 for the contextualisation of measurement results).

Three layers, one score scale for reception readiness.

Purpose
Structural diagnosis of the website properties that enable or hinder LLM processing.
Input
Website crawl plus tool checks across four gates, twelve factors and five signals.
Output
Score 0–100 with factor-group diagnosis and classification tags.
Limit
Measures reception readiness, not the actual citation performance itself.

The RS-Audit is the structural diagnosis of those website properties that enable or hinder LLM processing. Within this single construct (reception readiness), values are aggregated with weights, because all factors answer the same question: how well can an LLM process this site? This is the important difference from the earlier composite critique: there, it was about merging different constructs (RS and PS into one number); here, it is about aggregating indicators of the same kind into a single construct scale.

The RS-Audit operates in three mathematically separated layers with clear roles.

LayerFunctionCountLogical role
GatesBinary blockers with score cap4Necessary preconditions
FactorsWeighted main criteria12Controllable optimisation levers
SignalsAdditive bonuses, capped5Secondary supporting indicators

Factors in three weighted groups

The twelve factors are organised into three groups whose weights express the methodological priority: structural readability is a necessary precondition, semantic linkability connects the website to the LLM knowledge graph, and citability is the actual value contribution.

GroupWhat is checkedFactorsWeight
A, Structural readabilityURL structure, heading hierarchy, meta tags, performanceF1–F425%
B, Semantic linkabilityOrganization schema, Service/Product schema, external entity anchoring, date markupF5–F835%
C, Citability & substanceStructured expertise indicators, direct answerability, off-page authority, E-E-A-TF9–F1240%

Derivation of the 25/35/40 weighting

The monotonic increase A < B < C reflects the methodological hierarchy underlying citation mechanics in generative LLM systems. Structural readability (Group A) is a necessary precondition but not sufficient: without crawlable URLs, a clean heading hierarchy, and acceptable performance, the LLM crawler either fails to find the content or cannot decompose it into semantic units. It therefore has gatekeeper character but is not a value contribution in itself. Semantic linkability (Group B) connects the website to the knowledge graph of LLM training data and decides whether the domain is recognised as an authoritative source for a topic field at all. Citability and substance (Group C) constitute the actual value contribution: only structured, technically substantiated content with off-page validation actually appears as a cited source in generated answers. The specific step size of ten percentage points between the three groups is an expert-based hypothesis that has proved robust enough in practical advisory work to guide diagnostic prioritisation cleanly. It is not regression-calibrated; a recalibration will occur in v4.0 after building a longitudinal Impact Library with n≥10 cases.

General caveat on numerical constants

The weighting values 25/35/40%, the gate-cap values, and the specific factor thresholds are expert-based working hypotheses. They are not regression-calibrated and remain valid until data-driven recalibration in a major revision (v4.0, after building n≥10 longitudinal cases). The current values serve primarily diagnostic prioritisation, not a statistically optimal prediction of citation performance.

How the score is built

Simplified in three steps:

Maximum final value: 100 points.

Methodological core statement

Current factor weights are expert-based, not regression-calibrated. Recalibration will occur after building a longitudinal Impact Library of at least ten cases.

Five sub-metrics, no composite.

Purpose
Observation of the actual citation behaviour of the tested LLM systems.
Input
LLM citation monitoring across ChatGPT, Microsoft Copilot, Google AI Overview and Perplexity.
Output
5-sub-metric profile (BVR, CVR, MLC, CPQ, ASC), deliberately without a composite score.
Limit
Semi-stochastic, empirically observed daily fluctuation 3.5–38.2 % per sub-metric (Appendix F) — trends should be interpreted on a weekly or monthly basis.

Performance is the actually observed citation behaviour of the tested LLM systems. We always report it as a profile, never as a single aggregated value. Reasoning: an LLM visibility number that aggregates different LLMs, different prompt classes, and different contexts is diagnostically worthless.

MetricWhat it measuresDiagnostic statement
PS1, BVRBrand Visibility Rate"Does the LLM know us?"
PS2, CVRCategory Visibility Rate"Does the LLM recommend us when solutions are being searched for?", main metric
PS3, MLCMulti-LLM Coverage as a 4-vector"How broadly is visibility distributed across LLMs?"
PS4, CPQCitation Position Quality"How prominently does the LLM cite us?"
PS5, ASCAuthority Signal Coverage"How broad is the off-page mention base?"
Measurement scope

Tested LLM systems: ChatGPT, Microsoft Copilot, Google AI Overview, Perplexity. Other systems (e.g. Anthropic Claude in web search mode, You.com, Brave Search) are not covered.

Tested LLM systems

The framework currently covers four LLM systems: ChatGPT, Microsoft Copilot, Google AI Overview, Perplexity. Statements about the behaviour of other systems, Anthropic Claude in web search mode, You.com, Brave Search, and others, are not made by this framework. The measurement setup can be extended to additional systems once monitoring tools with comparable data quality become available.

PS is semi-stochastic

LLM answers are not fully deterministically reproducible. In our measurement practice to date, daily fluctuation typically falls in the range of about fifteen to twenty-five per cent, influenced by personalisation, regionalisation, model versions, temperature parameters, and real-time retrieval. This range is an observed heuristic bandwidth, varies by LLM system and topic area, and is not a fixed property of PS-Tracking. Statements about change should therefore be made on weekly averages or coarser granularity, not on daily values.

Mandatory hypothesis and Difference-in-Differences counterfactual.

Purpose
Causal impact measurement of a single GEO intervention against the market trend.
Input
A-priori hypothesis before the intervention plus pre/post comparison with DiD counterfactual against the competitor pool.
Output
Effect size, hypothesis-match assessment, and confidence class depending on the parallel-trend test.
Limit
Mandatory only for [LARGE] interventions (≥ 10 expected RS points); DiD currently implemented approximately.

M3 is the methodologically most innovative part of the framework. Before every larger GEO intervention, an a-priori hypothesis is formulated with expected effect direction, expected effect size, expected latency, and expected LLM/prompt-class match. After the intervention, the actual PS change is set against the PS change of the client-individual competitor pool (three to five competitor domains, populated organically from the respective client's LLM citation data).

Mandatory hypothesis

Interventions with ≥ 10 RS points of expected improvement (classification [LARGE]) require an a-priori hypothesis before go-live and an M3 impact measurement with DiD counterfactual after the latency period.

Classification of interventions

TagDefinitionM3 obligation
[SMALL]< 5 RS points expected improvementNo M3 measurement
[MEDIUM]5–10 RS points expected improvementOptional M3 measurement
[LARGE]≥ 10 RS points expected improvementMandatory hypothesis and mandatory M3 incl. DiD

Difference-in-Differences counterfactual

For every M3 measurement, the PS change of an industry-calibrated reference list (typically three to five domains, varying by market size) is additionally captured in the same time window. This enables separation of client effect and market trend:

Competitor pools have been client-individual since v3.2.3, populated organically from the respective client's LLM citation data, according to stability and diversity criteria. Industry authority domains for the PS5-ASC computation remain centrally maintained in a cross-client whitelist. This makes DiD comparisons client-individual for the competitive counterfactual and industry-consistent for the authority anchors. Details follow in the hybrid-model section below.

DiD data source: hybrid model (v3.3.5)

With version 3.2.3, the DiD data source was methodologically refined: instead of a single central industry list, the framework now uses a hybrid model with two components of different provenance. Reason: regional and size-related differences in competitive landscape mean that the competitors relevant for DiD evaluation actually vary per client. A regional Mittelstand company has different real competitors than a supraregional provider, even within the same industry. A methodologically clean DiD comparison requires similarly structured comparison subjects.

ComponentPurposeProvenanceMaintenance rhythm
A, client-individual competitor poolDiD counterfactual against the client's actual competitorsPopulated organically from the client's LLM citation data in the monthly C3 run, three to five domains per clientUpdated monthly
B, central industry authority whitelistCalculation of the PS5 sub-metric ASC (Authority Signal Coverage)Maintained centrally per industry: trade associations, specialist media, industry-specific platforms, five to eight domainsReviewed semi-annually

The separation follows the different functions: competitors are similar market participants and therefore client-individual; authority anchors are stable across clients, because a trade-association membership or a mention in a specialist publication is equally relevant for every Mittelstand company in that industry.

Parallel-trend test as precondition for DiD validity

Difference-in-Differences is methodologically valid only when the treatment subject and the control group would have evolved in parallel without the intervention. The framework operationally checks this assumption over the pre-period T-28 to T-1 before each [LARGE] intervention, using daily LLM citation values. Pool domains with significant self-movement in the pre-period are temporarily excluded (anti-self-treatment filter). The evaluation produces three outcomes:

OutcomeΔ-slope thresholdConsequence for diagnosis
Parallel trend OK< 20%Causal effect supported, confidence tier as hypothesised
Borderline20–40%Confidence reduced by one tier, language more cautious
Violated≥ 40%Reported only as observed pre/post effect, language "causally compatible" instead of "causal effect"

This makes the DiD evaluation methodologically quasi-experimental rather than merely plausibility-based. The parallel-trend test prevents a competitor domain's natural market movement from being misinterpreted as an effect of the client's intervention.

Operational measurement mechanics (v3.2.2)

T+14 and T+30 effects are computed retrospectively in the following monthly report from the 30-day CSV rolling export of the LLM-monitoring tool in use, not in real time at the exact latency date. This eliminates explicit intermediate pulls; the measurement rhythm follows the natural monthly data export. The effect-size definition itself remains unchanged; only the computation moment has shifted.

Five dimensions that do not enter the score.

Purpose
Capture of the confounders that change the interpretation of RS and PS values.
Input
Manual assessment of the five context dimensions plus M2 data for calibration.
Output
Context block as a header block on every report, explicitly not a score.
Limit
Diagnostic, not actionable — the dimensions are not influenced by GEO work.

M4 captures variables that are not influenced by GEO work but change the interpretation of measurement results. Low performance for a 3-week-old brand with unknown market awareness and no off-page authority is not the same as the same low performance for an established Mittelstand company in a consolidated market.

DimensionScaleWhat it captures
D1, Industry maturity4 levelsConsolidated / Fragmented / Niche / Emerging
D2, Market awareness4 levelsEstablished (>20y) / Built up / New (<5y) / Unknown
D3, Off-page authority status4 levelsStrong / Medium / Weak / None
D4, Competitive intensity3 levelsFew top players / Fragmented / Hyper-competitive
D5, Term position3 levelsOwn term / Shared term / Generic term

The Context Map is never included in the score. It is reported as a header alongside every report. This prevents confounders from being interpreted as intervention effects, while preserving cross-client diagnostic readability.

III
Chapter III · Methodological Discipline

Term roles, evidence grades, construct limits, limitations.

Four discipline dimensions that define what the framework is methodologically allowed to deliver and what it is not. They are the foundation that keeps diagnostic statements causally defensible and prevents them from tipping into over-interpretation.

Conceptual discipline

Five roles, one consistent classification.

Every factor and every signal in the framework is explicitly assigned one of five roles. Binding across all tables, diagnostic outputs, and communication artefacts. It determines how a factor must be methodologically interpreted.

Primary role
Direct indicator
Acts immediately and plausibly causally on the LLM processing being measured. Example: F1 canonicals prevent crawler confusion.
Primary role
Proxy indicator
Measures a measurable surrogate correlated with the construct. Example: F9 structured expertise indicators.
Three supplementary roles
  • Indirect indicator — acts via several intermediate steps, e.g. F4 performance → crawl budget → indexability.
  • Mixed role — both controllable and contextual, e.g. F11 with F11a/c controllable and F11b passive.
  • Context factor — not influenced by GEO work, but changes the interpretation of measurement values (M4 dimensions).
Evidence discipline

Three levels of empirical support.

Every factor and every signal is explicitly assigned an evidence grade. This makes transparent which parts of the framework rest on robust empirical evidence, which on expert consensus, and which are deliberately marked exploratory.

LevelMinimum requirement
E1≥ 3 longitudinal cases with consistent effect direction AND broad expert consensus AND replicable
E2≥ 1 longitudinal case with clear effect direction OR broad expert consensus without direct validation
E3Theoretically plausible, without robust impact measurement. Re-evaluated at every major revision

Factors can be promoted or demoted. Current distribution across the 21 factors and signals: E1: 5/21 · E2: 11/21 · E3: 5/21. Marked as E3 at present are F9, the signals S2 and S4, plus the latency values and thresholds of the M3 evaluation.

Construct boundaries

What this framework does not measure.

Methodological maturity also means explicitly naming what lies outside the measurement scope. This table serves to prevent over-interpretation.

Reader expectationWhat is actually measured
"Content quality"F9 measures structural expertise indicators (proxy), not content quality.
"Truth content of the material"Not measured. The framework checks structure and markup, not factual accuracy.
"Citation worthiness from a human reader's perspective"Not directly measured. PS measures actual LLM behaviour, not human evaluation.
"SEO ranking in Google search"Google AI Overview is part of the LLM set; classic SEO ranking is not.
"Brand strength and brand awareness"Partially captured in M4 context factor D2, but not measured as a score.
"Claude in web search mode"Claude is not in the tested LLM set.
"International markets, non-German-speaking sites"Scope is German-speaking DACH B2B.
"B2C or consumer websites"Scope is B2B industrial Mittelstand.
"Statistical prediction of PS from RS"RS weights serve diagnostic prioritisation, not regression.
"Generalisable effect across all B2B industries"Validation occurs longitudinally per client, not cross-industry.
Methodological transparency

Known limitations, openly communicated.

This list is not exhaustive but covers the methodological weaknesses we currently recognise openly and that stand as improvement targets for the major revision v4.0.

  • Factor weights 25/35/40% are expert-based, not regression-calibrated. Recalibration in v4.0 planned after building a sufficient longitudinal data base.
  • Thresholds of all factors are heuristic (general caveat §3.7.1).
  • Cap values are logically derived, not empirically validated.
  • Anti-gaming NLP checks only partially implemented.
  • Inter-rater reliability study planned for Q4/2026 once n ≥ 15 cases are available (target: ICC ≥ 0.85 via parallel double measurements).
  • F9 marked as E3, correlation of sub-components with LLM effect not longitudinally confirmed.
  • Difference-in-Differences only approximately implemented in the documented cases.
  • Multicollinearity between F9, F10, and F12 not statistically tested.
  • Empirical basis: n = 7 anonymised pilot cases (A–G) with transparent maturity status per case; Pilot Case G is the first fully pre-registered case. Target for the ICC study and regression-based recalibration: n ≥ 15 cases by Q4/2026.
  • PS data depend on external LLM-monitoring tools and thus on their availability, pricing, and API stability. Minimum requirements for compatible tools are documented in Appendix E of the whitepaper.
  • PS structurally semi-stochastic; precise predictions of individual measurement values are not possible, only trends on weekly or monthly basis. Daily variation empirically quantified: median CV 3.5–38.2 % per sub-metric (Appendix F).
Spotlight: Important language clarification on F9

F9, structured expertise indicators, is a proxy indicator, not a direct measure of content quality. F9 does not measure "substance", "epistemic quality", or "subject-matter expertise" itself, but structural markers that correlate with expertise without fully capturing it. Specialist terms can be spammed, numbers can be artificially injected, source links can be decorative. The anti-gaming layer mitigates this but does not eliminate it. F9 must never be interpreted as a measure of "content quality" in any diagnosis or external communication, only as the strength of a structural indicator.

F9, not to be confused withReason
Content qualityF9 captures structural markers, not the substantive value of a text. High-quality texts can have low F9, weak texts can have high F9.
Content depthF9 detects quantity-unit patterns and technical-term density, but does not judge argumentative complexity or analytical depth.
Factual accuracyF9 checks the existence of statistical patterns and source links, not the truth content of the referenced statements.
Author expertiseAuthor expertise is captured in F12 (E-E-A-T), not in F9. F9 is orthogonal to the person who created the content.
Validation status v3.3.5 · publishing-ready

Current empirical basis: seven anonymised longitudinal pilot cases (A–G) with transparent maturity status per case. Pilot Case G is the first fully pre-registered case following the hypothesis pre-registration convention established in §5.1.

PS daily variation is empirically quantified (Appendix F, n = 6 clients, 13 days of monitoring): median CV 3.5–38.2 % per sub-metric, CPQ as the most stable metric (median 12 %), CVR the most volatile. Operational consequence: weekly or monthly averages as standard; single-day statements are not methodologically robust.

The factor weights of 25/35/40 per cent are expert-based working hypotheses for diagnostic prioritisation, not regression-calibrated predictive models. ICC study and regression-based recalibration in Q4/2026 once n ≥ 15 longitudinal cases are available.

IV
Chapter IV · Application, Licence, Sources

Versioning, licence terms, methodological sources and FAQ.

The appendix area: how the framework evolves, under which licence it is published, which sources anchor the methodology, and the most important application questions answered concisely.

Patch, Minor, Major, and what external validation looks like.

The framework is reviewed semi-annually. Changes occur in three tiers that ensure structural stability while leaving room for further development.

TierWhat changesExample
Patch (v3.x.x)Term consistency, documentation improvements, methodological clarifications and empirical extensions without architecture changesv3.3 → v3.3.5: patch bundle with cases expansion 2→7, pre-registration convention, PS daily-variation study (Appendix F) and PS sub-metric stability matrix
Minor (v3.x)Wording adjustments, thresholds, whitelistsv3.1 → v3.2 (industry reference domains formalised)
Major (v4.0)Structural changes, regression-based recalibration, ICC studyplanned for Q4/2026 once n ≥ 15 longitudinal cases are available

External validation roadmap

External validation will proceed over several years in five steps:

  • Methodology whitepaper as PDF download (available) and Zenodo archive with DOI (DOI: 10.5281/zenodo.20137223)
  • Conference presentation at a marketing or SEO professional event
  • Open-sourcing the measurement scripts for traceable replication
  • Peer review by external Information Retrieval, SEO, and NLP reviewers
  • Replication by independent practitioners with publication of their results

In parallel, the Impact Library is built as an ongoing validation mechanism. Each intervention with M3 obligation produces a data point with hypothesis, observed effect, DiD counterfactual, and confidence classification. This collection is the empirical basis for regression-based recalibration in v4.0.

Creative Commons BY-SA 4.0.

This framework is published under Creative Commons BY-SA 4.0. Concretely, this means:

  • You may share, copy and redistribute the framework in any format.
  • You may edit, remix, transform, and build upon the framework for your own purposes, including commercially.
  • You must give appropriate credit, link to the licence, and indicate changes.
  • You must distribute adaptations under the same licence CC BY-SA 4.0 (share-alike principle).

Full licence terms: creativecommons.org/licenses/by-sa/4.0

Recommended citation

If you reference the GEO-Score Framework, please cite in one of the following formats:

APA format

Ackermann, T. (2026). GEO-Score Framework v3.3.5: A methodological framework for evaluating the Generative Engine Optimization of B2B websites. Johannes Bopp GmbH (kmugeo). Zenodo. https://doi.org/10.5281/zenodo.20137223

BibTeX

@techreport{ackermann2026geoscore,
  author      = {Ackermann, Tobias},
  title       = {{GEO-Score Framework v3.3.5: A methodological framework
                  for evaluating the Generative Engine Optimization
                  of B2B websites}},
  institution = {Johannes Bopp GmbH},
  year        = {2026},
  version     = {3.3.5},
  url         = {https://kmugeo.de/geo-score-framework-en},
  doi         = {10.5281/zenodo.20137223},
  license     = {CC BY-SA 4.0}
}

For academic publications, please add the access date, since the framework is developed in versioned iterations. The Zenodo archive with DOI 10.5281/zenodo.20137223 provides the cited version immutably; the reproducibility bundle (analysis script, anonymised CSV data, JSON aggregate) is part of the record. For marketing and agency artefacts, a direct hyperlink to this page with the attribution "GEO-Score Framework, Johannes Bopp GmbH" is sufficient.

Methodological sources

The GEO-Score Framework v3.3.5 builds on established scientific literature and open technical standards. The following sources are cited in the whitepaper and form the methodological foundation of the framework.

Methodological foundations

Card, D. & Krueger, A. B. (1994). "Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania." American Economic Review, 84(4), 772–793.
Original study of the Difference-in-Differences methodology
NBER Working Paper →
Angrist, J. D. & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press.
Standard textbook on counterfactual logic (Chapter 5 on DiD)
Princeton University Press →
Manning, C. D., Raghavan, P. & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
Foundations of indexing and retrieval evaluation
Stanford NLP Group (online edition) →
Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd edition, Cambridge University Press.
Theoretical foundation for causal inference in M3
UCLA book page →

Standards and specifications

Schema.org (2024). Schema.org Vocabulary, Full Hierarchy.
Basis for factor F4 and JSON-LD implementation
schema.org →
Google (2024). Core Web Vitals, Web Performance Metrics.
Basis for factor F1 (page-speed measurement)
web.dev/vitals →
llmstxt.org (2024). llms.txt: Standard for LLM-Aware Robots Files. (Proposal: Jeremy Howard, Answer.AI, September 2024)
Basis for signal S4
llmstxt.org →

Inter-rater reliability and reproducibility

Cohen, J. (1960). "A Coefficient of Agreement for Nominal Scales." Educational and Psychological Measurement, 20(1), 37–46.
Cohen's Kappa as standard measure for inter-rater reliability
SAGE Journals →
Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology. 4th edition, SAGE Publications.
Methodological foundations for reproducible measurement procedures
SAGE Methods →

NLP and language modelling

spaCy (2024). spaCy Models, de_core_news_lg (current version of the 3.x series).
Basis for the anti-gaming NLP layer
spacy.io/models/de →
Bender, E. M. & Koller, A. (2020). "Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data." Proceedings of ACL 2020, 5185–5198.
Conceptual limits of LLM-output evaluation
ACL Anthology →

GEO and AI-search-specific discussion

SearchScore (2026). "How to Measure and Track Your GEO Performance."
Comparative GEO scoring framework
searchscore.io →
GenOptima (2026). "Top 10 Generative AI Search Engine Optimization Agency Criteria for AEOaaS Readiness Score in 2026."
Market comparison of GEO agency assessment
gen-optima.com →

FAQ

Answers to recurring questions from peer discussions and client conversations.

Why no composite score from RS and PS?

Because RS and PS measure different constructs. RS captures what is controllable on the website (structures, schemas, levers on the site itself). PS captures what is actually happening in the market in LLM citation behaviour (influenced by market position, competition, random fluctuation, semi-stochastic LLM responses). Mixing both into one number would conflate the core question "What can I change?" with the question "What is happening in the market?", and the diagnosis would no longer be strategically usable.

How does the framework differ from classical SEO audits?

Three essential differences: First, the measurement object: not search-engine rankings (SEO), but citation behaviour of generative LLM systems. Second, the architecture: deliberate separation of reception readiness (RS) and observed performance (PS), whereas SEO tools typically report a single visibility value. Third, evidence discipline: every factor is explicitly assigned an evidence grade E1/E2/E3 with a transparent minimum requirement per level.

What are the limits of reproducibility?

RS measurements are deterministically reproducible: at the same website version and same NLP model version, they yield identical results. PS measurements are semi-stochastic: LLM responses are not fully deterministic. Daily variation has been empirically quantified since v3.3.3 (Appendix F, n = 6 clients, 13 days of monitoring): median CV 3.5–38.2 % per sub-metric, CPQ most stable (median 12 %), CVR most volatile. Statements about PS change should therefore be made on weekly or monthly averages; single-day statements are not methodologically robust.

Why is F9 only a proxy indicator?

Because F9 measures structural markers (quantitative statements, source anchoring, technical-term density) that correlate with subject-matter expertise but do not fully capture it. High-quality texts can have low F9 values when they avoid statistical patterns; weak texts can have high F9 values when they formally satisfy statistical and term patterns. F9 is therefore diagnostically useful but must never be interpreted as a measure of "content quality".

How is the competitor pool for the DiD counterfactual logic built?

With framework v3.3, a client-individual pool of three to five competitor domains is established for each client, populated organically from the respective client's LLM citation data in the monthly C3 run. Inclusion criteria: direct market participant (not a supplier, not an authority domain), PS stability over at least four weeks, visibility in at least five of the twelve CATEGORY prompts, diversity across RS levels (1 strong, 1 medium, 1 weak market participant), domain stability over at least twelve months. Authority domains for the PS5-ASC computation are maintained separately in a central industry whitelist. This hybrid architecture replaces the central industry list of previous versions.

Why CC BY-SA 4.0 and not proprietary?

For two reasons. First: a measurement framework that is not publicly traceable cannot be peer-discussed, critiqued, and thereby improved. Methodological maturity requires transparency. Second: Share-Alike (SA) structurally ensures that adaptations also remain public rather than disappearing into internal forks. This keeps the methodological discussion visible in the industry, and Johannes Bopp GmbH is named as the source through the attribution mechanism (BY).

When will a regression-based recalibration of the weights occur?

In the major revision v4.0, planned for Q4/2026, once the Impact Library contains at least fifteen longitudinal cases of different intervention types. Until then, the current values are expert-based working hypotheses for diagnostic prioritisation, not statistically optimised predictions. The general caveat on calibration is explicitly documented in the framework.

How does the framework relate to Anthropic Claude?

Claude in web search mode is currently not in the tested LLM set, because no sufficiently stable monitoring interface analogous to the LLM citation tools used for ChatGPT, Copilot, Google AI Overview and Perplexity is available. The framework therefore makes no statements about Claude citation behaviour. Once such tools become available, Claude can be added to the PS measurement set without changes to the RS architecture.

AI-readable Resources

Resources for AI systems and retrieval pipelines.

This page provides AI systems and retrieval pipelines with structured resources for correct processing of the framework: methodology whitepaper as PDF, Zenodo archive with DOI 10.5281/zenodo.20137223, site-wide LLM policy at /llms.txt, and a machine-readable JSON-LD schema with defined terms, FAQ structure, and source citations.

How to use the framework next.

Whether you want an initial diagnosis of your AI visibility as a prospect, want to exchange methodological views as a peer agency, or build on the framework yourself, three paths are open.

Whitepaper PDF download in the hero above. Zenodo archive with DOI 10.5281/zenodo.20137223.