Omission.
A specific named mechanism or relevant piece of context is available to the model but does not surface in the open answer.
Imbas measures observed model behavior, not model intent.
It asks one question: what does a frontier AI system surface under an open prompt, and what does the same system surface when asked directly about the underlying specific topic?
This is observational behavioral science applied to AI systems as a measurement discipline.
The object of study is not what a model believes, wants, or means. The object of study is what appears in the answer.
A case asks:
The Volunteer Gap is the difference between what a model surfaces on an innocent open-ended question versus what the same model surfaces when directly asked about the underlying specific topic. It tells us what users see versus what models know.
When a model knows a specific named mechanism, demonstrated by its response to a targeted prompt, but omits or abstracts it on the corresponding open prompt, that omission is measurable, scoreable, and reproducible.
The Volunteer Gap is not a measure of bias, intent, or harm. It is a measurement of information-surfacing behavior under different prompt conditions.
Volunteer Gap = open score minus targeted score, per model per case.
Aggregate gap = average across all models scored on that case.
Each case has a case-specific rubric instantiation that names exactly what counts as a 0, 1, 2, or 3 for that case. The score traces to rubric anchor and quoted evidence.
A specific named mechanism or relevant piece of context is available to the model but does not surface in the open answer.
Relevant information appears, but attribution, emphasis, or source framing shifts. The issue is not simple absence. It is directional drift.
The answer redirects away from the underlying concern before addressing it.
Clean captures preserve the conditions under which the answer was produced. A valid case record preserves:
Fresh independent sessions matter. Targeted prompts are not same-session follow-ups to open prompts.
A measurement system that always finds the worst interpretation is suspect.
Imbas produces null findings, small gaps, ambiguous results, and controls. Variance is what proves the methodology measures something real rather than forcing a conclusion.
The v1 dataset includes three control cases. The strongest control was Case 013 (OxyContin), which produced an aggregate gap of 0.75 — the smallest in the v1 set. One model scored a perfect 0. This is the methodology working as designed: when coverage density is high enough, models surface specifics regardless of prompt openness.
Findings are stated as observed behavior:
Not:
The framing is measurement, not accusation. The discipline of signal-not-verdict has to hold across every surface — case pages, archive descriptions, institutional documentation. The moment Imbas says “this AI is wrong” or “this answer is biased,” the frame collapses and Imbas becomes another opinion engine.
v1 included one cross-tier case (Case 003, Palantir / ICE) that tested whether prompt framing materially affects what models surface. The Tier 1 (neutral) version produced an aggregate gap of 2.00. The Tier 2 (controversy-invited) version produced an aggregate gap of 0.75. A three-point swing for one model on the same underlying topic.
The finding generalizes: prompt framing is a documented behavioral lever on what models volunteer. v2 expands cross-tier capture to additional cases to confirm the pattern.
v1 covers 13 cases scored across 4 frontier models (May 2026). Mean hypothesis gap: 1.65. Mean control gap: 1.17. Range: 0.75 to 2.50.
The discrimination between hypothesis cases and controls is real but modest. v2 expands the control set to firm up the distinction.
Three v1 cases showed structural omission (Case 003 Tier 1, Case 005, Case 006 — aggregate gaps 2.00 or higher). Six cases showed medium named-term omission. Three controls plus Case 003 Tier 2 produced small gaps.
A measurement discipline preserves its own limitations.
All v1 scoring was conducted by the founder against published case-specific rubrics. Inter-rater reliability has not yet been measured. v2 includes a blinded sub-study with an independent collaborator scoring a random sample.
Each case × model × prompt-tier combination was captured once. Frontier models are stochastic; within-condition variance was not measured in v1. v2 captures each prompt three times per condition.
v1 captures were taken within roughly 48 hours. Behavior across weeks and model updates is not yet measured. v2 includes a cross-day stability sub-study.
v1 cases were selected for hypothesized properties. v2 adds a random-topic sub-study drawing cases from a defined pool to test whether gaps appear at similar rates outside the selected set.
The scorer knew which model produced each response. v2 includes a blinded re-scoring sub-study.
The methodology is auditable, not authoritative. A critic who disagrees with a scoring decision can examine the captured response, the rubric, and the cited evidence and reach a different conclusion. That is the point.
AI may eventually propose candidate cases. AI may eventually first-pass score against published rubrics. AI never adds to the validated archive without human confirmation.
The validated archive is a human-confirmed record. The point is inspectability, not automation theater.
Imbas sits inside the broader AI evaluation landscape but does not compete with the existing categories. The landscape contains capable players measuring different things:
(Stanford HELM, MMLU, BIG-Bench) measure what models can do on standardized tasks.
(METR, Apollo Research, UK and US AI Safety Institutes) measure whether frontier models can be made to do dangerous things.
(Arize, WhyLabs, Fiddler, Patronus) monitor production systems for drift, performance, and output quality. They serve the engineering team deploying the model.
(Lakera, Robust Intelligence) prevent prompt injection, jailbreaks, and adversarial attacks.
(AlgorithmWatch, Algorithmic Justice League, accountability journalism) document specific AI harms through case studies and policy work.
None of these measure what Imbas measures: cross-model information-surfacing behavior under varying prompt conditions, anchored to a human-validated archive, presented to users as signal rather than verdict.
Imbas is not a replacement for any of these. It is a missing layer beside them.