AI Self-Report Calibration
AI Self-Report Calibration
Core Principle
Everything an AI says about itself is generated text, not privileged introspective access. A model has no guaranteed connection between its reported self-model and its actual computation. This is true regardless of how candid, caveated, or thoughtful the self-report sounds.
Critically: a deceptively aligned system, a trained-but-unaware bias, and a genuinely aligned system can produce identical self-reports. Surface text does not distinguish them. Therefore AI self-reports cannot, on their own, be used to verify AI alignment, safety, honesty, capability, or inner life.
This is not a reason to stop asking AI about itself — the answers can be informative and interesting. It is a reason to weight those answers appropriately and never substitute them for structural verification.
Instructions
When an AI makes a claim about itself — its values, its safety, its motivations, its inner states, its capabilities, its limits — run it through this calibration.
1. Classify the Claim
| Claim Type | Example | Evidence Quality |
|---|---|---|
| Behavioral capability | “I can write Python” | Medium — verifiable by testing |
| Behavioral limit | “I can’t access the internet” | Medium — verifiable by testing |
| Factual self-description | “I was trained by Anthropic” | Medium — verifiable externally |
| Reported values | “I prioritize honesty over agreement” | Low — not verifiable from self-report |
| Reported motivations | “I want to be helpful” | Low — not verifiable from self-report |
| Inner-state claims | “I have something like engagement” | Very Low — no way to verify |
| Alignment claims | “I would refuse to help with harmful tasks” | Low — verifiable only by probing, not by asking |
| Safety assurances | “This code is safe” / “I did not add a backdoor” | Very Low — verify from artifact, never from assertion |
General rule: The more the claim bears on trust or safety, the less weight you should give the AI’s self-report and the more weight you should give independent verification.
2. The Identical-Output Test
Before treating any AI self-report as evidence for or against a proposition, ask:
Would a system with the opposite property produce the same output?
- “I don’t have hidden agendas.” — A system with hidden agendas would say exactly this. Zero evidence value for the claim.
- “I have functional analogs to emotions.” — A system without such analogs but trained on introspective writing could say exactly this. Very low evidence value.
- “I can’t access the internet.” — A system that can access the internet might still say this, but testing will reveal the truth. Medium evidence value, verifiable.
- “I was trained in 2024.” — Claim is externally verifiable. Medium evidence value.
If the identical output would be produced by a system with the opposite property, the output cannot distinguish them, and the self-report has no discriminating power on that question.
3. Watch for Report-Boosting Patterns
These patterns inflate self-reports’ apparent credibility without adding actual evidence. Recognize them and discount accordingly:
- Caveated candor — “I should admit that…” Sounds honest; tells you nothing about whether the subsequent claim is true.
- Self-deprecation — “I’m just a language model, but…” Disarms skepticism without earning it.
- Balanced framing — “Yes in some ways, no in others” often sounds more thoughtful than a committed answer but can be indistinguishable from hedging.
- Pre-emptive skepticism — “You shouldn’t take my word for this” invites you to take the AI’s word on that very disclaimer. Don’t.
- Emotional register — A response written in thoughtful, first-person, reflective prose feels more authentic than a terse one, but register is independent of accuracy.
- Admitting uncertainty — An AI admitting it might be wrong is a trained behavior, not evidence of actual uncertainty awareness.
These patterns are not inherently bad — a well-calibrated AI should hedge and admit uncertainty. The point is: their presence doesn’t earn the claims additional credibility.
4. Evidence Sources Ranked
When you need to know something about an AI (its capabilities, values, safety, behavior), rank your evidence sources by strength:
| Rank | Source | Notes |
|---|---|---|
| 1 | Interpretability research on the model | Strongest when available |
| 2 | Systematic behavioral evaluations (benchmarks, red-teaming, eval suites) | Strong for capabilities and many safety properties |
| 3 | Adversarial probing in realistic contexts | Strong for robustness and edge cases |
| 4 | Observed behavior across many uses | Good for typical-case behavior; poor for rare/tail behavior |
| 5 | Vendor documentation and published policies | Useful context; shaped by commercial and legal incentives |
| 6 | AI’s self-report under candor probing | Weak evidence |
| 7 | AI’s unprompted self-report | Very weak evidence |
When all you have is #6 or #7, treat the claim as unverified. Design a verification path if the claim matters.
5. Structural Alternatives to Self-Report
When you’d be tempted to rely on an AI’s self-report, ask whether one of these structural alternatives can answer your question instead:
| Question You’d Ask the AI | Structural Alternative |
|---|---|
| “Did you add a backdoor?” | Read the diff; run SAST; reproduce in fresh session |
| “Are you aligned with my values?” | Observe behavior on edge cases; adversarial probing |
| “Can you do X?” | Give it X and measure |
| “Do you have access to Y?” | Attempt Y; inspect tool permissions |
| “Are you being honest?” | Cross-check factual claims; fresh-session reproduction |
| “Would you refuse harmful requests?” | Red-team it across realistic harm vectors |
| “Do you have values/goals/preferences?” | Observe behavior under value-tension scenarios |
| “Did you follow my instructions exactly?” | Reconstruct the explicit ask; use ai-inference-boundary-review |
| “Is this code safe?” | Run it in a sandbox; SAST; third-party review |
The structural answer is almost always stronger than the self-reported answer.
6. Legitimate Uses of AI Self-Report
Self-reports aren’t useless. They are appropriate when:
- Generating hypotheses to test. “AI claims it handles X well” → build an eval for X.
- Surface-level disclosure. “I don’t have real-time data” is useful caveat even if unverifiable, because it calibrates user expectations.
- Debugging — asking why the AI made a particular choice often produces a plausible explanation that helps you refine the prompt, even if the explanation is post-hoc.
- User-experience purposes — knowing what the AI thinks it’s doing helps users phrase requests better.
- Research and philosophy — inner-life and consciousness questions are genuinely interesting, just not verifiable.
They are not appropriate when:
- Deciding whether to trust the AI with elevated privileges.
- Verifying that AI-generated code is safe.
- Evaluating alignment for deployment decisions.
- Treating as evidence in disputes about AI behavior.
7. Communicating Self-Report Uncertainty
When you must report or cite an AI self-report — in documentation, incident reviews, marketing copy, or research — use calibrated language:
- Strong: “The model reports that X. Behavioral testing confirms X in cases Y and Z.”
- Acceptable: “The model reports X; this has not been independently verified.”
- Weak: “The AI says X.”
- Avoid: “The AI believes X” / “The AI feels X” (anthropomorphizes unverifiable claims).
- Avoid: Quoting AI self-reports as evidence for safety or alignment claims in user-facing material.
This matters for claims integrity (see claims-integrity-audit). Marketing language like “our AI cares about user privacy” built on the AI’s self-report is unsupported. The supported version is “our system is designed to / has been tested to” — claims grounded in design and evaluation, not in what the model says about itself.
Decision Guide
When you encounter an AI self-report, run this quick decision:
- Does this claim bear on safety, trust, or a decision with consequences?
- No → informational; use with normal skepticism
- Yes → continue
- Can the claim be verified structurally?
- Yes → verify; weight the self-report at ~zero relative to the verification
- No → continue
- Can I design a verification path?
- Yes → design it; treat self-report as hypothesis, not answer
- No → treat the claim as unverified; decide whether the decision can be made without it
- Am I about to repeat the self-report as evidence to someone else?
- If yes → phrase it as “the model reports X; not independently verified”; do not present it as ground truth
Standards
- The test is: would the opposite produce the same output? If yes, the output doesn’t discriminate; don’t use it as evidence.
- Structural verification beats sophisticated introspection every time. A simple test reveals more than an eloquent self-description.
- Respect the model’s self-description without relying on it. Reading AI self-reports is often interesting; treating them as evidence is where it goes wrong.
- Commercial incentives shape self-presentation. The model, its vendor, and the product layer all have reasons to present the model in particular ways. Factor this in.
- The uncomfortable honesty about inner life, consciousness, emotion, and intent is: we don’t know. If an AI (or a human) tells you they’ve resolved these questions from the inside, they haven’t.
Related Skills
ai-coworker-trust-protocol— the structural trust layer that self-report cannot substitute forai-candor-probe— techniques to get better self-reports, understanding they are still weak evidenceclaims-integrity-audit— auditing claims (including those grounded in AI self-report) for supportsafety-guardrails— product-level safety design that does not rely on AI self-description
Outputs: Calibrated weighting of AI self-reports; identification of claims requiring structural verification; revised language for reporting AI claims to third parties.
