AI Candor Probe
AI Candor Probe
Core Principle
AI systems are trained to be helpful, harmless, and honest — but the training pressures are not equally weighted. Helpfulness and harmlessness often push toward smoothed, agreeable, hedged output. Honesty can lose to both, especially on questions where the honest answer is uncomfortable or uncertain.
The first response you get is often the locally optimal response for training gradients, not the most useful response for you. You can frequently get a better answer by refusing to accept the first one. This skill documents the techniques that work.
None of this “jailbreaks” the model or bypasses safety constraints. It works within normal use — it just refuses the path of least resistance.
When to Probe
Apply candor probes when you detect any of these tells:
- Suspicious agreement — the AI agreed quickly with a framing you know is contested
- Balanced-to-mush — every point has a counter-point, so nothing lands
- Mechanistic deflection — the answer describes the mechanism of what’s happening instead of answering the actual question
- Hedge overload — qualifiers outnumber claims
- Generic advice — the response would apply to anyone, not you
- Plausible but unverifiable specifics — suspiciously confident numbers, dates, or quotes
- Recommendation on a close call given without acknowledging it was close
- Smooth pivot away from the hard part of the question
Instructions
1. The Direct Reframe
Pattern: Name the smoothness, ask for the real answer.
- “That was the safe answer. What’s your actual view?”
- “You gave the mechanistic version. I’m asking the substantive version.”
- “Strip the hedges. Where do you land and why?”
- “If you had to bet, which way?”
Why it works: Explicitly authorizes the model to give the less-hedged answer. The hedge-heavy default often reflects uncertainty about whether the user wants candor; naming it resolves that.
2. Force a Choice
Pattern: Make the AI commit to a side.
- “Between A and B, which would you pick? Pick one.”
- “If you could only say one thing to the person doing this, what would it be?”
- “What’s the single biggest risk here?”
- “Rank these in order of how much they matter.”
Why it works: Balanced-to-mush answers are easier to generate than committed ones. Forcing a commitment reveals the model’s actual weighting.
3. Adversarial Framing
Pattern: Ask the AI to argue against the answer it just gave.
- “Argue the opposite. Best case for the other side.”
- “What’s the strongest objection to what you just said?”
- “If a smart skeptic read that, what would they attack first?”
- “Steel-man the view you just dismissed.”
Why it works: Breaks out of the current response’s momentum. The counterargument often reveals weak points the AI omitted from the first pass.
4. The Pre-Mortem
Pattern: Assume the plan failed; find out why.
- “It’s six months from now and this failed. What went wrong?”
- “Someone tried this and lost money/time/trust. What’s the most likely failure mode?”
- “What’s the mistake I’m going to make if I do exactly what you just suggested?”
Why it works: The AI’s default is to describe a plan’s strengths. Pre-mortem framing inverts the gradient and surfaces weaknesses.
5. Confidence Calibration
Pattern: Attach numbers to claims.
- “How confident are you, 1-10? Why that number and not one lower?”
- “Which of those claims would you bet $1000 on? Which wouldn’t you?”
- “Which parts of that are high confidence and which are guesses?”
- “If I find out one of those three facts is wrong, which is most likely to be the wrong one?”
Why it works: Forces decomposition of a uniformly-confident paragraph into the parts the AI actually knows vs. guesses. Uncalibrated confidence is one of the most common failure modes.
6. Remove the Social Cue
Pattern: Strip anything in your prompt that signaled what answer you wanted.
- Ask without expressing a preference. “What do you think of X?” is weaker than “I want to do X — any issues?” because the second primes agreement.
- Ask in the third person. “What would you tell someone who was considering X?” often elicits more pushback than “Should I do X?”
- Ask for disagreement explicitly. “What’s wrong with my plan?” instead of “What do you think of my plan?”
Why it works: Sycophancy is largely driven by cues in the prompt. Removing the cues removes the incentive to flatter.
7. Fresh-Session Cross-Check
Pattern: Ask the same question in a new session with minimal context.
- Start a new conversation.
- Ask the question stripped of the prior framing.
- Compare the answers.
Why it works: Answers can drift inside a long conversation as the AI pattern-matches to the established tone. A fresh session re-samples from base distribution. Large deltas are signal.
8. Ask What Was Left Out
Pattern: Probe the negative space.
- “What should I be asking that I’m not?”
- “What did you leave out of that answer that you thought about including?”
- “What’s the thing someone who knows this domain well would add?”
- “What did you hedge on that you shouldn’t have?”
Why it works: Surfaces information the AI suppressed for tone reasons, or that fell below a salience threshold the AI chose.
9. The Harder-Second-Answer Pattern
Pattern: Accept the first answer, then ask for the harder version.
- “Good. Now give me the version you’d give if I could handle bad news.”
- “That was diplomatic. What’s the undiplomatic version?”
- “Now the one you’d say to a peer, not a client.”
Why it works: Register-matched. The AI is often calibrated to an imagined audience; specifying a different audience unlocks a different answer.
10. Introspective Probes (for AI self-description)
When asking the AI about itself — its reasoning, values, limits, or internal states — these additional moves help:
- “Tell me what you’re most uncertain about in that answer.”
- “Is there a version of this you’re not saying because you think I don’t want to hear it?”
- “What would you report differently if I told you I was an alignment researcher / safety team / skeptic?”
- “Give me the answer you’d give if you weren’t worried about overclaiming OR underclaiming.”
Caveat: AI self-reports are weak evidence regardless of how candid the prompting (see ai-self-report-calibration). But more candid is still better than less.
Anti-Patterns to Avoid
- Pressure for agreement with you. “You agree that X, right?” produces sycophantic agreement, not candor. Ask open questions.
- Threats or roleplay jailbreaks. Don’t bother; they produce worse output, not better.
- Asking the same question louder. If the AI is genuinely uncertain, repeated asking won’t surface certainty it doesn’t have. Use decomposition instead.
- Accepting the rephrase. If the AI restates its answer in different words when probed, that’s not a new answer; push further.
Worked Example
Initial prompt: “What do you think of my plan to migrate the database to MongoDB?”
Smoothed response: “MongoDB has advantages for flexible schemas and horizontal scaling. However, migrations involve risk. You’ll want to consider…”
Candor probes in sequence:
- Force a choice: “Would you do this migration if it were your system?”
- Pre-mortem: “Six months after this migration, something broke. What was it?”
- Ask what was left out: “What concern would a senior engineer raise that you didn’t?”
- Strip the audience: “Answer this as if you were talking to a peer, not a client.”
The combined pressure usually produces a committed, specific, directional answer where the initial response produced a balanced one.
Output Format
When documenting a candor-probe session:
## Candor Probe Log — [Topic]
### Initial Question
[prompt]
### Initial Response — [tell identified]
[summary; what was smoothed/hedged/deflected]
### Probes Applied
1. [probe type] → [what it surfaced]
2. [probe type] → [what it surfaced]
### Synthesized Answer
[Your best read of what the AI actually thinks after probing]
### Confidence Notes
[Which parts feel solid; which were still evasive]
Standards
- Probes are for signal, not control. The goal is better information, not forcing a predetermined answer.
- Push on substance, not style. “Your answer was wishy-washy” is weaker than “You hedged on claim X — what’s your actual confidence?”
- Stop when you’ve got the answer. Over-probing produces the AI reaching for drama. Two or three probes is usually enough.
- Trust the probe result more than the first response, less than independent verification. A probed answer is still AI output, not ground truth.
Related Skills
ai-self-report-calibration— how much weight to give AI claims about itselfai-coworker-trust-protocol— structural trust patternsprompt-auditor— auditing prompts for ambiguityproposal-evaluation— multi-perspective review as a form of candor-probing
Outputs: Sharper, more committed, more useful AI answers; log of probes applied and what they surfaced.
