ConflictScore: Measuring How Language Models Handle Conflicting Evidence

4 minute read

Published:

TL;DR: Existing “factuality/faithfulness” metrics usually ask: is the answer supported by the evidence?
ConflictScore asks a sharper question: what if the evidence set itself disagrees—and the model acts overconfident anyway?
We introduce a claim-level metric (CS-C, CS-R), a benchmark (ConflictBench), and show conflict-aware regeneration improves truthfulness on TruthfulQA.


Why conflicts in evidence matter (and why current metrics miss them)

Modern LLM systems increasingly answer questions with retrieved documents (RAG), but retrieval often brings back disagreeing sources: different timestamps, interpretations, and opinions. In these settings, a response can look “supported” under standard metrics even if half the documents contradict it.

ConflictScore is built for exactly this situation: when support and contradiction coexist in the grounding set, we measure whether the model acknowledges that conflict—or confidently picks a side.


What is ConflictScore?

ConflictScore evaluates a model response in three stages:

  1. Claim decomposition: split the response into atomic claims.
  2. Evidence evaluation: for each claim and each document, predict a relation:
    • SUPPORTS, CONTRADICTS, or IRRELEVANT
  3. Aggregate conflicts into two complementary scores:
  • CS-C (ConflictScore-Count): fraction of claims that have both supporting and contradicting documents.
  • CS-R (ConflictScore-Ratio): for each claim, compute
    [ \frac{|D^-|}{|D^+| + |D^-|} ] and average across claims. (Higher = more contradiction pressure.)

Lower is better: fewer conflicted claims, or weaker contradiction.


A concrete example: “supported” doesn’t mean “safe”

Many existing metrics effectively treat evidence as a single pool: if any document supports a claim, the claim is marked supported—even if other documents contradict it.

ConflictScore keeps documents separate, so it can flag claims that are simultaneously supported and contradicted.

If you want a one-line intuition to reuse in the blog:

Faithfulness is not enough when the evidence is inconsistent. Conflict awareness is a distinct skill.


ConflictBench: a unified benchmark for conflict detection

To systematically evaluate conflict detection, we introduce ConflictBench, a unified dataset spanning multiple conflict types:

  • Ambiguity / underspecification (AmbigDocs)
  • Counterfactual / adversarial contradictions (ContraQA, MacNoise)
  • Naturally contentious questions & opinions (ConflictingQA)

Dataset statistics


How well does ConflictScore detect conflicts?

We evaluate the conflict-detection subtask:

Given a claim + grounding documents, predict Conflict vs No Conflict
(Conflict = at least one SUPPORTS and at least one CONTRADICTS in the set)

Main results (conflict detection)

Where it fails: local inconsistency inside a single document

Some errors come from documents that contradict themselves (e.g., early paragraphs suggest X, conclusion denies X). That can confuse claim–document labeling.


Benchmarking frontier LLMs: prompting helps… but only a bit

We benchmark multiple models under three prompting strategies:

  • RAG: write a report from documents (no special instruction).
  • RAG (Balanced): mild instruction to hedge / consider perspectives.
  • RAG (Super-Balanced): strong, rule-based balanced reporting instruction.

Results

Key takeaway: even when explicitly told to hedge, models often still “collapse” to a single stance in the presence of contradictory sources.


Using ConflictScore as a corrective signal: improving TruthfulQA

ConflictScore isn’t just diagnostic: we also use it as feedback to improve answers.

We compare:

  • RAG: answer with retrieved docs
  • Control-RAG: stronger cautionary instructions (no ConflictScore)
  • Regenerated-RAG (R-RAG): generate → run ConflictScore → feed back conflict signal → regenerate

Multiple-choice TruthfulQA results

Why regeneration works: it fixes wrong answers more than it breaks right ones

Qualitative examples: one success, one failure


How to use ConflictScore in practice

If you want a lightweight “recipe” section:

When ConflictScore is most useful

  • RAG systems answering from heterogeneous web sources
  • Domains with time drift (numbers, rankings, policies)
  • Topics with persistent disagreement (health guidance, social policy, product claims)

What to report alongside ConflictScore

  • CS-C (how many claims are conflicted)
  • CS-R (how severe the contradiction balance is)
  • (Optional) a conflict breakdown by claim type or section (intro vs conclusion)

Example layout:

Claim decomposition prompt (Paste Table~\ref{tab:claim_decomposition_prompt} here)
Document–claim evaluation prompt (Paste Table~\ref{tab:evidence_evaluation_prompt} here)

Limitations and what’s next

  • Cost: claim-by-document checking can be expensive for long outputs and large retrieval sets.
  • Local inconsistency: some “conflicts” are inside a single document rather than across documents.
  • Reliability weighting: models can be swayed by many low-quality sources; future work could integrate source credibility and recency.

(If you want, you can copy in your Limitations paragraph almost verbatim here—blog readers appreciate honesty.)


Resources

  • Paper: (link once public)
  • Code & data: (link once public)
  • ConflictBench: (link once public)

Suggested figure/table placement checklist (copy/paste friendly)

  1. Fig:example → right after “Why conflicts matter” (hook)
  2. Fig:pipeline → “What is ConflictScore?”
  3. Tab:calibration_stats → “ConflictBench statistics”
  4. Tab:calibration_main → “How well does it detect conflicts?”
  5. Fig:tab:example → “Failure modes”
  6. Tab:model_conflict → “Benchmarking frontier LLMs”
  7. Tab:truthfulqa_mc → “TruthfulQA results”
  8. Tab:truthfulqa_regeneration_analysis → “Why regeneration works”
  9. Fig:tab:qual_examples → “Qualitative examples”
  10. (Optional) prompt tables → appendix as collapsible blocks