Blog

2026

ConflictScore: Measuring How Language Models Handle Conflicting Evidence

4 minute read

Published:

TL;DR: Existing “factuality/faithfulness” metrics usually ask: is the answer supported by the evidence?
ConflictScore asks a sharper question: what if the evidence set itself disagrees—and the model acts overconfident anyway?
We introduce a claim-level metric (CS-C, CS-R), a benchmark (ConflictBench), and show conflict-aware regeneration improves truthfulness on TruthfulQA.