If you've ever used an AI assistant to answer a question about a PDF or a financial report, you've probably experienced the uncanny valley of correctness: the model nails the answer, but the source it points to doesn't exist, or worse, says something completely different. The industry has been treating this as a minor annoyance, a quirk of generative models that will smooth out with scale. The CiteVQA benchmark from Peking University and Shanghai AI Laboratory proves that this isn't a quirk—it's a fundamental blind spot in how we evaluate and trust AI systems.
The benchmark is elegantly brutal. Instead of just checking whether the answer is right, CiteVQA demands that the model point to the exact paragraph, table, or figure that justifies each claim. A page number won't cut it. The metric, Strict Attributed Accuracy, awards points only when both the answer and the citation are correct. The results are sobering. Gemini-3.1-Pro-Preview, the best performer, scored 76 out of 100. GPT-5.4, which gets the raw answer right 87.1% of the time, plummets to 59 once citations are required. That's a 28-point gap between knowing and showing your work. Open-source models fare even worse: Qwen3-VL-235B-A22B managed just 22.5 points, and smaller models scored below 10. The researchers didn't mince words: these models are "extremely risky" for regulated industries.
This isn't an academic hypothetical. In finance, healthcare, law, and audit, the traceability of a claim is what transforms an AI output from a suggestion into evidence. A correct diagnosis with a hallucinated reference to the wrong page in a medical chart is worse than a wrong diagnosis—it creates false confidence in a fabricated paper trail. The ablation study in the paper makes the mechanism clear: when researchers artificially narrowed the search space to the correct page or document, scores jumped by over 13 points for some models. The bottleneck isn't the model's ability to answer the question; it's the model's inability to find and attribute the evidence.
For builders deploying RAG systems, this should set off alarms. We've been optimizing for answer quality—BLEU, ROUGE, accuracy—while ignoring the attribution gate. The conventional wisdom has been that more context is better: shove in longer passages, load the top-K chunks, and the model will sort it out. CiteVQA shows that this approach is fundamentally flawed. Models that can't locate the right source also give worse answers; accurate source information directly improves answer quality. This means that context engineering isn't just about reducing noise—it's about building a traceable chain of evidence that the model can actually navigate.
The deeper problem, as OpenAI recently noted, is systemic. Training and evaluation reward confident answers and penalize hedging. The model learns that guessing with authority is safer than saying "I don't know." That same dynamic fuels attribution hallucination: the model would rather fabricate a citation than admit it can't find one. Until our evaluation frameworks punish false attributions as severely as wrong answers, we will keep building models that sound smart but can't be trusted. CiteVQA is a much-needed step toward fixing that, but it's only the beginning. The next generation of AI systems must be measured not just by what they say, but by whether they can prove it.