InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation

Generative AI & LLMs
Published: arXiv: 2510.21538v1
Authors

Likun Tan Kuan-Wei Huang Joy Shi Kevin Wu

Abstract

Retrieval-Augmented Generation (RAG) integrates external knowledge to mitigate hallucinations, yet models often generate outputs inconsistent with retrieved content. Accurate hallucination detection requires disentangling the contributions of external context and parametric knowledge, which prior methods typically conflate. We investigate the mechanisms underlying RAG hallucinations and find they arise when later-layer FFN modules disproportionately inject parametric knowledge into the residual stream. To address this, we explore a mechanistic detection approach based on external context scores and parametric knowledge scores. Using Qwen3-0.6b, we compute these scores across layers and attention heads and train regression-based classifiers to predict hallucinations. Our method is evaluated against state-of-the-art LLMs (GPT-5, GPT-4.1) and detection baselines (RAGAS, TruLens, RefChecker). Furthermore, classifiers trained on Qwen3-0.6b signals generalize to GPT-4.1-mini responses, demonstrating the potential of proxy-model evaluation. Our results highlight mechanistic signals as efficient, generalizable predictors for hallucination detection in RAG systems.

Paper Summary

Problem
Large language models (LLMs) are incredibly good at generating text, but they often produce "hallucinations" - responses that are factually incorrect or not supported by evidence. This is a big problem in high-stakes areas like healthcare, finance, and education, where accuracy is crucial. Even when models are trained with external knowledge to mitigate hallucinations, they can still produce inconsistent outputs.
Key Innovation
The authors of this paper, InterpDetect, have developed a new method for detecting hallucinations in LLMs. They focus on the mechanisms underlying hallucinations and find that they arise when models rely too heavily on internal knowledge and not enough on external context. To address this, they use a combination of external context scores (ECS) and parametric knowledge scores (PKS) to predict hallucinations. These scores are computed across layers and attention heads, and are used to train regression-based classifiers.
Practical Impact
The InterpDetect method has several practical applications. It can be used to improve the accuracy of LLMs in high-stakes domains, reducing the risk of hallucinations and improving user trust. It can also be used as a proxy for evaluating large-scale, production-level models, making it easier and more cost-effective to detect hallucinations. Additionally, the method can be used to steer the model during generation to produce more reliable and truthful outputs.
Analogy / Intuitive Explanation
Think of an LLM as a researcher who is trying to answer a question. The LLM has access to a vast amount of knowledge, but it can also make mistakes or produce incorrect information. The ECS and PKS scores are like two sets of eyes that the researcher uses to verify the accuracy of the information. The ECS scores look at how much the researcher relies on external knowledge, while the PKS scores look at how much they rely on internal knowledge. By combining these two sets of scores, the researcher can get a better sense of whether the information is accurate or not.
Paper Information
Categories:
cs.CL
Published Date:

arXiv ID:

2510.21538v1

Quick Actions