Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses

Explainable & Ethical AI
Published: arXiv: 2509.16093v1
Authors

Fangyi Yu Nabeel Seedat Dasha Herrmannova Frank Schilder Jonathan Richard Schwarz

Abstract

Evaluating long-form answers in high-stakes domains such as law or medicine remains a fundamental challenge. Standard metrics like BLEU and ROUGE fail to capture semantic correctness, and current LLM-based evaluators often reduce nuanced aspects of answer quality into a single undifferentiated score. We introduce DeCE, a decomposed LLM evaluation framework that separates precision (factual accuracy and relevance) and recall (coverage of required concepts), using instance-specific criteria automatically extracted from gold answer requirements. DeCE is model-agnostic and domain-general, requiring no predefined taxonomies or handcrafted rubrics. We instantiate DeCE to evaluate different LLMs on a real-world legal QA task involving multi-jurisdictional reasoning and citation grounding. DeCE achieves substantially stronger correlation with expert judgments ($r=0.78$), compared to traditional metrics ($r=0.12$), pointwise LLM scoring ($r=0.35$), and modern multidimensional evaluators ($r=0.48$). It also reveals interpretable trade-offs: generalist models favor recall, while specialized models favor precision. Importantly, only 11.95% of LLM-generated criteria required expert revision, underscoring DeCE's scalability. DeCE offers an interpretable and actionable LLM evaluation framework in expert domains.

Paper Summary

Problem
Evaluating long-form answers in high-stakes domains such as law, medicine, and finance remains a fundamental challenge. Current evaluation metrics like BLEU and ROUGE fail to capture semantic correctness, and LLM-based evaluators often reduce nuanced aspects of answer quality into a single undifferentiated score. This can lead to inaccurate assessments and real consequences, including tangible harm, legal liability, and erosion of trust in AI systems.
Key Innovation
Researchers introduce DeCE, a decomposed LLM evaluation framework that separates precision (factual accuracy and relevance) and recall (coverage of required concepts). DeCE is model-agnostic and domain-general, requiring no predefined taxonomies or handcrafted rubrics. It automatically extracts instance-specific, domain-aware criteria from gold-standard answers to perform a structured precision-recall decomposition.
Practical Impact
DeCE offers an interpretable and actionable LLM evaluation framework in expert domains. It achieves substantially stronger correlation with expert judgments (r=0.78) compared to traditional metrics, point-wise LLM scoring, and modern multidimensional evaluators. DeCE's scalability is also demonstrated, with only 11.95% of LLM-generated criteria requiring expert revision. This framework can be applied to various high-stakes domains, such as law, medicine, and finance, to improve the accuracy and reliability of LLM evaluations.
Analogy / Intuitive Explanation
Imagine evaluating a lawyer's response to a complex legal question. A standard evaluation metric might give a single score, but DeCE breaks down the evaluation into two dimensions: precision (how accurate and relevant is the answer?) and recall (how well does the answer cover the required concepts?). This decomposition provides a more nuanced understanding of the answer's quality and helps identify areas for improvement.
Paper Information
Categories:
cs.CL cs.AI
Published Date:

arXiv ID:

2509.16093v1

Quick Actions