SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation

Generative AI & LLMs
Published: arXiv: 2511.17432v1
Authors

Shrikant Kendre Austin Xu Honglu Zhou Michael Ryoo Shafiq Joty Juan Carlos Niebles

Abstract

Traditional evaluation metrics for textual and visual question answering, like ROUGE, METEOR, and Exact Match (EM), focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.

Paper Summary

Problem
The main problem this paper addresses is the need for a more effective and efficient evaluation metric for question-answering (QA) tasks. Current metrics, such as Exact Match (EM) and ROUGE, are too stringent and may not capture the nuances of language models' responses. Additionally, large language models (LLMs) as judges have their own limitations, including high costs, biases, and inconsistencies.
Key Innovation
The key innovation of this paper is the introduction of SMILE (Semantic Metric Integrating Lexical Exactness), a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. SMILE balances lexical precision and semantic relevance, offering a comprehensive evaluation metric that is lightweight and efficient.
Practical Impact
The practical impact of SMILE is significant. It can be used as a drop-in replacement for more powerful LLM-based evaluators, offering a more cost-effective and interpretable solution for QA evaluation across modalities. SMILE's strong correlation with human judgment makes it a promising alternative to traditional metrics and LLM judges. Its design also offers interpretability, making it easier to understand and debug language models.
Analogy / Intuitive Explanation
Think of SMILE as a referee in a game. The referee's job is to ensure that the rules are followed and that the game is played fairly. In the context of QA evaluation, the referee is SMILE, which ensures that the language model's response is accurate and relevant. Just as a good referee balances the rules with the spirit of the game, SMILE balances lexical precision with semantic relevance to provide a comprehensive evaluation metric.
Paper Information
Categories:
cs.CL cs.AI cs.CV
Published Date:

arXiv ID:

2511.17432v1

Quick Actions