Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection

Generative AI & LLMs
Published: arXiv: 2605.31563v1
Authors

Benedetta Muscato Beiduo Chen Gizem Gezici Barbara Plank Fosca Giannotti

Abstract

Human disagreement is ubiquitous and well-known in labeling. However, variation in explanations, captured through token-level human rationales, remains far less explored. At the same time, it is unclear how to best evaluate human labels and rationales -- or even how to best aggregate rationales beyond majority vote -- in light of this variation. Yet, rationales may provide additional insights into the richness of human reasoning, that may differ in style, values and interpretations -- especially in subjective NLP tasks like hate speech detection. In this work, we unify diverse models, training strategies, loss functions, and existing evaluation metrics under a single protocol by systematically re-implementing them across different label and rationale representation spaces. Classification metrics are organized around two key properties -- predictive and distributional -- while explainability metrics through three complementary dimensions: plausibility, faithfulness, and complexity. In this unified supervision framework, we evaluate model behavior across classification and explainability metrics, as well as metric sensitivity to the choice of label (hard and soft) and rationale representation space (hard, intermediate and soft). Results show that both hard and soft metrics favor softer representations, highlighting their effectiveness in capturing variation and the need to rethink evaluation in subjective NLP.

Paper Summary

Problem
The main problem addressed in this research paper is the evaluation of hate speech detection models, particularly in subjective tasks where human judgments and rationales vary greatly. Traditional evaluation metrics, such as accuracy and F1-score, often overlook the nuances of ambiguous instances and fail to capture the richness of human reasoning.
Key Innovation
The paper introduces a unified supervision framework that combines diverse models, training strategies, loss functions, and evaluation metrics under a single protocol. This framework evaluates model behavior across classification and explainability metrics, as well as metric sensitivity to the choice of label (HARD and SOFT) and rationale representation space (HARD, INTERMEDIATE, and SOFT). The paper also proposes new label and rationale representation spaces, including intermediate and soft supervision objectives.
Practical Impact
This research has significant practical implications for hate speech detection and other subjective NLP tasks. By providing a more comprehensive evaluation framework, the paper highlights the need to rethink evaluation in subjective NLP and suggests that softer representations may be more effective in capturing variation. This work can be applied to improve the accuracy and fairness of hate speech detection models, which can have a significant impact on online communities and social media platforms.
Analogy / Intuitive Explanation
Imagine you're trying to decide whether a piece of text is hate speech or not. A traditional model might look at the text and say, "Yes, it's hate speech because it contains a certain word." But a human might look at the same text and say, "Yes, it's hate speech because it's using a certain tone and language that's derogatory." The paper is trying to capture the nuances of human judgment and the varying rationales that people use to make decisions about hate speech. By doing so, it can help improve the accuracy and fairness of hate speech detection models.
Paper Information
Categories:
cs.CL
Published Date:

arXiv ID:

2605.31563v1

Quick Actions