Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity

Generative AI & LLMs
Published: arXiv: 2509.22641v1
Authors

Arkadiy Saakyan Najoung Kim Smaranda Muresan Tuhin Chakrabarty

Abstract

N-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 7542 expert writer annotations (n=26) of novelty, pragmaticality, and sensicality via close reading of human and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, ~91% of top-quartile expressions by n-gram novelty are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.

Paper Summary

Problem
The main problem this research paper addresses is the limitations of using n-gram novelty as a metric to evaluate the creativity of text generated by language models (LLMs). While n-gram novelty is widely used, it only measures the originality of the text and does not account for its sensibility and practicality. This can lead to misleading results, as a text may be novel but not necessarily creative.
Key Innovation
The key innovation of this paper is the proposal of a new operationalization of textual creativity that goes beyond n-gram novelty. The authors conducted a close reading study of human and AI-generated text, collecting annotations from professional writers to assess both the novelty and appropriateness of the text. This approach allows for a more comprehensive evaluation of creativity, considering both the originality and the sensibility of the text.
Practical Impact
This research has significant practical implications for the development and evaluation of LLMs. By providing a more accurate measure of creativity, the authors' approach can help identify the limitations of current LLMs and guide the development of more creative and practical language models. This can have a significant impact on various applications, such as writing assistance tools, content generation, and creative writing.
Analogy / Intuitive Explanation
Imagine you're a chef trying to create a new recipe. While it's great to come up with something entirely new and original (novelty), it's not enough if the dish is inedible or impractical (appropriateness). A creative recipe should balance novelty with practicality and sensibility. Similarly, a language model should generate text that is not only original but also makes sense and is useful. The authors' approach provides a more comprehensive evaluation of creativity, considering both aspects.
Paper Information
Categories:
cs.CL cs.AI cs.HC
Published Date:

arXiv ID:

2509.22641v1

Quick Actions