BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

Generative AI & LLMs
Published: arXiv: 2604.03159v1
Authors

Delip Rao Chris Callison-Burch

Abstract

Large language models with web search are increasingly used in scientific publishing agents, yet they still produce BibTeX entries with pervasive field-level errors. Prior evaluations tested base models without search, which does not reflect current practice. We construct a benchmark of 931 papers across four scientific domains and three citation tiers -- popular, low-citation, and recent post-cutoff -- designed to disentangle parametric memory from search dependence, with version-aware ground truth accounting for multiple citable versions of the same paper. Three search-enabled frontier models (GPT-5, Claude Sonnet-4.6, Gemini-3 Flash) generate BibTeX entries scored on nine fields and a six-way error taxonomy, producing ~23,000 field-level observations. Overall accuracy is 83.6%, but only 50.9% of entries are fully correct; accuracy drops 27.7pp from popular to recent papers, revealing heavy reliance on parametric memory even when search is available. Field-error co-occurrence analysis identifies two failure modes: wholesale entry substitution (identity fields fail together) and isolated field error. We evaluate clibib, an open-source tool for deterministic BibTeX retrieval from the Zotero Translation Server with CrossRef fallback, as a mitigation mechanism. In a two-stage integration where baseline entries are revised against authoritative records, accuracy rises +8.0pp to 91.5%, fully correct entries rise from 50.9% to 78.3%, and regression rate is only 0.8%. An ablation comparing single-stage and two-stage integration shows that separating search from revision yields larger gains and lower regression (0.8% vs. 4.8%), demonstrating that integration architecture matters independently of model capability. We release the benchmark, error taxonomy, and clibib tool to support evaluation and mitigation of citation hallucinations in LLM-based scientific writing.

Paper Summary

Problem
Scientific publishing agents increasingly rely on large language models with web search to generate BibTeX entries. However, these models still produce field-level errors in these entries, which can lead to inaccuracies in manuscripts, reference lists, and downstream citation analyses.
Key Innovation
This paper addresses the issue of BibTeX citation hallucinations in scientific publishing agents by constructing a benchmark of 931 papers across four scientific domains and three citation tiers. The authors evaluate three search-enabled frontier models and find that while they produce more accurate BibTeX entries than base models, they still contain field-level errors. The paper also introduces clibib, an open-source deterministic retrieval tool that fetches authoritative BibTeX from the Zotero Translation Server, and demonstrates its effectiveness in mitigating errors.
Practical Impact
The findings of this paper have significant practical implications for the use of large language models in scientific publishing. The authors' benchmark and evaluation framework can be used to assess the accuracy of BibTeX entries generated by these models, and the clibib tool can be integrated into scientific publishing workflows to improve the accuracy of these entries. This can help to reduce errors in manuscripts, reference lists, and downstream citation analyses, and improve the overall quality of scientific publishing.
Analogy / Intuitive Explanation
Imagine a researcher using a large language model to generate a list of references for a paper. The model is like a librarian who is trying to help the researcher find the right books, but sometimes the librarian makes mistakes in the book titles, authors, or publication dates. The researcher then needs to correct these mistakes to ensure that the list of references is accurate. This is similar to the problem of BibTeX citation hallucinations, where the large language model generates BibTeX entries that contain errors, which need to be corrected to ensure the accuracy of the manuscript, reference list, and downstream citation analyses. The clibib tool is like a second librarian who checks the model's work and corrects any mistakes, ensuring that the final list of references is accurate.
Paper Information
Categories:
cs.DL cs.CL
Published Date:

arXiv ID:

2604.03159v1

Quick Actions