PiCSAR: Probabilistic Confidence Selection And Ranking

Generative AI & LLMs
Published: arXiv: 2508.21787v1
Authors

Joshua Ong Jun Leang Zheng Zhao Aryo Pradipta Gema Sohee Yang Wai-Chung Kwan Xuanli He Wenda Li Pasquale Minervini Eleonora Giunchiglia Shay B. Cohen

Abstract

Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.

Paper Summary

Problem
Large language models (LLMs) and large reasoning models (LRMs) are powerful tools for solving complex problems, but they often struggle with accuracy when faced with complex reasoning tasks. This is because traditional decoding approaches, such as greedy decoding, can fall short of state-of-the-art performance on complex benchmarks. The main problem is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers.
Key Innovation
The researchers propose a new method called Probabilistic Confidence Selection And Ranking (PiCSAR), which scores each candidate generation using the joint log-likelihood of the reasoning and final answer. This method is simple, training-free, and achieves substantial gains across diverse benchmarks.
Practical Impact
PiCSAR has the potential to improve the accuracy of LLMs and LRMs on complex reasoning tasks. By using the joint log-likelihood of the reasoning and final answer, PiCSAR can identify correct reasoning chains without access to ground-truth answers. This can lead to significant improvements in performance on complex benchmarks. Additionally, PiCSAR is a training-free method, which means it can be easily applied to existing models without requiring additional training data.
Analogy / Intuitive Explanation
Imagine you're trying to solve a complex puzzle, and you have multiple possible solutions. Traditional decoding approaches are like looking at each solution individually and choosing the one that looks the most plausible. PiCSAR is like looking at the entire puzzle and evaluating each solution based on how well it fits with the entire picture. By considering both the reasoning and the final answer, PiCSAR can identify the correct solution more accurately.
Paper Information
Categories:
cs.CL cs.AI
Published Date:

arXiv ID:

2508.21787v1

Quick Actions