Controllable Reasoning Models Are Private Thinkers

Generative AI & LLMs

Published: arXiv: 2602.24210v1

Authors

Haritz Puerto Haonan Li Xudong Han Timothy Baldwin Iryna Gurevych

Abstract

AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result in the unintended leakage of private information to external parties. We propose training models to follow instructions not only in the final answer, but also in reasoning traces, potentially under different constraints. We hypothesize that improving their instruction following abilities in the reasoning traces can improve their privacy-preservation skills. To demonstrate this, we fine-tune models on a new instruction-following dataset with explicit restrictions on reasoning traces. We further introduce a generation strategy that decouples reasoning and answer generation using separate LoRA adapters. We evaluate our approach on six models from two model families, ranging from 1.7B to 14B parameters, across two instruction-following benchmarks and two privacy benchmarks. Our method yields substantial improvements, achieving gains of up to 20.9 points in instruction-following performance and up to 51.9 percentage points on privacy benchmarks. These improvements, however, can come at the cost of task utility, due to the trade-off between reasoning performance and instruction-following abilities. Overall, our results show that improving instruction-following behavior in reasoning models can significantly enhance privacy, suggesting a promising direction for the development of future privacy-aware agents. Our code and data are available at https://github.com/UKPLab/arxiv2026-controllable-reasoning-models

Paper Summary

Problem

Large language models (LLMs) are used as agents to solve tasks for us, but they have a major problem: they often reveal sensitive user information in their "reasoning traces" - the steps they take to arrive at an answer. This can happen even when the user only wants to share a small amount of information, making it difficult to keep their data private.

Key Innovation

The researchers propose a new way to train LLMs to follow instructions not only in their final answers, but also in their reasoning traces. They create a new training dataset that includes instructions on how to conduct the reasoning process in a way that preserves user privacy. They also introduce a generation strategy called Staged Decoding, which separates the generation of reasoning traces and final answers using specialized LoRA adapters.

Practical Impact

This research has the potential to significantly enhance the privacy of LLMs, making them more suitable for use in applications where user data needs to be protected. By improving the controllability of reasoning models, we can create more privacy-aware agents that can help prevent the unintended leakage of sensitive information.

Analogy / Intuitive Explanation

Think of a reasoning model like a recipe book. When you ask it to make a cake, it follows a series of steps to arrive at the final answer (the cake). However, in its reasoning trace, it might reveal the exact amount of sugar and eggs needed, which might be sensitive information. By training the model to follow instructions in its reasoning trace, we can ensure that it only reveals the necessary information, like the type of cake and the number of servings. This is like using a recipe book that only reveals the necessary ingredients and steps, without exposing the exact quantities or methods used.

Paper Information

Categories:

cs.CL cs.AI

Published Date:

arXiv ID:

2602.24210v1

Quick Actions

Back to Home