PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold

Agentic AI
Published: arXiv: 2510.15862v1
Authors

Yi Wan Jiuqi Wang Liam Li Jinsong Liu Ruihao Zhu Zheqing Zhu

Abstract

Tool-augmented large language models (LLMs) are emerging as deep research agents, systems that decompose complex queries, retrieve external evidence, and synthesize grounded responses. Yet current agents remain limited by shallow retrieval, weak alignment metrics, and brittle tool-use behavior. We introduce PokeeResearch-7B, a 7B-parameter deep research agent built under a unified reinforcement learning framework for robustness, alignment, and scalability. PokeeResearch-7B is trained by an annotation-free Reinforcement Learning from AI Feedback (RLAIF) framework to optimize policies using LLM-based reward signals that capture factual accuracy, citation faithfulness, and instruction adherence. A chain-of-thought-driven multi-call reasoning scaffold further enhances robustness through self-verification and adaptive recovery from tool failures. Among 10 popular deep research benchmarks, PokeeResearch-7B achieves state-of-the-art performance among 7B-scale deep research agents. This highlights that careful reinforcement learning and reasoning design can produce efficient, resilient, and research-grade AI agents. The model and inference code is open-sourced under MIT license at https://github.com/Pokee-AI/PokeeResearchOSS.

Paper Summary

Problem
Deep research agents, which are tool-augmented large language models, are being used to help with complex research tasks. However, current agents have several limitations, including: * Shallow retrieval of information * Weak alignment with human judgments of usefulness and factual grounding * Brittle tool-use behavior, which means that small errors can cause the entire process to fail
Key Innovation
The researchers introduce PokeeResearch-7B, a 7B-parameter deep research agent that addresses these limitations. PokeeResearch-7B is trained using a reinforcement learning framework that optimizes policies for reliability, alignment, and scalability. The framework uses an external LLM as an impartial judge to assess the semantic correctness of the agent's answers, which helps to improve the agent's accuracy and faithfulness to human instructions.
Practical Impact
PokeeResearch-7B has the potential to significantly improve the reliability and effectiveness of deep research agents. By achieving state-of-the-art performance on 10 popular deep research benchmarks, PokeeResearch-7B demonstrates that careful reinforcement learning and reasoning design can produce efficient, resilient, and research-grade AI agents. This could have a significant impact on various fields, such as scientific research, where accurate and reliable information is crucial.
Analogy / Intuitive Explanation
Imagine a researcher trying to find the answer to a complex question, such as "What are the effects of climate change on global food production?" A deep research agent like PokeeResearch-7B would help by: * Breaking down the question into smaller sub-questions * Retrieving relevant information from various sources * Synthesizing the information into a coherent answer * Verifying the answer to ensure its accuracy and faithfulness to human instructions In this analogy, PokeeResearch-7B is like a highly skilled research assistant that can help researchers find accurate and reliable information, freeing them up to focus on higher-level tasks.
Paper Information
Categories:
cs.AI
Published Date:

arXiv ID:

2510.15862v1

Quick Actions