ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning

Data Science Institute, The University of Chicago
Preprint (In Submission)

Abstract

Recent models such as OpenAI o1 and DeepSeek-R1 have demonstrated strong performance on reasoning-intensive tasks by generating extended Chain-of-Thought (CoT) traces. While longer reasoning helps with thorough exploration of solution paths for complex problems, it also often leads to inefficient and redundant outputs—a phenomenon commonly described as overthinking. In this paper, we propose ShorterBetter, a simple yet effective reinforcement learning method that enables reasoning models to learn their own optimal CoT lengths without manual supervision. We define the Sample Optimal Length (SOL) as the length of the shortest correct response among multiple generations, which serves as a dynamic reward signal to guide the model toward efficient reasoning. Applied to DeepSeek-Distill-Qwen-1.5B/7B as base models, ShorterBetter achieves 50%-80% reduction in output lengths in both in-domain and out-of-domain reasoning tasks while maintaining accuracy. Our reasoning trace analysis shows that ShorterBetter refines the structure of the reasoning traces by reducing unnecessary repetition, excessive self-verification, and over-exploration of alternatives.

Examples of Verbose Repetition and Self-Verification in Intermediate Steps

The following examples demonstrate common patterns of inefficient reasoning where models engage in excessive repetition, circular logic, and unnecessary self-verification that leads to verbose responses without improving accuracy.
Example 1: Verbose Repetition Without New Insights â–¼
Example 2: Early Stopping of Promising Paths â–¼
Example 3: Over-Exploration and Extraneous Solution Checks â–¼
Example 4: Collapse of Reasoning Coherence Following Very Long Outputs â–¼

Overall Methodology

In this work, we aim to guide the reasoning model to think for the right amount of time, which requires knowing its optimal reasoning length (OL) — a prompt-dependent quantity. However, calculating OL in practice is challenging due to the absence of precise measures for both problem difficulty and model capability. To address this, we propose a sample-based approximation of OL (Sample Optimal Length, SOL), yielding a practical length reward. Finally, we integrate this reward design with Group Relative Policy Optimization (GRPO).

Optimal Reasoning Length (OL)

Problem Setup   Let \( \mathcal{M}_\theta \) be our policy model (i.e. a base LRM) with parameters \( \theta \), and \( \mathcal{D} \) be the training dataset. Each problem \((x_i, y^*_i) \in \mathcal{D}\) consists of a prompt \(x_i\) and a reference response \(y^*_i\). Given \(x_i\), for each rollout the model generates a response \(y = (y^1, \ldots, y^t)\) autoregressively from \(p_\theta(\cdot | x_i)\), where \(y^k\) denotes its k-th token and \(\ell(y) = t\) is its length. For reasoning tasks, \(y\) typically includes long CoT tokens (often enclosed in </think> tags) followed by a final answer.

We then define a verifier \(\mathcal{I}\) that assigns a normalized correctness score \(s_\mathcal{I}(y, y^*_i) \in [0, 1]\). For fixed correctness threshold \(c \geq 0\) and probability threshold \(\epsilon \geq 0\), the optimal reasoning length is defined as:

\[ \ell^*_{c,\epsilon}(x_i; \theta) := \min\{\ell(y) : y \in \mathcal{Y}_{c,\epsilon}(x_i; \theta)\}, \quad \text{where} \] \[ \mathcal{Y}_{c,\epsilon}(x_i; \theta) := \{y : p_\theta(y \mid x_i) \geq \epsilon \wedge s_\mathcal{I}(y, y^*_i) \geq c\}. \]

Remark   In our method, we narrow our focus to tasks with verifiable answers, such as mathematical and coding problems. These tasks naturally come with a binary verifier that outputs

\[ \mathbb{I}(y = y^*_i) := s_\mathcal{I}(y, y^*_i) \in \{0, 1\}, \]

and we adopt the trivial threshold choice \(c = 1\) and use the shorthand notation \(\mathbb{I}(y = y^*_i)\) to denote correctness.

Our method: Sample Optimal Reasoning Length (SOL)

we introduce the Sample Optimal Length (SOL) — a practical, sample-based approximation of OL. Given a prompt \(x_i\), our method generates \(n\) rollouts of candidate responses \(G(x_i) = \{ y_1, y_2, \dots, y_n \}\) from the policy \(p_\theta(\cdot|x_i)\). We then define the SOL for the group \(G(x_i)\) as: $$ \ell^{SOL}(G(x_i)) = \begin{cases} \quad \displaystyle \min_{y_j \in G(x_i):\,\mathbb{I}(y_j = y_i^*)=1} \ell(y_j), &\text{if at least one response is correct},\\[15pt] \quad \quad\frac{1}{n}\sum_{j=1}^{n}\ell(y_j), &\text{otherwise}. \end{cases} $$ Given the SOL, we define the following reward function for each response \(y_j \in G(x_i)\): \begin{equation}\label{eq:sol-reward} r(y_j) = \alpha \cdot \mathbb{I}(y_j = y_i^*) - \beta \cdot \left|\ell(y_j) - \ell^{SOL}(G(x_i))\right|, \end{equation} where \(\alpha, \beta > 0\) are hyperparameters balancing correctness reward and length penalty, respectively.

Experiment Results

We compare the performance of our trained models against the baselines across a range of in-domain math tasks as well as out-of-domain benchmarks. We then provide an in-depth analysis of how our method changed the structure of the reasoning traces.

Baselines To assess the efficiency of our method, we compare our trained models against 4 baselines:
  • DeepSeek-R1-Distill-Qwen-1.5B/7B (DeepSeek-AI, 2025): Models obtained by distillation from the Qwen series using larger DeepSeek-R1 models. We refer to these models as Distill-1.5B and Distill-7B in the following sections.
  • Qwen2.5-1.5B/7B-Instruct (Team, 2024): Instruction-tuned models from the Qwen 2.5 family of large language models. We include these non-reasoning models as a contrast, i.e., fast thinking without deliberate reasoning. We denote them as Instruct-1.5B and Instruct-7B hereafter.
  • Training Efficient (Arora and Zanette, 2025): We evaluated the 1.5B/7B open-source models released in this work, which have the same base models as ours.
  • O1-Pruner (Luo et al., 2025a): We evaluate the 7B open-source model proposed in this work, which is fine-tuned from the Marco-o1 base model (Zhao et al., 2024).

Output Length after First Appearance of Correct Answer

ShorterBetter-7B
Distill-7B
For each model, we calculate what percentage of its total output (in tokens) appears after the first time that the correct answer is stated. We observe that the percentages of tokens appearing after the first correct answer decreases by 15%, 20%, 17%, and 19% for AMC, MATH, Olympiad, and Minerva benchmarks, respectively. This suggests that ShorterBetter-7B reaches conclusions more efficiently, using fewer tokens on extra reasoning once it reaches the correct answer.

Breakdown of Reasoning Traces

Distill-7B
Avg: 2337 tokens
ShorterBetter-7B
Avg: 804 tokens
Pivotal Reasoning
Productive Elaboration & Calculation
Exploring Alternatives
Verification & Self-Correction
Non-Substantive Statement
This analysis reveals how ShorterBetter-7B achieves more efficient reasoning compared to Distill-7B. Using an LLM-as-a-judge framework, we categorized each sentence in reasoning traces and computed token-level proportions across five functional categories. The results demonstrate significant structural improvements in reasoning efficiency.
+15.6% Pivotal Reasoning and +6.8% Productive Elaboration indicate a denser concentration of meaningful, solution-oriented reasoning in ShorterBetter-7B.
-4.8% Non-Substantive Statements suggests the model produces fewer filler or redundant tokens, leading to more concise outputs.
Reductions in Verification & Self-Correction (-14.1%) and Exploring Alternatives (-3.9%) imply more decisive reasoning with fewer unnecessary detours.
These structural shifts support the conclusion that ShorterBetter promotes more confident and streamlined reasoning trajectories, potentially addressing the early stopping of promising paths issue.

BibTeX

@article{yi2025shorterbetter,
  title={Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning},
  author={Yi, Jingyang and Wang, Jiazheng and Li, Sida},
  journal={arXiv preprint arXiv:2504.21370},
  year={2025}
}