ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning

Jingyang Yi, Jiazheng Wang, Sida Li

Data Science Institute, The University of Chicago
Preprint (In Submission)

arXiv:2504.21370 Source Code 🤗 Model (1.5B) 🤗 Model (7B)

Abstract

Recent models such as OpenAI o1 and DeepSeek-R1 have demonstrated strong performance on reasoning-intensive tasks by generating extended Chain-of-Thought (CoT) traces. While longer reasoning helps with thorough exploration of solution paths for complex problems, it also often leads to inefficient and redundant outputs—a phenomenon commonly described as overthinking. In this paper, we propose ShorterBetter, a simple yet effective reinforcement learning method that enables reasoning models to learn their own optimal CoT lengths without manual supervision. We define the Sample Optimal Length (SOL) as the length of the shortest correct response among multiple generations, which serves as a dynamic reward signal to guide the model toward efficient reasoning. Applied to DeepSeek-Distill-Qwen-1.5B/7B as base models, ShorterBetter achieves 50%-80% reduction in output lengths in both in-domain and out-of-domain reasoning tasks while maintaining accuracy. Our reasoning trace analysis shows that ShorterBetter refines the structure of the reasoning traces by reducing unnecessary repetition, excessive self-verification, and over-exploration of alternatives.

Examples of Verbose Repetition and Self-Verification in Intermediate Steps

The following examples demonstrate common patterns of inefficient reasoning where models engage in excessive repetition, circular logic, and unnecessary self-verification that leads to verbose responses without improving accuracy.

Example 1: Verbose Repetition Without New Insights ▼

Prompt:
Denali and Nate work for a dog walking business and are paid for each dog they walk. Denali is responsible for 16 dogs and Nate is responsible for 12 dogs. Under the company's new policy, they will be assigned or unassigned new dogs in groups of x dogs. The ratio of Denali's pay to Nate's pay would be the same if Denali started walking 4x more dogs and Nate stayed at 12 dogs or if x of Nate's dogs were reassigned to Denali. Find x if x ≠ 0.

Response (DeepSeek-R1-Distill-Qwen-1.5B):

          ...

          Wait, perhaps the problem is not about the same pay ratio, but the same total number of dogs? But the problem says "the ratio of Denali's pay to Nate's pay would be the same." So it's about the pay ratio, not the number of dogs.

          Wait, another thought: Maybe the ratio is in terms of the number of dogs walked, so if the ratio of their pay is same, and pay is per dog, then perhaps the number of dogs walked should be proportional.

          Wait, that is, if Denali walks 16 dogs and Nate walks 12, the ratio of their pay is (16)/(12) if they are paid per dog.

          ...

          Wait, so regardless of how I set up the ratio, I end up with x=0.

          But since the problem states x≠0, perhaps my entire approach is wrong.

          Wait, let me think differently.

          Perhaps the ratio of their pay is the same when they are assigned x dogs, regardless of direction.

          Wait, when they are assigned x dogs in groups of x, meaning x dogs are reassigned. So the ratio is same.

          Perhaps it's the pay per dog ratio that's the same.

          Wait, maybe Denali's pay per dog is same as Nate's.

          But that would be same as each getting paid per dog.

          Wait, but if they are reassigned, their pay per dog would be same.

          Wait, perhaps we can think in terms of average.

          Wait, I'm getting confused. Maybe I should go back to the original problem.

          But we saw that this leads to x=0.

          Alternatively, maybe the ratio is the other way around.

          Wait, perhaps the ratio is the pay per dog ratio. So if the pay per dog is same, then the ratio is 1:1.

          But in that case, both would have to walk same number of dogs. But the problem says that if x dogs are reassigned, it results in same ratio.

          But in that case, if the pay per dog is same, then it's just the same number of dogs walked, which would imply that 16 − x = 16 + 4x ⇒ −5x = 0 ⇒ x = 0, which is again invalid.

          Wait, perhaps I need to approach this problem by assuming that the pay ratio remains the same.

          ...

Issues Identified:

Excessive "Wait" repetitions: The model uses "Wait" 15+ times, indicating circular reasoning
Self-contradiction: Repeatedly arrives at x=0 but continues searching without systematic approach
Lack of progress: Multiple paragraphs of reasoning without advancing toward the solution
Verbose self-verification: Constantly second-guessing previous steps instead of building upon them

Example 2: Early Stopping of Promising Paths ▼

Example 3: Over-Exploration and Extraneous Solution Checks ▼

Prompt:
Find all values of x that satisfy the equation $ x = \sqrt{11 - 2x} + 4 $.

Response (DeepSeek-R1-Distill-Qwen-1.5B):

So, the solutions are x = 1 and x = 5. But wait, I need to check if these solutions actually satisfy the original equation because sometimes squaring both sides can introduce extraneous solutions.

First, let me check x = 1:

Left side: 1 = \sqrt{11 - 2(1)} + 4

Calculate inside the square root: 11 - 2 * 1 = 9. So, square root of 9 is 3. Then 3 + 4 = 7. But left side is 1, 1 ≠ 7. That means x = 1 is not a valid solution.

Now, check x = 5:

Left side: 5 = \sqrt{11 - 2(5)} + 4

Calculate inside the square root: 11 - 10 = 1. Square root of 1 is 1. Then 1 + 4 = 5. So, 5 = 5, which is true. Therefore, x = 5 is a valid solution.

But wait, earlier I thought that x must be greater than or equal to 4. Since x = 5 is greater than 4, that's fine. But x = 1 is less than 4.

Wait, hold on. Let me double-check my earlier factorization and solution.

I had x² - 6x + 5 = 0, which factored into (x - 1)(x - 5) = 0, giving x = 1 and x = 5.

Plugging x = 5 into the original equation worked, but plugging x = 1 didn't. So, x = 1 is extraneous.

But is there any other step I should verify? Sometimes, especially with square roots, squaring both sides can sometimes lead to other issues, but in this case, since we didn't square the square root term on both sides but isolated it once, it seems fine.

Let me consider another method to solve the equation, ...

Same quadratic as before, so same solutions. So, same process, same result. So, x = 1 and x = 5, but only x = 5 is valid.

But just to make sure, let me consider if squaring the equation introduced any extraneous solutions. Sometimes when we square both sides, it's easy to have extra roots.

Issues Identified:

Over-exploration and repeated verification: The model repeatedly checks for extraneous solutions and re-derives the same result multiple times.
Verbose self-doubt: Multiple "wait" and "let me check" statements, even after the correct answer is found.
Redundant solution paths: Attempts to solve the same quadratic again after already confirming the answer.
Unnecessary caution: Excessive concern about extraneous solutions, leading to longer output without new insights.

Example 4: Collapse of Reasoning Coherence Following Very Long Outputs ▼

Overall Methodology

In this work, we aim to guide the reasoning model to think for the right amount of time, which requires knowing its optimal reasoning length (OL) — a prompt-dependent quantity. However, calculating OL in practice is challenging due to the absence of precise measures for both problem difficulty and model capability. To address this, we propose a sample-based approximation of OL (Sample Optimal Length, SOL), yielding a practical length reward. Finally, we integrate this reward design with Group Relative Policy Optimization (GRPO).

Optimal Reasoning Length (OL)

Problem Setup Let $ \mathcal{M}_\theta $ be our policy model (i.e. a base LRM) with parameters $ \theta $, and $ \mathcal{D} $ be the training dataset. Each problem $(x_i, y^*_i) \in \mathcal{D}$ consists of a prompt $x_i$ and a reference response $y^*_i$. Given $x_i$, for each rollout the model generates a response $y = (y^1, \ldots, y^t)$ autoregressively from $p_\theta(\cdot | x_i)$, where $y^k$ denotes its k-th token and $\ell(y) = t$ is its length. For reasoning tasks, $y$ typically includes long CoT tokens (often enclosed in </think> tags) followed by a final answer.

We then define a verifier $\mathcal{I}$ that assigns a normalized correctness score $s_\mathcal{I}(y, y^*_i) \in [0, 1]$. For fixed correctness threshold $c \geq 0$ and probability threshold $\epsilon \geq 0$, the optimal reasoning length is defined as:

\[ \ell^*_{c,\epsilon}(x_i; \theta) := \min\{\ell(y) : y \in \mathcal{Y}_{c,\epsilon}(x_i; \theta)\}, \quad \text{where} \] \[ \mathcal{Y}_{c,\epsilon}(x_i; \theta) := \{y : p_\theta(y \mid x_i) \geq \epsilon \wedge s_\mathcal{I}(y, y^*_i) \geq c\}. \]

Remark In our method, we narrow our focus to tasks with verifiable answers, such as mathematical and coding problems. These tasks naturally come with a binary verifier that outputs

\[ \mathbb{I}(y = y^*_i) := s_\mathcal{I}(y, y^*_i) \in \{0, 1\}, \]

and we adopt the trivial threshold choice $c = 1$ and use the shorthand notation $\mathbb{I}(y = y^*_i)$ to denote correctness.

Our method: Sample Optimal Reasoning Length (SOL)

we introduce the Sample Optimal Length (SOL) — a practical, sample-based approximation of OL. Given a prompt $x_i$, our method generates $n$ rollouts of candidate responses $G(x_i) = \{ y_1, y_2, \dots, y_n \}$ from the policy $p_\theta(\cdot|x_i)$. We then define the SOL for the group $G(x_i)$ as: $$ \ell^{SOL}(G(x_i)) = \begin{cases} \quad \displaystyle \min_{y_j \in G(x_i):\,\mathbb{I}(y_j = y_i^*)=1} \ell(y_j), &\text{if at least one response is correct},\\[15pt] \quad \quad\frac{1}{n}\sum_{j=1}^{n}\ell(y_j), &\text{otherwise}. \end{cases} $$ Given the SOL, we define the following reward function for each response $y_j \in G(x_i)$: \begin{equation}\label{eq:sol-reward} r(y_j) = \alpha \cdot \mathbb{I}(y_j = y_i^*) - \beta \cdot \left|\ell(y_j) - \ell^{SOL}(G(x_i))\right|, \end{equation} where $\alpha, \beta > 0$ are hyperparameters balancing correctness reward and length penalty, respectively.

Experiment Results

We compare the performance of our trained models against the baselines across a range of in-domain math tasks as well as out-of-domain benchmarks. We then provide an in-depth analysis of how our method changed the structure of the reasoning traces.

Baselines To assess the efficiency of our method, we compare our trained models against 4 baselines:

DeepSeek-R1-Distill-Qwen-1.5B/7B (DeepSeek-AI, 2025): Models obtained by distillation from the Qwen series using larger DeepSeek-R1 models. We refer to these models as Distill-1.5B and Distill-7B in the following sections.
Qwen2.5-1.5B/7B-Instruct (Team, 2024): Instruction-tuned models from the Qwen 2.5 family of large language models. We include these non-reasoning models as a contrast, i.e., fast thinking without deliberate reasoning. We denote them as Instruct-1.5B and Instruct-7B hereafter.
Training Efficient (Arora and Zanette, 2025): We evaluated the 1.5B/7B open-source models released in this work, which have the same base models as ours.
O1-Pruner (Luo et al., 2025a): We evaluate the 7B open-source model proposed in this work, which is fine-tuned from the Marco-o1 base model (Zhao et al., 2024).

Output Length after First Appearance of Correct Answer

ShorterBetter-7B

Distill-7B

For each model, we calculate what percentage of its total output (in tokens) appears after the first time that the correct answer is stated. We observe that the percentages of tokens appearing after the first correct answer decreases by 15%, 20%, 17%, and 19% for AMC, MATH, Olympiad, and Minerva benchmarks, respectively. This suggests that ShorterBetter-7B reaches conclusions more efficiently, using fewer tokens on extra reasoning once it reaches the correct answer.

Breakdown of Reasoning Traces

Distill-7B

Avg: 2337 tokens

ShorterBetter-7B

Avg: 804 tokens

Pivotal Reasoning

Productive Elaboration & Calculation

Exploring Alternatives

Verification & Self-Correction

Non-Substantive Statement

This analysis reveals how ShorterBetter-7B achieves more efficient reasoning compared to Distill-7B. Using an LLM-as-a-judge framework, we categorized each sentence in reasoning traces and computed token-level proportions across five functional categories. The results demonstrate significant structural improvements in reasoning efficiency.

+15.6% Pivotal Reasoning and +6.8% Productive Elaboration indicate a denser concentration of meaningful, solution-oriented reasoning in ShorterBetter-7B.

-4.8% Non-Substantive Statements suggests the model produces fewer filler or redundant tokens, leading to more concise outputs.

Reductions in Verification & Self-Correction (-14.1%) and Exploring Alternatives (-3.9%) imply more decisive reasoning with fewer unnecessary detours.

These structural shifts support the conclusion that ShorterBetter promotes more confident and streamlined reasoning trajectories, potentially addressing the early stopping of promising paths issue.

BibTeX

@article{yi2025shorterbetter,
  title={Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning},
  author={Yi, Jingyang and Wang, Jiazheng and Li, Sida},
  journal={arXiv preprint arXiv:2504.21370},
  year={2025}
}