Overclocking LLM Reasoning

Abstract

Recently, techniques such as explicit structured reasoning have demonstrated strong test-time scaling behavior by enforcing a separation between the model's internal "thinking" process and the final response. A key factor influencing answer quality in this setting is the length of the thinking stage. When the reasoning is too short, the model may fail to capture the complexity of the task. Conversely, when it is too long, the model may overthink, leading to unnecessary computation and degraded performance. This paper explores and exploits the underlying mechanisms by which LLMs understand and regulate the length of their reasoning during explicit thought processes. First, we show that LLMs encode their progress through the reasoning process and introduce an interactive progress bar visualization, which is then used to reveal insights on the model's planning dynamics. Second, we manipulate the internal progress encoding during inference to reduce unnecessary steps and generate a more concise and decisive chain of thoughts. Our empirical results demonstrate that this "overclocking" method mitigates overthinking, improves answer accuracy, and reduces inference latency.

Monitoring and Overclocking Reasoning Progress

🚀🤖 Overclocking LLMs

We present qualitative examples showing how the uncovered mechanism can be leveraged to speed up a reasoning model's thought process.

From 6 men and 4 women, how many ways can a 3-person committee be formed with at least one woman?

A family group went fishing. Among them were two fathers and two sons. They caught three whole fish and gave one entire fish to each person. How many of the fishermen had no children of their own?

Find the intersection of the two lines given by the equations: $2x + 3y = 6$ and $x - y = 1$.

Method

This work investigates how large reasoning models internally track their thinking progress and how such processes can be monitored and controlled. We focus on reasoning models that explicitly segment their computations using <think> and </think> tokens (e.g., DeepSeek-R1), allowing us to study the internal dynamics of the "thinking phase."

1. Monitoring the Thinking Phase

We hypothesize that hidden states encode a token's relative position within the thinking phase. To test this, we collect hidden representations from the final layer of the model for each token in a thinking trajectory $T = w_1w_2...w_N$. Each token is paired with a normalized position:

$$p_j^{(k)} = j / N_k$$

This creates a dataset $D = \{ (h_j^{(k)}, p_j^{(k)}) \}$, where $h \in \mathbb{R}^d$ is the hidden state and $p \in (0, 1]$ is the relative position. We learn a regression function:

$$\theta^* = \arg\min_\theta \sum (f_\theta(h) - p)^2$$

We compare a linear regressor (TPV: Thinking Progress Vector) with a 2-layer FFN and find no improvement from the latter, favoring the simpler TPV model. For improved temporal modeling, we also train a single-layer GRU on full token sequences:

$$D' = \{ (h_1, ..., h_N), (p_1, ..., p_N) \}$$

The GRU outperforms TPV, especially in generalizing from MATH-500 to GSM8K in both fine-tuned and zero-shot setups.

2. Controlling the Thinking Phase

To explore whether TPVs are causally involved in reasoning, we intervene on hidden states during decoding:

$$h^\alpha = h + \alpha\theta \quad \rightarrow \quad \theta^T h^\alpha = \theta^T h + \alpha||\theta||^2$$

This intervention occurs after attention layers to isolate its effect to a single token step. We refer to this manipulation as "overclocking" when $\alpha > 0$. Empirically, overclocking results in more concise and decisive reasoning while maintaining correctness.

Findings

Original trajectories often exhibit repetition and hesitation.
Overclocked outputs are shorter and more linear in progress prediction.
In some cases, the token count reduces by up to 6× while yielding the same correct answer.

Conclusion

These findings suggest that models internally track thinking progress and that this representation can be extracted and modified, opening doors for dynamic reasoning control and real-time interpretability.

BibTeX

If you find this project useful, please cite it as follows:

@misc{eisenstadt2025overclockingllmreasoningmonitoring,
      title={Overclocking LLM Reasoning: Monitoring and Controlling Thinking Path Lengths in LLMs}, 
      author={Roy Eisenstadt and Itamar Zimerman and Lior Wolf},
      year={2025},
      eprint={2506.07240},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.07240}, 
}

Overclocking LLM Reasoning: Monitoring and Controlling Thinking Path Lengths in LLMs