While reasoning-enhanced Large Language Models (LLMs) have demonstrated remarkable advances in complex tasks such as mathematics and coding, their effectiveness across universal multimodal scenarios remains uncertain. The trend of releasing parallel "Instruct" and "Thinking" models by leading developers serves merely as a resource-intensive workaround, stemming from the lack of a criterion for determining when reasoning is truly beneficial. In this paper, we propose Dual Tuning, a framework designed to assess whether reasoning yields positive gains for target tasks under given base models and datasets. By jointly fine-tuning on paired Chain-of-Thought (CoT) and Direct-Answer (DA) data under controlled prompts, we systematically quantify and compare the gains of both training modes using the proposed metrics, and establish the "Thinking Boundary" to evaluate the suitability of reasoning training across diverse multimodal tasks, including spatial, mathematical, and multi-disciplinary domains. We further explore the impact of reinforcement training and thinking patterns on reasoning suitability, and validate whether the "Thinking Boundary" can guide data refinement. Our findings challenge the "reasoning-for-all" paradigm, providing practical guidance for identifying appropriate data and training strategies, and motivating the development of resource-efficient, adaptive auto-think systems.
Table 1: Results of preliminary experiments on spatial reasoning. Baseline: Qwen2.5-VL-7B. I: Image Spatial Data. V: Video Spatial Data. S: Direct-Answer. L: CoT.
Table 2: Results of preliminary experiments on disciplinary reasoning. Baseline: Qwen2.5-VL-7B. O: Onethinker Image Data. S: Direct-Answer. L: CoT.
Table 3: Experimental results on spatial tasks. Values in red and green denote negative and positive results, respectively. A task is identified as suitable for reasoning-oriented training only when both \( \mathbf{Gain_{CoT}} \) and \( \mathbf{GAP_{DT}} \) exhibit concurrent positive values (highlighted in green), which constitutes the Thinking Boundary.
Table 4: Experimental results on MathVista tasks. Values in red and green denote negative and positive results, respectively. A task is identified as suitable for reasoning-oriented training only when both \( \mathbf{Gain_{CoT}} \) and \( \mathbf{GAP_{DT}} \) exhibit concurrent positive values (highlighted in green), which constitutes the Thinking Boundary.
Table 5: Experimental results on MMMU tasks. Values in red and green denote negative and positive results, respectively. A task is identified as suitable for reasoning-oriented training only when both \( \mathbf{Gain_{CoT}} \) and \( \mathbf{GAP_{DT}} \) exhibit concurrent positive values (highlighted in green), which constitutes the Thinking Boundary.
Figure 1: The base model shows discrepancies in initial performance between CoT and DA inference across various tasks. Positive values indicate that CoT inference has an advantage.
Table 6: Performance Comparison Following Subsequent RL Training on Dual-Tuned Models for Spatial Tasks. Values in red and green denote negative and positive results, respectively. A task is identified as suitable for reasoning-oriented training only when both \( \mathbf{Gain_{CoT}} \) and \( \mathbf{GAP_{DT}} \) exhibit concurrent positive values (highlighted in green), which constitutes the Thinking Boundary.
Table 7: Performance Comparison Following Subsequent RL Training on Dual-Tuned Models for MathVista Tasks. Values in red and green denote negative and positive results, respectively. A task is identified as suitable for reasoning-oriented training only when both \( \mathbf{Gain_{CoT}} \) and \( \mathbf{GAP_{DT}} \) exhibit concurrent positive values (highlighted in green), which constitutes the Thinking Boundary.
Figure 2: We evaluated on two different datasets, marked by circles (original) and triangles (new) on MMMU. The resulting change in task distribution highlights how Thinking Patterns dictate reasoning suitability across different tasks.
Figure 3: The effectiveness of a thinking pattern depends on its refinement and the exclusion of redundant or invalid reasoning. We compare the \( \mathbf{Gain_{token}} \) for both datasets on MathVista tasks.
Figure 4: We plot each task's \( \mathbf{Gain_{CoT}} \) and \( \mathbf{Gain_{DA}} \) in a two-dimensional coordinate map. Through three distinct regions, we categorize the suitability of different tasks for the two training modes.
Figure 5: We partition tasks into two halves using \( \mathbf{Gain_{DA}} \) from Figure 4 and conduct two separate DA training on the data belonging to each half. The results show that left-side tasks predominantly show negative gains and right-side positive tasks mostly achieve positive gains after standalone training, which confirms the efficacy of the corresponding data.
Figure 6: We partition tasks into two halves using \( \mathbf{Gain_{CoT}} \) from Figure 4 and conduct two separate CoT training on the data belonging to each half. The results show that the data corresponding to negative tasks (lower half) indeed yield negative gains during standalone training, and vice versa. These results confirm the efficacy of the corresponding data.
Figure 7: We separately train models with data from the lower-left negative region and the remaining three positive regions. For tasks in the lower-left yellow region, training solely on corresponding data predominantly yields negative gains. For the green and pink positive regions, training on the corresponding data reveals exclusively positive gains.
@article{zheng2026dualtuning,
title={The Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning},
author={Zheng, Ruobing and Li, Tianqi and Li, Jianing and Guo, Qingpei and Yuan, Yi and Chen, Jingdong},
year={2026}
}