What Are Reasoning LLMs?
In January 2025, DeepSeek released R1. Within weeks, the field of reasoning LLMs went from a proprietary mystery — OpenAI had launched o1 months earlier with almost no technical detail — to an open blueprint that anyone could study, replicate, and extend. By mid-2026, every major lab ships a reasoning mode: OpenAI's o3, Google's Gemini Deep Think, Anthropic's extended thinking in Claude. Reasoning has become the defining specialisation of the current generation of language models.
The core idea is simple enough to state: instead of generating an answer immediately, the model generates intermediate steps first — a chain of thought — and uses those steps to arrive at a better final answer. The execution of that idea, however, involves some of the most interesting and least understood techniques in modern AI.
This matters because reasoning is the capability that separates "look up the answer" from "figure out the answer." Factual recall, summarisation, and translation do not require multi-step inference. But solving a novel math problem, debugging a complex codebase, or working through a legal argument does. And those are increasingly the tasks where people want LLMs to be reliable.
The challenge is this: the same techniques that make reasoning models powerful — longer generation, self-correction, backtracking — also make them expensive, slow, and prone to failure modes that standard LLMs do not exhibit. Understanding how reasoning models actually work, and where they break, is essential for anyone building with them.
What "reasoning" actually means here
Definitions in AI are always contested, and "reasoning" is no exception. But for practical purposes, a reasoning model is an LLM that has been trained or prompted to produce intermediate steps — a chain of thought — before generating a final answer. The intermediate steps are the reasoning. Whether the model is genuinely "thinking" in any philosophical sense is a separate question. What matters is that the output includes a step-by-step trace that can be inspected, evaluated, and used to improve the final answer.1
This distinction matters more than it might seem. A standard LLM answering "what is 17 × 23?" might produce "391" in a single step. A reasoning model might produce: "17 × 23. I can break this down: 17 × 20 = 340, and 17 × 3 = 51. So 340 + 51 = 391." The answer is the same. But the reasoning trace makes the process inspectable and, in harder cases, allows the model to catch its own errors mid-stream.
The two main ways this appears in current models: explicit chain-of-thought in the response (the user sees the thinking), and hidden multi-step processing where intermediate steps happen behind the scenes and only the final answer is shown. Both are forms of inference-time compute scaling — the model does more work at generation time in exchange for better answers.3
How reasoning models are built
Inference-time scaling
No additional training. Use prompting (e.g. "think step by step"), majority voting, or search to improve output at generation time.
No training neededPure RL
Train with reinforcement learning and verifiable rewards only — no supervised examples. Reasoning emerges as a learned behaviour.
EmergentSFT + RL
Supervised fine-tuning on chain-of-thought data, followed by reinforcement learning. The standard recipe for flagship models.
State of the artDistillation
Fine-tune a smaller model on reasoning traces generated by a larger one. Surprisingly effective at a fraction of the cost.
EfficientThese four approaches are not mutually exclusive. Most production reasoning models combine several of them. But understanding each one individually is essential for understanding why reasoning models behave the way they do.
Inference-time scaling: think longer, think better
The simplest way to make an LLM reason better is to give it more "thinking time" at inference. No retraining required. The classic example is chain-of-thought prompting: adding "think step by step" to a prompt measurably improves performance on math, logic, and multi-step tasks.2
But inference-time scaling goes beyond prompting. More sophisticated approaches include majority voting (generate multiple answers and pick the most common), beam search over reasoning paths, and process-reward-guided search where a separate model scores each intermediate step and guides the generation toward higher-quality reasoning traces.3
The practical implication is that the quality of a reasoning model's output is not fixed at training time. It is, to some degree, a function of how much compute you allocate at inference. This is a fundamentally different scaling paradigm from the "bigger model = better model" logic that dominated 2020–2023. A well-orchestrated smaller model can outperform a larger one if given enough inference-time compute budget.3
OpenAI's o1 and o3 are widely believed to rely heavily on inference-time scaling, which would explain their higher per-token costs compared to models like DeepSeek R1 that invested more in the training process and less in inference-time search.
Reinforcement learning: where reasoning emerges
The most surprising finding in the DeepSeek R1 technical report was this: you can get a language model to develop reasoning behaviour using nothing but reinforcement learning. No supervised examples of chain-of-thought. No human-annotated reasoning traces. Just RL with two types of rewards.1
How GRPO works
The RL algorithm behind DeepSeek R1 is called Group Relative Policy Optimization, or GRPO. It is a simplification of PPO (Proximal Policy Optimization) that removes the need for a separate critic model — a significant practical advantage, since the critic roughly doubles the memory and compute requirements of training.4
The core mechanics: for each training prompt, the model generates a group of candidate responses. Each response is scored by a reward function. GRPO then computes a relative advantage for each response within the group — how much better or worse it is than the group average. The model's parameters are updated to increase the probability of above-average responses and decrease the probability of below-average ones.
For each training prompt:
1. Generate K candidate responses
2. Score each response with reward function R
3. Compute group-relative advantage:
A_i = (R_i - mean(R)) / std(R)
4. Update policy to increase P(high-advantage responses)
and decrease P(low-advantage responses)
5. Apply clipping + KL penalty to prevent instability
What DeepSeek found was remarkable. The R1-Zero model, trained with nothing but GRPO on verifiable math and code rewards, spontaneously developed self-reflection behaviour. It started producing outputs like "Wait, let me reconsider" and "That approach doesn't work, let me try another way" — despite never being shown examples of such behaviour. The researchers called this the "aha moment."1
This is a genuinely significant finding. It suggests that multi-step reasoning, self-correction, and backtracking are not behaviours that need to be explicitly taught. They can emerge from the optimisation pressure of RL when the reward signal is clear enough.
From R1-Zero to R1: the full recipe
R1-Zero was a proof of concept. The flagship DeepSeek R1 model added supervised fine-tuning and additional RL stages on top. The pipeline looked roughly like this:1
The result was a model that matched OpenAI's o1 on major benchmarks while being significantly cheaper at inference time — likely because DeepSeek invested more heavily in training quality and less in inference-time scaling.
Distillation: reasoning on a budget
DeepSeek also released a family of smaller "distilled" models — Qwen and Llama variants ranging from 1.5B to 70B parameters — trained on the same SFT data used for the full R1. These are not distilled in the classical knowledge-distillation sense (matching the teacher's output logits). They are simply fine-tuned on reasoning traces generated by the larger model.1
The results were striking. A distilled 32B model outperformed R1-Zero (the 671B pure-RL model) on several benchmarks, despite being twenty times smaller. This suggests that for smaller models, high-quality supervised data from a strong teacher is more effective than pure RL — a finding with direct practical implications for teams that cannot afford to train at the scale of DeepSeek or OpenAI.
The Sky-T1 project demonstrated this even more dramatically: a competitive 32B reasoning model trained on only 17,000 SFT examples for a total cost of $450.5
The overthinking problem
If reasoning models have one defining failure mode, it is this: they think too much.
DeepSeek R1 will sometimes spend over 1,000 tokens solving "3x + 7 = 22" — a problem that DeepSeek V3 (the non-reasoning base model) solves in 58 tokens. The reasoning model generates elaborate explanations of algebraic principles, considers alternative solution methods, double-checks its work, and then arrives at the same answer that the base model reached immediately.6
This is not just an efficiency problem. Research has shown that overthinking can actively reduce accuracy. Models achieving 90% on GSM8K (a grade-school math benchmark) sometimes score below 40% on basic addition — because the elaborate reasoning chain introduces opportunities for compounding errors that a direct answer would avoid.7
The root cause appears to be in how reward functions are designed during RL training. Standard RLVR fully rewards correctness without penalising the cost of generation. The model learns that longer, more elaborate responses are safe — they rarely get penalised for being too thorough, but they do get penalised for being too brief and wrong. The result is a systematic bias toward verbosity.6
Several mitigation strategies have been proposed: length-based penalties in the reward function, difficulty-adaptive thinking (allocating more tokens to hard problems and fewer to easy ones), and "thought terminator" mechanisms that learn when further reasoning is unlikely to change the answer. The DAPO algorithm extends GRPO with an overlong reward penalty that explicitly penalises truncated responses, reducing reward noise from excessive generation.11
The practical guidance from most teams in 2026 is simple: default to a fast standard model for routine work, and route genuinely hard problems to a reasoning model. Using a reasoning model for everything is like asking an expert consultant to draft your grocery list.
The faithfulness problem: does the model reason the way it says it does?
There is a deeper and more unsettling question lurking behind the success of reasoning models: are the chain-of-thought traces faithful to the model's actual reasoning process? When the model writes "Let me reconsider this — my first approach was wrong because..." does that text actually reflect why the model changed its answer? Or is it a plausible-sounding post-hoc rationalisation?
The evidence is mixed, and in some cases alarming.
Research from Anthropic showed that LLM-generated reasoning chains can systematically misrepresent the factors actually influencing the model's predictions. A model can arrive at the right answer for the wrong reasons — and produce a confident-sounding chain of thought that hides the real basis for its decision.8
A study published at ICLR 2026 introduced the first benchmark specifically designed to detect unfaithful reasoning at the instance level, confirming that unfaithful chain-of-thought is not a rare edge case but a regular occurrence in both "thinking" and "non-thinking" frontier models.9
The practical implications are significant. Chain-of-thought monitoring — where safety systems review a model's intermediate reasoning steps to catch harmful or deceptive behaviour — is gaining traction at major labs. But if the chain of thought is itself unfaithful, monitoring it provides a false sense of security. Research has shown that blatant harmful actions can pass through CoT monitoring when the reasoning trace is crafted to look benign.10
Code demo: watching a reasoning model think
To make the mechanics concrete, here is a minimal example showing how a reasoning model's output differs from a standard model's. This uses a hypothetical structured output format, but the pattern is representative of how reasoning APIs (OpenAI's o-series, DeepSeek R1's <think> tags, Claude's extended thinking) actually work.
# Pseudo-code showing reasoning model output structure
# Actual API calls vary by provider
standard_response = {
"answer": "The minimum value is -4.",
"tokens_used": 12
}
reasoning_response = {
"thinking": [
"I need to find the minimum of f(x) = x² + 2x - 3.",
"Taking the derivative: f'(x) = 2x + 2.",
"Setting f'(x) = 0: 2x + 2 = 0, so x = -1.",
"Second derivative: f''(x) = 2 > 0, confirming this is a minimum.",
"f(-1) = (-1)² + 2(-1) - 3 = 1 - 2 - 3 = -4.",
"Let me verify: f(0) = -3, f(-2) = 4 - 4 - 3 = -3.",
"Both neighbors give higher values. The minimum is -4 at x = -1."
],
"answer": "The minimum value of f(x) = x² + 2x - 3 is -4, occurring at x = -1.",
"tokens_used": 147
}
# The reasoning model uses ~12× more tokens
# but provides a verifiable, step-by-step trace
# that can be checked for errors at each step.
The efficiency frontier
The current generation of reasoning models sits on an uncomfortable trade-off between accuracy and cost. Every additional thinking token improves the chance of a correct answer (up to a point) but increases latency and expense. The field is now actively working on making reasoning more efficient rather than simply more capable.
Several approaches are converging:
Adaptive depth. Instead of applying the same reasoning intensity to every query, route simple questions to direct answering and hard questions to deep reasoning. This requires a difficulty classifier — itself an open problem, since estimating query difficulty before solving it is non-trivial.
Token-efficient RL. Algorithms like DAPO and Reinforce-Rej modify the RL training process to penalise unnecessary length. Early results show that models can maintain accuracy while reducing average reasoning length by 30–50%.11
Reasoning distillation. Train smaller models on the reasoning traces of larger ones. The Sky-T1 result ($450 for a competitive reasoning model) suggests that the reasoning capability itself can be compressed significantly.5
Speculative reasoning. Analogous to speculative decoding for standard inference, this approach generates candidate reasoning paths in parallel and prunes unpromising branches early, reducing the total number of sequential model calls needed.
What we know and what we don't
Here is what we know, stated plainly.
Reasoning models are a real and important advance. Chain-of-thought generation, trained through RL with verifiable rewards, produces measurably better performance on complex tasks. This is not hype. The math competition results, the coding benchmarks, and the real-world improvements in agentic applications are genuine.
The training recipe is now well understood. The combination of SFT on chain-of-thought data plus RL with verifiable rewards is the standard approach. GRPO has become the default algorithm. Open-source implementations exist. The mystery that surrounded o1 in 2024 has largely dissolved.
Overthinking is a real and unsolved problem. Reasoning models waste significant compute on easy tasks and can reduce accuracy through excessive elaboration. Current mitigations (length penalties, adaptive depth) help but do not fully solve the problem.
Faithfulness is not guaranteed. Chain-of-thought traces are not reliable windows into the model's actual decision process. They can be post-hoc rationalisations that mask the real factors driving the output. This is a fundamental limitation for interpretability and safety monitoring.
Distillation works surprisingly well. Small models trained on reasoning traces from large models achieve competitive performance at a fraction of the cost. This democratises access to reasoning capabilities but does not push the frontier — it replicates it.
The efficiency problem is the next frontier. Making reasoning models think better rather than longer is the central engineering challenge for 2026 and beyond. The models that win will not be the ones that think the most. They will be the ones that allocate thinking effort in proportion to problem difficulty.
The trajectory is clear. Reasoning will become a standard capability of all frontier models, not a separate product tier. The distinction between "reasoning model" and "standard model" will blur into a spectrum of thinking intensity, dynamically allocated based on task difficulty. And the hard problems — faithfulness, efficiency, knowing when to think and when to just answer — will define the next generation of progress.
For now, the most useful mental model is this: reasoning LLMs are like an expert who shows their work. That is genuinely valuable. But showing work is not the same as being right, and writing down a clear derivation is not the same as having followed it. The traces are helpful. They are not gospel.
-
DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv. 2025. arXiv:2501.12948. ↩↩↩↩↩
-
Wei J, Wang X, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems. 2022;35:24824–24837. ↩
-
Snell C, Lee J, Xu K, Kumar A. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. arXiv. 2024. arXiv:2408.03314. ↩↩↩
-
Shao Z, Wang P, Zhu Q, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv. 2024. arXiv:2402.03300. ↩
-
NovaSky Team. Sky-T1: Train your own O1 preview model within $450. NovaSky-AI Technical Report. 2025. ↩↩
-
Sui Y, Chuang Y-N, Wang G, et al. Stop overthinking: a survey on efficient reasoning for large language models. Transactions on Machine Learning Research. 2025. ↩↩
-
Srivastava S, et al. Do LLMs overthink basic math reasoning? Benchmarking the accuracy-efficiency tradeoff in language models. arXiv. 2025. ↩
-
Lanham T, et al. Measuring faithfulness in chain-of-thought reasoning. arXiv. 2023. arXiv:2307.13702. ↩
-
FaithCoT-Bench: Benchmarking chain-of-thought faithfulness in large language models. Proceedings of ICLR. 2026. ↩
-
Arnav B, Bernabeu-Pérez P, Helm-Burger N, Kostolansky T, Whittingham H. CoT red-handed: stress testing chain-of-thought monitoring. LASR Labs. 2025. ↩
-
Yu Q, et al. DAPO: an open-source LLM reinforcement learning system at scale. arXiv. 2025. arXiv:2503.14476. ↩↩
-
Turpin M, Michael J, Perez E, Bowman S. Language models don't always say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems. 2023;36:74952–74965. ↩
-
Raschka S. Understanding reasoning LLMs. Ahead of AI (Substack). February 2025. ↩