How DeepSeek-R1-Zero Model Learns to Reason Like a Human


Introduction 👋
DeepSeek AI's recent paper, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning[1], has quickly become a hot topic in the AI world, and for good reason! In this paper, DeepSeek introduces their new reasoning model: DeepSeek-R1-Zero. What's exciting about DeepSeek-R1-Zero is that it is the first open research to show that the reasoning capabilities of LLMs can be enhanced purely through Reinforcement Learning (RL), without needing a Supervised Fine Tuning (SFT) stage. This discovery is a significant breakthrough that I believe will lead to major changes in AI in the coming years. But what enables DeepSeek-R1-Zero to achieve this 'reasoning' capability without SFT? That's what I will explore in this article.
Previous approaches for ‘Reasoning’ 🤓
Around the time GPT3 came out the models required special prompting techniques like Chain-of-Thought Prompting(CoT) [2,5] were needed to elicit reasoning behavior from LLMs as shown below:
By the time GPT-4 was released, LLMs were being trained with many Chain of Thought (CoT) examples. These examples, which include a chain of thought structure in the answers during the SFT stage, enabled the models to reason naturally for questions or prompts where reasoning seemed necessary. For example, as shown below, the model [GPT-4 in this case] was able to use CoT behavior naturally while answering the question, even without any CoT prompting. While these LLMs were generally better at simple reasoning tasks like this, they still struggled with complex tasks.
The o1 model advances this ability by introducing a dedicated 'reasoning' stage, as shown below. It can solve complex reasoning tasks at an elite level. These specialized reasoning models have two output stages. The first is the reasoning stage, where the problem is broken down into simpler subtasks, using 'reasoning' tokens. The second stage is the output stage, which provides the final answer based on the input prompt and the reasoning tokens..
While the OpenAI Reasoning Models [3] were hinted to be trained with Large Scale Reinforcement Learning, nothing could be ascertained as it was a closed source project. Plus the o1 model is very very expensive making it infeasible for using at scale.
How DeepSeek-R1-Zero model learns reasoning 🧠
The following are the key steps in training of DeepSeek-R1-Zero model:
Base Model: For training DeepSeek-R1-Zero model the DeepSeek-V3-Base from [4] is used as the base model. DeepSeek-V3-Base is the model that was obtained after extensive unsupervised Pre-Training stage with 14.8T high-quality and diverse tokens on the Next Token prediction and Fill-in-Middle (FIM) tasks. This means that this base model itself has very good understanding of the meanings, grammar and the structure of the language.
So as far as next token prediction is concerned it is a very good autoregressive model but not that great at a) overall QA/other instruction tasks and b) answering in a Chain of Thought patterns.Training Prompt: For training the DeepSeek-R1-Zero model, the following training prompt is used-
The key details here are:
They explicitly instruct that - ‘assistant first thinks about the reasoning process in mind and then provides the answer’, this is very important because, for the DeepSeek-V3-Base which already has a good understanding of the semantics of the language this instruction has similar effect to what ‘Let’s think step by step’ from the point (1) on prompting techniques in the previous section has. While it may not work perfectly(not yet!) because the DeepSeek-V3-Base model has only gone through the unsupervised training stage until this point, this instruction still primes the model towards generating responses which include some kind of thinking/breaking down.
They explicitly instruct that - ‘The reasoning process and answer are enclosed within <think></think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.’ This primes the model towards having a separate reasoning stage and a separate answer stage in the overall output i.e this helps with ensuring there is a dedicated reasoning stage.
Reward functions: When using Reinforcement Learning, the reward functions are crucial as the rewards are what determine which type of outputs get preferred and which type of outputs do not get preferred. They use two types of rewards:
Accuracy rewards: This reward function is used for evaluating if the final answer was correct or not. While they do not give exactly how this is done they do provide some example scenarios like for a) math questions where they know the answer and try to make the model predict the final answer in a format that is easy to parse and evaluate and b) for coding related questions they compile them and verify correctness with predefined tests etc.
This reward pushes the output to get good accuracy. Note that by itself this does not directly ensure that model does ‘reasoning’.Format rewards: Based on the paper the purpose of this reward function is to enforce the model to put its thinking process between <think> and </think> tags. However there is no explanation provided as to how this is implemented. At the least they can ensure the tag formats are correctly followed by searching for the occurrence of <think> tag followed by some tokens then followed by the </think> tag. I wish they had revealed more details!
Note that by itself it does not dictate the type of content in the reasoning(at least based on the details mentioned in the paper).
Note that to be able to use these reward functions the data needs to be in a particular format ex: it needs to have all the information required to be able to check for accuracy etc. Which means they would be restricted on what they could be use for the Large Scale Reinforcement Learning. They indeed allude to this issue but do not give any further details.
Group Relative Policy Optimization(GRPO): Now we start the Large Scale Reinforcement Learning stage using the GRPO algorithm [6] (which is a variant of the popular Proximal Policy Optimization (PPO) algorithm) to maximize the reward functions. The explanation of GRPO is beyond the scope of this article however if you are interested you can refer to the paper which proposed it [6] or this YouTube video which explains.
All together: So we started off with the base model which already comes with a good understanding of language semantics, we have the curated prompt which instructs it to think before answering and to also include the thinking process in the answer following a specific format, and we have the reward functions which ensure accuracy and format through RL training stage, this how we induce reasoning capabilities into the model.
What problem strategies does the model learn?
While we do instruct the model to think and follow a particular format while answering we do not provide any additional help/constraints on what kind of strategies/reasoning patterns to follow for a given problem. The model is left to figure this out based on its pre existing knowledge from the pre-training stage and whatever it learns during RL stage for maximizing the rewards. Empirically the researchers at DeepSeek find that these incentives are enough to make the model learn remarkable reasoning patters using RL. These capabilities were very similar in style to that of humans reasoning skills like the example of ‘aha-moment’ below where model re-evaluates its initial approach and changes approach when needed to just like we humans often do. Thus this shows the potential of Reinforcement Learning to help teach reasoning to the model without even bothering about the SFT stage.
Impact 💥
I believe this is a very important milestone because this is the first instance of open research to validate that reasoning capabilities of LLMs can be incentivized purely through Reinforcement Learning (RL), without the need for a Supervised Fine Tuning (SFT) stage.
Based on how synthetic data based distillation is used to very effectively further improve the models across various sizes and the fact that the DeepSeek-R1 is open source, I believe that LLMs are only going to keep getting better and smaller.
Since skipping SFT stage can be a convenient option, I am guessing there will be more and more work on pyre RL based training regimes or at the least an increased focus on RL stage.
Conclusion
In conclusion, the DeepSeek-R1-Zero model represents a significant advancement in the field of AI by demonstrating that reasoning capabilities in large language models can be enhanced purely through Reinforcement Learning, without the need for a Supervised Fine Tuning stage. This breakthrough not only validates the potential of RL in teaching reasoning but also opens up possibilities for creating more efficient and smaller models. As the AI community continues to explore and refine RL-based training regimes, we can anticipate further improvements in the performance and accessibility of language models, ultimately leading to more sophisticated and human-like reasoning abilities in AI systems.
References 📝
Subscribe to my newsletter
Read articles from Sai Chowdary Gullapally directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
