Disclaimer: I recently started implementing GRPO for a small project consisting of balancing a pendulum. These are fresh thoughts, and I still have a lot to learn on the subject. Hopefully later posts will be more insightful and referenced.

Reminder

I will not explain here how GRPO works here, but here is a reminder of the objective used. (don’t worry, I won’t go into technical details)

$$J_{GRPO}(\theta) = \mathbb{E}[q\sim P(Q), \{o_i\}^G_{i=1}\sim \pi_{\theta_{old}}(O|q)]\frac{1}{G}\Sigma_{i=1}^G\frac{1}{|o_i|}\Sigma_{t=1}^{|o_i|}\{A-B\}$$

$$A := \min \left[ r \hat{A}_{i,t}, \text{clip}\left(r, 1-\varepsilon, 1+\varepsilon\right)\hat{A}_{i,t}\right]$$

$$B := \beta\mathbb{D}_{KL}[\pi_\theta\|\pi_{ref}]$$

$$r := \frac{\pi_\theta(o_{i,t}|q,o_{i,{<}t})}{\pi_{\theta_{old}}(o_{i,t}|q,o_{i,{<}t})}$$

The clipping term

What strikes me in this formula is the presence of two terms (B and the clipping term in A) that aim to curb the learning speed of the policy. The clipping term was introduced with PPO as a way to stabilize the learning process. The idea being that we don’t want πθ and πθ-old to be too different. So it is basically telling the policy to not jump to conclusions too quickly, or in other words, to not throw away all it has learnt so far. To me this term shows a promising direction in RL, and I believe there is room for big improvements. Let’s consider the following situation:

We are training a little robot with arms. It is facing a wooden plank, it has been given nails, screws and a screwdriver. So far it has discovered that you can use the pointy tip of the screwdrivers to screw the screws, and that you can use the bulbous side of the screwdriver to "hammer" the nails. Then comes an actual hammer. It discovers quickly that you can hammer the nails way more efficiently with this tool. Unfortunately, out of "excitement" it now also uses the hammer on the screws. Its behavior changed a little too dramatically because of the new discovery…

The clipping term is here to avoid such situations. However, I believe there was some wisdom in the behavior of the robot. When we obtain new information, the model of the world that we have can change drastically on some aspects. For example, if tomorrow you were to see a man casually flying in the sky without any sort of apparatus, that would challenge your entire conception of reality. Of course, this is an extreme example, but we do experience such shocks on a smaller scale, as the world around us evolves. Think of the advent of cars, electricity, the internet, etc. I believe that being capable of assessing the magnitude of the influence of a new discovery, and to determine its consequences, is an important factor regarding intelligence.

In the case of the little robot, it makes sense to test what the hammer can achieve in different contexts. The difficulty is determining what to conclude. In that regard, the clipping term is a very crude way of dealing with this. I think looking for ways to obtain a finer adaptation is a great direction for improvements (easier said than done though).

There is already some literature on the subject of clipping, but I have not read anything for the moment, I will soon.

Greedy selection

Adapting GRPO to a continuous context such as my pendulum simulation brings a problem: after evaluating a whole group of actions, which one should we actually choose to continue the simulation? A simple choice is to select the one that got the highest reward. However, in this setting, such a greedy behavior favors short-term rewards instead of long-term, because the outcomes we observe are too local. In my particular case, I don’t think this will impact the final result, since balancing a simple pendulum is easy enough, but in other cases I cannot say. That problem has surely been studied extensively by now, so I will look out for solutions.

Group sampling

I especially like the group sampling behavior of GRPO, because it resembles a very sensible behavior that a human could have, namely "let’s see how different actions affect the output". I think that cognitive science could bring a lot to RL and AI in general, by providing cognitive strategies that could then be adapted in algorithms. As vague as it sounds, this is a direction I will definitely explore.

Some personal thoughts while implementing GRPO

Reminder

The clipping term

Greedy selection

Group sampling

Subscribe to my newsletter

William Gaudelier

William Gaudelier