Were you curious to find out Google DeepMind new gemma models are using REINFORCE instead of PPO for RLHF for learning human preferences? 🔎
We have been working on this paper for months -- that takes apart PPO and motivates REINFORCE Style Optimization as far simpler and more effective. 🎖Reinforcement learning from human feedback has been widely adopted as a way to ensure models reflect preferences. 🌍 Approaches like PPO directly borrow from traditional RL assumptions. Here we ask is this necessary for LLM settings?
In an LLM setting we show that due to the strength of the init policy and prompt conditioning, PPO is unnecessary and computationally costly. Take a look and understand why we suggest the future of RLHF for LLMs looks different from PPO and requires going back to basics
Work I am very proud of this work led by our Cohere For AI Cohere teams. A huge congrats to first author Arash Ahmadian and the rest of authors Chris Cremer, Matthias Gallé Marzieh Fadaee Julia Kreutzer Olivier Pietquin Ahmet Üstün. 🎉 🎊
Learn more: https://lnkd.in/e77W6hec
We have been working on this paper for months -- that takes apart PPO and motivates REINFORCE Style Optimization as far simpler and more effective. 🎖Reinforcement learning from human feedback has been widely adopted as a way to ensure models reflect preferences. 🌍 Approaches like PPO directly borrow from traditional RL assumptions. Here we ask is this necessary for LLM settings?
In an LLM setting we show that due to the strength of the init policy and prompt conditioning, PPO is unnecessary and computationally costly. Take a look and understand why we suggest the future of RLHF for LLMs looks different from PPO and requires going back to basics
Work I am very proud of this work led by our Cohere For AI Cohere teams. A huge congrats to first author Arash Ahmadian and the rest of authors Chris Cremer, Matthias Gallé Marzieh Fadaee Julia Kreutzer Olivier Pietquin Ahmet Üstün. 🎉 🎊
Learn more: https://lnkd.in/e77W6hec