Discussion about this post

User's avatar
Ani N's avatar

To the idea of RLHF eliciting existing capabilities: this is a statement about present models and reasoning pipelines, not about the fundamental techniques. We just don't have data in most domains at the scale to reliably learn new things from RL, but we may in the future.

Existing models are ineffective learners, in the grand scheme of things. A single H100 takes fourteen thousand kcal of energy per day (at 700 Watts), and a single gpu day gets very little.

But this cost in power pales in comparison to the cost of acquiring human labeled gold standard RL data. Enough data to tune grok into what it was costs at least hundreds of thousands of dollars unless it is bootstrapped from an existing LLM.

Huge amounts of work in industry and academia is going into getting as much bang for your buck as possible from synthetic data, and making fine tuning / adaptation as efficient as possible. Its possible, in a few years, that we will be able to learn new skills on the fly in a rudimentary fashion. It goes without saying that such an achievement would drastically increase the utility of LLMs, but to the point you make here it would also likely allow the grubby RL reward maximizer in the model to choke the life out of the world spirit language model it was built on top of.

Expand full comment
John Quiggin's avatar

This is great! I was going to write something vaguely along the lines of your discussion of bias, but instead I will go away and digest this for a while

Expand full comment
11 more comments...

No posts