Discussion about this post

User's avatar
gmt's avatar

One thing to be cautious of here is that ELO is not designed to work for comparing different players across time. It works well for comparing a single player across time or multiple players at one moment, but not both at once. This can be seen pretty well with chess, where a 1500 ELO player today is a lot better than a 1500 ELO player a hundred years ago, because the techniques have improved over time.

Now, lmarena has tried to get around this. They technically use a slightly different model than ELO (though for this purpose I believe it’s functionally identical), but they also normalize against a model (mixtral-8x7b-instruct-v0.1, which has an ELO of 1114. For some reason it varies slightly instead of being constant though.). The normalizing process gets rid of some of the issues with measuring across time, but it’s not perfect and it still doesn’t mean nearly as much as one would like.

The other thing to take note of is that ELO is only a measure of relative performance. If a new model consistently does better than all the previous models, even if only 1% better on every question, it will have a much higher ELO. That’s fine for a realm like chess, but for something like AI you care a lot more about absolute performance than relative. I imagine that this played a large role in the second gain period, where reasoning models were able to be consistently better than non-reasoning models even if they didn’t actually improve that much at an absolute level. Meanwhile in the plateau period, models often made mistakes (and thus didn’t gain much ELO) but could still have large absolute gains when they weren’t making mistakes.

Expand full comment
Amicus's avatar

> I strongly suspect the LMArena test is starting to come to the end of its useful life.

This happened a while ago, I think. You don't need to completely exhaust the valuable signal to start overfitting - it's the default outcome once you use your measure as a target. And oh boy, are they ever.

Expand full comment

No posts