Sam Altman has recently suggested that there is little more juice to be had in raw scaling:
“I think we’re at the end of the era where it’s gonna be these giant models, and we’ll make them better in other ways,”
I’d like to pre-register a few guesses on this topic.
There is still a little bit of juice in scaling. It’s not 100% done, just 80-90%, and the remaining bit matters.
Nonetheless, Altman has a point about scaling, and he’s right about there being alternatives to scaling too. There are almost too many options for improvement apart from scaling- it’s overwhelming, and it’s easy to generate new ideas. The key to many, but not all, of these ideas is to treat the current models as the core processer at the heart of a more complex process. This includes doubling down on:
Chain of thought and similar strategies: https://arxiv.org/abs/2201.11903
Reflection: https://arxiv.org/abs/2303.11366
Retrieval https://proceedings.mlr.press/v162/borgeaud22a.html
Selecting between multiple drafts to pick one to show to the user.
Self-criticizing and editing its own output before showing it to the user
Connecting it with various tools https://arxiv.org/abs/2302.04761
Multiple models voting
More reliance on still large, but area-specific, finetuned models
Endless more exotic ideas- e.g. something like Monte Carlo search for writing paragraphs. https://www.geeksforgeeks.org/ml-monte-carlo-tree-search-mcts/
This guy has put it better than anyone else I’ve read so far- https://www.beren.io/2023-04-11-Scaffolded-LLMs-natural-language-computers/
His argument is that what we’re essentially building is a new kind of computer, for which the LLM, however large, is only the CPU- a new kind of computer, perhaps even partly modelled on the human mind; not something that spits out words one at a time, but engages specific ‘faculties’ for metacognition, planning, working memory, long term memory, self-criticism, imagination, working with external tools (calculators, information storage) and so on.
A lot of people are going to mistake the end of scaling for the end of the LLM boom. That’s silly. Even if the fundamental technology never advanced significantly further, there’d still be a lot of work to do in rolling out applications.
ENDNOTE
A quick clarification. When I predict scaling is running out of juice, it's not so much due to in principle factors as to practical limits on flops and training data. Scaling isn't going to stop being a factor, it's just going to stop being the most important driver.
I want to point out that many of the things in the list would allow us to continue training the model with good old gradient descent with backpropagation, the gift that keeps on giving, without any additional human generated data. In particular that's built into MTCS, but you could imagine other schemes, say based on chain of thought. You ask the model to come up with a problem, then solve the problem with chain of thought, then you ask it to predict its answer had it used chain of though but without writing down any intermediate reasoning and the original answer becomes the training data. If this were, say, a programming or math problem, you could give the model access to an external API to allow the it to check the answer. Remember, a state of the art language model can already do this quite reliably. Any scheme that improves the output of the model, even a little, is also a scheme for creating new training data! In light of alphazero, we may want to expect stuff like this to take us very far, strap in folks.
The reason Altman is saying increased training and model size is no longer worth the increase in costs is because the transformer architecture is running out of juice. Reducing the quadratic growth inside the model (as Hyena does) would restart the scaling cycle.