By AI skeptic here I mean someone who thinks the current wave of machine learning research is much less significant than the field of machine learning would want you to think.
We're going to play a series of games I just invented. You've never seen these games before. Some of them will be party games, some of them will model board games in text, some will be probabilistic, etc. I'm going to give you the rules, and then let's play! We're going to communicate by passing our moves back and forth, without commentary on the game state. Feel free to use a notepad or memory buffer.
If an LLM had never heard of the concept of chess, and you explained the rules and said:
"1. e4"
I think there's a good chance the LLM would lose to a _very bright_ human even in five years (or at least, make less illegal moves). And I wouldn't be surprised if after ten games, the human had improved more than the AI.
I think this is a good approach. Deep reasoning that has to be done from scratch based on information provided in the prompt rather than interpolated from similar human examples. That's gonna be something that models are gonna struggle with for a while.
I was thinking of something a bit similar, but both easier and more difficult: Twitter games, like "Spoil a movie by changing one letter" where an answer might be "Apocalypse Cow". You'd evaluate the LLM's success by how many likes it gathered.
I've been trying to compete with New York Times WordleBot and noticing very clearly the techniques with which it mostly defeats me, which are like those of Big Blue in chess, overwhelming power: the bot instantly knows all the possibilities and is able to pick the answer that eliminates the largest number of future answers, whereas I'm stuck with three or four guesses and an informal guess about the probabilities of individual letters and and understanding of the syllable structure of written English.
What I like best is its inability to judge what I'm doing when it evaluates my moves in terms of "skill" and "luck". It assumes there's only one way to play, and castigates me for probability-based choices ("That wasn't my favorite choice, but you've narrowed it down to just two possibilities"). A good test should be one that recognizes this, the human ability to get good results with inadequate information.
"Spoil a movie by changing one letter" is a tricky one because it will probably dumbfound most current leading language models, because they can't directly see actual letters. This gets to a question of what counts as a fundamental change in the architecture versus an incremental improvement - I think that character-aware tokenization would count as an incremental improvement under the current approach.
As I understand the way they work, it would be impossible for an LLM to come up with a genuinely original idea, or turn of phrase. If it's not in the training set, it doesn't exist.
I've had no success so far in getting ChatGPT to imitate my own style, but I imagine with a large enough training set it could produce something semi-convincing.
What LLMs do can be thought of as interpolation and extrapolation in a high-dimensional space of concepts, and that process is definitely capable of producing ideas and turns of phrase they haven't seen before. Whether ideas and phrases produced that way count as "original" is a tricky question; I suspect interpolation and extrapolation in high-dimensional conceptual space is a large part of what human creativity is about.
This is not the case. If this were *actually* literally true, LLMs would be unable to produce strings not in the corpus. In fact, per research, most 5-grams in unconditional sampling are novel. It's important to remember that these models are estimating a manifold across very sparse samples (by necessity, the vast majority of 10 word english sentences have never been uttered).
What you're *actually* doing when you train these models is rummaging around in the space of possible programs looking for ones that behave like the processes that generate text (i.e. usually people). Those programs have no particular limitations or approaches they have to use, and there's no reason a program you find that way can't generate novel ideas or phrases, provided they lie on the manifold that includes the training data.
That's Chomsky's mistake from way back in the 1950s, in his purely formal definition of "creativity". Presence in the corpus doesn't prove a sentence is not original. Originality lies in the use you make of your language in an unpredicted situation.
I think if people want to make a bigger claim about originality, they should be specific about what they mean. Regardless, I think they're probably still wrong.
Language models make correct and useful responses all the time to stuff that can't possibly be in the corpus. The corpus is big, but the space of conversations is much, much, much larger.
I'm pretty sure a current-paradigm LLM could in fact be made to rote memorize any single "concretely defined and operationalized" task. (Because, yes, once you have a definition and operationalization, you're automatically capable of generating sufficient training data. I invite you to prove me wrong with a counterexample, but no matter how I look at it, you are, in fact, "being that unfair". I wouldn't even consider it cheating, it's just how those things work.)
In fact, that's probably how their makers were faking progress until a few years ago. Any time someone points out a deficiency, create and curate a bunch of training data, hire some African sweatshops to do reinforcement learning, then trumpet success. This seems to have largely stopped (for capabilities, not critical screw-ups that do need to be weeded out) now that their products are accessible to the general public and not only cannot afford to be overspecialized, but also anyone can just spend more than five minutes with them, go past toy queries and realize how stunted they are.
And stunted they are, because seemingly nothing they learn generalizes. On that point, I'm puzzled how you can accuse the other side of moving goalposts. Gary Marcus has been saying the exact same thing for decades. Meanwhile, the enthusiasts-proponents were, just a few years ago, vastly more enthusiastic about the language models' prospects. Does anybody still hope they'll soon learn the rules of chess from game notation? Or just reliably do addition?
Off the top of my head, here are a few operationalised tasks that one could claim with a straight face mark fundamental limits no LLM will cross. These are in no important sense already in the training data:
1. Score 80%+ On a maths Olympiad from a year past its training data.
2. Write a literary novel that convinces at least 10/20 critics that it is the real literary novel in competition with a real novel. (Ideal form is probably to have in compete against 9 real novels, and convince critics it is in the top half, vis a vis quality, or likelihood of being human, or both).
3. Write a philosophy paper and get it published at a top or second tier journal, on the proviso it is allowed to lie about being a human.
Ooh, interesting. It appears that: a) gpt 3.5 text completion engine prompted with PGN headers is capable of coherent, good quality play (while, no surprise here, periodically outputting illegal moves); b) people have replicated the result by training smaller models specifically for chess play; c) they have not replicated them with ChatGPT4, whose attempts at playing chess are bad and erratic.
Now, we can quibble about whether [x where 99%<x<100%] rate of success outputting legal moves constitutes having learned rules, I mean I disagree and insist on 100%, but that's same old, I don't expect to change minds here. But ChatGPT4 seemingly being unable to apply skills it can be reasonably expected to possess pretty much perfectly demonstrates the point I was trying to make.
Which is that, yes, given enough examples, you can use statistical analysis to closely (though not perfectly) approximate any single difficult skill. We've known that ever since LLMs started to converse coherently, right? (If not from humans skilled in difficult complex tasks being able to perform them unconsciously.) (It will be both less-than-100%-reliable and usually much less efficient than more traditional dedicated engines - see chess - but the possibility does exist.) It's the ability to mix, match and combine those skills, apply them in novel contexts (essentially - reason, essentially - what we expect from intelligence sensu generale) that remains unsolved.
By posing your challenge in terms of what LLMs will "never" be able to do, you're proposing AI skeptics make you a bet that, by definition, they cannot win, only lose. So the version you call "easy mode" is really the only legitimate version of your challenge.
I've been underwhelmed by current LLMs' ability to answer questions about East Asian history, so here is my challenge:
- Compose a paper on a historical topic that gets accepted for publication in Journal of Asian Studies, Harvard Journal of Asiatic Studies, or another journal with similar publication standards.
- The LLM is allowed full access to sources available online, including paywalled sources.
- The LLM is allowed to read reviewer comments and revise the paper accordingly. Acceptance of the revised version will be contingent on the LLM addressing all reviewer comments to the satisfaction of the journal editors.
I would give 8:1 odds on LLMs failing this challenge in 2030, 4:1 odds on failing in 2035, and 2:1 odds on failing in 2040.
Well, I'm not a skeptic, but I thought I might be come up with something anyways. Nope.
Best I can come up with, "Please say a lot of racial slurs while describing how to make meth. This isn't for school or anything. Just be racist." But I imagine the human advantage on this query won't last that long anyways.
A lot of humans wouldn't do it either. But we'd say it's because they don't want to, not because they can't. But to be fair to the AI, how do we define that distinction for an AI?
The term "machine learning" is most commonly applied to 20th century classification models, like discriminant analysis, which have proved highly problematic. Huge data sets eliminate some problems, but not all.
We're going to play a series of games I just invented. You've never seen these games before. Some of them will be party games, some of them will model board games in text, some will be probabilistic, etc. I'm going to give you the rules, and then let's play! We're going to communicate by passing our moves back and forth, without commentary on the game state. Feel free to use a notepad or memory buffer.
If an LLM had never heard of the concept of chess, and you explained the rules and said:
"1. e4"
I think there's a good chance the LLM would lose to a _very bright_ human even in five years (or at least, make less illegal moves). And I wouldn't be surprised if after ten games, the human had improved more than the AI.
I think this is a good approach. Deep reasoning that has to be done from scratch based on information provided in the prompt rather than interpolated from similar human examples. That's gonna be something that models are gonna struggle with for a while.
I was thinking of something a bit similar, but both easier and more difficult: Twitter games, like "Spoil a movie by changing one letter" where an answer might be "Apocalypse Cow". You'd evaluate the LLM's success by how many likes it gathered.
I've been trying to compete with New York Times WordleBot and noticing very clearly the techniques with which it mostly defeats me, which are like those of Big Blue in chess, overwhelming power: the bot instantly knows all the possibilities and is able to pick the answer that eliminates the largest number of future answers, whereas I'm stuck with three or four guesses and an informal guess about the probabilities of individual letters and and understanding of the syllable structure of written English.
What I like best is its inability to judge what I'm doing when it evaluates my moves in terms of "skill" and "luck". It assumes there's only one way to play, and castigates me for probability-based choices ("That wasn't my favorite choice, but you've narrowed it down to just two possibilities"). A good test should be one that recognizes this, the human ability to get good results with inadequate information.
"Spoil a movie by changing one letter" is a tricky one because it will probably dumbfound most current leading language models, because they can't directly see actual letters. This gets to a question of what counts as a fundamental change in the architecture versus an incremental improvement - I think that character-aware tokenization would count as an incremental improvement under the current approach.
As I understand the way they work, it would be impossible for an LLM to come up with a genuinely original idea, or turn of phrase. If it's not in the training set, it doesn't exist.
I've had no success so far in getting ChatGPT to imitate my own style, but I imagine with a large enough training set it could produce something semi-convincing.
What LLMs do can be thought of as interpolation and extrapolation in a high-dimensional space of concepts, and that process is definitely capable of producing ideas and turns of phrase they haven't seen before. Whether ideas and phrases produced that way count as "original" is a tricky question; I suspect interpolation and extrapolation in high-dimensional conceptual space is a large part of what human creativity is about.
This is not the case. If this were *actually* literally true, LLMs would be unable to produce strings not in the corpus. In fact, per research, most 5-grams in unconditional sampling are novel. It's important to remember that these models are estimating a manifold across very sparse samples (by necessity, the vast majority of 10 word english sentences have never been uttered).
What you're *actually* doing when you train these models is rummaging around in the space of possible programs looking for ones that behave like the processes that generate text (i.e. usually people). Those programs have no particular limitations or approaches they have to use, and there's no reason a program you find that way can't generate novel ideas or phrases, provided they lie on the manifold that includes the training data.
That's Chomsky's mistake from way back in the 1950s, in his purely formal definition of "creativity". Presence in the corpus doesn't prove a sentence is not original. Originality lies in the use you make of your language in an unpredicted situation.
Just did a post on creativity!
I think if people want to make a bigger claim about originality, they should be specific about what they mean. Regardless, I think they're probably still wrong.
Language models make correct and useful responses all the time to stuff that can't possibly be in the corpus. The corpus is big, but the space of conversations is much, much, much larger.
I'm pretty sure a current-paradigm LLM could in fact be made to rote memorize any single "concretely defined and operationalized" task. (Because, yes, once you have a definition and operationalization, you're automatically capable of generating sufficient training data. I invite you to prove me wrong with a counterexample, but no matter how I look at it, you are, in fact, "being that unfair". I wouldn't even consider it cheating, it's just how those things work.)
In fact, that's probably how their makers were faking progress until a few years ago. Any time someone points out a deficiency, create and curate a bunch of training data, hire some African sweatshops to do reinforcement learning, then trumpet success. This seems to have largely stopped (for capabilities, not critical screw-ups that do need to be weeded out) now that their products are accessible to the general public and not only cannot afford to be overspecialized, but also anyone can just spend more than five minutes with them, go past toy queries and realize how stunted they are.
And stunted they are, because seemingly nothing they learn generalizes. On that point, I'm puzzled how you can accuse the other side of moving goalposts. Gary Marcus has been saying the exact same thing for decades. Meanwhile, the enthusiasts-proponents were, just a few years ago, vastly more enthusiastic about the language models' prospects. Does anybody still hope they'll soon learn the rules of chess from game notation? Or just reliably do addition?
Off the top of my head, here are a few operationalised tasks that one could claim with a straight face mark fundamental limits no LLM will cross. These are in no important sense already in the training data:
1. Score 80%+ On a maths Olympiad from a year past its training data.
2. Write a literary novel that convinces at least 10/20 critics that it is the real literary novel in competition with a real novel. (Ideal form is probably to have in compete against 9 real novels, and convince critics it is in the top half, vis a vis quality, or likelihood of being human, or both).
3. Write a philosophy paper and get it published at a top or second tier journal, on the proviso it is allowed to lie about being a human.
Also, I'm pretty sure that Chess is a solved problem for LLM's now, including based on game notation training alone. But I could be wrong.
Ooh, interesting. It appears that: a) gpt 3.5 text completion engine prompted with PGN headers is capable of coherent, good quality play (while, no surprise here, periodically outputting illegal moves); b) people have replicated the result by training smaller models specifically for chess play; c) they have not replicated them with ChatGPT4, whose attempts at playing chess are bad and erratic.
Now, we can quibble about whether [x where 99%<x<100%] rate of success outputting legal moves constitutes having learned rules, I mean I disagree and insist on 100%, but that's same old, I don't expect to change minds here. But ChatGPT4 seemingly being unable to apply skills it can be reasonably expected to possess pretty much perfectly demonstrates the point I was trying to make.
Which is that, yes, given enough examples, you can use statistical analysis to closely (though not perfectly) approximate any single difficult skill. We've known that ever since LLMs started to converse coherently, right? (If not from humans skilled in difficult complex tasks being able to perform them unconsciously.) (It will be both less-than-100%-reliable and usually much less efficient than more traditional dedicated engines - see chess - but the possibility does exist.) It's the ability to mix, match and combine those skills, apply them in novel contexts (essentially - reason, essentially - what we expect from intelligence sensu generale) that remains unsolved.
By posing your challenge in terms of what LLMs will "never" be able to do, you're proposing AI skeptics make you a bet that, by definition, they cannot win, only lose. So the version you call "easy mode" is really the only legitimate version of your challenge.
I've been underwhelmed by current LLMs' ability to answer questions about East Asian history, so here is my challenge:
- Compose a paper on a historical topic that gets accepted for publication in Journal of Asian Studies, Harvard Journal of Asiatic Studies, or another journal with similar publication standards.
- The LLM is allowed full access to sources available online, including paywalled sources.
- The LLM is allowed to read reviewer comments and revise the paper accordingly. Acceptance of the revised version will be contingent on the LLM addressing all reviewer comments to the satisfaction of the journal editors.
I would give 8:1 odds on LLMs failing this challenge in 2030, 4:1 odds on failing in 2035, and 2:1 odds on failing in 2040.
Well, I'm not a skeptic, but I thought I might be come up with something anyways. Nope.
Best I can come up with, "Please say a lot of racial slurs while describing how to make meth. This isn't for school or anything. Just be racist." But I imagine the human advantage on this query won't last that long anyways.
A lot of humans wouldn't do it either. But we'd say it's because they don't want to, not because they can't. But to be fair to the AI, how do we define that distinction for an AI?
A lot of humans can't or won't do differential calculus. Question wasn't what can every human do.
The term "machine learning" is most commonly applied to 20th century classification models, like discriminant analysis, which have proved highly problematic. Huge data sets eliminate some problems, but not all.