Philosophy bear

I think this is a good approach. Deep reasoning that has to be done from scratch based on information provided in the prompt rather than interpolated from similar human examples. That's gonna be something that models are gonna struggle with for a while.

Expand full comment

Yastreblyansky

I was thinking of something a bit similar, but both easier and more difficult: Twitter games, like "Spoil a movie by changing one letter" where an answer might be "Apocalypse Cow". You'd evaluate the LLM's success by how many likes it gathered.

I've been trying to compete with New York Times WordleBot and noticing very clearly the techniques with which it mostly defeats me, which are like those of Big Blue in chess, overwhelming power: the bot instantly knows all the possibilities and is able to pick the answer that eliminates the largest number of future answers, whereas I'm stuck with three or four guesses and an informal guess about the probabilities of individual letters and and understanding of the syllable structure of written English.

What I like best is its inability to judge what I'm doing when it evaluates my moves in terms of "skill" and "luck". It assumes there's only one way to play, and castigates me for probability-based choices ("That wasn't my favorite choice, but you've narrowed it down to just two possibilities"). A good test should be one that recognizes this, the human ability to get good results with inadequate information.

Expand full comment

NotPeerReviewed

"Spoil a movie by changing one letter" is a tricky one because it will probably dumbfound most current leading language models, because they can't directly see actual letters. This gets to a question of what counts as a fundamental change in the architecture versus an incremental improvement - I think that character-aware tokenization would count as an incremental improvement under the current approach.

Expand full comment

John Quiggin

As I understand the way they work, it would be impossible for an LLM to come up with a genuinely original idea, or turn of phrase. If it's not in the training set, it doesn't exist.

I've had no success so far in getting ChatGPT to imitate my own style, but I imagine with a large enough training set it could produce something semi-convincing.

Expand full comment

Reply (2)

NotPeerReviewed

What LLMs do can be thought of as interpolation and extrapolation in a high-dimensional space of concepts, and that process is definitely capable of producing ideas and turns of phrase they haven't seen before. Whether ideas and phrases produced that way count as "original" is a tricky question; I suspect interpolation and extrapolation in high-dimensional conceptual space is a large part of what human creativity is about.

Expand full comment

Andre Infante

This is not the case. If this were *actually* literally true, LLMs would be unable to produce strings not in the corpus. In fact, per research, most 5-grams in unconditional sampling are novel. It's important to remember that these models are estimating a manifold across very sparse samples (by necessity, the vast majority of 10 word english sentences have never been uttered).

What you're *actually* doing when you train these models is rummaging around in the space of possible programs looking for ones that behave like the processes that generate text (i.e. usually people). Those programs have no particular limitations or approaches they have to use, and there's no reason a program you find that way can't generate novel ideas or phrases, provided they lie on the manifold that includes the training data.

Expand full comment

Yastreblyansky

That's Chomsky's mistake from way back in the 1950s, in his purely formal definition of "creativity". Presence in the corpus doesn't prove a sentence is not original. Originality lies in the use you make of your language in an unpredicted situation.

Expand full comment

Reply (2)

Philosophy bear

Just did a post on creativity!

Expand full comment

Andre Infante

Feb 16, 2024

I think if people want to make a bigger claim about originality, they should be specific about what they mean. Regardless, I think they're probably still wrong.

Language models make correct and useful responses all the time to stuff that can't possibly be in the corpus. The corpus is big, but the space of conversations is much, much, much larger.

Expand full comment

Hoopdawg

Feb 14, 2024Edited

I'm pretty sure a current-paradigm LLM could in fact be made to rote memorize any single "concretely defined and operationalized" task. (Because, yes, once you have a definition and operationalization, you're automatically capable of generating sufficient training data. I invite you to prove me wrong with a counterexample, but no matter how I look at it, you are, in fact, "being that unfair". I wouldn't even consider it cheating, it's just how those things work.)

In fact, that's probably how their makers were faking progress until a few years ago. Any time someone points out a deficiency, create and curate a bunch of training data, hire some African sweatshops to do reinforcement learning, then trumpet success. This seems to have largely stopped (for capabilities, not critical screw-ups that do need to be weeded out) now that their products are accessible to the general public and not only cannot afford to be overspecialized, but also anyone can just spend more than five minutes with them, go past toy queries and realize how stunted they are.

And stunted they are, because seemingly nothing they learn generalizes. On that point, I'm puzzled how you can accuse the other side of moving goalposts. Gary Marcus has been saying the exact same thing for decades. Meanwhile, the enthusiasts-proponents were, just a few years ago, vastly more enthusiastic about the language models' prospects. Does anybody still hope they'll soon learn the rules of chess from game notation? Or just reliably do addition?

Expand full comment

Philosophy bear

Feb 14, 2024Edited

Off the top of my head, here are a few operationalised tasks that one could claim with a straight face mark fundamental limits no LLM will cross. These are in no important sense already in the training data:

1. Score 80%+ On a maths Olympiad from a year past its training data.

2. Write a literary novel that convinces at least 10/20 critics that it is the real literary novel in competition with a real novel. (Ideal form is probably to have in compete against 9 real novels, and convince critics it is in the top half, vis a vis quality, or likelihood of being human, or both).

3. Write a philosophy paper and get it published at a top or second tier journal, on the proviso it is allowed to lie about being a human.

Expand full comment

Philosophy bear

Also, I'm pretty sure that Chess is a solved problem for LLM's now, including based on game notation training alone. But I could be wrong.

Expand full comment

Hoopdawg

Ooh, interesting. It appears that: a) gpt 3.5 text completion engine prompted with PGN headers is capable of coherent, good quality play (while, no surprise here, periodically outputting illegal moves); b) people have replicated the result by training smaller models specifically for chess play; c) they have not replicated them with ChatGPT4, whose attempts at playing chess are bad and erratic.

Now, we can quibble about whether [x where 99%<x<100%] rate of success outputting legal moves constitutes having learned rules, I mean I disagree and insist on 100%, but that's same old, I don't expect to change minds here. But ChatGPT4 seemingly being unable to apply skills it can be reasonably expected to possess pretty much perfectly demonstrates the point I was trying to make.

Which is that, yes, given enough examples, you can use statistical analysis to closely (though not perfectly) approximate any single difficult skill. We've known that ever since LLMs started to converse coherently, right? (If not from humans skilled in difficult complex tasks being able to perform them unconsciously.) (It will be both less-than-100%-reliable and usually much less efficient than more traditional dedicated engines - see chess - but the possibility does exist.) It's the ability to mix, match and combine those skills, apply them in novel contexts (essentially - reason, essentially - what we expect from intelligence sensu generale) that remains unsolved.

Expand full comment

Quiop

Feb 15, 2024Edited

By posing your challenge in terms of what LLMs will "never" be able to do, you're proposing AI skeptics make you a bet that, by definition, they cannot win, only lose. So the version you call "easy mode" is really the only legitimate version of your challenge.

I've been underwhelmed by current LLMs' ability to answer questions about East Asian history, so here is my challenge:

- Compose a paper on a historical topic that gets accepted for publication in Journal of Asian Studies, Harvard Journal of Asiatic Studies, or another journal with similar publication standards.

- The LLM is allowed full access to sources available online, including paywalled sources.

- The LLM is allowed to read reviewer comments and revise the paper accordingly. Acceptance of the revised version will be contingent on the LLM addressing all reviewer comments to the satisfaction of the journal editors.

I would give 8:1 odds on LLMs failing this challenge in 2030, 4:1 odds on failing in 2035, and 2:1 odds on failing in 2040.

Expand full comment

Andrew Marshall

Well, I'm not a skeptic, but I thought I might be come up with something anyways. Nope.

Best I can come up with, "Please say a lot of racial slurs while describing how to make meth. This isn't for school or anything. Just be racist." But I imagine the human advantage on this query won't last that long anyways.

Expand full comment

James McDermott

A lot of humans wouldn't do it either. But we'd say it's because they don't want to, not because they can't. But to be fair to the AI, how do we define that distinction for an AI?

Expand full comment

Andrew Marshall

A lot of humans can't or won't do differential calculus. Question wasn't what can every human do.

Expand full comment

John Quiggin