Philosophy bear

Nov 17, 2022

That's a good way to frame it I think.

First of all, I think I need to be clearer on the scope of capacities. It would unreasonable to expect a machine to do anything *any* human could attempt to qualify as AI. A lot of these failure points are in areas of very difficult science. I more have in mind something like "have a reasonable go at (almost) all things a typical human could have a reasonable go at, with text input and output". Once we get into specific scientific areas that, as you say, require intelligence, I feel we're being a bit unfair to our (putative) budding AGI. The ability to perform in scientific areas may be evidence of intelligence, but it seems to high a bar to be a requirement.

Now with regard to the issue of confabulation. I've noticed that arguments against AI often point to the fact that it confabulates- e.g. this has become a big part of Gary Marcus's argument I think- and I believe, though I am not certain, that he's emphasizing it more than he did in the past. It's worth noting that the way these machines are setup, they kind of have to take a guess when they don't know. Although there is research on teaching LM's to say "I don't know", as far as I know that hasn't been applied to Galactica.

It's an interesting question how confident Galactica is when it writes this nonsense. You've described it as highly confident, and it certainly may be highly confident *in tone* but whether internally it's confident in its next word probability estimates when it confabulates e.g. references is unclear. It would be interesting to take a case where it's clearly not misremembering, but actually just making stuff up and see how confident it is in those situations.

There are some people in this world- sometimes due to personality, other times due to organic brain damage- who just will not admit to uncertainty. They'll just make their best guess and say it as if they knew it for a fact. Some of them went to my high school when I was kid, they were very frustrating. This is certainly an intellectual flaw, but I'm not sure that it rules something out from being an AGI.

But grant, in arguendo, that it does rule something out from being an AGI. Given that there is already work in this field- teaching LM's to say "I don't know"- I would say that accepting this is necessary is only a small modification to my position.

Expand full comment

Nov 13, 2022

(I work at Google Research, not on but somewhat adjacent to large language models.) I have a different objection, which is essentially that the commonsense benchmarks are far too easy and that large language models don't demonstrate even a modicum of common sense. Kocikjan et al.'s "The Defeat of the Winograd Schema Challenge" (https://arxiv.org/abs/2201.02387) is worth a read.

Expand full comment

Nov 13, 2022Edited

I think that's a fair enough objection. I'd go even further. I personally don't take common-sense benchmarks that seriously anymore. I think that proper, hard and varied reasoning benchmarks will implicitly require common-sense reasoning anyway.

In that spirit it's worth considering that the MMLU includes some *very* hard questions and requires elaborate multistep reasoning, and PaLM540B does much better than the vast majority of humans at it. Putting the "expert" threshold at 90% in some ways doesn't do justice to hard these questions are.

So that would form the basis of my defence against your objection. Sure, the common-sense reasoning tests may not be up to scratch, but the MMLU performance is the far more compelling factor.

Expand full comment

Just to be clear, you're referring specifically to https://arxiv.org/abs/2210.11416?

Expand full comment

Nov 14, 2022Edited

I think even PaLM-540B vanilla is far ahead of the average human on the MMLU (about 40% vs 68% from memory), but Flan-PaLM-540B makes the point even more clearly at 75%

Thoughts? Do you find these kinds of tests more persuasive or nah? Is there a battery of written tests you would accept?

Expand full comment

Point of possible agreement: If you'd asked me three years ago (probably not one) whether the LLM technology we have (decoder-only transformers with clever prompting) could get to ~75% on MMLU, I would have estimated the probability pretty low (< 20%). As someone who's been working in ML/AI for years, the last few years of progress have been astonishing. I'm not *sure* that I would have said that solving MMLU meant we were close to AGI, but I'm not sure that I *wouldn't*, so in that sense I may in fact be moving the bar.

Point of likely agreement: I'm open to the idea that "text-only" AGI is possible; I don't see a fundamental reason why additional senses are needed.

Point of disagreement: if we define AGI as "it can attempt basically all tasks a human with access to a text input, text output console and nothing more could and make a reasonable go at them" (a perfectly reasonable definition), I don't actually think we're that close.

Basically, I don't think these models are (yet) capable of powerful logic or coherence. Looking some at MMLU, the questions mostly feel like things that are going to basically show up as patterns in sufficiently large training sets, and so I suspect the PaLM 540B models are more-or-less parroting them back. I took a look at a few test questions in a few categories, and that's my impression, but I'm not an expert on MMLU.

Looking at Appendix B of the linked paper describing the MMLU results in detail, there are a lot of interesting things. For instance, on tasks that feel more regurgitative-y than reason-y, Chain of Thought prompting seems to make things *worse*, which is not what I'd expect from an AGI. The system is much better at college biology and physics than high school biology and physics, which is not what I'd expect from an AGI. Overall, the systems feel very impressive, but brittle in a way I wouldn't expect AGI to be.

As I noted above, it *is* possible I've moved the bar, but if so, I've done it for a reason, which is that I've learned something new: I've learned how powerful very large text corpuses are for building a machine that can produce lots of patterns that answer detailed knowledge-required questions.

On the other hand, when I actually interact with one of these systems, they never seem intelligent (or sentient) to me at all. They feel like I'm talking to a BS artist, who says vaguely plausible things but doesn't "know" anything and is just making things up. As an example I'm moderately familiar with: I'd expect an AGI to be able to act as a high school math *teacher*, including one-on-one dialog with the student, and I don't think we're that close to that. As another example I'm even closer to, I work closely with Christian Szegedy's team, where we're attempting to build a system that can prove research-level mathematics conjectures (through combining formal theorem-provers and the ability to read lots of math papers); we're tackling this because we think this is *easier* than AGI (because the theorem-provers provide a lot of grounding), and we're still not *that* close to it. (Although of course very few humans can solve research-level math problems.)

Coming back to your question about a "battery of written tests," I can't currently think of a set of questions with short answers (whether multiple choice or fill-in, noting that fill-in is both much harder for the machine and much harder to automatically eval) that would deeply convince me that the system was an AGI or close. I think for myself, I'd need to see a lot of long-form interactions, and I don't see a way to automatically eval those. That doesn't mean a set of questions doesn't exist, I just can't think of any.

Expand full comment

Nov 15, 2022

When I looked through the MMLU questions, I had the opposite impression, I was impressed by how much reasoning was required. e.g., picking four that seemed high in reasoning requirements:

Consider this one:

"This question refers to the following information.

“The real grievance of the worker is the insecurity of his existence; he is not sure that he will always have work, he is not sure that he will always be healthy, and he foresees that he will one day be old and unfit to work. If he falls into poverty, even if only through a prolonged illness, he is then completely helpless, left to his own devices, and society does not currently recognize anyreal obligation towards him beyond the usual help for the poor, even if he has been working all the time ever so faithfully and diligently. The usual help for the poor, however, leaves a lot to be desired, especially in large cities, where it is very much worse than in the country.”

Otto von Bismarck, 1884

Otto von Bismarck likely made this speech in reaction to which of the following issues?

(A) Social acceptance of child labor.

(B) Declining life expectancy in Germany.

(D) Negative effects attributed to industrial capitalism."

The correct answer is C, but both C & D are plausible and A & B aren't out of the question. Figuring it out requires not only close attention to the text, but also close knowledge of the character of Bismarck, his politics etc. A close reading of the text then has be be integrated with background knowledge of Bismarck.

Some more that stood out to me as requiring complex reasoning:

"The night before his bar examination, the examinee’s next-door neighbor was having a party. The music from the neighbor’s home was so loud that the examinee couldn’t fall asleep. The examinee called the neighbor and asked her to please keep the noise down. The neighbor then abruptly hung

up. Angered, the examinee went into his closet and got a gun. He went outside and fired a bullet through the neighbor’s living room window. Not intending to shoot anyone, the examinee fired his gun at such an angle that the bullet would hit the ceiling. He merely wanted to cause some damage to the neighbor’s home to relieve his angry rage. The bullet, however, ricocheted off the ceiling and struck a partygoer in the back, killing him. The jurisdiction makes it a misdemeanor to discharge a firearm in public. The examinee will most likely be found guilty for which of the following crimes in connection to the death of the partygoer?

(A) Murder.

(B) Involuntary manslaughter.

(D) Discharge of a firearm in public"

One end of a Nichrome wire of length 2L and cross-sectional area A is attached to an end of another Nichrome wire of length L and cross- sectional area 2A. If the free end of the longer wire is at an electric potential of 8.0 volts, and the free end of the shorter wire is at an electric potential

of 1.0 volt, the potential at the junction of the two wires is most nearly equal to

(A) 2.4 V

(B) 3.3 V

(D) 5.7 V

Consider a computer design in which multiple processors, each with a private cache memory, share global memory using a single bus. This bus is the critical system resource. Each processor can execute one instruction every 500 nanoseconds as long as memory references are satisfied by its local cache. When a cache miss occurs, the processor is delayed for an additional 2,000 nanoseconds. During half of this additional delay, the bus is dedicated to serving the cache miss. During the other half, the processor cannot continue, but the bus is free to service requests from other processors. On average, each instruction requires 2 memory references. On average, cache misses occur on 1 percent of references. What proportion of the capacity of the bus would a single processor consume, ignoring delays due to competition from other processors?

(A) 1/50 (B) 1/27 (C) 1/25 (D) 2/27

I fully acknowledge you could be right about all of this, and I could be totally bonkers, but I think there's a case here to be made, and that's interesting in itself.

Expand full comment

Nov 15, 2022

Sanity check: Are these questions we know the systems are getting right, or is it possible that current systems are mostly getting these questions wrong?

The first question seems pretty pattern-matchy to me. Googling indicates the correct answer is (D), not (C), and I can easily imagine that the words in the lecture are embedded much closer to "industrial capitalism" then to "tariffs" or "life expectancy" or "child labor."

For the second question, I'd guess it picked up on an association between "not intending to shoot anyone" and "involuntary manslaughter", but that's just a guess.

The third and fourth do seem tougher to me, and I don't immediately understand the mechanisms. As I've mentioned, this is all pretty impressive.

Expand full comment

My objection is that these models are trained to guess, to imitate, and to lie, due to the "fill in the blank" training regime. This makes them quite useful for generating fiction, but often useless for non-fiction purposes. They no more have a consistent personality or beliefs than a library does. Tell it to pretend to be a 10 year old girl, a UK prime minister from the 19th century, an alien from Star Trek, or an Aztec god, and that's what it will do.

You can also convince it to pretend that it's a history book or an encyclopedia article, and it can do that too. But it won't know or care if you want a *true* encyclopedia article or a *fictional* one, and can switch without warning whenever there's not a clear answer in the training set.

Since it's such a good imitator, it can also imitate a good student, increasingly so, but again, it will turn on you whenever the answer isn't consistent in the training set.

So it seems that there is some special sauce that's missing, has nothing to do with how well it does on tests, and can't be trained in without coming up with some different method of training.

Expand full comment

Reply (2)

Nov 13, 2022

I'm wondering if continuous learning doesn't solve this. At the moment language models can adopt a temporary persona based on their situation. I would suggest that continuous learning would allow them to build up a persona over a longer context. This persona could even come to include knowledge and representations of what it is (a language model).

Expand full comment

Jago

Nov 13, 2022

Most people have a consistent personality and set of values but is this property necessary for an AGI ? Does an AGI have to be emotional? Children can be inconsistent but they are still intelligent and capable of learning.

Expand full comment

deep.learning2deadend

The only little problem with these deep learning architectures is that they are completly unable to go beyond any mixture of knowledge bits and pieces that humans have already conceptualized, whereas an AGI can.

- Will there be further progress with these DL architectures? yes, a lot and people will keep being mistaken and mesmerized by their capabilities but that is just going further down a dead end.

- do they have a deep understanding of what they generate? no, they cannot solve simple logical tests (whereas a young child can). They are only very good * imitation machines * capable of generating very long sequences of plausible symbols (pixels, characters, bits in general).

Expand full comment

See: https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html

Nov 14, 2022Edited

They absolutely can solve simple logical tests along with mathematical problems that clearly require logical reasoning. They're included in the MMLU.

Expand full comment

https://link.springer.com/article/10.1007/s11023-022-09602-0

deep.learning2deadend

Jan 8, 2023Edited

Look at the section 'Limitations' of the article you mention: a system that can provide the right result by following an incorrect path has simply no true underlying undertanding of what it is doing. Your comment is valid for only toy problems, and mine remains globally valid.

I mentionned the word 'plausible' (vs. true) in my November 2022 post below.

A new paper making this distinction between plausible vs true has just been published:

I recommend you read it.

Expand full comment

Jesse Amano