Summary:
While everyone is focused on the question of when will general artificial intelligence come to be, there’s another milestone that will probably come earlier- verbal parity.
By verbal parity I mean roughly the ability of a computer, given a string of text, to respond with another string of text as well as a typical human could (weak parity) or to respond as well as a human from the relevant profession could (strong parity). This comes without a presumption that the computer can do things normally attributed to general AI- perception, motor coordination, playing a simple real-time computer game, or even the simplest visual tasks, etc.
Although in some respects verbal parity can be seen as an extension of the Turing test, thinking of it too much through that framework might be a trap.
There are reasons to think that verbal parity could come well before we currently expect AGI to come. If the current rate of advance is sustained, we can even imagine it happening in a very brief time (5-15 years). The implications of verbal parity would be enormous. These include the automation of a vast swathe of jobs, and, just possibly, a singularity.
If verbal parity comes before general artificial intelligence, that changes the strategic landscape around the control problem, although how it changes it is difficult to say.
Verbal parity might be nearer than general artificial intelligence, and that likely makes the control problem worse because it gives us less time to solve it.
Currently, missing components of verbal parity from the best models include a capacity for ongoing learning without finetuning and a larger window size.
Predictions are included at the end, even though, in general, I hate making predictions I’m making them here is a kind of publicity effort.
Note on terminology: It’s considered low rent in some circles to use AI as a singular noun as in “an AI”. I find writing goes smoother if you allow it though.
I've become interested in an AI milestone that I don't think gets as much talk as it should- Verbal parity.
You ever had an idea that, maybe to other people, seems like inside baseball, but to you seems very important? Have you ever thought you’ve spotted a blindspot that everyone else, including people you very much respect, are missing? This is my such idea, my idea fixee as it were, my hobby horse when it comes to AI futures.
Informally, by verbal parity, I mean the capacity to respond to a string of text with another string of text as well as a human could- whether that answer is a paragraph or a single character.
This excludes skills as diverse as controlling a robot, generating, classifying or understanding images and playing a real-time computer game.
Once upon a time, many people would have thought that verbal parity is an AI-complete problem, or at least that it’s not something natural language processing on its own could achieve.
However rapid progress in natural language processing relative to other areas has thrown this into doubt. We’ve all seen GPT-3, for example, and though it is not yet a human level, the range of its verbal skills is staggering. Even if true verbal parity is AI-complete, it seems we can get much closer to it without general artificial intelligence than we thought- or without anything much except more deep learning-based natural language processing. In concrete terms, it increasingly seems like we could have artificial intelligence that can respond to text strings as well as a human, but which couldn’t, for example, control a robot to assemble an IKEA cabinet according to instructions. Even more amazing, is that the AI that can respond to text strings, as well as a human, might be trained on nothing but vast databases of text with no prestructured knowledge and with no images or sounds, or any of the things we use to learn about our world.
Let us define two concepts of verbal parity:
Weak verbal parity: The ability to respond to a string of text in a known language as well as an average human could.
Strong verbal parity: The ability to respond to a string of text in a known language as well as an average human could, or, if the correct response to a string of text called for some professional skill, respond to it as well as an average professional from that field. For example, if a string of text asked you to write a novel, respond with a novel as well written as the average novelist could write one. Or if asked to write a review paper on the cladistics of eusociality in Hymenoptera, write it as well as the average phylogenist or biologist in a related field could write it.
Of course, we need to apply a bit of common sense in our definition of strong verbal parity. What about some tiny godforsaken profession that no one has ever heard of, undocumented on the internet, practiced in a small rural siberian village only. We’re talking about stuff that is well and truly public knowledge.
I’m writing about verbal parity because I think it could change the world but a lot of thought around AI timelines - at least in the popular press and blogosphere- has become stuck on the concept of general intelligence. It seems to me that thinking about things through a lens of an AGI pushes projected world-changing AI timelines out further while thinking about verbal parity makes it look likely they could be closer. The purpose of this post is to try to make us think about verbal parity as a medium-term possibility with explosive implications.
In this essay, I do two things that I normally try to avoid. I make concrete, unconditionalized, predictions about singular, unrepeatable future events with complex human factors involved, and I deal in very specific reasoning about material that is well outside even the most generous interpretation of my expertise. So let me walk it back a bit by saying I’m offering this as a contribution to the conversation, something to think about that may be important- nothing more.
Is verbal parity just the ability to pass the Turing test?
Yes and no.
I’m not an intellectual historian, but looking at his original paper, my impression is that Turing intended that the eponymous test was to seek a certain kind of behavioral indistinguishability, as was the behaviorist fad of the time. Posing questions like “write a book” or “collate a review paper” would therefore have been fair game, since a computer whose verbal behavior was indistinguishable from a human could attempt them.
But over time, the common understanding of the Turing test has come to be a sort of intellectual trickery. Many applications of the Turing test are short, conducted within time limits, and with restrictions that make it harder to catch the computer out. The goalposts are hence shifted from true behavioral indistinguishability to deceit.
So perhaps what we are talking about is the ability to pass the Turing test, but only in the fullest sense of that concept.
The limitations of Natural Language Understanding just a few years ago and its progress now
The main reason I am bullish about near-term verbal parity is the rapid progress in the last four years or so. Had you described the performance of natural language software in 2022 to someone in 2018 on question datasets, they would never have guessed that you were talking about 2022. Further, I suspect they would have inferred that, whatever time you were talking about, that time was probably really close to true AI or at least verbal parity.
GLUE & superGLUE
Consider, for example, GLUE:
The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
GLUE came with human benchmark levels on all its tasks. Although the researchers don’t come out and say it in their initial paper, my impression (and I apologise if I am wrong) is that they hoped that to achieve human-level performance on GLUE, at least approximately human capability in question answering would be required.
Except the GLUE benchmark was surpassed relatively early, just in 2019. So a new benchmark was created, superGLUE. Again, there were high hopes for superGLUE- that there was plenty of room for growth before machines could reach truly human level. In the team’s initial paper introducing superGLUE, they boasted that it had:
• Comprehensive human baselines: We include human performance estimates for all benchmark tasks, which verify that substantial headroom exists between a strong BERT-based baseline and human performance.
But once again, the test fell fairly quickly. DeBERTA, a 2020 Microsoft model, exceeded the human baseline.
There has never been an ultraGLUE. Perhaps the makers of GLUE & superGLUE were burnt too many times.
Winograd Schema
Or consider, for example, the Winograd schema. The Winograd schema was once considered an update on the Turing Test- like the Turing test, but quantifiable, repeatable, and not open to tricks. It works as follows. Suppose I give you the following sentence:
The city councilmen refused the demonstrators a permit because they feared violence.
Who does the pronoun “they” refer to?
Now suppose instead the sentence was:
The city councilmen refused the demonstrators a permit because they advocated violence.
Now, who does the word “they” refer to? Essentially a Winograd Schema problem is a problem in which the reference of an ambitious pronoun depends on a single word (“feared” or “advocated” in the above). The test taker has to work out what the pronoun refers to, on the basis of reasoning, implicit or explicit, about how the world works. For example, while it is possible, it doesn’t make much sense to say “the councilors denied the protestors a permit because the councilors advocated violence”, but it does make sense to say “the councilors denied the protestors a permit because the protestors advocated violence”. [Perhaps this is not a good example given how many anti-war protestors have been denied permits by pro-war governments :-) ].
There were high hopes for the Winograd schema. Insomuch as it was meant to be a replacement for the Turing test, it was intended to be solvable only by real artificial general intelligence, or at least something close to that. It was, after all, meant to replace the Turing test. Although these problems are simple and easy for humans, solving them requires a deep understanding of how things relate to each other in the world.
And it started out well. Four years ago, NLU was essentially near chance on the Winograd schema, just as we would expect from a stand-in for the Turing test.
Here’s an abstract from around that time:
Commonsense reasoning is a long-standing challenge for deep learning. For example, it is difficult to use neural networks to tackle the Winograd Schema dataset [1]. In this paper, we present a simple method for commonsense reasoning with neural networks, using unsupervised learning. Key to our method is the use of language models, trained on a massive amount of unlabled data, to score multiple choice questions posed by commonsense reasoning tests. On both Pronoun Disambiguation and Winograd Schema challenges, our models outperform previous state-of-the-art methods by a large margin, without using expensive annotated knowledge bases or hand-engineered features. We train an array of large RNN language models that operate at word or character level on LM-1-Billion, CommonCrawl, SQuAD, Gutenberg Books, and a customized corpus for this task and show that diversity of training data plays an important role in test performance. Further analysis also shows that our system successfully discovers important features of the context that decide the correct answer, indicating a good grasp of commonsense knowledge.
What did this hot new model from Google, exciting enough to be written up, score? 57%. !!! 7% better than chance.
Take a moment to guess how far we’ve come in four years.
…
…
…
In 2022 on the Winograd schema challenge, computers can score up to 97.3%. It’s just another questionnaire on which natural language understanding models can compete. The new frontier is zero-shot, one-shot and few-shot performance because the test as it used to be done is too easy.
ARC
To pick another example of progress in natural language processing, the ARC (AI2 Reasoning Challenge) is a dataset containing: “7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset is partitioned into a Challenge Set and an Easy Set, where the former contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm.”
Pretty much any form of scientific reasoning you might need occurs somewhere in the dataset.
Here’s a sample question:
A climax fire ecosystem requires periodic forest fires to maintain stability. Which of these would be the most likely result of preventing natural fires from occurring in this ecosystem? (A) Pine species would reproduce more rapidly. (B) Broadleaf species would replace pine species. (C) Burning areas would be more easily contained. (D) Trees would spread into previously unforested areas.
And another:
A mass of air is at an elevation of 1000 meters in the low pressure center of a Northern Hemisphere storm. Which of the following best describes the motion of air particles in this air mass due to storm conditions and the rotation of Earth as the air mass moves outward? (A) Air particles move up and to the left. (B) Air particles move up and to the right. (C) Air particles move down and to the left. (D) Air particles move down and to the right.
Take a moment to guess our progress in the last four years
….
….
….
Throughout 2018 the high score gradually rose from 27% (little better than chance) to 53%. Now in 2022 the maximum score on the ARC is 86%.
Clearly, computers have gained linguistic reasoning abilities in the last few years that previously they did not- a qualitative jump in question answering ability.
It’s worth reiterating. If back in 2018, a genie had come to me and described the progress we have made now, and then if the genie had asked me “in that future I have described to you- how much progress do you think they have made towards verbal parity compared to where you are now?” I would have said they were about two-thirds of the way there. Thus I am bullish on the medium-term likelihood of verbal parity. Though, of course, we must acknowledge, the last third may not be as quick. But then again, it could be. Honestly, based on this progress, if I woke up tomorrow and found out that a computer with verbal parity had been created by some secret lab somewhere, I would be shocked but not that shocked.
If you disagree with me on how close verbal parity is, I urge you to think about why.
Do you not agree with me about how far we’ve come in the last four years?
Do you think this distance is only a tiny fraction of how far we have to go (and don’t you think it’s a bit ad hoc to claim this now when people at the time thought their question datasets required true understanding- what’s your rationale)?
Do you think that regardless of how far we have come, the remaining progress will be slower?
Technical limitations of the best language models
This isn’t to say that we’re necessarily within shooting distance of verbal parity in the sense of being just a few percentage points off on various problem datasets. The most powerful language models available today have technical deficiencies that wouldn’t be solved even by getting 100% on these datasets
Models can’t permanently learn “on the fly”. If you want to teach a language model something, you have to fine-tune it. Presumably, verbal parity would require the capacity to permanently learn new things
Also, they have a limited window of material that they can consider simultaneously. For GPT-3 this is about 2000 words. In addition to an ability to add to long-term memory “on the fly”, true verbal parity would presumably require the capacity for a larger attention window, perhaps around the maximum length of a long book, although perhaps writeable long-term memory would handle this problem as well.
If these two technical deficits were solved- and in the grand scheme of things, they seem like relatively modest technical deficits- then it’s at least plausible that the remaining barriers to verbal parity would be only barriers of degree. Of course, it is possible that there are other fundamental barriers that we, or at least I, are unaware of. I’d love to hear suggestions in the comments.
Edit: Two more limitations. 1) The ability to understand the idea of generating the best response to something, and not merely the most likely response. The most likely response to a difficult question might be a wrong answer, but that’s not what we’re looking for 2) The ability to plan multiple steps ahead before acting by building an internal model of a sequence of actions- e.g. scaffolding a structure before writing an essay.
Brain scale gives us little guide to verbal parity timelines
A common methodology in trying to predict when human-level artificial intelligence is coming is to make comparisons to the complexity and processing power instantiated by the human brain and work out how long it will take us to get there. Such a methodology is unlikely to work for verbal parity. To put it somewhat informally, we don’t know how much of the brain is indispensable to language processing and abstract reasoning. A lot of the brain is more concerned with movement, perception, the integration of these etc.
Nonetheless, it is interesting to note that, just as I was writing this, a Chinese research group was experimenting with the creation of language models with trillions of parameters. Creating something that is near brain-scale. Unfortunately, the paper doesn’t give us a demonstration of its capabilities, making it is difficult to judge the significance of this model.
Implications of verbal parity and why thinking about it matters
If verbal parity is not an AI-complete problem -or even if near-verbal parity isn't AI-complete- I think that's an important thing to know. It changes the way we look at a whole raft of problems- AI timelines, the control problem, the future of work, and automation, etc. The economic implications of verbal parity alone would be staggering. Any job that consists primarily in sending emails, or could be made to do so, could thereby be automated- a vast swathe of white collar office work.
Verbal parity is also easier to think about in concrete terms than general AI. It may be easier to predict when it will come because it is easier to quantify, or at least pseudo-quantify, how far we are from the goal now, and easier to see how much progress has been made, so I find it a useful mental framework.
The singularity and verbal parity
A singularity is an event in which we reach a point in which artificial intelligence can design better artificial intelligence, can design better artificial intelligence… and so on. This could lead to a qualitative shift in how quickly computer and artificial intelligence technology can advance, leading to a recursive intelligence explosion. Such an event would surely be the most significant in recorded history. Like the event horizon of a black hole surrounding a gravitational singularity, it is a point beyond which we cannot see, and all predictions break down.
Many public discussions assume that a singularity would be driven by general artificial intelligence, but this isn’t necessarily so. We can imagine, for example, a narrow artificial intelligence focused on designing better computers triggering a singularity.
And what of our topic here? It’s possible to conceive of verbal parity driving a singularity by driving scientific research. Although an AI with strong verbal parity couldn’t conduct a research program on its own, it could create hypotheses, arguments, experimental designs, etc. All this despite lacking the capacity to recognize a picture of a lion, control a robot that can cook toast, or play Red Alert.
AI safety and verbal parity
By verbal superintelligence, I mean the capacity to respond to strings of text significantly better than any human can.
Verbal parity raises a lot of questions about AI safety. On the surface, a machine with verbal superintelligence, but lacking other capacities of AGI seems safer than normal AGI superintelligence. It can’t exactly orient itself in the world in the way we can. Yet this appearance may be an illusion, the nature of superintelligence is to be flexible and adaptive in ways we can’t anticipate, and honestly, it’s not too difficult to think of ways around the limitations of a purely verbal AI superintelligence. For example, control a few loyal humans.
We must also remember that if verbal parity does trigger a singularity, that would likely lead to a more general form of superintelligence- then we’re back to the ordinary AI safety landscape- at least in some respects.
On the whole then, I have downplayed the differences regarding AI safety. Still, if I am right, verbal parity is coming soon, and will be a transformative form of superintelligence, that seems like something AI safety researchers should grapple with.
Perhaps the most important possibility raised by the verbal parity framework I’ve outlined here, re: AI safety is how long we have to prepare. When we think about AGI, it seems 20 to 40 years away, but when we think about verbal parity, it seems to me plausible that it’s much closer than that. Presumably not long after we hit verbal parity, we will hit verbal superintelligence, and verbal superintelligence could even start a recursive process leading to a singularity. All this means we could have much less time than we think to prepare solutions to the control problem.
I can’t find good predictions that would track when something like verbal parity is coming
This isn’t to say they don’t exist, but it does indicate that discussion over these sorts of timelines is… rarer than I would like.
Metaculus has a question about whether AI will be able to write a New York Times bestseller before 2030, but while verbal parity would certainly make this possible, it doesn’t necessitate it. We can also imagine a bestseller written by something without verbal parity- although it may be strange and disjointed, the novelty of an AI author might propel it into the bestseller list. Also, the predictors might think verbal parity is coming soon, but that it’s books won’t be very popular.
If anyone has read this essay, agrees with me, and also agrees that starting a conversation about this is important, one thing you could do to help is setup a Metaculus question on this topic.
Hazarding some predictions
I don’t normally like making predictions because no other enterprise has the same tendency to blow up in one’s face as making predictions. Making predictions is dangerous, especially, as the old joke goes, making predictions about the future.
Nonetheless, I’m going to make some predictions here, because I want to draw attention to the issue, and predictions are a flashy way to do that. There are a lot of communities- from unions concerned with job loss to the AI safety community- which, if I am right, urgently need to take this into account. This is, so to speak, my silly little way of trying to hit the AI fire alarm except not quite for AGI.
Predictions
Verbal parity does not require general artificial intelligence. It is possible, and will eventually be practically feasible, that there could be a machine with verbal parity but without the ability to do stuff like control a robot to make a cup of coffee. There could even be verbal parity, or something so near enough to be good enough, without the ability to draw or recognize images. (90% confidence)
The preceding is not just a theoretical possibility. In the real world, verbal parity will be achieved before AGI (70% confidence)
Verbal parity will be achieved in the next twelve years (50% confidence)
Final remark
Let’s have a conversation about this! Even more than usual I’m keen to hear what readers have to say on this topic.
Appendix: A prior essay I wrote on Natural Language Processing: “Recent advances in Natural Language Processing—Some Woolly speculations”
I wrote this essay back in 2019- before GPT-3. Since then I think it has held up very well. I thought I'd re-share it to see what people think has changed since then, in relation to the topics covered in this essay, and see if time has uncovered any new flaws in my reasoning.
Natural Language Processing (NLP) per Wikipedia:
“Is a sub-field of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.”
The field has seen tremendous advances during the recent explosion of progress in machine learning techniques.
Here are some of its more impressive recent achievements:
A) The Winograd Schema is a test of common sense reasoning—easy for humans, but historically almost impossible for computers—which requires the test taker to indicate which noun an ambiguous pronoun stands for. The correct answer hinges on a single word, which is different between two separate versions of the question. For example:
The city councilmen refused the demonstrators a permit because they feared violence.
The city councilmen refused the demonstrators a permit because they advocated violence.
Who does the pronoun “They” refer to in each of the instances?
The Winograd schema test was originally intended to be a more rigorous replacement for the Turing test, because it seems to require deep knowledge of how things fit together in the world, and the ability to reason about that knowledge in a linguistic context. Recent advances in NLP have allowed computers to achieve near human scores:(https://gluebenchmark.com/leaderboard/).
B) The New York Regent’s science exam is a test requiring both scientific knowledge and reasoning skills, covering an extremely broad range of topics. Some of the questions include:
1.Which equipment will best separate a mixture of iron filings and black pepper? (1) magnet (2) filter paper (3) triplebeam balance (4) voltmeter
2. Which form of energy is produced when a rubber band vibrates? (1) chemical (2) light (3) electrical (4) sound
3. Because copper is a metal, it is (1) liquid at room temperature (2) nonreactive with other substances (3) a poor conductor of electricity (4) a good conductor of heat
4. Which process in an apple tree primarily results from cell division? (1) growth (2) photosynthesis (3) gas exchange (4) waste removal
On the 8th grade, non-diagram based questions of the test, a program was recently able to score 90%. ( https://arxiv.org/pdf/1909.01958.pdf )
C)
It’s not just about answer selection either. Progress in text generation has been impressive. See, for example, some of the text samples created by Megatron: https://arxiv.org/pdf/1909.08053.pdf
2.
Much of this progress has been rapid. Big progress on the Winograd schema, for example, still looked like it might be decades away back in (from memory) much of 2018. The computer science is advancing very fast, but it’s not clear our concepts have kept up.
I found this relatively sudden progress in NLP surprising. In my head—and maybe this was naive—I had thought that, in order to attempt these sorts of tasks with any facility, it wouldn’t be sufficient to simply feed a computer lots of text. Instead, any “proper” attempt to understand language would have to integrate different modalities of experience and understanding, like visual and auditory, in order to build up a full picture of how things relate to each other in the world. Only on the basis of this extra-linguistic grounding could it deal flexibly with problems involving rich meanings—we might call this the multi-modality thesis. Whether the multi-modality thesis is true for some kinds of problems or not, it’s certainly true for far fewer problems than I, and many others, had suspected.
I think science-fictiony speculations generally backed me up on this (false) hunch. Most people imagined that this kind of high-level language “understanding” would be the capstone of AI research, the thing that comes after the program already has a sophisticated extra-linguistic model of the world. This sort of just seemed obvious—a great example of how assumptions you didn’t even know you were making can ruin attempts to predict the future.
In hindsight it makes a certain sense that reams and reams of text alone can be used to build the capabilities needed to answer questions like these. A lot of people remind us that these programs are really just statistical analyses of the co-occurence of words, however complex and glorified. However we should not forget that the statistical relationships between words in a language are isomorphic to the relations between things in the world—that isomorphism is why language works. This is to say the patterns in language use mirror the patterns of how things are(1). Models are transitive—if x models y, and y models z, then x models z. The upshot of these facts are that if you have a really good statistical model of how words relate to each other, that model is also implicitly a model of the world, and so we shouldn't surprised that such a model grants a kind of "understanding" about how the world works.
It might be instructive to think about what it would take to create a program which has a model of eighth grade science sufficient to understand and answer questions about hundreds of different things like “growth is driven by cell division”, and “What can magnets be used for” that wasn’t NLP led. It would be a nightmare of many different (probably handcrafted) models. Speaking somewhat loosely, language allows for intellectual capacities to be greatly compressed that's why it works. From this point of view, it shouldn’t be surprising that some of the first signs of really broad capacity—common sense reasoning, wide ranging problem solving etc., have been found in language based programs—words and their relationships are just a vastly more efficient way of representing knowledge than the alternatives.
So I find myself wondering if language is not the crown of general intelligence, but a potential shortcut to it.
3.
A couple of weeks ago I finished this essay, read through it, and decided it was not good enough to publish. The point about language being isomorphic to the world, and that therefore any sufficiently good model of language is a model of the world, is important, but it’s kind of abstract, and far from original.
Then today I read this report by Scott Alexander of having trained GPT-2 (a language program) to play chess. I realised this was the perfect example. GPT-2 has no (visual) understanding of things like the arrangement of a chess board. But if you feed it enough sequences of alphanumerically encoded games—1.Kt-f3, d5 and so on—it begins to understand patterns in these strings of characters which are isomorphic to chess itself. Thus, for all intents and purposes, it develops a model of the rules and strategy of chess in terms of the statistical relations between linguistic objects like "d5", "Kt" and so on. In this particular case, the relationship is quite strict and invariant- the "rules" of chess become the "grammar" of chess notation.
Exactly how strong this approach is—whether GPT-2 is capable of some limited analysis, or can only overfit openings—remains to be seen. We might have a better idea as it is optimized — for example, once it is fed board states instead of sequences of moves. Either way though, it illustrates the point about isomorphism.
Of course everyday language stands in a woollier relation to sheep, pine cones, desire and quarks than the formal language of chess moves stands in relation to chess moves, and the patterns are far more complex. Modality, uncertainty, vagueness and other complexities enter- not to mention people asserting false sentences all the time- but the isomorphism between world and language is there, even if inexact.
Postscript—The Chinese Room Argument
After similar arguments are made, someone usually mentions the Chinese room thought experiment. There are, I think, two useful things to say about it:
A) The thought experiment is an argument about understanding in itself, separate from capacity to handle tasks, a difficult thing to quantify or understand. It’s unclear that there is a practical upshot for what AI can actually do.
B) A lot of the power of the thought experiment hinges on the fact that the room solves questions using a lookup table, this stacks the deck. Perhaps we be more willing to say that the room as a whole understood language if it formed an (implicit) model of how things are, and of the current context, and used those models to answer questions? Even if this doesn’t deal with all the intuition that the room cannot understand Chinese, I think it takes a bite from it (Frank Jackson, I believe, has made this argument).
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
(1)—Strictly of course only the patterns in true sentences mirror, or are isomorphic to, the arrangement of the world, but most sentences people utter are at least approximately true.
Since you've framed strong verbal parity as being stronger than the turing test, I think I can make this argument. Verbal parity in this sense is actually AGI. Yeah I know this is not an uncommon opinion, it's what turing thought and has remained popular ever since. But I think it usually comes from this sort of mindset where people assume that an AI trained only on static data could never achieve human level intelligence, that this would require having a body, or at least interacting with and doing experiments on its environment, learning over time as a child does, not really my perspective. Either way though, an AI with verbal parity could actually do this type of stuff automatically. It could consider descriptions of real world problems, command a person to carry out tasks, devise experiments, and change its behavior over time. It couldn't produce the signals needed to control a robot in real time, but it might be able to design something that could, or at least do the physics calculations required to move the robot across a given room. If we encountered an alien species that could do everything we could just as well but also could control swarms of insects with their minds, we wouldn't throw up our hands and declare ourselves inferior, swarming instinct being a now obvious requisite for true general intelligence, we would wave our hands uneasily and say, yeah but we can design things for controlling swarms of insects sort of and if not that then at least we could technically do the laborious calculations required to create a nice, well behaved swarm in a given space. We don't decide to call one dimensional turing machines not turing complete because they're polynomially slower than higher dimensional ones, that's just gross, we smooth everything out, lump them all into the same class and reassure ourselves that this is the way to go based on the math becoming beautiful and interesting and useful when you do this. Weird aside, extremely reasonable people, I believe Scott Aaronson is an example, do try to claim that turing machines which are exponentially slower than regular ones ought not be called complete, see this thing and the resulting fallout: https://blog.wolfram.com/2007/10/24/the-prize-is-won-the-simplest-universal-turing-machine-is-proved/ Anyway, suppose we encounter another naturally evolved intelligent species which is fixed in its environment, no sense other than decent but not great hearing, which gets by singing special songs to the dumb but physically capable animals around it, compelling them to do things and getting them to respond and give up information about what they can sense. As soon as we try communicating by straightforwardly translating words into alien stump song, they begin to pick up english easily and after some time we gauge them about as capable as humans in the domain of language. We likewise wouldn't, I would hope, decide these are inferior and not generally intelligent beings.
I'm not sure I believe the idea of general intelligence can be made sound, but if it can then I think the above is gonna hold. I could see trying to make the definition of strong verbal parity so it's not quite strong enough to do the tasks described above (but still stronger than the turing test!) but I would consider such a thing implausible. Maybe idk the context window is large enough to fit a novel, to engage in hours of conversation, but not long enough to fit a lifespan of interaction. Well then I'd create something that periodically re-summarizes its life in novel form and uses the last 1/3 of the context window for current interactions, that would be a better long term memory than I have at least, maybe not top of the line. The intuition is that once you master what a human can do in X minutes the rest is comparatively easy. I think X=5 has been used before but here we're of course talking much longer.
In the definitions of verbal parity, what does it mean to have "the ability to respond [in some manner]"?
When we generate text with an LM, we don't directly observe abilities-to-respond, we just observe specific responses. So all we can do is gather a finite, if possibly large, sample of these responses. How do we get from that to a judgment about whether the LM "has verbal parity"?
It seems like we could make this call in many different ways, depending on how we construe "ability" here.
Suppose we find that LM responds like a human the vast majority of the time, but every once in a while, it spits out inhuman gibberish instead. Does that count?
Suppose we find that the LM "gets it right" (correct answers, high-quality prose, etc.) about as often as our reference human, but when it does "get it wrong," its mistakes don't look like human mistakes. Does that count?
Suppose we find that the LM displays as much total factual knowledge as a typical human, but it isn't distributed in a human way. For example, it might spend a lot of its total "fact budget" on a spotty, surface-level knowledge of an extremely large number of domain areas (too many for any human polymath to cover). Does that count?
In my opinion, LMs are making fewer mistakes on average as they scale up, but the mistakes they do make are not growing more humanlike at the same rate. So, as LMs get better, there will be a larger and larger gap between their average/typical output and their worst output, and whether you judge them as humanlike will come down more and more to how (or whether) you care about their worst output, and in what way.
I discuss this in more detail in this post: https://www.lesswrong.com/posts/pv7Qpu8WSge8NRbpB/larger-language-models-may-disappoint-you-or-an-eternally
Other stuff:
- On the Turing Test comparison, note that "passing the Turing Test" and "displaying the ability to pass the Turing Test" are not the same thing (the latter is not even clearly well-defined). A system might pass some instances of the Test and fail others.
- I worked in NLP during the transition period you mention (and still do), and it really was remarkable to watch.
- The Chinese model you link to is a MoE (Mixture of Experts) model, so the parameter count is not directly comparable to GPT (nor to most other modern NNs). MoEs tend to have enormous parameter counts relative to what is possible with dense models at any given time, but they also do far worse at any given parameter count, so overall it's kind of a wash.
If you aren't familiar with MoEs, you can imagine this 173.9T parameter model as (roughly) containing 96000 independent versions of a single small model, each about the size of GPT-2, though much wider and shallower (3 layers). And, when each layer runs, it dynamically decides which version of the next layer will run after it.