Important updates:
THE IN PERSON EVENT IS MOVED TO THE 18th. THE DATE AND TIME IS 18th of December, 2022 at the Forest Lodge Hotel, Forest Lodge- Sydney, Australia. More details about the event here:
If you have a body of work- ideally written work- and you want to be interviewed about it, get in contact! No promises, but I’ll have a look.
Verbal parity and forecasting
I want to setup some questions on a prediction market- perhaps Metaculus- designed to give us some idea of when verbal parity might be achieved. If anyone has any connections to assist with that I’d be gratified. What follows is my attempt to figure out some specific questions.
I have defined verbal parity as the ability of an artificial intelligence program to, given any written input, respond with appropriate written output as well as a human could. I’ve written about this topic several times previously. For the initial discussion where I developed the idea see this.
Now of course verbal parity is a pretty vague concept. As well as any human could for any subject? That seems very ambitious. Furthermore, there are an infinite number of verbal tasks, so how could we practicably verify that the goal had been achieved? Still, it's a somewhat more concrete goal than artificial general intelligence.
Vague or not, the stakes are big. Verbal parity would be transformative if achieved. If nothing else, it would totally change how people think about AI. Indeed Turing was, to some degree, playing off this intuition when he suggested the Turing test. I bet that verbal parity will cause at least as big a cultural splash as multimodal artificial general intelligence when it happens. Potentially short-term job losses could be very significant. Cultural impacts- in terms of how people see themselves for example, will be big. Given the current rate of progress in large-language models it could happen well before multi-modal AGI. Personally, I think verbal parity is a kind of AGI in and of itself, but you could think of it as a very special kind of narrow AI. Regardless, it matters.
Despite the stakes, no one seems to be trying to specifically estimate when something like verbal parity will be achieved. Existing prediction markets on AGI focus on multi-modality, and so, from the point of view of verbal parity, include a lot of extraneous requirements- from robotics to real-time computer games. Because I'm interested in verbal parity, and I'm interested in prediction markets I've put to myself the challenge of trying to create some questions that will allow us to assess proximity to verbal parity that are suitable for inclusion in prediction markets. I'd like feedback and suggestions for further questions.
My first thought is to simply take sub-criteria from other prediction market questions about AI that focus on AI textual competence:
1: "High competency at diverse fields of expertise, as measured by achieving at least 75% accuracy in every task and 90% mean accuracy across all tasks in the Q&A dataset developed by Dan Hendrycks et al.."
From:
https://metaculus.com/questions/5121/date-of-artificial-general-intelligence/
Beyond that, I have two original proposals:
"2: When will AI software write a novel of at least 50,000 words. The novel must be coherent [allowing for small deviations from continuity as is common in human written novels. The novel must describe in the way almost all novels do, complex sequences of actions, behaviors and contexts involving at least three or more characters. The novel can be of any genre, but must not, in the judgment of the assessors, utilize any experimental or unusual literary features that make sustaining coherence easier or make deviations from coherence much harder to spot. The software must be the sole author."
(N.B., this related question: https://www.metaculus.com/questions/5587/ai-ny-times-best-seller-before-2030/ is too tied up in questions of aesthetic quality, sales success etc.)
(Also, in slight contradiction to the above point, I have considered adding a requirement that it score at least 4 stars on Amazon with at least 100 reviews?)
"3. When will AI software write a paper accepted by an academic journal with [some criteria meant to require the journal be notable/not a paper mill.] the paper must be at least 5000 words long, not including the bibliography. It must consist in a sustained and coherent critical argument or review, and may not be merely a summary of previous research. The model must be the sole author [collaboration by multiple language models is also acceptable, so long as there was no human input that rose to the level of meriting attribution as a co-author]. The article must have been peer-reviewed by at least two reviewers. The editors and reviewers must not have been aware that the piece had been created by a language model when they accepted it.
For the purposes of this goal, a paper that was accepted by the editors and then rejected upon discovering its non-human authorship will be acceptable. Papers written by models with additional functions other than language modeling (e.g. generating figures) will also be accepted.
At the sole discretion of the judges, apparent fulfillment of the criteria may be rejected if the paper evades normal academic requirements in some way via its unusual form or subject matter. A paper that contains fabrications (e.g. false claims of having run an experiment) will also not be considered to have met these criteria."
An obvious addition would be an operationalization of the Turing test, but formal Turing test competitions of high quality no longer seem to be running. Any suggestions?
Here is one robust proposal for a Turing Test (semi in jest):
We will be utilizing human proxies for intelligence equivalents, selecting for five major demographics: 96~104 (avg 100 as the general public), 105~113 (avg 109 as above-average intelligence), 114~122 (avg 118 as undergraduate equivalents), 123~131 (avg 127 as graduate equivalents), 132~140 (avg 136 as professionals or PhD equivalents). 4 people per rounds of test, per demographic, will be needed. Also for conversational representatives (the "interviewer" side), a pool of 15 people would be sufficient. Ideally the pool should intellectually match with the human proxies. REASON: there are management theory claim that intellectual homophily fosters communicative clarity, and that leader-follower dynamics break down once the intelligence difference is outside of the 9~27 point range, therefore large intellectual gaps should be avoided.
There will be five stages of test to be done based on each intelligence demographics. Each stage will have three rounds of four proxies each, and every proxies should be assigned to a single round only. The task is for the general public (and conversational representatives) to distinguish the AI within a pool of five people, three times in a row. Three conversational representatives will be selected for each stage. REASON: to avoid biases in both representatives and proxies, since personality psychometrics have not yet been factored in by default.
The conversation between human and potential AI candidates should be done through chatrooms for 30 minutes each, and that conversation should be monitored. Trolling and invocation of external information is encouraged (a strong AI should learn to say "I don't know" and detect irony), but asking for personal details are not. The AI can self-proclaim to be neurodivergent, however intelligence levels for human parity has been factored in, thus discrepancies between claims and simulated behaviors will be detrimental to passing this test. REASON: counter-signaling, critical thinking, and uncertainties are innate human features.
After the conversations are done (at this point each proxy encountered 3 conversations, and each representative encountered 12), the transcripts are released to the general public for surveying. For brevity, survey package should only include either (a) one representative's interview work to guess which round is the AI for determining diversity of mimicry, or (b) three round with four consistent proxies but different representatives to guess the non-proxy AI for determining spoofing consistency.