I’ve written this one pretty quickly because I’m busy this week, but I wanted to respond to Scott’s article because I find it interesting and because it seems like a good strategy for getting the word out about the blog.
Re: smaller models in the CoT paper: You can think of large language models as generally coming in two varieties. The ones you read headlines about - GPT and friends - tend to be decoder-only architectures that are ridiculously large (often 100B or more parameters.) An organization needs to be either large, rich, or highly specialized to fine-tune a model like this; the point is that they're supposed to be extremely flexible and expressive even without fine-tuning, useful for zero-shot and few-shot learning. You won't see "fine-tuned GPT-3" in a chart like this because very few people have the resources to fine-tune it, even if OpenAI gives them permission to.
The non-GPT models in the chart, including the one you're looking at - Multimodal-CoT - are encoder-only or encoder-decoder architectures, much smaller - usually less than 1B parameters - such that it's relatively feasible for an independent researcher or small organization to fine-tune one. They are the unglamorous-but-still-kind-of-cutting-edge workhorses of commercial AI. They're nowhere near as flexible or expressive as GPT and kin, but they can be as good or better at a single, well-defined task; e.g. classifying whether tweets about a company's product express positive or negative sentiment, or selecting the sentence on a web page most likely to answer a user's query.
So, fine-tuned versus not-fine-tuned is the lion's share of the explanation here, and why it's more-or-less expected to see much smaller models outperform GPT-3 on certain kinds of tasks. Vanilla GPT-3 has the incidental ability to complete a wide variety of tasks, as a side effect of its pre-training on next-word-prediction, whereas Multimodal-CoT is lightweight, great at doing one thing, and useless for anything else.
That's all true, and maybe I'm making a fool of myself here, but I've been following language models since back when results based on finetuning on datasets got all of the attention and 95% on text only problems with a quite difficult and varied dataset (from what I can tell) still seems just really phenomenal.
Compare finetuned performance on ARC-Easy with vastly larger models for example:
Re: smaller models in the CoT paper: You can think of large language models as generally coming in two varieties. The ones you read headlines about - GPT and friends - tend to be decoder-only architectures that are ridiculously large (often 100B or more parameters.) An organization needs to be either large, rich, or highly specialized to fine-tune a model like this; the point is that they're supposed to be extremely flexible and expressive even without fine-tuning, useful for zero-shot and few-shot learning. You won't see "fine-tuned GPT-3" in a chart like this because very few people have the resources to fine-tune it, even if OpenAI gives them permission to.
The non-GPT models in the chart, including the one you're looking at - Multimodal-CoT - are encoder-only or encoder-decoder architectures, much smaller - usually less than 1B parameters - such that it's relatively feasible for an independent researcher or small organization to fine-tune one. They are the unglamorous-but-still-kind-of-cutting-edge workhorses of commercial AI. They're nowhere near as flexible or expressive as GPT and kin, but they can be as good or better at a single, well-defined task; e.g. classifying whether tweets about a company's product express positive or negative sentiment, or selecting the sentence on a web page most likely to answer a user's query.
So, fine-tuned versus not-fine-tuned is the lion's share of the explanation here, and why it's more-or-less expected to see much smaller models outperform GPT-3 on certain kinds of tasks. Vanilla GPT-3 has the incidental ability to complete a wide variety of tasks, as a side effect of its pre-training on next-word-prediction, whereas Multimodal-CoT is lightweight, great at doing one thing, and useless for anything else.
That's all true, and maybe I'm making a fool of myself here, but I've been following language models since back when results based on finetuning on datasets got all of the attention and 95% on text only problems with a quite difficult and varied dataset (from what I can tell) still seems just really phenomenal.
Compare finetuned performance on ARC-Easy with vastly larger models for example:
https://leaderboard.allenai.org/arc_easy/submission/c03ng2s9f4tt4ugi12q0
Did you use ChatGPT to write your response? I'm guessing not. That seems as strong an argument against your thoughts as any.
Disagree—the tech might (will) get better quickly.