4 Comments

Re: smaller models in the CoT paper: You can think of large language models as generally coming in two varieties. The ones you read headlines about - GPT and friends - tend to be decoder-only architectures that are ridiculously large (often 100B or more parameters.) An organization needs to be either large, rich, or highly specialized to fine-tune a model like this; the point is that they're supposed to be extremely flexible and expressive even without fine-tuning, useful for zero-shot and few-shot learning. You won't see "fine-tuned GPT-3" in a chart like this because very few people have the resources to fine-tune it, even if OpenAI gives them permission to.

The non-GPT models in the chart, including the one you're looking at - Multimodal-CoT - are encoder-only or encoder-decoder architectures, much smaller - usually less than 1B parameters - such that it's relatively feasible for an independent researcher or small organization to fine-tune one. They are the unglamorous-but-still-kind-of-cutting-edge workhorses of commercial AI. They're nowhere near as flexible or expressive as GPT and kin, but they can be as good or better at a single, well-defined task; e.g. classifying whether tweets about a company's product express positive or negative sentiment, or selecting the sentence on a web page most likely to answer a user's query.

So, fine-tuned versus not-fine-tuned is the lion's share of the explanation here, and why it's more-or-less expected to see much smaller models outperform GPT-3 on certain kinds of tasks. Vanilla GPT-3 has the incidental ability to complete a wide variety of tasks, as a side effect of its pre-training on next-word-prediction, whereas Multimodal-CoT is lightweight, great at doing one thing, and useless for anything else.

Expand full comment

That's all true, and maybe I'm making a fool of myself here, but I've been following language models since back when results based on finetuning on datasets got all of the attention and 95% on text only problems with a quite difficult and varied dataset (from what I can tell) still seems just really phenomenal.

Compare finetuned performance on ARC-Easy with vastly larger models for example:

https://leaderboard.allenai.org/arc_easy/submission/c03ng2s9f4tt4ugi12q0

Expand full comment

Did you use ChatGPT to write your response? I'm guessing not. That seems as strong an argument against your thoughts as any.

Expand full comment

Disagree—the tech might (will) get better quickly.

Expand full comment