RANT: For a long time, the public understanding of LLM’s has been hampered by the following phenomenon. Someone resolves to test what LLM’s can do. Because they’re skeptical of LLMs, both technologically and ethically, they don’t have access to ChatGPT-4, which costs money. They ask ChatGPT-3.5 a question and it fails to answer that question miserably. They then make a post on Twitter: “We’re supposed to believe this garbage is going to change the world?”. Someone in the comments points out that ChatGPT-4 gets the right answer. No one notices. The cycle continues.
The bar for AGI is getting pretty high; being able to do anything any human can do. Nobody's that capable. First example that comes to mind is languages: a serious polyglot might be fluent in five or ten, but there ~7000 not counting conlangs.
Poe is better. Not with single prompts usually, but because you can use their bots and run multiple LLMs inside the same context and history, and it's been free.
Is there really that big a difference between 3.5 and 4.0. I'm currently using the FastGPT option in Kagi, which works pretty well. I'm not sure what the underlying LLM is
One quantitative way of summing up the gap is that 3.5 gets about 70% on the MMLU, and GPT-4o gets about 90%. On the ultra hard GPQA, 3.5 gets 28.1%- not significantly different from chance- and 4o gets 53%.
Here's an illustration of the gap. I asked GPT-3.5
"Tell me, does an LLM have more or fewer reasons to be a philosophical skeptic than a human?"
It responded:
"The inclination toward philosophical skepticism isn't inherently tied to being human or non-human. Rather, it's about questioning assumptions, critically examining beliefs, and demanding evidence for claims. An LLM (Master of Laws) might engage in philosophical skepticism just as a human would, albeit with potentially different perspectives and motivations.
Humans may have certain cognitive biases or emotional attachments that influence their skepticism, whereas an LLM, being an artificial intelligence, might approach skepticism from a more purely rational standpoint, devoid of emotional biases. However, an LLM's skepticism could also stem from limitations in its programming, data availability, or understanding of human concepts.
In essence, the reasons for philosophical skepticism can vary widely among individuals, regardless of whether they are human or artificial intelligence. It depends more on the individual's mindset, experiences, and intellectual disposition rather than their specific nature as a human or an LLM."
Which I thought was a pretty bad answer. ChatGPT-4o was, in my view, a fair bit better, though far from perfect. Partly this was simply because it was longer, but I do think it tackled the main points at least in passing:
"The question of whether a large language model (LLM) like myself has more or fewer reasons to be a philosophical skeptic than a human involves examining the nature of skepticism and the foundational differences between humans and LLMs.
Understanding Philosophical Skepticism
Philosophical skepticism questions the possibility of certainty in knowledge. Skeptics argue that humans can never have absolute knowledge about the world due to limitations in perception, reasoning, and the potential for deception.
Reasons for Skepticism in Humans
Sensory Deception: Human senses can be deceived (e.g., optical illusions).
Cognitive Biases: Humans are prone to various cognitive biases that distort perception and reasoning.
Subjectivity: Personal experiences and emotions influence human understanding, leading to subjective interpretations.
Complexity of Reality: The complexity and potentially unknowable aspects of the universe make absolute knowledge difficult.
Philosophical Arguments: Thought experiments like Descartes' "evil demon" or the "brain in a vat" highlight the possibility of being fundamentally deceived about reality.
Reasons for Skepticism in an LLM
Data Dependence: An LLM's knowledge is entirely dependent on the data it has been trained on. If the data is incomplete or biased, the LLM's outputs will reflect those limitations.
Algorithmic Constraints: The algorithms governing LLMs have limitations and may not capture the full complexity of human experience or reality.
Lack of Sensory Experience: LLMs lack direct sensory experience, relying solely on text-based information. This absence of direct interaction with the world could be seen as a reason for skepticism.
Absence of Subjectivity: While this could be an advantage, the lack of personal experience and consciousness might make an LLM's "understanding" fundamentally different from human understanding, potentially more detached and abstract.
Comparative Analysis
Certainty and Data Quality: Humans often have direct, albeit fallible, sensory experiences, while LLMs rely on second-hand, pre-processed data. This makes LLMs dependent on the quality and comprehensiveness of their training data.
Biases and Errors: Humans are prone to cognitive biases, but LLMs are susceptible to biases in their training data and algorithmic processes.
Foundational Beliefs: Humans form beliefs based on a mix of sensory input, emotion, and reason, which can be inconsistent. LLMs generate responses based on probabilistic patterns in data, lacking true beliefs or convictions.
Existential Reflection: Humans engage in existential reflection and question their own existence and knowledge, which is a core aspect of philosophical skepticism. LLMs do not possess self-awareness or existential concerns; they operate based on their programming without subjective experience.
Conclusion
In conclusion, whether an LLM has more or fewer reasons to be a philosophical skeptic than a human depends on the perspective taken. LLMs have clear epistemological limitations due to their reliance on data and algorithms, which could be seen as grounds for skepticism about their "knowledge." However, they lack the subjective experiences and cognitive biases that contribute to human skepticism.
Humans, with their direct but fallible experiences and cognitive complexities, have rich grounds for skepticism but also possess the tools for self-reflection and philosophical inquiry. LLMs, on the other hand, operate without consciousness or personal experience, making their "skepticism" purely theoretical and algorithmic rather than a genuine philosophical stance."
There really is a big difference. If 3.5 is a mediocre high-schooler who didn't pay attention in class and is trying to finish the assignment the night before it's due by pulling random stuff off the internet, 4.0 is a reasonably intelligent undergraduate who has done most of the reading but sometimes needs to pay more attention to detail.
I recently crowd-sourced a survey of LLM responses to a simple question about historical linguistics, to which 3.5 gave a pretty terrible response — you can see the answers in this thread:
[Edit: On the other hand, I asked ChatGPT-4o to give me a metaphor for the difference between ChatGPT3.5 and ChatGPT4.0, and the response was laughably inept:
"Comparing ChatGPT 3.5 to ChatGPT 4.0 is like upgrading from a reliable compact car to a sleek, high-performance sports car. Both will get you where you need to go, but the 4.0 version offers faster acceleration, smoother handling, and a more advanced set of features under the hood, making for an overall more exhilarating and refined driving experience."]
When I want to play tripping up chatbots, I just go to https://chat.lmsys.org nowadays. (That way you not only can see what the state-of-the-art can do, it incentivizes you to calibrate the question difficulty at around the frontier of its capabilities.)
(But that's not how most of those conversations play out. How they usually play out is that someone in the comments points out that if they rephrase the question by providing enough background information and nudges, they can induce the chatbot to answer correctly, and the pro-LLM crowd leaves satisfied.)
The bar for AGI is getting pretty high; being able to do anything any human can do. Nobody's that capable. First example that comes to mind is languages: a serious polyglot might be fluent in five or ten, but there ~7000 not counting conlangs.
Btw, had to log in (I already have an account) to get anything other than 3.5.
Poe is better. Not with single prompts usually, but because you can use their bots and run multiple LLMs inside the same context and history, and it's been free.
Is there really that big a difference between 3.5 and 4.0. I'm currently using the FastGPT option in Kagi, which works pretty well. I'm not sure what the underlying LLM is
One quantitative way of summing up the gap is that 3.5 gets about 70% on the MMLU, and GPT-4o gets about 90%. On the ultra hard GPQA, 3.5 gets 28.1%- not significantly different from chance- and 4o gets 53%.
Here's an illustration of the gap. I asked GPT-3.5
"Tell me, does an LLM have more or fewer reasons to be a philosophical skeptic than a human?"
It responded:
"The inclination toward philosophical skepticism isn't inherently tied to being human or non-human. Rather, it's about questioning assumptions, critically examining beliefs, and demanding evidence for claims. An LLM (Master of Laws) might engage in philosophical skepticism just as a human would, albeit with potentially different perspectives and motivations.
Humans may have certain cognitive biases or emotional attachments that influence their skepticism, whereas an LLM, being an artificial intelligence, might approach skepticism from a more purely rational standpoint, devoid of emotional biases. However, an LLM's skepticism could also stem from limitations in its programming, data availability, or understanding of human concepts.
In essence, the reasons for philosophical skepticism can vary widely among individuals, regardless of whether they are human or artificial intelligence. It depends more on the individual's mindset, experiences, and intellectual disposition rather than their specific nature as a human or an LLM."
Which I thought was a pretty bad answer. ChatGPT-4o was, in my view, a fair bit better, though far from perfect. Partly this was simply because it was longer, but I do think it tackled the main points at least in passing:
"The question of whether a large language model (LLM) like myself has more or fewer reasons to be a philosophical skeptic than a human involves examining the nature of skepticism and the foundational differences between humans and LLMs.
Understanding Philosophical Skepticism
Philosophical skepticism questions the possibility of certainty in knowledge. Skeptics argue that humans can never have absolute knowledge about the world due to limitations in perception, reasoning, and the potential for deception.
Reasons for Skepticism in Humans
Sensory Deception: Human senses can be deceived (e.g., optical illusions).
Cognitive Biases: Humans are prone to various cognitive biases that distort perception and reasoning.
Subjectivity: Personal experiences and emotions influence human understanding, leading to subjective interpretations.
Complexity of Reality: The complexity and potentially unknowable aspects of the universe make absolute knowledge difficult.
Philosophical Arguments: Thought experiments like Descartes' "evil demon" or the "brain in a vat" highlight the possibility of being fundamentally deceived about reality.
Reasons for Skepticism in an LLM
Data Dependence: An LLM's knowledge is entirely dependent on the data it has been trained on. If the data is incomplete or biased, the LLM's outputs will reflect those limitations.
Algorithmic Constraints: The algorithms governing LLMs have limitations and may not capture the full complexity of human experience or reality.
Lack of Sensory Experience: LLMs lack direct sensory experience, relying solely on text-based information. This absence of direct interaction with the world could be seen as a reason for skepticism.
Absence of Subjectivity: While this could be an advantage, the lack of personal experience and consciousness might make an LLM's "understanding" fundamentally different from human understanding, potentially more detached and abstract.
Comparative Analysis
Certainty and Data Quality: Humans often have direct, albeit fallible, sensory experiences, while LLMs rely on second-hand, pre-processed data. This makes LLMs dependent on the quality and comprehensiveness of their training data.
Biases and Errors: Humans are prone to cognitive biases, but LLMs are susceptible to biases in their training data and algorithmic processes.
Foundational Beliefs: Humans form beliefs based on a mix of sensory input, emotion, and reason, which can be inconsistent. LLMs generate responses based on probabilistic patterns in data, lacking true beliefs or convictions.
Existential Reflection: Humans engage in existential reflection and question their own existence and knowledge, which is a core aspect of philosophical skepticism. LLMs do not possess self-awareness or existential concerns; they operate based on their programming without subjective experience.
Conclusion
In conclusion, whether an LLM has more or fewer reasons to be a philosophical skeptic than a human depends on the perspective taken. LLMs have clear epistemological limitations due to their reliance on data and algorithms, which could be seen as grounds for skepticism about their "knowledge." However, they lack the subjective experiences and cognitive biases that contribute to human skepticism.
Humans, with their direct but fallible experiences and cognitive complexities, have rich grounds for skepticism but also possess the tools for self-reflection and philosophical inquiry. LLMs, on the other hand, operate without consciousness or personal experience, making their "skepticism" purely theoretical and algorithmic rather than a genuine philosophical stance."
It's better, but using multiple LLMs, bots/GPTs and prompts is better.
There really is a big difference. If 3.5 is a mediocre high-schooler who didn't pay attention in class and is trying to finish the assignment the night before it's due by pulling random stuff off the internet, 4.0 is a reasonably intelligent undergraduate who has done most of the reading but sometimes needs to pay more attention to detail.
I recently crowd-sourced a survey of LLM responses to a simple question about historical linguistics, to which 3.5 gave a pretty terrible response — you can see the answers in this thread:
https://www.astralcodexten.com/p/open-thread-329/comment/56268387
[Edit: On the other hand, I asked ChatGPT-4o to give me a metaphor for the difference between ChatGPT3.5 and ChatGPT4.0, and the response was laughably inept:
"Comparing ChatGPT 3.5 to ChatGPT 4.0 is like upgrading from a reliable compact car to a sleek, high-performance sports car. Both will get you where you need to go, but the 4.0 version offers faster acceleration, smoother handling, and a more advanced set of features under the hood, making for an overall more exhilarating and refined driving experience."]
When I want to play tripping up chatbots, I just go to https://chat.lmsys.org nowadays. (That way you not only can see what the state-of-the-art can do, it incentivizes you to calibrate the question difficulty at around the frontier of its capabilities.)
(But that's not how most of those conversations play out. How they usually play out is that someone in the comments points out that if they rephrase the question by providing enough background information and nudges, they can induce the chatbot to answer correctly, and the pro-LLM crowd leaves satisfied.)