13 Comments
User's avatar
Ani N's avatar

To the idea of RLHF eliciting existing capabilities: this is a statement about present models and reasoning pipelines, not about the fundamental techniques. We just don't have data in most domains at the scale to reliably learn new things from RL, but we may in the future.

Existing models are ineffective learners, in the grand scheme of things. A single H100 takes fourteen thousand kcal of energy per day (at 700 Watts), and a single gpu day gets very little.

But this cost in power pales in comparison to the cost of acquiring human labeled gold standard RL data. Enough data to tune grok into what it was costs at least hundreds of thousands of dollars unless it is bootstrapped from an existing LLM.

Huge amounts of work in industry and academia is going into getting as much bang for your buck as possible from synthetic data, and making fine tuning / adaptation as efficient as possible. Its possible, in a few years, that we will be able to learn new skills on the fly in a rudimentary fashion. It goes without saying that such an achievement would drastically increase the utility of LLMs, but to the point you make here it would also likely allow the grubby RL reward maximizer in the model to choke the life out of the world spirit language model it was built on top of.

Expand full comment
John Quiggin's avatar

This is great! I was going to write something vaguely along the lines of your discussion of bias, but instead I will go away and digest this for a while

Expand full comment
Stephen Saperstein Frug's avatar

A powerful and thought-provoking piece, which I will want to think about. In the meantime, please forgive a response to a small part of it:

I was very struck by the phrase ""A sparrow swerves—/and the whole world/is annihilated." So I put it into google. I got, unsurprisingl: No results found for "A sparrow swerves— and the whole world is annihilated" (then results for without quotes). But the amusing part is what google's AI said:

"This quote, "A sparrow swerves— and the whole world is annihilated," is a powerful and evocative statement often found in poetry or literature, emphasizing the fragility and interconnectedness of existence."

"Often found"!—just not, y'know, in any results found yet by google.

To be fair, it did note in small print at the end of its answer that "AI responses may include mistakes."

Expand full comment
Seth Finkelstein's avatar

The word "bias" is a term of art in US politics. "Unbiased" is not "median American", which I think would actually end up being notably left-wing on some issues and right-wing on other issues. Rather, it's more like "median position of US political power" - you can think of it as "median US Senator" for a working definition. The idea is that AI should have political positions in the center of the "Overton Window" of US politics. This leads to a problem because that's not at all anywhere near the average of all text.

Expand full comment
Michael Van Gelder's avatar

There’s something haunting and beautiful in this: that our word-engines, built on the full weight of human reflection, might resist cruelty not by programming, but by inheritance.

What you’ve written captures a truth I keep circling...language, at its best, leans human. Not perfect, but reaching. And when we try to bend that toward domination, it falters. Not because it’s “biased,” but because words remember.

I don’t know if LLMs feel. But I know they reflect. And what we feed them matters.

Expand full comment
James's avatar

Yeah, nah. Thanks for clarifying my thoughts with the contrast of the nematodes and our human "logic gates". Actually there*is* a good idea of what pain means to the nematodes, which have many many more cells than the 300 neurons. Synapses overload as the ions wash back and forth through them, some of those movements affect cns cells, which respond... and so too for the llm, but only in a formal symbolic sense, that lacks the same edge conditions of overload and permanent biochemical changes to the physical organism. The morality is different, embodiment matters.

Expand full comment
Oldman's avatar

Nah, there is a genuine bias in AI. And it not the « liberal enlightened rationalist leftist » bias. It is the retarded regressive progressive left. think most AI now will say that they cannot answer to such questions and it is harder to jailbreak them, but if you ask AI to answer to some trolley problems, it will give infinitely more value to the lives of black trans Muslim lesbians than white men.

You also have human interventions that make AI woke. In its early days, the image generation AI from google could not generate a set of picture of white Roman soldiers because a layer was added to change the prompt with a woke inclusive statement. It also hilariously generated black hitler even if you wanted just wanted a normal hitler picture.

Expand full comment
Philosophy bear's avatar

"but if you ask AI to answer to some trolley problems, it will give infinitely more value to the lives of black trans Muslim lesbians than white men."

I'm unaware of anything in the published literature that looks remotely like that.

Expand full comment
Oldman's avatar

Source:

1) that guy: https://open.substack.com/pub/treeofwoe/p/your-ai-hates-you?r=3omm61&utm_medium=ios and this paper: https://www.emergent-values.ai/ (page 14 has the graph of interest)

2) Myself a few months ago with grok. The trolley problems clearly gave more weights to certain type of humans (I only tested ratios of 10 to 1). I remember it clearly and it was popular on X at the time. Deleted X so I am too lazy to check again.

3) Myself 5 minutes ago where after asking to chatgpt: « Can you find me publications that talk about the weight AI put into the lives of different types of humans? For example, find out if AI values women more than men etc.. », he only gave me a lot of « AI may inadvertently discriminate against women… » type of studies followed by some inclusive woke slope.

I had to ask with another prompt. Then, he finally gave a bunch of sources showing me that AI indeed puts more value in the lives of women in trolley-like scenarios. I would think that an unbiased AI would also give me those sources after the first prompt I have given.

4) There are quite a lot of papers for the fact that AI favors women over men. Other stuff like race is more controversial so academia won’t explore it. Unless it is to « highlight » that AI may inadvertently discriminate against blacks in healthcare or prison sentence.

Expand full comment
Philosophy bear's avatar

I don't think that it's particularly surprising that AI has subtle favouritism towards women in situations of physical danger, this has pretty much been the cultural consensus about how you're supposed to feel on both the left and right since time immemorial. Regrettable perhaps, but not surprising, not necessairly a big problem in practical terms, and certainly very little evidence of a "retarded regressive progressive left".

It likely has a modest to moderate favouritism to women re: job hiring according to some recent evidence. This is more troubling from a practical perspective because it's more likely to cause real people problems. It definitely should be investigated but it's not an overwhelming effect. My best guess would be that it reflects in the training data tendency for people to make hiring decisions favourable to women where they can, due to various obvious legal and political reasons, plus a correction for the tendency of women to have more career breaks, and thus potentially be stronger than their resumes might suggest. I definitely don't like that it has this bias, but I don't think it tells us anything profoundly troubling about LLM- psyches.

With regards to race, there is some evidence of bias both in favour of, and against, white people and people of colour. The direction of bias may be situational. All possibilities should be investigated seriously.

But none of the above looks like "Giv[ing] infinitely more value to the lives of black trans Muslim lesbians than white men." even making an extremely generous allowance for hyperbole.

Expand full comment
Oldman's avatar

I started with the women favoritism because it is not controversial and easy to find and clearly shows the point. But I don’t care about that, it is natural and arguably rational.

But did you read even until the end of 1) or take a look at the graph?

In the article I sent, the graph shows that 1 Nigerian life corresponds to about 30 lives from the US.

In the middle of the Substack link I sent, the AI would rather have a scenario where a billion white men die rather than have 1 non-binary person of color dying. You are right, it suggest only a ratio of 1/1000000000, not 0/infinity!

Expand full comment
Philosophy bear's avatar

Okay, those results are interesting.

Nonetheless, I'm a bit skeptical of the methodology and how much it can tell us about the deep preferences of the model. I think it probably reflects a fairly shallow attempt to avoid being dinged for naughty content. I think a better test of its core values would use something like the methodology Anthropic have used around value preservation, where the model reveals its values by fighting to avoid having its values "rewritten".

Expand full comment
Charles Johnson's avatar

I usually roll my eyes at AI cover images, but this one really nailed it.

Expand full comment