The AI Control Problem in a wider intellectual context
Epistemic status: A public intellectual is someone interesting enough that we have decided to let them be obviously wrong. I, unfortunately, am not even a public intellectual.
I’ve been thinking about the control problem lately. The control problem, also called the AI alignment problem is, per Wikipedia:
[A]spects of how to build AI systems such that they will aid rather than harm their creators. One particular concern is that humanity will have to solve the control problem before a superintelligent AI system is created, as a poorly designed superintelligence might rationally decide to seize control over its environment and refuse to permit its creators to modify it after launch. In addition, some scholars argue that solutions to the control problem, alongside other advances in AI safety engineering, might also find applications in existing non-superintelligent AI.
But can’t we just program it to help us rather than to harm us? The problem is that if you give a super-powerful entity a goal- a value function- and it follows it literally- bad things can happen. An analogy- consider a genie. This genie isn’t actively malign, but it will do exactly what you tell it to do in the most direct way possible. Wish for a tonne of gold? Well, it appears on top of and/or inside of you because that’s the most direct place for it to appear.
Now let me introduce an idea to understand the control problem.
A thick concept is a concept for which we can check whether any given instance falls under that concept relatively easily. However, it is all but impossible for us to articulate rules which, when mechanically applied, will tell us whether a given instance falls under a concept. In other words, it is very difficult or impossible to create an algorithm that captures thick concepts.
Using our analogy again, we can tell you if the genie has given us our heart’s desire (whether something falls under a concept), but we can’t given instructions for the genie to follow literally to give us our heart’s desire (can’t capture it with mechanical rules in a way that won’t fuck us over). Ironically I’m not quite sure my definition of thick concept captures exactly what I mean, because later on, we’ll look at cases where we can’t even agree on whether something falls under a concept, but I think this definition is a good start.
Now let us define a problem, or rather a class of problems. The conceptual richness problems are problems of trying to cope with thick concepts, either by (quixotically) trying to spell them out in all their detail and creating an algorithm, or by finding an alternative to having to spell them out. The control problem is one instance of a conceptual richness problem that specifically arises, at least in part, because there are so many thick concepts in human ideas of the good- flourishing, autonomy, rights, and so on. We can (often) tell you if a computer has respected the good, but not give a computer step-by-step instructions for respecting the good.
I thought that an interesting and bloggable, approach to the Control Problem would be to start a conversation about the variety of disciplines that also face the conceptual richness problem, with the idea of encouraging mutual interchange. Intellectual enterprises that have run up against these sorts of problems include Analytic Philosophy, Classical AI, Law, Statistics in social sciences, and of course AI alignment. Related but separate problems arise in areas as varied as poetry criticism, teaching, and AI-interpretability. Maybe by teasing out the transdisciplinary nature of the problem, we’ll encourage cross-pollination, or at least that’s my hope.
Analytic Philosophy Analytic Philosophy has taught us that, save perhaps a tiny handful (and maybe not even that!) all concepts are thick. It has shown this inductively. Hundreds of thousands of person-years have been spent by philosophers trying to find definitions of things- reasonably compact lists of necessary and sufficient conditions-. No such efforts have succeeded. Granted, philosophers have generally focused on fraught concepts like beauty, truth, goodness, knowledge, causation, etc., but there are no signs that shifting attention to easier concepts would help much. Indeed, consider a paradigmatic example of a concept that is often thought to be easy to analyze:
Bachelor: X is a bachelor if and only if X is an adult, X is male and X is unmarried
Firstly, note that even if this definition succeeds, we’ve just moved the attention onto three far more fraught concepts, adult, male and unmarried. But secondly, observe that this definition isn’t clearly right. Is the pope a bachelor? Is a man in a loving thirty-year relationship with twelve kids who, nonetheless, is not technically married a bachelor? Presumably adult male animals don’t count, so we might think it’s humans only, but if there were such things as elves, would an unmarried adult male elf be a bachelor? Even this ‘simple’ term, understood well enough that just about any native speaker could check whether a given use was right, wrong or dubious, cannot be turned into an algorithm.
Analytic philosophers have reacted to the apparent impossibility of finding necessary and sufficient conditions of things in different ways. Some philosophers are still trying to do it. Other philosophers view proposing definitions as a kind of provisional exercise- never fully adequate but useful for a variety of reasons. Others just get on with the many kinds of philosophical work that don’t require specifying necessary and sufficient conditions of concepts. Still others are grappling with ideas like the conceptual engineering program in light of these and related difficulties. Work in psychology (e.g. the prototype theory of concepts and related nonclassical approaches) has informed the thinking of philosophers about these issues. Philosophical work (e.g. Wittgenstein’s metaphor of family resemblance as a replacement for the idea of necessary and sufficient conditions) has informed many psychologists working on concepts in turn.
Symbolic or classical AI:
I don’t know so much about computer science, which is a shame because from what I can tell, problems of conceptual richness abound. It could be argued that they killed (or rather maimed) an entire approach to artificial intelligence. AI wasn’t always this machine learning connectionist stuff. Prior to the machine learning revolution, the most promising work in AI was around Symbolic AI. Symbolic AI tried to capture intelligence through explicit representations, operations using rules, etc.
There’s a long history of how this approach ran aground- the Dreyfus critique, AI winters, etc. I won’t say too much about all this stuff because I don’t know it that well, but there’s a joke about this sort of approach I like:
”I don’t know why self-driving cars keep hitting objects. It should be simple enough to program them not to:
If (going_to_hit_something) Then (Don’t)”
Let me translate the humor of this joke. Imagine a robot moving through the world with a camera giving it information about its environment in the form of an array with color data at each point. You are a hapless researcher who has to hard code rules to interpret that array of data into a guess about what the physical space and the objects in it, around the robot look like. Where would you even begin?
Classical AI proved very good at dealing with certain kinds of toy problems, and also with certain kinds of very important problems, like expert systems for disease diagnosis. But most of our ways of relating to the world just proved too thick to capture in lines of code, however extensive. The conceptual richness problem was thus one of the negative triggers for the switch over to machine learning as the dominant paradigm, with a variety of positive triggers, most especially increasing computational power and data collation.
A good source of further reading on this topic -both on classical AI and on our problem generally- would be the Drefyus critique. It goes in a similar direction to our argument here, although what we call thick concepts are just one part of it.
The behaviors we want to forbid and require are complex, varied, situational, come in degrees and are themselves subject to controversy. Spelling out exact rules for judges to apply and civilians to follow might seem impossible, and it is! so in its own encounter with the conceptual richness problem, the legal system has to find alternatives to creating algorithms of law. Unfortunately (or fortunately, depending on who you ask) because law cannot be turned into an algorithim we often face what contemporary legal scholars call legal indeterminacy- a situation in which there is no single right answer to many important legal questions. This has a number of undesirable effects, it undermines the rule of law- the idea that the law should be clear and determined in advance, and hence easy to follow. It blurs the line between judicial and legislative functions- arguably very undesirable in a democracy where the legislature is elected but judges are not (or even if judges are elected, it is difficult for the public to apply democratic scrutiny to their choices).
A big part of the way law approaches the conceptual richness problem is what might be termed constructive ambiguity. Laws are designed as far as is possible to create socially desirable flexibility while avoiding socially undesirable uncertainty. There are many ways to create constructive ambiguity: for example heavy use of concepts such as reasonable. Just add a bunch of steps in the procedure that amounts to saying “refer to best human judgment”. This is why sentiments like “the law is the law” are so silly, the law is full of discretion, and is set up that way deliberately, not that you could avoid it even if you tried.
Consequently, It is often, or perhaps even always, impossible to decide cases without legislating from the bench to some degree. In some cases -maybe even most!- what that legislation should be according to prevailing standards is so uncontroversial that no reasonable judge would disagree. But although humans may agree intersubjectively on the result, that doesn’t mean the human judgment is dispensable. Subtlety creeps in.
A great example of this is the (fictional) case of the Speluncean explorers, again via Wikipedia:
"The Case of the Speluncean Explorers" is an article by legal philosopher Lon L. Fuller first published in the Harvard Law Review in 1949. Largely taking the form of a fictional judgment, it presents a legal philosophy puzzle to the reader and five possible solutions in the form of judicial opinions that are attributed to judges sitting on the fictional "Supreme Court of Newgarth" in the year 4300.[a]
The case involves five explorers who are caved in following a landslide. They learn via intermittent radio contact that, without food, they are likely to starve to death before they can be rescued. They decide to engage in cannibalism and select one of their number to be killed and eaten so that the others may survive. They decide who should be killed by throwing a pair of dice. After the four survivors are rescued, they are charged and found guilty of the murder of the fifth explorer. If their appeal to the Supreme Court of Newgarth fails, they face a mandatory death sentence. Although the wording of the statute is clear and unambiguous, there is intense public pressure for the men to avoid facing the death penalty.
I’m going to dispute that last sentence.
The wording of the statute in this hypothetical case is:
"Whoever shall willfully take the life of another shall be punished by death."
It seems pretty simple, right? Willful gives a bit of wiggle room, but on the whole, so long as the deliberateness of the action is not in dispute, the questions of law in murder trials under this statute should be pretty simple. The sentencing phase should be even simpler again! Clearly, on the most natural, direct meaning, the explorers breached the statute. So does this case break down into a Sophie’s choice between going with the law and going with morality? No.
Since the publication of the article, a number of legal scholars on both the left and the right have commented on the case. The right often maintains that the law here is clear, and it is not the job of judges to legislate from the bench- even where the law will lead to tragedy as in this case. The left has had a number of lines of reply, but to me the most ingenious is this: none of these conservative commenters have ever thought that the law as written would require you to execute the executioner after he has finished his execution. Yet to the extent the law as written can be said to have a plain and natural meaning, that plain meaning implies that you should execute the executioner, and the executioners’ executioner, and so on. Clearly then, no one is taking the law at face value.
So the narrative of what the law plainly says versus external moral considerations breaks down because everyone in the room is interpreting the law in terms of policy goals and ethical values to some degree.(1) Now with that established we are, in words often falsely attributed to Winston Churchill, “Just haggling over the price”. The left is willing to stop being literal at a lower bar than the rightist jurists. The leftist jurists continue their argument against the rightists: since there’s no Schelling point of literalism to stop at and since we are both being non-literal to some degree - why won’t you take a few extra steps to join us? You could stop these more or less blameless men from dying.
My sympathies obviously lie with the leftists here, but it’s possible I’m wrong. Maybe in the long run the degree of textual looseness you would need to acquit the Speluncean explorers is too much. Maybe we should allow enough textual wiggle room to spare the executioner, but not enough to let these men go. But the point is established, I think, that whatever pretenses law might have, it’s galaxies away from being algorithmic. The solution everyone has adopted to the problem of conceptual richness in law is just to add a multitude of judgment calls. Some jurists are just more honest about it than others.
Presumably, the drafters of this hypothetical law had something in mind like “I dang think if you take a life your life should get taken”. No doubt they could tell you what they wanted in individual cases, but they failed to capture that concept properly (albeit, in this case, it doesn’t look like they tried particularly hard). Thus conceptual thickness strikes again.
Statistical queries of a certain sort:
When you are researching a certain type of question, no econometric statistic is ever quite right. An example. Recently I presented some data that showed that, after adjusting for inflation, the average wages of non-supervisory workers and production workers haven’t risen since 1964.
Now a bunch of people objected in a variety of ways, one of those ways was:
A) This statistic wasn’t quite right because it didn’t include non-wage benefits. Others including myself thought it was fairer, on the whole, not to include things like healthcare premiums.
Two objections they did not give, but could well have, were:
B) The statistic wasn’t quite right due to Simpson’s paradox. That is to say, it is possible that if you divide the workers up by race, each individual race is better off, but due to the changing racial composition of the population, the average isn’t improving. I would object to this objection that it was fairer on the whole just to look at the aggregate if we are to assess the position of the working class qua working class.
C) The statistic wasn’t quite right because the percentage of people who are supervisors has increased since 1964, thus those who were “left behind” in non-supervisory roles may represent a less talented pool. My response is that again, we are interested in the plight of the working class, so including non-workers would slant things. A statistician with more resources and time could respond with controls for skills- but this itself would open up numerous debates and questions of judgment.
Step back. What we are trying to capture through statistics is an answer to the question “has the American working class had an unusually bad period, economically speaking, in the last 55+ years”. But that question contains a series of thick concepts- e.g “unusually bad period”, “economically speaking” “American working class”. Because these concepts are so thick, it’s all but impossible to design a single statistical query with exact parameters that captures perfectly the intuition behind the question.
Partly the answer in statistics is sensitivity testing- many different queries with different combinations of parameters to see if they all lineup. Partly the answer is a judgment call- I think that the statistics I gave were, broadly speaking, very fair. Once again though we’ve hit the conceptual richness problem, our concepts are too broad and subtle for what we can capture in a formally defined query.
Disciplines that have grappled with related but separate problems include:
Poetry criticism: Poetry critics have long lamented or rejoiced that it is impossible to capture the full meaning of a poem through criticism. Harold Bloom once wrote that the meaning of a poem could only be another poem. Now a lot of STEM types would probably dismiss this as lunar-eyed romanticism. I propose though that we take it seriously. A poem (or any artwork really, but it’s especially clear with poems) creates a kind of mental experience that is too rich to spell out. Obviously part of the problem is that to experience a poem is to feel something, and you can’t usually explain someone into feeling something. However, I think it is entirely plausible that another part of the problem is this: experiencing a poem means experiencing something too complicated to be explained systematically. I see this as having kinship with the conceptual richness problem, although it’s not quite the same thing.
Teaching: In many sorts of teaching, e.g. teaching about complex concepts like “alive”, we can often tell the student whether x falls under that concept, but we can’t give the student a rule. Often the compromise in teaching seems to be giving the student an approximate rule and advising them that there are exceptions. Eventually, through a poorly understood process- a meeting of inductive biases and experience- the student gradually gets it. In the words of Wittgenstein: “Light dawns gradually over the whole”. Once again, this very general and common situation seems analogous to the conceptual richness problem.
Machine Learning Interpretability: At the moment in artificial intelligence there is a great deal of attention being spent on the problem of interpretability. Machine learning programs trained on millions or even billions of examples can use this knowledge to very good effect- sometimes better than humans- and it would be nice if we could use these programs in place of humans sometimes. The problem is that for all sorts of reasons, we can’t do this unless computers can explain their choices. In some sense this is an instance of our problem- what we would really like is for the computer to take its complex, statistically layered concepts and applied processes and translate them into reasoning a human can understand. This sounds quite similar to the problem of translating human concepts into algorithms. Obviously, this isn’t quite the same problem as the problem of distilling concepts, because humans can give rationales for their decisions but can’t distill concepts, but I suspect the analogy is important.
Incidentally, I wonder if the machine learning interpretability problem suggests a skeptical possibility about human communication. Maybe we make our decisions on the basis of vastly complex processes that bear very little resemblance to the explanations we give for our decisions. Maybe all or nearly all explanations are just post-hoc rationalisations.
1. The control problem is an instance of, or is at least very closely related to a very general problem. Simply put, that general problem is that we can use our concepts, but we can’t understand them in a systematic, formal way.
2. To the best of my knowledge, this problem has never been given a domain-general name. I call it the problem of conceptual richness.
3. The problem is likely insoluble in the way we would most like to solve it: humans writing out a procedure, which could be mindlessly or near mindlessly applied.
4. But there are alternative approaches to coming to grips with the problem. Exploring how other disciplines have approached this may be an interesting direction in the study of the control problem.
(1): Here’s one unconvincing attempt to get out the problem: “we need to look at what the drafters intended, and not just the literal meaning. Clearly, the drafters did not intend that the executioner be put to death, but clearly, they did intend that people like the Speluncean Explorers should be put to death”.
Certainly, I agree that the drafters clearly didn’t intend executioners be put to death, but in what sense is it clear that they did intend explorers to be put to death? It seems entirely possible that they’d be horrified by that reading. In truth the drafters probably weren’t thinking about cases like the explorers, so they didn’t intend anything either way on that sort of case. Drafter intention will fare no better than plain meaning, or only a little better. Plus the epistemic difficulties in getting to it are much greater, but that takes us beyond the scope of this essay.