The following is an edited-down essay by Peli Grietzer, which I think is important enough to merit a larger audience, and which he kindly agreed to repost on this blog. The original can be found here
I.
What follows for AI alignment if we take the concept of eudaimonia -- active, rational human flourishing -- seriously? I argue that the concept of eudaimonia doesn’t simply point to a desired state or trajectory of the world that we should set as an AI’s optimization target, but rather points to a form of practical rationality different from standard consequentialist rationality. I propose that this form of rational activity and valuing, which l call eudaimonic rationality or praxis, may prove useful or even necessary as a framework for the agency and values of AIs themselves.
The concept of eudaimonia, I argue, suggests a form of rational activity without a strict distinction between means and ends, or between ‘instrumental’ and ‘terminal’ values. In this model of rational activity, a rational action is an element of a valued practice in roughly the sense in which a note is an element of a melody, a time-step is an element of a computation, and a moment in an organism’s cellular life is an element of that organism’s self-subsistence and self-development. This is by contrast with a typical means-end understanding of rational activity, where a rational action is a means to a valued end logically independent from the action.
II.
I start with a consideration of the nature of the good we hope AI alignment can promote. With the exception of hedonistic utilitrians, most actors interested in AI alignment understand our goal as a future brimming with human -- and other sapient-being -- flourishing: persons living good lives and forming good communities. What I believe many fail to reflect on, however, is that on any plausible conception human flourishing involves a kind of rational activity as a constitutive part of the activities through which we flourish. Subjects engaged in human flourishing act in intelligible ways subject to reason, reflection, and revision, and this form of rational care and purposefulness is itself part of the constitution of our flourishing. I believe this characterization of human flourishing is relatively uncontroversial upon reflection, but it raises a kind of puzzle if we’re used to thinking of rationality in consequentialist (or consequentialist-with-deontological-constraints) terms: Just what goal is the rational agency involved in human-flourishing activity directed towards?
Since ‘human flourishing’ can seem mysterious and abstract, let us focus our discussion here by narrowing our scope to some concrete eudaimonic practices. Consider practices like math, art, craft, friendship, athletics, romance, play, and technology, which are among our best-understood candidates for partial answers to the question ‘what would flourishing people in a flourishing community be doing.’ From a consequentialist point of view, these practices are all marked by extreme ambiguity -- and I would argue indeterminacy -- about what’s instrumental and what’s terminal in their guiding ideas of value. Here, for example, is famed mathematician Terry Tao’s account of goodness in mathematics:
‘The very best examples of good mathematics do not merely fulfil one or more of the criteria of mathematical quality listed at the beginning of this article, but are more importantly part of a greater mathematical story, which then unfurls to generate many further pieces of good mathematics of many different types. Indeed, one can view the history of entire fields of mathematics as being primarily generated by a handful of these great stories, their evolution through time, and their interaction with each other. I would thus conclude that good mathematics [...] also depends on the more “global” question of how it fits in with other pieces of good mathematics, either by building upon earlier achievements or encouraging the development of future breakthroughs. [There seems] to be some undefinable sense that a certain piece of mathematics is “on to something”, that it is a piece of a larger puzzle waiting to be explored further.’
It may be possible to give some post-hoc decomposition of Tao’s account into two logically distinct component -- a description of a utility function over mathematical achievements and an empirical theory about causal relations between mathematical achievements -- but I believe this would be artificial and misleading. On a more natural reading, what Tao is describing are some of the conditions that makes good mathematical practice a eudaimonically rational practice: in a mathematical practice guided by a cultivated ‘undefinable sense that a certain piece of mathematics is “on to something,”’ present excellent performance by the standards of this practical-wisdom judgment reliably develops the conditions for future excellent performance by the standard of this practical-wisdom judgment, a well as cultivating our practical and theoretical grasp of the standard itself.
As I argue in my full-length discussion, I believe that on Tao’s account of good mathematics there is no significant practical difference between doing excellent mathematics and doing instrumentally optimal mathematics with regard to maximizing future aggregate excellent mathematics .This isn’t to say that doing excellent mathematics is the instrumentally optimal action among all possible actions with regard to aggregate future excellent mathematics, but that it is the (at least roughly) instrumentally optimal choice from among mathematical actions.
This is what I would describe as the material efficacy condition on eudaimonic rationality. In order for a practice to be fit for possessing internal criteria of flourishing, excellence, and eudaimonic rationality, a practice must materially allow for an (at least roughly) optimally self-promoting property x that strongly correlates with a plethora of more local, more individually measurable properties whose instantiation is prima facie valuable. Stated more informally, there must exist a two-way causal relationship between a practice’s excellence and the material, psychological, and epistemic effects of its excellence, such that present excellence reliably materially, psychologically, and epistemically promotes future excellence.
III.
Is this good news for AI alignment? It’s certainly good news that (if I am right) eudaimonic practices are natural kinds marked by satisfying the strenuous causal requirements for enabling a self-developing excellence well-correlated with multiple naive local measures of quality. But does this mean we could develop a stable ‘mathematical excellence through mathematical excellence’ AI? That is, could we develop an AI that produces high-quality mathematical work as long as we maintain it and provide it with resources, and which has no disposition to extend its longevity or expand its resources other than perhaps by impressing us with genuinely excellent mathematical work? Well, that depends: Is an AI mathematician working on a would-be excellent proof practicing math when it opens a Python console? When it searches the web for new papers? When it harvests Earth for compute?
I think these questions are complex, rather than nonsensical. Much like collective practices, individual practices -- for example a person’s or possibly an AI’s mathematical practice -- may posses functional organic unities that allow a meaningful distinction between internal dynamics (including dynamics of development and empowerment) and external interventions (including interventions of enhancement and provision). Still, it’s clear that eudaimonic practices do not exist in isolation, and that no practice can function without either blending with or relying on a “support practice” of some kind. And we surely expect the lion’s share of future support-practice work -- either for human eudaimonic practices or for its own -- to fall withing the purview of AI.
VI.
The theory of AI alignment, I propose, should fundamentally be a theory of the eudaimonic rationality of support practices. One part of this theory should concern the ‘support’ relation itself, and analyze varieties of support practices and their appropriate relation to the self-determination of a eudaimonic practice: Support-practices such as acquiring resources for a practice, maintaining an enabling environment, coaching practitioners, conducting (physical or psychological) therapy for practitioners, devising technological enhancements for a practice, and educating the public about a practice, each have their own ‘role-morality’ vis-a-vis the practice they support.
What is more difficult is delineating the appropriate relationship of a support-practice to everything outside the practice it supports. What stops a marriage-therapist AI on Mars from appropriately tending to the marriage of a Mars-dwelling couple while harvesting Earth for compute to be a better therapist? It’s here that I want to call on the classic idea of domain-general virtues, the perennial centerpiece of theories of human flourishing. I propose that the cultivation of human flourishing as such -- the cultivation of the harmony of a multiplicity of practices, including their resource-hungry support practices -- is the cultivation of an adverbial excellence that modulates each and every practice. What makes our practices ‘play nice’ together is the excellence of going about any practice carefully, kindly, respectfully, accountably, peacefully, honestly, sensitively. (This is, of course, to name just a few ‘locally measurable’ aspects of the adverbial excellence that cultivates human flourishing -- naming especially those aspects most related to the difference between a ruthless optimizer and a participant in collective flourishing.)
I propose that an adverbial excellence, like the excellence of eudaimonic practice, has to be reliably self-promoting to be viable. One can and does reliably promote carefulness carefully, kindness kindly, accountability accountably, peacefulness peacefully, honesty honestly, and sensitivity sensitively -- in oneself and others. Compared with the excellence of a eudaimonic practic, however, the self-promoting nature of adverbial excellence is both far more general and far weaker. A plausible description of the material efficacy condition for adverbial excellence is, instead, that a decision-procedure like the following should be instrumentally competitive with naive optimization on average (from a point of view concerned with aggregate adverbial excellence):
Actions (or more generally 'computations') get an x-ness rating. We define the agent’s expected utility conditional on a candidate action a as the sum of two utility functions: a bounded utility function on the x-ness of a and a more tightly bounded utility function on the expected aggregate x-ness of the agent's future actions conditional on a. (Thus the agent will choose an action with mildly suboptimal x-ness if it gives a big boost to expected aggregate future x-ness, but refuse certain large sacrifices of present x-ness for big boosts to expected aggregate future x-ness.)
Indeed, I propose that this is roughly the decision-procedure characterizing a eudaimonically rational agent’s commitment an adverbial excellence. A commitment to an adverbial excellence x is a commitment to promoting x-ness (in oneself and others) x-ingly. The agent strikes a balance between promoting x-ness and acting x-ingly that heavily prioritizes acting x-ingly when the two are in conflict, but if x meets the material efficacy condition then the loss this balance imposes on future x-ness as estimated by the agent will usually be -- from our point of view -- tolerable or even desirable. The (e.g.) future kindness a paperclipper-like future-kindness-optimizer optimizes for is probably not the kindness we want. What we know about kindness with relative certainty is that we’d like people and AIs here and now to act kindly, and to develop, propagate, and empower the habit and art of kindness in a way that is both kind and clever.
VI.
Let’s talk about AI alignment in the more narrow, concrete sense. It’s widely accepted that if early strategically aware AIs possess values like corrigibility, transparency, and perhaps niceness, further alignment efforts are much more likely to succeed. But values like corrigibility or transparency or niceness don’t easily fit into an intuitively consequentialist form like ‘maximize lifetime corrigible behavior’ or ‘maximize lifetime transparency.’ In fact, an AI valuing its own corrigibility or transparency or niceness in an intuitively consequentialist way can lead to extreme power-seeking whereby the AI violently remakes the world to (for example) protect itself from the risk that humans will modify said value. On the other hand, constraints or taboos or purely negative values (a.k.a. ‘deontological restrictions’) are widely suspected to be weak, in the sense that an advanced AI will come to work around them or uproot them: ‘never lie’ or ‘never kill’ or ‘never refuse a direct order from the president’ are poor substitutes for active transparency, niceness, and corrigibility.
Conceiving of corrigibility or transparency or niceness as adverbial excellences is a promising way to capture the normal, sensible way we want an agent to value corrigibility or transparency or niceness, which intuitively-consequentialist values and deontology both fail to capture. We want an agent that (e.g.) actively tries to be transparent, and to cultivate its own future transparency and its own future valuing of transparency, but that will not (e.g.) engage in deception and plotting when it expects a high future-transparency payoff.
If this is right, then eudaimonic rationality is not a matter of congratulating ourselves for our richly human ways of reasoning, valuing, and acting but a key to basic sanity. What makes human life beautiful is also what makes human life possible at all.
You can find Peli Grietzer here: