Could utility functions be for narrow AI only, and downright antithetical to AGI?
Could utility functions be for narrow AI only, and downright antithetical to AGI? That's a quite fundamental question and I'm kind of afraid there's an obvious answer that I'm just too uninformed to know about. But I did give this some thought and I can't find the fault in the following argument, so maybe you can?
Eliezer Yudkowsky says that when AGI exists, it will have a utility function. For a long time I didn't understand why, but he gives an explanation in AI Alignment: Why It's Hard, and Where to Start. You can look it up there, but the gist of the argument I got from it is:
- (explicit) If an agent's decisions are incoherent, the agent is behaving foolishly.
- Example 1: If an agent's preferences aren't ordered, the agent prefers A to B, B to C but also C to A, it behaves foolishly.
- Example 2: If an agent allocates resources incoherently, it behaves foolishly.
- Example 3: If an agent's preferences depend on the probability of the choice even having to be made, it behaves foolishly.
- Example 1: If an agent's preferences aren't ordered, the agent prefers A to B, B to C but also C to A, it behaves foolishly.
- (implicit) An AGI shouldn't behave foolishly, so its decisions have to be coherent.
- (explicit) Making coherent decisions is the same thing as having a utility function.
I accept that if all of these were true, AGI should have a utility function. I also accept points 1 and 3. I doubt point 2.
Before I get to why, I should state my suspicion why discussions of AGI really focus on utility functions so much. Utility functions are fundamental to many problems of narrow AI. If you're trying to win a game, or to provide a service using scarce computational resources, a well-designed utility function is exactly what you need. Utility functions are essential in narrow AI, so it seems reasonable to assume they should be essential in AGI because... we don't know what AGI will look like but it sounds similar to narrow AI, right?
So that's my motivation. I hope to point out that maybe we're confused about AGI because we took a wrong turn way back when we decided it should have a utility function. But I'm aware it is more likely I'm just too dumb to see the wisdom of that decision.
The reasons for my doubt are the following.
- Humans don't have a utility function and make very incoherent decisions. Humans are also the most intelligent organisms on the planet. In fact, it seems to me that the less intelligent an organism is, the easier its behavior can be approximated with model that has a utility function!
- Apes behave more coherently than humans. They have a far smaller range of behaviors. They switch between them relatively predictably. They do have culture - one troop of chimps will fish for termites using a twig, while another will do something like a rain dance - but their cultural specifics number in the dozens, while those of humans are innumerable.
- Cats behave more coherently than apes. There are shy cats and bold ones, playful ones and lazy ones, but once you know a cat, you can predict fairly precisely what kind of thing it is going to do on a random day.
- Earthworms behave more coherently than cats. There aren't playful earthworms and lazy ones, they basically all follow the nutrients that they sense around them and occasionally mate.
- And single-celled organisms are so coherent we think we can even model them them entirely on standard computing hardware. Which, if it succeeds, means we actually know e.coli's utility function to the last decimal point.
- Apes behave more coherently than humans. They have a far smaller range of behaviors. They switch between them relatively predictably. They do have culture - one troop of chimps will fish for termites using a twig, while another will do something like a rain dance - but their cultural specifics number in the dozens, while those of humans are innumerable.
- The randomness of human decisions seems essential to human success (on top of other essentials such as speech and cooking). Humans seem to have a knack for sacrificing precious lifetime for fool's errands that very occasionally create benefit for the entire species.
A few occasions where such fool's errands happen to work out will later look like the most intelligent things people ever did - after hindsight bias kicks in. Before Einstein revolutionized physics, he was not obviously more sane than those contemporaries of his who spent their lives doing earnest work in phrenology and theology.
And many people trying many different things, most of them forgotten and a few seeming really smart in hindsight - that isn't a special case that is only really true for Einstein, it is the typical way humans have randomly stumbled into the innovations that accumulate into our technological superiority. You don't get to epistemology without a bunch of people deciding to spend decades of their lives thinking about why a stick looks bent when it goes through a water surface. You don't settle every little island in the Pacific without a lot of people deciding to go beyond the horizon in a canoe, and most of them dying like the fools that they are. You don't invent rocketry without a mad obsession with finding new ways to kill each other. - An AI whose behavior is determined by a utility function has a couple of problems that human (or squid or dolphin) intelligence doesn't have, and they seem to be fairly intrinsic to having a utility function in the first place. Namely, the vast majority of possible utility functions lead directly into conflict with all other agents.
To define a utility function is to define a (direction towards a) goal. So a discussion of an AI with one, single, unchanging utility function is a discussion of an AI with one, single, unchanging goal. That isn't just unlike the intelligent organisms we know, it isn't even a failure mode of intelligent organisms we know. The nearest approximations we have are the least intelligent members of our species. - Two agents with identical utility functions are arguably functionally identical to a single agent that exists in two instances. Two agents with utility functions that are not identical are at best irrelevant to each other and at worst implacable enemies.
This enormously limits the interactions between agents and is again very different from the intelligent organisms we know, which frequently display intelligent behavior in exactly those instances where they interact with each other. We know communicating groups (or "hive minds") are smarter than their members, that's why we have institutions. AIs with utility functions as imagined by e.g. Yudkowsky cannot form these.
They can presumably create copies of themselves instead, which might be as good or even better, but we don't know that, because we don't really understand whatever it is exactly that makes institutions more intelligent than their members. It doesn't seem to be purely multiplied brainpower, because a person thinking for ten hours often doesn't find solutions that ten persons thinking together find in an hour. So if an AGI can multiply its own brainpower, that doesn't necessarily achieve the same result as thinking with others.
Now I'm not proposing an AGI should have nothing like a utility function, or that it couldn't temporarily adopt one. Utility functions are great for evaluating progress towards particular goals. Within well-defined areas of activity (such as playing Chess), even humans can temporarily behave as if they had utility functions, and I don't see why AGI shouldn't.
I'm also not saying that something like a paperclip maximizer couldn't be built, or that it could be stopped once underway. The AI alignment problem remains real.
I do contend that the paperclip maximizer wouldn't be an AGI, it would be narrow AI. It would have a goal, it would work towards it, but it would lack what we look for when we look for AGI. And whatever that is, I propose we don't find it within the space of things that can be described with (single, unchanging) utility functions.
And there are other places we could look. Maybe some of it is in whatever it is exactly that makes institutions more intelligent than their members. Maybe some of it is in why organisms (especially learning ones) play - playfulness and intelligence seem correlated, and playfulness has that incoherence that may be protective against paperclip-maximizer-like failure modes. I don't know.
The basic problem is the endemic confusion between the map, the UF as a way of modelling an entity, and the territory. the UF as an architectural feature that makes certain things happen.
It seems to you that entities with simple and obvious goal directed behaviour (as seen from the outside) have or need UFs, and entities that don't. don't. But there isn't a fixed connection between the way things seem from the outside, and the way they work.
From the outside, any system that succeeds in doing anything specialised can be thought of, or described as a relatively general purpose system that has been constrained down to a more narrow goal by some other system. For instance, a chess -playing system maybe described as a general purpose problem-solver that has been trained on chess. To say its UF defines a goal of winning at chess is the "map" view.
However, it might well be .. in terms of the territory ... in terms of what is going on inside the black box.. a special purpose system that has been specifically coded for chess, has no ability to do anything else, and therefore does not any kind of reward channel or training system to keep it focused on chess. So the mere fact that a system, considered from the outside as a black box, does some specific thing, is not proof that it has a UF, and therefore not a proof that anyone has succeeded in loading values or goals into its UF.
Taking an outside view of a system as possessing a UF (in the spirit or Dennett's "intentional stance") will only give correct predictions if everything works correctly. The essential point is that you need a fully accurate picture of what is going in inside a black box in order to predict its behaviour under all circumstances .. but pictures that are inaccurate in various ways can be good enough for restricted sets of circumstances.
Here's an analogy: suppose that machinery, including domestic appliances, were made of an infinitely malleable substance called Utlronium, say, and were constrained into some particular form, such a a kettle or toaster, by a further gadget called a Veeblefetzer. So long as a kettle functions as a kettle, I can regard it as Ultronium+Veeblefetzer ensemble. However, such ensembles support different counterfactuals to real kettles. For instance. if the veeblefetzer on my kettle fritzes it could suddenly reconfigure it into something else, a toaster or a spice rack -- but that is not possible for an ordinary kettle that is not made of Ultronium.
The converse case to an entity seeming to have a UF just because it fulfils some apparent purpose is an entity that seems not to have a UF because its behaviour is complex and perhaps seemingly random. A UF in the territory sense does not have to be simple, and a complex UF can include higher level goals, such as "seek variety" or "revise your lower level goals from time to time", so the lack of an obvious UF as judged externally does not imply the lack of a UF in the gold-standard sense of an actual component.
The actual possession of a UF is much more relevant to AI safety than being describable in terms of a UF. If an AI doesn't actually have a UF, you can't render it safe by fixing its UF.