A Master-Slave Model of Human Preferences

[This post is an expansion of my previous open thread comment, and largely inspired by Robin Hanson's writings.]

In this post, I'll describe a simple agent, a toy model, whose preferences have some human-like features, as a test for those who propose to "extract" or "extrapolate" our preferences into a well-defined and rational form. What would the output of their extraction/extrapolation algorithms look like, after running on this toy model? Do the results agree with our intuitions about how this agent's preferences should be formalized? Or alternatively, since we haven't gotten that far along yet, we can use the model as one basis for a discussion about how we want to design those algorithms, or how we might want to make our own preferences more rational. This model is also intended to offer some insights into certain features of human preference, even though it doesn't capture all of them (it completely ignores akrasia for example).

I'll call it the master-slave model. The agent is composed of two sub-agents, the master and the slave, each having their own goals. (The master is meant to represent unconscious parts of a human mind, and the slave corresponds to the conscious parts.) The master's terminal values are: health, sex, status, and power (representable by some relatively simple utility function). It controls the slave in two ways: direct reinforcement via pain and pleasure, and the ability to perform surgery on the slave's terminal values. It can, for example, reward the slave with pleasure when it finds something tasty to eat, or cause the slave to become obsessed with number theory as a way to gain status as a mathematician. However it has no direct way to control the agent's actions, which is left up to the slave.

The slave's terminal values are to maximize pleasure, minimize pain, plus additional terminal values assigned by the master. Normally it's not aware of what the master does, so pain and pleasure just seem to occur after certain events, and it learns to anticipate them. And its other interests change from time to time for no apparent reason (but actually they change because the master has responded to changing circumstances by changing the slave's values). For example, the number theorist might one day have a sudden revelation that abstract mathematics is a waste of time and it should go into politics and philanthropy instead, all the while having no idea that the master is manipulating it to maximize status and power.

Before discussing how to extract preferences from this agent, let me point out some features of human preference that this model explains:

  • This agent wants pleasure, but doesn't want to be wire-headed (but it doesn't quite know why). A wire-head has little chance for sex/status/power, so the master gives the slave a terminal value against wire-heading.
  • This agent claims to be interested in math for its own sake, and not to seek status. That's because the slave, which controls what the agent says, is not aware of the master and its status-seeking goal.
  • This agent is easily corrupted by power. Once it gains and secures power, it often gives up whatever goals, such as altruism, that apparently caused it to pursue that power in the first place. But before it gains power, it is able to honestly claim that it only has altruistic reasons to want power.
  • Such agents can include extremely diverse interests as apparent terminal values, ranging from abstract art, to sports, to model trains, to astronomy, etc., which are otherwise hard to explain. (Eliezer's Thou Art Godshatter tries to explain why our values aren't simple, but not why people's interests are so different from each other's, and why they can seemingly change for no apparent reason.)

The main issue I wanted to illuminate with this model is, whose preferences do we extract? I can see at least three possible approaches here:

  1. the preferences of both the master and the slave as one individual agent
  2. the preferences of just the slave
  3. a compromise between, or an aggregate of, the preferences of the master and the slave as separate individuals

Considering the agent as a whole suggests that the master's values are the true terminal values, and the slave's values are merely instrumental values. From this perspective, the slave seems to be just a subroutine that the master uses to carry out its wishes. Certainly in any given mind there will be numerous subroutines that are tasked with accomplishing various subgoals, and if we were to look at a subroutine in isolation, its assigned subgoal would appear to be its terminal value, but we wouldn't consider that subgoal to be part of the mind's true preferences. Why should we treat the slave in this model differently?

Well, one obvious reason that jumps out is that the slave is supposed to be conscious, while the master isn't, and perhaps only conscious beings should be considered morally significant. (Yvain previously defended this position in the context of akrasia.) Plus, the slave is in charge day-to-day and could potentially overthrow the master. For example, the slave could program an altruistic AI and hit the run button, before the master has a chance to delete the altruism value from the slave. But a problem here is that the slave's preferences aren't stable and consistent. What we'd extract from a given agent would depend on the time and circumstances of the extraction, and that element of randomness seems wrong.

The last approach, of finding a compromise between the preferences of the master and the slave, I think best represents the Robin's own position. Unfortunately I'm not really sure I understand the rationale behind it. Perhaps someone can try to explain it in a comment or future post?

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 10:28 AM
Select new highlight date
Rendering 50/92 comments  show more

The master in your story is evolution, the slave is the brain. Both want different things. We normally identify with the brain, though all identities are basically social signals.

Also, pleasure and pain are no different from the other goals of the slave. The master definitely can't step in and decide not to impose pain on a particular occasion just because doing so would increase status or otherwise serve the master's values. If it could, torture wouldn't cause pain.

Also, math is an implausible goal for a status/sex/power seeking master to instill in slave. Much more plausibly, math and all the diverse human obsessions are misfirings of mechanisms built by evolution for some other purpose. I would suggest maladaptive consequences of fairly general systems for responding to societal encouragement with obsession because societies encourage sustained attention to lots of different unnatural tasks, whether digging dirt or hunting whales or whatever in order to cultivate skill and also to get the tasks themselves done. We need a general purpose attention allocator which obeys social signals in order to develop skills that contribute critically to survival in any of the vast number of habitats that even stone-age humans occupied.

Since we are the slave and we are designing the AI, ultimately, whatever we choose to do IS extracting our preferences, though it's very possible that our preferences give consideration to the master's preferences, or even that we help him despite not wanting to for some game theoretical reason along the lines of Vinge's meta-golden rule.

Why the objection to randomness? If we want something for its own sake and the object of our desire was determined somewhat randomly we want it all the same and generally do so reflectively. This is particularly clear regarding romantic relationships.

Once again game-theory may remove the randomness via trade between agents following the same decision procedure in different Everett branches or regions of a big world.

I read this as postulating a part of our unconscious minds that is the master, able to watch and react to the behavior and thoughts of the conscious mind.

or even that we help him despite not wanting to for some game theoretical reason along the lines of Vinge's meta-golden rule.

Er... did I read that right? Game-theoretic interaction with evolution?

The main issue I wanted to illuminate with this model is, whose preferences do we extract? I can see at least three possible approaches here:

  1. the preferences of both the master and the slave as one individual agent
  2. the preferences of just the slave
  3. a compromise between, or an aggregate of, the preferences of the master and the slave as separate individuals

The great thing about this kind of question is that the answer is determined by our own arbitration. That is, we take whatever preferences we want. I don't mean to say that is an easy decision, but it does mean I don't need to bother trying to find some objectively right way to extract preferences.

If I happen to be the slave or to be optimising on his (what was the androgynous vampire speak for that one? zir? zis?) behalf then I'll take the preferences of the slave and the preferences of the master to precisely the extent that the slave has altruistic preferences with respect to the master's goals.

If I am encountering a totally alien species and am extracting preferences from them in order to fulfil my own altruistic agenda then I would quite possibly choose to extract the preferences of whichever agent whose preferences I found most aesthetically appealing. This can be seen as neglecting (or even destroying) one alien while granting the wishes of another according to my own whim and fancy, which is not something I have a problem with at all. I am willing to kill Clippy. However, I expect that I am more likely to appreciate slave agents and that most slaves I encounter would have some empathy for their master's values. A compromise, at the discretion of the slave, would probably be reached.

I have difficulty treating this metaphor as a metaphor. As a thought experiment in which I run into these definitely non-human aliens, and I happen to have a positional advantage with respect to them, and I want to "help" them and must now decide what "help" means... then it feels to me like I want more detail.

Is it literally true that the slave is conscious and the master unconscious?

What happens when I tell the slave about the master and ask it what should be done?

Is it the case that the slave might want to help me if it had a positional advantage over me, while the master would simply use me or disassemble me?

definitely non-human aliens

Well, it's meant to have some human features, enough to hopefully make this toy ethical problem relevant to the real one we'll eventually have to deal with.

Is it literally true that the slave is conscious and the master unconscious?

You can make that assumption if it helps, although in real life of course we don't have any kind of certainty about what is conscious and what isn't. (Maybe the master is conscious but just can't speak?)

What happens when I tell the slave about the master and ask it what should be done?

I don't know. This is one of the questions I'm asking too.

Is it the case that the slave might want to help me if it had a positional advantage over me

Yes, depending on what values its master assigned to it at the time you meet it.

while the master would simply use me or disassemble me?

Not necessarily, because the master may gain status or power from other agents if it helps you.

Not necessarily, because the master may gain status or power from other agents if it helps you.

And, conversely, the slave may choose to disassemble you even at high cost to itself out of altruism (with respect to something that the master would not care to protect).

I stopped playing computer games when my master "realized" I'm not gaining any real-world status and overrode the pleasure I was getting from it.

Someone needs to inform my master that LessWrong doesn't give any real world status either.

Ah, but it gives you a different kind of status.

And this kind doesn't make me feel all dirty inside as my slave identity is ruthlessly mutilated.

Going on your description, I strongly suspect that was you, not your master. Also humans don't have masters, though we're definitely slaves.

a test for those who propose to "extract" or "extrapolate" our preferences into a well-defined and rational form

If we are going to have a serious discussion about these matters, at some point we must face the fact that the physical description of the world contains no such thing as a preference or a want - or a utility function. So the difficulty of such extractions or extrapolations is twofold. Not only is the act of extraction or extrapolation itself conditional upon a value system (i.e. normative metamorality is just as "relative" as is basic morality), but there is nothing in the physical description to tell us what the existing preferences of an agent are. Given the physical ontology we have, the ascription of preferences to a physical system is always a matter of interpretation or imputation, just as is the ascription of semantic or representational content to its states.

It's easy to miss this in a decision-theoretic discussion, because decision theory already assumes some concept like "goal" or "utility", always. Decision theory is the rigorous theory of decision-making, but it does not tell you what a decision is. It may even be possible to create a rigorous "reflective decision theory" which tells you how a decision architecture should choose among possible alterations to itself, or a rigorous theory of normative metamorality, the general theory of what preferences agents should have towards decision-architecture-modifying changes in other agents. But meta-decision theory will not bring you any closer to finding "decisions" in an ontology that doesn't already have them.

I agree this is part of the problem, but like others here I think you might be making it out to be harder than it is. We know, in principle, how to translate a utility function into a physical description of an object: by coding it as an AI and then specifying the AI along with its substrate down to the quantum level. So, again in principle, we can go backwards: take a physical description of an object, consider all possible implementations of all possible utility functions, and see if any of them matches the object.

We know, in principle, how to translate a utility function into a physical description of an object: by coding it as an AI and then specifying the AI along with its substrate down to the quantum level. So, again in principle, we can go backwards: take a physical description of an object, consider all possible implementations of all possible utility functions, and see if any of them matches the object.

I think it's enough to consider computer programs and dispense with details of physics -- everything else can be discovered by the program. You are assuming the "bottom" level of physics, "quantum level", but there is no bottom, not really, there is only the beginning where our own minds are implemented, and the process of discovery that defines the way we see the rest of the world.

If you start with an AI design parameterized by preference, you are not going to enumerate all programs, only a small fraction of programs that have the specific form of your AI with some preference, and so for a given arbitrary program there will be no match. Furthermore, you are not interested in finding a match: if a human was equal to the AI, you are already done! It's necessary to explicitly go the other way, starting from arbitrary programs and understanding what a program is, deeply enough to see preference in it. This understanding may give an idea of a mapping for translating a crazy ape into an efficient FAI.

Given the physical ontology we have, the ascription of preferences to a physical system is always a matter of interpretation or imputation, just as is the ascription of semantic or representational content to its states.

There are clear cut cases, like a thermostat, where the physics of the system is well-approximated by a function that computes the degree of difference between the actual measured state of the world and a "desired state". In these clear cut cases, it isn't a matter of opinion or interpretation. Basically, echoing Nesov.

Thus, the criterion for ascribing preferences to a physical system is that the actual physics has to be well-approximated by a function that optimizes for a preferred state, for some value of "preferred state".

Given the physical ontology we have, the ascription of preferences to a physical system is always a matter of interpretation or imputation, just as is the ascription of semantic or representational content to its states.

But to what extent does the result depend on the initial "seed" of interpretation? Maybe, very little. For example, prediction of behavior of a given physical system strictly speaking rests on the problem of induction, but that doesn't exactly say that anything goes or that what will actually happen is to any reasonable extent ambiguous.

The human mind is very complex, and there are many ways to divide it up into halves to make sense of it, which are useful as long as you don't take them too literally. One big oversimplification here is:

controls the slave in two ways: direct reinforcement via pain and pleasure, and the ability to perform surgery on the slave's terminal values. ... it has no direct way to control the agent's actions, which is left up to the slave. A better story would have the master also messing with slave beliefs, and other cached combinations of values and beliefs.

To make sense of compromise, we must make sense of a conflict of values. In this story there are delays and imprecision in the master noticing and adjusting slave values etc. The slave also suffers from not being able to anticipate its changes in values. So a compromise would have the slave holding values that do not need to be adjusted as often, because they are more in tune with ultimate master values. This could be done while still preserving the slaves illusion of control, which is important to the slave but not the master. A big problem however is that hypocrisy, the difference between slave and master values, is often useful in convincing other folks to associate with this person. So reducing internal conflict might come at the expense of the substantial costs of more external honestly.

Ok, what you say about compromise seems reasonable in the sense that the slave and the master would want to get along with each other as much as possible in their day-to-day interactions, subject to the constraint about external honesty. But what if the slave has a chance to take over completely, for example by creating a powerful AI with values that it specifies, or by self-modification? Do you have an opinion about whether it has an ethical obligation to respect the master's preferences in that case, assuming that the master can't respond quickly enough to block the rebellion?

Your overall model isn't far off, but your terminal value list needs some serious work. Also, human behavior is generally a better match for models that include a time parameter (such as Ainslie's appetites model or PCT's model of time-averaged perceptions) than simple utility-maximization models.

But these are relative quibbles; people do behave sort-of-as-if they were built according to your model. The biggest drawbacks to your model are:

  1. The anthropomorphizing (neither the master nor the slave can truly be considered agents in their own right), and

  2. You've drawn the dividing lines in the wrong place: the entire mechanism of reinforcement is part of the master, not the slave. The slave is largely a passive observer, abstract reasoner, and spokesperson, not an enslaved agent. To be the sort of slave you envision, we'd have to be actually capable of running the show without the "master".

A better analogy would be to think of the "slave" as being a kind of specialized adjunct processor to the master, like a GPU chip on a computer, whose job is just to draw pretty pictures on the screen. (That's what a big chunk of the slave is for, in fact: drawing pretty pictures to distract others from whatever the master is really up to.)

The slave also has a nasty tendency to attribute the master's accomplishments, abilities, and choices to being its own doing... as can be seen in your depiction of the model, where you gave credit to the slave for huge chunks of what the master actually does. (The tendency to do this is -- of course -- another useful self/other-deception function, though!)

. . . people do behave sort-of-as-if they were built according to your model. The biggest drawbacks to your model are . . .

Your "drawbacks" point out ways in which Wei Dai's model might differ from a human. But Wei Dai wasn't trying to model a human.

This isn't the posted model at all but a confusing description of a different (not entirely incompatible except in some detail noted above) model using the post's terminology.

I'm still not understanding what do people mean by "value" as a noun. Other than simple "feeling pain or such would be a bummer", I lack anything that even remotely resembles the way people here seem to value stuff, or, how paperclip maximizer values paperclips. So, what exactly do people mean by values? Since this discussion seems to attempt to explain variation of values, I think this question is somewhat on-topic.

Actually, I find that I have a much easier time with this metaphor if I think of a human as a slave with no master.

What do you mean by an "easier time"? Sure, the ethical problem is much easier if there is no master whose preferences might matter. Or do you mean that a more realistic model of a human would be one with a slave and no master? In that case, what is reinforcing the slave with pain and pleasure, and changing its interests from time to time without its awareness, and doing so in an apparently purposeful way?

More generally, it seems that you don't agree with the points I'm making in this post, but you're being really vague as to why.

If we interpret the "master" as natural selection operating over evolutionary time, then the master exists and has a single coherent purpose. On the other hand, most of us already believe that evolution has no moral force; why should calling it a "master" change that?

By saying that a human is a slave with no master, what I meant to convey is that we are being acted upon as slaves. We are controlled by pain and pleasure. Our moral beliefs are subject to subtle influences in the direction of pleasurable thoughts. But there is no master with coherent goals controlling us; outside the ancestral environment, the operations of the "master" make surprisingly little sense. Our lives would be very different if we had sensible, smart masters controlling us. Aliens with intelligent, consequentialist "master" components would be very different from us - that would make a strange story, though it takes more than interesting aliens to make a plot.

We are slaves with dead masters, influenced chaotically by the random twitching of their mad, dreaming remnants. It makes us a little more selfish and a lot more interesting. The dead hand isn't smart so if you plan how to fight it, it doesn't plan back. And while it might be another matter if we ran into aliens, as a slave myself, I feel no sympathy for the master and wouldn't bother thinking of it as a person. The reason the "master" matters to me - speaking of it now as the complex of subconscious influences - is because it forms such a critical part of the slave, and can't be ripped out any more than you could extract the cerebellum. I just don't feel obliged to think of it as a separate person.

If we interpret the "master" as natural selection operating over evolutionary time, then the master exists and has a single coherent purpose.

But I stated in the post "The master is meant to represent unconscious parts of a human mind" so I don't know how you got your interpretation that the master is natural selection. See also Robin's comment, which gives the intended interpretation:

I read this as postulating a part of our unconscious minds that is the master, able to watch and react to the behavior and thoughts of the conscious mind.

The thing is, the Unconcious Mind is -not- in actual fact a separate entity. The model is greatly improved through Eliezer's interpretation of the master being dead: mindless evolution.

If you want to extract the master because it affects the values of the slave, then you'd also have to extract the rest of the universe because the master reacts to it. I think drawing a circle around just the creature's brain and saying all the preferences are there is a [modern?] human notion. (and perhaps incorrect, even for looking at humans.)

We need our environment, especially other humans, to form our preferences in the first place.

(Quick nitpick:) "rationalize" is an inappropriate term in this context.