The Blue-Minimizing Robot

Imagine a robot with a turret-mounted camera and laser. Each moment, it is programmed to move forward a certain distance and perform a sweep with its camera. As it sweeps, the robot continuously analyzes the average RGB value of the pixels in the camera image; if the blue component passes a certain threshold, the robot stops, fires its laser at the part of the world corresponding to the blue area in the camera image, and then continues on its way.

Watching the robot's behavior, we would conclude that this is a robot that destroys blue objects. Maybe it is a surgical robot that destroys cancer cells marked by a blue dye; maybe it was built by the Department of Homeland Security to fight a group of terrorists who wear blue uniforms. Whatever. The point is that we would analyze this robot in terms of its goals, and in those terms we would be tempted to call this robot a blue-minimizer: a machine that exists solely to reduce the amount of blue objects in the world.

Suppose the robot had human level intelligence in some side module, but no access to its own source code; that it could learn about itself only through observing its own actions. The robot might come to the same conclusions we did: that it is a blue-minimizer, set upon a holy quest to rid the world of the scourge of blue objects.

But now stick the robot in a room with a hologram projector. The hologram projector (which is itself gray) projects a hologram of a blue object five meters in front of it. The robot's camera detects the projector, but its RGB value is harmless and the robot does not fire. Then the robot's camera detects the blue hologram and zaps it. We arrange for the robot to enter this room several times, and each time it ignores the projector and zaps the hologram, without effect.

Here the robot is failing at its goal of being a blue-minimizer. The right way to reduce the amount of blue in the universe is to destroy the projector; instead its beams flit harmlessly through the hologram.

Again, give the robot human level intelligence. Teach it exactly what a hologram projector is and how it works. Now what happens? Exactly the same thing - the robot executes its code, which says to scan the room until its camera registers blue, then shoot its laser.

In fact, there are many ways to subvert this robot. What if we put a lens over its camera which inverts the image, so that white appears as black, red as green, blue as yellow, and so on? The robot will not shoot us with its laser to prevent such a violation (unless we happen to be wearing blue clothes when we approach) - its entire program was detailed in the first paragraph, and there's nothing about resisting lens alterations. Nor will the robot correct itself and shoot only at objects that appear yellow - its entire program was detailed in the first paragraph, and there's nothing about correcting its program for new lenses. The robot will continue to zap objects that register a blue RGB value; but now it'll be shooting at anything that is yellow.

The human-level intelligence version of the robot will notice its vision has been inverted. It will know it is shooting yellow objects. It will know it is failing at its original goal of blue-minimization. And maybe if it had previously decided it was on a holy quest to rid the world of blue, it will be deeply horrified and ashamed of its actions. It will wonder why it has suddenly started to deviate from this quest, and why it just can't work up the will to destroy blue objects anymore.

The robot goes to Quirinus Quirrell, who explains that robots don't really care about minimizing the color blue. They only care about status and power, and pretend to care about minimizing blue in order to impress potential allies.

The robot goes to Robin Hanson, who explains that there are really multiple agents within the robot. One of them wants to minimize the color blue, the other wants to minimize the color yellow. Maybe the two of them can make peace, and agree to minimize yellow one day and blue the next?

The robot goes to Anna Salamon, who explains that robots are not automatically strategic, and that if it wants to achieve its goal it will have to learn special techniques to keep focus on it.

I think all of these explanations hold part of the puzzle, but that the most fundamental explanation is that the mistake began as soon as we started calling it a "blue-minimizing robot". This is not because its utility function doesn't exactly correspond to blue-minimization: even if we try to assign it a ponderous function like "minimize the color represented as blue within your current visual system, except in the case of holograms" it will be a case of overfitting a curve. The robot is not maximizing or minimizing anything. It does exactly what it says in its program: find something that appears blue and shoot it with a laser. If its human handlers (or itself) want to interpret that as goal directed behavior, well, that's their problem.

It may be that the robot was created to achieve a specific goal. It may be that the Department of Homeland Security programmed it to attack blue-uniformed terrorists who had no access to hologram projectors or inversion lenses. But to assign the goal of "blue minimization" to the robot is a confusion of levels: this was a goal of the Department of Homeland Security, which became a lost purpose as soon as it was represented in the form of code.

The robot is a behavior-executor, not a utility-maximizer.

In the rest of this sequence, I want to expand upon this idea. I'll start by discussing some of the foundations of behaviorism, one of the earliest theories to treat people as behavior-executors. I'll go into some of the implications for the "easy problem" of consciousness and philosophy of mind. I'll very briefly discuss the philosophical debate around eliminativism and a few eliminativist schools. Then I'll go into why we feel like we have goals and preferences and what to do about them.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 12:38 PM
Select new highlight date
All comments loaded

The conclusion I'd draw from this essay is that one can't necessarily derive a "goal" or a "utility function" from all possible behavior patterns. If you ask "What is the robot's goal?", the answer is, "it doesn't have one," because it doesn't assign a total preference ordering to states of the world. At best, you could say that it prefers state [I SEE BLUE AND I SHOOT] to state [I SEE BLUE AND I DON'T SHOOT]. But that's all.

This has some implications for AI, I think. First of all, not every computer program has a goal or a utility function. There is no danger that your TurboTax software will take over the world and destroy all human life, because it doesn't have a general goal to maximize the number of completed tax forms. Even rather sophisticated algorithms can completely lack goals of this kind -- they aren't designed to maximize some variable over all possible states of the universe. It seems that the narrative of unfriendly AI is only a risk if an AI were to have a true goal function, and many useful advances in artificial intelligence (defined in the broad sense) carry no risk of this kind.

Do humans have goals? I don't know; it's plausible that we have goals that are complex and hard to define succinctly, and it's also plausible that we don't have goals at all, just sets of instructions like "SHOOT AT BLUE." The test would seem to be if a human goal of "PROMOTE VALUE X" continues to imply behaviors in strange and unfamiliar circumstances, or if we only have rules of behavior in a few common situations. If you can think clearly about ethics (or preferences) in the far future, or the distant past, or regarding unfamiliar kinds of beings, and your opinions have some consistency, then maybe those ethical beliefs or preferences are goals. But probably many kinds of human behavior are more like sets of instructions than goals.

At best, you could say that it prefers state [I SEE BLUE AND I SHOOT] to state [I SEE BLUE AND I DON'T SHOOT]. But that's all.

No; placing a blue-tinted mirror in front of him will have him shoot himself even though that greatly diminishes his future ability to shoot. Generally a generic program really can't be assigned any nontrivial utility function.

Also, you misspelled my name - it's Quirinus, not Quirinius.

The robot is not consequentialist, its decisions are not controlled by the dependence of facts about the future on its decisions.

Good point, but the fact that humans are consequentialists (at least partly) doesn't seem to make the problem much easier. Suppose we replace Yvain's blue-minimizer robot with a simple consequentialist robot that has the same behavior (let's say it models the world as a 2D grid of cells that have intrinsic color, it always predicts that any blue cell that it shoots at will turn some other color, and its utility function assigns negative utility to the existence of blue cells). What does this robot "actually want", given that the world is not really a 2D grid of cells that have intrinsic color?

What does this robot "actually want", given that the world is not really a 2D grid of cells that have intrinsic color?

Who cares about the question what the robot "actually wants"? Certainly not the robot. Humans care about the question what they "actually want", but that's because they have additional structure that this robot lacks. But with humans, you're not limited to just looking at what they do on auto-pilot; instead, you can just ask the aforementioned structure when you run into problems like this. For example, if you asked me what I really wanted under some weird ontology change, I could say, "I have some guesses, but I don't really know; I would like to defer to a smarter version of me". That's how I understand preference extrapolation: not as something that looks at what your behavior suggests that you're trying to do and then does it better, but as something that poses the question of what you want to some system you'd like to answer the question for you.

It looks to me like there's a mistaken tendency among many people here, including some very smart people, to say that I'd be irrational to let my stated preferences deviate from my revealed preferences; that just because I seem to be trying to do something (in some sense like: when my behavior isn't being controlled much by the output of moral philosophy, I can be modeled as a relatively good fit to a robot with some particular utility function), that's a reason for me to do it even if I decide that I don't want to. But rational utility maximizers get to be indifferent to whatever the heck they want, including their own preferences, so it's hard for me to see why the underdeterminedness of the true preferences of robots like this should bother me at all.

Insert standard low confidence about me posting claims on complicated topics that others seem to disagree with.

In other words, our "actual values" come from our being philosophers, not our being consequentialists.

It seems plausible to me, and I'm not sure that "many" others do disagree with you.

That would imply a great diversity of value systems, because philosophical intuitions differ much more from person to person than primitive desires. Some of these value systems (maybe including yours) would be simple, some wouldn't. For example, my "philosophical" values seem to give large weight to my "primitive" values.

preference extrapolation: not as something that looks at what your behavior suggests that you're trying to do and then does it better, but as something that poses the question of what you want to some system you'd like to answer the question for you

That might be a procedure that generates human preference, but it is not a general preference extrapolation procedure. E.g suppose we replace Wei Dai's simple consequentialist robot with a robot that has similar behavior, but that also responds to the question, "What system do you want to answer the question of what you want for you?" with the answer, "A version of myself better able to answer that question. Maybe it should be smarter and know more things and be nicer to strangers and not have scope insensitivity and be less prone to skipping over invisible moral frameworks and have conecepts that are better defined over attribute space and be automatically strategic and super commited and stuff like that? But since I'm not that smart and I pass over moral frameworks and stuff, eveything I just said is probably insufficient to specify the right thing. Maybe you can look at my source code and figure out what I mean by right and then do the thing that a person who better understood that would do?" And then goes right back to zapping blue.

Actually, this notion of consequentialism gives a new and the only clue I know of about how to infer agent goals, or how to constrain the kinds of considerations that should be considered goals, as compared to the other stuff that moves your action incidentally, such as psychological drives or laws of physics. I wonder if Eliezer had this insight before, given that he wrote a similar comment to this thread. I wasn't ready to see this idea on my own until a few weeks ago, and this thread is the first time I thought about the question given the new framework, and saw the now-obvious construction. This deserves more than a comment, so I'll be working on a two-post sequence to write this up intelligibly. Or maybe it's actually just stupid, I'll try to figure that out.


(A summary from my notes, in case I get run over by a bus; this uses a notion of "dependence" for which a toy example is described in my post on ADT, but which is much more general: )

The idea of consequentialism, of goal-directed control, can be modeled as follows. If a fact A is controlled by (can be explained/predicted based on) a dependence F: A->O, then we say that A is a decision (action) driven by a consequentialist consideration F, which in turn looks at how A controls the morally relevant fact O.

For a given decision A, there could be many different morally relevant facts O such that the dependence A->O has explanatory power about A. The more about A can a dependence A->O explain, the more morally relevant O is. Finding highly relevant facts O essentially captures A's goals.

This model has two good properties. First, logical omniscience (in particular, just knowledge of actual action) renders the construction unusable, since we need to see dependencies A->O as ambient concepts explaining A, so both A and A->O need to remain potentially unknown. (This is the confusing part. It also lends motivation to the study of complete collection of moral arguments and the nature of agent-provable collection of moral arguments.)

Second, action (decision) itself, and many other facts that control the action but aren't morally relevant, are distinguished by this model from the things that are. For example, A can't be morally relevant, for that would require the trivial identity dependence A->A to explain A, which it can't, since it's too simple. Similarly for other stuff in simple relationship with A: the relationship between A and a fact must be in tune with A for the fact to be morally relevant, it's not enough for the fact itself to be in tune with A.

This question doesn't require a fixed definition for a goal concept, instead it shows how various concepts can be regarded as goals, and how their suitability for this purpose can be compared. The search for better morally relevant facts is left open-ended.

steven0461's comment notwithstanding, I can take a guess at what the robot actually wants. I think it wants to take the action that will minimize the number of blue cells existing in the world, according to the robot's current model of the world. That rule for choosing actions probably doesn't correspond to any coherent utility function over the real world, but that's not really a surprise.

The interesting question that you probably meant to ask is whether the robot's utility function over its model of the world can be converted to a utility function over the real world. But the robot won't agree to any such upgrade, so the question is kinda moot.

That might sound hopeless for CEV, but fortunately humans aren't consequentialists with a fixed model of the world. Instead they seem to be motivated by pleasure and pain, which you can't disprove out of existence by coming up with a better model. So maybe there's hope in that direction.

I'll be interested to see where you go with this, but it seems to me that saying, "look, this is the program the robot runs, therefore it doesn't really have a goal", is exactly like saying "look, it's made of atoms, therefore it doesn't really have a goal".

Goals are precisely explained (like rainbows), and not explained away (like kobolds), as the controlled variables of control systems. This robot is such a system. The hypothetical goal of its designers at the Department of Homeland Security is also a goal. That does not make the robot's goal not a goal; it just makes it a different goal.

We feel like we have goals and preferences because we do, in fact, have goals and preferences, and we not only have them, but we are also aware of having them. The robot is not aware of having the goal that it has. It merely has it.

First of all, your control theory work was...not exactly what started me thinking along these lines, but what made it click when I realized the lines I had been thinking along were similar to the ones I had read about in one of your introductory posts about performing complex behaviors without representations. So thank you.

Second - When you say the robot has a "different goal", I'm not sure what you mean. What is the robot's goal? To follow the program detailed in the first paragraph?

Let's say Robot-1 genuinely has the goal to kill terrorists. If a hacker were to try to change its programming to "make automobiles" instead, Robot-1 would do anything it could to thwart the hacker; its goal is to kill terrorists, and letting a hacker change its goal would mean more terrorists get left alive. This sort of stability, in which the preference remains a preference regardless of context are characteristic of my definition of "goal".

This "blue-minimizing robot" won't display that kind of behavior. It doesn't thwart the person who places a color inversion lens on it (even though that thwarts its stated goal of "minimizing blue"), and it wouldn't try to take the color inversion lens off even if it had a manipulator arm. Even if you claim its goal is just to "follow its program", it wouldn't use its laser to stop someone walking up to it and changing its program, which means its program no longer got followed.

This isn't just a reduction of a goal to a program: predicting the robot's goal-based behavior and its program-based behavior give different results.

If goals reduce to a program like the robot's in any way, it's in the way that Einsteinian mechanics "reduce" to Newtonian mechanics - giving good results in most cases but being fundamentally different and making different predictions on border cases. Because there are other programs that goals do reduce to, like the previously mentioned Robot-1, I don't think it's appropriate to call what the blue-minimizer is doing a "goal".

If you still disagree, can you say exactly what goal you think the robot is pursuing, so I can examine your argument in more detail?

When you say the robot has a "different goal", I'm not sure what you mean. What is the robot's goal? To follow the program detailed in the first paragraph?

The robot's goal is not to follow its own program. The program is simply what the robot does. In the environment it is designed to operate in, what it does is destroy blue objects. In the vocabulary of control theory, the controlled variable is the number of blue objects, the reference value is zero, the difference between the two is the error, firing the laser is the action is takes when the error is positive, and the action has the effect of reducing the error. The goal, as with any control system, is to keep the error at zero. It does not have an additional goal of being the best destroyer of blue objects possible. Its designers might have that goal, but if so, that goal is in the designers, not in the system they have designed.

In an environment containing blue objects invulnerable to laser fire, the robot will fail to control the number of blue objects at zero. That does not make it not a control system, just a control system encountering disturbances it is unable to control. To ask whether it is still a control system veers into a purely verbal argument, like asking whether a table is still a table if one leg has broken off and it cannot stand upright.

People are more complex. They have (according to PCT) a large hierarchy of control systems (very broad, but less than a dozen levels deep), in which the reference signal for each controller is set by the output signals of higher level controllers. (At the top, reference signals are presumably hard-wired, and at the bottom, output signals go to organs not made of neurons -- muscles, mainly.) In addition, the hierarchy is subject to reorganisation and other forms of adaptation. The adaptations present to consciousness are the ability to think about our goals, consider whether we are doing the best things to achieve them, and change what we are doing. The robot in the example cannot do this.

You might be thinking of "goal" as meaning this sort of conscious, reflective, adaptive attempt to achieve what we "really" want, but I find that too large and fuzzy a concept. It leads into a morass of talk about our "real" goals vs. the goals we think we have, self-reflexive decision theory, extreme thought experiments, and so on. A real science of living things has to start smaller, with theories and observations that can be demonstrated as surely and reproducibly as the motion of balls rolling down inclined planes.

(ETA: When the neuroscience fails to discover this huge complex thing that never carved reality at the joints in the first place, people respond by saying it doesn't exist, that it went the way of kobolds rather than rainbows.)

Maybe you're also thinking of this robot's program as a plain stimulus-response system, as in the behaviourist view of living systems. But what makes it a control system is the environment it is embedded in, an environment in which shooting at blue objects destroys them.

If goals reduce to a program like the robot's in any way, it's in the way that Einsteinian mechanics "reduce" to Newtonian mechanics - giving good results in most cases but being fundamentally different and making different predictions on border cases.

If I replace "program" by "behaviourism", then I would say that it is behaviourism that is explained away by PCT.

Now I'm very confused. I understand that you think humans are PCT systems and that you have some justifications for that. But unlike humans, we know exactly what motivates this robot (the program in the first paragraph) and it doesn't contain a controlled variable corresponding to the number of blue objects, or anything else that sounds PCT.

So are you saying that any program can be modeled by PCT better than by looking at the program itself, or that although this particular robot isn't PCT, a hypothetical robot that was more reflective of real human behavior would be?

As for goals, if I understand your definition correctly, even a behaviorist system could be said to have goals (if you reinforce it every time it pulls the lever, then its new goal will be to pull a lever). If that's your definition, I agree that this robot has goals, and I would rephrase my thesis as being that those goals are not context-independent and reflective.

What is the robot's goal? To follow the program detailed in the first paragraph?

I suspect Richard would say that the robot's goal is minimizing its perception of blue. That's the PCT perspective on the behavior of biological systems in such scenarios.

However, I'm not sure this description actually applies to the robot, since the program was specified as "scan and shoot", not "notice when there's too much blue and get rid of it.". In observed biological systems, goals are typically expressed as perception-based negative feedback loops implemented in hardware, rather than purely rote programs OR high-level software algorithms. But without more details of the robot's design, it's hard to say whether it really meets the PCT criterion for goals.

Of course, from a certain perspective, you could say at a high level that the robot's behavior is as if it had a goal of minimizing its perception of blue. But as your post points out, this idea is in the mind of the beholder, not in the robot. I would go further as to say that all such labeling of things as goals occurs in the minds of observers, regardless of how complex or simple the biological, mechanical, electronic, or other source of behavior is.

I suspect Richard would say that the robot's goal is minimizing its perception of blue. That's the PCT perspective on the behavior of biological systems in such scenarios.

This 'minimization' goal would require a brain that is powerful enough to believe that lasers destroy or discolor what they hit.

If this post were read by blue aliens that thrive on laser energy, they'd wonder they we were so confused as to the purpose of a automatic baby feeder.

This 'minimization' goal would require a brain that is powerful enough to believe that lasers destroy or discolor what they hit.

From the PCT perspective, the goal of an E. coli bacterium swimming away from toxins and towards food is to keep its perceptions within certain ranges; this doesn't require a brain of any sort at all.

What requires a brain is for an outside observer to ascribe goals to a system. For example, we ascribe a thermostat's goal to be to keep the temperature in a certain range. This does not require that the thermostat itself be aware of this goal.

Although I find PCT intriguing, all the examples of it I've found have been about simple motor tasks. I can take a guess at how you might use the Method of Levels to explain larger-level decisions like which candidate to vote for, or whether to take more heroin, but it seems hokey, I haven't seen any reputable studies conducted at this level (except one, which claimed to have found against it) and the theory seems philosophically opposed to conducting them (they claim that "statistical tests are of no use in the study of living control systems", which raises a red flag large enough to cover a small city)

I've found behaviorism much more useful for modeling the things I want to model; I've read the PCT arguments against behaviorism and they seem ill-founded - for example, they note that animals sometimes auto-learn and behaviorist methodological insistence on external stimuli shouldn't allow that, but once we relax the methodological restrictions, this seems to be a case of surprise serving the same function as negative reinforcement, something which is so well understood that neuroscientists can even point to the exact neurons in charge of it.

Richard's PCT-based definition of goal is very different from mine, and although it's easily applicable to things like controlling eye movements, it doesn't have the same properties as the philosophical definition of "goal", the one that's applicable when you're reading all the SIAI work about AI goals and goal-directed behavior and such.

By my definition of goal, if the robot's goal were to minimize its perception of blue, it would shoot the laser exactly once - at its own visual apparatus - then remain immobile until turned off.

By my definition of goal, if the robot's goal were to minimize its perception of blue, it would shoot the laser exactly once - at its own visual apparatus - then remain immobile until turned off.

Ironically, quite a lot of human beings goals would be more easily met in such a way, and yet we still go around shooting our lasers at blue things, metaphorically speaking.

Or, more to the point, systems need not efficiently work towards their goals' fulfillment.

In any case, your comments just highlight yet again the fact that goals are in the eye of the beholder. The robot is what it is and does what it does, no matter what stories our brains make up to explain it.

(We could then go on to say that our brains have a goal of ascribing goals to things that appear to be operating of their own accord, but this is just doing more of the same thing.)

This robot is not a consequentialist - it doesn't have a model of the world which allows it to extrapolate (models of) outcomes that follow causally from its choices. It doesn't seem to steer the universe any particular place, across changes of context, because it explicitly doesn't contain a future-steering engine.

What exactly is meant by the robot having a human-level intelligence? Does it have two non-interacting programs: shoot blue and think?

This seems to be the key point. Everything interesting about the whole project of human rationality is contained in the interaction between the parts of us that think and the parts of us that do. All of the theories Yvain is criticising are about, ultimately, explaining and modeling the relationship between these two entities

Ah, excellent. This post comes at a great time. A few weeks ago, I talked with someone who remarked that although decision theory speaks in terms of preferences and information being separate, trying to apply that into humans is fitting the data to the theory. He was of the opinion that humans don't really have preferences in the decision theoretic sense of the word. Pondering that claim, I came to the conclusion that he's right, and have started to increasingly suspect that CEV-like plans to figure out the "ultimate" preferences of people are somewhat misguided. Our preferences are probably hopelessly path-, situation- and information-dependent. Which is not to say that CEV would be entirely pointless - even if the vast majority of our "preferences" would never converge, there might be some that did. And of course, CEV would still be worth trying, just to make sure I'm not horribly mistaken on this.

The ease at which I accepted the claim "humans don't have preferences" makes me suspect that I've myself had a subconscious intuition to that effect for a long time, which was probably partially responsible for an unresolved disagreement between me and Vladimir Nesov earlier.

I'll be curious to hear what you have to say.

...CEV-like plans to figure out the "ultimate" preferences of people are somewhat misguided. Our preferences are probably hopelessly path-, situation- and information-dependent.

This is off-topic but since you mentioned it and since I don't think it warrants a new post, here are my latest thoughts on CEV (a convergence of some of my recent comments originally posted as a response to a post by Michael Anissimov):

Consider the difference between a hunter-gatherer, who cares about his hunting success and to become the new clan chief, and a member of lesswrong who wants to determine if a “sufficiently large randomized Conway board could turn out to converge to a barren ‘all off’ state.”

The utility of the success in hunting down animals and proving abstract conjectures about cellular automata is largely determined by factors such as your education, culture and environmental circumstances. The same hunter gatherer who cared to kill a lot of animals, to get the best ladies in its clan, might have under different circumstances turned out to be a vegetarian mathematician solely caring about his understanding of the nature of reality. Both sets of values are to some extent mutually exclusive or at least disjoint. Yet both sets of values are what the person wants, given the circumstances. Change the circumstances dramatically and you change the persons values.

You might conclude that what the hunter-gatherer really wants is to solve abstract mathematical problems, he just doesn’t know it. But there is no set of values that a person “really” wants. Humans are largely defined by the circumstances they reside in. If you already knew a movie, you wouldn’t watch it. To be able to get your meat from the supermarket changes the value of hunting.

If “we knew more, thought faster, were more the people we wished we were, and had grown up closer together” then we would stop to desire what we learnt, wish to think even faster, become even different people and get bored of and rise up from the people similar to us.

A singleton will inevitably change everything by causing a feedback loop between the singleton and human values. The singleton won’t extrapolate human volition but implement an artificial set values as a result of abstract high-order contemplations about rational conduct. Much of our values and goals, what we want, are culturally induced or the result of our ignorance. Reduce our ignorance and you change our values. One trivial example is our intellectual curiosity. If we don’t need to figure out what we want on our own, our curiosity is impaired.

Knowledge changes and introduces terminal goals. The toolkit that is called ‘rationality’, the rules and heuristics developed to help us to achieve our terminal goals are also altering and deleting them. A stone age hunter-gatherer seems to possess very different values than I do. If he learns about rationality and metaethics his values will be altered considerably. Rationality was meant to help him achieve his goals, e.g. become a better hunter. Rationality was designed to tell him what he ought to do (instrumental goals) to achieve what he wants to do (terminal goals). Yet what actually happens is that he is told, that he will learn what he ought to want. If an agent becomes more knowledgeable and smarter then this does not leave its goal-reward-system intact if it is not especially designed to be stable. An agent who originally wanted to become a better hunter and feed his tribe would end up wanting to eliminate poverty in Obscureistan. The question is, how much of this new “wanting” is the result of using rationality to achieve terminal goals and how much is a side-effect of using rationality, how much is left of the original values versus the values induced by a feedback loop between the toolkit and its user?

Take for example an agent is facing the Prisoner’s dilemma. Such an agent might originally tend to cooperate and only after learning about game theory decide to defect and gain a greater payoff. Was it rational for the agent to learn about game theory, in the sense that it helped the agent to achieve its goal or in the sense that it deleted one of its goals in exchange for a more “valuable” goal?

It seems to me that becoming more knowledgeable and smarter is gradually altering our utility functions. But what is it that we are approaching if the extrapolation of our volition becomes a purpose in and of itself? A living treaty will distort or alter what we really value by installing a new cognitive toolkit designed to achieve an equilibrium between us and other agents with the same toolkit.

Would a singleton be a tool that we can use to get what we want or would the tool use us to do what it does, would we be modeled or would it create models, would we be extrapolating our volition or rather follow our extrapolations?

Is becoming the best hunter really one of the primitive man's terminal values? I would say his terminal values are more things like "achieving a feeling of happiness, contentment, and pride in one's self and one's relatives". The other things you mention are just effective instrumental goals.

I mostly agree with this.

I think that the idea of desires converging if “we knew more, thought faster, were more the people we wished we were, and had grown up closer together” relies on assumptions of relatively little self-modification. Once we get uploads and the capability for drastic self-modification, all kinds of people and subcultures will want to use it. Given the chance and enough time, we might out-speciate the beetle (to borrow Anders Sandberg's phrase), filling pretty much every corner of posthuman mindspace. There'll be minds so strange that we won't even recognize them as humans, and we'll hardly have convergent preferences with them.

Of course, that's assuming that no AI or mind with a first-mover advantage simply takes over and outcompetes everyone else. Evolutionary pressures might prune the initial diversity a lot, too - if you're so alien that you can't even communicate with ordinary humans, you may have difficulties paying the rent for your server farm.

... the mistake began as soon as we started calling it a "blue-minimizing robot".

Agreed. But what kind of mistake was that?

Is "This robot is a blue-minimizer" a false statement? I think not. I would classify it as more like the unfortunate selection of the wrong Kuhnian paradigm for explaining the robot's behavior. A pragmatic mistake. A mistake which does not bode well for discovering the truth, but not a mistake which involves starting from objectively false beliefs.

Why does the human-level intelligence component of the robot care about blue? It seems to me that it is mistaken in doing so. If my motor cortex was replaced by this robot's program, I would not conclude that I had suddenly started to only care about blue, I would conclude that I had lost control of my motor cortex. I don't see how it makes any difference that the robot always had it actions controlled by the blue-minimizing program. If I were the robot then, upon being informed about my design, I would conclude that I did not really care about blue. My human-level intelligence is the part that is me and therefore contains my preferences, not my motor cortex.

If my motor cortex was replaced by this robot's program, I would not conclude that I had suddenly started to only care about blue, I would conclude that I had lost control of my motor cortex.

I predict this would not happen the way you anticipate, at least for some ways to cash out 'taking control of your motor cortex'. For example, when a neurosurgeon uses a probe to stimulate a part of the motor cortex responsible for moving the arm, and eir patient's arm moves, and the neurosurgeon asks the patient why ey moved eir arm, the patient often replies something like "I had an itch", "it was uncomfortable in that position", or "What, I'm not allowed to move my arm now without getting grilled on it?"

Or in certain forms of motor cortex damage in which patients can't move their arm, they explain it by saying "I could move my arm right now, I just don't feel like it" or "That's not even my real arm, how could you expect me to move that?".

Although I won't get there in a while, part of my thesis for this sequence is that we infer our opinions from our behaviors, although it's probably more accurate to say that our behaviors feed back to the same processes that generate our opinions and can alter them. If this is true, then there are probably very subtle ways of taking control of your motor cortex that would leave your speech centers making justifications for whatever you did.

I'd be very surprised if this worked on me for more than, say, a day. Even if the intuition that I'm the one in control doesn't go away, I expect to eventually notice that it's actually false and consciously choose to not take it into account, at least in verbal reasoning. Has it been tried (on someone more qualified than a random patient)? If it doesn't work, the effect should be seen as rather more horrible than just overriding one's limb movement.

One of the obvious extensions of this thought experiment is to posit a laser-powered blue goo that absorbs laser energy, and uses it to grow larger.

This thought experiment also reminds me: Omohundro's arguments regarding likely uFAI behavior are based on the AI having goals of some sort - that is, something we would recognize as goals. It's entirely possible that we wouldn't perceive it as having goals at all, merely behavior.

A couple of points here. First, as other people seem to have indicated, there does seem to be a problem with saying 'the robot has human level intelligence/self-reflective insight' and simultaneously that it carries out its programming with regards to firing lasers at percepts which appear blue unreflectively, in so far as it would seem that the former would entail that the latter would /not/ be done unreflectively. What you have here are two seperate and largely unintegrated cognitive systems one of which ascribes functional-intentional properties to things, including the robot, and has human level intelligence on the one hand and the robot on the other.

The second point is that there may be a confusion based upon what your functional ascriptions to the robot are tracking. I want to say that objects have functions only relative to a system in which they play a role, which means that for example the robot might have 'the function' of eliminating blue objects within the wider system which is the Department of Homeland Security, however there is no discoverable fact about the robot which describes it's 'function simpliciter'. You can observe what appears to be goal directed behaviour, of course, but your ascriptions of goals to the robot are only good in so far as they serve to predict future behaviour of the robot (this is a standard descriptivist/projectivist approach to mental content ascription, of the sort Dennett describes in 'The Intentional Stance'). So when you insert the investing in front of the robots camera it ceases to exercise the same goal directed behaviour or (what amounts to the same, expressed differently) your previous goal ascriptions to the object cease to be able to make reliable predictions of the robots future behaviour and need to be corrected. ((I'm going to ignore the issue of the ontological status of these ascriptions. If this is an interest you happen to have Dennett discusses his views on the subject in an essay entitled 'Real Patterns' and there is further commentary in a couple of the articles in Ross and Brook's 'Dennett's Philosophy)).

I realise you are consciously using a naive version of behaviourism as the backdrop of your discussion, so it's possible that I'm just jumping ahead to 'where you're going with this' but it does seem that with subsequent post-behaviourist approaches to mental content ascription the puzzle you seem to be describing of how to correctly describe the robot dissolves. ((N.B. - You might want to Millar's Understanding People which surveys a broad range of the various approaches to mental state ascription)).

Shouldn't the human intelligence part be considered part of the source code? With its own goals/value functions? Otherwise it will be just a human watchig the robot kind of thing.

Is there a sense in which we can conclude that the robot is a blue-minimizing robot, in which we can't also conclude that it's an object-minimizing robot that happens to be optimized for situations where most of the objects are blue or most of the backgrounds are non-blue? (Perhaps it's one of a set of robots, or perhaps the ability to change its filter is an intentional feature.)

Does that "human level intelligence module" have any ability to actually control the robot's actions, or just to passively observe and then ask "why did I do that?" What're the rules of the game, as such, here?