So, here's my pet theory for AI that I'd love to put out of it's misery: "Don't do anything your designer wouldn't approve of". It's loosely based on the "Gandi wouldn't take a pill that would turn him into a murderer" principle.

A possible implementation: Make an emulation of the designer and use it as an isolated component of the AI. Any plan of action has to be submitted for approval to this component before being implemented. This is nicely recursive and rejects plans such as "make a plan of action deceptively complex such that my designer will mistakenly approve it" and "modify my designer so that they approve what I want them to approve".

There could be an argument about how the designer's emulation would feel in this situation, but.. torture vs. dust specks! Also, is this a corrupted version of ?

The Friendly AI Game

At the recent London meet-up someone (I'm afraid I can't remember who) suggested that one might be able to solve the Friendly AI problem by building an AI whose concerns are limited to some small geographical area, and which doesn't give two hoots about what happens outside that area. Cipergoth pointed out that this would probably result in the AI converting the rest of the universe into a factory to make its small area more awesome. In the process, he mentioned that you can make a "fun game" out of figuring out ways in which proposed utility functions for Friendly AIs can go horribly wrong. I propose that we play.

Here's the game: reply to this post with proposed utility functions, stated as formally or, at least, as accurately as you can manage; follow-up comments explain why a super-human intelligence built with that particular utility function would do things that turn out to be hideously undesirable.

There are three reasons I suggest playing this game. In descending order of importance, they are:

  1. It sounds like fun
  2. It might help to convince people that the Friendly AI problem is hard(*).
  3. We might actually come up with something that's better than anything anyone's thought of before, or something where the proof of Friendliness is within grasp - the solutions to difficult mathematical problems often look obvious in hindsight, and it surely can't hurt to try
DISCLAIMER (probably unnecessary, given the audience) - I think it is unlikely that anyone will manage to come up with a formally stated utility function for which none of us can figure out a way in which it could go hideously wrong. However, if they do so, this does NOT constitute a proof of Friendliness and I 100% do not endorse any attempt to implement an AI with said utility function.
(*) I'm slightly worried that it might have the opposite effect, as people build more and more complicated conjunctions of desires to overcome the objections that we've already seen, and start to think the problem comes down to nothing more than writing a long list of special cases but, on balance, I think that's likely to have less of an effect than just seeing how naive suggestions for Friendliness can be hideously broken.

 

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 10:32 PM
Select new highlight date
All comments loaded

Start the AI in a sandbox universe, like the "game of life". Give it a prior saying that universe is the only one that exists (no universal priors plz), and a utility function that tells it to spell out the answer to some formally specified question in some predefined spot within the universe. Run for many cycles, stop, inspect the answer.

A prior saying that this is the only universe that exists isn't very useful, since then it will only treat everything as being part of the sandbox universe. It may very well break out, but think that it's only exploiting weird hidden properties of the game of life-verse. (Like the way we may exploit quantum mechanics without thinking that we're breaking out of our universe.)

I have no idea how to encode a prior saying "the universe I observe is all that exists", which is what you seem to assume. My proposed prior, which we do know how to encode, says "this mathematical structure is all that exists", with an apriori zero chance for any weird properties.

If the AI is only used to solve certain formally specified questions without any knowledge of an external world, then that sounds much more like a theorem-prover than a strong AI. How could this proposed AI be useful for any of the tasks we'd like an AGI to solve?

An AI living in a simulated universe can be just as intelligent as one living in the real world. You can't ask it directly to feed African kids but you have many other options, see the discussion at Asking Precise Questions.

An AI living in a simulated universe can be just as intelligent as one living in the real world.

It can be a very good theorem prover, sure. But without access to information about the world, it can't answer questions like "what is the CEV of humanity like" or "what's the best way I can make a lot of money" or "translate this book from English to Finnish so that a native speaker will consider it a good translation". It's narrow AI, even if it could be broad AI if it were given more information.

The single-universe prior seems to be tripping people up, and I wonder whether it's truly necessary.

Also, what if the simulation existed inside a larger simulated "moat" universe, but if there is any leakage into the moat universe, then the whole simulation shuts down immediately.

Start the AI in a sandbox universe. Define its utility function over 32-bit integers. Somewhere inside the sandbox, put something that sets its utility to INT_MAX utility, then halts the simulation. Outside the sandbox, leave documentation of this readily accessible. The AI should never try to do something elaborately horrible, because it can get max utility easily enough from inside the simulation; if it does escape the box, it should go back in to collect its INT_MAX utility.

The AI should never try to do something elaborately horrible, because it can get max utility easily enough from inside the simulation

...but never do anything useful either, since it's going to spend all its time trying to figure out how to reach the INT_MAX utility point?

Or you could say that reaching the max utility point requires it to solve some problem we give it. But then this is just a slightly complicated way of saying that we give it goals which it tries to accomplish.

What about giving it some intra-sandbox goal (solve this math problem), and the INT_MAX functions as a safeguard - if it ever escapes, it'll just turn itself off.

Ooh, just thought of another one. For whatever reason, the easiest way for the AI to escape the box happens to have the side effect of causing immense psychological damage to its creator, or starting a war, or something like that.

Oracle AI - its only desire is to provide the correct answer to yes or no questions posed to it in some formal language (sort of an ueber Watson).

Comment upvoted for starting the game off! Thanks!


Q: Is the answer to the Ultimate Question of Life, the Universe, and Everything 42?

A: Tricky. I'll have to turn the solar system into computronium to answer it. Back to you as soon as that's done.

Oracle AI - its only desire is to provide the correct answer to yes or no questions posed to it in some formal language (sort of an ueber Watson).

Oops. The local universe just got turned into computronium. It is really good at answering questions though. Apart from that you gave it a desire to provide answers. The way to ensure that it can answer questions is to alter humans such that they ask (preferably easy) questions as fast as possible.

Some villain then asks how to reliably destroy the world, and follows the given answer.

Alternatively: A philosopher asks for the meaning of life, and the Oracle returns an extremely persuasive answer which convinces most of people that life is worthless.

Another alternative: After years of excellent work, the Oracle gains so much trust that people finally start to implement a possibility to ask less formal questions, like "how to maximise human utility", and then follow the given advice. Unfortunately (but not surprisingly), unnoticed mistake in the definition of human utility has slipped through the safety checks.

Would take overt or covert dictatorial control of humanity and reshape their culture so that (a) breeding to the brink of starving is a mass moral imperative and (b) asking very simple questions to the Oracle five times a day is a deeply ingrained quasi-religious practice.

So, here's my pet theory for AI that I'd love to put out of it's misery: "Don't do anything your designer wouldn't approve of". It's loosely based on the "Gandi wouldn't take a pill that would turn him into a murderer" principle.

A possible implementation: Make an emulation of the designer and use it as an isolated component of the AI. Any plan of action has to be submitted for approval to this component before being implemented. This is nicely recursive and rejects plans such as "make a plan of action deceptively complex such that my designer will mistakenly approve it" and "modify my designer so that they approve what I want them to approve".

There could be an argument about how the designer's emulation would feel in this situation, but.. torture vs. dust specks! Also, is this a corrupted version of ?

You flick the switch, and find out that you are a component of the AI, now doomed to an unhappy eternity of answering stupid questions from the rest of the AI.

This is a problem. But if this is the only problem, then it is significantly better than paperclip universe.

I'm sure the designer would approve of being modified to enjoy answering stupid questions. The designer might also approve of being cloned for the purpose of answering one question, and then being destroyed.

Unfortunately, it turns out that you're Stalin. Sounds like 1-person CEV.

The AI wishes to make ten thousand tiny changes to the world, individually innocuous, but some combination of which add up to catastrophe. To submit its plan to a human, it would need to distill the list of predicted consequences down to its human-comprehensible essentials. The AI that understands which details are morally salient is one that doesn't need the oversight.

If the AI is designed to follow the principle by the letter, it has to request approval from the designer even for the action of requesting approval, leaving the AI incapable of action. If the AI is designed to be able to make certain exemptions, it will figure out a way to modify the designer without needing approval for this modification.

How about making 'ask for approval' the only pre-approved action?

The AI may stumble upon a plan which contains a sequence of words that hacks the approver's mind, making him approve pretty much anything. Such plans may even be easier for the AI to generate than plans for saving the world, seeing as Eliezer has won some AI-box experiments but hasn't yet solved world hunger.

The Philosophical Insight Generator - Using a model of a volunteer's mind, generate short (<200 characters, say) strings that the model rates as highly insightful after read each string by itself, and print out the top 100000 such strings (after applying some semantic distance criteria or using the model to filter out duplicate insights) after running for a certain number of ticks.

Have the volunteer read these insights along with the rest of the FAI team in random order, discuss, update the model, then repeat as needed.

This isn't a Friendly Artificial General Intelligence 1) because it is not friendly; it does not act to maximize an expected utility based on human values, 2) because it's not artificial; you've uploaded an approximate human brain and asked/forced it to evaluate stimuli, and 3) because, operationally, it does not possess any general intelligence; the Generator is not able to perform any tasks but write insightful strings.

Are you instead proposing an incremental review process of asking the AI to tell us its ideas?

You're right, my entry doesn't really fit the rules of this game. It's more of a tangential brainstorm about how an FAI team can make use of a large amount of computation, in a relatively safe way, to make progress on FAI.

The AI gets positive utility from having been created, and that is the whole of its utility function. It's given a sandbox full of decision-theoretic problems to play with, and is put in a box (i.e. it can't meaningfully influence the outside world until it has superhuman intelligence). Design it in such a way that it's initially biased toward action rather than inaction if it anticipates equal utility from both.

Unless the AI develops some sort of non-causal decision theory, it has no reason to do anything. If it develops TDT, it will try to act in accordance with what it judges to be the wishes of its creators, following You're In Newcomb's Box logic--it will try to be the sort of thing its creators wished to create.

Define "Interim Friendliness" as a set of constraints on the AI's behavior which is only meant to last until it figures out true Friendliness, and a "Proxy Judge" as a computational process used to judge the adequacy of a proposed definition of true Friendliness.

Then there's a large class of Friendliness-finding strategies where the AI is instructed as follows: With your actions constrained by Interim Friendliness, find a definition of true Friendliness which meets the approval of the Proxy Judge with very high probability, and adopt that as your long-term goal / value system / decision theory.

In effect, Interim Friendliness is there as a substitute for a functioning ethical system, and the Proxy Judge is there as a substitute for an exact specification of criteria for true Friendliness.

The simplest conception of a Proxy Judge is a simulated human. It might be a simulation of the SIAI board, or a simulation of someone with a very high IQ, or a simulation of a random healthy adult. A more abstracted version of a Proxy Judge might work without simulation, which is a brute-force way of reproducing human judgment, but I think it would still be all about counterfactual human judgments. (A non-brute-force way of reproducing human judgment is to understand the neural and cognitive processes which produce the judgments in question, so that they may be simulated at a very high level of abstraction, or even so that the final judgment may be inferred in some non-dynamical way.)

The combination of Interim Friendliness and a Proxy Judge is a workaround for incompletely specified strategies for the implementation of Friendliness. For example: Suppose your blueprint for Friendliness is to apply reflective decision theory to the human decision architecture. But it turns out that you don't yet have a precise definition of reflective decision theory, nor can you say exactly what sort of architecture underlies human decision making. (It's probably not maximization of expected utility.) Never fear, here are amended instructions for your developing AI:

With your actions constrained by Interim Friendliness, come up with exact definitions of "reflective decision theory" and "human decision architecture" which meet the approval of the Proxy Judge. Then, apply reflective decision theory (as thus defined) to the human decision architecture (as thus defined), and adopt the resulting decision architecture as your own.

So the AI just needs to find an argument that will hack the mind of one simulated human?

Give the AI a bounded utility function where it automatically shuts down when it hits the upper bound. Then give it a fairly easy goal such as 'deposit 100 USD in this bank account.' Meanwhile, make sure the bank account is not linked to you in any fashion (so the AI doesn't force you to deposit the 100 USD in it yourself, rendering the exercise pointless.)

Define "shut down". If the AI makes nanobots, will they have to shut down too, or can they continue eating the Earth? How do you encode that in the utility function?

A variant of Alexandros' AI: attach a brain-scanning device to every person, which frequently uploads copies to the AI's Manager. The AI submits possible actions to the Manager, which checks for approval from the most recently available copy of each person who is relevant-to-the-action.

At startup, and periodically, the definition of being-relevant-to-an-action is determined by querying humanity with possible definitions, and selecting the best approved. If there is no approval-rating above a certain ratio, the AI shuts down.