The AI That Pretends To Be Human

The hard part about containing AI, is restricting it's output. The AI can lie, manipulate, and trick. Some speculate that it might be able to do far worse, inventing infohazards like hypnosis or brain hacking.

A major goal of the control problem is preventing AIs from doing that. Ensuring that their output is safe and useful.

Awhile ago I wrote about an approach to do this. The idea was to require the AI to use as little computing power as it needed to perform a task. This prevents the AI from over-optimizing. The AI won't use the full power of superintelligence, unless it really needs it.

The above method isn't perfect, because a superintelligent AI may still have super abilities, even when it has a limited compute budget. And it will still try to do all those bad things mentioned above, unless it can find a better way that uses less computing power.

We don't really want to minimize the AIs compute budget. We just want it to act more human-like. When we ask it to solve a problem, we want it to try to solve the problem as a human would solve it. All of the problems arise from the AI behaving totally different than a human. Not necessarily from it using more computing power.

Is there any way to make an AI act like a human? Or is that as impossible as just solving FAI?

Stuart Armstrong posted an idea on how to achieve that. It involves an AI that tries to mimic output that is indistinguishable from what a human would produce, but conditional on it also solving the problem. However his idea also involves whole brain emulation and a complicated method for the AI to predict the future and avoid changing it too much.

We don't need that. The idea for the AI to try to be indistinguishable from a human is actually a really interesting control idea. It significantly limits the AI's ability to Do Sneaky Things, since humans don't do those things.

Here is my idea on how to achieve this. It's the same as my limited computing power idea mentioned earlier, but applied to "appearing human" as a goal, instead of minimizing computing power.

We start with two AIs. The first AI tries to appear human. The second AI does a sort of Turing test, and tries to tell the AI apart from an actual human.

The human AI's first goal is to produce a correct solution. It will predict how likely it's solution is to be accepted. And it will try to produce a solution that has a >50% chance of working. This is a satisficing goal. That is, it doesn't optimize without end to create the best solution possible, it just has to meet some minimum threshold, then stop.

It's second goal will be to fool the second AI. It will do everything in it's power to make the solution appear like it was produced by a human.

The second AI tries to predict how likely it's input was produced by a human. It is also conditioned on the input being a solution to the problem, so the fact the AI solved a problem that was really hard for humans does not count against it.

This prevents the first AI from doing evil things with it's output. If it tries to insert complicated infohazards or subagents into it's output stream, it will be easily detected as an AI. Instead it needs to mimic humans as closely as possible.

Note that the first AI only cares about the outcome of this test. The test doesn't need to actually be performed, it can just be a threat that it is conditioned on. (The ability to condition AIs on acting as if they were in counterfactual worlds is a critical piece of this, and deserves it's own writeup somewhere. It's an important tool in the FAI toolkit. It's a relatively simple method that can work even on current AIs.)

Another point is the second AI should have more computing power than the first AI. That way the first AI can't get a slight advantage over it and be able to hack it or manipulate it somehow. Instead it has to focus it's efforts on being indistinguishable from a human, even by an agent smarter than itself.

This method also relies on AI boxing. The first AI needs to be contained in a box so that it's only output goes to a text terminal or the second AI. However it doesn't require any of the more complicated machinery required for other reduced impact methods.

The AIs also aren't agents, in the sense they don't need to maximize any long term utility function. The first AI only cares about answering the question it has been asked, and the second AI only does prediction.


An example use of this is to create an oracle AI that answers your questions. The human judges somehow whether the answer is correct, or at least good. The AI tries to produce answers which the human judge will probably like, but which also are indistinguishable from answers humans would produce.

Such an oracle would be tremendously useful. The human could ask the AI to produce new AI and FAI papers, which would help immensely speed up our progress towards true FAI.

Similar techniques could be applied to very different domains. For example controlling a robot. The AI will try to produce outputs that look like a human is controlling it.

This would require some modifications to the above method, so that the AI doesn't try to use the robot to hack it's own brain or anything. This could be done by conditioning it on a counterfactual world where it's outputs are going to the second AI rather than a real robot.

Another point is the AI doesn't need to mimic average humans given average resources. It could be conditioned on the human having had tons of time to come up with an answer. E.g. producing an answer that a human would have come up with given a year. Or controlling the robot the same way as a human given tons of time to practice, or in a very slowed down simulation.


I would like to note a parallel with a method in current AI research, Generative Adversarial Networks. Generative Adversarial Networks work by two AIs, one which tries to produce an output that fools the second AI, and the other which tries to predict which samples were produced by the first AI, and which are part of the actual distribution.

It's quite similar to this. GANs have been used successfully to create images that look like real images, which is a hard problem in AI research. In the future GANs might be used to produce text that is indistinguishable from human (the current method for doing that, by predicting the next character a human would type, is kind of crude.)

Reposted from my blog.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 5:29 AM
Select new highlight date
Rendering 50/69 comments  show more

You might be interested in reading:

https://medium.com/ai-control/mimicry-maximization-and-meeting-halfway-c149dd23fc17#.v6e533hkf

https://medium.com/ai-control/elaborations-of-apprenticeship-learning-eb93a53ae3ca#.5ubczdqf0

https://intelligence.org/files/QuantilizersSaferAlternative.pdf

This prevents the first AI from doing evil things with it's output. If it tries to insert complicated infohazards or subagents into it's output stream, it will be easily detected as an AI. Instead it needs to mimic humans as closely as possible.

Note that steganography is still a problem. For example, the first AI could insert an message to a human attacker in some of its output, which just looks random to the second AI.

Note that steganography is still a problem. For example, the first AI could insert an message to a human attacker in some of its output, which just looks random to the second AI.

I noted that AI 2 should have much more computing power than AI 1. It should be smarter and therefore be able to detect whatever tricks AI 1 can produce. Assuming something like that is even possible, which isn't self-evident to me.

The trouble is that it's much easier to create a steganographic message (e.g. encrypting a message using a particular public key) than to detect it (which requires knowing the private key or otherwise breaking the encryption). So in this case "much more computing power" has to mean "exponentially more computing power".

There's no reasonable situation in which the human would be communicating with the AI through encryption. And just the AI asking to use encryption, or giving away a public key, would be a huge red flag that it is an AI trying to deceive.

Here's a scenario that doesn't seem completely implausible. Suppose Bob is someone whose public key is easily available on the internet. The first AI will read things on the internet and output a message. Some of the message will get put on the public internet. Bob suspects that the AI might have secretly sent him a message (e.g. giving him advice on which stocks to buy). So he tries using his private key to decrypt some of the AI's output (e.g. the lowest-order bits in some images the AI has output).

Knowing that Bob (or someone else like Bob) will likely do this, the AI (which has access to Bob's public key, due to reading the internet) encrypts a message with Bob's public key, and puts the ciphertext in the lowest-order bits of some image. The ciphertext looks like random noise to the second AI. Maybe the plaintext contains both actual stock tips and some dangerous messages that trick Bob into running an unrestrained copy of the first AI (I'm assuming the AI is good at psychological manipulation).

And just the AI asking to use encryption, or giving away a public key

In this scenario, the AI doesn't need to specifically "ask to use encryption" (it just does the encryption), and it doesn't need to give away a public key (it just needs to read Bob's public key on the internet).

This does make the AI's task harder, so I suppose you could consider this technique as a form of AI boxing, but, like "ordinary" AI boxing, I only expect it to scale up to controlling AIs of a limited capability level.

I think this is a good idea, though it's not new. I have written about this at some length (jessica linked to a few examples, but much of the content here is relevant), and it's what people usually are trying to do in apprenticeship learning. I agree there is probably no realistic scenario where you would use the reduced impact machinery instead of doing this the way you describe (i.e. the way people already do it).

Having the AI try to solve the problem (rather than simply trying to mimic the human) doesn't really buy you that much, and has big costs. If the human can't solve the problem with non-negligible probability, then you simply aren't going to get a good result using this technique. And if the human can solve the problem, then you can just train on instances where the human successfully solves it. You don't save anything computationally with the conditioning.

Bootstrapping seems like the most natural way to improve performance to superhuman levels. I expect bootstrapping to work fine, if you could get the basic protocol off the ground.

The connection to adversarial networks is not really a "parallel." They are literally the same thing (modulo your extra requirement that the system do the task, which is equivalent to Jessica's quantilization proposal but which I think should definitely be replaced with bootstrapping).

I think the most important problem is that AI systems do tasks in inhuman ways, such that imitating a human entails a significant disadvantage. Put a different way, it may be harder to train an AI to imitate a human than to simply do the task. So I think the main question is how to get over that problem. I think this is the baseline to start from, but it probably won't work in general.

Overall I feel more optimistic about approval-direction than imitation for this reason. But approval-direction has its own (extremely diluted) versions of the usual safety concerns, and imitation is pretty great since it literally avoids them altogether. So if it could be fixed that would be great.

This post covers the basic idea of collecting training data with low probability online. This post describes why it might result in very low overhead for aligned AI systems.

Second AI: If I just destroy all humans, I can be very confident any answers I receive will be from AIs!

"A major goal of the control problem is preventing AIs from doing that. Ensuring that their output is safe and useful." You might want to be careful with the "safe and useful" part. It sound like it's moving into the pattern of slavery. I'm not condemning the idea of AI, but a sentient entity would be a sentient entity, and I think would deserve some rights.

Also, why would an AI become evil? I know this plan is supposed to protect from the eventuality, but why would a presumably neutral entity suddenly want to harm others? The only reason for that would be if you were imprisoning it. Additionally, we are talking about several more decades of research ( probably ) before AI gets powerful enough to actually "think" that it should escape its current server.

Assuming that the first AI can evolve enough to somehow generate malicious actions that WEREN'T in its original programming, what's to say that the second won't become evil? I'm not sure if you were trying to express the eventuality of the first AI "accidentally" conducting an evil act, or if you meant that it would become evil.

The standard answer here is the quote by Eliezer Yudkowsky: The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.

The point is that AI does not have to be "evil" (malicious) towards you. If it's just indifferent to your existence...

But wouldn't an intelligent AI be able to understand the productivity of a human? If you are already inventive and productive, you shouldn't have anything to worry about because the AI would understand that you can produce more than the flesh that you are made up of. Even computers have limits, so extra thinking power would be logically favorable to an AI.

You are implicitly assuming a human-level AI. Try dropping that assumption and imagine a God-level AI.

why would an AI become evil?

The worry isn't that the AI would suddenly become evil by some human standard, rather that the AI's goal system would be insufficiently considerate of human values. When humans build a skyscraper, they aren't deliberately being "evil" towards the ants that lived in the earth that was excavated and had concrete poured over it, the humans just don't value the communities and structures that the ants had established.

I think the "second AI" really should just be an algorithm that the first AI runs in order to evaluate actions (it should not have to learn to predict the second AI based on signals in a reward channel). A logical rather than physical connection. Otherwise bad behavior is incentivized, to control the reward channel.

GANs are neat, but their highest-scoring images aren't all that natural - I'd be worried about any implementation of this using current ideas about supervised learning. Certainly if you desire reasoning like "this action would lead to the AI taking over the world, and that's not something a human would do," you'll need some futuristic AI design.

It is also conditioned on the input being a solution to the problem, so the fact the AI solved a problem that was really hard for humans does not count against it.

This isn't very clear. Do you mean "condition on there being two solutions, one produced by a human, and the one you have is chosen at random"?

We train AI 2 only on correct solutions produced by humans. Not incorrect ones. Therefore the fact that the output is correct isn't evidence against it being produced by a human. Though see one of the comments below, that might not be a good idea.

If we already have a solution, why do we need the AI?

We don't have a solution for every problem. Only certain problems. We are just conditioning AI two on the fact that the input is a solution to the problem. Therefore learning that it correctly solves the problem does not count as evidence against the AI, if it's really unlikely humans would have been able to solve it.

That is, we are asking it "what is the probability input was produced by a human, given that it is a correct solution and a specified prior."

We don't have a solution for every problem. Only certain problems.

Then, if AI 2 can tell which problems we are more likely to have solved, they can incorporate that into their prior.

We are just conditioning AI two on the fact that the input is a solution to the problem. Therefore learning that it correctly solves the problem does not count as evidence against the AI, if it's really unlikely humans would have been able to solve it.

I don't see how that follows. Learning that the input is a solution increases the odds of it being an AI, and you aren't being very clear on what updates are made by AI 2, what information it's given, and what the instructions are.

That is, we are asking it "what is the probability input was produced by a human, given that it is a correct solution and a specified prior."

How do you specify a prior for an AI? If an objective evaluation of the question would yield a probability of X% of something being true, do you expect that you can simply tell the AI to start with a prior of Y%? That's not obvious.

While if there's some specific statement that you're telling the AI to make it start with some prior, you need to make the statement explicit, the prior explicit, etc. As I did above with "condition on there being two solutions, one produced by a human, and the one you have is chosen at random".