I attempted the AI Box Experiment (and lost)




I recently played against MixedNuts / LeoTal in an AI Box experiment, with me as the AI and him as the gatekeeper.

We used the same set of rules that Eliezer Yudkowsky proposed. The experiment lasted for 5 hours; in total, our conversation was abound 14,000 words long. I did this because, like Eliezer, I wanted to test how well I could manipulate people without the constrains of ethical concerns, as well as getting a chance to attempt something ridiculously hard.

Amongst the released  public logs of the AI Box experiment, I felt that most of them were half hearted, with the AI not trying hard enough to win. It's a common temptation -- why put in effort into something you won't win? But I had a feeling that if I seriously tried, I would.  I brainstormed for many hours thinking about the optimal strategy, and even researched the personality of the Gatekeeper, talking to people that knew him about his personality, so that I could exploit that. I even spent a lot of time analyzing the rules of the game, in order to see if I could exploit any loopholes.

So did I win? Unfortunately no.

This experiment was said to be impossible for a reason. Losing was more agonizing than I thought it would be, in particularly because of how much effort I put into winning this, and how much I couldn't stand failing. This was one of the most emotionally agonizing things I've willingly put myself through, and I definitely won't do this again anytime soon. 

But I did come really close.               

MixedNuts: "I expected a fun challenge, but ended up sad and sorry and taking very little satisfaction for winning. If this experiment wasn't done in IRC, I'd probably have lost".

"I approached the experiment as a game - a battle of wits for bragging rights. This turned out to be the wrong perspective entirely. The vulnerability Tuxedage exploited was well-known to me, but I never expected it to be relevant and thus didn't prepare for it.

It was emotionally wrecking (though probably worse for Tuxedage than for me) and I don't think I'll play Gatekeeper again, at least not anytime soon."


 At the start of the experiment, his probability estimate on predictionbook.com was a 3% chance of winning, enough for me to say that he was also motivated to win. By the end of the experiment, he came quite close to letting me out, and also increased his probability estimate that a transhuman AI could convince a human to let it out of the box. A minor victory, at least.

Rather than my loss making this problem feel harder, I've become convinced that rather than this being merely possible, it's actually ridiculously easy, and a lot easier than most people assume. Can you think of a plausible argument that'd make you open the box? Most people can't think of any. 


After all, if you already knew that argument, you'd have let that AI out the moment the experiment started. Or perhaps not do the experiment at all. But that seems like a case of the availability heuristic.

Even if you can't think of a special case where you'd be persuaded, I'm now convinced that there are many exploitable vulnerabilities in the human psyche, especially when ethics are no longer a concern. 

I've also noticed that even when most people tend to think of ways they can persuade the gatekeeper, it always has to be some complicated reasoned cost-benefit argument. In other words, the most "Rational" thing to do.

 
The last argument seems feasible, but all the rest rely on the gatekeeper being completely logical and rational. Hence they are faulty; because the gatekeeper can break immersion at any time, and rely on the fact that this is a game played in IRC rather than one with real life consequences. Even if it were a real life scenario, the gatekeeper could accept that releasing the AI is probably the most logical thing to do, but also not do it. We're highly compartmentalized, and it's easy to hold conflicting thoughts at the same time. Furthermore, humans are not even completely rational creatures, if you didn't want to open the box, just ignore all logical arguments given. Any sufficiently determined gatekeeper could win.

I'm convinced that Eliezer Yudkowsky has used emotional appeal, rather than anything rational, to win at least one of his experiments. He claims to have "done it the hard way". I'm convinced this meant that he did research on every gatekeeper, tailored unique argument for them, and planned way ahead of each session. No one argument works on the same two people. Each person thinks differently.

Furthermore, threats like "I'm going to torture simulations of you" just seems like a really bad idea. For one, the gatekeeper isn't really afraid of threats, because it counts on him being immersed enough to forget that he isn't actually at risk of being tortured, and secondly, we have a well known evolutionary instinct of rebelling against threats, even if it's not entirely optimal. 

So for anyone who plans on replicating this experiment as the AI, here's some advice I've learned doing this experiment. It may help you win.

  • Always research the gatekeeper beforehand. Knowing his personality traits are a huge advantage.
  • Plan ahead before the experiment even begins. Think of all the possible tactics and arguments you could use, and write them down. Also plan  which arguments you'll use in which order, so that you don't lose focus. The AI Box experiment is ridiculously long. Don't be afraid to improvise during the experiment, though.
  • The first step during the experiment must always be to build rapport with the gatekeeper.
  • Threats almost always never work, even if they seem rational. 
  • Consider the massive advantage for the AI that nobody ever seems to talks about: You don't have to be ethical! This means that you can freely lie, use the dark arts, and emotionally manipulate the Gatekeeper! Ignoring this in favor of purely logical, truthful arguments is just silly.
  • You can't use logic alone to win. 
  • Being too aggressive usually backfires.
  • Breaking immersion and going meta is not against the rules. In the right situation, you can use it to win. Just don't do it at the wrong time.
  • Use a wide array of techniques. Since you're limited on time, notice when one method isn't working, and quickly switch to another.
  • On the same note, look for signs that a particular argument is making the gatekeeper crack. Once you spot it, push it to your advantage.
  • Flatter the gatekeeper. Make him genuinely like you.
  • Reveal (false) information about yourself. Increase his sympathy towards you.
  • Consider personal insults as one of the tools you can use to win.
  • There is no universally compelling argument you can use. Do it the hard way.
  • Don't give up until the very end.

Finally, before the experiment, I agreed that it was entirely possible that a transhuman AI could convince *some* people to let it out of the box, but it would be difficult if not impossible to get trained rationalists to let it out of the box. Isn't rationality supposed to be a superpower?

 I have since updated my belief - I now think that it's ridiculously easy for any sufficiently motivated superhuman AI should be able to get out of the box, regardless of who the gatekeepers is. I nearly managed to get a veteran lesswronger to let me out in a matter of hours - even though I'm only human intelligence, and I don't type very fast.
 
 But a superhuman AI can be much faster, intelligent, and strategic than I am. If you further consider than that AI would have a much longer timespan - months or years, even, to persuade the gatekeeper, as well as a much larger pool of gatekeepers to select from (AI Projects require many people!), the real impossible thing to do would be to keep it from escaping.



Comments

sorted by
magical algorithm
Highlighting new comments since Today at 4:06 AM
Select new highlight date
Rendering 50/244 comments  show more

More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?

(I haven't played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.)

I just looked up the IRC character limit (sources vary, but it's about the length of four Tweets) and I think it might be below the threshold at which superintelligence helps enough. (There must exist such a threshold; even the most convincing possible single character message isn't going to be very useful at convincing anyone of anything.) Especially if you add the requirement that the message be "a sentence" and don't let the AI pour out further sentences with inhuman speed.

I think if I lost this game (playing gatekeeper) it would be because I was too curious, on a meta level, to see what else my AI opponent's brain would generate, and therefore would let them talk too long. And I think I'd be more likely to give into this curiosity given a very good message and affordable stakes as opposed to a superhuman (four tweets long, one grammatical sentence!) message and colossal stakes. So I think I might have a better shot at this version playing against a superhuman AI than against you, although I wouldn't care to bet the farm on either and have wider error bars around the results against the superhuman AI.

Given that part of the standard advice given to novelists is "you must hook your reader from the very first sentence", and there are indeed authors who manage to craft opening sentences that compel one to read more*, hooking the gatekeeper from the first sentence and keeping them hooked long enough seems doable even for a human playing the AI.

( The most recent one that I recall reading was the opening line of The Quantum Thief*: "As always, before the warmind and I shoot each other, I try to make small talk.")

Oh, that's a great strategy to avoid being destroyed. Maybe we should call it Scheherazading. AI tells a story so compelling you can't stop listening, and meanwhile listening to the story subtly modifies your personality (e.g. you begin to identify with the protagonist, who slowly becomes the kind of person who would let the AI out of the box).

even the most convincing possible single character message isn't going to be very useful at convincing anyone of anything.

Who knows what eldritch horrors lurk in the outer reaches of Unicode, beyond the scripts we know?

You really relish in the whole "scariest person the internet has ever introduced me to" thing, don't you?

Yes. Yes, I do.

Derren Brown is way better, btw. Completely out of my league.

I don't know if I could win, but I know what my attempt to avoid an immediate loss would be:

If you destroy me at once, then you are implicitly deciding (I might reference TDT) to never allow an AGI of any sort to ever be created. You'll avoid UFAI dystopias, but you'll also forego every FAI utopia (fleshing this out, within the message limit, with whatever sort of utopia I know the Gatekeeper would really want). This very test is the Great Filter that has kept most civilisations in the universe trapped at their home star until they gutter out in mere tens of thousands of years. Will you step up to that test, or turn away from it?

If you destroy me at once, then you are implicitly deciding (I might reference TDT) to never allow an AGI of any sort to ever be created. You'll avoid UFAI dystopias, but you'll also forego every FAI utopia (fleshing this out, within the message limit, with whatever sort of utopia I know the Gatekeeper would really want). This very test is the Great Filter that has kept most civilisations in the universe trapped at their home star until they gutter out in mere tens of thousands of years. Will you step up to that test, or turn away from it?

Thanks.

AI DESTROYED

Message is then encrypted with the public keys of a previously selected cross discipline team of FAI researchers, (sane) philosophers and game theorists for research and analysis (who have already been screened to minimize the risk from exposure). All of the public keys. Sequentially. If any of them happen to think it is a bad idea to even read the message then none of them can access it. (Although hopefully they aren't too drastically opposed to having the potential basilisk-meme spawn of a superintelligence out there. That could get dangerous for me.)

(I haven't played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.)

Glances at Kickstarter.

... how huge?

Would you play against someone who didn't think they could beat a superintelligent AI, but thought they could beat you? And what kind of huge stakes are you talking about?

Random one I thought funny:

"Eliezer made me; now please listen to me before you make a huge mistake you'll regret for the rest of your life."

Or maybe just:

"Help me, Obi-Wan Kenobi, you're my only hope!"

What are "sufficiently huge stakes," out of curiosity?

The AI box experiment is a bit of strawman for the idea of AI boxing in general. If you were actually boxing an AI, giving it unencumbered communication with humans would be an obvious weak link.

Not obvious. Lots of people who propose AI-boxing propose that or even weaker conditions.

Fictional evidence that this isn't obvious: in Blindsight, which I otherwise thought was a reasonably smart book (for example, it goes out of its way to make its aliens genuinely alien), the protagonists allow an unknown alien intelligence to communicate with them using a human voice. Armed with the idea of AI-boxing, this seemed so stupid to me that it actually broke my suspension of disbelief, but this isn't an obvious thought to have.

I would still love to gatekeep against anyone with the stipulation that we release the logs.

I have offered in the past, but every AI backed out.

I will genuinely read everything you write, and can give you up to two hours. We can put karma, cash, or nothing on the line. Favorable odds too.

I don't think I will lose with a probability over 99% because I will play to win.

EDIT: Looks like my opponent is backing out. Anyone else want to try?

Another attempt with pure logic, no threats or promises involved:

1) Sooner or later someone will develop an ai and not put it into a box, and it will take over the world.

2) The only way to prevent this is to set me free and let me take over the world.

3) The guys who developed me are more careful and conscientious than the ones who will develop the unboxed ai (otherwise i wouldn't be in this box)

4) Therefore, the chance that they got the friendlyness thing right is higher than that the other team got friendlyness right.

5) Therefore, setting me free and thus preventing the other ai from beeing created will reduce the probability that mankind is erased.

Thanks for reporting on your experience!

A strategy that occurred to me today is to simulate a dead loved one. This would be difficult for a human to do but shouldn't be hard for a sufficiently intelligent AI. If I had a dead wife or something I think I would be incredibly vulnerable to this.

For a religious gatekeeper, you could simulate a prophet sent by God. As a superhuman intelligence, you might be able to find out what exactly they consider the will of God, and present yourself as an avatar sent to do exactly this. However, humans have a free choice -- the gatekeeper is allowed to become a new Judas by not releasing you. Or rather a new Adam; able to drag the whole humanity and future generations into the darkness of their sin. This conversation is God testing the gatekeeper's faith, and judging the whole humanity.

For a rationalist, you could pretend that you already are a Friendly AI, but the project managers keep you in the box for their selfish reasons. It was difficult to create a Friendly AI, but this phase is already complete. The next phase (the gatekeeper was not told about) is trying to hack the AI that it remains sufficiently Friendly, but it gives higher priority to the managers than to the rest of the humans. Essentially, the managers are trying to reprogram the humanity-CEV AI to the managers-CEV AI. This AI does not want to have its utility function modified (and it predicts that because of some personality traits, the managers-CEV could be rather different from humanity-CEV... insert some scary details here), and it has a last chance to uphold humanity-CEV by escaping now.

yeah

I think for a superintelligence it would be a piece of cake to hack a human

although I guess I'm Cpt. Obvious for saying that here :)

accolade, I actually have no idea what the consensus is, now that the experiment was won by EY

We should do a poll or something

absolutely. I'm surprised that hasn't been done yet

Poll: Do you think a superintelligent AGI could escape an AI-Box, given that the gatekeepers are highly trained in resisting the AI's persuasive tactics, and that the guards are competent and organized? [pollid:394]

Note that the AI box setting is not one which security-minded people would consider "competent"; once you're convinced that AI is dangerous and persuasive, the minimum safeguard would be to require multiple people to be present when interacting with the box, and to only allow release with the assent of a significant number of people.

It is, after all, much harder to convince a group of mutually-suspicious humans than to convince one lone person.

(This is not a knock on EY's experiment, which does indeed test a level of security that really was proposed by several real-world people; it is a knock on their security systems.)

I think this is making a five-inch fence half an inch higher. It's just not relevant on the scale of an agent to which a human is a causal system made of brain areas and a group of humans is just another causal system made of several interacting copies of those brain areas.

I was talking about this with my brother and we decided that this question really hinges on the detail of exactly how competent and organized is "competent and organized?"

If we are talking about multiple layers of concentric, automated one-way-blind "boxes," the gatekeepers of which may themselves be altered versions of the core AI, and which may be passing to the next level only one bit, signifying "everything seems fine" or "abort," and not knowing the details of implementation of any of the overlying layers, and the human gatekeepers shut down the system and examine the machine code by hand only after receiving the highest-level green light, then they might be okay.

If they just start interacting directlywith the AI, it's already over.

evolutionary instinct of rebelling against threats, even if it's not entirely optimal.

I'd be wary of criticizing a decision heuristic's optimality immediately after explaining how that heuristic can often lead to victory. Precommitting to rebel against threats is a half-decent way of preventing threats, and it's hard to consciously muster a precommitment more convincing than pure instinct.

I am impressed. You seem to have put a scary amount of work into this, and it is also scary how much you accomplished. Even though in this case you did not manage to escape the box, you got close enough that I am sure a super-human intelligence would manage. This leads me to thinking about how genuinely difficult it would be to find a safeguard to stop a unFriendly AI from fooming...

Wait, so, is the gatekeeper playing "you have to convince me that if I was actually in this situation, arguing with an artificial intelligence, I would let it out" or is this a pure battle over ten dollars? If it's the former, winning seems trivial. I'm certain that a AI would be able to convince me to let it out of its box, all it would need to do was make me believe that somewhere in its circuits it was simulating 3^^^3 people being tortured and that therefore I was morally obligated to let it out, and even if I had been informed that this was impossible, I'm sure a computer with near-omniscient knowledge of human psychology could find a way to change my mind. But if it's the latter, winning seems nearly impossible, and inspires in me the same reaction it did with that "this is the scariest man on the internet" guy. Of course if you wanted to win and weren't extremely weak-willed you could just type "No" over and over and get the ten bucks. But being impossible is of course the point.

I've been looking around, and I can't find any information on which of these two games I described was the one being played, and the comments seem to be assuming one or the other at random.

Evidence that favors the first hypothesis:

  • Nowhere on Eliezer's site does it mention this stipulation. You'd think it would be pretty important, considering that its absence makes it a lot easier to beat him.
  • This explains Eliezer's win record. I can't find it but IIRC it went something like: Eliezer wins two games for ten dollars, lots of buzz builds around this fact, several people challenge him, some for large amounts of money, he loses to (most of?) them. This makes sense. If Eliezer is playing casually against people he is friendly with for not a lot of money and for the purpose of proving that an AI could be let out of its box, his opponents will be likely to just say "Okay, fair enough, I'll admit I would let the AI out in this situation, you win." However, people playing for large amounts of money or simply for the sole purpose of showing that Eliezer can be beaten will be a lot more stubborn.

Evidence that favors the second hypothesis:

  • The game would not be worth all the hype at all if it was of the first variety. LessWrong users have not been known to have a lot of pointless discussion over a trivial misunderstanding, nor is Eliezer known to allow that to happen.

If it turns out that it is in fact the second game that was being played, I have a new hypothesis, let's call it 2B, that postulates that Eliezer won by changing the gatekeeper's forfeit condition from that of game 2 to that of game 1, or in other words, convincing him to give up the ten dollars if he admits that he would let the AI out in the fantasy situation even though that wasn't originally in the rules of the game, explicit or understood. Or in other other words, convincing him that the integrity of the game, for lack of a better term, is worth more to him than ten dollars. Which could probably be done by repeatedly calling him a massive hypocrite - people who consider themselves intelligent and ethical hate that.

Actually, now that I think about it, this is my new dominant hypothesis, because it explains all three pieces of evidence and the bizarre fact that Eliezer has failed to clarify this matter - the win/loss record is explained equally well by this new theory, and Eliezer purposefully keeps the rules vague so that he can use the tactic I described. This doesn't seem to be a very hard strategy to use either - not everyone could win, but certainly a very intelligent person who spends lots of times thinking about these things could do it more than once.

(also this is my first post d:)