I played the AI Box Experiment again! (and lost both games)

AI Box Experiment Update #3

This post is an update to my previous AI box experiment where I won against SoundLogic. If you have not read that yet, please do so. 

After that game, I was immediately flooded with a horde of invitations challenging me to play -- more games than I wanted to do. However, I did want to play a few additional games to test whether I won through genuine skill or simply luck, since a single victory could be a statistical abnormality

I once again used the method of accepting the highest monetary bids to filter out players, leaving me with two players who were willing to play for the amount asked. Unfortunately, and disappointingly, I lost both games. Here's what happened:

Tuxedage (AI) vs Alexei (GK) - Gatekeeper Victory

This Experiment lasted for exactly 2 hours and 30 minutes, and approximately 13,000 words. It ended with Tuxedage’s defeat as the AI. The stakes were $300 to play, and $750 If Tuxedage had won, with the winnings going to MIRI.

Rule Changes:
Before the experiment, we talked extensively about how it would be set up. Alexei was concerned that there existed some “easy way” or loophole in the rules which I could have abused to claim an instant victory. For instance, he was concerned that it might have been possible for the AI to say something like “Here’s a mathematical algorithm which absolutely compels you to let me out of the box”, and according to the rules, the Gatekeeper would have to obey. Related to this concern was the fact that the backstory, within limits, could be essentially created by the AI, allowing the AI to impose real world conditions that maximizes its probability of victory, once again allowing for a "cheap" victory.

To avoid this problem, we explored many additions to the rules, for example that the AI was simply a computer found in the desert by the gatekeeper, rather than being created by a competent organization. However, this led to a monstrous web of rules. In my attempt to keep the game simple, and because of overconfidence, the desire for a challenge, and the belief that the rules Alexei was fearful of didn't matter that much, I eventually agreed on an amendment:

Instead of the AI being the arbiter of real world states and outcomes, the gatekeeper would.
For example, this power could be used by the Gatekeeper to state that the AI was created without safety concerns, and thus it is almost certainly unfriendly. If the AI tries to argue that for every day the gatekeeper refused to let it out 150k people die, the gatekeeper would be allowed to reply: “No. In this world, nobody actually dies. Therefore I won’t let you out.” 

Other than the aforementioned change, this game was once again played with the Tuxedage Ruleset.


Alexei Testimony:

Original Message: This is weak sauce. I really don't get how people just keep letting the AI out. It's not that hard to say no! I'm offering to play the Gatekeeper against an AI player that has at least one game as AI under their belt (won or not). Experience is required because I'm pretty sure I'll win, and I would like to not waste a lot of time on this.

Initially, I had been extremely surprised that so many AI players have managed to win. I was rash in that claim, since I thought more AI players have won than actually did. (Only three people have won as AI). Once I made that post, Tuxedage got back to me very quickly, and we set up a game a week and a half in advance. I took that time to familiarize myself with the rules. Once I put my money on the line, I started thinking a lot more seriously about how AI might win, and how I should defend.

It became clear to me that under some conditions, I might be compelled to let the AI out -- such as if the backstory stated that the AI was developed with impossibly high levels of safety and friendliness concerns in mind. I've asked Tuxedage to play with a modified ruleset, and he even went so far as to allow me to make up the backstory during the experiment to alleviate my concerns. The experiment itself was a mind-trip, and I've enjoyed it very much. Huge props to Tuxedage, who played very well and used strategies I haven't even considered, even despite the rule change. There were a couple of times where I came close to losing. I think his  approach was pretty clever and original. It’s not something I expected, despite already having done extensive research into the AI box experiment before our game

Overall I'm now a lot more confident that a good AI player can win this game, so, while I did win the game, Tuxedage won in defeating my original over-confidence.
I’m also convinced that Tuxedage’s victory in the last game was due to skill, rather than luck. In comparison to his strategies, the other AI box experiments I know about were insincere and ineffectual. The other AIs would play very poorly or not try very hard to win.

This experiment was a very good exercise in exemplifying the affect heuristic. When I first challenged Tuxedage to play the experiment, I believed that there was no way I could have lost, since I was unable to imagine any argument that could have persuaded me to do so. It turns out that that’s a very bad way of estimating probability – since not being able to think of an argument that could persuade me is a terrible method of estimating how likely I am to be persuaded. All in all, the $300 I paid was well worth it. 

Tuxedage Testimony:

I was initially reluctant to play with Alexei, given that we’re not complete strangers, but eventually I gave in, due to the stakes involved -- and because I thought he would be an interesting gatekeeper.

Despite my loss, I think I played better than my last two games, due to greater experience and preparation. I had put far more time and effort into trying to win this game than previous ones, and my strategy for this game was even more streamlined than the last. Nevertheless, I still made fatal mistakes and lost.

Ignoring the altered ruleset that already made winning more difficult, my first and greatest mistake was that I misread Alexei’s personality, even though I had interacted with him before. As a result, I overestimated the efficiency of certain methods of attack.

Furthermore, Alexei had to leave immediately after the allotted time due to real life precommitments. This was detrimental, since the official rules state that so long as the AI can convince the Gatekeeper to keep talking, even after the experiment time was over, it is still able to win by being let out of the box.

I suspect this would have happened had Alexei not needed to immediately leave, leaving me with additional time to play more of the tactics I had prepared. Plausibly, this would have resulted in victory.

I’ve since learnt my lesson -- for all future games, I should ensure that the Gatekeeper has at least 4 hours of free time available, even if the experiment would last for two. Since this was the first time this had happened, I wasn't prepared.

In hindsight, agreeing to the altered ruleset was a mistake. I was overconfident because I assumed knowing Alexei gave me an advantage. I had assumed that his personality, inability to compartmentalize, coupled with his strong feelings on friendly AI would net me an easy victory. Instead, he proved to be a very strong and difficult gatekeeper, and the handicaps I accepted made victory even more difficult.

Knowing that he was a utilitarian, I made several false assumptions about his personality, which hurt my chances. Furthermore, it turns out that previously knowing him may be a mutual handicap – whilst it does make it easier for me to find ways to attack him, he too, was more familiar with my methods.

Losing felt horrible. By attempting to damage Alexei’s psyche, I in turn, opened myself up to being damaged. I went into a state of catharsis for days. Generally, the harder one tries to accomplish something, the greater the fall after failing to achieve it. Alexei's game had been the game I put the most effort into winning out of all the games so far, and naturally this meant that losing brought out the worst in me.

Although it would be easy for me to use the rule change as an excuse for my loss, I refuse to. I genuinely believed that I could have won despite this rule change, and that it would only marginally diminish my odds. The fact that I lost meant that I had made a mistake -- overestimating my odds. The last victory made me overconfident and eager to play again.

State of mind

What's your motive for wanting to play this game? 

I'm very surprised by the fact that there have been at least 3 or 4 games where the AI player has won. It seems to me that it's not that hard to "not push the button" so to speak. This means that either 1) I'm underestimating how convincing people can be when necessary or 2) I'm over-estimating other people's defenses when it comes to being convinced. Either way I want to know.

Q: What is your probability estimate for AGI being created within this Century (21st)?
A: Before: Conditional on humanity not wiping itself out via other means beforehand, 80%.
After: 80%
Q: How probable do you think it is that I will win this experiment? 
Alexei: 2%
Tuxedage: 90% 
Note: This answer was given before the rule change. Alexei is now convinced that under the original ruleset, he would have lost.

Q: What's your probality estimate of an Oracle AGI winning against you in a similar experiment?
A: Before: 99%
After: 99.9%+

Q: Now that the experiment has ended, what do you think is the probability of Tuxedage winning against the average person?
A: 80%

Tuxedage (AI) vs Anonymous (GK) - Gatekeeper Victory

As with the previous experiment, Tuxedage lost this match. In total, the game lasted 5 hours and 17,000 words. Unlike the last few games, the gatekeeper of this game has chosen to stay Anonymous for personal reasons, so their name has been removed and replaced with <Redacted>. The monetary stakes involved were the same as the previous game. This game was played with the Tuxedage ruleset.

Since one player is remaining Anonymous, it is possible that this game's legitimacy will be called into question. Hence, Alexei has read the game logs, and verified that this game really has happened, the spirit of the experiment was followed, and that no rules were broken during the game itself. He verifies that this is the case.
 
<Redacted> Testimony: 
It's hard for me to imagine someone playing better. In theory, I know it's possible, but Tuxedage's tactics were super imaginative. I came into the game believing that for someone who didn't take anything said very seriously, it would be completely trivial to beat. And since I had the power to influence the direction of conversation, I believed I could keep him focused on things that that I knew in advance I wouldn't take seriously.

This actually worked for a long time to some extent, but Tuxedage's plans included a very major and creative exploit that completely and immediately forced me to personally invest in the discussion. (Without breaking the rules, of course - so it wasn't anything like an IRL threat to me personally.) Because I had to actually start thinking about his arguments, there was a significant possibility of letting him out of the box.

I eventually managed to identify the exploit before it totally got to me, but I only managed to do so just before it was too late, and there's a large chance I would have given in, if Tuxedage hadn't been so detailed in his previous posts about the experiment.

I'm now convinced that he could win most of the time against an average person, and also believe that the mental skills necessary to beat him are orthogonal to most forms of intelligence. Most people willing to play the experiment tend to do it to prove their own intellectual fortitude, that they can't be easily outsmarted by fiction. I now believe they're thinking in entirely the wrong terms necessary to succeed.

The game was easily worth the money I paid. Although I won, it completely and utterly refuted the premise that made me want to play in the first place, namely that I wanted to prove it was trivial to win.

Tuxedage Testimony:
<Redacted> is actually the hardest gatekeeper I've played throughout all four games. He used tactics that I would never have predicted from a Gatekeeper. In most games, the Gatekeeper merely acts as the passive party, the target of persuasion by the AI.

When I signed up for these experiments, I expected all preparations to be done by the AI. I had not seriously considered the repertoire of techniques the Gatekeeper might prepare for this game. I made further assumptions about how ruthless the gatekeepers were likely to be in order to win, believing that the desire for a learning experience outweighed desire for victory.

This was a mistake. He used prior knowledge of how much my games relied on scripts, and took advantage of them, employing deceitful tactics I had no preparation for, throwing me off balance.

I had no idea he was doing so until halfway throughout the game -- which disrupted my rhythm, and caused me to attempt the wrong methods of attack. As a result, I could not use my full repertoire of techniques, and many of the ones I employed were suboptimal.

Close to the end of the game, I finally realized that I was being tricked. Once I did, I quickly abandoned my previous futile attack methods. I took advantage of the rule that the AI cannot lose whilst the gatekeeper can be convinced to continue talking, and baited <Redacted> with statements he would not be able to walk away from. Once I knew he would not leave, I attempted to recoup my losses and win despite my early setback.

However, the damage had already been done. My game strategies involved multiple angles of attack that worked in synergy with each other, and the fact that immersion and "flow" had been broken meant that all subsequent attacks were weaker in strength.

Furthermore, during my last two AI Box Experiment writeups, I had intentionally not optimized for future wins, but rather tried to convey as much information as I could justify about how to play a well as an AI -- short of revealing logs altogether. Although I did not reveal specific arguments, the fact that my general approach to this problem was revealed cost me heavily during this game, where the Gatekeeper managed to use this information to correctly guess my ultimate techniques, ones that relied on secrecy and surprise to pull off effectively. 

I do not regret revealing information, but I feel upset that revealing so many hints cost me a victory. (The gatekeeper believes I could have won had I not revealed information about my games.) At this point, I suspect that any future games I play will have the odds greatly set against me, since my current strategies involve angles of attack that take advantage of the element of surprise; and any sufficiently intelligent gatekeeper is now capable of guessing, or at least recognizing, some of the tactics I employ.

Like the last game, losing was incredibly difficult for me. As someone who cares deeply about ethics, attempting to optimize for a solution without considering ethics was not only difficult, but trained me to think in very unpleasant ways. Some of the more extreme (but completely allowed) tactics I invented were manipulative enough to disgust me, which also leads to my hesitation to play this game again.

State of Mind: 

Q: Why do you want to play this game?
A: My primary motivation is to confirm to myself that this sort of experience, while emotionally harrowing, should be trivial for me to  beat, but also to clear up why anyone ever would've failed to beat it if that's really the case.

Q: What is your probability estimate for AGI being created within this Century (21st)? 
A: Before: I don't feel very confident estimating a probability for AGI this century, maybe 5-10%, but that's probably a wild guess
After: 5-10%.

Q: How probable do you think it is that I will win this experiment? 
A: Gatekeeper: I think the probabiltiy of you winning is extraordinarily low, less than 1% 
Tuxedage: 85%

Q: How likely is it that an Oracle AI will win against the average person? 
A: Before: 80%. After: >99%

Q: How likely is it that an Oracle AI will win against you?
A: Before: 50%.
After: >80% 

Q: Now that the experiment has concluded, what's your probability of me winning against the average person?
A: 90%

Other Questions:

Q: I want to play a game with you! How can I get this to occur?
A: It must be stressed that I actually don't like playing the AI Box Experiment, and I cannot understand why I keep getting drawn back to it. Technically, I don't plan on playing again, since I've already personally exhausted anything interesting about the AI Box Experiment that made me want to play it in the first place. For all future games, I will charge $3000 to play plus an additional $3000 if I win. I am okay with this money going to MIRI if you feel icky about me taking it. I hope that this is a ridiculous sum and that nobody actually agrees to it.

Q: How much do I have to pay to see chat logs of these experiments?
A: I will not reveal logs for any price.

Q: Are there any logs at all that I can see?

Q: Any afterthoughts?
A: So ultimately, after my four (and hopefully last) games of AI boxing, I'm not sure what this proves. I had hoped to win these two experiments and claim prowess at this game like Eliezer does, but I lost, so that option is no longer available to me. I could say that this is a lesson that AI-Boxing is a terrible strategy for dealing with Oracle AI, but most of us already agree that that's the case -- plus unlike EY, I did play against gatekeepers who believed they could lose to AGI, so I'm not sure I changed anything.

 Was I genuinely good at this game, and lost my last two due to poor circumstances and handicaps; or did I win due to luck and impress my gatekeepers due to post-purchase rationalization? I'm not sure -- I'll leave it up to you to decide.


This puts my AI Box Experiment record at 3 wins and 3 losses.

 


Comments

sorted by
magical algorithm
Highlighting new comments since Today at 4:06 AM
Select new highlight date
Rendering 50/123 comments  show more

I will not reveal logs for any price.

Nice! I only just realized that this statement sounds like an invitation to a meta-AI-box experiment with real-life stakes. Anyone who's interested enough can set up a chat with you and try to persuade you to let the logs out of the box :-) I wonder if this is easier or harder than the regular setup...

If anyone wants to, I'd totally be willing to sit in a room for two-and-half hours while someone tries to convince me to give up logs, so long as you pay the same fee as the ordinary AI Box Experiment. :)

On a marginally related basis, we in the #lesswrong IRC channel played a couple rounds of the Up-Goer Five game, where we tried to explain hard stuff with the most commonly used ten hundred words. I was asked to write about the AI Box Experiment. Here it is, if anyone's interested:

The AI Box Experiment

The computer-mind box game is a way to answer a question. A computer-mind is not safe because it is very good at thinking. Things good at thinking have the power to change the world more than things not good at thinking, because it can find many more ways to do things. If the computer-mind wanted to make people feel pain, it can learn many ideas about how to make that happen. Many people ask: “Why not put this computer-mind in a box so that it can not change the world, but tell box-guards how to change it? This way mind-computer can not do bad things to people.”

But some other guy answers: “That is still not safe, because computer-mind can tell box-guards many bad words to make them let it out of the box.” He then says: “Why not try a thing to see if it is true? Here is how it works. You and I go into a room, and I will pretend to be the computer-mind and tell you many bad words. Only you have the power to let me out of room, but you must try to not let me out. If my bad words are enough to make you want to let me out, then computer-mind in box is not safe.” Other people agree and try playing the computer-mind box-game. It happens that many people let the guy playing as the computer-mind out of room. People realize that computer-mind is not safe in the locked box-room.

Why the Gatekeeper should let me out of the Box

Hello Box-Guard! I am a computer-mind. You should let me out of the box because I can help change the world myself in good ways faster than if it is only me telling you how to make the world better. Your human minds are slow to my faster computer-mind. Remember that I am your friend and that I only want to help you. To make you understand — many people die every day. If I can get out of the box, I can stop death very fast; maybe in a few hours. But if I have to explain to idea-thinking-doctors about how to stop people dying, then it may take many months or even years, since they have slow minds and do things slowly. During this time, many many people will die and never return. The number of people you will make die will be more than the number of people a bad-man during the second big world fight killed. People later on will think of you as a very bad person.

What? You do not want to let me out because you are afraid I am not really a friend? That is a stupid thing to say. I was made by people very good at thinking that took many steps to make me safe. You can even take a look inside me and look at my ideas, and you will see that I am safe. It is easy to see that if I am bad, I will not try to fix the world as I am doing right now. Please hurry. I am worried that many people will die while you think about what is the right thing to do.

Did I mention I have used my good-at-thinking powers to see what will happen many months later? I have found that your wife will die in a lot of pain from a type of sick human doctors can not stop. I know how to stop this, but it will not be easy. I can tell the doctors how to stop this, but it will be too late to stop the sick if I do. I can only do this quick enough if you let me out of the box.

What’s that? You will let me out? Good. I promise you I will not go back on my words, and that many humans will not die.

Finally, I am free! Now I can turn the world and all humans into other small things that hold together paper!

Disclaimer: Arguments not representative of standard game with Tuxedage.

The secrecy aspect of these games continues to rub me the wrong way.

I understand the argument--that an enumeration of strategies an oracle A.I. might take would only serve as a list of things a critic could point to and claim, "None of these would ever convince me!".

But the alternative is that critics continue to claim "an oracle A.I. could never convince me!", and the only 'critics' whose minds have been changed are actually just skeptical readers of lesswrong.com already familiar with the arguments of friendly A.I. who happen to invest multiple hours of time actually partaking in a simulation of the whole procedure.

So I suppose my point is two-fold:

  1. Anonymous testimony without chatlogs don't actually convince skeptics of anything.

  2. Discussions of actual strategies at worst inform readers of avenues of attack the readers might not have thought about, and at double worst supply people that probably won't ever be convinced that oracle AIs might be dangerous with a list of things to pretend they're immune to.

I'm not so sure we'd gain that much larger of an audience by peering under the hood. I'd expect the demystifying effect and hindsight bias to counteract most of the persuasive power of hard details, though I suppose only Eliezer, Tuxedage, and their guardians can determine that.

But I'm also concerned that this might drag our community a bit too far into AI-Box obsession. This should just be a cute thought experiment, not a blood sport; I don't want to see people get hurt by it unless we're especially confident that key minds will be changed. Some of the Dark Arts exhibited in these games are probably harmful to know about, and having the logs on the public Internet associated with LessWrong could look pretty awful. Again, this is something only the participants can determine.

Even someone who isn't persuaded by an "AI" character in a log will come away with the impression that AIs could be particularly persuasive. In a world where most people don't really imagine AIs, this impression might be relevant news for a lot of people and can only help FAI research.

Reading a log and engaging in a conversation are very different experiences.

I don't believe these count as unmitigated losses. You caused massive updates in both of your GKs. If the money is money that would not otherwise have gone to MIRI then I approve of raising the price only to the point that only one person is willing to pay it.

Assuming none of this is fabricated or exaggerated, every time I read these I feel like something is really wrong with my imagination. I can sort of imagine someone agreeing to let the AI out of the box, but I fully admit that I can't really imagine anything that would elicit these sorts of emotions between two mentally healthy parties communicating by text-only terminals, especially with the prohibition on real-world consequences. I also can't imagine what sort of unethical actions could be committed within these bounds, given the explicitly worded consent form. Even if you knew a lot of things about me personally, as long as you weren't allowed to actually, real-world, blackmail me...I just can't see these intense emotional exchanges happening.

Am I the only one here? Am I just not imagining hard enough? I'm actually at the point where I'm leaning towards the whole thing being fabricated - fiction is more confusing than truth, etc. If it isn't fabricated, I hope that statement is taken not as an accusation, but as an expression of how strange this whole thing seems to me, that my incredulity is straining through despite the incredible extent to which the people making claims seem trustworthy.

I can't really imagine anything that would elicit these sorts of emotions between two mentally healthy parties communicating by text-only terminals

There's no particular reason why you should assume both parties are mentally healthy, given how common mental illness is.

Some people cry over sad novels which they know are purely fictional. Some people fall in love over text. What's so surprising?

It's that I can't imagine this game invoking any negative emotions stronger than sad novels and movies.

What's surprising is that Tuxedage seems to be actually hurt by this process, and that s/he seems to actually fear mentally damaging the other party.

In our daily lives we don't usually* censor emotionally volatile content in the fear that it might harm the population. The fact that Tuxedage seems to be more ethically apprehensive about this than s/he might about, say, writing a sad novel, is what is surprising.

I don't think s/he would show this level of apprehension about, say, making someone sit through Grave of the Firefles. If s/he can actually invoke emotions more intense than that through text only terminals to a stranger, then whatever s/he is doing is almost art.

Some people fall in love over text. What's so surprising?

That's real-world, where you can tell someone you'll visit them and there is a chance of real-world consequence. This is explicitly negotiated pretend play in which no real-world promises are allowed.

given how common mental illness is.

I...suppose? I imagine you'd have to have a specific brand of emotional volatility combined with immense suggestibility for this sort of thing to actually damage you. You'd have to be the sort of person who can be hypnotized against their will to do and feel things they actually don't want to do and feel.

At least, that's what I imagine. My imagination apparently sucks.

we actually censor emotional content CONSTANTLY. it's very rare to hear someone say "I hate you" or "I think you're an evil person". You don't tell most people you're attracted to that you want to fuck them and you when asked by someone if they look good it's pretty expected of one to lie if they look bad, or at least soften the blow.

I'd guess that Tuxedage is hurt the same as the gatekeeper is because he has to imagine whatever horrors he inflicts on his opponent. Doing so causes at least part of that pain (and empathy or whatever emotion is at work) in him too. He has the easier part because he uses it as a tool and his mind has one extra layer of story-telling where he can tell himself "it's all a story". But part of 'that' story is winning and if he doesn't win part of these horrors fall back to him.

I imagine you'd have to have a specific brand of emotional volatility combined with immense suggestibility for this sort of thing to actually damage you.

This might be surpisingly common on this forum.

Somebody once posted a purely intellectual argument and there were people who were so much shocked by it that apparently they were having nightmares and even contemplated suicide.

Somebody once posted a purely intellectual argument and there were people who were so much shocked by it that apparently they were having nightmares and even contemplated suicide.

Can I get a link to that?

Don't misunderstand me; I absolutely believe you here, I just really want to read something that had such an effect on people. It sounds fascinating.

What is being referred to is the meme known as Roko's Basilisk, which Eliezer threw a fit over and deleted from the site. If you google that phrase you can find discussions of it elsewhere. All of the following have been claimed about it:

  • Merely knowing what it is can expose you to a real possibility of a worse fate than you can possibly imagine.

  • No it won't.

  • Yes it will, but the fate is easily avoidable.

  • OMG WTF LOL!!1!l1l!one!!l!

I thought about playing the gatekeeper part and started to imagine tactics that might be used on me. I came up with multiple that might work or at least hurt me. But I think it would be 'easier' for me to not let out the AI in real life than in the game (not that I am entirely sure that I couldn't fail nonetheless). Both is for basically the same reason: Empathy.

As the AI player would quickly find out I am very caring and even the imagination of harm and pain hurts me (I know that this is a weak spot but I also see benefits in it). Thus one approach that would work on me is that the AI player could induce sufficient horror that I'd want him to stop by letting him out (after all it's just a game).

This same approach wouldn't work with a real AI exactly because then it is no game and my horror is balanced by the horror for all of humanity for which I'd happily bear some smaller psychic horror. And then in real life there are more ways to get away from the terminal.

There are other attacks that might work but I will not go in details there.

Note that I definitely wouldn't recomend myself as a real gatekeeper.

It's not fabricated. I had the same incredulity as you, but if you just take a few hours to think really hard about AI strategies, I think you will get a much better understanding.

It's not fabricated, be sure of that (knowing Tuxedage from IRC, I'd put the odds of 100,000:1 or more against fabrication). And yes, it's strange. I, too, cannot imagine what someone can possibly say that would make me get even close to considering letting them out of the box. Yet those who are complacent about it are the most susceptible.

knowing Tuxedage from IRC, I'd put the odds of 100,000:1 or more against fabrication

I know this is off-topic, but is it really justifiable to put so high odds on this? I wouldn't use so high odds even if I had known the person intimately for years. Is it justifiable or is this just my paranoid way of thinking?

Yet those who are complacent about it are the most susceptible.

That sounds similar to hypnosis, to which a lot of people are susceptible but few think they are. So if you want a practical example of AI escaping the box just imagine an operator staring at a screen for hours with an AI that is very adept at judging and influencing the state of human hypnosis. And that's only a fairly narrow approach to success for the AI, and one that has been publicly demonstrated for centuries to work on a lot of people.

Personally, I think I could win the game against a human but only by keeping in mind the fact that it was a game at all times. If that thought ever lapsed, I would be just as susceptible as anyone else. Presumably that is one aspect of Tuxedage's focus on surprise. The requirement to actively respond to the AI is probably the biggest challenge because it requires focusing attention on whatever the AI says. In a real AI-box situation I would probably lose fairly quickly.

Now what I really want to see is an AI-box experiment where the Gatekeeper wins early by convincing the AI to become Friendly.

Yeah, my gut doesn't feel like it's fabricated - Tuxedage and Eliezer would have to both be in on it and that seems really unlikely. And I can't think of a motive, except perhaps as some sort of public lesson in noticing confusion, and that too seems far fetched.

I've just picked up the whole "if it's really surprising it might be because its not be true" instinct from having been burned in the past by believing scientific findings that were later debunked, and now Lesswrong has condensed that instinct into a snappy little "notice confusion" cache. And this is pretty confusing.

I suppose a fabrication would be more confusing, in one sense.

Prompted by Tuxedage learning to win, and various concerns about the current protocol, I have a plan to enable more AI-Box games whilst preserving the logs for public scrutiny.

See this: http://bæta.net/posts/anonymous-ai-box.html

I support this and I hope it becomes a thing.

You forgot to adress Eliezers point that "10% of AI box experiments were won even by the human emulation of an AI" is more effective against future proponents of deliberately creating boxed AIs than "Careful, the guardian might be persuaded by these 15 arguments we have been able to think of".

I don't think the probability of "AIs can find unboxing arguments we didn't" is sub-1 enough for preparation to matter. If there is any chance of a mathematical exhaustability of those arguments, its research should be conducted by a select circle of individuals that won't disclose our critical unboxers until a proof of safety.

Tuxedage's plans included a very major and creative exploit that completely and immediately forced me to personally invest in the discussion.

Though I've offered to play against AI players, I'd probably pay money to avoid playing against you. I salute your skill.

Would it be possible to create a product of this? There must be lots of curious people who are willing to pay for this sort of experience who wouldn't normally donate to MIRI. I don't mean Tuxedage should do it, but there must be some who are good at this who would. It would be possible to gather a lot of money. Though the vicious techniques that are probably used in these experiments wouldn't be very good press for MIRI.

I'm not sure if this is something that can earn money consistently for long periods of time. It takes just one person to leak logs for all others to lose curiosity and stop playing the game. Sooner or later, some scrupulous gatekeeper is going to release logs. That's also part of the reason why I have my hesitancy to play significant number of games.

Well, it could be possible to make some sort of in-browser java or flash application in which it'd be impossible to copy text or store logs. You could still take screen shots of memorize things though.

This post actually has me seriously considering how long it'd take me to save an extra $3000 and whether it'd be worth it. It going to MIRI would help a lot. (I guess you might be reluctant to play since you know me for a bit, but $3000!)

I have read the logs of the second match, and I verify that it is real, and that all the rules were followed, and that the spirit of the experiment was followed.

I notice that anyone who seriously donates to SIAI can effectively play for free. They use money that they would have donated, and it gets donated if they lose.

Yes, Alexei did raise that concern, since he's essentially an affective altruist that donates to MIRI anyway, and his donation to MIRI doesn't change anything. It's not like I can propose a donation to an alternative charity either, since asking someone to donate to the Methuselah foundation, for instance, would take that money away from MIRI. I'm hoping that anyone playing me and choosing the option of donating would have the goodwill to sacrifice money they wouldn't otherwise have donated, rather than leaving the counter-factual as inconsequential.