Prisoner's dilemma tournament results

The prisoner's dilemma tournament is over. There were a total of 21 entries. The winner is Margaret Sy, with a total of 39 points. 2nd and 3rd place go to rpglover64 and THE BLACK KNIGHT, with scores of 38 and 36 points respectively. There were some fairly intricate strategies in the tournament, but all three of these top scorers submitted programs that completely ignored the source code of the other player and acted randomly, with the winner having a bias towards defecting.

You can download a chart describing the outcomes here, and the source codes for the entries can be downloaded here.

I represented each submission with a single letter while running the tournament. Here is a directory of the entries, along with their scores: (some people gave me a term to refer to the player by, while others gave me a term to refer to the program. I went with whatever they gave me, and if they gave me both, I put the player first and then the program)

A: rpglover64 (38)
B: Watson Ladd (27)
c: THE BLACK KNIGHT (36)
D: skepsci (24)
E: Devin Bayer (30)
F: Billy, Mimic-- (27)
G: itaibn (34)
H: CooperateBot (24)
I: Sean Nolan (28)
J: oaz (26)
K: selbram (34)
L: Alexei (25)
M: LEmma (25)
N: BloodyShrimp (34)
O: caa (32)
P: nshepperd (25)
Q: Margaret Sy (39)
R: So8res, NateBot (33)
S: Quinn (33)
T: HonoreDB (23)
U: SlappedTogetherAtTheLastMinuteBot (20)


Comments

sorted by
magical algorithm
Highlighting new comments since Today at 3:05 PM
Select new highlight date
All comments loaded

Yay, I wasn't last!

Still, I'm not surprised that laziness did not pay off. I wrote a simple bot, then noticed that it cooperated against defectbot and defected against itself. I thought to myself, "This is not a good sign." Then I didn't bother changing it.

Can I suggest running a second tournement, including all these bots and new bots that we can examine these bots in writing? After a few iterations, we might beat random. Or we might develop all sorts of signaling quirks, but that would also be interesting.

I like this plan. I'd be willing to run it, unless AlexMennan wants to.

When I first announced the tournament, people came up with many suggestions for modifications to the rules, most of which I did not take for the sake of simplicity. Maybe we should try to figure out what rules will make it easier to write more effective bots, even if that would disqualify many of the bots already submitted to this tournament.

Maybe we should have a prisoner's dilemma meta-tournament in which we define a tournament-goodness metric and then everyone submits a tournament design that is run using the bot participants in the previous tournament, and then use the rankings of those designs.

Wait: does anyone know a good way to design the meta-tournament?

Am I the only one who sees a problem in that we're turning a non-zero-sum game into a winner-take-all tournament? Perhaps instead of awarding a limited resource like bitcoins to the "winner", each player should be awarded an unlimited resource such as karma or funny cat pictures according to their strategy's performance.

I'll list some suggestions.

  1. Conduct a statistically significant number of trials. With few trial, the highest ranked agents will be the ones with a high variance in possible score, rather than the agents with the highest average score.
  2. Allow no libraries, except for a very small whitelist (e.g. random numbers).
  3. No timing, no threads. They make analysis too hard. The payoff might have to be changed to make it never in your best interest to make your opponent diverge, e.g. (-1, -1).
  4. The rule "This is about Prisoner's dilemma, not the language you're working in. If you violate the spirit of this rule, your agent will be disqualified."
  5. (Possible) Change language to a restricted dialect of Python. You can analyze your opponent bytecode, if you wish, or run your opponent as a black-box. Programs with side effects (such as writing to global variables) and programs that use Pythonic trickiness (such as inspecting the stack) are verboten.

I went and ran this another 100 times, so I could see what it would look like without the variance. The mean scores are:

A: 32.03
B: 28.53
C: 32.48
D: 24.94
E: 28.75
F: 29.62
G: 28.42
H: 26.12
I :26.06
J: 26.10
K: 36.15
L: 27.21
M: 25.14
N: 34.37
O: 31.06
P: 26.55
Q: 34.95
R: 32.93
S: 37.08
T: 26.43
U: 24.24

If you're interested, here's the code for the test (takes a day to run) and the raw output for my run (an inconvenient format, but it shows the stats for the matchups).

And for the lazy, here these scores are in sorted order (with original scores in parentheses):

S: 37.08 - Quinn (33)
K: 36.15 - selbram (34)
Q: 34.95 - Margaret Sy (39)
N: 34.37 - BloodyShrimp (34)
R: 32.93 - So8res, NateBot (33)
C: 32.48 - THE BLACK KNIGHT (36)
A: 32.03 - rpglover64 (38)
O: 31.06 - caa (32)
F: 29.62 - Billy, Mimic-- (27)
E: 28.75 - Devin Bayer (30)
B: 28.53 - Watson Ladd (27)
G: 28.42 - itaibn (34)
L: 27.21 - Alexei (25)
P: 26.55 - nshepperd (25)
T: 26.43 - HonoreDB (23)
H: 26.12 - CooperateBot (24)
J: 26.1 - oaz (26)
I: 26.06 - Sean Nolan (28)
M: 25.14 - LEmma (25)
D: 24.94 - skepsci (24)
U: 24.24 - SlappedTogetherAtTheLastMinuteBot (20)

Has someone gone through all the programs to figure out what strategy they're supposed to be implementing, then checked that they in fact correctly implement that strategy? Also, given that the bots were allowed to use randomness, it would've been a good idea to average a bunch of rounds together.

A tournament like this would be much more interesting if it involved multiple generations. Here, the results heavily depended upon the pool of submitted strategies, regardless of their actual competitiveness, while a multiple-generations tournament would measure success as performance against other successful strategies.

h, i, and j are identical and received noticeably different scores. Maybe next time, repeat the whole process a few times to even out the RNG?

Also, it would be interesting to see the head-to-head matchups, especially to see if R's complexity actually helped even though it seemed mostly aimed at special cases that never came up.

Hmm. Considering the heavy amount of random strategies, it seems like it might be worth it iterating the tournament several more times and summing the point totals, particularly since C(one of the randoms) is above the three way tie at 4th place between G, K and N by only two points, and A and C also have a 2 point difference despite having functionally identical strategies (50% random Cooperate or Defect.) I feel like that implies standard deviations may make analysis of the results premature.

Nonetheless, analyzing is fun, so I'll do a little bit anyway on the fourth placers.

K appears to be functionally identical to a Defect bot. It has more complicated code, but that code never cooperated. (Side note: randomly, in this run, Margaret Sy's winning random favor defection at 80% strategy only cooperated with this bot and defected against all other bots.)

Also G and N, despite being in 4th place, didn't actually seem to take successfully advantage of any of the 3 pseduosimplistic cooperate bots in this run (H, I, J - Simulate something, then cooperate regardless.) by defecting against them, which would seem to be one of the key ways a mind reader could try to score points for a win. I would guess the 'pretend to be thinking' purpose of the simulate call made them look too complicated to exploit in that manner, but I'm not that familiar with the programming language, so perhaps not? (Not that being H,I or J's type of bot was sufficient to win. Their points are low enough that it seems unlikely that they would surge ahead even after repetition.)

Anyway, thank you for running this! It was a neat idea and as above, I had fun looking at the results.

Edit: I realized a pronoun above was unclear and replaced it with nouns.

I think that kind of deviation was probably a large part of the motivation for those who submitted random-playing bots, since the contest rules specified one round.

I haven't looked too closely at K, nor ran any tests, but I have a slight suspicion it sometimes cooperated in simulation, despite always defecting on its turn.

As for the cooperatebots, there were multiple reasons I didn't write N to exploit them, but not source complexity--they don't even do anything besides cooperate; the middle term is just the function's argument list.

One additional note on a second tournament: I notice bots weren't played against themselves. Should this be done? It seems useful as a sanity check; I feel like a bot that won't even cooperate against itself ought to get fewer points.

Huh. I'm not entirely surprised that random behavior was dominant, but it's certainly depressing. Nevertheless, I'm glad you ran this.

This may be more a reflection that programming interesting strategies which run each others code and fully do what you want them to do is tough. It might be interesting to look at the strategies which did look at the other's code and see if any are responding in an obviously wrong fashion in that particular instance.

I wasn't expecting there to be so much randomness. If I had been, I guess I would have spent the time to figure out how to test for it in some kind of loop and defect if the opponent was random. Looping in lisp is unnecessarily complicated though, so I didn't bother...

That suggests that a large part of this behavior may have been due to limited amounts of programming time.

I don't know quine at all and can't easily understand exactly what all these guys are doing but:

This is obviously NOT a stable metagame; in both senses.

If we ran this tournament again the random-bots would get utterly destroyed (since it's very easy/fast to check for them and defect). If we ran this tournament with multiple rounds, where successful strategies survive and unsuccessful die, and there was even one defect-against-randombot code in the mix, it will win if it doesn't die early.

My guess-happy summary of what happened is this: The core problem is very hard, so most people instead did something simple, often very simple (cooperate bot, defect bot, random bot - which won!) with a small number of people trying for real. The people trying for real, however, spent all their energy planning for other people trying for real and trying to fool them, rather than trying to act correctly against all the super-simple programs, because if you put in that much work you think other people will mostly put in at least some work (R has an entire long blog post of metagame and strategic analysis which is COMPLETELY wrong in exactly this way, in real life I do this a lot). That, combined with programming errors since the task involved was actually hard, prevented them from winning.

The lesson here, I think, is more about how people react to such tournaments than it is about the actual problem at hand; if anyone had assumed their opponents would either be simple or hopelessly complex, they could have written a program that defects against programs that don't look at their code or otherwise are clearly safe defections, does something hard to exploit against everyone else that likely is somewhat random, and win easily.

Wait, someone actually submitted CooperateBot? That wasn't just a default entry?

Thoughts on a possible second tournament:

  • I like Luke Somers's suggestion that each matchup should be run several times to smooth out randomness (obviously, bots don't get access to earlier runs, this isn't iterated prisoner's dilemma)
  • Should it include default entries? Perhaps a CooperateBot, a DefectBot, and a RandomBot? Or perhaps just a RandomBot?
  • Should we maybe restrict to a subset of Scheme to make things easier for those of us who want to do static analysis? :) (Yeah, yeah, I never actually submitted anything...) Or should we just rely on pre-announcmets of "If you do this, I'm defecting?"

Fantastic work, thank you =)

For anyone else unpacking the zip file, note you'll want to create a new directory to unzip it into, not just unzip it in the middle of your home folder.

:(

I didn't realise that when you said programs should assume eval would work normally, that meant I needed to make my simulated eval take a namespace as an optional argument (and use the proper namespace containing the standard library as a default). So my program crashed whenever it simulated something that uses (eval x). There goes my clever algorithm, lol

What did you intend your program to do? I tried to work it out from the source code but it scared me off.

Well, now the contest's over I can publish an expanded version of my source code with comments. Basically it was a sort-of variation of PrudentBot designed to get cooperation from So8res's stupid pseudo-mimics. The original variation had two conditions: if ((they cooperate with us if we cooperate) and (they defect against defectbot)) then cooperate; else defect. The second condition allows you to exploit stupid bots like cooperatebot, but since doing that causes all justicebots, etc etc to defect, I ended up removing that. (Though, looks like that was a bad idea in the end, since there were three cooperatebots submitted!)

Well, it's not quite a true PrudentBot since PrudentBot tries to prove unconditional cooperation, rather than cooperation-if-I-cooperate.

If only THE BLACK KNIGHT had won the tournament, I would've revealed my identity then. This way, the identity of THE BLACK KNIGHT shalt forevermore remain a mystery. Giddyup!

In more seriousness though, how could this happen, being worse than random? Just lacking a clause filtering for the string 'random' (that could be exploitable, too), or alternatively, not simulating the opponent, say, 10 times, then going with the majority behavior if there is one, and if there's not (6-5, 5-5, 4-6), treating the opponent as randombot. Won't filter all randombots, of course. Also won't explain why another entry didn't beat out most of the rest of the pack, so as to become impervious to the few detrimental results against the minority of randombots. In conclusion, what happened Ani?

Given Rice's theorem, there is no source code analysis that is guaranteed to tell you anything interesting about your opponent, unless your opponent has been written in such a way as to facilitate that analysis. This suggests two possible modifications of the competition.

Firstly, the rule could be imposed that source code is unanalysable. A competing program will not be furnished with its opponent's source, but only an opaque blob. The only operation they can perform on this blob is to supply it with another blob and see what answer it gives. Each program is given access to itself (as an opaque blob), and can contain within itself whatever other blobs it wishes to use for probing purposes. Note that if every program begins a contest by running its opponent on some example, then no-one terminates and everyone loses. Therefore a program must sometimes make a decision without having run its opponent on any examples. This looks like it severely limits the usefulness of simulating the opponent to see what it will do, perhaps to the point where it is impossible to gain anything from it, and the competition reduces to an ordinary PD tournament.

Alternatively, source code would be revealed, but in addition each program would be allowed to furnish machine-checkable proofs of assertions about itself. The machine checking would be part of the infrastructure of the competition, so competitors would be able to have complete trust in these assertions.

Or one could combine these: one receives one's opponent as an unanalysable blob together with some set of guaranteed true assertions about it, supplied by the creator of that program and verified by the competition infrastructure. The idea is that skill at analysing someone's source code is secondary to the point of the exercise.

What verifiably true assertions would it be useful for a competitor to make about itself? It would be easy to write code of which it can be proved "I always cooperate", "I always defect", "I play Tit For Tat", or any of the other ordinary PD strategies. What verifiable assertions can a program make about the sort of opponent it will cooperate or defect with? "I will cooperate with anyone who cooperates with me" would not be an admissible assertion (by Rice's theorem). Nor would "I will cooperate with anyone who verifiably asserts that they cooperate with me". But "I will cooperate with anyone who verifiably asserts that they cooperate with anyone verifiably asserting this sentence" might be, if the circularity can be handled.

ETA: But "I will cooperate with anyone who verifiably asserts that they cooperate with anyone verifiably asserting a sentence logically equivalent to this one" won't work. So there would have to be a negotiation phase before the competition in which people publish the assertions they want to make about themselves, so that people can coordinate to agree on the exact formualtion of the mutually referential sentences they want to be seen to be true of them.

Some notes I've made on the entries:

(Note, this is largely from the perspective of "I want to write a bot that works by doing lots of static analysis to the other bots", since I was working on such a bot. In fact I can see now that the idea I had in mind didn't actually make sense, but, oh well, I have an idea for how to fix it for the next contest...)

  • Mutation: Three of the bots -- K, L, and R -- used mutation. N and P used a function "namespace-set-variable-value!"; not sure what that does so I'm not sure if it's really mutating things in context. Nonetheless, point is, I might have to deal with mutation next time around. (Or I can just decide "mutation is defection". That might be appropriate.)
  • Do loops: Nobody use do loops... but L used a for loop, which isn't even standard Scheme, IINM. My program totally would have choked on that. (I had just planned on counting do loops as a defection, meanwhile...)
  • Syntax redefinition: Thankfully, nobody did this. As for whether my (ultimately empty) declaration that I'd consider such a defection had anything to do with this, I don't know.
  • Fine-grained randomness: One program, K, called random with no argument; no program called random with an argument higher than 100. (If I can actually get this thing working for the next contest, btw, I entirely intend to count calling random without an argument as a defection, as well as with an argument more than about 1000 or so.) K didn't actually need randomness that fine, note; it was just generating a 1-in-100 chance.
  • Multithreading: All four of the "big programs" -- K, N, P, and R -- used this. Unsurprising, seeing as it's the only way to do an apply-with-timeout. I'm kind of surprised Racket doesn't natively provide a function for this, seeing as it provides time-apply already, but unfortunately it doesn't and you can't even use time-apply to build one; you have to use multithreading. P I noticed built such a function themselves, called timeout-exec; I don't know about the others.
  • Other timing things: S slept for 9 seconds, for reasons I don't understand (see wedrifid's and BloodyShrimp's comments about why such a program is a bad idea). It then had some weird conditional involving time, but the conditional could only come out one way, so I'm not sure what's going on there. Just some obfuscation? That seems like a bad idea. (Not related to time, but G also had a weird conditional that could only come out one way. Not sure what's going on there either.)
  • Other threading things: T sleeps, yet doesn't seem to use timing or multithreading anywhere? Is this some subtle "this has an effect if I'm being run in another thread" thing?
  • Quasiquotation: All four of the "big programs" K, N, P, and R used quasiquotation. OK, this is probably not too worthy of note, except writing the quasiquote handler was one of the places I got stuck when writing my program (in retrospect, it was certainly not the biggest problem). I contemplated counting quasiquotes as a defection -- and would have done so had I finished the rest of the program except for that -- but that seemed and now even more so seems unreasonable, especially since my own program was going to use quasiquotes. (Why do quasiquotes exist? Well, OK, I get why they exist. I'd just like to point out that -- if I'm understanding correctly, which I might not be -- quasiquotes are never necessary, and, aside from quoting symbols, neither is quote, because there exists the list function, which is an ordinary function call and I could have handled like an ordinary function call. Oh well. I'll probably take the time to get my quasiquote handler working next time... in fact I think I see how to fix it already...)
  • Delay, force, call/cc: Thankfully noone tried to use any of these! Edit: Similarly with multiple-values stuff.

(I call K, N, P, and R the "big programs" since they were the ones long enough that I didn't take the time to make a detailed analysis of what they do.)