Wow, thanks a lot guys!
I'm probably not the only one who feels this way, so I'll just make a quick PSA: For me, at least, getting comments that engage with what I write and offer a different, interesting perspective can almost be more rewarding than money. So I definitely encourage people to leave comments on entries they read--both as a way to reinforce people for writing entries, and also for the obvious reason of making intellectual progress :)
I definitely wish I had commented more on them in general, and ran into a thing where a) the length and b) the seriousness of it made me feel like I had to dedicate a solid chunk of time to sit down and read, and then come up with commentary worth making (as opposed to just perusing it on my lunch break).
I'm not sure if there's a way around that (posting things in smaller chunks in venues where it's easy for people to comment might help, but my guess isn't the whole thing)
This is freaking awesome - thank you so much for doing both this one and the new one.
Added: I think this is a really valuable contribution to the intellectual community - successfully incentivising research, and putting in the work on your end (assessing all the contributions and giving the money) to make sure solid ideas are rewarded - so I've curated this post.
Added2: And of course, congratulations to all the winners, I will try to read all of your submissions :-)
The funny thing is that if you look at some old papers, they read a lot more like blog posts than modern papers. One of my favorite examples is the paper where Alan Turing introduced what's now known as the Turing test, and whose opening paragraph feels pretty playful:
I propose to consider the question, "Can machines think?" This should begin with definitions of the meaning of the terms "machine" and "think." The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous, If the meaning of the words "machine" and "think" are to be found by examining how they are commonly used it is difficult to escape the conclusion that the meaning and the answer to the question, "Can machines think?" is to be sought in a statistical survey such as a Gallup poll. But this is absurd.
Datum: The existence of this prize has spurred me to put actual some effort into AI alignment, for reasons I don't fully understand--I'm confident it's not about the money, and even the offer of feedback isn't that strong an incentive, since I think anything worthwhile I posted on LW would get feedback anyway.
My guess is that it sends the message that the Serious Real Researchers actually want input from random amateur LW readers like me.
Also, the first announcement of the prize rules was in one ear and out the other for me. Reading this announcement of the winners is what made it click for me that this is something I should actually do. Possibly because I had previously argued on LW with one of the winners in a way that made my brain file them as my equal (admittedly, the topic of that was kinda bike-sheddy, but system 1 gonna system 1).
Yes.
I was waiting until the last minute to see if I would have a clear winner on what to submit. Unfortunately, I do not, since there are four posts on the Pareto frontier of karma and how much I think they have an important insight. In decreasing order of karma and increasing order of my opinion:
Sources of Intuitions and Data on AGI
Don't Condition on no Catastrophes
Can I have you/other judges decide which post/subset of posts you think is best/want to put more signal towards, and consider that my entry?
Also, why is my opinion anti-correlated with Karma?
Maybe, it is a selection effect where I post stuff that is either good content or a good explanation.
Or maybe important insights have a larger inferential gap.
Or maybe I like new insights and the old insights are better because they survived across time, but they are old to me so I don't find them as exciting.
Or maybe it is noise.
I just noticed that the first two posts were curated, and the second two were not, so maybe the only anti-correlation is between me and the Sunshine Regiment, but IIRC, most of the karma was pre-curration, and I posted Robustness to Scale and No Catastrophes at about the same time and was surprised to see a gap in the karma. (I would have predicted the other direction.)
I posted Robustness to Scale and No Catastrophes at about the same time and was surprised to see a gap in the karma
FWIW, I was someone who upvoted Robustness to Scale (and Sources of Intuitions, and Knowledge is Freedom), but did not upvote No Catastrophes.
I think the main reason was that I was skeptical of the advice given in No Catastrophes. People often talk about timelines in vague ways, and I agree that it's often useful to get more specific. But I didn't feel compelled by the case made in No Catastrophes for its preferred version of the question. Neither that one should always substitute a more precise question for the original, nor that if one wants to ask a more precise question, then this is the question to ask.
(Admittedly I didn't think about it very long, and I wouldn't be too surprised if further reflection caused me to change my mind, but at the time I just didn't feel compelled to endorse with an upvote.)
Robustness (along with the other posts) does not give advice, but rather stakes out conceptual ground. That's easier to endorse.
Awesome! I hadn't seen Caspar's idea, and I think it's a neat point on its own that could also lead in some new directions.
Edit: Also, I'm curious if I had any role in Alex's idea about learning the goals of a game-playing agent. I think I was talking about inferring the rules of checkers as a toy value-learning problem about a year and a half ago. It's just interesting to me to imagine what circuituitous route the information could have taken, in the case that it's not independent invention.
Thanks, that's very flattering! The thing I'm working on now (looking into prior work on reference, because it seems relevant to what Abram Demski calls model-utility learning) will probably qualify, so I will err on the side of rushing a little (prize working as intended).
Good and Safe use of AI Oracles: https://arxiv.org/abs/1711.05541
An Oracle is a design for potentially high power artificial intelligences (AIs), where the AI is made safe by restricting it to only answer questions. Unfortunately most designs cause the Oracle to be motivated to manipulate humans with the contents of their answers, and Oracles of potentially high intelligence might be very successful at this. Solving that problem, without compromising the accuracy of the answer, is tricky. This paper reduces the issue to a cryptographic-style problem of Alice ensuring that her Oracle answers her questions while not providing key information to an eavesdropping Eve. Two Oracle designs solve this problem, one counterfactual (the Oracle answers as if it expected its answer to never be read) and one on-policy, but limited by the quantity of information it can transmit.
Very interesting work!
You might like my new post that explains why I think only Oracles can resolve the problems of Causal Goodhart-like issues; https://www.lesserwrong.com/posts/iK2F9QDZvwWinsBYB/non-adversarial-goodhart-and-ai-risks
I'm unsure whether the problems addressed in your paper are sufficient for resolving the causal Goodhart concerns, since I need to think much more about the way the reward function is defined, but it seems it might not. This question is really important for the follow-on work on adversarial Goodhart, and I'm still trying to figure out how to characterize the metrics / reward functions that are and are not susceptible to corruption in this way. Perhaps a cryptographic approach solve parts of the problem
I have a paper which preprint was uploaded in December 2017 but which is expected to be officially published in the beggining of 2018. Is it possible to suggest the text to this round of the competition? The text in question is:
"Military AI as a Convergent Goal of Self-Improving AI"
Alexey Turchin & Denkenberger David
In Artificial Intelligence Safety and Security.
Louiswille: CRC Press (2018)
https://philpapers.org/rec/TURMAA-6
I worked with Scott to formalize some of his earlier blog post here; https://arxiv.org/abs/1803.04585 - and wrote a bit more about AI-specific concerns relating to the first three forms in this new lesswrong post: https://www.lesserwrong.com/posts/iK2F9QDZvwWinsBYB/non-adversarial-goodhart-and-ai-risks
The blog post discussion was not included in the paper both because agreement on these points proved difficult, and because I wanted the paper to be relevant more widely than only for AI risk. The paper was intended to expand thinking about Goodhart-like phenomena to address what I initially saw as a confusion about causal and adversarial Goodhart, and to allow a further paper on adversarial cases I've been contemplating for a couple years, and am actively working on again. I was hoping to get the second paper, on Adversarial Goodhart and sufficient metrics, done in time for the prize, but since I did not, I'll nominate the arxiv paper and the blog post, and I will try to get the sequel blog post and maybe even the paper done in time for round three, if there is one.
Submitting my blog post on AI Alignment testing https://medium.com/@thelastalias/ai-alignment-testing-bf8f4b6bb261?source=linkShare-182b8243d384-1522555285
How to resolve human values, completely and adequately: https://www.lesswrong.com/posts/Y2LhX3925RodndwpC/resolving-human-values-completely-and-adequately
(this connects with https://www.lesswrong.com/posts/kmLP3bTnBhc22DnqY/beyond-algorithmic-equivalence-self-modelling and https://www.lesswrong.com/posts/pQz97SLCRMwHs6BzF/using-lying-to-detect-human-values).
Impossibility of deducing preferences and rationality from human policy: https://arxiv.org/abs/1712.05812
Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. Since human planning systematically deviates from rationality, several approaches have been tried to account for specific human shortcomings. However, there has been little analysis of the general problem of inferring the reward of a human of unknown rationality. The observed behavior can, in principle, be decomposed into two components: a reward function and a planning algorithm, both of which have to be inferred from behavior. This paper presents a No Free Lunch theorem, showing that, without making `normative' assumptions beyond the data, nothing about the human reward function can be deduced from human behavior. Unlike most No Free Lunch theorems, this cannot be alleviated by regularising with simplicity assumptions. We show that the simplest hypotheses which explain the data are generally degenerate.
Hello!
I have significantly elaborated and extended my article of self deception in the last couple of months (before that it was about two pages long).
"Self-deception: Fundamental limits to computation due to fundamental limits to attention-like processes"
https://medium.com/threelaws/definition-of-self-deception-in-the-context-of-robot-safety-721061449f7
I included some examples for the taxonomy, positioned this topic in relation to other similar topics, compared the applicability of this article to applicability of other known AI problems.
Additionally, I described or referenced a few ideas to potential partial solutions to the problem (some of the descriptions of solutions are new, some of them I have published before).
One of the motivations for the post is that when we are building an AI that is dangerous in a certain manner, we should at least realise that we are doing that.
I will probably continue updating the post. The history of the post and state by 31. March can be seen from the linked Google Doc’s history view (that link is in top of the article).
When it comes to feedback to postings, I have noticed that people are more likely to get feedback when they ask for it.
I am always very interested in feedback, regardless whether it is given to my past, current or future postings. So if possible, please send any feedback you have. It would be of great help!
I will post the same message to your e-mail too.
Thank you and regards:
Roland
I would like to submit the following entries:
A typology of Newcomblike problems (philosophy paper, co-authored with Caspar Oesterheld).
A wager against Solomonoff induction (blog post).
Three wagers for multiverse-wide superrationality (blog post).
UDT is “updateless” about its utility function (blog post). (I think this post is hard to understand. Nevertheless, if anyone finds it intelligible, I would be interested in their thoughts.)
For this round I submit the following entries on decision theory:
Robust Program Equilibrium (paper)
The law of effect, randomization and Newcomb’s problem (blog post) (I think James Bell's comment on this post makes an important point.)
A proof that every ex-ante-optimal policy is an EDT+SSA policy in memoryless POMPDs (IAFF comment) (though see my own comment to that comment for a caveat to that result)
Here is my submission, Anatomy of Prediction and Predictive AI
Hopefully I'm early enough this time to get some pre-deadline feedback :)
I wouldn't mind feedback as well if possible. Mainly because I only dabble in AGI theory and not AI. So i'm curious to see the differance in thoughts/opinion/ fields, or however you wish to put it. Thanks in advance., and thanks to the contest host/judges. I learned a lot more about the (human) critic process then I did before.
We (Zvi Mowshowitz, Vladimir Slepnev and Paul Christiano) are happy to announce that the AI Alignment Prize is a success. From November 3 to December 31 we received over 40 entries representing an incredible amount of work and insight. That's much more than we dared to hope for, in both quantity and quality.
In this post we name six winners who will receive $15,000 in total, an increase from the originally planned $5,000.
We're also kicking off the next round of the prize, which will run from today until March 31, under the same rules as before.
The winners
First prize of $5,000 goes to Scott Garrabrant (MIRI) for his post Goodhart Taxonomy, an excellent write-up detailing the possible failures that can arise when optimizing for a proxy instead of the actual goal. Goodhart’s Law is simple to understand, impossible to forget once learned, and applies equally to AI alignment and everyday life. While Goodhart’s Law is widely known, breaking it down in this new way seems very valuable.
Five more participants receive $2,000 each:
We'll be contacting each winner by email to arrange transfer of money.
We would also like to thank everyone who participated. Even if you didn't get one of the prizes today, please don't let that discourage you!
The next round
We are now announcing the next round of the AI alignment prize.
As before, we're looking for technical, philosophical and strategic ideas for AI alignment, posted publicly between now and March 31, 2018. You can submit your entries in the comments here or by email to apply@ai-alignment.com. We may give feedback on early entries to allow improvement, though our ability to do this may become limited by the volume of entries.
The minimum prize pool this time will be $10,000, with a minimum first prize of $5,000. If the entries once again surpass our expectations, we will again increase that pool.
Thank you!
(Addendum: I've written a post summarizing the typical feedback we've sent to participants in the previous round.)