Steelmanning MIRI critics

I'm giving a talk to the Boulder Future Salon in Boulder, Colorado in a few weeks on the Intelligence Explosion hypothesis. I've given it once before in Korea but I think the crowd I'm addressing will be more savvy than the last one (many of them have met Eliezer personally). It could end up being important, so I was wondering if anyone considers themselves especially capable of playing Devil's Advocate so I could shape up a bit before my talk? I'd like there to be no real surprises. 

I'd be up for just messaging back and forth or skyping, whatever is convenient.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 3:33 AM
Select new highlight date
Rendering 50/67 comments  show more

Without knowing the content of your talk (or having time to Skype at present, apologies), allow me to offer a few quick points I would expect a reasonably well-informed, skeptical audience member to make (part-based on what I've encountered):

1) Intelligence explosion requires AI to get to a certain point of development before it can really take off (let's set aside that there's still a lot we need to figure out about where that point is, or whether there are multiple different versions of that point). People have been predicting that we can reach that stage of AI development "soon" since the Dartmouth conference. Why should we worry about this being on the horizon (rather than a thousand years away) now?

2) There's such a range of views on this topic by apparent experts in AI and computer science that an analyst might conclude "there is no credible expertise on "path/timeline to super intelligent AI". Why should we take MIRI/FHI's arguments seriously?

3) Why are mathematician/logician/philosophers/interdisciplinary researchers the community we should be taking most seriously when it comes to these concerns? Shouldn't we be talking to/hearing from the cutting edge AI "builders"?

4) (Related). MIRI (and also FHI, but not to such a 'primary' extend') focuses on developing theoretical safety designs, and friendly-AI/safety-relevant theorem proving and maths work ahead of any efforts to actually "build" AI. Would we not be better to be more grounded in the practical development of the technology - building, stopping, testing, trying, adapting as we see what works and what doesn't, rather than trying to lay down such far-reaching principles ahead of the technology development?

All good points.

I'd focus on #4 as the primary point. Focusing on theoretical safety measures far ahead of the development of the technology to be made safe is very difficult and has no real precedent in previous engineering efforts. In addition, MIRI's specific program isn't heading in a clear direction and hasn't gotten a lot of traction in the mainstream AI research community yet.

Edit: Also, hacks and heuristics are so vital to human cognition in every domain, that it seems clear that general computation models like AIXI don't show the roadmap to AI, despite their theoretical niceness.

Just to be clear: Are you looking for criticisms of the intelligence explosion hypothesis or of MIRI? The hypothesis is just one of many claims made by MIRI researchers, and not the most controversial one. Are you just going to be arguing for the plausibility of a (reasonably imminent) intelligence explosion, or are you planning on defending the whole suite of MIRIesque beliefs?

Only the IE as defended by MIRI; it'd be a much longer talk if I wanted to defend everything they've put forward!

Short duration hard takeoff, A la That Alien Message? That's one of the hardest claims for MIRI to justify.

Some of the stuff I've posted - http://lesswrong.com/lw/ksa/the_metaphormyth_of_general_intelligence/ , http://lesswrong.com/lw/hvo/against_easy_superintelligence_the_unforeseen/ - could be used to build a good anti-MIRI steelman, but I've not seen them used.

The most convincing anti-MIRI argument? AI may not develop in the way you're imagining. The most convincing rebuttal? We only need a decent probability of that happening to justify worrying about it.

It may be more useful to ask actual critics what they think (rather than asking proponents what they think critics are trying to say). Robin Hanson criticizes foom here. I don't actually know what he thinks of MIRI.

Well, firstly its good that the crowd is savy, but it might still be wise to prepare for strawman/fleshman attacks as well as steelmanned ones.

These are some more plausible criticisms:

(1) Moore's law seems to be slowing - this could be a speedbump before the next paradigm takes over, or it could be the start of stagnation, in which case the singularity is postponed. Of course, if humanity survives the singularity will happen eventually anyway, but if it is hundreds of years in the future it would probably be wiser focussing on promoting rationality/genetic engineering/other methods of improving biological intelligence as well as cryonics in the short term, and leaving work on FAI to future generations.

(2) It could be argued that FAI and perhaps de novo AGI as well is simply so hard we will never get it done in time. Eventually neuromorphic AI/WBE/ brute force evolutionary simulations will be developed (assuming that exponential progress in these fields holds) and we would be better of preparing for this case, perhaps by developing empathic neuromorphic AI, or developing a framework for uploads and humans to live without a Malthusian race to the bottom.

(3) The budget/number of people involved of MIRI is tiny compared to google and other entities which could plausibly design AI. Therefore many would argue that MIRI cannot develop AI first, and instead should focus on outreach towards other, larger, groups.

(4) Gwern seems to be going further, and arguing that we should advocate that nations should suppress technology.

In all of these cases, some sort of rationality outreach would seem to be the alternative, so you could still spin that as a positive.

(1) Moore's law seems to be slowing - this could be a speedbump before the next paradigm takes over, or it could be the start of stagnation, in which case the singularity is postponed.

The pithy one-liner comeback to this is that the human brain is an existence proof for a computer the size of the human brain with the performance of the human brain, and it seems implausible that nature arrived at the optimal basic design for neurons on (basically) its first try.

An existence proof is very different from a constructive proof! Nature did not happen upon this design on the first try, the brain has evolved for billions of generations. Of course, intelligence can work faster than the blind idiot god, and humanity, if it survives long enough, will do better. The question is, will this take decades or centuries?

An existence proof is very different from a constructive proof!

Quite so. However, it does give reason to hope.

The question is, will this take decades or centuries?

If you look at Moore's Law coming to a close in silicon around 2020, and we're still so far away from a human brain equivalent computer, it's easy to get disheartened. I think it's important to remember that it's at least possible, and if nature could happen upon it..

Well, I'm no Dymitry or XiXiDu, and not "especially capable" but why not give it a try.

Intelligence explosion pattern-matches pretty well to the religious ideas of Heaven/Hell, and cryonics to Limbo. Also note the dire warnings of the impending UFAI doom unless the FAI Savior is ushered by the righteous (called "rational") before it's too late. So one could dismiss the whole thing as a bad version of Christianity or some other religion.

This one deserves a lot of attention, not because it's inherently brilliant, but because I expect it to match a large portion of what the audience will think.

So one could dismiss the whole thing as a bad version of Christianity or some other religion.

Only by ignoring the fundamental question: Is it true?

The point is that all object-level arguments for and against these scenarios, even if you call them "probability estimates", are ultimately based on intuitions which are difficult to formalize or quantify.

The scenarios hypthesized by the Singularitarians are extreme, both in the magnitude of the effect they are claimed to entail, and in the the highly conjunctive object-level arguments that are used to argue for them. Common sense rationality tells us that "extraordinary claims demand exceptional evidence". How do we evaluate whether the intuitions of these people constitute "exceptional evidence"?

So we take the "outside view" and try to meta-reason on these arguments and the people making them:
Can we trust their informal intuitions, or do they show any signs of classical biases?

Are these people privileging the hypothesis? Are they drawing their intuitions from the availability heuristic?

If intelligence explosion/cryonics/all things singularitarian were ideas radically different from any common meme, then the answer to these questions would be likely no: these ideas would appear counterintuitive at a gut level to most normally rational people, possibly in the same way quantum mechanics and Einstenian relativity appear conterituitive.
If domain-level experts, after studing the field for years, recalibrated their intuitions and claimed that these scenarios were likely, then we should probably listen to them.
We should not just accept their claims based on authority, of course: even the experts can subject to groupthink and other biases (cough...economists...cough), but as far as the "outside view" is concerned, we would at least have plausibly excluded the availability bias.

What we observe, instead, is that singulariarians ideas strongly pattern-match to Christian millenarianism and similar religious beliefs, mixed with popular scifi tropes (cryonics, AI revolt, etc.). They certainly originated, or at least were strongly influenced by these memes, and therefore the intuitions of the people arguing for them are likely "contaminated" via the availability heuristic by these memes.
More specifically, if singulariarians ideas make intuitive sense to you, you can't even trust your own intuitions since they are likely to be "contaminated" as well.

Add the fact that the strength of these intuitions seems to decrease rather than increase with domain-expertise, suggesting that the Dunning–Kruger effect is also at work, then the "outside view" tells us to be wary.

Of course, it is possible to believe correct things even when they are likely to be the subject of biases, or even to believe correct things that many people believe for the wrong reason, but in order to make a case for these beliefs, you need some airtight arguments with strong evidence.
As far as I can tell, MIRI/FHI/other Singularitarians have provided no such arguments.

They certainly originated, or at least were strongly influenced by these memes

Originated? Citation needed, seriously.

What we observe, instead, is that singulariarians ideas strongly pattern-match to Christian millenarianism and similar religious beliefs, mixed with popular scifi tropes (cryonics, AI revolt, etc.).

Not very strong pattern match. In Christian millenarianism, you have the good being separated from the bad. And this is considered good, even with all of the horror. Also, the humans don't cause the good and bad things. It's God. Also, it's prophesied and certain to happen in a particular way.

In a typical FOOM scenario, everyone shares their fate regardless of any personal beliefs. And if it's bad for people, it's considered bad - no excuses for any horror. And humans create whatever it is that makes the rest happen, so that 'no excuses' is really salient. There are many ways it could work out, there is no roadmap. This produces pretty much diametrically opposite attitude - 'be really careful and don't trust that things are going to work out okay'.

So the pattern-match fails on closer inspection. "We are heading towards something dangerous but possibly awesome if we do it just right" just isn't like "God is going to destroy the unbelievers and elevate the righteous, you just need to believe!" in any relevant way.

Indeed. But it is hard to argue for the truth of the models whose predictions haven't come to pass (yet).

I'd like there to be no real surprises.

It seems like surprises would be more valuable than just reciting info.

Speaking as someone who speaks about X-risk reasonably regularly: I have empathy for the OP's desire for no surprises. IMO there are many circumstances in which surprises are very valuable - one on one discussions, closed seminars and workshops where a productive, rational exchange of ideas can occur, boards like LW where people are encouraged to interact in a rational and constructive way.

Public talks are not necessarily the best places for surprises, however. Unless you're an extremely skilled orator, the combination of nerves, time limitations, crowd dynamics, and other circumstances can make it quite difficult to engage in an ideal manner. Crowd perception of how you "handle" a point, particularly a criticism, can do a huge amount in how the overall merit of you, your talk, and your topic, are perceived - even if the criticism is invalid or your response adequate. My experience is also that the factors above can push us into less nuanced, more "strong"-seeming positions than we would ideally take. In a worst-case scenario, a poor presentation/defence of an important idea can impact perception of the idea itself outside the context of the talk (if the talk is widely enough disseminated).

These are all reasons why I think it's an excellent idea to consider the best and strongest possible objections to your argument, and to think through what an ideal and rational response would be - or, indeed, if the objection is correct, in which case it should be addressed in the talk. This may be the OP's only to expose his audience to these ideas.

In a worst-case scenario, a poor presentation/defence of an important idea can impact perception of the idea itself outside the context of the talk

Right. Exposure to a weak meme inoculates people against being affected by similar memes in the future. There was a recent SSC post about it, I think. Bad presentation is worse than no presentation at all.

MIRI intends to make an AI that is provably friendly. This would require having a formal definition of friendliness that means exactly what it's supposed to mean, and then proving it. Either of those steps seems highly unlikely to be completed without error.

MIRI intends to make an AI that is provably friendly.

I really wish people would stop repeating this claim. Mathematical Proofs Improve But Don’t Guarantee Security, Safety, and Friendliness.

And yet, all the publicly known MIRI research seems to be devoted to formal proof systems, not to testing, "boxing", fail-safe mechanisms, defense in depth, probabilistic failure analysis, and so on.

Motte and bailey?

This paragraph is a simplification rather than the whole story, but: Our research tends to be focused on mathematical logic and proof systems these days because those are expressive frameworks with which to build toy models that can give researchers some general insight into the shape of the novel problems of AGI control. Methods like testing and probabilistic failure analysis require more knowledge of the target system than we now have for AGI.

And we do try to be clear about the role that proof plays in our research. E.g. see the tiling agents LW post:

The paper uses first-order logic (FOL) because FOL has a lot of useful standard machinery for reflection which we can then invoke; in real life, FOL is of course a poor representational fit to most real-world environments outside a human-constructed computer chip with thermodynamically expensive crisp variable states.

As further background, the idea that something-like-proof might be relevant to Friendly AI is not about achieving some chimera of absolute safety-feeling, but rather about the idea that the total probability of catastrophic failure should not have a significant conditionally independent component on each self-modification, and that self-modification will (at least in initial stages) take place within the highly deterministic environment of a computer chip. This means that statistical testing methods (e.g. an evolutionary algorithm's evaluation of average fitness on a set of test problems) are not suitable for self-modifications which can potentially induce catastrophic failure (e.g. of parts of code that can affect the representation or interpretation of the goals).

And later, in an Eliezer comment:

Reply to: "My previous understanding had been that MIRI staff think that by default, one should expect to need to solve the Lob problem in order to build a Friendly AI."

By default, if you can build a Friendly AI you were not troubled by the Lob problem. That working on the Lob Problem gets you closer to being able to build FAI is neither obvious nor certain (perhaps it is shallow to work on directly, and those who can build AI resolve it as a side effect of doing something else) but everything has to start somewhere. Being able to state crisp difficulties to work on is itself rare and valuable, and the more you engage with a problem like stable self-modification, the more you end up knowing about it. Engagement in a form where you can figure out whether or not your proof goes through is more valuable than engagement in the form of pure verbal arguments and intuition, although the latter is significantly more valuable than not thinking about something at all.

My guess is that people hear the words "proof" and "Friendliness" in the same sentence but (quite understandably!) don't take time to read the actual papers, and end up with the impression that MIRI is working on "provably Friendly AI" even though, as far as I can tell, we've never claimed that.

This paragraph is a simplification rather than the whole story, but: Our research tends to be focused on mathematical logic and proof systems these days because those are expressive frameworks with which to build toy models that can give researchers some general insight into the shape of the novel problems of AGI control. Methods like testing and probabilistic failure analysis require more knowledge of the target system than we now have for AGI.

When somebody says they are doing A for reason X, then reason X is criticized and they claim they are actually doing A for reason Y, and they have always been, I tend to be wary.

In this case A is "research on mathematical logic and formal proof systems",
X is "self-improving AI is unboxable and untestable, we need to get it provably right on the first try"
and Y is "Our research tends to be focused on mathematical logic and proof systems these days because those are expressive frameworks with which to build toy models that can give researchers some general insight into the shape of the novel problems of AGI control".

If Y is better than X, as it seems to me in this case, this is indeed an improvement, but when you modify your reasons and somehow conclude that your previously chosen course of action is still optimal, then I doubt your judgment.

as far as I can tell, we've never claimed that.

Well... (trigger wa-...)

"And if Novamente should ever cross the finish line, we all die. That is what I believe or I would be working for Ben this instant."
"I intend to plunge into the decision theory of self-modifying decision systems and never look back. (And finish the decision theory and implement it and run the AI, at which point, if all goes well, we Win.)"
"Take metaethics, a solved problem: what are the odds that someone who still thought metaethics was a Deep Mystery could write an AI algorithm that could come up with a correct metaethics? I tried that, you know, and in retrospect it didn’t work."
"Find whatever you’re best at; if that thing that you’re best at is inventing new math[s] of artificial intelligence, then come work for the Singularity Institute. [ ... ] Aside from that, though, I think that saving the human species eventually comes down to, metaphorically speaking, nine people and a brain in a box in a basement, and everything else feeds into that."

X is "self-improving AI is unboxable and untestable, we need to get it provably right on the first try"

But where did somebody from MIRI say "we need to get it provably right on the first try"? Also, what would that even mean? You can't write a formal specification that includes the entire universe and than formally verify an AI against that formal specification. I couldn't find any Yudkowsky quotes about "getting it provably right on the first try" at the link you provided.

Why talk about unupdateable UFs and "solving morality" if you are not going for that approach?

Again, a simplification, but: we want a sufficient guarantee of stably friendly behavior before we risk pushing things past a point of no return. A sufficient guarantee plausibly requires having robust solutions for indirect normatively, stable self-modification, reflectively consistent decision theory, etc. But that doesn't mean we expect to ever have a definite "proof" that system will be stably friendly.

Formal methods work for today's safety-critical software systems never results in a definite proof that a system will be safe, either, but ceteris paribis formal proofs of particular internal properties of the system give you more assurance that the system will behave as intended than you would otherwise have.

Otherwise compared to nothing, or otherwise compared to informal methods?

Are you talking into account that the formal/proveable/unupdateable approach has a drawback in the AI domain that it doesn't have in the non AI domain, namely you lose the potential to tell an AI "stop doing that,it isn't nice"

You admit that friendliness is not guaranteed. That means that you're not wrong, which is a good sign, but it doesn't fix the problem that friendliness isn't guaranteed. You have as many tries as you want for intelligence, but only one for friendliness. How do you expect to manage it in the first try?

It also doesn't seem to be clear to me that this is the best strategy. In order to get that provably friendly thing to work, you have to deal with an explicit, unchanging utility function, which means that friendliness has to be right from the beginning. If you deal with an implicit utility function that will change as the AI comes to understand itself better, you could program an AI to recognise pictures of smiles, then let it learn that the smiles correspond to happy humans and update its utility function accordingly, until it (hopefully) decides on "do what we mean".

It seems to me that part of the friendliness proof would require proving that the AI will follow its explicit utility function. This would be impossible. The AI is not capable of perfect solomonoff induction, and will alway have some bias, no matter how small. This means that its implicit utility function will never quite match its explicit utility function. Am I missing something here?

You admit that friendliness is guaranteed.

Typo?

In order to get that provably friendly thing to work

Again, I think "provably friendly thing" mischaracterizes what MIRI thinks will be possible.

I'm not sure exactly what you're saying in the rest of your comment. Have you read the section on indirect normativity in Superintelligence? I'd start there.

Given the apparent misconceptions about MIRI's work even among LWers, it seems like you need to write a Main post clarifying what MIRI does and does not claim, and does and does not work on.

I can't think of rational arguments, even steelmanned ones, beyond those Holden already gave. Maybe I'm too close to the whole thing, but I think that when viewed rationally, MIRI is on pretty solid ground.

If I wanted to make people wary of supporting MIRI, I'd simply go ad hominem . Start with selected statements from supporters about how much MIRI is about Eliezer, and from Eliezer about how he can't get along with AI researchers, how he can't do straight work for more than two hours per day and how "this is a cult". Quote a few of the psychotic sounding parts from Eliezer's old autobiography piece. Paint him as a very skilled writer/persuader whose one great achievement was to get Peter Thiel to throw him a golden bone. Describe the whole Friendliness issue as an elaborate excuse from someone who claimed ability to code an AGI fifteen years ago, and hasn't.

Of course that's a lowly and unworthy style of argument, but it'd get attention from everyone there, and I wonder how you'd defend against it.

I think I'm basically prepared for that line of attack. MIRI is not a cult, period. When you want to run a successful cult you do it Jim-Jones-style, carting everyone to a secret compound and carefully filtering the information that makes it in or out. You don't work as hard as you can to publish your ideas in a format where they can be read by anyone, you don't offer to publicly debate William Lane Craig, and you don't seek out the strongest versions of criticisms of your position (i.e. those coming from Robin Hanson).

Eliezer hasn't made it any easier on himself by being obnoxious about how smart he is, but then again neither did I; most smart people eventually have to learn that there are costs associated with being too proud of some ability or other. But whatever his flaws, the man is not at the center of a cult.

Sure MIRI isn't a cult, but I didn't say it was. I pointed out that Eliezer does play a huge role in it and he's unusually vulnerable to ad hominem attack. If anyone does that, your going with "whatever his flaws" isn't going to sound great to your audience.

I think I'd point out that he's a fairly public person, which both should increase trust and gives more material for ad hominem attacks. And once someone else has dragged the discussion down to a personal level, you might as well throw in appeals to authority with Elon Musk on AI risk, i.e. change the subject.

This comment is a poorly-organized brain dump which serves as a convenient gathering place for what I've learned after several days of arguing with every MIRI critic I could find. It will probably get it's own expanded post in the future, and if I have the time I may try to build a near-comprehensive list.

I've come to understand that criticisms of MIRI's version of the intelligence explosion hypothesis and the penumbra of ideas around it fall into two permeable categories:

Those that criticize MIRI as an organization or the whole FAI enterprise (people making these arguments may or may not be concerned about the actual IE) and those that attack object-level claims made by MIRI.

Broad Criticisms

1a) Why worry about this now, instead of in the distant future, given the abysmal performance of attempts to predict AI?

1b) Why take MIRI seriously when there are so many expert opinions that diverge?

1c) Aren't MIRI and LW just an Eliezer-worshipping cult?

1d) Is it even possible to do this kind of theoretical work so far in advance of actual testing and experimentation?

1e) The whole argument can be dismissed as it pattern matches other doomsday scenarios, almost all of which have been bullshit.


Specific Criticisms

2a) General intelligence is what we're worried about here, and it may prove much harder to build than we're anticipating.

2b) Tool AIs won't be as dangerous as agent AIs.

2c) Why not just build an Oracle?

2d) the FOOM will be distributed and slow, not fast and localized.

2e) Dumb Superintelligence, i.e. nothing worth of the name could possibly misinterpret a goal like 'make humans happy'

2f) Even FAI isn't a guarantee

2g) A self-improvement cascade will likely hit a wall at sub-superintelligent levels.

2h) Divergence Issue: all functioning AI systems have built-in sanity checks which take short-form goal statements and unpack them in ways that take account of constraints and context (???). It is actually impossible to build an AI which does not do this (???), and thus there can be no runaway SAI which is given a simple short-form goal and then carries it to ridiculous logical extremes (I WOULD BE PARTICULARLY INTERESTED IN SOMEONE ADDRESSING THIS).

A wrinkle in the foom argument, re: source code readability

There is a sense in which a programme can easily read its own source code. The whole point of a compiler is to scan and process source code. A C compiler can compile its own source code, providing it is written in C.

The wrinkle is the ability of a programme to knowingly read its own source code. Any running process can be put inside a sandbox or simulated environment, such that there is no surely technical way if circumventing it. A running process accesses its environment using system calls, for instance get_time() or open_file(), and it has to take their results on faith. The get_time() function doesn't have to return the real system time, and in a visualized process, attempts to access the file system do not access the real file system. There is no is_this_real() call, or at least, no unfakeable one. (Homoiconicity is no magic fix - even if a LISP programme can easily process it's own code once it has obtained it, it still has to trust the subroutine that advertises itself as returning it)

Therefore, a programme can easily be prevented from accessing and modifying its own code.

It could be argued that an intelligent agent could use social engineering to cajole a human operator into getting the source code out of a locked cabinet, or whatever. This is a variation on the standard MIRI claim that an AI could talk it's way out of a box. However,in this case the AI needs to talk it's way out before it starts to recursively self improve, because it needs it's source code. This suggests that an AI that is below a certain level of intelligence can be maintained there.

A good point, I must spend some time looking into the FOOM debate.