Evaluating the feasibility of SI's plan

(With Kaj Sotala)

SI's current R&D plan seems to go as follows: 

1. Develop the perfect theory.
2. Implement this as a safe, working, Artificial General Intelligence -- and do so before anyone else builds an AGI.

The Singularity Institute is almost the only group working on friendliness theory (although with very few researchers). So, they have the lead on Friendliness. But there is no reason to think that they will be ahead of anyone else on the implementation.

The few AGI designs we can look at today, like OpenCog, are big, messy systems which intentionally attempt to exploit various cognitive dynamics that might combine in unexpected and unanticipated ways, and which have various human-like drives rather than the sort of supergoal-driven, utility-maximizing goal hierarchies that Eliezer talks about, or which a mathematical abstraction like AIXI employs.

A team which is ready to adopt a variety of imperfect heuristic techniques will have a decisive lead on approaches based on pure theory. Without the constraint of safety, one of them will beat SI in the race to AGI. SI cannot ignore this. Real-world, imperfect, safety measures for real-world, imperfect AGIs are needed.  These may involve mechanisms for ensuring that we can avoid undesirable dynamics in heuristic systems,  or AI-boxing toolkits usable in the pre-explosion stage, or something else entirely. 

SI’s hoped-for theory will include a reflexively consistent decision theory, something like a greatly refined Timeless Decision Theory.  It will also describe human value as formally as possible, or at least describe a way to pin it down precisely, something like an improved Coherent Extrapolated Volition.

The hoped-for theory is intended to  provide not only safety features, but also a description of the implementation, as some sort of ideal Bayesian mechanism, a theoretically perfect intelligence.

SIers have said to me that SI's design will have a decisive implementation advantage. The idea is that because strap-on safety can’t work, Friendliness research necessarily involves more fundamental architectural design decisions, which also happen to be general AGI design decisions that some other AGI builder could grab and save themselves a lot of effort. The assumption seems to be that all other designs are based on hopelessly misguided design principles. SI-ers, the idea seems to go, are so smart that they'll  build AGI far before anyone else. Others will succeed only when hardware capabilities allow crude near-brute-force methods to work.

Yet even if the Friendliness theory provides the basis for intelligence, the nitty-gritty of SI’s implementation will still be far away, and will involve real-world heuristics and other compromises.

We can compare SI’s future AI design to AIXI, another mathematically perfect AI formalism (though it has some critical reflexivity issues). Schmidhuber, Hutter, and colleagues think that their AXI can be scaled down into a feasible implementation, and have implemented some toy systems. Similarly, any actual AGI based on SI's future theories will have to stray far from its mathematically perfected origins.

Moreover, SI's future friendliness proof may simply be wrong. Eliezer writes a lot about logical uncertainty, the idea that you must treat even purely mathematical ideas with same probabilistic techniques as any ordinary uncertain belief. He pursues this mostly so that his AI can reason about itself, but the same principle applies to Friendliness proofs as well.

Perhaps Eliezer thinks that a heuristic AGI is absolutely doomed to failure; that a hard takeoff  immediately soon after the creation of the first AGI is so overwhelmingly likely that a mathematically designed AGI is the only one that could stay Friendly. In that case, we have to work on a pure-theory approach, even if it has a low chance of being finished first. Otherwise we'll be dead anyway. If an embryonic AGI will necessarily undergo an intelligence explosion, we have no choice but to "shut up and do the impossible."

I am all in favor of gung-ho knife-between-the teeth projects. But when you think that your strategy is impossible, then you should also look for a strategy which is possible, if only as a fallback. Thinking about safety theory until drops of blood appear on your forehead (as Eliezer puts it, quoting Gene Fowler), is all well and good. But if there is only a 10% chance of achieving 100% safety (not that there really is any such thing), then I'd rather go for a strategy that provides only a 40% promise of safety, but with a 40% chance of achieving it. OpenCog and the like are going to be developed regardless, and probably before SI's own provably friendly AGI. So, even an imperfect safety measure is better than nothing.

If heuristic approaches have a 99% chance of an immediate unfriendly explosion, then that might be wrong. But SI, better than anyone, should know that any intuition-based probability estimate of “99%” really means “70%”. Even if other approaches are long-shots, we should not put all our eggs in one basket. Theoretical perfection and stopgap safety measures can be developed in parallel.

Given what we know about human overconfidence and the general reliability of predictions, the actual outcome will to a large extent be something that none of us ever expected or could have predicted. No matter what happens, progress on safety mechanisms for heuristic AGI will improve our chances if something entirely unexpected happens.

What impossible thing should SI be shutting up and doing? For Eliezer, it’s Friendliness theory. To him, safety for heuristic AGI is impossible, and we shouldn't direct our efforts in that direction. But why shouldn't safety for heuristic AGI be another impossible thing to do?

(Two impossible things before breakfast … and maybe a few more? Eliezer seems to be rebuilding logic, set theory, ontology, epistemology, axiology, decision theory, and more, mostly from scratch. That's a lot of impossibles.)

And even if safety for heuristic AGIs is really impossible for us to figure out now, there is some chance of an extended soft takeoff that will allow for the possibility of us developing heuristic AGIs which will help in figuring out AGI safety, whether because we can use them for our tests, or because they can by applying their embryonic general intelligence to the problem. Goertzel and Pitt have urged this approach.

Yet resources are limited. Perhaps the folks who are actually building their own heuristic AGIs are in a better position than SI to develop safety mechanisms for them, while SI is the only organization which is really working on a formal theory on Friendliness, and so should concentrate on that. It could be better to focus SI's resources on areas in which it has a relative advantage, or which have a greater expected impact.

Even if so, SI should evangelize AGI safety to other researchers, not only as a general principle, but also by offering theoretical insights that may help them as they work on their own safety mechanisms.

In summary:

1. AGI development which is unconstrained by a friendliness requirement is likely to beat a provably-friendly design in a race to implementation, and some effort should be expended on dealing with this scenario.

2. Pursuing a provably-friendly AGI, even if very unlikely to succeed, could still be the right thing to do if it was certain that we’ll have a hard takeoff very soon after the creation of the first AGIs. However, we do not know whether or not this is true.

3. Even the provably friendly design will face real-world compromises and errors in its  implementation, so the implementation will not itself be provably friendly. Thus, safety protections of the sort needed for heuristic design are needed even for a theoretically Friendly design.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 4:58 PM
Select new highlight date
All comments loaded

Lots of strawmanning going on here (could somebody else please point these out? please?) but in case it's not obvious, the problem is that what you call "heuristic safety" is difficult. Now, most people haven't the tiniest idea of what makes anything difficult to do in AI and are living in a verbal-English fantasy world, so of course you're going to get lots of people who think they have brilliant heuristic safety ideas. I have never seen one that would work, and I have seen lots of people come up with ideas that sound to them like they might have a 40% chance of working and which I know perfectly well to have a 0% chance of working.

The real gist of Friendly AI isn't some imaginary 100% perfect safety concept, it's ideas like, "Okay, we need to not have a conditionally independent chance of goal system warping on each self-modification because over the course of a billion modifications any conditionally independent probability will sum to ~1, but since self-modification is initially carried out in the highly deterministic environment of a computer chip it looks possible to use crisp approaches that avert a conditionally independent failure probability for each self-modification." Following this methodology is not 100% safe, but rather, if you fail to do that, your conditionally independent failure probabilities add up to 1 and you're 100% doomed.

But if you were content with a "heuristic" approach that you thought had a 40% chance of working, you'll never think through the problem in enough detail to realize that your doom probability is not 60% but ~1, because only somebody holding themselves to a higher standard than "heuristic safety" would ever push their thinking far enough to realize that their initial design was flawed.

People at SI are not stupid. We're not trying to achieve lovely perfect safety with a cherry on top because we think we have lots of luxurious time to waste and we're perfectionists. I have an analysis of the problem which says that if I want something to have a failure probability less than 1, I have to do certain things because I haven't yet thought of any way not to have to do them. There are of course lots of people who think that they don't have to solve the same problems, but that's because they're living in a verbal-English fantasy world in which their map is so blurry that they think lots of things "might be possible" that a sharper map would show to be much more difficult than they sound.

I don't know how to take a self-modifying heuristic soup in the process of going FOOM and make it Friendly. You don't know either, but the problem is, you don't know that you don't know. Or to be more precise, you don't share my epistemic reasons to expect that to be really difficult. When you engage in sufficient detail with a problem of FAI, and try to figure out how to solve it given that the rest of the AI was designed to allow that solution, it suddenly looks that much harder to solve under sloppy conditions. Whereas on the "40% safety" approach, it seems like the sort of thing you might be able to do, sure, why not...

If someday I realize that it's actually much easier to do FAI than I thought, given that you use a certain exactly-right approach - so easy, in fact, that you can slap that exactly-right approach on top of an AI system that wasn't specifically designed to permit it, an achievement on par with hacking Google Maps to play chess using its route-search algorithm - then that epiphany will be as the result of considering things that would work and be known to work with respect to some subproblem, not things that seem like they might have a 40% chance of working overall, because only the former approach develops skill.

I'll leave that as my take-home message - if you want to imagine building plug-in FAI approaches, isolate a subproblem and ask yourself how you could solve it and know that you've solved it, don't imagine overall things that have 40% chances of working. If you actually succeed in building knowledge this way I suspect that pretty soon you'll give up on the plug-in business because it will look harder than building the surrounding AI yourself.

full disclosure: I'm a professional cryptography research assistant. I'm not really interested in AI (yet) but there are obvious similarities when it comes to security.

I have to back Elizer up on the "Lots of strawmanning" part. No professional cryptographer will ever tell you there's hope in trying to achieve "perfect level of safety" of anything and cryptography, unlike AI, is a very well formalized field. As an example, I'll offer a conversation with a student:

  • How secure is this system? (such question is usually a shorthand for: "What's the probability this system won't be broken by methods X, Y and Z")

  • The theorem says

  • What's the probability that the proof of the theorem is correct?

  • ... probably not

Now, before you go "yeah, right", I'll also say that I've already seen this once - there was a theorem in major peer reviewed journal that turned out to be wrong (counter-example found) after one of the students tried to implement it as a part of his thesis - so the probability was indeed not even close to for any serious N. I'd like to point out that this doesn't even include problems with the implementation of the theory.

It's really difficult to explain how hard this stuff really is to people who never tried to develop anything like it. That's too bad (and a danger) because people who do get it rarely are in charge of the money. That's one reason for the CFAR/rationality movement... you need a tool to explain it to other people too, am I right?

Now, before you go "yeah, right", I'll also say that I've already seen this once - there was a theorem in major peer reviewed journal that turned out to be wrong (counter-example found) after one of the students tried to implement it as a part of his thesis - so the probability was indeed not even close to for any serious N. I'd like to point out that this doesn't even include problems with the implementation of the theory.

Yup. Usual reference: "Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes". (I also have an essay on a similar topic.)

Upvoted for being gwern i.e. having a reference for everything... how do you do that?

Excellent visual memory, great Google & search skills, a thorough archive system, thousands of excerpts stored in Evernote, and essays compiling everything relevant I know of on a topic - that's how.

(If I'd been born decades ago, I'd probably have become a research librarian.)

Would love to read a gwern-essay on your archiving system. I use evernote, org-mode, diigo and pocket and just can't get them streamlined into a nice workflow. If evernote adopted diigo-like highlighting and let me seamlessly edit with Emacs/org-mode that would be perfect... but alas until then I'm stuck with this mess of a kludge. Teach us master, please!

I don't know how to take a self-modifying heuristic soup in the process of going FOOM and make it Friendly. You don't know either, but the problem is, you don't know that you don't know. Or to be more precise, you don't share my epistemic reasons to expect that to be really difficult.

But the article didn't claim any different: it explicitly granted that if we presume a FOOM, then yes, trying to do anything with heuristic soups seems useless and just something that will end up killing us all. The disagreement is not on whether it's possible to make a heuristic AGI that FOOMs while remaining Friendly; the disagreement is on whether there will inevitably be a FOOM soon after the creation of the first AGI, and whether there could be a soft takeoff during which some people prevented those powerful-but-not-yet-superintelligent heuristic soups from killing everyone while others put the finishing touches on the AGI that could actually be trusted to remain Friendly when it actually did FOOM.

The disagreement is not on whether it's possible to make a heuristic AGI that FOOMs while remaining Friendly; the disagreement is on whether there will inevitably be a FOOM soon after the creation of the first AGI

Moreover, the very fact that an AGI is "heuristic soup" removes some of the key assumptions in some FOOM arguments that have been popular around here (Omohundro 2008). In particular, I doubt that a heuristic AGI is likely to be a "goal seeking agent" in the rather precise sense of maximizing a utility function. It may not even approximate such behavior as closely as humans do. On the other hand, if a whole lot of radically different heuristic-based approaches are tried, the odds of at least one of them being "motivated" to seek resources increases dramatically.

Note that Omohundro doesn't assume that the AGI would actually have a utility function: he only assumes that the AGI is capable of understanding the microeconomic argument for why it would be useful for it to act as if it did have one. His earlier 2007 paper is clearer on this point.

People at SI are not stupid.

Understatement :-)

Given that heuristic AGI's have an advantage in development speed over your approach, how do you plan to deal with the existential risk that these other projects will pose?

And given this dev-speed disadvantage for SI, how is it possible that SI's future AI design might not only be safer, but also have significant implementation advantage over competitors, as I have heard from SI'ers (if I understood them correctly)?

Given that heuristic AGI's have an advantage in development speed over your approach

Are you asking him to assume this? Because, um, it's possible to doubt that OpenCog or similar projects will produce interesting results. (Do you mean, projects by people who care about understanding intelligence but not Friendliness?) Given the assumption, one obvious tactic involves education about the dangers of AI.

Thank you for the answers. I think that they do not really address the questions in the OP -- and to me this is a sign that the questions are all the more worth pursuing.

Here is a summary of the essential questions, with SI's current (somewhat inadequate) answers as I understand them.

Q1. Why maintain any secrecy for SI's research? Don't we want others to collaborate on and use safety mechanisms? Of course, a safe AGI must be safe from the ground up. But as to implementation, why should we expect that SI's AGI design could possibly have an lead on the others?

A1 ?

Q2 . Given that proofs can be wrong and that implementations can have their mistakes, and that we can't predict the challenges ahead with certainty, what is SI' s layered safety strategy (granted that FAI theory is the most important component)?

A2 . There should be layered safety strategy of some kind, but actual Friendliness theory is what we should be focusing on right now.

Q3. How do we deal with the fact that unsafe AGI projects, without the constraint of safety, will very likely have the lead on SI's project?

A3. We just have to work as hard as possible, and hope that it will be enough.

Q4. Should we evangelize safety ideas to other AGI projects?

A4. No, it's useless. For that to be useful, AGI designers would have to scrap the projects they had already invested in, and restart the projects with Friendliness as the first consideration, and practically nobody is going to be sane enough for that.

Why maintain any secrecy for SI's research? Don't we want others to collaborate on and use safety mechanisms? Of course, a safe AGI must be safe from the ground up. But as to implementation, why should we expect that SI's AGI design could possibly have an lead on the others?

The question of whether to keep research secret must be made on a case-by-case basis. In fact, next week I have a meeting (with Eliezer and a few others) about whether to publish a particular piece of research progress.

Certainly, there are many questions that can be discussed in public because they are low-risk (in an information hazard sense), and we plan to discuss those in public — e.g. Eliezer is right now working on the posts in his Open Problems in Friendly AI sequence.

Why should we expect that SI's AGI design will have a lead on others? We shouldn't. It probably won't. We can try, though. And we can also try to influence the top AGI people (10-40 years from now) to think with us about FAI and safety mechanisms and so on. We do some of that now, though the people in AGI today probably aren't the people who will end up building the first AGIs. (Eliezer's opinion may differ.)

Given that proofs can be wrong and that implementations can have their mistakes, and that we can't predict the challenges ahead with certainty, what is SI' s layered safety strategy (granted that FAI theory is the most important component)?

That will become clearer as we learn more. I do think several layers of safety will need to be involved. 100% proofs of Friendliness aren't possible. There are both technical and social layers of safety strategy to implement.

How do we deal with the fact that unsafe AGI projects, without the constraint of safety, will very likely have the lead on SI's project?

As I said above, one strategy is to build strong relationships with top AGI people and work with them on Friendliness research and make it available to them, while also being wary of information hazards.

Should we [spread] safety ideas to other AGI projects?

Eliezer may disagree, but I think the answer is "Yes." There's a great deal of truth in Upton Sinclair's quip that "It is difficult to get a man to understand something, when his salary depends upon his not understanding it," but I don't think it's impossible to reach people, especially if we have stronger arguments, more research progress on Friendliness, and a clearer impending risk from AI than is the case in early 2013.

That said, safety outreach may not be a very good investment now — it may be putting the cart before the horse. We probably need clearer and better-formed arguments, and more obvious progress on Friendliness, before safety outreach will be effective on even 10% of the most intelligent AI researchers.

Pursuing a provably-friendly AGI, even if very unlikely to succeed, could still be the right thing to do if it was certain that we’ll have a hard takeoff very soon after the creation of the first AGIs.

One consideration you're missing (and that I expect to be true; Eliezer also points it out) is that even if there is very slow takeoff, creation of slow-thinking poorly understood unFriendly AGIs is not any help in developing a FAI (they can't be "debugged" when you don't have accurate understanding of what it is you are aiming for; and they can't be "asked" to solve a problem which you can't accurately state). In this hypothetical, in the long run the unFriendly AGIs (or WBEs whose values have drifted away from original human values) will have control. So in this case it's also necessary (if a little bit less urgent, which isn't really enough to change the priority of the problem) to work on FAI theory, so hard takeoff is not decisively important in this respect.

(Btw, is this point in any of the papers? Do people agree it should be?)

As for my own work for SI, I've been trying to avoid the assumption of there necessarily being a hard takeoff right away, and to somewhat push towards a direction that also considers the possibility of a safe singularity through an initial soft takeoff and more heuristic AGIs. (I do think that there will be a hard takeoff eventually, but an extended softer takeoff before it doesn't seem impossible.) E.g. this is from the most recent draft of the Responses to Catastrophic AGI Risk paper:

As a brief summary of our views, in the medium term, we think that the proposals of AGI confinement (section 4.1.), Oracle AI (section 5.1.), and motivational weaknesses (section 5.6.) would have promise in helping create safer AGIs. These proposals share in common the fact that although they could help a cautious team of researchers create an AGI, they are not solutions to the problem of AGI risk, as they do not prevent others from creating unsafe AGIs, nor are they sufficient in guaranteeing the safety of sufficiently intelligent AGIs. Regulation (section 3.3.) as well as "merge with machines" (section 3.4.) proposals could also help to somewhat reduce AGI risk. In the long run, we will need the ability to guarantee the safety of freely-acting AGIs. For this goal, value learning (section 5.2.5.) would seem like the most reliable approach if it could be made to work, with human-like architectures (section 5.3.4.) a possible alternative which seems less reliable but possibly easier to build. Formal verification (section 5.5.) seems like a very important tool in helping to ensure the safety of our AGIs, regardless of the exact approach that we choose.

Here, "human-like architectures" also covers approaches such as OpenCog. To me, a two-pronged approach, both developing a formal theory of Friendliness, and trying to work with the folks who design heuristic AGIs to make them more safe, would seem like the best bet. Not only would it help to make the heuristic designs safer, it could also give SI folks the kinds of skills that would be useful in actually implementing their formally specified FAI later on.

Part of the problem here is an Angels on Pinheads problem. Which is to say: before deciding exactly how many angels can dance on the head of a pin, you have to make sure the "angel" concept is meaningful enough that questions about angels are meaningful. In the present case, you have a situation where (a) the concept of "friendliness" might not be formalizable enough to make any mathematical proofs about it meaningful, and (b) there is no known path to the construction of an AGI at the moment, so speculating about the properties of AGI systems is tantamount to speculating about the properties of railroads when you haven't invented the wheel yet.

So, should SI be devoting any time at all to proving friendliness? Yes, but only after defining its terms well enough to make the endeavor meaningful. (And, for the record, there at least some people who believe that the terms cannot be defined in a way that admits of such proofs).

Yes, but only after defining its terms well enough to make the endeavor meaningful.

That is indeed part of what SI is trying to do at the moment.

So ... SI is addressing the question of whether the "friendliness" concept is actually meaningful enough to be formalizable? SI accepts that "friendliness" might not be formalizable at all, and has discussed the possibility that mathematical proof is not even applicable in this case?

And SI has discussed the possibility that the current paradigm for an AI motivation mechanism is so poorly articulated, and so unproven (there being no such mechanism that has been demonstrated to be even approaching stability), that it may be meaningless to discuss how such motivation mechanisms can be proven to be "friendy"?

I do not believe I have seen any evidence of those debates/discussions coming from SI... do you have pointers?

Well, Luke has asked me to work on a document called "Mitigating Risks from AGI: Key Strategic Questions" which lists a number of questions we'd like to have answers to and attempts to list some preliminary pointers and considerations that would help other researchers actually answer those questions. "Can CEV be formalized?" and "How feasible is it to create Friendly AI along an Eliezer path?" are two of the questions in that document.

I haven't heard explicit discussions about all of your points, but I would expect them to all have been brought up in private discussions (which I have for the most part missed, since my physical location is rather remote from all the other SI folks). Eliezer has said that a Friendly AI in the style that he is thinking of might just be impossible. That said, I do agree with the current general consensus among other SI folk, which is to say that we should act based on the assumption that such a mathematical proof is possible, because humanity's chances of survival look pretty bad if it isn't.

Hmm, the OP isn't arguing for it, but I'm starting to wonder if it might (upon further study) actually be a good idea to build a heuristics-based FAI. Here are some possible answers to common objections/problems of the approach:

  • Heuristics-based AIs can't safely self-modify. A heuristics-based FAI could instead try to build a "cleanly designed' FAI as its successor, just like we can, but possibly do it better if it's smarter.
  • It seems impossible to accurately capture the complexity of humane values in a heuristics-based AI. What if we just give it the value of "be altruistic (in a preference utilitarian sense) towards (some group of) humans"?
  • The design space of "heuristics soup" is much larger than the space of "clean designs", which gives the "cleanly designed" FAI approach a speed advantage. (This is my guess of why someone might think "cleanly designed" FAI will win the race for AGI. Somebody correct me if there are stronger reasons.) The "fitness landscape" of heuristics-based AI may be such that it's not too hard to hit upon a viable design. Also, the only existence proof of AGI (i.e., humans) is heuristics based, so we don't know if a "cleanly designed" human-level-or-above AGI is even a logical possibility.
  • A heuristics-based AI may be very powerful but philosophically incompetent. We humans are heuristics based but at least somewhat philosophically competent. Maybe "philosophical competence" isn't such a difficult target to hit in the space of "heuristic soup" designs?

What if we just give it the value of "be altruistic (in a preference utilitarian sense) towards (some group of) humans"?

Well, then you get the standard "the best thing to do in a preference utilitarian sense would be to reprogram everyone to only prefer things that are maximally easy to satisfy" objection, and once you start trying to avoid that, you get the full complexity of value problem again.

The standard solution to that is to be altruistic to some group of people as they existed at time T, and the standard problem with that is it doesn't allow moral progress, and the standard solution to that is to be altruistic to some idealized or extrapolated group of people. So we just have to make the heuristics-based FAI understand the concept of CEV (or whatever the right notion of "idealized" is), which doesn't seem impossible. What does seem impossible is to achieve high confidence that it understands the notion correctly, but if provably-Friendly AI is just too slow or unfeasible, and we're not trying to achieve 100% safety...