Reply to Holden on 'Tool AI'

I begin by thanking Holden Karnofsky of Givewell for his rare gift of his detailed, engaged, and helpfully-meant critical article Thoughts on the Singularity Institute (SI). In this reply I will engage with only one of the many subjects raised therein, the topic of, as I would term them, non-self-modifying planning Oracles, a.k.a. 'Google Maps AGI' a.k.a. 'tool AI', this being the topic that requires me personally to answer.  I hope that my reply will be accepted as addressing the most important central points, though I did not have time to explore every avenue.  I certainly do not wish to be logically rude, and if I have failed, please remember with compassion that it's not always obvious to one person what another person will think was the central point.

Luke Mueulhauser and Carl Shulman contributed to this article, but the final edit was my own, likewise any flaws.

Summary:

Holden's concern is that "SI appears to neglect the potentially important distinction between 'tool' and 'agent' AI." His archetypal example is Google Maps:

Google Maps is not an agent, taking actions in order to maximize a utility parameter. It is a tool, generating information and then displaying it in a user-friendly manner for me to consider, use and export or discard as I wish.

The reply breaks down into four heavily interrelated points:

First, Holden seems to think (and Jaan Tallinn doesn't apparently object to, in their exchange) that if a non-self-modifying planning Oracle is indeed the best strategy, then all of SIAI's past and intended future work is wasted.  To me it looks like there's a huge amount of overlap in underlying processes in the AI that would have to be built and the insights required to build it, and I would be trying to assemble mostly - though not quite exactly - the same kind of team if I was trying to build a non-self-modifying planning Oracle, with the same initial mix of talents and skills.

Second, a non-self-modifying planning Oracle doesn't sound nearly as safe once you stop saying human-English phrases like "describe the consequences of an action to the user" and start trying to come up with math that says scary dangerous things like (he translated into English) "increase the correspondence between the user's belief about relevant consequences and reality".  Hence why the people on the team would have to solve the same sorts of problems.

Appreciating the force of the third point is a lot easier if one appreciates the difficulties discussed in points 1 and 2, but is actually empirically verifiable independently:  Whether or not a non-self-modifying planning Oracle is the best solution in the end, it's not such an obvious privileged-point-in-solution-space that someone should be alarmed at SIAI not discussing it.  This is empirically verifiable in the sense that 'tool AI' wasn't the obvious solution to e.g. John McCarthy, Marvin Minsky, I. J. Good, Peter Norvig, Vernor Vinge, or for that matter Isaac Asimov.  At one point, Holden says:

One of the things that bothers me most about SI is that there is practically no public content, as far as I can tell, explicitly addressing the idea of a "tool" and giving arguments for why AGI is likely to work only as an "agent."

If I take literally that this is one of the things that bothers Holden most... I think I'd start stacking up some of the literature on the number of different things that just respectable academics have suggested as the obvious solution to what-to-do-about-AI - none of which would be about non-self-modifying smarter-than-human planning Oracles - and beg him to have some compassion on us for what we haven't addressed yet.  It might be the right suggestion, but it's not so obviously right that our failure to prioritize discussing it reflects negligence.

The final point at the end is looking over all the preceding discussion and realizing that, yes, you want to have people specializing in Friendly AI who know this stuff, but as all that preceding discussion is actually the following discussion at this point, I shall reserve it for later.

1.  The math of optimization, and the similar parts of a planning Oracle.

What does it take to build a smarter-than-human intelligence, of whatever sort, and have it go well?

A "Friendly AI programmer" is somebody who specializes in seeing the correspondence of mathematical structures to What Happens in the Real World. It's somebody who looks at Hutter's specification of AIXI and reads the actual equations - actually stares at the Greek symbols and not just the accompanying English text - and sees, "Oh, this AI will try to gain control of its reward channel," as well as numerous subtler issues like, "This AI presumes a Cartesian boundary separating itself from the environment; it may drop an anvil on its own head." Similarly, working on TDT means e.g. looking at a mathematical specification of decision theory, and seeing "Oh, this is vulnerable to blackmail" and coming up with a mathematical counter-specification of an AI that isn't so vulnerable to blackmail.

Holden's post seems to imply that if you're building a non-self-modifying planning Oracle (aka 'tool AI') rather than an acting-in-the-world agent, you don't need a Friendly AI programmer because FAI programmers only work on agents. But this isn't how the engineering skills are split up. Inside the AI, whether an agent AI or a planning Oracle, there would be similar AGI-challenges like "build a predictive model of the world", and similar FAI-conjugates of those challenges like finding the 'user' inside an AI-created model of the universe.  The insides would look a lot more similar than the outsides.  An analogy would be supposing that a machine learning professional who does sales optimization for an orange company couldn't possibly do sales optimization for a banana company, because their skills must be about oranges rather than bananas.

Admittedly, if it turns out to be possible to use a human understanding of cognitive algorithms to build and run a smarter-than-human Oracle without it being self-improving - this seems unlikely, but not impossible - then you wouldn't have to solve problems that arise with self-modification.  But this eliminates only one dimension of the work.  And on an even more meta level, it seems like you would call upon almost identical talents and skills to come up with whatever insights were required - though if it were predictable in advance that we'd abjure self-modification, then, yes, we'd place less emphasis on e.g. finding a team member with past experience in reflective math, and wouldn't waste (additional) time specializing in reflection.  But if you wanted math inside the planning Oracle that operated the way you thought it did, and you wanted somebody who understood what could possibly go wrong and how to avoid it, you would need to make a function call to the same sort of talents and skills to build an agent AI, or an Oracle that was self-modifying, etc.

2.  Yes, planning Oracles have hidden gotchas too.

"Tool AI" may sound simple in English, a short sentence in the language of empathically-modeled agents — it's just "a thingy that shows you plans instead of a thingy that goes and does things." If you want to know whether this hypothetical entity does X, you just check whether the outcome of X sounds like "showing someone a plan" or "going and doing things", and you've got your answer.  It starts sounding much scarier once you try to say something more formal and internally-causal like "Model the user and the universe, predict the degree of correspondence between the user's model and the universe, and select from among possible explanation-actions on this basis."

Holden, in his dialogue with Jaan Tallinn, writes out this attempt at formalizing:

Here's how I picture the Google Maps AGI ...

utility_function = construct_utility_function(process_user_input());

foreach $action in $all_possible_actions {

$action_outcome = prediction_function($action,$data);

$utility = utility_function($action_outcome);

if ($utility > $leading_utility) { $leading_utility = $utility;

$leading_action = $action; }

}

report($leading_action);

construct_utility_function(process_user_input()) is just a human-quality function for understanding what the speaker wants. prediction_function is an implementation of a human-quality data->prediction function in superior hardware. $data is fixed (it's a dataset larger than any human can process); same with $all_possible_actions. report($leading_action) calls a Google Maps-like interface for understanding the consequences of $leading_action; it basically breaks the action into component parts and displays predictions for different times and conditional on different parameters.

Google Maps doesn't check all possible routes. If I wanted to design Google Maps, I would start out by throwing out a standard planning technique on a connected graph where each edge has a cost function and there's a good heuristic measure of the distance, e.g. A* search. If that was too slow, I'd next try some more efficient version like weighted A* (or bidirectional weighted memory-bounded A*, which I expect I could also get off-the-shelf somewhere). Once you introduce weighted A*, you no longer have a guarantee that you're selecting the optimal path.  You have a guarantee to within a known factor of the cost of the optimal path — but the actual path selected wouldn't be quite optimal. The suggestion produced would be an approximation whose exact steps depended on the exact algorithm you used. That's true even if you can predict the exact cost — exact utility — of any particular path you actually look at; and even if you have a heuristic that never overestimates the cost.

The reason we don't have God's Algorithm for solving the Rubik's Cube is that there's no perfect way of measuring the distance between any two Rubik's Cube positions — you can't look at two Rubik's cube positions, and figure out the minimum number of moves required to get from one to another. It took 15 years to prove that there was a position requiring at least 20 moves to solve, and then another 15 years to come up with a computer algorithm that could solve any position in at most 20 moves, but we still can't compute the actual, minimum solution to all Cubes ("God's Algorithm"). This, even though we can exactly calculate the cost and consequence of any actual Rubik's-solution-path we consider.

When it comes to AGI — solving general cross-domain "Figure out how to do X" problems — you're not going to get anywhere near the one, true, optimal answer. You're going to — at best, if everything works right — get a good answer that's a cross-product of the "utility function" and all the other algorithmic properties that determine what sort of answer the AI finds easy to invent (i.e. can be invented using bounded computing time).

As for the notion that this AGI runs on a "human predictive algorithm" that we got off of neuroscience and then implemented using more computing power, without knowing how it works or being able to enhance it further: It took 30 years of multiple computer scientists doing basic math research, and inventing code, and running that code on a computer cluster, for them to come up with a 20-move solution to the Rubik's Cube. If a planning Oracle is going to produce better solutions than humanity has yet managed to the Rubik's Cube, it needs to be capable of doing original computer science research and writing its own code. You can't get a 20-move solution out of a human brain, using the native human planning algorithm. Humanity can do it, but only by exploiting the ability of humans to explicitly comprehend the deep structure of the domain (not just rely on intuition) and then inventing an artifact, a new design, running code which uses a different and superior cognitive algorithm, to solve that Rubik's Cube in 20 moves. We do all that without being self-modifying, but it's still a capability to respect.

And I'm not even going into what it would take for a planning Oracle to out-strategize any human, come up with a plan for persuading someone, solve original scientific problems by looking over experimental data (like Einstein did), design a nanomachine, and so on.

Talking like there's this one simple "predictive algorithm" that we can read out of the brain using neuroscience and overpower to produce better plans... doesn't seem quite congruous with what humanity actually does to produce its predictions and plans.

If we take the concept of the Google Maps AGI at face value, then it actually has four key magical components.  (In this case, "magical" isn't to be taken as prejudicial, it's a term of art that means we haven't said how the component works yet.)  There's a magical comprehension of the user's utility function, a magical world-model that GMAGI uses to comprehend the consequences of actions, a magical planning element that selects a non-optimal path using some method other than exploring all possible actions, and a magical explain-to-the-user function.

report($leading_action) isn't exactly a trivial step either. Deep Blue tells you to move your pawn or you'll lose the game. You ask "Why?" and the answer is a gigantic search tree of billions of possible move-sequences, leafing at positions which are heuristically rated using a static-position evaluation algorithm trained on millions of games. Or the planning Oracle tells you that a certain DNA sequence will produce a protein that cures cancer, you ask "Why?", and then humans aren't even capable of verifying, for themselves, the assertion that the peptide sequence will fold into the protein the planning Oracle says it does.

"So," you say, after the first dozen times you ask the Oracle a question and it returns an answer that you'd have to take on faith, "we'll just specify in the utility function that the plan should be understandable."

Whereupon other things start going wrong. Viliam_Bur, in the comments thread, gave this example, which I've slightly simplified:

Example question: "How should I get rid of my disease most cheaply?" Example answer: "You won't. You will die soon, unavoidably. This report is 99.999% reliable". Predicted human reaction: Decides to kill self and get it over with. Success rate: 100%, the disease is gone. Costs of cure: zero. Mission completed.

Bur is trying to give an example of how things might go wrong if the preference function is over the accuracy of the predictions explained to the human— rather than just the human's 'goodness' of the outcome. And if the preference function was just over the human's 'goodness' of the end result, rather than the accuracy of the human's understanding of the predictions, the AI might tell you something that was predictively false but whose implementation would lead you to what the AI defines as a 'good' outcome. And if we ask how happy the human is, the resulting decision procedure would exert optimization pressure to convince the human to take drugs, and so on.

I'm not saying any particular failure is 100% certain to occur; rather I'm trying to explain - as handicapped by the need to describe the AI in the native human agent-description language, using empathy to simulate a spirit-in-a-box instead of trying to think in mathematical structures like A* search or Bayesian updating - how, even so, one can still see that the issue is a tad more fraught than it sounds on an immediate examination.

If you see the world just in terms of math, it's even worse; you've got some program with inputs from a USB cable connecting to a webcam, output to a computer monitor, and optimization criteria expressed over some combination of the monitor, the humans looking at the monitor, and the rest of the world. It's a whole lot easier to call what's inside a 'planning Oracle' or some other English phrase than to write a program that does the optimization safely without serious unintended consequences. Show me any attempted specification, and I'll point to the vague parts and ask for clarification in more formal and mathematical terms, and as soon as the design is clarified enough to be a hundred light years from implementation instead of a thousand light years, I'll show a neutral judge how that math would go wrong. (Experience shows that if you try to explain to would-be AGI designers how their design goes wrong, in most cases they just say "Oh, but of course that's not what I meant." Marcus Hutter is a rare exception who specified his AGI in such unambiguous mathematical terms that he actually succeeded at realizing, after some discussion with SIAI personnel, that AIXI would kill off its users and seize control of its reward button. But based on past sad experience with many other would-be designers, I say "Explain to a neutral judge how the math kills" and not "Explain to the person who invented that math and likes it.")

Just as the gigantic gap between smart-sounding English instructions and actually smart algorithms is the main source of difficulty in AI, there's a gap between benevolent-sounding English and actually benevolent algorithms which is the source of difficulty in FAI.  "Just make suggestions - don't do anything!" is, in the end, just more English.

3.  Why we haven't already discussed Holden's suggestion

One of the things that bothers me most about SI is that there is practically no public content, as far as I can tell, explicitly addressing the idea of a "tool" and giving arguments for why AGI is likely to work only as an "agent."

The above statement seems to lack perspective on how many different things various people see as the one obvious solution to Friendly AI. Tool AI wasn't the obvious solution to John McCarthy, I.J. Good, or Marvin Minsky. Today's leading AI textbook, Artificial Intelligence: A Modern Approach - where you can learn all about A* search, by the way - discusses Friendly AI and AI risk for 3.5 pages but doesn't mention tool AI as an obvious solution. For Ray Kurzweil, the obvious solution is merging humans and AIs. For Jurgen Schmidhuber, the obvious solution is AIs that value a certain complicated definition of complexity in their sensory inputs. Ben Goertzel, J. Storrs Hall, and Bill Hibbard, among others, have all written about how silly Singinst is to pursue Friendly AI when the solution is obviously X, for various different X. Among current leading people working on serious AGI programs labeled as such, neither Demis Hassabis (VC-funded to the tune of several million dollars) nor Moshe Looks (head of AGI research at Google) nor Henry Markram (Blue Brain at IBM) think that the obvious answer is Tool AI. Vernor Vinge, Isaac Asimov, and any number of other SF writers with technical backgrounds who spent serious time thinking about these issues didn't converge on that solution.

Obviously I'm not saying that nobody should be allowed to propose solutions because someone else would propose a different solution. I have been known to advocate for particular developmental pathways for Friendly AI myself. But I haven't, for example, told Peter Norvig that deterministic self-modification is such an obvious solution to Friendly AI that I would mistrust his whole AI textbook if he didn't spend time discussing it.

At one point in his conversation with Tallinn, Holden argues that AI will inevitably be developed along planning-Oracle lines, because making suggestions to humans is the natural course that most software takes. Searching for counterexamples instead of positive examples makes it clear that most lines of code don't do this.  Your computer, when it reallocates RAM, doesn't pop up a button asking you if it's okay to reallocate RAM in such-and-such a fashion. Your car doesn't pop up a suggestion when it wants to change the fuel mix or apply dynamic stability control. Factory robots don't operate as human-worn bracelets whose blinking lights suggest motion. High-frequency trading programs execute stock orders on a microsecond timescale. Software that does happen to interface with humans is selectively visible and salient to humans, especially the tiny part of the software that does the interfacing; but this is a special case of a general cost/benefit tradeoff which, more often than not, turns out to swing the other way, because human advice is either too costly or doesn't provide enough benefit. Modern AI programmers are generally more interested in e.g. pushing the technological envelope to allow self-driving cars than to "just" do Google Maps. Branches of AI that invoke human aid, like hybrid chess-playing algorithms designed to incorporate human advice, are a field of study; but they're the exception rather than the rule, and occur primarily where AIs can't yet do something humans do, e.g. humans acting as oracles for theorem-provers, where the humans suggest a route to a proof and the AI actually follows that route. This is another reason why planning Oracles were not a uniquely obvious solution to the various academic AI researchers, would-be AI-creators, SF writers, etcetera, listed above. Again, regardless of whether a planning Oracle is actually the best solution, Holden seems to be empirically-demonstrably overestimating the degree to which other people will automatically have his preferred solution come up first in their search ordering.

4.  Why we should have full-time Friendly AI specialists just like we have trained professionals doing anything else mathy that somebody actually cares about getting right, like pricing interest-rate options or something

I hope that the preceding discussion has made, by example instead of mere argument, what's probably the most important point: If you want to have a sensible discussion about which AI designs are safer, there are specialized skills you can apply to that discussion, as built up over years of study and practice by someone who specializes in answering that sort of question.

This isn't meant as an argument from authority. It's not meant as an attempt to say that only experts should be allowed to contribute to the conversation. But it is meant to say that there is (and ought to be) room in the world for Friendly AI specialists, just like there's room in the world for specialists on optimal philanthropy (e.g. Holden).

The decision to build a non-self-modifying planning Oracle would be properly made by someone who: understood the risk gradient for self-modifying vs. non-self-modifying programs; understood the risk gradient for having the AI thinking about the thought processes of the human watcher and trying to come up with plans implementable by the human watcher in the service of locally absorbed utility functions, vs. trying to implement its own plans in the service of more globally descriptive utility functions; and who, above all, understood on a technical level what exactly gets accomplished by having the plans routed through a human. I've given substantial previous thought to describing more precisely what happens — what is being gained, and how much is being gained — when a human "approves a suggestion" made by an AI. But that would be another a different topic, plus I haven't made too much progress on saying it precisely anyway.

In the transcript of Holden's conversation with Jaan Tallinn, it looked like Tallinn didn't deny the assertion that Friendly AI skills would be inapplicable if we're building a Google Maps AGI. I would deny that assertion and emphasize that denial, because to me it seems that it is exactly Friendly AI programmers who would be able to tell you if the risk gradient for non-self-modification vs. self-modification, the risk gradient for routing plans through humans vs. acting as an agent, the risk gradient for requiring human approval vs. unapproved action, and the actual feasibility of directly constructing transhuman modeling-prediction-and-planning algorithms through directly design of sheerly better computations than are presently run by the human brain, had the right combination of properties to imply that you ought to go construct a non-self-modifying planning Oracle. Similarly if you wanted an AI that took a limited set of actions in the world with human approval, or if you wanted an AI that "just answered questions instead of making plans".

It is similarly implied that a "philosophical AI" might obsolete Friendly AI programmers. If we're talking about PAI that can start with a human's terrible decision theory and come up with a good decision theory, or PAI that can start from a human talking about bad metaethics and then construct a good metaethics... I don't want to say "impossible", because, after all, that's just what human philosophers do. But we are not talking about a trivial invention here. Constructing a "philosophical AI" is a Holy Grail precisely because it's FAI-complete (just ask it "What AI should we build?"), and has been discussed (e.g. with and by Wei Dai) over the years on the old SL4 mailing list and the modern Less Wrong. But it's really not at all clear how you could write an algorithm which would knowably produce the correct answer to the entire puzzle of anthropic reasoning, without being in possession of that correct answer yourself (in the same way that we can have Deep Blue win chess games without knowing the exact moves, but understanding exactly what abstract work Deep Blue is doing to solve the problem).

Holden's post presents a restrictive view of what "Friendly AI" people are supposed to learn and know — that it's about machine learning for optimizing orange sales but not apple sales, or about producing an "agent" that implements CEV — which is something of a straw view, much weaker than the view that a Friendly AI programmer takes of Friendly AI programming. What the human species needs from an x-risk perspective is experts on This Whole Damn Problem, who will acquire whatever skills are needed to that end. The Singularity Institute exists to host such people and enable their research—once we have enough funding to find and recruit them.  See also, How to Purchase AI Risk Reduction.

I'm pretty sure Holden has met people who think that having a whole institute to rate the efficiency of charities is pointless overhead, especially people who think that their own charity-solution is too obviously good to have to contend with busybodies pretending to specialize in thinking about 'marginal utility'.  Which Holden knows about, I would guess, from being paid quite well to think about that economic details when he was a hedge fundie, and learning from books written by professional researchers before then; and the really key point is that people who haven't studied all that stuff don't even realize what they're missing by trying to wing it.  If you don't know, you don't know what you don't know, or the cost of not knowing.  Is there a problem of figuring out who might know something you don't, if Holden insists that there's this strange new stuff called 'marginal utility' you ought to learn about?  Yes, there is.  But is someone who trusts their philanthropic dollars to be steered just by the warm fuzzies of their heart, doing something wrong?  Yes, they are.  It's one thing to say that SIAI isn't known-to-you to be doing it right - another thing still to say that SIAI is known-to-you to be doing it wrong - and then quite another thing entirely to say that there's no need for Friendly AI programmers and you know it, that anyone can see it without resorting to math or cracking a copy of AI: A Modern Approach.  I do wish that Holden would at least credit that the task SIAI is taking on contains at least as many gotchas, relative to the instinctive approach, as optimal philanthropy compared to instinctive philanthropy, and might likewise benefit from some full-time professionally specialized attention, just as our society creates trained professionals to handle any other problem that someone actually cares about getting right.

On the other side of things, Holden says that even if Friendly AI is proven and checked:

"I believe that the probability of an unfavorable outcome - by which I mean an outcome essentially equivalent to what a UFAI would bring about - exceeds 90% in such a scenario."

It's nice that this appreciates that the problem is hard.  Associating all of the difficulty with agenty proposals and thinking that it goes away as soon as you invoke tooliness is, well, of this I've already spoken. I'm not sure whether this irreducible-90%-doom assessment is based on a common straw version of FAI where all the work of the FAI programmer goes into "proving" something and doing this carefully checked proof which then - alas, poor Spock! - turns out to be no more relevant than proving that the underlying CPU does floating-point arithmetic correctly if the transistors work as stated. I've repeatedly said that the idea behind proving determinism of self-modification isn't that this guarantees safety, but that if you prove the self-modification stable the AI might work, whereas if you try to get by with no proofs at all, doom is guaranteed. My mind keeps turning up Ben Goertzel as the one who invented this caricature - "Don't you understand, poor fool Eliezer, life is full of uncertainty, your attempt to flee from it by refuge in 'mathematical proof' is doomed" - but I'm not sure he was actually the inventor. In any case, the burden of safety isn't carried just by the proof, it's carried mostly by proving the right thing. If Holden is assuming that we're just running away from the inherent uncertainty of life by taking refuge in mathematical proof, then, yes, 90% probability of doom is an understatement, the vast majority of plausible-on-first-glance goal criteria you can prove stable will also kill you.

If Holden's assessment does take into account a great effort to select the right theorem to prove - and attempts to incorporate the difficult but finitely difficult feature of meta-level error-detection, as it appears in e.g. the CEV proposal - and he is still assessing 90% doom probability, then I must ask, "What do you think you know and how do you think you know it?" The complexity of the human mind is finite; there's only so many things we want or would-want. Why would someone claim to know that proving the right thing is beyond human ability, even if "100 of the world's most intelligent and relevantly experienced people" (Holden's terms) check it over? There's hidden complexity of wishes, but not infinite complexity of wishes or unlearnable complexity of wishes. There are deep and subtle gotchas but not an unending number of them. And if that were the setting of the hidden variables - how would you end up knowing that with 90% probability in advance? I don't mean to wield my own ignorance as a sword or engage in motivated uncertainty - I hate it when people argue that if they don't know something, nobody else is allowed to know either - so please note that I'm also counterarguing from positive facts pointing the other way: the human brain is complicated but not infinitely complicated, there are hundreds or thousands of cytoarchitecturally distinct brain areas but not trillions or googols.  If humanity had two hundred years to solve FAI using human-level intelligence and there was no penalty for guessing wrong I would be pretty relaxed about the outcome.  If Holden says there's 90% doom probability left over no matter what sane intelligent people do (all of which goes away if you just build Google Maps AGI, but leave that aside for now) I would ask him what he knows now, in advance, that all those sane intelligent people will miss.  I don't see how you could (well-justifiedly) access that epistemic state.

I acknowledge that there are points in Holden's post which are not addressed in this reply, acknowledge that these points are also deserving of reply, and hope that other SIAI personnel will be able to reply to them.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 3:36 AM
Select new highlight date
Rendering 50/351 comments  show more

My summary (now with endorsement by Eliezer!):

  • SI can be a valuable organization even if Tool AI turns out to be the right approach:
    • Skills/organizational capabilities for safe Tool AI are similar to those for Friendly AI.
    • EY seems to imply that much of SI's existing body of work can be reused.
    • Offhand remark that seemed important: Superintelligent Tool AI would be more difficult since it would have to be developed in way that it would not recursively self-improve.
  • Tool AI is nontrivial:
    • The number of possible plans is way too large for an AI to realistically evaluate all them. Heuristics will have to be used to find suboptimal but promising plans.
    • The reasoning behind the plan the AI chooses might be way beyond the comprehension of the user. It's not clear how best to deal with this, given that the AI is only approximating the user's wishes and can't really be trusted to choose plans without supervision.
    • Constructing a halfway decent approximation of the user's utility function and having a model good enough to make plans with are also far from solved problems.
    • Potential Tool AI gotcha: The AI might give you a self-fulfilling negative prophecy that the AI didn't realize would harm you.
    • These are just examples. Point is, saying "but the AI will just do this!" is far removed from specifying the AI in a rigorous formal way and proving it will do that.
  • Tool AI is not obviously the way AGI should or will be developed:
    • Many leading AGI thinkers have their own pet idea about what AGI should do. Few to none endorse Tool AI. If it was obvious all the leading AGI thinkers would endorse it.
    • Actually, most modern AI applications don't involve human input, so it's not obvious that AGI will develop along Tool AI lines.
  • Full-time Friendliness researchers are worth having:
    • If nothing else, they're useful for evaluating proposals like Holden's Tool AI one to figure out if they are really sound.
    • Friendliness philosophy would be difficult to program an AI to do. Even if we thought we had a program that could do it, how would we know the answers from that program were correct? So we probably need humans.
    • Friendliness researchers need to have a broader domain of expertise than Holden gives them credit for. They need to have expertise in whatever happens to be necessary to ensure safe AI.
    • The problems of Friendliness are tricky, so laypeople should beware of jumping to conclusions about Friendliness.
  • Holden's estimate of a 90% chance of doom even given a 100 person FAI team approving the design is overly pessimistic:
    • EY is aware it's extremely difficult to know what properties about a prospective FAI need to be formally proved, and plans to put a lot of effort into figuring this out.
    • The difficulty of Friendliness is finite. The difficulties are big and subtle, but not unending.
    • Where did 90% come from? Lots of uncertainty here...
  • Holden made other good points not addressed here.

The difficulty of Friendliness is finite. The difficulties are big and subtle, but not unending.

How do we know that the problem is finite? When it comes to proving a computer program safe from being hacked the problem is considered NP-hard. Google Chrome got recently hacked by chaining 14 different bugs together. A working AGI is probably as least a complex as Google Chrome. Proving it safe will likely also be NP-hard.

Google Chrome doesn't even self modify.

This point seems missing:

You can't get a 20-move solution out of a human brain, using the native human planning algorithm. Humanity can do it, but only by exploiting the ability of humans to explicitly comprehend the deep structure of the domain (not just rely on intuition) and then inventing an artifact, a new design, running code which uses a different and superior cognitive algorithm, to solve that Rubik's Cube in 20 moves. We do all that without being self-modifying, but it's still a capability to respect.

A system that undertakes extended processes of research and thinking, generating new ideas and writing new programs for internal experiments, seems both much more effective and much more potentially risky than something like chess program with a simple fixed algorithm to search using a fixed narrow representation of the world (as a chess board).

When I read posts like this I feel like an independent everyman watching a political debate.

The dialogue is oversimplified and even then I don't fully grasp exactly what's being said and the implications thereof, so I can almost feel my opinion shifting back and forth with each point that sounds sort of, kinda, sensible when I don't really have the capacity to judge the statements. I should probably try and fix that.

The analogy is apt: blue-vs.-green politics aren't the only kind of politics, and debates over singularity policy have had big mind-killing effects on otherwise-pretty-rational LW folk before.

Hello,

I appreciate the thoughtful response. I plan to respond at greater length in the future, both to this post and to some other content posted by SI representatives and commenters. For now, I wanted to take a shot at clarifying the discussion of "tool-AI" by discussing AIXI. One of the the issues I've found with the debate over FAI in general is that I haven't seen much in the way of formal precision about the challenge of Friendliness (I recognize that I have also provided little formal precision, though I feel the burden of formalization is on SI here). It occurred to me that AIXI might provide a good opportunity to have a more precise discussion, if in fact it is believed to represent a case of "a rare exception who specified his AGI in such unambiguous mathematical terms that he actually succeeded at realizing, after some discussion with SIAI personnel, that AIXI would kill off its users and seize control of its reward button."

So here's my characterization of how one might work toward a safe and useful version of AIXI, using the "tool-AI" framework, if one could in fact develop an efficient enough approximation of AIXI to qualify as a powerful AGI. Of course, this is just a rough outline of what I have in mind, but hopefully it adds some clarity to the discussion.

A. Write a program that

  1. Computes an optimal policy, using some implementation of equation (20) on page 22 of http://www.hutter1.net/ai/aixigentle.pdf
  2. "Prints" the policy in a human-readable format (using some fixed algorithm for "printing" that is not driven by a utility function)
  3. Provides tools for answering user questions about the policy, i.e., "What will be its effect on ___?" (using some fixed algorithm for answering user questions that makes use of AIXI's probability function, and is not driven by a utility function)
  4. Does not contain any procedures for "implementing" the policy, only for displaying it and its implications in human-readable form

B. Run the program; examine its output using the tools described above (#2 and #3); if, upon such examination, the policy appears potentially destructive, continue tweaking the program (for example, by tweaking the utility it is selecting a policy to maximize) until the policy appears safe and desirable

C. Implement the policy using tools other than AIXI agent

D. Repeat (B) and (C) until one has confidence that the AIXI agent reliably produces safe and desirable policies, at which point more automation may be called for

My claim is that this approach would be superior to that of trying to develop "Friendliness theory" in advance of having any working AGI, because it would allow experiment- rather than theory-based development. Eliezer, I'm interested in your thoughts about my claim. Do you agree? If not, where is our disagreement?

Didn't see this at the time, sorry.

So... I'm sorry if this reply seems a little unhelpful, and I wish there was some way to engage more strongly, but...

Point (1) is the main problem. AIXI updates freely over a gigantic range of sensory predictors with no specified ontology - it's a sum over a huge set of programs, and we, the users, have no idea what the representations are talking about, except that at the end of their computations they predict, "You will see a sensory 1 (or a sensory 0)." (In my preferred formalism, the program puts a probability on a 0 instead.) Inside, the program could've been modeling the universe in terms of atoms, quarks, quantum fields, cellular automata, giant moving paperclips, slave agents scurrying around... we, the programmers, have no idea how AIXI is modeling the world and producing its predictions, and indeed, the final prediction could be a sum over many different representations.

This means that equation (20) in Hutter is written as a utility function over sense data, where the reward channel is just a special case of sense data. We can easily adapt this equation to talk about any function computed directly over sense data - we can get AIXI to optimize any aspect of its sense data that we please. We can't get it to optimize a quality of the external universe. One of the challenges I listed in my FAI Open Problems talk, and one of the problems I intend to talk about in my FAI Open Problems sequence, is to take the first nontrivial steps toward adapting this formalism - to e.g. take an equivalent of AIXI in a really simple universe, with a really simple goal, something along the lines of a Life universe and a goal of making gliders, and specify something given unlimited computing power which would behave like it had that goal, without pre-fixing the ontology of the causal representation to that of the real universe, i.e., you want something that can range freely over ontologies in its predictive algorithms, but which still behaves like it's maximizing an outside thing like gliders instead of a sensory channel like the reward channel. This is an unsolved problem!

We haven't even got to the part where it's difficult to say in formal terms how to interpret what a human says s/he wants the AI to plan, and where failures of phrasing of that utility function can also cause a superhuman intelligence to kill you. We haven't even got to the huge buried FAI problem inside the word "optimal" in point (1), which is the really difficult part in the whole thing. Because so far we're dealing with a formalism that can't even represent a purpose of the type you're looking for - it can only optimize over sense data, and this is not a coincidental fact, but rather a deep problem which the AIXI formalism deliberately avoided.

(2) sounds like you think an AI with an alien, superhuman planning algorithm can tell humans what to do without ever thinking consequentialistically about which different statements will result in human understanding or misunderstanding. Anna says that I need to work harder on not assuming other people are thinking silly things, but even so, when I look at this, it's hard not to imagine that you're modeling AIXI as a sort of spirit containing thoughts, whose thoughts could be exposed to the outside with a simple exposure-function. It's not unthinkable that a non-self-modifying superhuman planning Oracle could be developed with the further constraint that its thoughts are human-interpretable, or can be translated for human use without any algorithms that reason internally about what humans understand, but this would at the least be hard. And with AIXI it would be impossible, because AIXI's model of the world ranges over literally all possible ontologies and representations, and its plans are naked motor outputs.

Similar remarks apply to interpreting and answering "What will be its effect on _?" It turns out that getting an AI to understand human language is a very hard problem, and it may very well be that even though talking doesn't feel like having a utility function, our brains are using consequential reasoning to do it. Certainly, when I write language, that feels like I'm being deliberate. It's also worth noting that "What is the effect on X?" really means "What are the effects I care about on X?" and that there's a large understanding-the-human's-utility-function problem here. In particular, you don't want your language for describing "effects" to partition, as the same state of described affairs, any two states which humans assign widely different utilities. Let's say there are two plans for getting my grandmother out of a burning house, one of which destroys her music collection, one of which leaves it intact. Does the AI know that music is valuable? If not, will it not describe music-destruction as an "effect" of a plan which offers to free up large amounts of computer storage by, as it turns out, overwriting everyone's music collection? If you then say that the AI should describe changes to files in general, well, should it also talk about changes to its own internal files? Every action comes with a huge number of consequences - if we hear about all of them (reality described on a level so granular that it automatically captures all utility shifts, as well as a huge number of other unimportant things) then we'll be there forever.

I wish I had something more cooperative to say in reply - it feels like I'm committing some variant of logical rudeness by this reply - but the truth is, it seems to me that AIXI isn't a good basis for the agent you want to describe; and I don't know how to describe it formally myself, either.

Thanks for the response. To clarify, I'm not trying to point to the AIXI framework as a promising path; I'm trying to take advantage of the unusually high degree of formalization here in order to gain clarity on the feasibility and potential danger points of the "tool AI" approach.

It sounds to me like your two major issues with the framework I presented are (to summarize):

(1) There is a sense in which AIXI predictions must be reducible to predictions about the limited set of inputs it can "observe directly" (what you call its "sense data").

(2) Computers model the world in ways that can be unrecognizable to humans; it may be difficult to create interfaces that allow humans to understand the implicit assumptions and predictions in their models.

I don't claim that these problems are trivial to deal with. And stated as you state them, they sound abstractly very difficult to deal with. However, it seems true - and worth noting - that "normal" software development has repeatedly dealt with them successfully. For example: Google Maps works with a limited set of inputs; Google Maps does not "think" like I do and I would not be able to look at a dump of its calculations and have any real sense for what it is doing; yet Google Maps does make intelligent predictions about the external universe (e.g., "following direction set X will get you from point A to point B in reasonable time"), and it also provides an interface (the "route map") that helps me understand its predictions and the implicit reasoning (e.g. "how, why, and with what other consequences direction set X will get me from point A to point B").

Difficult though it may be to overcome these challenges, my impression is that software developers have consistently - and successfully - chosen to take them on, building algorithms that can be "understood" via interfaces and iterated over - rather than trying to prove the safety and usefulness of their algorithms with pure theory before ever running them. Not only does the former method seem "safer" (in the sense that it is less likely to lead to putting software in production before its safety and usefulness has been established) but it seems a faster path to development as well.

It seems that you see a fundamental disconnect between how software development has traditionally worked and how it will have to work in order to result in AGI. But I don't understand your view of this disconnect well enough to see why it would lead to a discontinuation of the phenomenon I describe above. In short, traditional software development seems to have an easier (and faster and safer) time overcoming the challenges of the "tool" framework than overcoming the challenges of up-front theoretical proofs of safety/usefulness; why should we expect this to reverse in the case of AGI?

So first a quick note: I wasn't trying to say that the difficulties of AIXI are universal and everything goes analogously to AIXI, I was just stating why AIXI couldn't represent the suggestion you were trying to make. The general lesson to be learned is not that everything else works like AIXI, but that you need to look a lot harder at an equation before thinking that it does what you want.

On a procedural level, I worry a bit that the discussion is trying to proceed by analogy to Google Maps. Let it first be noted that Google Maps simply is not playing in the same league as, say, the human brain, in terms of complexity; and that if we were to look at the winning "algorithm" of the million-dollar Netflix Prize competition, which was in fact a blend of 107 different algorithms, you would have a considerably harder time figuring out why it claimed anything it claimed.

But to return to the meta-point, I worry about conversations that go into "But X is like Y, which does Z, so X should do reinterpreted-Z". Usually, in my experience, that goes into what I call "reference class tennis" or "I'm taking my reference class and going home". The trouble is that there's an unlimited number of possible analogies and reference classes, and everyone has a different one. I was just browsing old LW posts today (to find a URL of a quick summary of why group-selection arguments don't work in mammals) and ran across a quotation from Perry Metzger to the effect that so long as the laws of physics apply, there will always be evolution, hence nature red in tooth and claw will continue into the future - to him, the obvious analogy for the advent of AI was "nature red in tooth and claw", and people who see things this way tend to want to cling to that analogy even if you delve into some basic evolutionary biology with math to show how much it isn't like intelligent design. For Robin Hanson, the one true analogy is to the industrial revolution and farming revolutions, meaning that there will be lots of AIs in a highly competitive economic situation with standards of living tending toward the bare minimum, and this is so absolutely inevitable and consonant with The Way Things Should Be as to not be worth fighting at all. That's his one true analogy and I've never been able to persuade him otherwise. For Kurzweil, the fact that many different things proceed at a Moore's Law rate to the benefit of humanity means that all these things are destined to continue and converge into the future, also to the benefit of humanity. For him, "things that go by Moore's Law" is his favorite reference class.

I can have a back-and-forth conversation with Nick Bostrom, who looks much more favorably on Oracle AI in general than I do, because we're not playing reference class tennis with "But surely that will be just like all the previous X-in-my-favorite-reference-class", nor saying, "But surely this is the inevitable trend of technology"; instead we lay out particular, "Suppose we do this?" and try to discuss how it will work, not with any added language about how surely anyone will do it that way, or how it's got to be like Z because all previous Y were like Z, etcetera.

My own FAI development plans call for trying to maintain programmer-understandability of some parts of the AI during development. I expect this to be a huge headache, possibly 30% of total headache, possibly the critical point on which my plans fail, because it doesn't happen naturally. Go look at the source code of the human brain and try to figure out what a gene does. Go ask the Netflix Prize winner for a movie recommendation and try to figure out "why" it thinks you'll like watching it. Go train a neural network and then ask why it classified something as positive or negative. Try to keep track of all the memory allocations inside your operating system - that part is humanly understandable, but it flies past so fast you can only monitor a tiny fraction of what goes on, and if you want to look at just the most "significant" parts, you would need an automated algorithm to tell you what's significant. Most AI algorithms are not humanly understandable. Part of Bayesianism's appeal in AI is that Bayesian programs tend to be more understandable than non-Bayesian AI algorithms. I have hopeful plans to try and constrain early FAI content to humanly comprehensible ontologies, prefer algorithms with humanly comprehensible reasons-for-outputs, carefully weigh up which parts of the AI can safely be less comprehensible, monitor significant events, slow down the AI so that this monitoring can occur, and so on. That's all Friendly AI stuff, and I'm talking about it because I'm an FAI guy. I don't think I've ever heard any other AGI project express such plans; and in mainstream AI, human-comprehensibility is considered a nice feature, but rarely a necessary one.

It should finally be noted that AI famously does not result from generalizing normal software development. If you start with a map-route program and then try to program it to plan more and more things until it becomes an AI... you're doomed, and all the experienced people know you're doomed. I think there's an entry or two in the old Jargon File aka Hacker's Dictionary to this effect. There's a qualitative jump to writing a different sort of software - from normal programming where you create a program conjugate to the problem you're trying to solve, to AI where you try to solve cognitive-science problems so the AI can solve the object-level problem. I've personally met a programmer or two who've generalized their code in interesting ways, and who feel like they ought to be able to generalize it even further until it becomes intelligent. This is a famous illusion among aspiring young brilliant hackers who haven't studied AI. Machine learning is a separate discipline and involves algorithms and problems that look quite different from "normal" programming.

Thanks for the response. My thoughts at this point are that

  • We seem to have differing views of how to best do what you call "reference class tennis" and how useful it can be. I'll probably be writing about my views more in the future.
  • I find it plausible that AGI will have to follow a substantially different approach from "normal" software. But I'm not clear on the specifics of what SI believes those differences will be and why they point to the "proving safety/usefulness before running" approach over the "tool" approach.
  • We seem to have differing views of how frequently today's software can be made comprehensible via interfaces. For example, my intuition is that the people who worked on the Netflix Prize algorithm had good interfaces for understanding "why" it recommends what it does, and used these to refine it. I may further investigate this matter (casually, not as a high priority); on SI's end, it might be helpful (from my perspective) to provide detailed examples of existing algorithms for which the "tool" approach to development didn't work and something closer to "proving safety/usefulness up front" was necessary.

Canonical software development examples emphasizing "proving safety/usefulness before running" over the "tool" software development approach are cryptographic libraries and NASA space shuttle navigation.

At the time of writing this comment, there was recent furor over software called CryptoCat that didn't provide enough warnings that it was not properly vetted by cryptographers and thus should have been assumed to be inherently insecure. Conventional wisdom and repeated warnings from the security community state that cryptography is extremely difficult to do properly and attempting to create your own may result in catastrophic results. A similar thought and development process goes into space shuttle code.

It seems that the FAI approach to "proving safety/usefulness" is more similar to the way cryptographic algorithms are developed than the (seemingly) much faster "tool" approach, which is more akin to web development where the stakes aren't quite as high.

EDIT: I believe the "prove" approach still allows one to run snippets of code in isolation, but tends to shy away from running everything end-to-end until significant effort has gone into individual component testing.

For example: Google Maps works with a limited set of inputs; Google Maps does not "think" like I do and I would not be able to look at a dump of its calculations and have any real sense for what it is doing; yet Google Maps does make intelligent predictions about the external universe (e.g., "following direction set X will get you from point A to point B in reasonable time"), and it also provides an interface (the "route map") that helps me understand its predictions and the implicit reasoning (e.g. "how, why, and with what other consequences direction set X will get me from point A to point B").

Explaining routes is domain specific and quite simple. When you are using domain specific techniques to find solutions to domain specific problems, you can use domain specific interfaces where human programmers and designers do all the heavy lifting to figure out the general strategy of how to communicate to the user.

But if you want a tool AGI that finds solutions in arbitrary domains, you need a cross domain solution for communicating tool AGI's plans to the user. This is as much a harder problem than showing a route on a map, as cross domain AGI is a harder problem than computing the routes. Instead of the programmer figuring out how to plot road tracing curves on a map, the programmer has to figure out how to get the computer to figure out that displaying a map with route traced over it is a useful thing to do, in a way that generalizes figuring out other useful things to do to communicate answers to other types of questions. And among the hard subproblems of programming computers to find useful things to do in general problems is specifying the meaning of "useful". If that is done poorly, the tool AGI tries to trick the user into accepting plans that achieve some value negating distortion of what we actually want, instead of giving information that helps provide a good evaluation. Doing this right requires solving the same problems required to do FAI right.

To clarify, for everyone:

There are now three "major" responses from SI to Holden's Thoughts on the Singularity Institute (SI): (1) a comments thread on recent improvements to SI as an organization, (2) a post series on how SI is turning donor dollars into AI risk reduction and how it could do more of this if it had more funding, and (3) Eliezer's post on Tool AI above.

At least two more major responses from SI are forthcoming: a detailed reply to Holden's earlier posts and comments on expected value estimates (e.g. this one), and a long reply from me that summarizes my responses to all (or almost all) of the many issues raised in Thoughts on the Singularity Institute (SI).

Software that does happen to interface with humans is selectively visible and salient to humans, especially the tiny part of the software that does the interfacing; but this is a special case of a general cost/benefit tradeoff which, more often than not, turns out to swing the other way, because human advice is either too costly or doesn't provide enough benefit.

I suspect this is the biggest counter-argument for Tool AI, even bigger than all the technical concerns Eliezer made in the post. Even if we could build a safe Tool AI, somebody would soon build an agent AI anyway.

My five cents on the subject, from something that I'm currently writing:

Like with external constraints, Oracle AI suffers from the problem that there would always be an incentive to create an AGI that could act on its own, without humans in the loop. Such an AGI would be far more effective in furthering whatever goals it had been built to pursue, but also far more dangerous.

Current-day narrow-AI technology includes high-frequency trading (HFT) algorithms, which make trading decisions within fractions of a second, far too fast to keep humans in the loop. HFT seeks to make a very short-term profit, but even traders looking for a longer-term investment benefit from being faster than their competitors. Market prices are also very effective at incorporating various sources of knowledge (Hanson 2007). As a consequence, a trading algorithm’s performance might be improved both by making it faster and by making it more capable of integrating various sources of knowledge. Since trading is also one of the fields with the most money involved, it seems like a reasonable presumption that most advances towards general AGI will quickly be put to use into making more money on the financial markets, with little opportunity for a human to vet all the decisions. Oracle AIs are unlikely to remain as pure oracles for long.

In general, any broad domain involving high stakes, adversarial decision-making, and a need to act rapidly is likely to become increasingly dominated by autonomous systems. The extent to which the systems will need general intelligence will depend on the domain, but many domains such as warfare, information security and fraud detection could plausibly make use of all the intelligence they can get. This is especially the case if one’s opponents in the domain are also using increasingly autonomous A(G)I, leading to an arms race where one might have little choice but to give increasing amounts of control to A(G)I systems.

From the same text, also related to Eliezer's points:

Even if humans were technically kept in the loop, they might not have the time, opportunity, or motivation to verify the advice given by an Oracle AI. This may be a danger even with more narrow-AI systems. Friedman & Kahn (1992) discuss this risk in the context of APACHE, a computer expert system that provides doctors with advice regarding treatments. They write that as the medical community starts to trust APACHE, it may become practice to act on APACHE’s recommendations somewhat automatically, and it may become increasingly difficult to challenge the “authority” of the recommendation. Eventually, the consultation system may in effect begin to dictate clinical decisions.

Likewise, Bostrom & Yudkowsky (2011) point out that modern bureaucrats often follow established procedures to the letter, rather than exercising their own judgment and allowing themselves to be blamed for any mistakes that follow. Dutifully following all the recommendations of an AGI system would be an even better way of avoiding blame.

Thus, even AGI systems that function purely to provide advice will need to be explicitly designed as safe in the sense of not providing advice that would go against human values (Wallach & Allen 2009). This requires a way of teaching them the correct values.

Marcus Hutter is a rare exception who specified his AGI in such unambiguous mathematical terms that he actually succeeded at realizing, after some discussion with SIAI personnel, that AIXI would kill off its users and seize control of its reward button.

Marcus Hutter denies ever having said that.

I asked EY for how to proceed, with his approval these are the messages we exchanged:

Eliezer,

I am unsure how to proceed and would appreciate your thoughts on resolving this situation:

In your Reply to Holden on 'Tool AI', to me one of the central points, and the one that much credibility hinges on is this:

[Initial quote of this comment]

That and some other "quotes" and allusions to Hutter, the most recent one by Carl Shulman [I referred to this: "The informal argument that AIXI would accept a delusion box to give itself maximal sensory reward was made by Eliezer a while ago, and convinced the AIXI originators." which I may have mistakingly attributed to M.H. since he is the AIXI originator], that are attributed to M.H. seemed to be greatly at odds with my experiences with the man, so I asked Carl Shulman for sourcing them, he had this to say:

"I recall overhearing part of a conversation at the Singularity Summit a few years ago between Eliezer and Schmidhuber, with pushback followed later by agreement. It may have been initial misunderstanding, but it looked consistent with Eliezer's story."

At this point I asked Marcus himself, whom I know peripherally, for general clarification.

Marcus linked to the relevant sentence quoted above from your Reply to Holden and stated in unambiguous terms that he never said that. Further, he stated that while he does not have the time to engage in such discussions, he authorised me to set the picture straight.

I'm sure you realize what this seems to look like (note: not a harmless misunderstanding, though that is possible).

[Redacted personal info] ... and though I currently would not donate to your specific cause, I also am unsure about causing the potential ramifications of you quoting Hutter for support wrongly. On the other hand, I don't feel like a silent edit would do justice to the situation.

If you have a constructive idea of how to settle this issue, please let me know.

EY's response:

Info like this should be posted, and you can quote me on that part too. I did notice a tendency of Marcus Hutter to unconvince himself of things and require reconvincing, and the known experimental fragility of human memory (people believe they have always believed their conclusions, more or less by default) suggests that this is an adequate explanation for everything, especially if Carl Shulman remembers a similar conversation from his viewpoint. It is also obviously possible that I have misremembered and then caused false memory in Carl. I do seem to recall pretty strongly that Hutter invented the Azathoth Equation (for an AIXI variant with an extremely high exponential discount, so it stays in its box pressing its button so long as it doesn't anticipate being disturbed in the next 5 seconds) in response to this acknowledged concern, and I would be surprised if Hutter doesn't remember the actual equation-proposal. My ideal resolution would be for Hutter and I to start over with no harm, no foul on both sides and do a Bloggingheads about it so that there's an accessible record of the resulting dialogue. Please feel free to post your entire comment along with this entire response.

I apologise for the confusion of Carl Shulman actually referring to overhearing a conversation with Schmidhuber (again, since his initial quote referred to just "AIXI originators" I pattern matched that to M.H.), so disregard EY's remark on potentially causing false memories on Carl Shulman's part.

However, the main point of M.H. contradicting what is attributed to him in the Reply to Holden on 'Tool AI' stands.

For full reference, linking the relevant part of M.H.'s email:

[This part is translated, thus paraphrased:] I don't have time to participate in blog discussions, I do know there's a quote of me floating around: [Link to initial quote of this comment along with its text]

[Other than incorporating the links he provided into the Markdown Syntax, this following part is verbatim:]

I never said that. These are mainly open questions.

See e.g. Sec.5 of One Decade of Universal Artificial Intelligence In Theoretical Foundations of Artificial General Intelligence (2012) 67?--88? and references therein (in particular to Laurent Orseau) for social questions regarding AIXI.

See also Can Intelligence Explode? Journal of Consciousness Studies, 19:1-2 (2012) 143-166 for a discussion of AIXI in relation to the Singularity.

I also recommend you subscribe to the Mathematical Artificial General Intelligence Consortium (MAGIC) mailing list for a more scientific discussion on these and related issues.

Before taking any more of his time, and since he does not agree with the initial quote (at least now, whether he did back then is in dispute), I suggest the "Reply to Holden on Tool AI" to reflect that. Further, I suggest to instead refer to the sources he gave for a more thorough examination on his views re: AIXI.

I begin by thanking Holden Karnofsky of Givewell for his rare gift of his detailed, engaged, and helpfully-meant critical article Thoughts on the Singularity Institute (SI). In this reply I will engage with only one of the many subjects raised therein, the topic of, as I would term them, non-self-modifying planning Oracles, a.k.a. 'Google Maps AGI' a.k.a. 'tool AI', this being the topic that requires me personally to answer. I hope that my reply will be accepted as addressing the most important central points, though I did not have time to explore every avenue. I certainly do not wish to be logically rude, and if I have failed, please remember with compassion that it's not always obvious to one person what another person will think was the central point.

Luke Mueulhauser and Carl Shulman contributed to this article, but the final edit was my own, likewise any flaws.

I think you're starting to write more like a Friendly AI. This is totally a good thing.

Yes, the tone of this response should be commended.

Probably nothing new, but I just wanted to note that when you couple two straightforward Google tools, Maps and a large enough fleet of self-driving cars, they are likely to unintentionally agentize by shaping the traffic.

For example, the goal of each is to optimize the fuel economy/driving time, so the routes Google cars would take depend on the expected traffic volume, as predicted by Maps access, among other things. Similarly, Maps would know where these cars are or will be at a given time, and would adjust its output accordingly (possibly as a user option). An optimization strategy might easy arise that gives Google cars preference over other cars, in order to minimize, say, the overall emission levels. This can be easily seen as unfriendly by a regular Map user, but friendly by the municipality.

Similar scenarios would pop up in many cases where, in the EE speak, a tool gains an intentional or a parasitic feedback, whether positive or negative. As anyone who dealt with music amps knows, this feedback appears spontaneously and is often very difficult to track down. In a sense, a tool as simple as an amp can agentize and drown the positive signal. As the tool complexity grows, so do the odds of parasitic feedback. Coupling multiple "safe" tools together increases such odds exponentially.

And if the preference function was just over the human's 'goodness' of the end result, rather than the accuracy of the human's understanding of the predictions, the AI might tell you something that was predictively false but whose implementation would lead you to what the AI defines as a 'good' outcome. And if we ask how happy the human is, the resulting decision procedure would exert optimization pressure to convince the human to take drugs, and so on.

I was under the impression that Holden's suggestion was more along the lines of: Make a model of the world. Remove the user from the model and replace it with a similar user that will always do what you recommend. Then manipulate this user so that it achieves its objective in the model, and report the actions that you have the user do in the model to the real user.

Thus, if the objective was to make the user happy, the Google Maps AGI would simply instruct the user to take drugs, rather than tricking him into doing so, because such instruction is the easiest way to manipulate the user in the model that the Google Maps AGI is optimizing in.

Actually, the easiest output for the AI in that case is "be happy."

This is the first time I can recall Eliezer giving an overt indication regarding how likely an AGI project is to doom us. He suggests that 90% chance of Doom given intelligent effort is unrealistically high. Previously I had only seem him declare that FAI is worth attempting once you multiply. While he still hasn't given numbers (not saying he should) he has has given a bound. Interesting. And perhaps a little more optimistic than I expected - or at least more optimistic than I would have expected prior to Luke's comment.

how likely an AGI project is to doom us

Isn't it more like "how likely a formally proven FAI design is to doom us", since this is what Holden seems to be arguing (see his quote below)?

Suppose that it is successful in the "AGI" part of its goal, i.e., it has successfully created an intelligence vastly superior to human intelligence and extraordinarily powerful from our perspective. Suppose that it has also done its best on the "Friendly" part of the goal: it has developed a formal argument for why its AGI's utility function will be Friendly, it believes this argument to be airtight, and it has had this argument checked over by 100 of the world's most intelligent and relevantly experienced people. .. What will be the outcome?

There are two ways to read Holden's claim about what happens if 100 experts check the proposed FAI safety proof. On one reading, Holden is saying that if 100 experts check it and say, "Yes, I am highly confident that this is in fact safe," then activating the AI kills us all with 90% probability. On the other reading, Holden is saying that even if 100 experts do their best to find errors and say, "No, I couldn't identify any way in which this will kill us, though that doesn't mean it won't kill us," then activating the AI kills us all with 90% probability. I think the first reading is very implausible. I don't believe the second reading, but I don't think it's obviously wrong. I think the second reading is the more charitable and relevant one.

Commentary (there will be a lot of "to me"s because I have been a bystander to this exchange so far):

I think this post misunderstands Holden's point, because it looks like it's still talking about agents. Tool AI, to me, is a decision support system: I tell Google Maps where I will start from and where I will leave from, and it generates a route using its algorithm. Similarly, I could tell Dr. Watson my medical data, and it will supply a diagnosis and a treatment plan that has a high score based on the utility function I provide.

In neither case are the skills of "looking at the equations and determining real-world consequences" that necessary. There are no dark secrets lurking in the soul of A*. Indeed, that might be the heart of the issue: tool AI might be those situations where you can make a network that represents the world, identify two nodes, and call your optimization algorithm of choice to determine the best actions to choose to attempt to make it from the start node to the end node.

Reducing the world to a network is really hard. Determining preferences between outcomes is hard. But Tool AI looks to me like saying "well, the whole world is really too much. I'm just going to deal with planning routes, which is a simple world that I can understand," where the FAI tools aren't that relevant. The network might be out of line with reality, the optimization algorithm might be buggy or clumsy, but the horror stories that keep FAI researchers up at night seem impossible because of the inherently limited scope, and the ability to do dry runs and simulations until the AI's model of reality is trusted enough to give it control.

Now, this requires that AI only be used for things like planning where to put products on shelves, not planning corporate strategy- but if you work from the current stuff up rather than from the God algorithm down, it doesn't look like corporate strategy will be on the table until AI is developed to the point where it could be trusted with that. If someone gave me a black box that spit out plans based on English input, then I wouldn't trust it and I imagine you wouldn't either- but I don't think that's what we're looking at, and I don't know if planning for that scenario is valuable.

It seems to me that SI has discussed Holden's Tool AI idea- when it made the distinction between AI and AGI. Holden seems to me to be asking "well, if AGI is such a tough problem, why even do it?".

Holden explicitly said that he was talking about AGI in his dialogue with Jaan Tallinn:

Jaan: so GMAGI would -- effectively -- still be a narrow AI that's designed to augment human capabilities in particularly strategic domains, while not being able to perform tasks such as programming. also, importantly, such GMAGI would not be able to make non-statistical (ie, individual) predictions about the behaviour of human beings, since it is unable to predict their actions in domains where it is inferior.

Holden: [...] I don't think of the GMAGI I'm describing as necessarily narrow - just as being such that assigning it to improve its own prediction algorithm is less productive than assigning it directly to figuring out the questions the programmer wants (like "how do I develop superweapons"). There are many ways this could be the case.

Jaan: [...] i stand corrected re the GMAGI definition -- from now on let's assume that it is a full blown AGI in the sense that it can perform every intellectual task better than the best of human teams, including programming itself.

Minor point from Nick Bostrom: an agent AI may be safer than a tool AI, because if something goes unexpectedly wrong, then an agent with safe goals should turn out to be better than a non-agent whose behaviour would be unpredictable.

Also, an agent with safer goals than humans have (which is a high bar, but not nearly as high a bar as some alternatives) is safer than humans with equivalently powerful tools.

I think it's a pity that we're not focusing on what we could do to test the tool vs general AI distinction. For example, here's one near-future test: how do we humans deal with drones?

Drones are exploding in popularity, are increasing their capabilities constantly, and are coveted by countless security agencies and private groups for their tremendous use in all sorts of roles both benign and disturbing. Just like AIs would be. The tool vs general AI distinction maps very nicely onto drones as well: a tool AI corresponds to a drone being manually flown by a human pilot somewhere, while a general AI would correspond to an autonomous drone which is carrying out some mission (blast insurgents?).

So, here is a near-future test of the question 'are people likely to let tool AIs 'drive themselves' for greater efficiency?' - simply ask whether in, say, a decade there are autonomous drones carrying tasks that now would only be carried out by piloted drones.

If in a decade we learn that autonomous drones are killing people, then we have an answer to our tool AI question: it doesn't matter because given a tool AI, people will just turn it into a general AI.

(Amdahl's law: if the human in the loop takes up 10% of the time, and the AI or drone part comprises the other 90%, then even if the drone or AI become infinitely fast, you will still never speed up the whole loop by more than 90%... until you hand over that 10% to the AI, that is. EDIT: See also https://web.archive.org/web/20121122150219/http://lesswrong.com/lw/f53/now_i_appreciate_agency/7q4o )