Three Approaches to "Friendliness"

I put "Friendliness" in quotes in the title, because I think what we really want, and what MIRI seems to be working towards, is closer to "optimality": create an AI that minimizes the expected amount of astronomical waste. In what follows I will continue to use "Friendly AI" to denote such an AI since that's the established convention.

I've often stated my objections MIRI's plan to build an FAI directly (instead of after human intelligence has been substantially enhanced). But it's not because, as some have suggested while criticizing MIRI's FAI work, that we can't foresee what problems need to be solved. I think it's because we can largely foresee what kinds of problems need to be solved to build an FAI, but they all look superhumanly difficult, either due to their inherent difficulty, or the lack of opportunity for "trial and error", or both.

When people say they don't know what problems need to be solved, they may be mostly talking about "AI safety" rather than "Friendly AI". If you think in terms of "AI safety" (i.e., making sure some particular AI doesn't cause a disaster) then that does looks like a problem that depends on what kind of AI people will build. "Friendly AI" on the other hand is really a very different problem, where we're trying to figure out what kind of AI to build in order to minimize astronomical waste. I suspect this may explain the apparent disagreement, but I'm not sure. I'm hoping that explaining my own position more clearly will help figure out whether there is a real disagreement, and what's causing it.

The basic issue I see is that there is a large number of serious philosophical problems facing an AI that is meant to take over the universe in order to minimize astronomical waste. The AI needs a full solution to moral philosophy to know which configurations of particles/fields (or perhaps which dynamical processes) are most valuable and which are not. Moral philosophy in turn seems to have dependencies on the philosophy of mind, consciousness, metaphysics, aesthetics, and other areas. The FAI also needs solutions to many problems in decision theory, epistemology, and the philosophy of mathematics, in order to not be stuck with making wrong or suboptimal decisions for eternity. These essentially cover all the major areas of philosophy.

For an FAI builder, there are three ways to deal with the presence of these open philosophical problems, as far as I can see. (There may be other ways for the future to turns out well without the AI builders making any special effort, for example if being philosophical is just a natural attractor for any superintelligence, but I don't see any way to be confident of this ahead of time.) I'll name them for convenient reference, but keep in mind that an actual design may use a mixture of approaches.

  1. Normative AI - Solve all of the philosophical problems ahead of time, and code the solutions into the AI.
  2. Black-Box Metaphilosophical AI - Program the AI to use the minds of one or more human philosophers as a black box to help it solve philosophical problems, without the AI builders understanding what "doing philosophy" actually is.
  3. White-Box Metaphilosophical AI - Understand the nature of philosophy well enough to specify "doing philosophy" as an algorithm and code it into the AI.

The problem with Normative AI, besides the obvious inherent difficulty (as evidenced by the slow progress of human philosophers after decades, sometimes centuries of work), is that it requires us to anticipate all of the philosophical problems the AI might encounter in the future, from now until the end of the universe. We can certainly foresee some of these, like the problems associated with agents being copyable, or the AI radically changing its ontology of the world, but what might we be missing?

Black-Box Metaphilosophical AI is also risky, because it's hard to test/debug something that you don't understand. Besides that general concern, designs in this category (such as Paul Christiano's take on indirect normativity) seem to require that the AI achieve superhuman levels of optimizing power before being able to solve its philosophical problems, which seems to mean that a) there's no way to test them in a safe manner, and b) it's unclear why such an AI won't cause disaster in the time period before it achieves philosophical competence.

White-Box Metaphilosophical AI may be the most promising approach. There is no strong empirical evidence that solving metaphilosophy is superhumanly difficult, simply because not many people have attempted to solve it. But I don't think that a reasonable prior combined with what evidence we do have (i.e., absence of visible progress or clear hints as to how to proceed) gives much hope for optimism either.

To recap, I think we can largely already see what kinds of problems must be solved in order to build a superintelligent AI that will minimize astronomical waste while colonizing the universe, and it looks like they probably can't be solved correctly with high confidence until humans become significantly smarter than we are now. I think I understand why some people disagree with me (e.g., Eliezer thinks these problems just aren't that hard, relative to his abilities), but I'm not sure why some others say that we don't yet know what the problems will be.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 2:28 AM
Select new highlight date
Rendering 50/86 comments  show more

The difficulty is still largely due to the security problem. Without catastrophic risks (including UFAI and value drift), we could take as much time as necessary and/or go with making people smarter first.

The aspect of FAI that is supposed to solve the security problem is optimization power aimed at correct goals. Optimization power addresses the "external" threats (and ensures progress), and correctness of goals represents "internal" safety. If an AI has sufficient optimization power, the (external) security problem is taken care of, even if the goals are given by a complicated definition that the AI is unable to evaluate at the beginning: it'll protect the original definition even without knowing what it evaluates to, and aim to evaluate it (for instrumental reasons).

This suggests that a minimal solution is to pack all the remaining difficulties in AI's goal definition, at which point the only object level problems are to figure out what a sufficiently general notion of "goal" is (decision theory; the aim of this part is to give the goal definition sufficient expressive power, to avoid constraining its decisions while extracting the optimization part), how to build an AI that follows a goal definition and is at least competitive in its optimization power, and how to compose the goal definition. The simplest idea for the goal definition seems to be some kind of WBE-containing program, so learning to engineer stable WBE superorganisms might be relevant for this part (UFAI and value drift will remain a problem, but might be easier to manage in this setting).

(It might be also good to figure out how to pack a reference to the state of the Earth at a recent point in time into the goal definition, so that the AI has an instrumental drive to capture its state when it still doesn't understand its goals and so will probably use the Earth itself for something else; this might then also lift the requirement of having WBE tech in order to construct the goal definition.)

Without catastrophic risks (including UFAI and value drift), we could take as much time as necessary and/or go with making people smarter first.

You appear to be operating under the assumption that it's already too late or otherwise impractical to "go with making people smarter first", but I don't see why, compared to "build FAI first".

Human cloning or embryo selection look like parallelizable problems that would be easily amenable to the approach of "throwing resources at it". It just consists of a bunch of basic science and engineering problems, which humans are generally pretty good at, compared to the kind of philosophical problems that need to be solved for building FAI. Nor do we have to get all those problems right on the first try or face existential disaster. Nor is intelligence enhancement known to be strictly harder than building UFAI (i.e., solving FAI requires solving AGI as a subproblem). And there must be many other research directions that could be funded in addition to these two. All it would take is for some government or maybe even large corporation or charitable organization to take the problem of "astronomical waste" seriously (again referring to the more general concept than Bostrom's, which I wish had its own established name).

If it's not already too late or impractical to make people smarter first (and nobody has made a case that it is, as far as I know) then FAI work has the counterproductive consequence of making it harder to make people smarter first (by shortening AI timelines). MIRI and other FAI advocates do not seem to have taken this into account adequately.

My point was that when we expand on "black box metaphilosophical AI", it seems to become much less mysterious than the whole problem, we only need to solve decision theory and powerful optimization and maybe (wait for) WBE. If we can pack a morality/philosophy research team into the goal definition, the solution of the friendliness part can be deferred almost completely to after the current risks are eliminated, at which point the team will have a large amount of time to solve it.

(I agree that building smarter humans is a potentially workable point of intervention. This needs a champion to at least outline the argument, but actually making this happen will be much harder.)

My point was that when we expand on "black box metaphilosophical AI", it seems to become much less mysterious than the whole problem, we only need to solve decision theory and powerful optimization and maybe (wait for) WBE.

I think I understand the basic motivation for pursuing this approach, but what's your response to the point I made in the post, that such an AI has to achieve superhuman levels of optimizing power, in order to acquire enough computing power to run the WBE, before it can start producing philosophical solutions, and therefore there's no way for us to safely test it to make sure that the "black box" would produce sane answers as implemented? It's hard for me to see how we can get something this complicated right on the first try.

The black box is made of humans and might be tested the usual way when (human-designed) WBE tech is developed. The problem of designing its (long term) social organisation might also be deferred to the box. The point of the box is that it can be made safe from external catastrophic risks, not that it represents any new progress towards FAI.

The AI doesn't produce philosophical answers, the box does, and the box doesn't contain novel/dangerous things like AIs. This only requires solving the separate problems of having AI care about evaluating a program, and preparing a program that contains people who would solve the remaining problems (and this part doesn't involve AI). The AI is something that can potentially be theoretically completely understood and it can be very carefully tested under controlled conditions, to see that it does evaluate simpler black boxes that we also understand. Getting decision theory wrong seems like a more elusive risk.

The black box is made of humans and might be tested the usual way when (human-designed) WBE tech is developed.

Ok, I think I misunderstood you earlier, and thought that your idea was similar to Paul Christiano's, where the FAI would essentially develop the WBE tech instead of us. I had also suggested waiting for WBE tech before building FAI (although due to a somewhat different motivation), and in response someone (maybe Carl Shulman?) argued that brain-inspired AGI or low-fidelity brain emulations would likely be developed before high-fidelity brain emulations, which means the FAI would probably come too late if it waited for WBE. This seems fairly convincing to me.

Waiting for WBE is risky in many ways, but I don't see a potentially realistic plan that doesn't go through it, even if we have (somewhat) smarter humans. This path (and many variations, such as a WBE superorg just taking over "manually" and not leaving anyone else with access to physical world) I can vaguely see working, solving the security/coordination problem, if all goes right; other paths seem much more speculative to me (but many are worth trying, given resources; if somehow possible to do reliably, AI-initiated WBE when there is no human-developed WBE would be safer).

"create an AI that minimizes the expected amount of astronomical waste"

Of course, this is still just a proxy measure... say that we're "in a simulation", or that there are already superintelligences in our environment who won't let us eat the stars, or something like that—we still want to get as good a bargaining position as we possibly can, or to coordinate with the watchers as well as we possibly can, or in a more fundamental sense we want to not waste any of our potential, which I think is the real driving intuition here. (Further clarifying and expanding on that intuition might be very valuable, both for polemical reasons and for organizing some thoughts on AI strategy.) I cynically suspect that the stars aren't out there for us to eat, but that we can still gain a lot of leverage over the acausal fanfic-writing commun... er, superintelligence-centered economy/ecology, and so, optimizing the hell out of the AGI that might become an important bargaining piece and/or plot point is still the most important thing for humans to do.

Metaphilosophical AI

The thing I've seen that looks closest to white-box metaphilosophical AI in the existing literature is Eliezer's causal validity semantics, or more precisely the set of intuitions Eliezer was drawing on to come up with the idea of causal validity semantics. I would recommend reading the section Story of a Blob and the sections on causal validity semantics in Creating Friendly AI. Note that philosophical intuitions are a fuzzily bordered subset of justification-bearing (i.e. both moral/values-like and epistemic) causes that are theoretically formally identifiable and are traditionally thought of as having a coherent, lawful structure.

we still want to get as good a bargaining position as we possibly can, or to coordinate with the watchers as well as we possibly can, or in a more fundamental sense we want to not waste any of our potential, which I think is the real driving intuition here

It seems that we have more morally important potential in some possible worlds than others, and although we don't want our language to commit us to the view that we only have morally important potential in possible worlds where we can prevent astronomical waste, neither do we want to suggest (as I think "not waste any of our potential" does) the view that we have the same morally important potential everywhere and that we should just minimize the expected fraction of our potential that is wasted. A more neutral way of framing things could be "minimize wasted potential, especially if the potential is astronomical", leaving the strength of the "especially" to be specified by theories of how much one can affect the world from base reality vs simulations and zoos, theories of how to deal with moral uncertainty, and so on.

I completely understand your intuition but don't entirely agree; this comment might seem like quibbling: Having access to astronomical resources is one way to have a huge good impact, but I'm not sure we know enough about moral philosophy or even about what an acausal economy/ecology might look like to be sure that the difference between a non-astronomical possible world and an astronomical possible world is a huge difference. (For what it's worth, my primary intuition here is "the multiverse is more good-decision-theory-limited/insight-limited than resource-limited". I'd like to expand on this in a blog post or something later.) Obviously we should provisionally assume that the difference is huge, but I can see non-fuzzy lines of reasoning that suggest that the difference might not be much.

Because we might be wrong about the relative utility of non-astronomical possible worlds it seems like when describing our fundamental driving motivations we should choose language that is as agnostic as possible, in order to have a strong conceptual foundation that isn't too contingent on our provisional best guess models. E.g., take the principle of decision theory that says we should focus more on worlds that plausibly seem much larger even if it might be less probable that we're in those worlds: the underlying, non-conclusion-contingent reasons that drive us to take considerations and perspectives such as that one into account are the things we should be putting effort into explaining to others and making clear to ourselves.

Of course, this is still just a proxy measure... say that we're "in a simulation", or that there are already superintelligences in our environment who won't let us eat the stars, or something like that—we still want to get as good a bargaining position as we possibly can, or to coordinate with the watchers as well as we possibly can, or in a more fundamental sense we want to not waste any of our potential, which I think is the real driving intuition here.

Agreed. I was being lazy and using "astronomical waste" as a pointer to this more general concept, probably because I was primed by people talking about "astronomical waste" a bunch recently.

Further clarifying and expanding on that intuition might be very valuable, both for polemical reasons and for organizing some thoughts on AI strategy.

Also agreed, but I currently don't have much to add to what's already been said on this topic.

The thing I've seen that looks closest to white-box metaphilosophical AI in the existing literature is Eliezer's causal validity semantics, or more precisely the set of intuitions Eliezer was drawing on to come up with the idea of causal validity semantics.

Ugh, I found CFAI largely impenetrable when I first read it, and have the same reaction reading it now. Can you try translating the section into "modern" LW language?

CFAI is deprecated for a reason, I can't read it either.

because I think what we really want, and what MIRI seems to be working towards, is closer to "optimality": create an AI that minimizes the expected amount of astronomical waste.

Astronomical waste is a very specific concept arising from a total utilitarian theory of ethics. That this is "what we really want" seems highly unobvious to me; as someone who leans towards negative utilitarianism, I would personally reject it.

Doesn't negative utilitarianism present us with the analogous challenge of preventing "astronomical suffering", which requires an FAI to have solutions to the same philosophical problems mentioned in the post? I guess I was using "astronomical waste" as short for "potentially large amounts of negative value compared to what's optimal" but if it's too much associated with total utilitarianism then I'm open to suggestions for a more general term.

I'd be happy with an AI that makes people on Earth better off without eating the rest of the universe, and gives us the option to eat the universe later if we want to...

If the AI doesn't take over the universe first, how will it prevent Malthusian uploads, burning of the cosmic commons, private hell simulations, and such?

Those things you want to prevent are all caused by humans, so the AI on Earth can directly prevent them. The rest of the universe is only relevant if you think that there are other optimizers out there, or if you want to use it, probably because you are a total utilitarian. But the small chance of another optimizer suggests that anyone would eat the universe.

Yes, you could probably broaden the concept to cover negative utilitarianism as well, though Bostrom's original article specifically defined astronomical waste as being

an opportunity cost: a potential good, lives worth living, is not being realized.

That said, even if you did redefine the concept in the way that you mentioned, the term "astronomical waste" still implies an emphasis on taking over the universe - which is compatible with negative utilitarianism, but not necessarily every ethical theory. I would suspect that most people's "folk morality" would say something like "it's important to fix our current problems, but expanding into space is morally relevant only as far as it affects the primary issues" (with different people differing on what counts as a "primary issue").

I'm not sure whether you intended the emphasis on space expansion to be there, but if it was incidental, maybe you rather meant something like

I put "Friendliness" in quotes in the title, because I think what we really want, and what MIRI seems to be working towards, is closer to "optimality": create an AI that makes our world into what we'd consider the best possible one

?

(I hope to also post a more substantial comment soon, but I need to think about your post a bit more first.)

Normative AI - Solve all of the philosophical problems ahead of time, and code the solutions into the AI.
Black-Box Metaphilosophical AI - Program the AI to use the minds of one or more human philosophers as a black box to help it solve philosophical problems, without the AI builders understanding what "doing philosophy" actually is.
White-Box Metaphilosophical AI - Understand the nature of philosophy well enough to specify "doing philosophy" as an algorithm and code it into the AI.

So after giving this issue some thought: I'm not sure to what extent a white-box metaphilosopical AI will actually be possible.

For instance, consider the Repugnant Conclusion. Derek Parfit considered some dilemmas in population ethics, put together possible solutions at them, and then noted that the solutions led to an outcome which again seemed unacceptable - but also unavoidable. Once his results had become known, a number of other thinkers started considering the problem and trying to find a way way around those results.

Now, why was the Repugnant Conclusion considered unacceptable? For that matter, why were the dilemmas whose solutions led to the RC considered "dilemmas" in the first place? Not because any of them would have violated any logical rules of inference. Rather, we looked at them and thought "no, my morality says that that is wrong", and then (engaging in motivated cognition) began looking for a consistent way to avoid having to accept the result. In effect, our minds contained dynamics which rejected the RC as a valid result, but that rejection came from our subconscious values, not from any classical reasoning rule that you could implement in an algorithm. Or you could conceivably implement the rule in the algorithm if you had a thorough understanding of our values, but that's not of much help if the algorithm is supposed to figure out our values.

You can generalize this problem to all kinds of philosophy. In decision theory, we already have an intuitive value of what "winning" means, and are trying to find a way to formalize it in a way that fits our value. In epistemology, we have some standards about the kind of "truth" that we value, and are trying to come up with a system that obeys those standards. Etc.

The root problem is that classification and inference require values. As Watanabe (1974) writes:

According to the theorem of the Ugly Duckling, any pair of nonidentical objects share an equal number of predicates as any other pair of nonidentical objects, insofar as the number of predicates is finite [10], [12]. That is to say, from a logical point of view there is no such thing as a natural kind. In the case of pattern recognition, the new arrival shares the same number of predicates with any other paradigm of any class. This shows that pattern recognition is a logically indeterminate problem. The class-defining properties are generalizations of certain of the properties shared by the paradigms of the class. Which of the properties should be used for generalization is not logically defined. If it were logically determinable, then pattern recognition would have a definite answer in violation of the theorem of the Ugly Duckling.

This conclusion is somewhat disturbing because our empirical knowledge is based on natural kinds of objects. The source of the trouble lies in the fact that we were just counting the number of predicates in the foregoing, treating them as if they were all equally important. The fact is that some predicates are more important than some others. Objects are similar if they share a large number of important predicates.

Important in what scale? We have to conclude that a predicate is important if it leads to a classification that is useful for some purpose. From a logical point of view, a whale can be put together in the same box with a fish or with an elephant. However, for the purpose of building an elegant zoological theory, it is better to put it together with the elephant, and for classifying industries it is better to put it together with the fish. The property characterizing mammals is important for the purpose of theory building in biology, while the property of living in water is more important for the purpose of classification of industries.

The conclusion is that classification is a value-dependent task and pattern recognition is mechanically possible only if we smuggle into the machine the scale of importance of predicates. Alternatively, we can introduce into the machine the scale of distance or similarity between objects. This seems to be an innocuous set of auxiliary data, but in reality we are thereby telling the machine our value judgment, which is of an entirely extra-logical nature. The human mind has an innate scale of importance of predicates closely related to the sensory organs. This scale of importance seems to have been developed during the process of evolution in such a way as to help maintain and expand life [12], [14].

"Progress" in philosophy essentially means "finding out more about the kinds of things that we value, drawing such conclusions that our values say are correct and useful". I am not sure how one could make an AI make progress in philosophy if we didn't already have a clear understanding of what our values were, so "white-box metaphilosophy" seems to just reduce back to a combination of "normative AI" and "black-box metaphilosophy".

Coincidentally, I ended up reading Evolutionary Psychology: Controversies, Questions, Prospects, and Limitations today, and noticed that it makes a number of points that could be interpreted in a similar light: in that humans do not really have a "domain-general rationality", and that instead we have specialized learning and reasoning mechanisms, each of which are carrying out a specific evolutionary purpose and which are specialized for extracting information that's valuable in light of the evolutionary pressures that (used to) prevail. In other words, each of them carries out inferences that are designed to further some specific evolutionary value that helped contribute to our inclusive fitness.

The paper doesn't spell out the obvious implication, since that isn't its topic, but it seems pretty clear to me: since our various learning and reasoning systems are based on furthering specific values, our philosophy has also been generated as a combination of such various value-laden systems, and we can't expect an AI reasoner to develop a philosophy that we'd approve of unless its reasoning mechanisms also embody the same values.

That said, it does suggest a possible avenue of attack on the metaphilosophy issue... figure out exactly what various learning mechanisms we have and which evolutionary purposes they had, and then use that data to construct learning mechanisms that carry out similar inferences as humans do.

Quotes:

Hypotheses about motivational priorities are required to explain empirically discovered phenomena, yet they are not contained within domain-general rationality theories. A mechanism of domain-general rationality, in the case of jealousy, cannot explain why it should be “rational” for men to care about cues to paternity certainty or for women to care about emotional cues to resource diversion. Even assuming that men “rationally” figured out that other men having sex with their mates would lead to paternity uncertainty, why should men care about cuckoldry to begin with? In order to explain sex differences in motivational concerns, the “rationality” mechanism must be coupled with auxiliary hypotheses that specify the origins of the sex differences in motivational priorities. [...]

The problem of combinatorial explosion. Domain-general theories of rationality imply a deliberate cal- culation of ends and a sample space of means to achieve those ends. Performing the computations needed to sift through that sample space requires more time than is available for solving many adaptive problems, which must be solved in real time. Consider a man coming home from work early and discovering his wife in bed with another man. This circumstance typically leads to immediate jealousy, rage, violence, and sometimes murder (Buss, 2000; Daly & Wilson, 1988). Are men pausing to rationally deliberate over whether this act jeopardizes their paternity in future offspring and ultimate reproductive fitness, and then becoming enraged as a consequence of this rational deliberation? The predictability and rapidity of men’s jealousy in response to cues of threats to paternity points to a specialized psychological circuit rather than a response caused by deliberative domain-general rational thought. Dedicated psychological adaptations, because they are activated in response to cues to their corresponding adaptive problems, operate more efficiently and effectively for many adaptive problems. A domain-general mechanism “must evaluate all alternatives it can define. Permutations being what they are, alternatives increase exponentially as the problem complexity increases” (Cosmides & Tooby, 1994, p. 94). Consequently, combinatorial explosion paralyzes a truly domain-general mechanism (Frankenhuis & Ploeger, 2007). [...]

In sum, domain-general mechanisms such as “rationality” fail to provide plausible alternative explanations for psychological phenomena discovered by evolutionary psychologists. They are invoked post hoc, fail to generate novel empirical predictions, fail to specify underlying motivational priorities, suffer from paralyzing combinatorial explosion, and imply the detection of statistical regularities that cannot be, or are unlikely to be, learned or deduced ontogenetically. It is important to note that there is no single criterion for rationality that is independent of adaptive domain. [...]

The term learning is sometimes used as an explana- tion for an observed effect and is the simple claim that something in the organism changes as a consequence of environmental input. Invoking “learning” in this sense, without further specification, provides no additional explanatory value for the observed phenomenon but only regresses its cause back a level. Learning requires evolved psychological adaptations, housed in the brain, that enable learning to occur: “After all, 3-pound cauliflowers do not learn, but 3-pound brains do” (Tooby & Cosmides, 2005, p. 31). The key explanatory challenge is to identify the nature of the underlying learning adaptations that enable humans to change their behavior in functional ways as a consequence of particular forms of environmental input.

Although the field of psychology lacks a complete understanding of the nature of these learning adaptations, enough evidence exists to draw a few reasonable conclu- sions. Consider three concrete examples: (a) People learn to avoid having sex with their close genetic relatives (learned incest avoidance); (b) people learn to avoid eating foods that may contain toxins (learned food aversions); (c) people learn from their local peer group which actions lead to increases in status and prestige (learned prestige criteria). There are compelling theoretical arguments and empirical evidence that each of these forms of learning is best explained by evolved learning adaptations that have at least some specialized design features, rather than by a single all-purpose general learning adaptation (Johnston, 1996). Stated differently, evolved learning adaptations must have at least some content-specialized attributes, even if they share some components. [...]

These three forms of learning—incest avoidance, food aversion, and prestige criteria—require at least some content-specific specializations to function properly. Each op- erates on the basis of inputs from different sets of cues: coresidence during development, nausea paired with food ingestion, and group attention structure. Each has different functional output: avoidance of relatives as sexual partners, disgust at the sight and smell of specific foods, and emulation of those high in prestige. It is important to note that each form of learning solves a different adaptive problem.

There are four critical conclusions to draw from this admittedly brief and incomplete analysis. First, labeling something as “learned” does not, by itself, provide a satisfactory scientific explanation any more than labeling something as “evolved” does; it is simply the claim that environmental input is one component of the causal process by which change occurs in the organism in some way. Second, “learned” and “evolved” are not competing explanations; rather, learning requires evolved psychological mechanisms, without which learning could not occur. Third, evolved learning mechanisms are likely to be more numerous than traditional conceptions have held in psychology, which typically have been limited to a few highly general learning mechanisms such as classical and operant conditioning. Operant and classical conditioning are important, of course, but they contain many specialized adaptive design features rather than being domain general (Ohman & Mineka, 2003). And fourth, evolved learning mechanisms are at least somewhat specific in nature, containing particular design features that correspond to evolved solutions to qualitatively distinct adaptive problems.

Do you have thoughts on the other approaches described here? It seems to me that black box metaphilosophical AI, in your taxonomy, need not be untestable nor dangerous during a transient period.

If I understand correctly, in order for your designs to work, you must first have a question-answerer or predictor that is much more powerful than a human (i.e., can answer much harder questions that a human can). For example, you are assuming that the AI would be able to build a very accurate model of an arbitrary human overseer from sense data and historical responses and predict their "considered judgements", which is a superhuman ability. My concern is that when you turn on such an AI in order to test it, it might either do nothing useful (i.e., output very low quality answers that give no insights to how safe it would eventually be) because it's not powerful enough to model the overseer, or FOOM out of control due to a bug in the design or implementation and the amount of computing power it has. (Also, how are you going to stop others from making use of such powerful answers/predictors in a less safe, but more straightforward and "efficient" way?)

With a white-box metaphilosophical AI, if such a thing was possible, you could slowly increase its power and hopefully observe a corresponding increase in the quality of its philosophical output, while fixing any bugs that are detected and knowing that the overall computing power it has is not enough for it to vastly outsmart humans and FOOM out of control. It doesn't seem to require access to superhuman amounts of computing power just to start to test its safety.

I don’t think that the question-answerer or reinforcement learner needs to be superhuman. I describe them as using human-level abilities rather than superhuman abilities, and it seems like they could also work with subhuman abilities. Concretely, if we imagine applying those designs with a human-level intelligence acting in the interests of a superhuman overseer, they seem (to me) to work fine. I would be interested in problems you see with this use case.

Your objection to the question-answering system seemed to be that the AI may not recognize that human utterances are good evidence about what the overseer would ultimately do (even if they were), and that it might not be possible or easy to teach this. If I’m remembering right and this is still the problem you have in mind, I’m happy to disagree about it in more detail. But it seems that this objection couldn’t really apply to the reinforcement learning approach.

It seems like these systems could be within a small factor of optimal efficiency (certainly within a factor of 2, say, but hopefully much closer). I would consider a large efficiency loss to be failure.

I would be interested in problems you see with this use case.

The AI needs to predict what the human overseer "wants" from it, i.e., what answers the human would score highly. If I was playing the role of such an AI, I could use the fact that I am myself a human and thinks similarly to the overseer, and ask myself, "If I was in the overseer's position, what answers would I judge highly?" In particular, I could use the fact that I likely have philosophical abilities similar to the overseer, and could just apply my native abilities to satisfy the overseer. I do not have to first build a detailed model of the overseer from scratch and then run that model to make predictions. It seems to me that the AI in your design would have to build such a model, and doing so seems a superhuman feat. In other words, if I did not already have native philosophical abilities on par with the overseer, I I couldn't give answers to any philosophical questions that the overseer would find helpful, unless I had the superhuman ability to create a model of the overseer including his philosophical abilities, from scratch.

Suppose that you are the AI, and the overseer is a superintelligent alien with very different values and philosophical views. How well do you think that things will end up going for the alien? (Assuming you are actually trying to win at the RL / question-answering game.)

It seems to me like you can pursue the aliens' values nearly as well as if they were your own. So I'm not sure where we disagree (assuming you don't find this thought experiment convincing):

  1. Do you think that you couldn't satisfy the alien's values?
  2. Do you think that there is a disanalogy between your situation in the hypothetical and the situation of a subhuman AI trying to satisfy our values?
  3. Something else?

Black-Box Metaphilosophical AI is also risky, because it's hard to test/debug something that you don't understand.

On the other hand, to the extent that our uncertainty about whether different BBMAI designs do philosophy correctly is independent, we can build multiple ones and see what outputs they agree on. (Or a design could do this internally, achieving the same effect.)

it's unclear why such an AI won't cause disaster in the time period before it achieves philosophical competence.

This seems to be an argument for building a hybrid of what you call metaphilosophical and normative AIs, where the normative part "only" needs to be reliable enough to prevent initial disaster, and the metaphilosophical part can take over afterward.

create an AI that minimizes the expected amount of astronomical waste

I prefer the more cheerfully phrased "Converts the reachable universe to QALYs" but same essential principle.

Modulo complexity of value, I hope? I don't think we're in a position to pinpoint QALY's as The Thing to Tile.

Just a minor terminology quibble: the “black” in “black-box” does not refer to the color, but to the opacity of the box; i.e., we don’t know what’s inside. “White-box” isn’t an obvious antonym in the sense I think you want.

“Clear-box” would better reflect the distinction that what’s inside isn’t unknown (i.e., it’s visible and understandable). Or perhaps open-box might be even better, since not only we know how it works but also we put it there.

White-box is, nevertheless, the accepted name for the concept he was referring to - probably as an antonym to black-box.

English. What can you do.

Huh. I’ve never encountered it, and I would have bet ten to one that if it existed I’d have seen it. Time to check some of those priors...

Thanks for letting me know.

“White-box” isn’t an obvious antonym in the sense I think you want.

I actually checked Wikipedia before using the term, since I had the same thought as you, but "white-box testing" seems to be the most popular term (it's the title of the article and used throughout) in preference to "clear box testing" and a bunch of others that are in parenthesis under "also known as".

Right, sorry. I was so sure that I’d have heard the term before if it existed, and that you invented the term yourself, it never occurred to me to check. Well, you learn something new every day :)