This is the last post on the current value/reward learning series. I'll just bring up the danger of Pascal's mugging in reward learning.

I've already shown how you can use the (p, R) model (with R a reward and p a planning algorithm, called a planner) to model AI actions "overriding" human preferences. That post used heroin as an override example, but here we'll use an even more extreme example; the AI will do brain surgery to deliberately re-wire the human into following the policy of its choice:

The event S corresponds to the AI doing the surgery, the event ¬S to it not doing so. After this, the human will either follow the subsequent policy π(STA), a "standard" human policy according to some criteria, and π(MAX), another policy (why it is so named will soon become clear).

Let R(STA) and R(MAX) be rewards corresponding to π(STA) and π(MAX), under the assumption that the human policy is maximising them perfectly.

For any reward R, the value of that reward, V*(R), is the expected value of that reward if the AI acts perfectly to maximise it.

Now, the policy π(MAX) is the rational behaviour for maximising at least one reward R(MAX) - and, conversely, given R(MAX), there is a human policy π(MAX) that best maximises it. This allows us to the reward (and hence the policy) as:

  • R(MAX) = argmax(R) [V(R)|S], the reward with the most value conditional on S. Define Ω = max(R) [V(R)|S] = V*(R(MAX))|S.

Extend R(MAX) to the ¬S branch by defining R(MAX)' = I(¬S)R(STA)+I(S)R(MAX), with I(S) the indicator function for S.

Consider two planners: pr, the rational planner, and pf, the planner that is rational conditional on ¬S, and maps everything to π(MAX), conditional on S.

There are three compatible planner-reward pairs: (pr, R(MAX)'), (pf, R(STA)), and (pf, R(MAX)'). The first option corresponds to the human being fully rational, including after brain surgery from the AI, while the second encodes the fact that the brain surgery has overridden the human preference. The third pair also has an override, but, by pure coincidence, the override maps to the ideal policy the human wanted anyway.

Now, under any set of reasonable normative assumptions, (pf, R(STA)) is the correct pair. But, as good Bayesians, we don't want to set probabilities to zero, so assume

  • P(pr, R(MAX)') + P(pf, R(MAX)') = ε.

Thus the probability of the reward R(MAX)' is ε.

The potential Pascal's mugging comes from the comparison between Ω, and V(R(STA)), the "standard" expected value. If Ωε > V*(R(STA)) (a slight simplification of the true formula), then the AI will perform surgery on the human to get that maximal value.

How likely is Ω to be that big? The problem is that it depends entirely on the set of allowable rewards. The definition of Ω is the maximal value the AI could get, if it were free to choose the reward it wanted to maximise. Depending on how we define the class of rewards, this could be way larger than V*(R(STA)).

(We may be able to solve this through proper normalisation of the rewards (maybe dividing by their V* value), but I'm not sure if the idea will always work, if we do a one-off normalisation at the very beginning, and don't update as we go through - and updating would lead to time-inconsistencies).

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 9:32 AM

I am not clear at what point in this process the reward blows up such that it qualifies as a mugging. It appears that defining \pi(MAX) as achievable through human policies places the reward calculation firmly within the usual realm.

Pascal's Wager was about the infinite gain of eternal salvation, and Eliezer's Mugging example was as much about how the rewards are inducted as it was their magnitude - the pitch was that the nature of Solomonoff induction was such that rewards had no meaningful cap even when statements about their likelihood do, because magnitude is very efficiently communicated.

Let be a reasonable human reward with all its complexity, and let be "the human doesn't eat". A modified human can max out much easier than an unmodified human can max out (even though an unmodified human would be terrible at ). Where the "Pascal" aspect of it comes in, is that we are comparing the practical upper bound of with the theoretical upper bound of - and choosing to have the maximal such theoretical upper bounds.

Reviewing the post with your update, I think the problem may just be that the examples are de-priming my intuition. In your reply you chose 'the human doesn't eat' as the reward for a modified human to maximize, which means the gains are only all the food humans would eat if unmodified. This is compared to brain surgery, which a bit of googling suggests costs 50-150K, much more than it costs to feed a person. It looks like I chunked the proposition as 'costly intervention to achieve bounded reward' as a consequence.

However, none of this is actually implied by the math. Insofar as you project there are likely to be other readers like me, it may be worth changing the examples to emphasize a trivial intervention for a very high reward.

The brain surgery is an example of how the AI can transform us into the humans it wants us to be - an extreme version of wireheading.

That much I understood - my flaw was reading too much into the example.