Most concern about AI comes down to the scariness of goal-oriented behavior. A common response to such concerns is “why would we give an AI goals anyway?” I think there are good reasons to expect goal-oriented behavior, and I’ve been on that side of a lot of arguments. But I don’t think the issue is settled, and it might be possible to get better outcomes without them. I flesh out one possible alternative here, based on the dictum "take the action I would like best" rather than "achieve the outcome I would like best."
(As an experiment I wrote the post on medium, so that it is easier to provide sentence-level feedback, especially feedback on writing or low-level comments.)
I'm having trouble understanding this. Can you explain more what you mean here, and why you think it's true?
Maybe you can address the specific example I mentioned earlier, which might help clear this up. Suppose there is an argument X which if Hugh were to read, would mindhack him into adopting an alien value system and also giving Arthur high approval. It seems to me that Arthur would choose to present this argument to Hugh (i.e., the human supervisor who is going to enter the next approval data point), for essentially the same reason that single-step AIXI would. Do you agree?
Arthur is making choices from a small set of options; say it's just two options. (See here for how to move between small and large sets of options, and here for how to do this kind of thing safely.) Suppose the available actions are NULL and HACK, with the obvious effects. So there are four relevant numbers:
When I talked about "two ways..." I meant that counterfactually choosing HACK moves you from 1/3 to 2/4, by changing what you do; it also moves you from 1/2 to 3/4, by changing whether Hugh is hacked.
AIXI compares item 1 to item 4, and hacks if 4 is higher. That is, when AIXI considers the counterfactual it applies both of these changes.
Arthur estimates P(hack), then compares ( (item 1) P(hack) + (item 3) P(no hack) ) to ( (item 2) P(hack) + (item 4) P(no hack)).
So suppose that Hugh gives a high rating to NULL and a low rating to HACK, and hacked Hugh gives a high rating to HACK and a low rating to NULL. Then Arthur hacks iff P(hack) is high enough. It's hard to know what would actually happen, it seems like it's up to lower-level parts of the agent to break the ambiguity. This seems qualitatively different from AIXI, who would always HACK if it wasn't possible to achieve maximal reward through other channels.
That said, I don't think this is a major part of being protected from this kind of attack. The larger protections (vs. 1-step AIXI) are from (1) having a small enough set of actions that 1-step attacks are unlikely, (2) defining approval by considering how you would rate if the action didn't happen.