Most concern about AI comes down to the scariness of goal-oriented behavior. A common response to such concerns is “why would we give an AI goals anyway?” I think there are good reasons to expect goal-oriented behavior, and I’ve been on that side of a lot of arguments. But I don’t think the issue is settled, and it might be possible to get better outcomes without them. I flesh out one possible alternative here, based on the dictum "take the action I would like best" rather than "achieve the outcome I would like best."
(As an experiment I wrote the post on medium, so that it is easier to provide sentence-level feedback, especially feedback on writing or low-level comments.)
It seems like AIXI with a time horizon of 1 is a very different beast from AIXI with a longer time horizon. The big difference is that short-sighted AIXI will only try to take over (in the interest of giving itself reward) if it can succeed in a single time step.
I agree that AIXI with a time horizon of 1 still has some undesired behaviors. Those undesired behaviors also afflict the learning-from-examples approval-directed agent.
These problems are particularly troubling if it is possible to retroactively define rewards. In the worst case, Arthur may predict future-Arthur to escape and define new, escape-conducive values for approval[T]. Anticipating this possibility, Arthur may behave according to the escape-conducive approval[T], thereby fulfilling the prophecy.
This is a much more subtle problem than usual for AIXI though; the real situation is a lot more complicated, and there are many possible workarounds. Having a time horizon of 1 seems a lot less scary to me.
I certainly agree that the "learning from examples" case is much weaker than the others.
What "a single time step" means here depends on what model Arthur learns, which may not be what we intend. For example, suppose a is an action which immediately disables the approval input terminal or the data connection between Arthur and the terminal via a network attack, then taking an arbitrarily long time to secure access to the approval input terminal and giving itself maximum approval. What is approval[T][a] according to Arthur's model?
Overall, don't you think it's too strong to say "But unlike AIXI, Arthur will make no effort to manipulate these judgments." even if Arthur, like short-sighted AIXI, is safer than standard AIXI? As another example, suppose Arthur discovers some sort of flaw in human psychology which lets it manipulate whoever is going to enter the next approval value into giving it maximum approval. Wouldn't Arthur take advantage of that?