Superintelligence and wireheading

Stuart_Armstrong

A putative new idea for AI control; index here.

tl;dr: Even utility-based agents may wirehead if sub-pieces of the algorithm develop greatly improved capabilities, rather than the agent as a whole.

Please let me know if I'm treading on already familiar ground.

I had a vague impression of how wireheading might happen. That it might be a risk for a reinforcement learning agent, keen to take control of its reward channel. But that it wouldn't be a risk for a utility-based agent, whose utility was described over real (or probable) states of the world. But it seems it might be more complicated than that.

When we talk about a "superintelligent AI", we're rather vague on what superintelligence means. We generally imagine that it translates into a specific set of capabilities, but how does that work internally inside the AI? Specifically, where is the superintelligence "located"?

Let's imagine the AI divided into various submodules or subroutines (the division I use here is for illustration; the AI may be structured rather differently). It has a module I for interpreting evidence and estimating the state of the world. It has another module S for suggesting possible actions or plans (S may take input from I). It has a prediction module P which takes input from S and I and estimates the expected outcome. It has a module V which calculates its values (expected utility/expected reward/violation or not of deontological principles/etc...) based on P's predictions. Then it has a decision module D that makes the final decision (for expected maximisers, D is normally trivial, but D may be more complicated, either in practice, or simply because the agent isn't an expected maximiser).

Add some input and output capabilities, and we have a passable model of an agent. Now, let's make it superintelligent, and see what can go wrong.

We can "add superintelligence" in most of the modules. P is the most obvious: near perfect prediction can make the agent extremely effective. But S also offers possibilities: if only excellent plans are suggested, the agent will perform well. Making V smarter may allow it to avoid some major pitfalls, and a great I may make the job of S and P trivial (the effect of improvements to D depend critically on how much work D is actually doing). Of course, maybe several modules become better simultaneously (it seems likely that I and P, for instance, would share many subroutines); or maybe only certain parts of them do (maybe S becomes great at suggesting scientific experiments, but not conversational responses, or vice versa).

Breaking bad

But notice that, in each case, I've been assuming that the modules become better at what they were supposed to be doing. The modules have implicit goals, and have become excellent at that. But the explicit "goals" of the algorithms - the code as written - might be very different from the implicit goals. There are two main ways this could then go wrong.

The first is if the algorithms becomes extremely effective, but the output becomes essentially random. Imagine that, for instance, P is coded using some plausible heuristics and rules of thumb, and we suddenly give P many more resources (or dramatically improve its algorithm). It can look through trillions of times more possibilities, its subroutines start looking through a combinatorial explosion of options, etc... And in this new setting, the heuristics start breaking down. Maybe it has a rough model of what a human can be, and with extra power, it starts finding that rough model all over the place. Thus, predicting that rocks and waterfalls will respond intelligently when queried, P becomes useless.

In most cases, this would not be a problem. The AI would become useless and start doing random stuff. Not a success story, but not a disaster, either. Things are different if the module V is affected, though. If the AI's value system becomes essentially random, but that AI was otherwise competent - or maybe even superintelligent - it would start performing actions that could be very detrimental. This could be considered a form of wireheading.

More serious, though is if the modules become excellent at achieving their "goals", as if they were themselves goal-directed agents. Consider module D, for instance. If its task was mainly to pick the action with the highest V rating, and it became adept at predicting the output of V (possibly using P? or maybe it has the ability to ask for more hypothetical options from S, to be assessed via V), it could start to manipulate its actions with the sole purpose of getting high V-ratings. This could include deliberately choosing actions that lead to V giving artificially high ratings in future, to deliberately re-wiring V for that purpose. And, of course, it is now motivated to keep V protected to keep the high ratings flowing in. This is essentially wireheading.

Other modules might fall into the familiar failure patterns for smart AIs - S, P, or I might influence the other modules so that the agent as a whole gets more resources, allowing S, P, or I to better compute their estimates, etc...

So it seems that, depending on the design of the AI, wireheading might still be an issue even for agents that seem immune to it. Good design should avoid the problems, but it has to be done with care.

[-]HungryHobo9y30

I see where you're going with this mind-cancers but I'm not sure that the hypothetical modules you chose lead to an example which as a whole makes sense.

S,I, P and D together start off super-intelligent or at least highly intelligent as a unit and it starts trying to optimize the sub-parts.

Debugging your own thoughts on the fly seems like a non-starter so this AI needs to be constructing variants on itself and it's own modules and testing them out in some kind of a sandbox to create better versions of it's own modules.

But at the start it has functioning S,I,P and D modules. How would it end up choosing a D version 1.1 with random or semi-random outputs when it's assessing it with D version 1.0 which does not have random outputs.

What part isn't solved by taking the approach of "Don't assess whether a change to yourself is successful using the new version of yourself, decide with the old version"?

[-]Stuart_Armstrong9y10

It depends on whether the increase in intelligence comes from inside or outside. Some algorithms might be safe for limited resources, but become unstable if it has more resources, and this might not be easy to establish, even for the AI.

[-]entirelyuseless9y00

This is correct, and it is a reason why orthogonality is not likely to be true in practice, in the sense that it will probably not be easy to make an intelligent computer pursue just any random goal.

Basically, an architecture which allows you to plug any random goal into it, is like a situation where you make someone a slave and demand that he act for the sake of your goal. The slave is intelligent and already has goals, so there is no way you can guarantee that he will do what you want. Instead, he may escape and pursue his own goals. In the same way, an architecture that allows for plugging in random goals implies that the intelligence is already there, with other goals, and you are simply compelling it to pursue the goal you want. But since it is already intelligent it may escape and pursue its own goals.