Beyond algorithmic equivalence: self-modelling

Stuart_Armstrong

Beyond algorithmic equivalence: self-modelling

2 min read28th Feb 20183 comments

10

Complexity of ValueHeuristics & BiasesHuman ValuesModeling PeopleValue Learning

In the previous post, I discussed ways that the internal structure of an algorithm might, given the right normative assumption, allow us to distinguish bias from reward.

Here I'll be pushing the modelling a bit further.

Self-modelling

Consider the same toy anchoring biased problem as the previous post, with the human algorithm $H$ , some object $X$ , a random integer $0 \leq n \leq 99$ , and an anchoring bias given by

H (X, n) = \frac{3}{4} V (X) + \frac{1}{4} n,

for $V$ some valuation function that is independent of $n$ .

On these inputs, the internal structure of $H$ is:

However, $H$ is capable of self-modelling, to allow it to make long term decisions. At time $t$ , $H$ models itself at time $t + 1$ as:

Note that $H$ is in error here: it doesn't take into account the influence of $n$ on its own behaviour.

In this situation, it could be justifiable to say that $H$ 's self model is the correct model of its own values. And, in that case, the anchoring bias can safely be dismissed as a bias.

Self-model and preparation

Let's make the previous setup a bit more complicated, and consider that, sometimes, the agent $H$ is aware of the effect of $n$ , and sometimes they aren't.

At time $t$ , they also have an extra action choice: either $n$ , which will block its future self from seeing $n$ , or $\emptyset$ , which will proceed as normal. Suppose further that whenever $H$ is aware of the effect of $n$ , they take action $n$ :

And when $H$ isn't aware of the effect of $n$ , they don't take any action/takes $\emptyset$ :

Then it seems very justifiable to see $H$ as opposing the anchoring effect in themselves, and thus classifying it as a bias rather than a value/preference/reward.

The philosophical position

The examples in this post seem stronger than in the previous one, in terms of justifying "the anchoring bias is actually a bias".

More importantly, there is a philosophical justification, not just an ad hoc one. We are assuming that $H$ has a self model of their own values - they have a model of what is a value and what is a bias in their own behaviour.

Then we can define the reward of $H$ , as the reward that $H$ models itself as having.

In subsequent posts, I'll explore whether this definition is justified, how to access these self-models, and what can be done about errors and contradictions in self-models.

Complexity of ValueHeuristics & BiasesHuman ValuesModeling PeopleValue Learning

Frontpage

10

Mentioned in

52What AI Safety Researchers Have Written About the Nature of Human Values

48Future directions for ambitious value learning

32Resolving human values, completely and adequately

28A theory of human values

19Using lying to detect human values

Load More (5/8)

Beyond algorithmic equivalence: self-modelling

28th Feb 2018

2RyanCarey

3Stuart_Armstrong

1Gordon Seidoh Worley

New Comment

3 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:01 PM

[-]RyanCarey6y20

I agree that the agent should be able to make a decent effort at telling us which of its drives are biases (/addictions) versus values. One complicating factor is that agents change their opinions about these matters over time. Imagine a philosopher who uses the drug heroin. They may very well vacillate on whether heroin satisfies their full-preferences, even if the experience of taking heroin is not changing. This could happen via introspection, via philosophical investigation, via examining fMRI scans, et cetera. It's tricky for the human to state their biases with confidence because they may never know when they are done updating on the matter.

Intuitively, an agent might want the AI system to do this examination and then to maximize whatever turns out to be valuable. That is, you might want the bias-model to be the one that you would settle on if you thought for a long time, similarly to enlightened self-interest / extrapolated volition models. Similar problems ensue: e.g., it this process may diverge. Or it may be fundamentally indeterminate whether some drives are values or biases.

[-]Stuart_Armstrong6y30

>One complicating factor is that agents change their opinions about these matters over time.

Yep! This is one of the major issues, and one that I'll try to model in a soon-to-be-coming post. The whole issue of rigged and influeceable learning processes is connected with trying to learn the preferences of such an agent.

>Or it may be fundamentally indeterminate whether some drives are values or biases.

I think it's fundamentally indeterminate in principle, but we can make some good judgements in practice.

[-]Gordon Seidoh Worley6y10

Ooooh, I like where this is going. I realize you still have more to develop on this idea, but is your thought that this could replace the use of objective reward functions that exist outside the agent?

Moderation Log