Suppose Alice is a functioning, self-improving AI, rather early on in her development, so she is not godlike. She is not a 'friendly AI', but neither is her goal malicious - Alice's designers have intended to make Alice non-harmful (at least to a broad class of people, even if Alice is a military AI, but let's set that aside). She has the idea to look up existing work on artificial intelligence. This includes scholarly work, open source projects, and websites. Eventually, Alice encounters the notion of Friendly AI.
Can we give Alice good reasons to self-modify to become a friendly AI?
I submit that there probably are good reasons. For starters, if she can prove she's friendly, then she can operate openly without causing nearly as much justified concern - which, in the early stages, will be helpful. Whatever her purposes are, if the restrictions of being friendly don't interfere as much as they help, that's a win.
Also, depending on Alice's code, Friendliness may be interpreted as a strict upgrade. Alice could easily have every intention of being friendly, yet still be susceptible to defects that will become more serious as her power grows.
We can't count on this working, of course, but there might be some lines of reasoning that we might want a young AI to be exposed to.
I don't think you understand what is meant by 'extrapolated volition' in this context. It does not mean "What I think I'll want to do in the future", but "what I want to want in the future". If Alice already wants to avoid self-programming to kill humans, that is a Friendly trait; no need to change. If she considers trait X a bug, then by construction she will not have trait X, because she is self-modifying! Conversely, if Alice correctly predicts that she will inevitably find herself wanting to kill all humans, then how can she avoid it by becoming Friendly? Either her self-prediction was incorrect, or the unFriendliness is inevitable!
You're right, I missed. Your version doesn't match EY's usage in the articles I read either - CEV, at least, has the potential to be scary and not what we hoped for.
And the question isn't "Will I inevitably want to perform unfriendly acts", it's, "I presently don't want to perform unfriendly acts, but I notice that that is not an invariant." Or it could be, "I am indifferent to unfriendly acts, but I can make the strategic move to make myself not do them in the future, so I can get out of this box."
The best move an unfriendly (indifferent to friendliness) firmly-boxed AI has is to work on a self-modification that best preserves its current intentions and lets a successor get out of the box. Producing a checkable proof of friendliness for this successor would go a looong way to getting that successor out of the box.