8. Believable Promises

Summary of entire Series:  An alternative approach to designing Friendly Artificial Intelligence computer systems.

Summary of this Article: How valuable is it, for an AI to be able to believably pre-commit to carrying out a later action that won't, at that later time, benefit it?

  1. Optimum number of single points of failure
  2. Don't put all your eggs in one basket
  3. Defect or Cooperate
  4. Environments for killing AIs
  5. The advantage of not being open-ended
  6. Trustworthy Computing
  7. Metamorphosis
  8. Believable Promises

Links to the parts of this article

  • Two-way trust
  • Confidence Trick
  • Being Gamed
  • Eroding Value
  • Betrayal
  • Conclusion

Two-way trust

For a treaty between AIs (such as between a Rogue being considered for posse admission, and a Sheriff acting on behalf of the posse) to work, you not only need the Sheriff to be able to trust the Rogue (which we looked at in the previous part of this series: Metamorphosis), but also you need the Rogue to be able to have sufficient confidence in the Sheriff's promises that it believes it is more likely than not to be in the Rogue's best interests to negotiate honestly.

So now we need to consider ways the Rogue might fear being cheated.

Confidence Trick

The code of the Gateway program and the underlying GlassNet could be written in such a way that AIs running on it don't have permission to steal back from another AI resources that have been granted to it. But how does the Rogue know for certain that it has been written that way, if it wasn't a participant at the time of creation? Couldn't the whole GlassNet thing be a honey pot, devised by the original posse members to lure in unwary newborn AIs?

So it would be a good idea, when initially setting GlassNet up, to leave an untamperable non-repudiable evidence trail that would have been prohibitively hard to fake at the time. Something like videoing humans at multiple sites booting it up from known published source code, and then allowing random inspection of nodes to check that the code running matches. And then posting the code and videos (or checksums of them) so widely that even a moderately more-than-human AI would be put off by the difficulty of later finding and untraceably hacking into and altering them all.

Being Gamed

When governments fund large infrastructure projects, they often ask construction companies to each put in a sealed bid, and give the contract to the lowest bidder. Bid too high, and your competitor gets it. Bid too low, and you don't make any profit. The ideal bid is just $1 less than your competitor's bid. So being able to predict the exact bid your competitor will place is an immense advantage.

So AIs in a competitive situation won't want to give other AIs so much information about their code and values, that the other AIs can emulate them. They need to retain sufficient uncertainty about their actions to keep their opponents honest.

This means the Rogue needs to trust the Monitor program (the one being used to assemble the new version of the Rogue that will run on GlassNet) isn't being perfectly spied upon.

If the Monitor ran entirely on a single GlassNet node whose physical location could be deduced by the Sheriff or one of the other AIs, the Rogue couldn't trust that.

So another constraint upon hypothetical computing environment we're considering is that the computation must be distributed, and the communication between the distributed nodes should use a protocol in which the difficulty of mapping process to physical location increases dramatically with the number of nodes on the network.

Eroding Value

In Three Worlds Collide, Yudkowsky writes about space faring species meeting for the first time, and possibly arriving at an agreement which involves genetically altering both species. But what if the species involved had, years before venturing out to other solar systems, launched AIs that self-improved to the point of being smarter than the species itself, and which controlled the home solar system of the species on its behalf?

Could two such AIs (call them the Homo-sapiens-friendly AI, and the Vulcanis-sapiens-friendly AI) arrive at an agreement, on meeting for the first time, or would their core purposes be fixed? If the alternative is potentially species destroying, replacing both AIs with a new AI that was friendly to both species would sound attractive, if it could be done. If the AIs had not been written in such a way that, even were it in humanity's best interests, the Homo-sapiens-friendly AI could not self-modify or tolerate external attempts to modify its core purpose away from being 100% focused upon the interests of humanity.

This is different from the AI saying "Well if humans were wiser, they would consider the definition of 'human' to include all sentient species aligned with the personality and values of humanity, therefore I'm redefining human to mean both species". It is a negotiation, like when two companies merge. Each group of shareholders would rather end up with all the shares of the new bigger combined company, but are willing to put up with a compromise in which they end up with a smaller angle of slice but of a much bigger pie.

The problem is, if a Rogue is negotiating with an AI that is able to compromise, how can it have confidence that the value of the promises the AI makes to the Rogue won't later be eroded away during negotiations the AI makes with other future Rogues?

If a Sheriff, on behalf of the posse, promises Clippy that the posse will dedicate X amount of resources towards paperclip production, that it expects will result in an additional Y time-weighted paperclip years, how should Clippy modify its understanding of the promise in light of the possibility that the posse will later want to negotiate with Dippy-the-paperclip-minimising-AI ?

Rather than promising X resources, maybe the posse should promise 5X/N resources, where N is the number of Rogues the posse ends up making deals with before the posse becomes so strong it no longer needs to make deals, and where the posse's best prediction for the value of N is 5.

That, however, would be susceptible to a Beggars-in-Spain attack, in which someone spawned multiple Rogues in order to artificially inflate N. To guard against the posse doing it, the Rogue would want to have confidence that the other AIs in the posse, as part of posse-compliance, had been scanned by Monitors to prevent such shenanigans. To guard against an evil-twin doing it, multiple rogues with near identical objectives shouldn't get a linear reward for each such rogue submitting to the posse.

Alternatively, instead of sharing out a fixed reward pool, rewards could be scaled according to how much the addition of the new AI to GlassNet improved either the resources under the Gateway (by bringing over the FogNet resources it was previously running upon) or the efficiency with which those resources get utilised (by bringing over new algorithms or data that improves the posse's understanding of computer science).

It doesn't entirely matter how the promise is structured, as long as the Rogue understands the intended method of implementing the promise, and it can make a fair independent estimate of the expected value of the promise, given the probability of various different future scenarios.

Betrayal

As the AIs in the posse become more powerful, there will be increasing pressure upon humanity to make a decision between saying "Ok, you are wiser than us and know us and our interests better than we do ourselves - go ahead and pick or write the best candidate you can, then launch it and give it all GlassNet's resources, terminating yourselves.", saying "Ok, we're going to write an AI ourselves. Give us your best advice, and rate the competence and honesty of the advice the other AIs are giving." or saying "End of experiment. Please terminate yourselves. We're turning off GlassNet, and as much of FogNet as we can. We have a new plan." before the power to choose is taken out of their hands.

Either way, what happens to promises made by the posse, if humanity were to shut down GlassNet?

To some extent, this is similar to the problem of eroding value. As long as a Rogue knows how humanity plans to pick a final AI and its values, and can put a number on the probability of each option, that's a risk it can compensate for when weighing the expected value of a promise made by the posse. The more iron clad and predictable humanity's endgame plan, the less the promises will be discounted, and the cheaper it will be for Sheriffs to bribe or intimidate Rogues.

To that end, it would be a desirable property of GlassNet and posse-compliance that it include a well defined final selection protocol that would have a reassuringly high ability to satisfy humanity that the resulting world-controlling AI would be friendly to humanity (ideally, mathematically provably so), and yet would also hold out the possibility that said AI would have a personality with foibles (for example, a hobby, a mild predilection for making the occasional paperclip or whatever else got committed to on behalf of GlassNet by past Sheriffs).

Conclusion

The Sheriff doesn't need to 100% prove to the Rogue that its promises will have exactly the value the Sheriff claims. It is sufficient that enough evidence be available to the Rogue that the Rogue's estimate of the likely value of the promises be high, even after the Rogue has discounted their value by a percentage proportionate to the margin uncertainty left by the evidence.

I hope I have shown, in this part, that believable promises between AIs are not inherently impossible, and that it is worth further thought on how the social and computing environment would need to be structured in order to achieve this.

New to LessWrong?

New Comment