Instrumental convergence

From RB Wiki
Revision as of 19:36, 28 January 2020 by Louis Faucon (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Definition

Instrumental convergence is a concept introduced by Nick Bostrom [1] An instrumental goal is an intermediary objective that is useful to achieve an agent's final goal. For example gaining a lot of money (among other things) in any manner is useful if your goal is to go live on Mars. A convergent instrumental goal is an instrumental goal that most agent will converge to when their intelligence increases. More formally [math]G[/math] is a convergent instrumental goal if and only if nearly all sufficiently intelligent agents with nearly any final goal will have [math]G[/math] as an instrumental goal.

Examples

The most obvious example of instrumental goal is self preservation. Stuart Russell famously said concerning instrumental goals: "You cannot make coffee if you are dead" [2]. This illustrates that even for the most trivial goal of making coffee, if an intelligent agent has access in its behaviour to actions that increase its chances of survival and if the agent is sufficiently smart to understand this fact about the world then it will certainly decide to take such actions.

Other convergent instrumental goals listed in Nick Bostrom's paper are: - Goal content integrity: Any action that prevents your current goal to be changed is certainly a very good choice to maximise your current goal because you predict that you will continue to work toward your current goal in the future. - Self enhancement and ressource acquisition: Improving your capacity and acquiring ressources are also advantageous because you anticipate higher achievement of your goal in the future.

Discussion

The notion of instrumental goals is very worrisome for the safety of AI systems. An AI system with human-level understanding of the world will understand that the most likely danger for itself is humans and other similarly intelligent algorithms. It should then take actions against these unless its goal specifies otherwise (in ways we don't know how to do as of today).

Possibilities around convergent instrumental goals could be to not let the agent be aware of a large part of the world including the other agents able to switch it off (similarly to AlphaGo not knowing anything beyond the game of Go). Achieving this for agent with human-level intelligence is not an easy task and we could expect that it would also reduce the capabilities to be beneficial for such an algorithm. A second option could be to restrict the possible actions of the agent and not give the coffee making algorithm the option to kill the human trying to switch it off. In practice it is nevertheless difficult to anticipate the effect an algorithm can have on the world (We did not expect recommender systems to be able to increase polarisation or influence democracies).

Stuart Russel proposes to build agents with uncertainty about their own objectives. This way in case the agent sees external agents (most likely humans) trying to shut it off then it should understand that it is not achieving its objective correctly and that the optimal choice is to accept to be shut off.