Volition refers to what we would prefer to prefer, as opposed to our direct preferences that refer to what we prefer. It has been argued Yudkowsky04, Tarleton10, Soares15, Hoang19 that we should rest AI ethics upon human volitions rather than preferences.
Yudkowsky04 actually called it coherent extrapolated volition (CEV), which has been reused as such since. But in terms of communication strategy, Lê now feels that it may be making it sound more complicated than it is, which may be counter-productive to convince the general public and AI researchers of its importance and the tractability of its implementation.
Different approaches to elicit and learn human preferences have been proposed, like (cooperative) inverse reinforcement learning NgRussell00 HRAD16, active learning Burr10 or just good old supervised learning of human clicking (or watch-time). After all, learning human preferences is the core business of today's most influential algorithms, namely recommender systems like YouTube's.
Arguably, the time has come to move beyond preference learning towards volition learning. Unfortunately, to the best of Lê's knowledge, there is no proposed framework for volition learning.
To better understand volition, Hoang19 proposed to distinguish different versions of "me" for any individual. There's current me, but also better me, called me+, worse me, called me- (who's typically in charge when we're angry or hungry), past me, future me, zombie me (who's typically in charge when we're tired or drunk). Volition would correspond to the preferences of "maximally better me", called me++.
HoangElmhamdi19FR propose to rest all of moral philosophy on the single axiom that says that me+'s preferences are "better" than me's preferences. In other words, there is some partial order over the set of preferences, which may be called "moral progress". While still vague, the authors argue that this axiom is sufficient to imply numerous practical implications, while minimal enough to have a chance to be consensual enough to be implemented in practice. It is not clear what today's alternatives are.
Assuming that each me+ has a me++ as a (me+)+, the moral axiom above implies that we should (aim to) implement me++'s preferences. These may be called volitions.
It's been argued (in private discussions) that a me++ might be to different from a me to be actually desirable. There are numerous ways to move on from here, like redefining volition as some more consensual and robustly desirable concept or agreeing to only implement some me+'s preferences rather than me++'s. Clearly, we are far from any sort of actual consensus on what ought to be programmed. In practice, to make progress, it seems desirable to acknowledge this vagueness, but also perhaps to partially accept it for now. Indeed, what seems much more urgent is to advance our technological means to compute volitions. Arguing too harshly on details of the nature of volitions may hinder researchers' motivations to contribute to this effort. Especially for now, it seems more adequate to focus on estimating preferences of me+'s, that is, some (slight but robust) improvement of me's. This would already be a huge step forward.
It is important to note that our definitions of volitions do not seem to imply the uniqueness of volitions. In fact, if we reason in terms of counterfactuals (see below), then even the me++ of a given subject should not be regarded as a fixed deterministic volition. Rather, we should acknowledge our epistemic uncertainty about our me++'s, especially if we consider a range of counterfactuals. In Bayesian terms, we should probably map any individual to a probability distribution over their probable volitions.
This all implies that volition learning will likely be insufficient, because it may not converge to a single utility function (or at least we should prepare for the case where it does not). We would then need social choice theory to aggregate potentially incompatible volitions. Fortunately, several recent works tackle social choice for AI ethics!
One fruitful path may be to analyze the difference between what's learned through inverse reinforcement learning, as opposed to through (well-framed) elicitations as proposed by Moral Machines ADKSH+18 or WeBuildAI LKKKY+19 (see social choice). Arguably, what people say to prefer in environments that emphasize moral concerns is closer to volition than how people behave in their daily lives.
Another approach could be to pick up on weak signals, which indicate whether a given subject is in System 1 or System 2 KahnemanBook11. Typically, indication of anger may increase the belief that the subject is acting according to preferences rather than volitions.
It could also be interesting to model the System 1 / System 2 interaction as a game, and to consider that human actions are Nash equilibria of such game. This could then be used to infer volitions instead of preferences.
Perhaps, in the long run, the most robust method may have to be based on counterfactual. This is after all how Yudkowsky04 first framed volition. Namely, assuming the subject is now calmed down, thinks deeply, learns relevant data and wants to act mindfully, what would he do? Works based on Bayesian networks might be useful (Lê is slighlty skeptical of this though), or it might simply be possible to perform such counterfactual reasoning based on GAN-like neural networks. In fact, arguably, transformers perform some sort of counterfactual.
From a theoretical side, it could be worthwhile to develop some theory of volition. One potentially interesting question could be: is my volition of me++ equal to my volition?