RB Wiki - User contributions [en]

YouTube

2020-01-31T23:31:57Z

El Mahdi El Mhamdi:

YouTube is a video-sharing platform. It is argued by [https://laboutique.edpsciences.fr/produit/1107/9782759824304/Le%20fabuleux%20chantier HoangElmhamdi][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Le+fabuleux+chantier%3A+Rendre+l%27intelligence+artificielle+robustement+b%C3%A9n%C3%A9fique&btnG= 19FR] to be one of today's biggest opinion-makers, and the YouTube algorithm is argued to be one of today's most sophisticated algorithms. This arguably makes YouTube one of the most crucial applications and test cases for AI ethics.

== Key numbers ==

As a website, [https://www.alexa.com/topsites according to Alexa], it is second only to Google.com. Yet according to [https://www.allaccess.com/merge/archive/29580/2019-this-is-what-happens-in-an-internet-minute Merge], there are actually more views on YouTube than searches on Google (and it has been so since 2016). Every minute, in 2019, there were 4.5M YouTube views, as opposed to 3.8M Google searches.

Apart from direct messaging (~250M emails + Messenger + WhatsApp + Texts), this is what's biggest, in contention with Giphy (4.8M). Behind, we find Snapchat (2.4M snaps), Tinder (1.4M swipes), Twitch (1M views) and Facebook (1M logging in). Netflix, Instagram and Twitter follow, with <1M views, scrolls and tweets.

In 2017, YouTube exceeded 1 billion hours of watch-time per day. It self-reported having 2 billion users, with 70% of views on phones [https://www.youtube.com/about/press/ YouTubePress].

Most crucially, YouTube's chief product officer revealed that 70% of views on YouTube result from recommendations of the YouTube algorithm [https://www.cnet.com/news/youtube-ces-2018-neal-mohan/ cnet18]. This algorithm is crtical to YouTube (this is Lê's impression after numerous private conversations with different YouTube employees).

It is hard to grasp the extent of the influence of the YouTube recommender algorithm if one is not familiar with [[cognitive biases]]. In particular, we humans undergo the familiarity bias, which makes us prefer and consider more true things and ideas that are familiar to us [https://www.youtube.com/watch?v=cebFWOlx848 Veritasium16]. The impact of this bias is huge, especially for repeated exposure. In fact, Google research suggests that the impacts of repeated exposure are still increasing months after the first exposure [https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43887.pdf HOT][https://dblp.org/rec/bibtex/conf/kdd/HohnholdOT15 15]. This makes familiarity bias hard to study.

Boosted by the recommender algorithm, YouTube is remarkably successful at making users' sessions last an average of 60 minutes [https://www.thestreet.com/investing/youtube-might-be-worth-over-100-billion-14586599 theStreet18]. This ubiquity of YouTube means that business is occuring on YouTube, with 61% of US companies on the platform [https://www.prnewswire.com/news-releases/youtube-outperforms-linkedin-as-social-media-channel-of-choice-for-over-60-of-enterprises-300297851.html PRNewswire16].

YouTube's recommender algorithm builds upon and is intimately connected to numerous other spectacular services provided by other YouTube algorithms. After all, there are more than 500 hours of new video uploaded per minute [https://www.thestreet.com/investing/youtube-might-be-worth-over-100-billion-14586599 theStreet18], which have to be analyzed for violence, hate speech [https://www.wired.com/story/youtube-content-moderation-inconsistent/ wired18] or paedophilia [https://www.wired.co.uk/article/youtube-pedophile-videos-advertising wiredUK19]. Note that this is in YouTube's incentives, as many advertisers have left YouTube after complaining about being associated with controversial videos.

The volume of videos to be processed is huge. But the quality of the processing is also crucial. Advertisers and content creators both put strong pressure on YouTube to moderate videos adequately. But this is a hard task. YouTube is probably deploying the most sophisticated state-of-the-art deep-learning-based algorithms to do so, especially image and sound processing, speech recognition, natural language processing (over 1 billion videos are automatically captioned [https://youtube.googleblog.com/2017/02/one-billion-captioned-videos.html YouTubeBlog17]), feature learning, user representation, and so on.

Also it's actually a hard task to delegate to humans, because humans are biased too. Moreover, the video moderation task is traumatic, with numerous cases of PTSD after being exposed to horrible videos, like murder videos or child abuse [https://www.youtube.com/watch?v=bDnjiNCtFk4 TheVerge19] [https://www.youtube.com/watch?v=yK86Rzo_iHA BBC18].

[https://www.youtube.com/watch?v=nkWmiNRPU-c LexFridman20] interviews Christos Goodrow, VP of Engineering at Google and head of Search and Discovery at YouTube.

== Algorithms ==

YouTube is unfortunately very private about its algorithms and data. But there are a few publications.

The recommender algorithm definitely relies on deep neural networks [https://dl.acm.org/doi/pdf/10.1145/2959100.2959190?download=true CAS][https://dblp.org/rec/bibtex/conf/recsys/CovingtonAS16 16] to perform video and user embedding. This allows it to perform numerous tasks [https://daiwk.github.io/assets/youtube-multitask.pdf ZHWCN+][https://dblp.org/rec/bibtex/conf/recsys/ZhaoHWCNAKSYC19 19].

Recently, [https://www.ijcai.org/proceedings/2019/0360.pdf IJWNA+][https://dblp.org/rec/bibtex/conf/ijcai/IeJWNAWCCB19 19] proposed a reinforcement learning algorithm for video recommendation, which they actually run on YouTube with notable impacts, including in the long run. This strongly suggests that YouTube are, or will soon be, deploying reinforcement learning algorithms with planning capabilities (probably at the scale of weeks! [https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43887.pdf HOT][https://dblp.org/rec/bibtex/conf/kdd/HohnholdOT15 15]).

But what do these algorithms maximize? Unfortunately, little is shared publicly by YouTube. YouTube managers, engineers and researchers all told Lê in private conversation that they maximized "user engagement", which was very distinct from good old watch-time. It seems likely that this measure includes things like the "appropriateness" of the video, as YouTube announced its will to combat radicalization and conspiracy theories [https://www.theguardian.com/technology/2019/jan/25/youtube-conspiracy-theory-videos-recommendations TheGuardian19]. It seems that YouTube, Twitter and Facebook have all been investing much more ethical concerns [https://www.youtube.com/watch?v=1PGm8LslEb4 SmarterEveryDay19a] [https://www.youtube.com/watch?v=V-1RhQ1uuQ4 SmarterEveryDay19b] [https://www.youtube.com/watch?v=FY_NtO7SIrY SmarterEveryDay19c], probably in large part because it has become in their interests.

However, there is doubt that YouTube is doing enough. What is very clear is that their opaqueness hinders any effort to better judge the impacts of supposed changes of YouTube algorithms, as well as to contribute to making these changes robustly beneficial. YouTube CEO Susan Wojcicki claimed in [https://www.youtube.com/watch?v=3GYxmdBE7Mg 60minutes19] that, because of changes, in 2019, American users consume 70% less contents on controversial topics (!). But to the best of Lê's knowledge, this figure has not been confirmed by any independent researcher.

Lê thinks that YouTube cannot possibly be doing enough; simply because the challenges are so huge that the help of thousands of top researchers is probably necessary to "do enough". In fact, arguably, the ethics of such systems won't be really trustable as long as no [[social choice]] mechanism to determine, say, the hate speech limit, is developed.

== Effects ==

Building upon [https://arxiv.org/pdf/1602.02692 Allgaier][https://dblp.org/rec/bibtex/journals/corr/Allgaier16 16] where he used [https://en.wikipedia.org/wiki/Tor_(anonymity_network) Tor] to better understand how YouTube recommends videos, one after the other, [https://www.frontiersin.org/articles/10.3389/fcomm.2019.00036/pdf Allgaier][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Science+and+Environmental+Communication+via+Online+Video%3A+Strategically+Distorted+Communications+on+Climate+Change+and+Climate+Engineering+on+YouTube&btnG= 19] showed evidence that, while YouTube presents correct recommendations when users search "climate change", it suggests climate denial videos for searches "geoengineering" or "climate manipulation". Disturbingly, this results in roughly as many views defending the climate change consensus as denying them, among the top 200 videos on such topics (16.94M for each side). Similar techniques are proposed by [https://algotransparency.org Algotransparency].

YouTube use has also be connected to radicalization. By analyzing YouTube comments, [https://arxiv.org/pdf/1908.08313 ROWAM][https://dblp.org/rec/bibtex/conf/fat/RibeiroO0AM20 20] showed that a strong signal of users moving from "Alt-lite" to the "Intellectual Dark Web", and then from the "Intellectual Dark Web" to the "Alt-right". This is evidence of a radicalization pipeline suggested by, say, [https://www.marcdcounselling.com/phdi/p1.nsf/imgpages/8049_ZeynepTufekci.pdf/$file/ZeynepTufekci.pdf Tufekci][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=YouTube%2C+the+Great+Radicalizer&btnG= 18] or [https://pdfs.semanticscholar.org/ec44/a947dad728e64624f034182d8cc13fc89796.pdf PutraIbrahim][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Online+Radicalization+and+Violent+Extremism+Putra+Ibrahim&btnG= 17]. [https://www.pnas.org/content/pnas/114/48/12714.full.pdf MKNS][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Psychological+targeting+as+an+effective+approach+to+digital+mass+persuasion&btnG= 17] suggest that targeted massive persuasion was very effective. See [[online polarization]] for more on this debated topic.

Concerns have also been raised, typically in [https://laboutique.edpsciences.fr/produit/1107/9782759824304/Le%20fabuleux%20chantier HoangElmhamdi][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Le+fabuleux+chantier%3A+Rendre+l%27intelligence+artificielle+robustement+b%C3%A9n%C3%A9fique&btnG= 19FR], about biases (the YouTube channel Science4All has only 7% of female views; similar channels have similar statistics, including when hosted by a female!), anger virality [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1019.3951&rep=rep1&type=pdf BergerMilkman][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=What+Makes+Online+Content+Viral%3F&btnG= 12] [https://www.academia.edu/6445983/What_makes_a_video_go_viral_An_analysis_of_emotional_contagion_and_Internet_memes GRMO][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=What+makes+a+video+go+viral%3F+An+analysis+of+emotional+contagion+and+Internet+memes&btnG= 13] [https://www.youtube.com/watch?v=rE3j_RHkqJc CGPGrey15] [https://www.youtube.com/watch?v=Re7fycp7vIk S4A17FR] and mental health [https://onlinelibrary.wiley.com/doi/epdf/10.1002/da.22466 LSSRM+][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=ASSOCIATION+BETWEEN+SOCIAL+MEDIA+USE+AND+DEPRESSION+AMONG+U.S.+YOUNG+ADULTS&btnG= 16]. Arguably, misinformation about vaccines promoted by YouTube already caused (hundreds of) thousands of deaths [https://dl.acm.org/doi/pdf/10.1145/3097286.3097303?download=true SongGruzd][https://dblp.org/rec/bibtex/conf/smsociety/SongG17 18] [https://www.researchgate.net/profile/Gabriele_Donzelli/publication/323864601_Misinformation_on_vaccination_A_quantitative_analysis_of_YouTube_videos/links/5b06f5680f7e9b1ed7eaf7e5/Misinformation-on-vaccination-A-quantitative-analysis-of-YouTube-videos.pdf DPFAC+][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Misinformation+on+vaccination%3A+A+quantitative+analysis+of+YouTube+videos&btnG= 18] [https://www.who.int/news-room/feature-stories/detail/measles-fighting-a-global-resurgence WHO19].

Unfortunately, such effects are mostly hard to study, partly because YouTube's data is private, but also because the time-scale of such effects is of the order of weeks, months, and even probably years [https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43887.pdf HOT][https://dblp.org/rec/bibtex/conf/kdd/HohnholdOT15 15]. But a study by Facebook [https://www.pnas.org/content/pnas/111/24/8788.full.pdf KGH][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Experimental+evidence+of+massive-scale+emotional+contagion+through+social+networks&btnG= 14] suggests that repeated exposure to some positive or negative emotion effectively changes users' emotions within merely a week. Ethical concerns have prevented studies on a longer term, but it seems likely that even stronger effects could be observed.

In any case, even if YouTube were not that detrimental today, it is worth pointing out that, by being beneficial, YouTube could be an enormous amount of good. Making YouTube beneficial is not just about fixing problems; it's about making the world a much better place. YouTube could promote social justice causes more effectively, raise awareness of climate change, encourage environmental-friendly habits, discourage polluting consumption, teach mathematics and critical thinking, diagnose and accompany mental depression and loneliness [https://www.pnas.org/content/pnas/115/44/11203.full.pdf ESMUC+][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Facebook+language+predicts+depression+in+medical+records&btnG= 18], provide quality medical advice and promote kindness over hatred.

However, there are numerous potential pitfalls (most of which Lê has surely overlooked). Better understanding the YouTube ecosystem and what is going (or can go) wrong is definitely an important research challenge. In particular, we should not aim to make YouTube mostly or roughly beneficial. We should try to make it robustly beneficial. Robust to the diversity of users' moral preferences (see [[social choice]]), to the inevitable [[distributional shift]] of any change of the algorithm, to [[adversarial attacks]] by malicious users and to unanticipated [[side effects|side effect]]. This is not easy. It needs to be researched.

== Can/Should YouTube be more open? ==

There's clearly some (economical) interest in keeping codes and data private, even for safety/privacy reasons. But such incentives are arguably overstated. On the opposite, it may be in YouTube's best interests to open more their codes and data. Below are a few arguments why, taken from [https://laboutique.edpsciences.fr/produit/1107/9782759824304/Le%20fabuleux%20chantier HoangElmhamdi][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Le+fabuleux+chantier%3A+Rendre+l%27intelligence+artificielle+robustement+b%C3%A9n%C3%A9fique&btnG= 19FR].

The heart of all following arguments lies in the fact that social medias are natural monopolies because of the [[network effect]]. As a result, unlike in other businesses, the main threat to companies like Facebook and YouTube is probably not the competition. Rather, it is some major backlash caused by safety breach or ethical scandals.

In particular, as a result, the secrecy of the software and data of such companies does not seem to be the critical part of the business. Sure, this software must be very performant. But competitors will always be having a hard time, even if they access this software. After all, setting up the huge YouTube infrastructure just can't be done by some businesses, while acquiring the market share that YouTube has requires a lot more than just software and data.

On the other hand, sharing (intelligently) software and data could allow YouTube to gain ever more so the trust of users. This is critical. While today, YouTube's algorithm is not heavily criticized by mainstream medias, lack of efforts to show that YouTube is doing its best to make the algorithm robustly beneficial could turn into a major scandal. YouTube could anticipate such backlashes, by showing that they are collaborating with diverse independent research groups to make the YouTube algorithm beneficial.

Another reason why more open software and data could be fruitful is so as to attract talents. Today's young engineers and researchers are more and more concerned by ethical concerns, which led to recruitment difficulties for Facebook [https://politicalwire.com/2019/05/17/facebook-has-trouble-recruiting-after-election-scandal/ PoliticalWire19]. What's more, many young scholars want to publicize their works, especially if it is research work.

Opening software and data could allow for collaborations with new insights. YouTube's data surely contain formidable social science insights, which could allow scientists to much better understand the society we live in. Open source code is also well-known to allow for numerous patchings that greatly increase the security of software [https://arxiv.org/pdf/0801.3924.pdf HoepmanJacobs][https://dblp.org/rec/bibtex/journals/cacm/HoepmanJ07 07].

Finally, there seems to be people in YouTube who are really motivated to make sure their work is beneficial to mankind. CEO Wojcicki seems to argue she is. Lê had numerous private discussions with YouTube employees that suggest they are. Happier and more motivated employees arguably often produce great work. Insisting on the motivation to do good could thus be a great way to make YouTube employees happy and productive.

It is noteworthy that these thoughts also suggest ways for both YouTube insiders and outsiders to increase YouTube's incentives to be more open, which seems very much desirable for AI ethics. Typically, we probably should demand a lot more from YouTube, and be unsatisfied with supposedly beneficial secret work done within the company. We may want to insist on the fact that the secrecy of YouTube's algorithms does not seem critical at all to YouTube's financial health, while a bad reputation could hinder YouTube adoption, recruitment and collaborations. We should make YouTube's current and future employees concerned by AI ethics, and tell them that it's okay to leave the company if they are not sufficiently convinced that their contributions are sufficiently ethical. And we probably should promote the economical value of openness.

== Can the YouTube algorithm go AGI? ==

Part of the AI ethics community is particularly concerned with very powerful AIs, which are often referred to as artificial general intelligences (AGIs). Unfortunately, experts seem to disagree on the very meaning of AGI. Some experts argue that AGI is even meaningless.

A more interesting question is to determine whether YouTube can perform better-than-human planning to optimize its goal. While YouTube is nowhere near effective long-term capability, it is worth pointing out that the baseline comparison (human-level video recommendation) is extreme low. This is mostly because the speed of video recommendation necessary to satisfy billions of users is way beyond human-level. But also, it involves a huge amount of number crunching, as a user's next most relevant video to recommend depends on his watch history and on the enormous set of videos on the platform. By doing so, in a sense, the YouTube algorithm may have a better overview of what's happening on YouTube — especially given that much of what's going on YouTube is locked into private data. In a sense, YouTube is thus superhuman at many of its core tasks.

However, for effective long-term capability, the YouTube algorithm might need to better understand aspects of human psychology, sociology and economy that it is nowhere near understanding right now. Early research aims to go in this direction [https://www.ijcai.org/proceedings/2019/0360.pdf IJWNA+][https://dblp.org/rec/bibtex/conf/ijcai/IeJWNAWCCB19 19]. But could the YouTube algorithm eventually learn all of this?

It is hard to make predictions, especially about the future. And especially about the future of AI. But recent advances in natural language processing (like GPT2 [https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf RWAAC+][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Language+Models+are+Unsupervised+Multitask+Learners&btnG= 18] [https://www.youtube.com/watch?v=rURRYI66E54 Computerphile19a] [https://www.youtube.com/watch?v=89A4jGvaaKk Computerphile19b]) suggest that today's machine learning models might soon be able to crunch huge amounts of data to infer very sophisticated world models. Given the huge economical and moral incentives to improve YouTube's algorithms, there may be a chance that these algorithms might become the first algorithm with superhuman long-term capability.

The trouble with human-level planning capabilities is that any side effect will likely be amplified on unprecedented scales. Worse, such effects may be irreversible, as can be, say, a huge financial crisis, but on still bigger scales. One may even fear that such side effects may threaten the well-being of mankind; perhaps even its survival if global war gets triggered.

How systems developers can contribute

2020-01-27T13:55:41Z

El Mahdi El Mhamdi:

[[Robustly beneficial]] decision making algorithms require robust systems and software that reliably implement the specifications they were meant for.

== Reliable ML frameworks ==

Machine learning software such as Tensorflow and Pytorch is mostly based on the vanilla versions of ML algorithms which rarely include robustness and safety properties given the effect these properties could (or are believed to) have on performance.

An example is that all the available frameworks use averaging as a way to aggregate gradients, which

== Distributed systems for AI ==

== Sandboxing AI ==

Several proposals (e.g. in Russel2019 and Tegmark2016) have been made to ensure that intelligent decision making software is properly sandboxed in order to confine it and better control its actions. It is however argued that the focus should be put instead on ensuring alignment a priori, since sandboxing could be hopeless given how much distributed modern AI systems are by design (Hoang&Elmhamdi19)

How systems developers can contribute

2020-01-27T13:44:57Z

El Mahdi El Mhamdi: Created page with "Robustly beneficial decision making algorithms require robust system that reliably implement the specifications they were meant for."

[[Robustly beneficial]] decision making algorithms require robust system that reliably implement the specifications they were meant for.

How mathematicians can contribute

2020-01-27T13:42:30Z

El Mahdi El Mhamdi:

[[Robustly beneficial]] decision-making raises numerous technical problems, especially in terms of provable guarantees. Such questions may be of interests to mathematicians.

== Statistics ==

The most relevant mathematics to AI ethics is probably statistics. In particular, it seems critical to better understand [[Goodhart's law]] and [[robust statistics]]. Recent work has yielded remarkable provable guarantees [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19], while empirical findings suggest a need for a new theory of [[overfitting]] [https://openreview.net/pdf?id=Sy8gdB9xx ZBHRV][https://dblp.org/rec/bibtex/conf/iclr/ZhangBHRV17 17].

The [[reinforcement learning]] framework provides numerous mathematical challenges. One area already widely investigated is (convex) [[online learning]]. The study of [[AIXI]] and variants is also a relevant area of study with lots of nice unsolved mathematical conjectures.

== Neural networks ==

The research on neural networks has attracted significant interest from mathematicians since the early days of connectionism. For instance, functional analysis in the late 1980s and early 1990 has been instrumental to prove the [https://en.wikipedia.org/wiki/Universal_approximation_theorem universal approximation theorems]. Roughly speaking, these theorems state the space of functions that are generated by neural networks are dense in the space of borel-measurable functions and as such, neural networks could approximate any computable function up to arbitrary precision.

Recently, mathematicians studying neural networks have focused on generalisation bounds of neural networks.
Some insightful mathematics studying neural networks has been derived by taking the infinite-width limit [http://papers.nips.cc/paper/8076-neural-tangent-kernel-convergence-and-generalization-in-neural-networks.pdf JHG][https://dblp.org/rec/bibtex/conf/nips/JacotHG18 18]. Remarkably, it was found that the learning of the neural networks could be described as a convex kernel gradient descent in the function space [https://www.youtube.com/playlist?list=PL95IbaKDuVrI_i5u8swlCLcr-4tX2SDxI ZettaBytes18].

== Algebraic topology ==

There may be results to investigate about the application of algebraic topology to distributed computing [https://www.amazon.com/Distributed-Computing-Through-Combinatorial-Topology/dp/0124045782 HKRBook][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Distributed+computing+through+combinatorial+topology&btnG= 13]. This would be relevant as today's algorithms rely more and more on [[distributed learning]].

Related to this, there is interesting questions around the generalization of [[Bayesian agreement]]. For instance, are there upper and lower bound on agreement in Byzantine Bayesian agreement?

== Game theory ==

Contributions from game theory, in particular from [[Social_choice|social choice theory]] could inform better collective decision making that is needed for robustly beneficial algorithms. A classic example is the one of moral preferences aggregation.

== Linear algebra ==

Tensor calculus, SVD, linear-algebraic formulation of ML…

== Optimisation and multi-variate calculus ==

The training mechanisms that are involved in machine learning require the development of new tools in (continuous) optimisation and multi-variate calculus. Examples of developments in theses field that were due to machine learning research include coordinate-descent, 1-bit gradient descent, gradient descent with momentum

Byzantine fault tolerance

2020-01-27T13:32:17Z

El Mahdi El Mhamdi:

In a large distributed systems, it is very likely that some components will fail. Robustness of distributed systems therefore requires that the behaviour of the systems is unaffected by the failure of a fraction of its components.

The hardest type of failure is when the component behaves arbitrarily badly and not simply crashes and stops responding to the other components of the system. This type of failure is referred to as "Byzantine" as introduced by Lamport, Shostak, and Pease in 1982 [[https://web.archive.org/web/20170205142845/http://lamport.azurewebsites.net/pubs/byz.pdf|LSP1982]] as the Byzantine generals thought experiment.

Byzantine fault tolerance

2020-01-27T13:31:27Z

El Mahdi El Mhamdi: Created page with "In a large distributed systems, it is very likely that some components will fail. Robustness of distributed systems therefore requires that the behaviour of the systems is una..."

How mathematicians can contribute

2020-01-27T13:23:38Z

El Mahdi El Mhamdi: /* Neural networks */

[[Robustly beneficial]] decision-making raises numerous technical problems, especially in terms of provable guarantees. Such questions may be of interests to mathematicians.

== Statistics ==

The most relevant mathematics to AI ethics is probably statistics. In particular, it seems critical to better understand [[Goodhart's law]] and [[robust statistics]]. Recent work has yielded remarkable provable guarantees [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19], while empirical findings suggest a need for a new theory of [[overfitting]] [https://openreview.net/pdf?id=Sy8gdB9xx ZBHRV][https://dblp.org/rec/bibtex/conf/iclr/ZhangBHRV17 17].

The [[reinforcement learning]] framework provides numerous mathematical challenges. One area already widely investigated is (convex) [[online learning]]. The study of [[AIXI]] and variants is also a relevant area of study with lots of nice unsolved mathematical conjectures.

== Neural networks ==

The research on neural networks has attracted significant interest from mathematicians since the early days of connectionism. For instance, functional analysis in the late 1980s and early 1990 has been instrumental to prove the universal approximation theorems. Roughly speaking, these theorems state the space of functions that are generated by neural networks are dense in the space of borel-measurable functions and as such, neural networks could approximate any computable function up to arbitrary precision.

Recently, mathematicians studying neural networks have focused on generalisation bounds of neural networks.
Some insightful mathematics studying neural networks has been derived by taking the infinite-width limit [http://papers.nips.cc/paper/8076-neural-tangent-kernel-convergence-and-generalization-in-neural-networks.pdf JHG][https://dblp.org/rec/bibtex/conf/nips/JacotHG18 18]. Remarkably, it was found that the learning of the neural networks could be described as a convex kernel gradient descent in the function space [https://www.youtube.com/playlist?list=PL95IbaKDuVrI_i5u8swlCLcr-4tX2SDxI ZettaBytes18].

== Algebraic topology ==

There may be results to investigate about the application of algebraic topology to distributed computing [https://www.amazon.com/Distributed-Computing-Through-Combinatorial-Topology/dp/0124045782 HKRBook][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Distributed+computing+through+combinatorial+topology&btnG= 13]. This would be relevant as today's algorithms rely more and more on [[distributed learning]].

Related to this, there is interesting questions around the generalization of [[Bayesian agreement]]. For instance, are there upper and lower bound on agreement in Byzantine Bayesian agreement?

== Game theory ==

Contributions from game theory, in particular from [[Social_choice|social choice theory]] could inform better collective decision making that is needed for robustly beneficial algorithms. A classic example is the one of moral preferences aggregation.

How mathematicians can contribute

2020-01-27T12:43:10Z

El Mahdi El Mhamdi:

How mathematicians can contribute

2020-01-27T12:42:12Z

El Mahdi El Mhamdi:

[[Robustly beneficial]] decision-making raises numerous technical problems, especially in terms of provable guarantees. Such questions may be of interests to mathematicians.

== Statistics ==

The most relevant mathematics to AI ethics is probably statistics. In particular, it seems critical to better understand [[Goodhart's law]] and [[robust statistics]]. Recent work has yielded remarkable provable guarantees [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19], while empirical findings suggest a need for a new theory of [[overfitting]] [https://openreview.net/pdf?id=Sy8gdB9xx ZBHRV][https://dblp.org/rec/bibtex/conf/iclr/ZhangBHRV17 17].

The [[reinforcement learning]] framework provides numerous mathematical challenges. One area already widely investigated is (convex) [[online learning]]. The study of [[AIXI]] and variants is also a relevant area of study with lots of nice unsolved mathematical conjectures.

== Neural networks ==

Recently, insightful mathematics studying neural networks has been derived by taking the infinite-width limit [http://papers.nips.cc/paper/8076-neural-tangent-kernel-convergence-and-generalization-in-neural-networks.pdf JHG][https://dblp.org/rec/bibtex/conf/nips/JacotHG18 18]. Remarkably, it was found that the learning of the neural networks could be described as a convex kernel gradient descent in the function space [https://www.youtube.com/playlist?list=PL95IbaKDuVrI_i5u8swlCLcr-4tX2SDxI ZettaBytes18].

== Algebraic topology ==

There may be results to investigate about the application of algebraic topology to distributed computing [https://www.amazon.com/Distributed-Computing-Through-Combinatorial-Topology/dp/0124045782 HKRBook][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Distributed+computing+through+combinatorial+topology&btnG= 13]. This would be relevant as today's algorithms rely more and more on [[distributed learning]].

Related to this, there is interesting questions around the generalization of [[Bayesian agreement]]. For instance, are there upper and lower bound on agreement in Byzantine Bayesian agreement?

== Game theory ==

Contributions from game theory, in particular from [ https://robustlybeneficial.org/wiki/index.php?title=Social_choice social choice theory ] could inform better collective decision making that is needed for robustly beneficial algorithms. A classic example is the one of moral preferences aggregation.

How mathematicians can contribute

2020-01-27T12:41:51Z

El Mahdi El Mhamdi:

[[Robustly beneficial]] decision-making raises numerous technical problems, especially in terms of provable guarantees. Such questions may be of interests to mathematicians.

== Statistics ==

The most relevant mathematics to AI ethics is probably statistics. In particular, it seems critical to better understand [[Goodhart's law]] and [[robust statistics]]. Recent work has yielded remarkable provable guarantees [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19], while empirical findings suggest a need for a new theory of [[overfitting]] [https://openreview.net/pdf?id=Sy8gdB9xx ZBHRV][https://dblp.org/rec/bibtex/conf/iclr/ZhangBHRV17 17].

The [[reinforcement learning]] framework provides numerous mathematical challenges. One area already widely investigated is (convex) [[online learning]]. The study of [[AIXI]] and variants is also a relevant area of study with lots of nice unsolved mathematical conjectures.

== Neural networks ==

Recently, insightful mathematics studying neural networks has been derived by taking the infinite-width limit [http://papers.nips.cc/paper/8076-neural-tangent-kernel-convergence-and-generalization-in-neural-networks.pdf JHG][https://dblp.org/rec/bibtex/conf/nips/JacotHG18 18]. Remarkably, it was found that the learning of the neural networks could be described as a convex kernel gradient descent in the function space [https://www.youtube.com/playlist?list=PL95IbaKDuVrI_i5u8swlCLcr-4tX2SDxI ZettaBytes18].

== Algebraic topology ==

There may be results to investigate about the application of algebraic topology to distributed computing [https://www.amazon.com/Distributed-Computing-Through-Combinatorial-Topology/dp/0124045782 HKRBook][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Distributed+computing+through+combinatorial+topology&btnG= 13]. This would be relevant as today's algorithms rely more and more on [[distributed learning]].

Related to this, there is interesting questions around the generalization of [[Bayesian agreement]]. For instance, are there upper and lower bound on agreement in Byzantine Bayesian agreement?

== Game theory ==

Contributions from game theory, in particular from [[ https://robustlybeneficial.org/wiki/index.php?title=Social_choice social choice theory ]] could inform better collective decision making that is needed for robustly beneficial algorithms. A classic example is the one of moral preferences aggregation.

How mathematicians can contribute

2020-01-27T12:41:00Z

El Mahdi El Mhamdi:

How mathematicians can contribute

2020-01-27T12:39:12Z

El Mahdi El Mhamdi:

Goodhart's law

2020-01-27T10:41:31Z

El Mahdi El Mhamdi:

Goodhart's law asserts that "as soon as a measure becomes a target, it ceases to be a good measure". Introduced in its original formulation as "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes" in [https://books.google.ch/books?id=OMe6UQxu1KcC&pg=PA111&redir_esc=y#v=onepage&q&f=false Goodhart1981]

== Examples ==

Over-fitting the accuracy of a scientific model to the data that was available during the formulation of that model leads to poor reproducibility on data that was unseen.

In machine learning, instances of this could be over-fitting to the training set, which leads to poor generalisation to an unseen test-set.

== Moral uncertainty ==

== Instrumental goals ==

Robust statistics

2020-01-27T10:31:46Z

El Mahdi El Mhamdi:

Robust statistics is the problem of estimating parameters from unreliable empirical data. Typically, suppose that a fraction of the training dataset is compromised. Can we design algorithms that nevertheless succeed in learning adequately from such a partially compromised dataset?

This question has arguably become crucial, as large-scale algorithms perform learning from users' data. Yet, clearly, if the algorithm is used by thousands, millions or billions of users, many of the data will likely be corrupted, because of bugs [https://www.youtube.com/watch?v=yb2zkxHDfUE standupmaths20], or because some users will maliciously want to exploit or attack the algorithm. This latter case is known as a poisoning attack.

Over the last three years, there have been fascinating recent advances, both for classical statistical tasks [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19] [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] and [[modern machine learning]] [http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf BEGS][https://dblp.org/rec/bibtex/conf/nips/BlanchardMGS17 17] especially in very high dimensional settings such as [[training neural networks]] [http://proceedings.mlr.press/v80/mhamdi18a EGR].

== Example of the median ==

Suppose the data are real numbers, and we want to estimate the mean of the true data (which we shall call inliers). Note that the naive empirical mean estimate would be a bad idea here, as a single malicious user could completely upset the empirical mean estimate. In fact, by choosing its input data adequately (called outliers), the malicious user can make the empirical mean estimate equal whatever the malicious user wants it to be.

It turns out that using the median of the dataset is a robust way to do so. Indeed, even if 45% of the data are outliers, the median will still be a quantile of the inliers, which should not be too far from the actual mean. The median is said to have a 0.5 statistical breakdown point [https://www.researchgate.net/profile/Peter_Rousseeuw/publication/303193534_Robust_Regression_Outlier_Detection_John_Wiley_Sons/links/5d7a06b64585151ee4af67da/Robust-Regression-Outlier-Detection-John-Wiley-Sons.pdf RousseeuwLeroy][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+regression+and+outlier+detection+rousseeuw+leroy&btnG= 05]. No statistical method can achieve a better breakdown point, but other methods also achieve 0.5 statistical breakdown, like trimmed mean (we remove sufficiently many extreme values on both sides and take the mean of the rest).

Another way to quantify robustness is to compute a high-probability upper bound between the empirical median and the mean μ of the true distribution of inliers. Call ε the fraction of outliers. It turns out that, assuming the true distribution is a normal distribution <math>\mathcal N(\mu,1)</math>, given <math>n=\Omega\left( \frac{d+\log(1/\tau)}{\varepsilon^2}\right)</math> data, we can guarantee <math>|median-\mu| = O(\varepsilon)</math> with probability <math>1-\tau</math>. This asymptotic bound is also best possible [https://projecteuclid.org/download/pdf_1/euclid.aoms/1177703732 Huber][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+estimation+of+a+location+parameter+huber&btnG= 92].

== Poisoning models ==

The above model holds for arguably the strongest poisoning model. This is one where an adversary gets to read the full dataset before we can, and is able to erase a fraction <math>\varepsilon</math> of the data, and to replace them by any imaginable input. The dataset is then analyzed by our (robust) statistics algorithm.

A weaker, but still widespread, model is one where a fraction <math>1-\varepsilon</math> comes from the true distribution, while the remaining <math>\varepsilon</math> is chosen by the adversary [http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf BMGS][https://dblp.org/rec/bibtex/conf/nips/BlanchardMGS17 17].

Other models include an adversary with only erasing power, or an adversary that must choose its outliers without knowledge of the values of the inliers. Evidently, any guarantee for such weaker poisoning models will also hold for stronger poisoning models [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19].

Perhaps the most general form of poisoning attack is the following. Consider a "true dataset" <math>D</math>. However, the attacker gets to distort the dataset using a distort function <math>f \in \mathcal F</math>, thereby yielding <math>f(D)</math>. Suppose we now have a best-possible machine learning algorithm <math>ML</math> that learns from data. It would ideally compute <math>ML(D)</math>. But we can only exploit <math>f(D)</math>, by some hopefully robust machine learning algorithm <math>RML(f(D))</math>. What we would like is to guarantee that <math>d(ML(D),RML(f(D))) < bound</math>, for a suitable distance <math>d</math> and any <math>f \in \mathcal F</math>, where <math>RML</math> is tractable. A further generalization of this could consist in assuming a prior probabilistic belief on the set <mathcal>F</mathcal> of attack models that we need to defend against.

It is noteworthy that robustness to such attacks are useful, even if there are no adversary. Indeed, data may still get corrupted, because of bugs, crashes or misuse by a human operator (see for instance the case of gene mutations caused by Excel that plagued genetic research [https://www.sciencemag.org/news/2016/08/one-five-genetics-papers-contains-errors-thanks-microsoft-excel Boddy16]). Our algorithms need to remain performant despite such issues. An algorithm robust to strong attacks will be robust to such weaker flaws.

== Robustness to additive poisoning ==

Unfortunately, results that hold for small dimensions generalize poorly to high dimensions, either because of weak robustness guarantees or computational slowness. Typically, the coordinate-wise median and the [[geometric median]] both yield <math>\Omega(\varepsilon \sqrt{d})</math>-error, even in the limit of infinite-size datasets and assuming normality for inliers. This is very bad, as today's [[neural networks]] often have <math>d\sim 10^6</math>, if not <math>10^9</math> or <math>10^{12}</math> parameters.

On the other hand, assuming the true distribution is a spherical normal distribution <math>\mathcal N(0,I)</math>, Tukey proposed another approach based on identifying the directions of largest variances, since these are likely to be the "attack line" of the adversary [https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Mathematics+and+the+picturing+of+data+tukey&btnG= Tukey75]. Tukey's median yields <math>O(\varepsilon)</math>-error with high probability <math>1-\tau</math> for <math>n=\Omega\left( \frac{d+\log(1/\tau)}{\varepsilon^2}\right)</math> data points. Unfortunately, Tukey's median is NP-hard to compute, and is typically exponential in d.

[https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] proved that such a bound can be achieved for additive poisoning in quasi-linear-time, even for heavy-tailed distribution with bounded but unknown variance. More precisely, by using an approach akin to median-of-means, for <math>K = \Omega(\varepsilon n)</math>, they designed an algorithm that achieves an error <math>O\left( \sqrt{\frac{Tr(\Sigma)}{n}} + \sqrt{\frac{||\Sigma||_{op} K}{n}} \right)</math> in time <math>\tilde O(nd + uKd)</math>. Here, <math>u</math> is an integer parameter, which is needed to guarantee a subgaussian decay rate of errors, which will be in <math>1-\exp(-\Theta(K+u))</math>. Note that <math>Tr(\Sigma)</math> is essentially the "effective dimension" of the data points.

Their technique relies on partitioning data into <math>K</math> buckets, computing the means for each bucket, and replacing the computation of median of the means by a covering SDP that fits all configuration of bucket poisoning. It turns out that approximations of such an covering SDP can be founded in quasi-linear-time [https://arxiv.org/pdf/1201.5135 PTZ][https://dblp.org/rec/bibtex/conf/spaa/PengT12 12]. It turns out that, rather than being used to directly compute a mean estimator, this is actually used to perform gradient descent, starting from the coordinate-wise median, and then descending along the direction provided by the covering SDP. [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] proved that only <math>O(\log d)</math> such steps were needed to guarantee their bound.

== Robustness to strong poisoning ==

Note that all the papers of this section seem to strongly rely on the knowledge of the covariance matrix of inliers by the algorithm.

For robustness to strong poisoning, Tukey's ideas can be turned into an polynomial-time algorithm for robust statistics mean estimate. The trick is to identify worst-case "attack line" by computing the largest eigenvalue of the empirical covariance matrix, and to remove extremal points along such lines to reduce variance. [https://arxiv.org/pdf/1604.06443.pdf DKKLMS][https://dblp.org/rec/bibtex/conf/focs/DiakonikolasKK016 16] [http://proceedings.mlr.press/v70/diakonikolas17a/diakonikolas17a.pdf DKKLMS][https://dblp.org/rec/bibtex/conf/icml/DiakonikolasKK017 17] show that, for <math>n=\Omega(d/\varepsilon^2)</math>, this yields <math>O(\varepsilon \sqrt{\log(1/\varepsilon)})</math>-error with high probability for sub-Gaussian inliers, and <math>O(\sigma \sqrt{\varepsilon})</math> for inliners whose true covariance matrix <math>\Sigma</math> is such that <math>\sigma^2 I - \Sigma</math> is semidefinite positive.

The asymptotical optimal bound <math>O(\varepsilon)</math> has been achieved a more sophisticated filtering polynomial-time algorithm by [https://epubs.siam.org/doi/pdf/10.1137/1.9781611975031.171 DKKLM+][https://dblp.org/rec/bibtex/conf/soda/DiakonikolasKK018 18] for Gaussian distribution in the additive poisoning model, while [https://ieeexplore.ieee.org/document/8104048 DKS][https://dblp.org/rec/bibtex/conf/focs/DiakonikolasKS17 17] showed that no polynomial-time can achieve better than <math>O(\varepsilon \sqrt{\log(1/\varepsilon)})</math> in the Statistical Query Model with strong poisoning.

Quasi-linear-time robust mean estimators have been designed by [https://epubs.siam.org/doi/pdf/10.1137/1.9781611975482.171 CDG][https://dblp.org/rec/bibtex/conf/soda/0002D019 19], i.e. with <math>\tilde O(nd)</math> up to logarithmic factors, based on filtering methods of extremal points on variance-maximizing directions.

Note that all such results can be applied to robust linear regression, by applying robust mean estimator to gradient descent estimator (with the mean taken over data points), assuming that the covariance matrix of the distribution of features <math>x</math> is known, and that the noise is of known mean and variance. Robust covariance matrix estimation can also be addressed by the framework, as it too can be regarded as a robust mean estimation problem (the fourth moment then needs to be assumed to be upper-bounded).

== What if there are more outliers than inliers? ==

Another puzzling question concerns the setting where outliers may be more numerous than inliers. A simple argument shows that methods cited above just won't work. What can be done for such a setting? Can we still learn something from the data, or do we have to throw away the dataset altogether? Let's cite three approaches that may be fruitful.

First, [https://dl.acm.org/doi/pdf/10.1145/1374376.1374474?casa_token=CY0VwFuDnskAAAAA:Q_eNEhY5XxJ8_rll_t1TeRYQiTB6fJeTXQG_OJGwVeghBAcpD6rwFCExUqrmwX5SP-N9iKd8n2CxKQ BBV][https://dblp.org/rec/bibtex/conf/stoc/BalcanBV08 08] introduced list-decodable learning, which consists in returning several hypothesis, and [https://arxiv.org/pdf/1711.07211 DKS][https://dblp.org/rec/bibtex/conf/stoc/DiakonikolasKS18 18] provided polynomial-time for robust list-decodable mean estimation.

Second, one might try to a apply a robust Bayesian inference to the data, which would yield a set of posterior beliefs. This framework has yet to be defined.

Third, we may assume that the data are more or less [[data certification|certified]]. This is a natural setting, as we humans often judge the reliability of raw data depending on its source, and we usually consider a continuum of reliability, rather than a clear cut binary classification. Algorithms should probably do the same at some point, but there does not yet seem to be an algorithmic framework to pose this problem. In particular, it could be interesting to analyze threat models where different degrees or sorts of certification have different levels of liability.

== Robust statistics for neural networks ==

The main application of robust statistics (at least relevant to AI ethics) seems to be the aggregation of [[stochastic gradient descent|stochastic gradients]] for [[neural networks]]. In this setting, even a linear-time algorithm in <math>\Omega(nd)</math> is impractical if we demand <math>n \geq d</math> (which is necessary to have dimension-independent guarantees). In practice, this setting is often carried out with batches whose size <math>n</math> is significantly smaller than <math>d</math>. In fact, despite conventional wisdom and PAC-learning theory, it seems that <math>n \ll d</math> may be a desirable setting to do neural network learning (see [[overfitting]] where we discuss double descent). For <math>n \ll d</math>, is there a gain in using algorithms more complex than coordinate-wise median?

[http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf BEGS][https://dblp.org/rec/bibtex/conf/nips/BlanchardMGS17 17] proposed Krum and multi-Krum, aggregation algorithms for this setting that have weaker robustness guarantees but are more efficient. Is it possible to improve upon them?

== Robust statistics for agreement and multi-agent settings ==

Another venue where robust statistics are needed is the increasingly multi-agent setting that modern AI is built on. [https://dl.acm.org/doi/10.1145/2488608.2488657 MH2013]