# Interpretability

As algorithms become more influential and complex, their interpretability becomes a major challenge. Unfortunately, it is not always clear what is meant by "interpretability". Amusingly, the concept of interpretability needs interpretation.

## Contents

## What to expect from interpretability?

Arguably, interpretability should not be regarded as a binary concept. Algorithms are more or less interpretable.

In practice, it seems that the more critical question is to which extent we can predict the behavior of an algorithm in different relevant contexts. In particular, what can we guarantee about the algorithm's behavior? What can we be reasonably confident about? How to identify what ought to be changed about the algorithm? What should we be suspicious of?

To answer these questions as best as possible, it is often useful to arm ourselves with *probing algorithms*, that is, algorithms designed to probe the algorithms that need to be interpreted.

## Provable guarantees

Evidently, the best form of interpretability is proving theoretical guarantees. This is essentially the heart of theoretical computer science. An example is the incentive-compatibility and stable matching of the Gale-Shapley algorithm.

The search for theoretical guarantees can be automated by program verification. Some research aims to apply similar techniques to neural network verification, typically by involving *mixed integer programming* TXT19.

Theoretical guarantees also play an important role in studying the learning of an algorithm. Typically, one could prove theoretical convergence of stochastic gradient descent under the assumption of the convexity of the loss function. Another example is the study of robustness to adversarial attacks BMGS17.

## Black box

Diakopoulos14 RB1 provides two very interesting definitions of a "black box", which we shall call *weak black box*, *strong black box* and *total black box*. A *weak black box* is an algorithm that you can only probe through input/output queries. In the case of a *strong black box*, you cannot choose the input, though you can observe it, as well as the output. Finally, a *total black box* would be an algorithm whose inputs are unknown; only outputs are observable. Evidently, this classification of black boxes is a simplification; rather, algorithms are more or less black, because inputs are more or less controllable and observable.

Importantly, in this sense, a "black box" is not an "unknowable algorithm". Rather, it is an algorithm whose properties can essentially only be probed in an input-output manner. The reason why the distinctions above seem useful is that the "blackness" of an algorithm limits what *probing algorithms* can or should be used.

Lê is conjecturing here that weak black boxes can be "fully probed" by O(K) queries, assuming that the algorithm has Solomonoff-Kolmogorov complexity K. But total black boxes are harder to probe, especially if the black box inputs data whose complexity is not bounded. Exciting research direction!

Interestingly, in this sense, humans can be argued to be mostly black boxes. But not completely. Indeed, you can also probe humans' algorithms by scanning brain activity with MRIs. The same holds for artificial neural networks, for which additional data can be collected by studying neural activity. In fact, artificial neural networks are "less black" than human brains, not only because neural activity can be probed more precisely, but also because neural networks can be probed in other ways, for instance by analyzing the gradients of the neural network for some data (which turns out to also increase vulnerability to adversarial attacks).

Interestingly, Diakopoulos14 shows that, in most cases, algorithms need to be treated as black boxes, either because their code is private or because it is too complex (especially if it involves neural networks). Nevertheless, Diakopoulos shows that by interacting with the black box through queries and analyzing outputs, some partial interpretability can sometimes be achieved. Note that, in the end, this is very similar to how we interpret humans' brains.

ROWAM20 show that even strong black boxes can be probed through indirect means, for instance by analyzing data from users who interacted with the black box. In particular, a probing algorithm may not need to analyze the outputs

## Interpreting the YouTube algorithm

A few papers tried to probe the YouTube recommendation algorithms with probing algorithms. Former Google employee Guillaume Chaslot launched Algotransparency, which creates YouTube accounts run by bots that randomly click on suggested videos. Allgaier16 used similar techniques. Both found worrying pipelines towards conspiracy or climate denialist videos. This was indirectly confirmed by ROWAM20.

## Neural network activations

If we have access to the code of a neural network, additional interpretability can be achieved by analyzing the activations of the neural network for different stimuli, as is done through magnetic resonance imaging for humans.