Looking back a decade (2010–2020)
This blog focusses on developments on explainability in neural networks. We divide our presentation into a four part blog series:
- Part 1 talks about the effectiveness of Visualizing Gradients of the image pixels for explaining the pre-softmax class score of CNNs.
- Part 2 talks about some more advanced/modified gradient based methods like DeConvolution, Guided Back Propagation for explaining CNNs.
- Part 3 talks about some short comings of gradient based approaches and discusses alternate axiomatic approaches like Layer-wise Relevance Propagation, Taylor Decomposition, Deep LiFT.
- Part 4 talks about some recent developments like Integrated Gradients (continuing from part 3) and recent novelties in CNN architecture like Class Activation Maps developed to make the the feature maps more interpretable.
Up until now, we discussed gradient based methods for understanding decisions made by a neural network. But, this approach has a serious draw back. Due to the presence of units like ReLU and MaxPooling, often the score function can be locally “flat” for some input pixel or in other words have 0 gradients. Gradient based methods often attribute 0 contribution to pixels which saturate the ReLU or MaxPool. This is counter-intuitive. To address this problem, we need:
- Some formal notion of what we mean by explainability or relevance (beyond vanilla gradients). What are the properties that we want the “relevance” to follow. It would be desirable for the relevance to behave like vanilla gradients at linear layers, since gradients are good at explaining linear functions.
- What are some candidates that satisfy our axioms of “relevance”, which are also easy to compute, ideally we want to compute them in a single back pass.
Taylor Decomposition and Layer-wise Relevance Propagation (2015)
Axiomatic relevance was first explored by Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller and Wojciech Samek. They introduced the notion of Layer-wise Relevance Propagation in their work “On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation (PLOS 2015)”.
The authors propose the following axiom that relevance must follow:
- Sum of relevance of all pixels must equal the class score of the model. We call this axiom “conservation of total relevance” from now on. This has been a popular axiom followed by other authors too.
The authors propose two different ways to distribute the total relevance to individual pixels.
1. Taylor Decomposition
In this approach, the authors propose to choose a reference image X₀ which is to be interpreted as a “Baseline Image” against which the the pixels of image X are explained. It would be desirable to have the class score of this baseline image to be as small as possible.
Using a baseline image to compare the input image with to highlight the important pixels is a recurring theme in many axiomatic-relevance works. Some good examples of baseline image are:
- Blurred input images: Works well in colored images
- Blank (dark) image: Works well in grey-scale/black and white images
Given the baseline image X₀, we perform Taylor Decomposition of the class score function to obtain the relevance of individual pixels.
An extended version of Taylor Decomposition for neural networks was suggested by the authors in another work of theirs called Deep Taylor Decomposition. Deep Taylor decomposition forms the theoretical basis of Layer-wise Relevance propagation described next.
2. Layer-wise Relevance Propagation
Taylor Decomposition is a a general method that works for any class score function. For neural networks, we can design a simpler method called Layer-wise Relevance Propagation.
For a neural network, the authors propose passing down the relevance down from the output layer to the contributing neurons.
- Every time the relevance is passed down from a neuron to the contributing neurons in the layer below, we follow the conservation of total relevance of contributing neurons to the neuron from which the relevance is passed down. Hence, in LRP, the total relevance is conserved in every layer.
- All incoming relevances to a neuron from the layer above are collected and summed up, before passing down further. As we do this recursively from the one layer to the layer below, we ultimately rich the input image, giving us the relevance of each pixel.
It remains to define how we we distribute the relevance of a neuron to its contributing inputs or input neurons. This can be achieved via multiple schemes. Here is one such simple scheme given by the authors:
We note that the above scheme only approximates the conservation of total relevance axiom. To make it conserve the sum exactly, we have to redistribute the bias terms back to the inputs/input neurons in some way.
Here are some results of LRP on the ImageNet dataset:
Following the works of Sebastian Bach et al on LRP/Taylor decomposition, Avanti Shrikumar, Peyton Greenside, Anshul Kundaje proposed DeepLiFT method in their work Learning Important Features Through Propagating Activation Differences (ICML 2017). DeepLiFT(Deep Learning Important FeaTures) uses a reference image along with an input image to explain the input pixels (similar to LRP). While LRP followed the conservation axiom, there was no clear way on how to distribute the net relevance among the pixels. DeepLiFT fixes this problem by enforcing an additional axiom on how to propagate the relevance down.
The two axioms followed by DeepLiFT are:
Axiom 1. Conservation of Total Relevance: Sum of relevance of all inputs must equal the difference between the score of the input image and baseline image, at every neuron. This axiom is same as the one in LRP.
Axiom 2. Back Propagation/Chain Rule: The relevance per input follows the chain rule like gradients. This is enough to help us back propagate the gradient-like relevance per input. This axiom makes DeepLiFT closer to “vanilla” gradient back propagation.
The authors prove that the two axioms stated above are consistent with one another.
Given these axioms, what are some good candidate solutions for DeepLiFT? The authors suggest splitting relevance into positive and negative parts:
Depending on the function at hand, the authors suggest the following candidate solutions for C() and m():
- Linear Rule for linear functions: This is exactly same as using the gradients for m(). LRP would do the same as well.
- Rescale Rule for non-linear functions like ReLU, Sigmoid: This is exactly same as LRP.
Linear and Rescale rules follow LRP pretty closely.
- RevealCancel (Shapley) Rule for non-linear functions like MaxPool: Using Rescale rule (with reference input of 0s) for MaxPool would end up attributing all the relevance contribution to the biggest input. Chages along other inputs would make no difference to the output. RevealCancel rule fixes this counter intuitive conclusion, using the idea of Shapley values.
Shapley values have been used in game theory for calculating attributions of input variables. A number of recent works on explainable AI (like SHAP) use ideas inspired from Shapley Values.
The authors show the results of using DeepLiFT on a CNN trained on MNIST dataset.
To read about more exciting works on explainability of neural networks, you can catch the next part here: Link to Part 4