Explainable AI: Part Two — Investigating SHAP’s Statistical Stability

In our previous blog, we introduced the concept of SHAP values (4) and its advantages over other saliency mapping methods such as LIME (5). We also proposed that the application of this method to genetic data can be used to explore biology and enable clinical utility of deep models. However, given the reliability issues touched on in the previous blog as well as those reported in (1, 9) it is absolutely paramount that we tread carefully before putting our trust in the interpretations of deep learning results. One way to navigate this safely, is through the learned biological relevance of the found features, as well as benchmarking of the results against well established and trusted, traditional bioinformatic methods.

To begin, we needed a model and a target hypothesis. For this purpose, we trained a convolutional neural network (CNN; Fig 1) to predict tissue type using RNA-seq data from the Genotype-Tissue Expression (GTEx) project. This GTEx dataset includes 18,884 features (genes) derived from 16,651 samples and 47 classes (tissue types).

Fig 1. | Architecture of our Convolutional Neural Network.

We fitted GradientExplainer SHAP to the softmax layer of our model and calculated SHAP values for the test set (Fig 2). This was done using the library provided in Lundberg’s SHAP GitHub repository. Ranking the genes by median SHAP value yielded the top genes for each tissue type respectively.

We chose this explainability model as GradientExplainer utilises statistical methods that suit neural network models. It is also user friendly and adapted for usage on any of the major Python deep learning frameworks, ie. Tensorflow, Keras, and Pytorch and most importantly, addresses many limitations raised in both “Sanity Checks” publications mentioned earlier (1, 9). GradientExplainer is an amalgamation of Integrated Gradients (7), SHAP (3), and SmoothGrad (7) into a single equation that calculates saliency scores for each feature in the original input of a neural network. Details of each improvement are described in the original publications, but only briefly.

Propagation through the CNN model had to rely on gradient-based methods, thus most saliency mapping techniques developed for CNN had to be gradient-based. However, due to the nature of common activation functions, such as ReLU, the sensitivity of these methods was greatly compromised (because once activated, gradient would not change), often missing even very salient features (1, 9). The solution came with a simple process of integrating all gradients along the path between the baseline model and the input (Equation 1), and linking it to SHAP values via adoption of Shapley values from the theorem of Infinite Game (Equation 2). Further improvements to the sensitivity maps were made through an adoption of the SmoothGrad technique, which introduces Gaussian noise to the gradients as well as the input data (7).

Equation 1. Integrated gradient (Equation 1 in Sundararajan et al., 2017)

IntegratedGradsi(x)::=(xi-xi’)=01F(x’+(x-x’))xid

Equation 2. Aumann-Shapley value for infinitesimal player

(Sv)(ds)=01(v(tI+ds)-v(tI))dt

Where Sv are Shapley values for ds infinitesimal player; tI is a perfect sample of the all-player set I containing a proportion t of all the players, and tI+ds is the coalition obtained after ds joins tI (sourced from Wikipedia)

Fig 2. | Visual representation of SHAP value matrix and its content. The background represents a full matrix with SHAP values. Foreground insert shows that this matrix was available for all 47 available tissue types.

Distributional properties of SHAP values have not yet been mathematically proven, preventing calculation of p-values. Consequently, the selection of the “most important” features can still be somewhat arbitrary and approached in several ways. For example, SHAP can be numerically converted to ranks providing a non-parametric avenue. However, we decided to define the cut-off point as a function of tissue exclusivity drawing the line at 50% of the top genes being tissue exclusive. This resulted in our 2,423 SHAP identified genes, henceforth referred to as “SHAP genes”. Of note, out of the top 1 gene of each tissue type (n=47), we found that 44 of them were unique. This is indicative of the fact that SHAP is indeed identifying the unique features relevant to the question we are addressing.

As an illustrative example of SHAP’s utility for sanity checking, we found, in the process of writing our scientific manuscript, that if our model was trained on incorrectly transformed or non-normalised data, we would observe features, such as mitochondrial genes (mitochondria is a membrane bound structure inside most eukaryotic cells, thought to originate from bacteria, that can be seen as a chemical energy powerhouse) show up as the most salient. This ‘saliency’ however, would not have been driven by the biological relevance to tissue differentiation, but rather as a byproduct of a technical artefact. That is because these genes are always expressed and their raw expression values are much greater than those of “human” genes. So on the raw scale, any changes in these expressions would be unfairly exaggerated. On the standardised scales however, the relative changes would become much smaller, reducing their respective saliency. If we imagine we were wanting to find genes as future therapeutic targets for some specific disorder, such a mistake would have led to catastrophic outcomes, both financially and at a cost to patients’ wellbeing.

For decades, ML techniques have suffered from overfitting on the training data, something that in the critical sectors, such as healthcare, can be very dangerous. Although train/validation split plus techniques, such as dropouts, batch normalisations, etc., can minimise the problem, a generalisability issue can still persist. That is because the sample used for training/validation comes from a single source and can therefore be somewhat different from another, independent collection source. This might be due to factors such as collection techniques, the technologies used, or even epidemiological differences in the populations. For that reason, nothing can test generalisability of the model quite as robustly as an independent replication. We therefore recalculated SHAP values in an out-of-sample dataset that was completely independent from GTEx data. The independent expression profiles were sourced from the Human Protein Atlas (HPA) dataset (21 tissue types). When comparing the results from both sets, we found a large overlap between the SHAP genes from GTEx and HPA samples. This overlap for each tissue ranged from 20–60% of our significant genes, providing strong evidence that the identified features generalised beyond our training dataset and therefore were more likely to be the “true” causal drivers of tissue differentiation.

We also looked at the number of protein-protein interactions per tissue within the SHAP genes using STRING v11. What we found was very promising! Every tissue type had a significantly higher number of interactions among the proteins for the 103 SHAP genes than the 103 genes selected at random!

On top of this, we explored the biological processes enriched in these gene lists. We found that the processes enriched in these genes largely reflected the processes that should be important for those tissues. For example, processes enriched for heart tissue included cardiac muscle fibre and development pathways for thyroid gland included thyroid hormone generation (Fig 3).

Fig 3. | Reactome pathway analysis. This enrichment of pathways for Heart — AA (atrial appendage) and Thyroid tissue.

edgeR is the current standard for analysing differential expression of count data to determine which genes significantly differentiate the phenotype of interest (e.g. tissue types). We used this package to identify 30,352 differentially expressed genes, however, many of these genes appeared more than once across tissues. To compare this to our unique SHAP genes, we collapsed the edgeR identified genes to include only unique genes, resulting in 7,854 genes (referred to as ‘edgeR genes’). When we compared these to our SHAP genes, we found 98.6% overlap (Fig 4).

Fig 4. | Overlap of gene sets.

Interestingly, we found that many of the top SHAP genes did not pass the logFC edgeR filter at individual tissue types. Note that edgeR is designed to only detect linear associations with large effect sizes, suggesting that the Neural Network was able to detect more subtle differentiators of tissue types. This could be because ML models are capable of analysing multi-class data simultaneously (“seeing” all classes at the same time), whereas edgeR is designed for pair-wise differentiations only.

If these SHAP genes were really a list of the genes most relevant to that particular tissue, then we would expect them to contain all the information required to differentiate between tissues. To test this hypothesis, we estimated the quality of clustering that our significant features produced using a dimensionality reduction tool called UMAP and compared the results between various control feature sets. The feature sets used to compare the quality of clustering included all genes, edgeR genes, random subsets of 2,423 genes and SHAP genes (Fig 5).

Fig 5. | UMAPs of held-out GTEx samples using different gene subsets. a All genes (18,884 genes), b edgeR genes (7,854 genes), c Random genes (2,423 genes; representative UMAP from 10 separate runs), and d SHAP genes (2,423 genes). Tissue types represented by colours (bottom legend).

We evaluated the quality of these clusters by applying k-means clustering on the UMAP projections and then calculating the V-Measure, which is essentially a measure of not only how well samples of the same label group together, but also how homogenous each cluster is.

Fig 6. | V-Measure of k-means clustering analysis on UMAP. k-means clustering was performed using SHAP genes (blue; 2,423 genes), 10 random sets of 2,423 genes (grey), edgeR genes (red; 7,854 genes), and all genes (yellow; 18,884 genes).

Our results showed that even though the list of SHAP genes was less than other gene lists, it was able to produce higher quality clusters for all tissue types (Fig 6).

In summary, we showed how IntegratedGradient, from the family of SHAP methods, is not only reliable by comparing it to traditional bioinformatic methods, but also biologically informative and statistically effective when applied to RNAseq data. This work is a huge step forward in showing how carefully designed and safely implemented explainability models have the potential to facilitate clinical decision making, as well as provide novel biological insights.

For more information, read our full publication online here on Nature Scientific Reports.

Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I.J., Hardt, M., & Kim, B. (2018). Sanity Checks for Saliency Maps. NeurIPS.

2. Garg, M. (2009). Axiomatic Foundations of Centrality in Networks. SSRN Electronic Journal.

3. Lewis, F., Butler, A. and Gilbert, L. (2010). A unified approach to model selection using the likelihood ratio test. Methods in Ecology and Evolution, 2(2), pp.155–162.

4. Lundberg, S. M. & Lee, S. I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017-Decem, 4766–4775 (2017).

5. Ribeiro, M. T., Singh, S. & Guestrin, C. ‘Why should I trust you?’ Explaining the predictions of any classifier. Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. 13–17-Augu, 1135–1144 (2016).

6. Shrikumar, A., Greenside, P. and Kundaje, A. (2019). Learning Important Features Through Propagating Activation Differences. [online] arXiv.org. Available at: https://arxiv.org/abs/1704.02685 [Accessed 4 Sep. 2019].

7. Smilkov, D., Thorat, N., Kim, B., Viégas, F. and Wattenberg, M. (2019). SmoothGrad: removing noise by adding noise. [online] arXiv.org. Available at: https://arxiv.org/abs/1706.03825 [Accessed 4 Sep. 2019].

8. Sundararajan, M., Taly, A. & Yan, Q. Axiomatic Attribution for Deep Networks. (2017). doi:10.1007/s10144–009–0162–4

9. Tomsett, R., Harborne, D., Chakraborty, S., Gurram, P., & Preece, A. (2020). Sanity Checks for Saliency Metrics. ArXiv, abs/1912.01451.

10. Wang, Q., Armenia, J., Zhang, C., Penson, A., Reznik, E., Zhang, L., Minet, T., Ochoa, A., Gross, B., Iacobuzio-Donahue, C., Betel, D., Taylor, B., Gao, J. and Schultz, N. (2018). Unifying cancer and normal RNA sequencing data from different sources. Scientific Data, 5(1).

11. Way, G. and Greene, C. (2017). Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Biocomputing 2018.

12. Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 26, 139–140 (2010).

Footer