High-Level Segmentation Based Interpretability:
Human vision is different from computer vision in two main aspects. Firstly, the human brain is a huge source of prior knowledge, acquired by diverse sensory organs, experience, and memory. The deep learning model lacks this sort of prior knowledge for the vision-related task. And secondly, when we see a picture, rather than focussing on the complete image we focus (pay attention to) on different areas of the image, gather high-level features, and then consolidate all that high-level feature to decide on the image. So if we ask ourselves why the input is an image of digit 7. We probably answer in a fashion that it has got a horizontal line along with a connected slanting vertical line and it matches our previous knowledge of digit 7, hence this input image is actually of class 7.
Can we get this level of interpretation from the CNN model? To find this out, I have employed a special technique. I have segmented the input image with the ‘Felzenszwalb’ method using the ‘skimage’ library and rather than the whole image giving as input to the model, I have given individual segments as input to the model and predicted the class along with the score.
I find the outcome of this experiment unusual, interesting, uncanny, and dangerous at the same time. If you have a look at the top three segments, which are nothing but the horizontal line from the actual image of digit 7, The model can predict those as class 7 with an almost near-perfect score. Those segments are nothing like digit 7. Whereas the 4th segment which somewhat like digit 7 the prediction score comes down to 0.913.
This finding further underscores the question, what the network is actually learning. Is it at all able to learn any high-level features like we human do or it just finds some low-level interaction of different intensity patterns of the pixels and classifies the images based on the presence or absence of those patterns?