Outstanding problem: Instead of training a student model to imitate just the class independent output probabilities of a teacher model, training to imitate the representations of the teacher model transfers more knowledge from the teacher model to a student model.
Even though previous methods dealt with transferring representations of the teacher model, their loss functions are not designed to model the correlations and higher-order dependencies in the representational space using just a dot product between representations. This constrains the feature vectors of the student and the teacher to be of the same size, which doesn’t allow the possibility of having smaller student networks without any constraints on their architecture, which is most desired.
This work employs a contrastive learning approach based on mutual information to learn correlations in the representational space. The similarity between representations is estimated with a critic model that is trained along with the student model which takes two representations and outputs a similarity between 0 to 1, making it possible to have different sizes of feature vectors in teacher and student networks.
Proposed solution: Instead of just maximizing the dot product between the representations of a teacher model and a student model or minimizing L1 between them, this work proposes that we use mutual information as the criterion to maximize.
Mutual information is a measure between two variables that tells how much information is present in one variable about the other.
Adopting a contrastive learning framework, the authors train positive pairs which are student and teacher representations for the same input by increasing the mutual information between those representations. They also trained negative pairs which are student and teacher representations for different inputs by decreasing the mutual information between those representations.
Results and Conclusions: It is shown that the output probabilities of the student network trained with the proposed methods achieve higher correlations with the output probabilities of the teacher network.
The proposed method outperforms many of the recently proposed Knowledge Distillations methods.
Also, the proposed method outperforms all other methods in cases where the architectures of the teacher and student models are very different.