Coming back to our old friend Canonical Correlation Analysis (CCA), let us see how researchers were able to appreciate the differences in training dynamics of generalizing and memorizing CNNS.
CCA and its use to appreciate the difference between layers of neural networks have already been discussed here. In this article, we are focusing on different neural networks, having different training dynamics. Researchers at DeepMind and Google Brain built upon SVCCA to develop projection-weighted CCA for comparison among CNNs, where weighted means were used for canonical correlations and their relationship with the underlying representation. So a higher weight was assigned to a CCA vector that was more canonically correlated with the representation.
It must also be remembered that training data in the real world will contain noise as well. Since training dynamics are also impacted by the training data, how do the dynamics vary according to the ‘original signal’ and the accompanying ‘noise’? To answer this, the CCA similarities were compared between layer L at times t throughout training with the same layer L at the final time step T. It was found that the sorted CCA coefficients ρ continued to change well after the network’s performance had converged. It could also be assumed that the un-converged coefficients and their corresponding vectors represented the ‘noise’.
The next question that arose was if the CCA vectors that stabilized early in training remained stable. To test this, the CCA vectors were computed between layer L at the time step tₑₐᵣₗᵧ in training and time step T/2. The similarity between the top 100 vectors, which had stabilized early was found to remain stable; and the bottom 100 vectors, which had not stabilized with the representation at all other training steps, continued to vary and therefore likely represented noise. These results suggested that task-critical representations are learned by midway through training, while the noise only approaches its final value towards the end.