The two techniques are best practices to improve the generalization of deep learning models.
I recently started a new newsletter focus on AI education and already has over 50,000 subscribers. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:
Understanding the characteristics of input datasets is an essential capability of machine learning algorithms. Given a specific input, machine learning models need to infer specific features about the data in order to perform some target actions. Representation learning or feature learning is the subdiscipline of the machine learning space that deals with extracting features or understanding the representation of a dataset.
Representation learning can be illustrated using a very simple example. Take deep learning algorithm that is trying to identify geometric shapes in the following image:
In order to match pixels to geometric shapes, the algorithm first needs to understand some basic features/representations of the data such as the number of corners. That’s the role of representation learning.
Representation learning has been a established discipline in the machine learning space for decades but its relevance has increased tremendously lately with the emergence of deep learning. While traditional machine learning techniques such as classification often deal with mathematically well-structured datasets, deep learning models process data such as images or sounds that have not well-defined features. In that sense, representation learning is a key element of most deep learning architectures.
The central problem of representation learning is to determine an optimal representation for the input data. In the context of deep learning, the quality of a representation is mostly given by how much it facilitates the learning process. In the real world, the learning algorithm and the underlying representation of a model are directly related.
If the knowledge representation of a model is tied to its learning algorithm then selecting the correct representation should be trivial, right? We simply pick the knowledge representation associated with the learning task and that should guarantee an optimal performance. I wish were that simple. In the journey to find an optimal representation we quickly find an old friend: The No Free Lunch Theorem(NFLT).
NFLT is one of those mathematical paradoxes that puzzles the most pragmatic data scientists and technologists. In a nutshell, NFLT states that, averages over all possible data generating distributions, every machine learning algorithm has approximately the same error rate when processing previously unobserved points (read my previous article about NFLT). In other words, no machine learning algorithm is better than any other given a broad enough dataset.
In the context of representation learning, NFLT demonstrates that multiple knowledge representations can be applicable to the learning task. If that’s the case, how can we empirically decide on one knowledge representation vs. another? The answer is one of the core, and often ignored, techniques in machine learning and deep learning models: regularization.
A core task of machine learning algorithms is to perform well with new inputs outside the training dataset. Optimizing that task is the role of regularization. Conceptually, regularization induces modifications to a machine learning algorithm that reduces the test or generalization error without affecting the training error.
Let’s now come full circle and see how regularization is related to representation learning. The relationship is crystal clear: the quality of a knowledge representation is fundamentally related to its ability to generalize knowledge efficiently. In other words, the knowledge representation must be able to adapt to new inputs outside the training dataset. In order to perform well with new inputs and reduce the generalization error, any representation of knowledge should be useful in regularization techniques. Therefore, the quality of representation learning models is directly influenced by its ability to work with different regularization strategies. The next step is to figure out which regularization strategies are specifically relevant in representation learning. That will be the topic of a future post.
Now that we know that regularization is a mechanism to improve the representation of knowledge the next step is to evaluate the quality of a given representation. Essentially, we are trying to answer a simple question: what makes a knowledge representation superior to others?
Just to get the terminology straight, by regularization we are referring to the ability of a model to reduce its test error(generation error) without impacting its training error. Every knowledge representation has certain characteristics that makes it more prompt to specific regularization techniques. Artificial intelligence luminaries Ian Goodfellow and Yoshua Bengio have done some remarkable work in the area of regularization. Based on Goodfellow and Bengio’s thesis, there are a few characteristics that make knowledge representations more efficient when comes to regularization. I’ve summarized five of my favorite regulation patters below:
1 — Disentangling of Causal Factors
One of the key indicators of a robust knowledge representation is the fact that its features correspond to the underlying causes of the training data. This characteristic helps to separate which features in the representation correspond to specific causes in the input dataset and, consequently, help to better separate some features from others.
2 — Smoothness
Representation smoothness is the assumption that a value of a hypothesis doesn’t change drastically among points in close proximity in the input dataset. Mathematically, smoothness implies that f(x+ed)≈ f(x) for a very small e. This characteristic allows knowledge representations to generalize better across close areas in the input dataset.
Linearity is a regularization pattern that is complementary to the smoothness assumption. Conceptually, this characteristic assumes that the relationship between some input variables is linear (f(x) = ax + b) which allows to make accurate predictions even when there are relatively large variations from the input.
4 — Hierarchical Structures
Knowledge representations based on hierarchies are ideal for many regularization techniques. A hierarchy assumes that every step in the network can be explained by previous steps which tremendously helps to better reason through a knowledge representation.
5 — Manifold Representation
Manifold learning is one of the most fascinating, mathematically deep foundations of machine learning. Conceptually, a manifold is a high dimensional area of fully connected points. The manifold assumption states that probability masses tend to concentrate is manifolds in the input data. The great thing about manifolds is that they are relatively easy to reduce from high dimensional structures to lower dimensional representations which are easier and cheaper to manipulate. Many regularization algorithms are especially efficient at detecting and manipulating manifolds.
Representation learning is a key and not very well-known discipline in the deep learning space. Understanding the features and representation of underlying datasets is essential in order to select the best neural network architectures for any given tasks. Some of the characteristics explained in this article provide a simple framework to think about representation learning in the context of deep learning solutions.