If you are genuinely interested in machine learning, you must have heard about the amazing data science competitions website, kaggle.com. Last month, Stanford was hosting an urgent competition about predicting the degradation of the new mRNA Covid vaccine. To back up a little, the new Covid vaccine is dependent on messenger RNA molecules instead of DNA. This is mainly due to the fact that the Covid vaccine needs to be developed and implemented quite rapidly.
The data provided by Stanford was Tabular data that mainly contains the RNA sequences, their length, reactivity and many other features. One of the main challenges of the competition was that it was much shorter compared to usual Kaggle competitions. Participants had to quickly develop robust solutions in 1 month using only about 3,000 RNA sequences. As a beginner in machine learning, I only managed to be in the top 37%.
Here are the main lessons that I have learned:
- Feature engineering is key:
As an enthusiastic beginner in machine learning, I was very eager to start testing and prototyping quickly with complex Neural Networks. However, I now understand that you need to thoroughly examine, understand and analyze the data that you are given if you wish to optimize the given metric. Actually, the top competitors were the ones who had the best feature engineering solutions rather than complicated models.
To rewind for a second here, let’s explain what typically happens in a machine learning project:
- First you are either given a dataset or you are required to gather and label a dataset.
- You now need to carefully investigate which of the features or attributes in this dataset are the best to reach your required goal or target.
- Then, you have to make an educated decision on the best machine learning model that can make sense of this data and perform the required analysis such as regression or classification.
It is very important that you take your time with the second step although it can be tempting to just jump to the third one. For instance, although the 1st solution in this competition was using a fairly standard model, they were using data augmentation and pseudo-labelling which are parts of that second-step.
2. What are Recurrent Neural Networks (RNNs) and LSTMs:
The main machine learning task in this competition was text regression, which is quite a new area for me. One of the most standard and efficient machine learning models in this subdomain are RNNs.
Although may not be obvious, the selling point for RNNs over normal NNs is data persistence. This means that RNNs basically have memory units or hidden state cells that allow them to “remember” the data, which in text analysis happens to be a significant advantage. This is simply because usually in text the different data batches (or windows) are dependent on one another, unlike for example classifying batches of images of cats and dogs. This allows the text to be analyzed within its “original context”. Basic NNs do still remember the data of course but only the data within the same epoch/batch, however, RNNs remember the data from the prior input while generating the current output.
Long Short Term Memory or LSTMs are an upgrade of basic RNNs. They have special gates that enhance the data persistence capabilities. These gates have additional parameters that can be learned by the network to filter the data along the line, essentially discard useless features and keep the most relevant ones. This is a bit similar to how our memory works, when we read a paragraph, we often remember keywords instead of the whole paragraph. In the computer world this is done simply through multiplying the irrelevant word vectors by 0 (to get rid of them), 1 (to persist them) or a value in between.
3. Autoencoders can be magical sometimes
Autoencoders are neural networks that aid in feature engineering by compressing datasets into a latent representation, effectively performing dimensionality. I have already used Autoencoders heavily in ML projects, but it was great to see them being used in this competition to achieve high scores.
Essentially, a lot of people were pre-training their data on Autoencoders first before passing them to RNNs, this allowed the RNNs to perform a lot better since they were processing fewer dimensions.
4. What Graph Neural Networks are
This was one of my favorite things about this competition, so many different approaches being used ! I didn’t fully understand them at the time, but I think I understand them a bit better now.
In Computer Science, graph theory is about modelling a problem into a set of vertices and nodes, a very common example is the Travelling Salesman Problem. This modelling technique then allows you to use tons of graph traversal algorithms that can be quite useful. To further elaborate, in this competition many successful solutions were modelling the structure of the mRNA vaccine structures as graphs and using Neural Networks to extract meaningful features from them.