These tips and tricks are more technical. They’re more specific things you might be able to implement into your code solution, and are commonly used in well-performing Kaggle competition submissions.
Engineer the heck out of those features.
If I have learned anything from this competition, it is that feature engineering is king. Feature engineering, simply, is taking existing features and adding more. This could be something as simple as multiplying two columns.
It’s too easy to treat neural networks, and in general machine learning methods, as magical all-purpose solutions. They can learn anything, supposedly, from the data. Unfortunately, that’s not the case. Most of the time, for a model to truly learn well from the data, it needs human help.
The model is only as good as the data! — you might as well give it more information to make sense of the original data than less.
Two helpful ideas for feature engineering:
- PCA/feature reduction. This is a great feature engineering method, since we’re doing a lot of work for the model by saying, “these are the most important structural elements of the data, here you go”. You can replace the data with this PCA-reduced version, or concatenate the reduced PCA features to the data (probably more successful). Other manifold-learning/feature reduction methods like Locally Linear Embedding should work as well.
- Add statistics. If there are many columns that are on a comparable scale with each other, you may be able to add simple statistics like the mean and variance, but also higher-order ones like the kurtosis or skew. For example, the variance between data points
# of cars moving in Los Angeles
,# of cars moving in Santa Monica
,# of cars moving in Beverley Hills
,…
may give us helpful information about the disparate impact of weather. If there is low variance, then perhaps the weather affects all the cities very similarly. This could then be interpreted by the model to aid its prediction.
Feature engineering is an art. Most importantly is to remember to feature engineer with the context of the data in mind. If it doesn’t make sense in real-life (e.g. multiplying two columns that have nothing to do with each other), it’s likely not going to aid the model in better understanding the data.
Be stringent with your feature selection.
Feature engineering is great, and it’s good practice to go all-out on it. But it’s also important to remember that too much data can overwhelm the model and cause difficulty in learning what is important. Being precise with which features to leave in or out can do a tremendous service to the model.
In general, try to be more conservative with removing columns. Data is precious, so only throw it away if you’re sure it’s not going to be helpful.
- Look closely at the data. Especially if there are many categorical variables, there may be redundant columns. For instance, sometimes competitions will have ‘control group’ samples in which the target is always 0. Removing this usually helps.
- Information gain. You can calculate the information gain each feature provides towards prediction of the final model, then remove features that barely provide any information.
- Variance threshold. The less sexier version of information gain (but sometimes more practical), calculate the variance of each column and remove columns with little variance (after doing necessary scaling).
- Feature reduction. If you find many highly correlated features, it may be helpful to replace them with a dimensionality-reduced version of them. In general, you should try to aim not to remove less ‘important’ features, but to reduce them. That way, you still have what information there is.
Understand the metric and design the solution with it in mind.
Kaggle evaluates your solution by a specific metric, which determines your ranking on the leaderboard. Sometimes, it’s something like Area Under Curve (AUC), or perhaps log-loss. Kaggle will always provide its formula in the ‘Evaluation’ section of the competition overview.
It’s always worth looking over, because it determines how you should go about constructing your solution. For instance, you may find that using a particular loss function very similar to the evaluation function will improve the model’s performance along that metric.
Let’s take the example of log-loss. Some Internet digging will bring up some helpful information: log-loss significantly penalizes confident but incorrect answers. That is, the more confident a model is in a prediction, the penalties for it being wrong rise very quickly. There’s a lot to think about this:
- Let’s say your model is doing rather poorly with log-loss because there is systematic error (i.e. the model is not understanding the data). It may be helpful to make your model more ‘hesitant’, because if it’s going to get the wrong answers, at least it shouldn’t be very confident in them. You can do this model-wise by augmenting data (if suitable) or making the model less confident. If you’re feeling lazy, you can simply use ‘target clipping’: if a prediction is less than 1% or larger than 99%, simply clip it at 1% and 99%, respectively. This prevents any overconfident answers. (Of course, the another direction to look into here should be how to make your model have less of a systematic error and to better understand the data.)
- On the other hand, perhaps your model understands the data well. Instead of a systematic error, it has more of a precision error, where it is often too hesitant. This provides a new direction to look into: perhaps you can try bagging or another ensemble approach, which are known to make prediction confidences more constant and confident.
This kind of thinking is the creative and fun aspect of data science. It’s being able to translate mathematical knowledge into actual techniques.
Yay… time to model.
Modelling can be repetitive and boring, if you treat it as a checklist of
a) build
b) fine-tune
c) evaluate
d) repeat.
It can seem like there’s a pretty finite list of models to try out, especially if you’re not too experienced and uncomfortable with low-level code.
Luckily, there is a lot of fun and learning to be had in the art of modelling. While there is by no means an exhaustive list, here are some things to try out:
- Pretraining. If you happen to have unsupervised or unscored data (data that is provided in the training set but not in the testing set), you can use it for pretraining by running it through the model. This isn’t too hard to do. In the same vein, try out some pre-trained and pre-built models from Keras. These can save a lot of work and are not too hard to work around.
- Nonlinear topologies. These types of neural networks are not sequential; instead, one layer can branch off into several, which can later rejoin at some other point. This is actual really easy to do with Keras’s functional API. For example, you can split image data into two convolutional layers with different filter sizes. They learn representations on different scales and later combine their knowledge.
- Wacky, mad-scientist solutions. The DeepInsight model is a great example of a wacky, mad-scientist solution. This approach was immensely popular and successful in the Mechanisms of Action competition. It used t-SNE, a visual dimensionality reduction method, to convert tabular data into an image, then trained a convolutional neural network on it.
And lastly, to cram in a few more ideas to try out: creative ways to combine predictions in ensembles, different activation functions besides ReLU (e.g. Leaky ReLU, Swish), ‘boosting’ for non-tree models (feed the predictions of one model into another to learn its mistakes).
The main point is: modelling is for everyone. You don’t need to have written the source code for TensorFlow to develop sophisticated and successful models. All you need is creativity and a willingness to try out ideas.