With Fastai and Kaggle Competitions
Intro
I’ve always been fascinated with the concept of artificial intelligence. I used to be a philosophy student and would armchair speculate about things like Artificial General Intelligence and the like. But I was secretly envious of my friends in the computer science department that got to learn about artificial intelligence in practice. Now I’m one of those computer science students, and I’m learning how to do the practical AI stuff.
Last semester, I took a class on data mining and did a little bit of classification with my final project using sklearn (UFOs And Elections). It was my first foray into ML and got me really excited to learn other techniques. So I took Jeremy Howard’s free Deep Learning for Coders class online. My goal for the course was to do a couple projects using what I learned about deep learning and fastai, but when I started looking at kaggle competitions, I changed my mind.
I chose to enter these three competitions:
I may have spread myself a bit thin given the short 4 week timeframe I had to learn the course material, but I think it was a good idea to enter multiple competitions in different domains so that I could get practice working on a broad range of deep learning subfields. The benefit of doing kaggle competitions as my course projects was that I got to see exactly how well my work stacked up against others. Also people post awesome notebooks containing their work publicly on kaggle. So you get to see how others approached the problem and gain insight and inspiration from the work of other kagglers.
Titanic Survival
For the titanic competition we were provided with tabular data about titanic passengers, eg. Sex, Age, Cabin, etc. The goal was to use the tabular data to predict if the passenger survived or not.
The first technique I applied was feature engineering. I borrowed some techniques I saw others use. One of the most useful ones was using the prefix title from the name attribute to construct a title attribute. (Props to this kaggler’s wonderful notebook for the idea: Titanic using Neural Network)
df_test[‘Title’] = df_test.Name.str.extract(‘([A-Za-z]+).’, expand = False)df_train[‘Title’] = df_train.Name.str.extract(‘([A-Za-z]+).’, expand = False)rare = [ ‘Rev’, ‘Dr’, ‘Mme’, ‘Ms’, ‘Major’, ‘Lady’, “Don”, ‘Sir’, ‘Mlle’, ‘Col’, ‘Capt’, ‘Countess’,’Dona’, ‘Jonkheer’]df_test.Title = df_test.Title.replace(rare, ‘U’)
df_train.Title = df_train.Title.replace(rare, ‘U’)
Then I applied a numerical mapping
title_mapping = {“Mr”: 1, “Miss”: 2, “Mrs”: 3, “Master”: 4, “Rare”: 5}df_train[‘Title’] = df_train[‘Title’].map(title_mapping)
df_test[‘Title’] = df_test[‘Title’].map(title_mapping)
I then made a random forest classifier as a baseline. Getting the feature importance from the model I was able to determine that a couple of the columns were worth dropping.
training_set = training_set.drop([‘Parch’], axis=1)
test_set = test_set.drop([‘Parch’], axis=1)training_set = training_set.drop([‘SibSp’], axis=1)
test_set = test_set.drop([‘SibSp’], axis=1)
Classification results after dropping Parch and SibSp:
It’s interesting how building a baseline model can play a role in the preprocessing/feature engineering process.
Finally, I used a fastai neural network with the dropped and engineered features from the random forrest.
learn.fit_one_cycle(10, 2e-2, cbs=EarlyStoppingCallback(monitor=’accuracy’, min_delta=0.05, patience=3))
The results seemed comparable.
However, the score from the neural network was actually better against the competition test set. So it appears the random forest may have been overfitting. Either way, this model produced a score in the top 14% of entries.
Full Notebook Link: Beginner Random Forest + Fastai Neural Net Top 14%
What I would do differently next time:
I think a main weakness of this project was that in the rush to get a working model I didn’t do sufficiently extensive round of visualization and exploratory data analysis. There could have been more feature engineering choices I missed out on because I didn’t take the time to really parse the relationships between the different features, etc. So I would definitely explore different feature engineering options if I took a second crack at the project.
Natural Language Disaster Tweets
The next competition was a natural language processing task. Given a data set of tweets, we were to build a model that can classify whether or not they are about natural disasters. Again I built a sklearn random forest baseline, and then applied fastai neural nets. This time around I was able to harness fastai’s easy to apply ULMfit transfer learning approach.
Sklearn results:
After building this baseline, I used the ULMfit approach to classify the tweets with fastai’s pretrained text_classifier_learner AWD_LSTM architecture.
Steps:
- Preprocess and tokenize the text to build a language model of the tweets.
- Train a language model that predicts subsequent words given a word set.
- Use the same vocabulary and pre-trained model to build the text classifier.
I also added another step that gave me a slight ~1% boost in accuracy.
4. BONUS STEP: Train a classifier using ULMfit on a backwards training set. Train another on the regular dataset. Use both to predict in an ensemble.
Fastai results:
This got a score in the top 30%.
Full Notebook link: Fastai Backwards-Forwards ULMfit Ensemble
What I would do differently next time:
I used Fastai’s TextDataLoaders_from_df class to preprocess and tokenize the text. Fastai does allow more customizable text preprocessing options. So that would be something I would want to further explore.
Cassava Leaf Disease Classification
Finally, I wanted to play with a little computer vision. This competition actually has $ prizes.
Techniques:
For data augmentation/preprocessing I used resize and aug_transforms. I normalized using the imagnet stats, and made the images 128×128 in size.
def get_dls(size): dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
get_x=get_x,
get_y=get_y,
item_tfms=Resize(460),
batch_tfms=[*aug_transforms(size=size, min_scale=0.75),
Normalize.from_stats(*imagenet_stats)])return dblock.dataloaders(df, bs=128)
dls = get_dls(128)
I then used a pre-trained resnet50 model, and fine tuned. I then applied progressive resizing. Progressive resizing is a technique where you first train the model on smaller, lower resolution images, then resize the images and train more. The general idea is that the first lower resolution round of training helps the model learn more general patterns, and the training on higher resolution images gives the model (which at this point already knows general features) a chance to learn more high granularity features.
Below you can see that I call get_dls() with a larger image size and update the learner’s DataLoaders attribute.
learner.dls = get_dls(224) # progressive resizing
Then I trained the model some more. Finally, I applied test time augmentation for a small performance boost.
The model was about 88 percent accurate. Not too shabby.
Full Notebook Link: Beginner fastai Progressive Resize
What I would do differently next time:
I would take a lot more time to explore image transformation options. From what I see on a lot of high quality image classification kaggle submissions, a large part of the code is concerned with transforming and augmenting the data.
The Next steps
I am definitely going to continue kaggling and potentially further fine tune the Cassava classifier.
I’m also going to pay more mind to coding style in the future. My competition submissions weren’t the best examples of functional or object oriented style. Besides just the readability factor, adhering to object oriented design would also make it easier for me to reuse classes defined on one project for another project, potentially allowing me to abstract away from a lot of the minutia.
I plan to learn pytorch in depth. Fastai is a wrapper around pytorch, so learning Fastai does teach you some pytorch, but only up to a certain extent. Moreover, learning how to construct and train neural networks at a lower level of abstraction with pytorch will only help my understanding of deep learning. From what I tend to see on some pretty impressive kaggle entries/notebooks, it looks like you can really get the most out of fastai if you really know your pytorch. Also, pytorch is a pretty frequently listed skill qualification on ML/Data science job postings.
At the moment my learning resources for pytorch are the official docs and the free pytorch udacity course. Let me know if you think there are better resources I should be considering.
I just built my first convolutional neural network with pytorch and I would like to write a tutorial for CNNs. There were a couple of things concerning the arithmetic involved with CNN architecture that if explained clearer would have saved me a lot of debugging time, so I’d like to help others with that. And as the adage goes, “if you want to learn something, teach it”. I’d also like to build a resnet and pre-train it on CIFAR-10 to see how it does on the Cassava Leaf Disease classification task. The motivation being that the CIFAR-10 dataset, like the Cassava dataset, is pretty low resolution. Maybe learning to pick up on patterns in low quality CIFAR-10 images will “transfer” to the Cassava task.
Thanks for reading this article!
Let me know what you think. If you like my kaggle notebooks, give them an upvote. Follow me on kaggle or connect on linkdin. If you have questions, suggestions or constructive criticisms, it is definitely welcome. I hope that this is helpful for others getting started on their own machine learning journeys as well.