Check out the AI solution for Kaggle’s 25,000$ competition for diagnosing prostate cancer
With more than 1 million new diagnoses reported every year, prostate cancer (PCa) is the second most common cancer among males worldwide that results in more than 350,000 deaths annually. The key to decreasing mortality is developing more precise diagnostics.
Diagnosing prostate cancer through machine learning is quite an important task and has been a challenge for a while. In this article, I will be explaining the top solution for Kaggle’s prostate cancer competition. I have a fair amount of experience in machine learning in digital pathology, and I was quite impressed by this solution. I think there is a lot of lessons that can be learned here.
The main challenge here is to predict the ISUP grade score from WSI (Whole Sliding Image). Whole slide images are gigantic high-resolution pathology images. ISUP grade is the risk of prostate cancer, no cancer is 0, and 5 is a high risk of cancer.
The competition was evaluated by a metric called Quadratic Weighted Kappa.
What is Quadratic Weighted Kappa (QWK)?
QWK measures the agreement between 2 predictions. 0 is for random agreement and 1 is for complete agreement. If QWK is negative, then there is less agreement than expected by chance. It’s calculated through the following equation:
where an N x N histogram matrix O is constructed, such that Oi,j corresponds to the number of
isup_grades i (actual) that received a predicted value j. An N-by-N matrix of weights, w, is calculated based on the difference between actual and predicted values.
First step: Data preprocessing (dealing with WSIs):
Processing WSIs is always a tedious task. For my project, I used a library called OpenSlide that breaks the slides down into tiles, I might be writing a separate story about this later. However, here they used a technique called Concatenate tile pooling (CTP). You might be thinking, can’t you just resize the image to a smaller size? You can do this, but you would be losing a lot of information as there would be pixel loss.
The result of CTP is the same as OpenSlide, a collection of tiles that would be equivalent to the image if put together. CTP works this way:
Instead of passing an entire image as an input, N tiles are selected from each image based on the number of tissue pixels and passed independently through the convolutional part. The outputs of the convolutional part is concatenated in a large single map for each image preceding pooling and a fully connected head.
- Imagehash: Python hashing library, used here to remove duplicates
The dataset provided was noisy and had some duplicates. Part of the challenge was to remove duplicates. They used an image hashing library called Imagehash to do so. Image hashing is the process of constructing a hash value based on the visual contents of an image.
Here is the code that was used:
from tqdm import tqdm_notebook as tqdm
import numpy as np #Different hashing types
]#Use the appropriate hash type
hashes = 
for path in tqdm(paths, total=len(paths)): # Image path
image = cv2.imread(path)
image = Image.fromarray(image)
hashes.append(np.array([f(image).hash for f in funcs]).reshape(256))# calc similarity scoressims = np.array([(hashes[i] == hashes).sum(dim=1).cpu().numpy()/256 for i in range(hashes.shape)])# Let's check image pairs with similarity larget than threshold.
# You can lower threshold to find more duplicates (and more false positives).import matplotlib.pyplot as pltthreshold = 0.96
duplicates = np.where(sims > threshold)# remove duplicates....
The solution relies on 3 different EfficientNet models and Cross-Entropy Loss. From my experience at Kaggle, EfficientNet is becoming very popular with a supervised image classification task. It was also used here to win the Melanoma competition:
Unlike many other solutions where the developers just use an ensemble of 2 networks, the developer here had an extra network to clean the labels (Step 1). This is because the dataset is noisy (as mentioned above) and this was one of the main challenges of the competition. This also reminds me of a technique called pseudo labeling that was used to win another Kaggle competition.
- Cosine annealing scheduler
A scheduler decreases the learning rate of the model along the training process in a systematic fashion. The cosine type is one of the more aggressive schedulers where the learning rate starts high and gets very close to 0 and then increased again (the cosine wave).
Data augmentation is becoming standard for top solutions in Kaggle competitions. This team was using cutout and mixup (shuffling) to boost their generalization.
Cutout is a simple regularization technique for convolu- tional neural networks that involves removing contiguous sections of input images, effectively augmenting the dataset with partially occluded versions of existing samples.
This technique essentially measures the gap between the original labels and the “hold-out” (1st step) prediction results. A large gap means that either the label is wrong or that this is a complex data point. A disadvantage of this technique is that the model performs well on the majority of the dataset, but will perform poorly on the complex data points (since they were eradicated before training).
Data processing techniques always turn out to be quite important. If you look at other solutions for this competition, you will find out that the data cleaning step is the main step that led this solution to win 1st place.
To summarise the solution works this way:
To summarise the solution works this way:
- Split the images into folds according to their similarity and remove duplicates
- Train with noisy labels
- Remove noise by prediction and original label gap (out of fold)
- Re-train model without noise
I think there is a lot of lessons to be learned from top solutions from Kaggle competitions. I always thought the challenge would be choosing the right model, however as we saw above, there are many more challenges than that.