ARTIFICIAL INTELLIGENCE — Opinion
DeepCOVID-XR, reported to have similar performance with experienced thoracic radiologists in detecting COVID-19 from chest radiographs, yet needs much less time to train. How much should we expect from these AI radiologists?
The fact that the COVID-19 specific features could be observed on chest imaging with X-ray/CT has been inspiring a lot of AI scientists to focus on algorithm development.
Recently, the researchers from Northwestern University published a novel classifier, DeepCOVID-XR, to diagnose a “COVID-19 positive case” based on the chest radiographs. The performance of this classifier is reported to be similar to experienced thoracic radiologists.
A lot of works have been published on similar topics, but the authors claimed that DeepCOVID-XR was trained on “the largest clinical dataset of chest radiographs from the COVID-19 era”.
As a data scientist, let me briefly go over what we should learn from DeepCOVID-XR in terms of constructing a classifier.
The important reference labels.
Data and algorithms to a classification problem are food and cutlery to a picky gastronome, respectively.
No matter how gorgeous your cutlery is, you cannot satisfy the person without good food.
One of the most important features of good data is the target value in your project, such as the true labels of the samples in training the classifier.
The data used to train DeepCOVID-XR are labeled “independently”, which come from any of the following resources: real-time polymerase chain reaction (RT-PCR), international classification of diseases (ICD-10), and electronic health record (EHR).
We know that none of RT-PCR, ICD-10, or EHR could be 100% correct on the diagnosis of COVID-19. However, the key is that these records are independent of the function of the classifier (to diagnose COVID-19 based on the radiographs). For example, it will be problematic if the classifier uses the diagnosis of the expert radiologists based on the radiographs as the true labels in the training process.
In sum, to get true labels (independently from the task) is very important to the construction of a classifier.
Use the ensemble of classifiers.
Expanding the structure of the neural network, tuning the hyperparameters, or even changing to a more complex model may help capture more detailed features from the data. However, it doesn’t guarantee better performance on the test set.
One of the most practical ways to gain better values on performance is to use the ensemble of models.
DeepCOVID-XR is constructed with a weighted ensemble of six different convolutional neural networks (CNNs) (including DenseNet-121, ResNet-50, InceptionV3, Inception- ResNetV2, Xception, and EfficientNet-B2).
These used CNNs were pre-trained on chest radiographs in order to “understand” the features of chest X-ray images and then fine-tuned on the COVID-19 dataset.
One of the assumptions of using the ensemble model is that each model is good/bad at a specific aspect of feature capturing. The ensemble of these individual models tries to eliminate the obvious shortcomings of each one of them.
For some machine learning practitioners, you may find the random forest model, in general, could gain better performance than models (like logistic regression, SVM, etc..) on the tabular data for most of the time. The random forest model itself is an ensemble of decision trees.
Each decision tree with limited depth and a limited number of features (defined by the hyperparameters) could be bad at or totally ignore some features of the data, which could be addressed by the ensemble of a large number of such trees.
Similarly, DeepCOVID-XR, as a weighted ensemble of CNNs, was shown to outperform each CNN by itself.
The ceiling of the algorithms’ performance is already defined by the task.
Of course, we are chasing a higher number in performance when developing algorithms. But we do need to keep in mind that the ceiling of the performance is already defined by the task itself.
Chest radiographs can never be used as a diagnostic tool for COVID-19, so we should not expect an AI algorithm is fancy enough to do the diagnosis on its own. Here, the task, “to diagnose COVID-19 by chest radiographs”, itself is the ceiling of all AI programs to do the job.
This type of ceilings is everywhere in the application field of AI. The intelligence is not really based on a process of thinking, but on the task that it is trained to do and on the data it is fed. Therefore, sometimes I would rather trust biologists more on generating real artificial intelligence in the future.