I chose accuracy as my metric because the majority class was between 50%-70%, at 51.7%.
I didn’t need to use any data augmentation techniques since both binary classes are somewhat evenly balanced as seen above in the Target Occurrence visual.
After cleaning and pre-processing the dataset and creating the feature matrix and target vector, I built three models:
- Logistic Regression
- Random Forest Classifier
- XGBoost Classifier
XGBoost Classifier
This model had 4479 true positives, 3828 true negatives, 2489 false positives, and 2190 false negatives. The XGBoost model performed best, having the highest number of true positives and true negatives combined(the Logistic Regression model had the highest number of true positives alone), and the lowest number of false positives and false negatives combined(the Logistic Regression model had the lowest number of false negatives alone).
PDP Interact Plot for features ‘deadlift’ and ‘snatch’ from XGBoost Classifer
The above PDP Interact Plot shows a somewhat even expansion of purple along the X axis and the y axis. The athletes with lowest potential have a very high deadlift but can only lift a very small number of pounds for the snatch lift. This isn’t surprising because for an athlete to realize his or her full strength potential, the amount of weight snatched should roughly be about 40% of the deadlift.
It is not uncommon for CrossFit athletes to have ratios that veer wildly from ideal, which presents a particular challenge identifying which individuals have potential for competitive Weightlifting.
What does the purple expansion along the horizontal axis show?
It shows that the amount of weight for the snatch lift for determining the target feature ‘talent’ has more contribution when the deadlift is higher.
What does the purple expansion along the vertical axis show?
It shows that deadlift has a higher contribution for determining ‘talent’ when the snatch is higher.
Brief Glimpse of Metrics
PDP Interact Plot for features ‘deadlift’ and ‘age’ from XGBoost Classifer
The fact that ‘age’ was the most important feature is not surprising at all. You’re unlikely to see a weightlifters in their 40s at the Olympics, and 20-somethings tend to have a higher probability of achieving that level. However, weightlifting does offer age brackets for every level, and master’s events are surprisingly competitive through at least age 50.
The above partial dependence plot shows the effect that features ‘age’ and ‘snatch’ have on the XGBoost Classifier model’s predicted outcome. Unsurprisingly, the lowest of ‘talent’ potential occurs for the oldest individuals in the dataset who lift the lowest amount of weight for the ‘snatch”. (lower right corner) The youngest individual with the heaviest snatch has the highest potential (upper left corner).
Use Cases for My Machine Learning Models
- Useful for corporate sponsors seeking to identify athletes for sponsorship opportunities (For a corporate sponsor, I would set a higher threshold to increase precision. Doing so, provides greater assurance that athletes identified as having potential, actually do have potential.
- Useful for coaches to assist with recruitment efforts.
- Useful for college athletic departments to identify or confirm potential Weightlifting scholarship recipients. (Weighlifting scholarships represent a very small portion of the budget for collegiate sports scholarships, but they do exist.)
Test Accuracy for Initial Models
Test Accuracy for Models after Hyperparameter Tuning
For all three models before and after tuning the hyperparameters, the order ranked according to performance was:
- XGBoost Classifier
- Logistic Regression
- Random Forest Classifier
All three models ranked better than the original baseline. Hyperparameter tuning only provided a slight performance improvement for the XGBClassifier and Random Forest Classifier models. The Logistic Regression model saw the most improvement after hyperparameter tuning even though it ranked last in terms of performance.