The Importance of Model Selection in Machine Learning

From avocado toast to interpreting millennial homeownership through classification

Three years ago, I came across a headline that stopped me in my tracks.

It read: “Millionaire to Millennials: Stop Buying Avocado Toast if you Want to Buy a Home.”

At first, I felt ashamed, and I wasn’t alone. My friends and I took turns in the group chat confessing the numerous times we had spent money on avocado toast. Yes, we had perfectly good avocados at home, but was brunch really getting in the way of owning a home? We took a step back. We did the math. We researched how different millennial circumstances were from the generations before us (the economy, the cost of education, the rising real estate costs, etc.) and moved on with our lives.

Kind of. In the back of my mind, I always thought: one day when I have the skills, I’m going to get to the bottom of this atrocious claim, with cold hard data.

The craziness of 2020 has reshaped my view of what is shocking and unprecedented in the news, but I thought back to this headline while brainstorming my second solo project at Metis Data Science Bootcamp. I enrolled in Metis after a few years of working in analytics.

What’s always fulfilled me — my career mission as a whole — is using data to unlock potential within people, products, and processes alike. Any day in which I’m able to use data to empower someone/something to be the best that they can be is a good day for me.

When the pandemic hit in March, the world came to an abrupt standstill, and I fully realized (along with many others, I’m sure) just how short life is. I knew that there was no time to waste to be the best that I could be — to gain a deeper mastery of data science and equip myself with more technical skills to supercharge my mission.

When the supervised machine learning unit rolled around in the bootcamp, I knew that it would be the perfect time for me to build a binary classification model to delineate millennial homeowners from their renter peers and, in the process, understand what it takes to complete this rite of passage for my generation.

Selecting a model for your machine learning project is almost as important as selecting a home. (Photo by: Jessica Bryant | Pexels)

I used the U.S. Bureau of Labor Statistics’ National Longitudinal Survey (’97) as my data source. The survey followed a group of early U.S. millennials (born 1980–1984) for 20 years, recording their data on 80,000 variables. The majority of my data collection and cleaning process focused on understanding the survey methods and variable definitions. In the end, I extracted 26 variables for my study, taking a snapshot of the subjects at age 30 — a key milestone of adulthood.

Model selection is a key step in every data science project and requires perhaps the most conceptual foundational knowledge.

We’d reviewed a number of supervised machine learning models in class like Logistic Regression, K-Nearest Neighbors, Naive Bayes, Random Forest, and Gradient Boost. The first model I eliminated off the bat was Naive Bayes: the model’s underlying assumption is that all features are independent of each other, which would be irresponsible to assume given that I was working with demographic data (for instance, factors like gender, race, and age are rarely independent of salary).

That left me with four remaining models.

I’m an advocate for the iterative process school of product design, so my goal for every project is to establish a minimal viable product (MVP) as quickly as possible, then loop back and perfect the subsequent iterations from there.

In this case, my MVP would be a baseline model, which made Random Forest the perfect choice. Tree-based models are very functional right out of the box — no need to fill in missing values or even decode the longitudinal study variables into something a layperson can understand, so I was able to build off my messy, raw dataset. Random Forest, unlike its debatably more powerful tree based sibling Gradient Boost, also has a built-in parameter for balancing class weights in scikit-learn. I took full advantage of this parameter, as the class of millennials who owned permanent homes (houses and apartments) in my data were outnumbered 2:1 by their renter counterparts.

For every millennial home owner, there’s more than 2 millennial renters.

My Random Forest model produced great scores: recall of 0.73 and precision of 0.55, meaning that my model was able to capture 73% of all millennial homeowners in my data, and out of the predicted homeowners, 55% were homeowners. (If you’re newer to classification, you’re probably thinking, “what about accuracy??” Here’s a great article by Will Koehrsen that explains why recall and precision are more effective metrics.) Amazed at Random Forest’s predictive power, I could’ve stopped there, with a week to spare before my deadline, but my goal was interpretability.

In data science, it can be easy to focus solely on a model’s output metrics as the bar for success, so it’s important to zoom out and think back to the primary question you’re trying to answer. Mine was to interpret the factors that make up a millennial homeowner, so if I couldn’t pinpoint the impact each feature had on my classification output, no matter how predictively powerful my model was, my project would be missing the point. Random Forest, despite its off the bat predictive power, was a model that did not support interpretation. It was essentially a black box.

I wanted the best of both worlds — predictive power and interpretability, so I set out to make a Logistic Regression model (the most interpretable of all classification models) that could be as powerful as Random Forest.

Logistic Regression is simple and elegant but much more sensitive to noise in data compared to tree based models. What you get is really only as good as what you put in — which meant a much higher bar for input data.

I spent days cleaning my data thoroughly: filling in null values and feature engineering. I also spent time on parameter tuning — applying LASSO regularization (increased strength of C= 0.4) and balancing class weights (which worked better than oversampling using RandomSampler()). When applied to the test holdout set, my final Logistic Regression model generated a recall of 0.73 and a precision of 0.53 — fully matching the scores originally generated by my baseline Random Forest model.

I was able to match the predictive power of my Random Forest baseline with the interpretability of my final Logistic Regression model. I had achieved the best of both worlds!

Furthermore, when looking at my recall vs. precision tradeoff, I didn’t see a need to adjust the precision-recall threshold, because I wanted to emphasize recall in my model. Aside from interpretation, I knew that the business use of my model would most likely be applicable towards outreach/marketing purposes: anyone who wants to reach the ever-elusive home-owning millennial. Casting a wide net is key for outreach/marketing, so the opportunity cost of my model leaving out a millennial who is a homeowner (false negative) is greater than reaching a millennial who is not currently a homeowner (false positive).

Top 5 features impacting millennial homeownership in my final model.

While current financial assets may be one factor in home ownership, it is hardly the key factor. More important were past relocations and marriage, with attainment of a Bachelor’s Degree and race (subjects who identify as Black Americans) rounding out the rest of the Top 5.

What surprised me the most was that the top impactful feature was relocations between age 12 to 30. Were these relocations indicative of an inherited nomadic lifestyle which made subjects less likely to purchase permanent homes? Or were they symptoms of systemic poverty leading to involuntary displacement (potentially subjects in the foster care system)? That’s what I want to investigate the most in my next steps.

For the final part of my project, I created a Tableau dashboard to visualize the relocation feature, as well as an in-depth view of each subject in my data, which allows for further exploration.

Interactive visuals for further interpretation and exploration.

I came into this project with 1 simple question: “What makes a millennial homeowner” and wrapped up with even more questions and motivations for next steps. While the longitudinal study I pulled data from conducted its survey in a manner representative of the United States in 1997, 23 years later, we know that there needs to be more nuance in representation (ie. expanding the study’s 4 categories for race to be more inclusive), especially, as we saw above, when race is one of the top factors in homeownership. I would also be interested to see where the subjects are at 35, once more recent data from the study is released.

The possibilities never end once you start asking questions, and one thought begets a dozen more, but that’s the greatest beauty of data science. There’s always the next iteration and the next question to be answered!

(Speaking of questions, for anyone interested in the nitty gritty, check out my GitHub repo. If you’d like to chat more about this project or anything else data related, feel free to reach out on LinkedIn.)

From avocado toast to interpreting millennial homeownership through classification

Footer