8 Things I Learned Creating My First Machine Learning Model

What happens when 3 non-techies come together to build something

If you’re interested in seeing the code for this model, the notebooks can be found on my GitHub: https://github.com/CharlieX1701

After I graduated university in 2018, I was fortunate enough to land a solid job right out of school. I received my Bachelor’s in Genetics, a program that required me to conduct independent research in a research laboratory. It was this experience that helped me land my job as a Research Specialist (fancy name for a technician/lab manager) in a prestigious quantitative genetics lab at Princeton University. I quickly learned that this lab wasn’t your run-of-the-mill molecular bio lab; our discoveries relied heavily on advanced methods such as quantitative cellular imaging and single-cell sequencing methods that employed machine learning for analysis. This was a major shift from my previous research experience which largely analyzed qualitative data… needless to say it was a tough transition. But I started falling in love with the power of quantitative methods, and I quickly realized the need for me to strengthen my quantitative abilities. This led me to my current state of pursuing my Master’s of Business and Science in Analytics — Discovery Informatics & Data Sciences. I know, it’s a mouthful.

One of the core classes for this program is Intro to Fundamentals of Analytics. In a nutshell, it’s a class designed to introduce students to common machine learning algorithms using the Scikit-Learn library, as well as some of the theory behind the algorithms. The end-goal of the course is for students to create their own machine learning model on a topic of their choice. Every aspect of the project from data collection to hyperparameter tuning needed to be done from scratch, i.e. no code copying. While this task was certainly daunting at first, I knew I was going to come out of it a stronger coder and a stronger data scientist, so I went into it headstrong and positive. I teamed up with a couple classmates who were at a similar skill level as me, and we started chopping away. It turned out to be a very challenging but rewarding experience, and the following lessons are some of the things I learned throughout the process.

Cool Ideas Aren’t Always the Best Ideas

To begin the project, we first had to come up with our topic of choice. Finding an idea that three unacquainted people are equally interested in and willing to execute was difficult in and of itself. Some of my ideas included analyzing NASA exoplanet data, predicting travel delays, predicting the spread of a wildfire, or predicting a song’s popularity based off its sheet music and musical attributes. Ultimately, we decided on predicting real estate property values and analyzing the relationship between the housing and financial markets, in an attempt to give buyers/sellers a better idea of when to buy/sell based off financial market conditions. It’s a rather cookie-cutter idea if I say so myself, but there was a plethora of relevant research available online for us to study, and this was needed as none of us had ever implemented an ML model before.

Shortly into the project, it became apparent that some of my earlier ideas were too grandiose. Although they sounded cool, I didn’t realize the technical competency required to execute them. Computer vision is awesome, but regression and classification provide a more appropriate yet still effective learning opportunity. Also, by tying in the analysis between the different markets, it added our own flair to the project thereby satisfying my desire to do things differently. Boulder County, CO Open Data Portal provided us with our sales data, Zillow gave us historical housing markets data, and The Wall Street Journal provided easy access to historical stock data. With the data collected, the hard part was set to begin.

Figure 1. Boulder County, CO Sales Data

Meticulous Data Cleaning Really is Necessary

At the very beginning of the project discussions, our professors really stressed the role that data cleansing plays in building machine learning models. Garbage in, garbage out is what they would always say. And they gave us a fair warning: at least 80% of your time will be spent cleaning data. My naïveté made me think this was ridiculous; machine learning is about cool algorithms, not data munging! But what do you know, I was wrong.

We started with the standard cleaning techniques: removing nulls, removing outliers, condensing repetitive attribute values, removing erroneous values, etc. We thought this would be sufficient, let’s get to the cool algos baby! But yet again, we were wrong.

In the beginning stages of testing out our models, they were total crap. Thankfully, one of my group members previously worked as an underwriter for mortgages so she had solid domain knowledge in the real estate realm. Once we took a deeper, more knowledgeable look at the data, we realized our data was still filled with garbage values. After countless more hours of cleaning and testing, our models started working beautifully. This process was a testament to the value of proper data cleansing, but it also opened my eyes to something else.

The Importance of Normalizing Your Data

This is where the research literature review really proved useful. When reading through published research and online resources, I kept noticing discussions on the importance of scaling your data. I had no idea what data scaling was, so I started looking into it and discovered normalization.

When working with data containing a wide range of values, normalizing the data is essential to ensure your regression models don’t incorrectly assign higher weights to higher values. There seem to be many forms of normalization, but for our purposes it involved implementing the MinMaxScaler function. This function scaled all of the data to values between 0 and 1, which would help ensure that all of our features would have roughly equal weights when entered into the regression models.

With the scaling complete, we next looked at the distributions of our feature values. We quickly realized that most of our features had a positive skew in their values, and this would need to be corrected before implementing methods such as Lasso and SVR, as they assume a normal distribution. To fix this skew, we implemented a Yeo-Johnson Power Transformer which brought a Gaussian-like distribution at the 0 mean, unit variance to our data. All of this normalization gave us the confidence to know our data was properly scaled and distributed for clean models.

Footer