- Introduction
- Business Problem Statement
- Isolating Business Metrics
- Algorithm Comparison
- Code Review
- MLOps and DevOps
- Summary
- References
Working professionally in Data Science for a few years now, I have discovered some best practices from a variety of different experiences. I will be highlighting my top five Data Science practices in hopes of helping you in your future endeavors. While there are countless ways to improve your Data Science process, these are key methods to improve not only your everyday work as a Data Scientist, but as an employee as well. That being said, some of these practices can be applied to more than just Data Science, including, but not limited to Data Analytics, Machine Learning, Software Engineering, Data Engineering, and DevOps. Keep reading if you would like to learn more about the top five best Data Science practices.
I like to bring this best practice up in many of my articles because it is incredibly important. As Data Scientists, we can become so engulfed in the technical aspect of Machine Learning algorithms like tuning hyperparameters and lowering our error rate. However, we have to keep in mind that the goal of a Data Science project is not supposed to get the most complicated answer with the highest accuracy. It is actually more of the opposite — we want to find the simplest solution to a business problem. Data Science can already be quite complicated and so can the reason why we are using it. That leads me to the business problem statement. This is the whole reason why you are utilizing a Machine Learning algorithm in the first place. No one will care if you have 98% accuracy for a model that does not answer or solve the current, specific problem at hand. That is why it is incredibly essential to isolate the real meaning of the problem and check in with your stakeholders (coworkers who manage and often assign the Data Science project, or raise the business question).
Below, I will be giving a poor and great example of a business problem statement along with a Data Science solution:
Poor:
“Solve our detection problem with some Data Science magic, and see how accurate we can get to save money.”
Great:
“Detecting objects from images is time-consuming and inaccurate when performed by a person.”
Analysis:
Poor — while the poor example might lead you to the correct way of looking at the problem, it is simply too confusing and unnecessary. A lot of the time, stakeholders or others in a company think Data Science can solve everything, so it is up to you to tell them if their issue can be solved by a Machine Learning algorithm or not. Of course, the use of the word magic is vague and over the top; however, it can be quite surprising to know that some people actually say it in real-world business case problems like that. Next, there is a proposed solution by saying Data Science will solve it, which might not actually be necessary. Then, they presume that the Data Science magic will be accurate no matter what and will save money. While this is hopefully true, this cannot be asserted in the problem statement. The main goal of the problem statement is to simply isolate a problem. Then, the next step would be to develop a solution along with expected results and/or a return-on-investment (ROI).
Great — this problem statement is much better because it first lays out the current process and what is wrong with it. It does not actually suggest Data Science or promises something that may not happen. You can clearly see what is being articulated even if you are not a Data Scientist, which is important as working cross-functionally is the preferred method of working for a company. Because the statement is clearly defined, a Data Scientist could now isolate the following:
current process: detecting objects from images by a personproblem one: time-consumingproblem two: innaccuate
Now, based on what has been dissected, you could study the current process to assess if Data Science can be applied. Then, you would have two problems that you will find solutions to. This granularity allows everyone to be on the same page. If a Data Science model has been chosen as the solution, then the goal of the algorithm is to make the current process both more efficient and accurate. Now, there are base metrics on which the model can improve on.
Example:
Say, before, it took 2 hours to classify 500 images with 80% accuracy from a person. Now, a Data Science model can be used to classify those same 500 images in 10 minutes with 98% accuracy. The Data Scientist now knows what to work on and what to strive for.
This example leads me to my next best practice.
Once we are confident in choosing a Data Science model as a solution to the aforementioned problem, we will need to specifically isolate the business metrics that need to be improved upon. Examples of metrics in this situation include time spent per employee and accuracy of task. These metrics can be compared to the manual process to the automated Data Science process. Additionally, these metrics can be monitored and visualized by a dashboard for easy-to-understand results.
The example of isolating business metrics would be the following:
Current Process:
- time spent per employee per day
- time spent per employee per week
- accuracy of employee per day
- accuracy of employee per week
Data Science Solution:
- time spent per model per day
- time spent per model per week
- accuracy of the model per day
- accuracy of the model per week
The metrics will highlight not only the improvements made by the model (hopefully), but will also highlight how important and beneficial your specific Data Science solution is to the company. This point goes along with proving your worth, can a Data Scientist save time and money, while also making the process more accurate for the company?
I have mentioned some simple yet specific metrics; however, there are countless more that can be monitored and analyzed at any company, ultimately depending on the situation, process, and product. Some general metrics to look at include the following:
— clicks per user
— clicks per user for a specific age group
— clicks per user for a specific location group
— daily churn
— weekly churn
— week over week improvements of a metric
— etc.
We have just discussed the importance of the business with Data Science, so now let’s discuss the importance of algorithm comparison. Comparing several different Machine Learning algorithms is a top best practice as it allows you to have a general sense of which algorithm is the best for your data and solution. You may want to look at algorithms like Decision Trees, Random Forest, or XGBoost, for example, and compare how they all stack up against one another on the same train and test data. While this practice is slightly more obvious, it is still important to bring up. It took me a while to realize that I should always test at least a few different algorithms to find the best one. This is because sometimes you can get caught up in a specific algorithm and use it for every solution. However, there are cases when one algorithm that is usually mostly better can sometimes be worse and vice versa.
An example of algorithm comparison is the following:
Use Case 1: classifying images
Decision Tree Accuracy — 94%
Random Forest Accuracy — 96%
XGBoost Accuracy — 98%
Use Case 2: classifying zoo animals
Decision Tree Accuracy — 95%
Random Forest Accuracy — 98%
XGBoost Accuracy — 97%
In this example, you can see how XGBoost has done better in the past (aka, Use Case 1), but may not necessarily be the best for every single situation in the future. Make sure to look at the size of your data, the amount of numeric versus categorical features you have, and so on. There is a library that is incredibly powerful and useful that compares pretty much all the Machine Learning algorithms you have ever heard of and then some.
This library is called PyCaret [5] and I highly recommend using it:
Here is a small example of a Python code snippet on how you can implement PyCaret. In just one line of code, you can see how several Machine Learning algorithms stack up against another. You can also isolate certain algorithms when comparing if you know which ones you want to look at or if you want to save some more time:
# import the library
from pycaret.classification import *
model = setup(data = your_data, target = image_name)# compare models
compare_models()# best models based on AUC, Accuracy is default
compare_models(sort = 'AUC')# compare exact models
compare_models(include = ['rf','xgboost'])
This next best practice may seem more obvious or general as well, but I believe Data Scientists often neglect this point too often. While Software Engineers certainly abide by code reviews most of the time, Data Scientists are not always reviewing code as they should. Oftentimes, Data Scientists work alone even if there are other Data Scientists on their team (working on a specific project per Data Scientist).
It is extremely beneficial to have not only another Data Scientist go over your code, but also a Machine Learning Operations Engineer or Software Engineer. In general, it is best to bring in someone other than yourself before pushing your final codebase to a master branch (on GitHub, for example). Take this best practice as a reminder to have your code reviewed. It can be as simple as changing one line of code that ends up saving you and your company hours in the long run — something a skilled Software Engineer might catch more quickly, perhaps. There are especially some Pandas and NumPy operations that can be enhanced, saving a lot of time and eventually, money. In general, looking at your same code for days and even weeks can result in being too comfortable with it, and allowing you to not realize easy mistakes. Try getting in the routine of code review to ensure your code, project, and business is at the best point it can be.
This last practice is somewhat related to code review, but with a focus on becoming familiar with what happens after your Data Science model is deployed into production. Oftentimes, at a bigger company, you can hand off your model to another person who deploys it correctly, but could possibly make a mistake, training timing, for example. There are other consequences that could occur if you are unaware of what your model is doing when it transferred from local testing to production as well.
Sometimes a Data Scientist is responsible for the whole process, so this practice might not pertain to you, but if you, fortunately, get to work with someone like a Machine Learning Operations Engineer (MLOps) or DevOps Engineer, you can get used to handing off the model without actually seeing what is happening in the process afterward. The results may be different from what you were expecting, specifically from testing, so you will want to go over the whole process that has occurred since your handoff (if you have that at your company, of course).
An example of a mistake as a result from not performing this best practice could be:
- train and testing model results in 94% accuracy
- handoff to MLOps or DevOps Engineer (or Sofware Engineer, etc.)
- check after it has been in production, and the results are at 88%
- you realize that instead of the training data being 100,000 rows of data, it is 20,000 rows of data
- perhaps the other Engineer had placed a time restriction on training time that limited the number of training rows from a previous restriction already in place
This example may not happen to you, but just like validating your data in your model testing, you will want to validate the whole entire Data Science process, especially once your model is in production — because even the smallest changes can lead to the most significant differences.
There are countless ways of improving your Data Science process. I have discussed five of the best practices that will ultimately improve your Data Science process. Of course, there are several more, but these are the main issues I have found to be particularly prominent. I have also offered solutions to the problems, serving as the best practices.
Here are the summarized best practices for Data Scientists:
Developing a Concise Business Problem StatementIsolating Key Business MetricsAlgorithm ComparisonCode ReviewMLOps and DevOps Incorporation and Validation
I hope you enjoyed my article and found it interesting. Please feel free to comment down below on the best practices you follow as a Data Scineceitst or perhaps in another similar role. Do you agree with what I have discussed, or do you disagree — and why? Thank you for reading, I appreciate it!
[1] Photo by William Iven on Unsplash, (2015)
[2] Photo by Daria Nepriakhina on Unsplash, (2017)
[3] Photo by Austin Distel on Unsplash, (2019)
[4] Photo by engin akyurt on Unsplash, (2020)
[5] Moez Ali, PyCaret Homepage, (2021)
[6] Photo by heylagostechie on Unsplash, (2018)
[7] Photo by Peter Gombos on Unsplash, (2019)