- Introduction
- Missing Data
- Stakeholder Problem Statement
- Communication Effectiveness
- Engineering Production
- Summary
- References
Data Science roles come with a variety of challenges, with that being said, I will be discussing the ones that I have experienced the most, while also highlighting their importance, implications, and possible solutions. Some of these obstacles can be applied to other roles as well, like Data Engineering, Machine Learning Engineering, Software Development, and Data Analytics. It can be easy to get used to a certain process in Data Science, for you to then encounter a roadblock that could certainly throw you off your game. That is why I will be discussing my top four Data Science roadblocks, ranging from data quality, business acumen, to production-ready code, so hopefully, you can learn about these issues that may arise and become well-prepared before they happen.
One of the most popular roadblocks a Data Scientist can experience includes dealing with missing data. Oftentimes, when you are learning Data Science — like at a university or online course (sometimes), and focusing more on the wide array of Machine Learning algorithms, you can skip the part of dealing with missing data in your dataset in order to study algorithms at a high level first. One of the ways you can practice working with missing data if you are just starting off in Data Science is to create a mock dataset with variations of missing data in different columns, either from your independent variables or dependent variable. You can then test various solutions as well as combinations of solutions that will make the best impact on your model.
There are a few ways to fix the issue of missing data, whether that be from simply dropping the NA (null — missing) values in your dataset, or imputing their value in some way. It ultimately depends on your dataset, how much is missing, and what problem you are trying to solve. If you are seeing too much missing data, it might mean that instead of imputing missing values, you could find a better data source, or gather more features that just have a small fraction of missing data.
Here are some of the ways to mandle missing data:
- filling NA continuous values with the min value of that feature
- filling NA continuous values with the max value of that feature
- filling NA continuous values with the mean value of that feature
- dropping rows with missing data
- filling NA categorical variables with a blank space
- predicting the missing value
- appending new, yet important features with less missing daata
As you can see, there are countless ways to handle missing values, so it is best to practice these skills before starting a Data Science job, but you can also learn and apply them as a Data Scientist now — by studying the various imputation methods in further detail. To really know which method to use, you will have to apply these solutions in a trial-and-error approach, while also understanding the impact of your changes on your overall model and predictions.
As a Data Scientist, you will have to work with other stakeholders, sometimes technical and sometimes not. It can be complicated to reach a simple problem statement that your Data Science model will ultimately solve. You will have to be clear if you are the one coming up with the problem statement first, and if the stakeholder is assigning you the solution to the problem instead, then you will have to make sure you are both on the same page with what the root problem is.
An example of a poor problem statement is:
- use Machine Learning to solve why we cannot do this process manually, perhaps regression
An example of a good problem statement is:
- manually classifying products is time-consuming and inaccurate
So what makes the first statement so bad? It is because it tries to solve the problem first, without highlighting only the problem. Sometimes as a Data Scientist, you can create some more hand-made algorithms that can quickly create a manual process faster, but to suggest regression to the Data Scientist can make them focus on that first, when thinking outward and more high level first — rather, is a better way to go so that you are not limiting yourself to a specific solution right away — especially when it could be the incorrect one.
The second example is good because it is simple and straightforward. It covers the current action, which is classifying products, and it says why it is currently an issue — takes too long and inaccurate. Now, as a Data Scientist, you can quickly think of a classification algorithm(s), for example, as the direct way to solve both problems.
Similar to the roadblock above, is the general effectiveness of your communication to other stakeholders and coworkers. It is likely that you will be one of the only Data Scientists in your company, or perhaps the only one. Therefore, you will have to be able to explain pretty much everything that you do in a non-technical sense, or just a way that others can understand that is not all about statistics, Machine Learning, and more specific details that only a Data Scientist would usually know.
Times where you would encounter communcaiton roadblocks:
- communicating the problem statement with stakeholders
- breaking down your code to Software Engineers, Data Engineers, and — or Machine Learning Engineers (more on this topic later)
- allowing others to interpret your results
Above are some of the roadblocks you can experience for communication. You will need to become a master of communication for not only other Data Scientists in the company, but to Product Managers and Engineers as well, if you want to be a truly successful and cross-functional Data Scientist.
This issue is more related to just the relationship between a Data Scientist and the Software Engineer, Data Engineer, or Machine Learning Engineer they work with. Most Data Scientists are not responsible for the whole entire end-to-end solution, even if the model is the main part (yes, some Data Scientists at smaller companies — usually, are sometimes). With that being said, you will have to not only test your model locally, but also in production. The engineers who will help you will work with you to make your code more modularized, scalable, and help to account for possible errors in production — quality assurance — or testing, is what people often refer to this case as. Sometimes you can get great results locally, but then you will find in production they are not so good — so it is up to you and your closer coworkers to work alongside as the process goes on, so that there are no surprises come production day.
Here are some issues that can arise from conveying your local and development code to production-ready code:
- your code is not in an object-oriented programming (OOP) format
- you did not account for possible errors that could happen eventually
- the version of your local libraries cannot be executed in production, or they will need to be updated periodically in the production environment
- the deployment process can only take a certain amount of training data at a time vs all of it locally
- production platform you are using may cause some new issues (Docker, etc.)
As you can see, there are some parts of the Data Science process where transferring your local code to production-ready code and in a production-ready environment can go wrong.
I have discussed the top roadblocks that I experienced in my Data Science career. I bet there are others who have had similar experiences as well, so hopefully, you can learn about these before they happen so that you are well-prepared for upcoming, common issues. We have outlined obstacles like missing data, stakeholder problem statements, communication effectiveness, and engineering in production. That being said, some of these can apply to you if you are in another role like Data Analytics or Software Engineering as well.
Once again, here are all of the roadblocks summarized:
Missing DataStakeholder Problem StatementCommunication EffectivenessEngineering Production
I hope you found my article both interesting and useful. Please feel free to comment down below if you have experienced any of these roadblocks yourself. Do you think knowing about these now will help you in your Data Science career? Do you agree or disagree with my roadblocks, and why?
Please feel free to check out my profile and other articles, as well as reach out to me on LinkedIn.
Here is a similar article I wrote on mastering Python as a prerequisite for Data Science [6]:
[1] Photo by Raúl Nájera on Unsplash, (2017)
[2] Photo by Tyler Callahan on Unsplash, (2018)
[3] Photo by Campaign Creators on Unsplash, (2018)
[4] Photo by Dylan Gillis on Unsplash, (2018)
[5] Photo by Christopher Gower on Unsplash, (2017)
[6] M.Przybyla, You Should Master Python First Before Becoming a Data Scientist, (2021)