

If you are starting your life as a data scientist, you will hear around you things like why we didn’t get the expected performance? This model is a failure!
We must find a solution to improve this model or We must start from the beginning.
If you want to avoid to be in this kind of situation, you must avoid some beginner mistakes. Here, we will discuss some of them that we already have committed and you can avoid.
- Understand The Client Need
Understanding the need and not what your are being asked to achieve is essential.
The client or your teammates will often come to see you with specific requests such as scoring to identify fraud,… But that is not their need and they think that the scoring is the better way to resolve this problem but may be the real need is to identify all customers with an outlier behavior…
The idea here is to go beyond the client’s request to understand their needs and the business problem and then you identify the “real question to solve”.
2. Define The Expected Outcome
The next step in your data science project is to understand the client expectation.
“What the client expect as outcome?” is the most important thing that you must ask your self, in this step, and keep it on mind throughout your project.
You are not going to treat the same problem the same way if the client asks you for a scorecard or an API for example.
So take your time and define explicitly the client outcome expectations.
3. Plan Your Project Step By Step
- Make an inventory of available and usable data
- Estimate your project duration
- Define metrics that you want to use to tell if your project is a success or not (precision, recall …) and never change them throughout the project
- Make a first version of your model even if it’s not so performant that you wish (so you have something finished, if you should present it)
- Improve your performance
4. Explanatory Analysis
We all agree that this part of your project is boring and all of us want to skip it but it is the most important step in your data science project.
It’s like making cake, you have to use good ingredients for a good cake but if you don’t know the ingredients very well there is a good chance that this cake will go wrong.
That is the same for a data science project, you should know very well your data to success project.
Here you need to identify and treat missing values, outliers, correlations, …
For this part, you can use a few tips to go faster:
pandas-profiling
for an univariate analysis. This python package will generate a HTML file with all needed information about your dataset.
5. Feature Engineering
Then next step that is often avoided by data scientists is the feature engineering. This step is very important too because here you can create new variables.
Feature engineering is useful to improve the performance of machine learning algorithms.
Here again, you can save a lot of time by using the featuretools package. This one will allow you to automatically create new features.
6. Never Put a Model Into Production Without the Monitor
When you put your model in production, you need to be sure that the quality of the output is still the same at the day you finished it.
For that, you need to create a monitoring system to follow model performance every day and notify you if some thing wrong happen.
For this, you can use tools like Grafana or create your own monitoring/alerting system.
7. Communication & Explain How We Can Use It
A lot of people think that data science is magic and they don’t really understand how it works.
Our job is not just building solution but we have to explain by popularizing to make it accessible by everyone. This is all the more important as your model will be used by these people.
Here, they can give you some insight that you ignore and that you can integrate them in your model. This is very important too.
8. Data science & Software Engineering
Never think that the data scientist job is just building a model and let others team put it in production. Your are responsible for your model from the beginning to the end.
Your solution should be able to be deployed on client environment quickly and without additional effort, which is why you should use tools like Docker, Git and others.
In addition, during your career as a data scientist, you will have to create APIs, Applications, integrate scoring in existing software… etc, you must therefore learn at least the basics of these tasks.
Conclusion:
Your data science project is not destined to fail, so be alert to those issues we covered above.
Now have fun with your projects 🙂