Every data science project can be seen as a sequence of steps, each step depends on the previous ones.
Before learning more about data science life cycle, I suggest you take a look in advance at what data science is and what its goals are, by reading this article : Data Science : The Buzzword.
Data science is a multidisciplinary field that uses methods, algorithms, mathematics, statistics, programing and logic in order to extract insights and relevant ideas from structured and unstructured data.
For a better understanding of What is Data Science ?, let’s explore its life cycle and understand each stage.
For the first step, we have to clearly define the problem that we should solve.
* Example 1 : Suppose M. Ahmad has a car agency, his goal is to improve the sales of his comany by identifying the drivers of sales.
To accomplish his objective, he needs to answer the following questions :
- How to estimate the car price ?
- How are the in-store promotions working ?
- Are the car placements effectively deployed ?
His primary aim is to answer these questions which would surely influence the outcome of the project. So, he appoints you as a Data Scientist. Let’s solve his problem using the Data Science process. *
The first essential step before starting any data science project. As its name indicates, it’s the comprehension of the application field.
* Suppose that M. Ahmad want to answer the first question : be able to estimate the cars price.
So first, we have to get familiar with the application field, witch is car sales. *
For every problem we should know the data source, there are several ways to discover data from various sources which could be :
- In an unstructured format like videos or images
- In a structured format like in text files.
- From relational database systems.
* To solve the problem given at the previous example, we will use the scrapping technic (extract content from websites) to collect data from an e-commerce website (like www.avito.ma) to get cars freatures.
Once the data identification step is completed, the data file will look something like this, set of lines and colomns, containing cars features :
So now, we have to discover this dataset:
- The dataset form : size, attributes, data type
- Key attribute distribution
- Relationship between attributes
- Simple analysis of statictics
We can develop python scripts to explore that, or by using some plateforms like rapidMiner, Tanagra or Weka.
At this point, we have to know the data quality, to prepare it in suitable format by :
- Cleaning
- Identifing missing values
- Exploring and understanding what patterns and values our datasets have.
To achieve the final stage of preparation, the data must be cleaned, formatted, and transformed into something digestible by analytics tools.
This Step focuses on the mathemacial side, it includes the identification, setting and testing of different algorithms as well as their sequencing, which constitutes a model. This process is :
- First descriptive to generate knowledge, explaining why things happened.
- It then becomes predictive by explaining what is going to happen.
- Then prescriptive by allowing the optimization of a future situation.
It’s here that we apply statistical, machine learning, or deep learning algorithm.
The purpose of this step is to test the model(s) or the knowledge obtained, verify if it meets the objectives formulated at the beginning of the process. Here, the robustness and accuracy of the models obtained are tested.
This is the final step in the process. It consists of putting the resulting models into production for the end users.
Its objective : to put the knowledge obtained through modeling into an adapted form and integrate it into the decision-making process.
*Mr. Ahmad now can use this solution to predict cars price, in order to influence the outcome of his project.*
What i have presented here is an agile method that data scientists follow to develop there projects, each iteration brings additional business knowledge that allows to better approach the next iteration.