
Data Cleaning & Tidying
A must-know concept for Data Scientists.
There’s a popular saying in Data Science that goes like this — “Data Scientists spend up to 80% of the time on data cleaning and 20 percent of their time on actual data analysis”. The origin of this quote goes back to 2003, in Dasu and Johnson’s book, Exploratory Data Mining and Data Cleaning, and it still true to this day.
In a typical Data Science project, from importing your data to communicating your results, tidying your data is a crucial aspect in making your workflow more productive and efficient.
The process of tidying data would thus create what’s known as tidy data, which is an ideal first formulated by Hadley Wickham in his paper. So my article will be largely a summarization or extracting the essence of the paper if you will.
From the paper, the definition given is:
Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning)
To break down this definition, you have to first understand what structure and semantics means.
> Structure is the form and shape of your data. In statistics, most datasets are rectangular data tables(data frames) and are made up of rows and columns.
> Semantics is the meaning for the dataset. Datasets are a collection of values, either quantitative or qualitative. These values are organized in 2 ways — variable & observation.
- Variables — all values that measure the same underlying attribute across units
- Observations — all values measured on the same unit across attributes
If you didn’t get any of that, I recommend you reading the paper mentioned as it has examples and tables that illustrate it better.
Nonetheless, the 3 rules of tidy data help simplify the concept and make it more intuitive.