A Straightforward Guide to Cleaning and Preparing Data in Python

How to Identify and deal with dirty data.

Real-world data is dirty. In fact, around 80% of a data scientist’s time is spent collecting, cleaning and preparing data. These tedious (but necessary) steps make the data suitable for any model we want to build and ensure the high quality of data.

The cleaning and preparation of data might be tricky sometimes, so in this article, I would like to make these processes easier by showing some techniques, methods and functions used to clean and prepare data. To do so, we’ll use a Netflix dataset available on Kaggle that contains information about all the titles on Netflix. I’m using movie datasets because they’re frequently used in tutorials for many data science projects such as sentiment analysis and building a recommendation system. You can also follow this guide with a movie dataset from IMDb, MovieLens or any dataset that you need to clean.

Although the Kaggle dataset might look well organized, it’s not ready to be used, so we’ll identify missing data, outliers, inconsistent data and do text normalization. This is shown in detail in the table below.

Table of Contents
1. Quick Dataset Overview
2. Identify Missing Data
- Create a percentage list with .isnull()
3. Dealing with Missing Data
- Remove a column or row with .drop, .dropna or .isnull
- Replace it by the mean or median or mode
- Replace it by an arbitrary number with .fillna()
4. Identifying Outliers
- Using histograms to identify outliers within numeric data
- Using boxplots to identify outliers within numeric data
- Using bars to identify outliers within categorical data
5. Dealing with Outliers
- Using operators & | to filter out outliers
6. Dealing with Inconsistent Data Before Merging 2 Dataframes
- Dealing with inconsistent column names
- Dealing with inconsistent data type
- Dealing with inconsistent names e.g. "New York" vs "NY"
7. Text Normalization
- Dealing with inconsistent capitalization
- Remove blank spaces with .strip()
- Remove or replace strings with .replace() or .sub()
8. Merging Datasets
- Remove duplicates with .drop_duplicates()

The first thing to do once you downloaded a dataset is to check the data type of each column (the values of a column might contain digits, but they might not be datetime or int type)

After reading the CSV file, type .dtypes to find the data type of each column.

df_netflix_2019 = pd.read_csv(‘netflix_titles.csv’)
df_netflix_2019.dtypes

Once you run that code, you’ll get the following output.

show_id          int64
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object

This will help you identify whether the columns are numeric or categorical variables, which is important to know before cleaning the data.

Now to find the number of rows and columns, the dataset contains, use the .shape method.

In [1]: df_netflix_2019.shape
Out[1]: (6234, 12) #This dataset contains 6234 rows and 12 columns.

Missing data sometimes occurs when data collection was done improperly, mistakes were made in data entry, or data values were not stored. This happens often, and we should know how to identify it.

Create a percentage list with .isnull()

A simple approach to identifying missing data is to use the .isnull() and .sum() methods

df_netflix_2019.isnull().sum()

This shows us a number of “NaN” values in each column. If the data contains many columns, you can use .sort_values(ascending=False) to place the columns with the highest number of missing values on top.

show_id            0
type               0
title              0
director        1969
cast             570
country          476
date_added        11
release_year       0
rating            10
duration           0
listed_in          0
description        0
dtype: int64

That being said, I usually represent the missing values in percentages, so I have a clearer picture of the missing data. The following code shows the above output in %

Now it’s more evident that a good number of directors were omitted in the dataset.

show_id: 0.0%
type: 0.0%
title: 0.0%
director: 31.58%
cast: 9.14%
country: 7.64%
date_added: 0.18%
release_year: 0.0%
rating: 0.16%
duration: 0.0%
listed_in: 0.0%
description: 0.0%

Now that we identified the missing data, we have to manage it.

There are different ways of dealing with missing data. The correct approach to handling missing data will be highly influenced by the data and goals your project has.

That being said, the following cover 3 simple ways of dealing with missing data.

Remove a column or row with .drop, .dropna or .isnull

If you consider it’s necessary to remove a column because it has too many empty rows, you can use .drop() and add axis=1 as a parameter to indicate that what you want to drop is a column.

However, most of the time is just enough to remove the rows containing those empty values. There are different ways to do so.

The first solution uses .drop with axis=0 to drop a row. The second identifies the empty values and takes the non-empty values by using the negation operator ~ while the third solution uses .dropna to drop empty rows within a column.

If you want to save the output after dropping, use inplace=False as a parameter. In this simple example, we’ll not drop any column or row.

Replace it by the mean or median or mode

Another common approach is to use the mean, median or mode to replace the empty values. The mean and median are used to replace numeric data, while the mode replaces categorical data.

As we’ve seen before, the rating column contains 0.16% of missing data. We could easily complete that tiny portion of data with the mode since the rating is a categorical value.

First, we calculated the mode (TV-MA), and then we filled all the empty values with .fillna.

Replace it by an arbitrary number with .fillna()

If the data is numeric, we can also set an arbitrary number to prevent removing any row without affecting our model’s results.

If the duration column was a numeric value (currently, the format is string e.g. 90 minutes), we could replace the empty values by 0 with the following code.

df_netflix_2019['duration'].fillna(0, inplace=True)

Also, you can use the ffill , bfill to propagate the last valid observation forward and backward, respectively. This is extremely useful for some datasets but also has no use for other datasets like the one we’re using.

An outlier is that data that that differs significantly from other observations. A dataset might contain real outliers or outliers obtained after poor data collection or caused by data entry errors.

Using histograms to identify outliers within numeric data

We’re going to use the duration as a reference that will help us identify outliers in the Netflix catalog. The duration column is not considered a numerical value (minutes) in our dataset because it’s mixed with strings (e.g., 90 min). Also, the duration of TV shows is in seasons.

With the following code, we’ll take only movies from the dataset and then extract the numeric values.

Now the data is ready to be displayed in a histogram. You can make plots with matplotlib, seaborn or pandas in Python. In this case, I’ll do it with matplotlib.

import matplotlib.pyplot as pltfig, ax = plt.subplots(nrows=1, ncols=1)
plt.hist(df_movie[‘minute’])
fig.tight_layout()

The plot below reveals how is distributed the duration of movies. By observing the plot, we can say that movies in the first bar (3’–34′) and the last visible bar (>189′) are probably outliers. They might be short films or long documentaries that don’t fit well in our movie category (again, it still depends on your project goals)

Image by author

Using boxplots to identify outliers within numeric data

Another option to identify outliers is boxplots. I prefer using boxplots because it leaves outliers out of the box’s whiskers. As a result, it’s easier to identify the minimum and maximum values without considering the outliers.

We can easily make boxplots with the following code.

import seaborn as snsfig, ax = plt.subplots(nrows=1, ncols=1)
ax = sns.boxplot(x=df_movie[‘minute’])
fig.tight_layout()

The boxplot shows that values below 43′ and above 158′ are probably outliers.

Image by author

Also, we can identify some elements of the boxplot like lower quartile (Q1), median and upper quartile (Q3) with the.describe() method.

In  [1]: df_movie[‘minute’].describe()
Out [1]: count    4265.000000
mean       99.100821
std        28.074857
min         3.000000
25%        86.000000
50%        98.000000
75%       115.000000
max       312.000000

In addition to that, you can easily display all elements of the boxplot and even make it interactive with Plotly.

import plotly.graph_objects as go
from plotly.offline import iplot, init_notebook_modefig = go.Figure()
fig.add_box(x=df_movie[‘minute’], text=df_movie[‘minute’])
iplot(fig)

Using bars to identify outliers within categorical data

In case the data is categorical, you can identify categories with few observations by plotting bars.

In this case, we’ll use the built-in Pandas visualization to make the bar plot.

fig=df_netflix_2019['rating'].value_counts().plot.bar().get_figure()
fig.tight_layout()

Image by author

In the plot below, we can see that the mode (a value that appears most often in the column) is ‘TV-MA’ while ‘NC-17’ and ‘UR’ are uncommon.

Once we identified the outliers, we can easily filter them out by using Python’s operators.

Using operators & | to filter out outliers

Python operators are simple to memorize. & is the equivalent of and, while| is the equivalent of or.

In this case, we’re going to filter out outliers based on the values revealed by the boxplot.

#outliers
df_movie[(df_movie['minute']<43) | (df_movie['minute']>158)]
#filtering outliers out
df_movie = df_movie[(df_movie['minute']>43) & (df_movie['minute']<158)]

The df_movie created now contains only movies that last between 43′ and 158′.

A common task we often come across is merging dataframes to increase the expand the information of an observation. Unfortunately, most of the time, datasets have many inconsistencies because they come from different sources.

From now on, we’ll use a second dataset df_netflix_originals that contains only Netflix originals, and we’ll merge it with the original dataset df_netflix_2019 to determine original and non original content.

Dealing with inconsistent column names

A common issue we have to deal with is different column names between tables. Column names can be easily changed with the .rename method.

Dealing with inconsistent data type

If you try to merge 2 datasets based on a column that has different data types, Python will throw an error. That’s why you have to make sure the type is the same. If the same column have different types, you can use the .astype method to normalize it.

Dealing with inconsistent names e.g., “New York” vs. “NY”

Usually, the column and data type normalization is enough to merge to datasets; however, sometimes, there are inconsistencies between the data within the same column caused by data entry errors (typos) or disagreements in the way a word is written.

Movies titles don’t usually have these problems. They might have a disagreement in punctuation (we’ll take care of this later), but movies usually have a standard name, so to explain how to deal with this problem, I’ll create a dataset and a list containing states written in different ways.

There are many libraries that can help us solve this issue. In this case, I’ll use the fuzzywuzzy library. This will give a score based on the distance between 2 strings. You can choose the scorer that fits your data better. I will set scorer=fuzz.token_sort_ratio

As we can see in the output, the scorer does a good job matching strings.

states         match       score
CA             California  33       
Hawai          Hawaii      91       
NY             New York    40       
Washington DC  Washington  87

However, keep in mind that it can still make mistakes.

Text normalization is necessary for Natural Language Processing. This involves the following techniques to make text uniform.

Removing whitespace, punctuation and non-alphanumeric characters
Tokenization, Stemming, Lemmatization, removing stop words

To make things simple, this article will only cover the first point.

Dealing with inconsistent capitalization

Before merging 2 frames, we have to make sure most rows will match, and normalizing capitalization helps with it.

There are many ways to lower case text within a frame. Below you can see two options (.apply or .str.lower)

Remove blank spaces with .strip()

Sometimes data has leading or trailing white spaces. We can get rid of them with the .strip method

Remove or replace strings with .replace() or .sub()

Often texts between two datasets will have disagreements in punctuation. You can remove it with .apply and .sub or by using .replace

It’s good to use any of them with regular expressions. For example, the regex[^ws] will help you remove characters other than words (a-z, A-Z, 0–9, _ ) or spaces.

Finally, we can merge the dataset df_netflix_originals and df_netflix_2019. With this, we can identify which movies are Netflix originals and which only belong to the catalog. In this case, we do an outer join to give ‘Catalog’ value to all the rows with empty values in the “Original" column.

Remove duplicates with .drop_duplicates()

One of the pitfalls of outer join with 2 key columns is that we’ll obtain duplicated rows if we consider a column alone. In this case, we merged based on the title and release_year columns, so most likely there are titles duplicated that have different release_year.

You can drop duplicates within a column with the .drop_duplicated method

The data grouped by type and origin is distributed like this.

In[1]: df_netflix[['original', 'type']].value_counts()Out[1]:
original  type   
Catalog   Movie      3763
TV Show    1466
Netflix   TV Show    1009
Movie       504

How to Identify and deal with dirty data.

Create a percentage list with .isnull()

Remove a column or row with .drop, .dropna or .isnull

Replace it by the mean or median or mode

Replace it by an arbitrary number with .fillna()

Using histograms to identify outliers within numeric data

Using boxplots to identify outliers within numeric data

Using bars to identify outliers within categorical data

Using operators & | to filter out outliers

Dealing with inconsistent column names

Dealing with inconsistent data type

Dealing with inconsistent names e.g., “New York” vs. “NY”

Dealing with inconsistent capitalization

Remove blank spaces with .strip()

Remove or replace strings with .replace() or .sub()

Remove duplicates with .drop_duplicates()

Footer