How to Boost Pandas Functions with Python Dictionaries

Explained with examples

Pandas is a highly popular data analysis and manipulation library. Thanks to the simple and intuitive Python syntax, Pandas is usually the first choice for aspiring data scientist. Its powerful and efficient functions make a great amount of experienced data scientists to prefer Pandas as well.

Pandas provides a rich selection of functions that expedite the data analysis process. The default parameter settings do a fine job in most cases but we can do better by customizing the parameters.

In addition to a constant value or list, some parameters accept a dictionary argument. In this article, we will go over several examples to demonstrate how using dictionaries add value to functions.

We will use a small sample from the Melbourne housing dataset available on Kaggle for the examples. We first read the csv file using the read_csv function.

import numpy as np
import pandas as pdcols =['Price','Landsize','Distance','Type','Regionname']melb = pd.read_csv(
"/content/melb_data.csv",
usecols = cols,
dtype = {'Price':'int'},
na_values = {'Landsize':9999, 'Regionname':'?'}
)melb.head()

(image by author)

The dtype parameter is used to specify the data types. By using a dictionary, we are able to specify the data type for each column separately.

The real life data is usually messy so we are likely to encounter different representations of missing values. The na_values parameter handles such representations.

Consider a case where the missing values in the land size and region name columns are represented with 9999 and “?”, respectively. We can pass a dictionary to the na_values parameter to handle column-specific missing values.

Footer