Getting Familiar With Pandas

Install and import

Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using the following commands:

pip install pandas

Alternatively, if you’re currently viewing this article in a Jupyter notebook you can run this cell:

!pip install pandas

The ! at the beginning runs cells as if they were in a terminal.

To import pandas we usually import it with a shorter name since it’s used so much:

import pandas as pd

Data Structure

Pandas generally provide two data structure for manipulating data, They are:

Series

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.

import pandas as pd

import numpy as np

data = np.array([‘g’, ‘e’, ‘e’, ‘k’, ‘s’])

ser = pd.Series(data)

print(ser)

DataFrame

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

data = {‘apples’: [3, 2, 0, 1], ‘oranges’: [0, 3, 7, 2]}

purchases = pd.DataFrame(data)

purchases

How to read in data

It’s quite simple to load data from various file formats into a DataFrame. In the following examples we’ll keep using our apples and oranges data, but this time it’s coming from various files.

Reading data from CSVs

With CSV files all you need is a single line to load in the data:

df=pd.read_csv(‘abc.csv’)

Reading data from JSON

If you have a JSON file — which is essentially a stored Python dict — pandas can read this just as easily:

df = pd.read_json(‘purchases.json’)

Reading data from SQL Database

If you’re working with data from a SQL database you need to first establish a connection using an appropriate Python library, then pass a query to pandas. Here we’ll use SQLite to demonstrate.

First, we need pysqlite3 installed, so run this command in your terminal:

pip install pysqlite3

Or run this cell if you’re in a notebook:

!pip install pysqlite3

sqlite3 is used to create a connection to a database which we can then use to generate a DataFrame through a SELECT query.

So first we’ll make a connection to a SQLite database file:

import sqlite3 con = sqlite3.connect(“database.db”)

In this SQLite database we have a table called purchases, and our index is in a column called “index”.

By passing a SELECT query and our con, we can read from the purchases table:

df = pd.read_sql_query(“SELECT * FROM purchases”, con)

Converting back to a CSV, JSON, or SQL

So after extensive work on cleaning your data, you’re now ready to save it as a file of your choice. Similar to the ways we read in data, pandas provides intuitive commands to save it:

df.to_csv('new_purchases.csv')df.to_json('new_purchases.json')df.to_sql('new_purchases', con)

Python Pandas Operations

Using Python pandas, you can perform a lot of operations with series, data frames, missing data, group by etc. Some of the common operations for data manipulation are listed below:

Install and import

Data Structure

How to read in data

Converting back to a CSV, JSON, or SQL

Python Pandas Operations

Footer