Install and import
Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using the following commands:
pip install pandas
Alternatively, if you’re currently viewing this article in a Jupyter notebook you can run this cell:
!pip install pandas
The !
at the beginning runs cells as if they were in a terminal.
To import pandas we usually import it with a shorter name since it’s used so much:
import pandas as pd
Data Structure
Pandas generally provide two data structure for manipulating data, They are:
Series
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.
import pandas as pd
import numpy as np
data = np.array([‘g’, ‘e’, ‘e’, ‘k’, ‘s’])
ser = pd.Series(data)
print(ser)
DataFrame
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.
data = {‘apples’: [3, 2, 0, 1], ‘oranges’: [0, 3, 7, 2]}
purchases = pd.DataFrame(data)
purchases
How to read in data
It’s quite simple to load data from various file formats into a DataFrame. In the following examples we’ll keep using our apples and oranges data, but this time it’s coming from various files.
Reading data from CSVs
With CSV files all you need is a single line to load in the data:
df=pd.read_csv(‘abc.csv’)
Reading data from JSON
If you have a JSON file — which is essentially a stored Python dict
— pandas can read this just as easily:
df = pd.read_json(‘purchases.json’)
Reading data from SQL Database
If you’re working with data from a SQL database you need to first establish a connection using an appropriate Python library, then pass a query to pandas. Here we’ll use SQLite to demonstrate.
First, we need pysqlite3
installed, so run this command in your terminal:
pip install pysqlite3
Or run this cell if you’re in a notebook:
!pip install pysqlite3
sqlite3
is used to create a connection to a database which we can then use to generate a DataFrame through a SELECT
query.
So first we’ll make a connection to a SQLite database file:
import sqlite3 con = sqlite3.connect(“database.db”)
In this SQLite database we have a table called purchases, and our index is in a column called “index”.
By passing a SELECT query and our con
, we can read from the purchases table:
df = pd.read_sql_query(“SELECT * FROM purchases”, con)
Converting back to a CSV, JSON, or SQL
So after extensive work on cleaning your data, you’re now ready to save it as a file of your choice. Similar to the ways we read in data, pandas provides intuitive commands to save it:
df.to_csv('new_purchases.csv')df.to_json('new_purchases.json')df.to_sql('new_purchases', con)
Python Pandas Operations
Using Python pandas, you can perform a lot of operations with series, data frames, missing data, group by etc. Some of the common operations for data manipulation are listed below: