In the real world, datasets are dirty. This data must be processed before data analysis. Data preprocessing is one of the most important stages of data analysis. The most time-consuming step for data scientists is data preprocessing. Pandas is one of the most important libraries of Python. In this post, I will talk about the Pandas library.
Pandas is a great library for data preprocessing. Pandas often uses libraries such as NumPy and SciPy for numerical computations and Matplotlib to visualize data. Pandas has methods similar to the methods in NumPy. While NumPy works with the same data types, pandas can work with different data types.
A data set written in Excel or SQL table data can be easily analyzed with pandas.
Pandas module is an open-source library since 2010. Pandas are constantly updated by developers around the world.
Briefly in this post,
- How to install Pandas?
- Series data structure
- Working with Series
- DataFrame data structure
Let’s get started.
If you are using platforms such as Anaconda, the Pandas library comes preinstalled. To install pandas on your computer
pip install pandas
Pandas and dependent libraries are automatically loaded with this command.
To use Pandas it is necessary to import. Let’s import Pandas with pd.
import pandas as pd
Let’s take a look at the installed version of Pandas.
pd.__version__
#'1.1.3'
Pandas has data structures for easy data analysis. The most used of these are Series and DataFrame data structures. Series data structure is one-dimensional, that is, it consists of a column. DataFrame data structure is two-dimensional, i.e. it consists of rows and columns.
Let’s take a look Series data structure.
Series is a one-dimensional object and represents a column.
obj=pd.Series([1,"John",3.5,"Hey"])
obj
The object’s index starts from 0.
obj[0]
#1
If we want to take the object of type Series as an array, the values are used.
obj.values
#array([1, 'John', 3.5, 'Hey'], dtype=object)
We can change the indexes.
obj2=pd.Series([1,"John",3.5,"Hey"],index=["a","b","c","d"])
obj2
We can call the value according to the index.
obj2["b"]
#'John'
If we want to see the index structure of obj2, the index is used.
obj2.index
#Index(['a', 'b', 'c', 'd'], dtype='object')
We can convert data types such as list, tuple, or dictionary to Series structure.
score={"Jane":90, "Bill":80,"Elon":85,"Tom":75,"Tim":95}
names=pd.Series(score) # Convert to Series
names
We can call values using keys.
names["Tim"]
#95
To choose specific names.
names[names>=85]
Let’s change Tom’s value.
names["Tom"]=60
names
We can change more than one value.
names[names<=80]=83
names
We can check if any value is in the data.
"Tom" in names
#True"Can" in names
#False
We can apply mathematical operations to Series.
names/10
We can square each value.
names**2
The isnull() method is used to find the missing data in the Pandas.
names.isnull()
Now I am going to show how to work with Series.
Let’s import the data set from my working directory. You can download the data set I will use here.
games=pd.read_csv("vgsalesGlobale.csv")
Let’s take a look first 5 rows of the data set.
games.head()
Let’s look at the types of variables in the data set.
games.dtypes
Let’s print the definitive statistics of the genre variable on the screen.
games.Genre.describe()
Let’s see the number of subcategories inside the variable.
games.Genre.value_counts()
Let’s print out the percentage of each value on the screen.
games.Genre.value_counts(normalize=True)
Let’s look at Genre’s type.
type(oyunlar.Genre.value_counts())
#pandas.core.series.Series
Since this object has a series structure, we can use Series methods. For example, let’s use the head method.
games.Genre.value_counts().head()
unique() method is used to see the repeating values individually.
games.Genre.unique()
We can see how many single values there are.
games.Genre.nunique()
#12
crosstab() method is used to see the mutual values of two variables as a table.
Now let’s consider the variable Series named Global_Sales of numerical type. Let’s look at the definitive statistics of this variable.
games.Global_Sales.describe()
Let’s see the mean of the Global_Sales variable.
games.Global_Sales.mean()
#0.53744065550074
We may want to take the average of this numerical variable directly.
games.Global_Sales.value_counts()
Let’s look at how to visualize the data of Series type. Let’s want to draw the histogram of the numeric variable Year.
games.Year.plot(kind="hist")
Now consider the Genre variable, which is of the object type. Let’s look at the numerical values of this variable.
games.Genre.value_counts()
Let’s see the bar graph of the numerical values of Genre in the object type.
games.Genre.value_counts().plot(kind="bar")
Another important data structure in Pandas is DataFrame.