A comprehensive practice guide for data analysis.
There are several libraries and packages that provide data scientists, analysts, or any one interested in data with functions to perform efficient data analysis. Most of them are well-documented so you can easily find out what a function does.
However, the best way to learn such libraries is through practice. It is not enough to know what a function does. We should be able to recall and use them at the right time and place. Thus, I highly recommend to practice to learn a package or library.
In this article, we will use R packages to explore and gain insight into Melbourne housing dataset available on Kaggle.
For data analysis and manipulation, we will be using the data.table package of R. Let’s import it and read the csv file.
> library(data.table)> house_prices <- fread("Downloads/melb_data.csv")> head(house_prices)
The dataset contains several attributes of the houses in Melbourne along with their prices. Since the focus of this dataset is the price, it is better to get an overview of the price column first.
> house_prices[, summary(Price)] Min. 1st Qu. Median Mean 3rd Qu. Max.
85000 650000 903000 1075684 1330000 9000000
We use the summary function on the price column to get an overview in terms of basic statistics. The average house price is approximately 1.07 million.
We know the average house price in general. We might need to compare the house prices in different regions. This is a group by task and can easily be done by adding the name of column to be used for grouping.
> house_prices[, mean(Price), by = Regionname]
The aggregated column is represented as “V1” which is not very informative. We can assign a name to the aggregated column with a slight change in the syntax. Let’s also calculate the number of houses in each region along with the average house price.
> house_prices[, .(avg_price = mean(Price), number_of_houses = .N), by = Regionname]