Airbnb data exploration, analysis and feature engineering

This article aims to develop a foundation to perform an analysis of the data presented by Airbnb. It demonstrates how to formulate scrapped data into features that will assist the model to predict the listing’s price.
Airbnb, Inc. is an American vacation rental online marketplace company. Airbnb maintains and hosts a marketplace, accessible to consumers on its website or via an app.

Here is the reference to the GitHub repository that provides access to the Python notebook and step by step documentation to set up the project:
https://github.com/rafayullah/Airbnb

Figure shows the average price of listings throughout the year. It has a calculated weekly average of the number of ads posted on the platform.

Figure shows the correlation between the number of listings and average price. As the number of listings rise, the average price of the listings rises proportionally.

The figure represents the average listing price of Entire home/apartments, private rooms, shared rooms and hotel rooms across the neighbourhoods. On average, the highest prices asked are for the entire home/apartments and many of the neighbourhoods have no listings for shared and hotel rooms.

Just like the average price of different room types, the chart above explains the average deviation of prices of listings present across neighbourhoods. This means that near West Roxbury, the listing price fluctuates the most. Likewise, Hyde Park has the lowest average difference among the listings present.

Since the data is scrapped, it is not ideal and needs relevant preprocessing and feature engineering.

Data preprocessing:
As a part of some preprocessing following actions were performed:

Converting date to pandas DateTime format
Removing currency symbols from price and converting it to a continuous data type Float, will later assist the model to predict continuous values
Removing the per cent symbols for some features like acceptance rate to convert them to integers
Removing outliers, this step is essentially performed to make sure that abnormalities present in data may not reciprocate in our statistics and modelling:

The function removes the quantiles present above 0.999 and below 0.001, this ensures that there are no boundary cases present like false ads having 0$ as listing price or ads having abnormally high price values. This step is performed for every room type.

Feature engineering:
To further enhance the feature set, some of the columns needs to be parsed, for example, the ‘host_verifications’ and ‘amenities’ column can be further processed and parsed to be used as an effective source of information.
As a sample, here is the column of amenities and host verifications containg default scrapped values:

After the relevant preprocessing performed with (get_unique_features and get_list_as_features) functions, we retrieve the following results:

Amenities parsed as features

Host verifications parsed as features

Additionally, like host verifications feature, the data contains some other features describing the host further. One such field is ‘host_since’, this is the date when the ad poster joined the platform. By calculating the number of days host has been on the platform, we can enhace the features. Lets see going further of that affects the model at all when we perform evaluation.

Host presence in days on the platform

After the preprocessing, feature engineering and encoding the data into respective formats, we split the data in train and validation set like any other regression problem.

For the purpose of this project, we use XGBoost as our model. Please note that the purpose of this experiment is not to achieve the highes accuracy, but to build a pipeline and critical thinking for the problem. The models can be replaced with different parameters or model:

Model performance on training set (Actual=Blue, Predicted=Orange)

Model performance on validation set (Actual=Blue, Predicted=Orange)

Feature importance (see the full picture at GitHub notebook)

The features like facebook, jumio, government_id etc. we derived earlier can be clearly seen contributing towards the model development.

Note: The featureset ‘amenities’ that we parser earlier was not used in this model, however you can try adding that too.

The Airbnb data also comprises of the reviews of listing present on the platform, with this project I have included the methods to perform the sentiment analysis. This further can be added to improve model performance. I use Spacy’s text blob for the purpose which can be replaced easily.

Spacy also makes mistakes, like in the second record, it puts a negative polarity due to the word ‘base’ present. But overall the performance is acceptable

Thank you for reading the article, feel free to use the repo provided for your experiments.

Footer