This article aims to develop a foundation to perform an analysis of the data presented by Airbnb. It demonstrates how to formulate scrapped data into features that will assist the model to predict the listing’s price.
Airbnb, Inc. is an American vacation rental online marketplace company. Airbnb maintains and hosts a marketplace, accessible to consumers on its website or via an app.
Here is the reference to the GitHub repository that provides access to the Python notebook and step by step documentation to set up the project:
https://github.com/rafayullah/Airbnb
Figure shows the average price of listings throughout the year. It has a calculated weekly average of the number of ads posted on the platform.
Figure shows the correlation between the number of listings and average price. As the number of listings rise, the average price of the listings rises proportionally.
Since the data is scrapped, it is not ideal and needs relevant preprocessing and feature engineering.
Data preprocessing:
As a part of some preprocessing following actions were performed:
- Converting date to pandas DateTime format
- Removing currency symbols from price and converting it to a continuous data type Float, will later assist the model to predict continuous values
- Removing the per cent symbols for some features like acceptance rate to convert them to integers
- Removing outliers, this step is essentially performed to make sure that abnormalities present in data may not reciprocate in our statistics and modelling:
Feature engineering:
To further enhance the feature set, some of the columns needs to be parsed, for example, the ‘host_verifications’ and ‘amenities’ column can be further processed and parsed to be used as an effective source of information.
As a sample, here is the column of amenities and host verifications containg default scrapped values:
After the relevant preprocessing performed with (get_unique_features and get_list_as_features) functions, we retrieve the following results:
Additionally, like host verifications feature, the data contains some other features describing the host further. One such field is ‘host_since’, this is the date when the ad poster joined the platform. By calculating the number of days host has been on the platform, we can enhace the features. Lets see going further of that affects the model at all when we perform evaluation.
After the preprocessing, feature engineering and encoding the data into respective formats, we split the data in train and validation set like any other regression problem.
For the purpose of this project, we use XGBoost as our model. Please note that the purpose of this experiment is not to achieve the highes accuracy, but to build a pipeline and critical thinking for the problem. The models can be replaced with different parameters or model:
The features like facebook, jumio, government_id etc. we derived earlier can be clearly seen contributing towards the model development.
Note: The featureset ‘amenities’ that we parser earlier was not used in this model, however you can try adding that too.
The Airbnb data also comprises of the reviews of listing present on the platform, with this project I have included the methods to perform the sentiment analysis. This further can be added to improve model performance. I use Spacy’s text blob for the purpose which can be replaced easily.
Thank you for reading the article, feel free to use the repo provided for your experiments.