Instead of waking to overlooked “Do not disturb” signs, Airbnb travelers find themselves rising with the birds in a whimsical treehouse, having their morning coffee on the deck of a houseboat, or cooking a shared regional breakfast with their hosts.
New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. By accurately predicting where a new user will book their first travel experience, Airbnb can share more personalized content with their community, decrease the average time to first booking, and better forecast demand.
In this kaggle competition, Airbnb challenges you to predict in which country a new user will make his or her first booking.
- Business Problem
- Use of Machine Learning
- Source of Data
- Existing Approaches
- My Improvements
- Exploratory Data Analysis
- First Cut Solution
- Comparison of Models
- Kaggle Screenshot
- Future Work
- References
- Github Repo
- Linkedin profile
Airbnb, Inc. is an American vacation rental online marketplace company based in San Francisco, California, United States. Airbnb maintains and hosts a marketplace, accessible to consumers on its website or via an app. Through the service, users can arrange lodging, primarily homestays, and tourism experiences or list their properties for rental. Airbnb does not own any of the listed properties; instead, it profits by receiving commission from each booking.
New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. By accurately predicting where a new user will book their first travel experience, Airbnb can share more personalized content with their community, decrease the average time to first booking, and better forecast demand.
In this kaggle competition, Airbnb challenges you to predict in which country a new user will make his or her first booking.
Here, We will use machine learning technique to predict the first travel destination of the user along with 4 more probable choices by using the given train dataset.
We will use the NDGC Score as the metric to evaluate the performance of the model — https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/overview/evaluation.
In this challenge, you are given a list of users along with their demographics, web session records, and some summary statistics. You are asked to predict which country a new user’s first booking destination will be. All the users in this dataset are from the USA.
There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’. Please note that ‘NDF’ is different from ‘other’ because ‘other’ means there was a booking, but is to a country not included in the list, while ‘NDF’ means there wasn’t a booking.
The training and test sets are split by dates. In the test set, you will predict all the new users with first activities after 7/1/2014 (note: this is updated on 12/5/15 when the competition restarted). In the sessions dataset, the data only dates back to 1/1/2014, while the users dataset dates back to 2010.
- The session dataset has the history logs of the users and it contains both train and test user data. Only 30% of the users are common in between ‘train_users’ and ‘session’ dataset, rest all users belong to the test dataset.
- Because of the unavailability of the session data for maximum users, some people has trained the model by considering only the ‘train_users’ dataset (User specific dataset-train_users_2.csv).
- Some have trained the model by considering all the train users and the matching session data. But this approach gives the final dataset with large number of cell values as null.
- People who have used the above approach got relatively lower score compared to later approaches.
- Basic feature engineering in date related field like extraction of day, date, month and year.
- Bucketing the age of users into young, middle, old and unknown categories.
- As we know, the session dataset has multiple records for the same user. So we first convert it into single row for each user by concatenation of ‘action’, ‘action_type’, ‘Device_type’ field and summing up the ‘sec_elapsed’ by each user.
- Calculate the Skew, Kurtosis, Standard deviation, Variance, Maximum, Minimum, Average of the ‘sec_elapsed’ by each user.
- As shown in the above plots, the ‘train_users’ dataset field : ‘date_first_booking’, ‘age’ and ‘first_affilate_tracked’ contain the null values.
→ Feature ‘date_first_booking’ : ~58% null values
→ Feature ‘age’ : ~41% null values
→ Feature ‘first_affilate_tracked’ : ~2% null values
- As shown in the above plots, maximum value in age feature is 2014 and minimum is 1, which is not the valid ages. It can be either born year or current year.
- Age feature has 41% null values.
- From the above boxplot, ‘ES’ and ‘PT’ can be separable at some level. Users having age between 30 to 34 years prefer these as their destination country.
- In the above plot, ‘User signup method’ distribution is almost same for all users, where ‘Basic’ is highest for all countries.
- From the above plot, we observed that almost 95% users preferred English as a language as maximum travelers belong from the ‘US’.
- The data is highly imbalanced as majority traveler prefer ‘US’ and least preferable is ‘PT’.
- Airbnb has started growing rapidly in the period of 2010 – 2014.
- Majority of customer preferred ‘Windows’ or ‘Mac’ as first device for booking.
Bivariate Analysis :
- From the above plot, age between 25 to 30 of ‘Other’ category users prefer to visit ‘CA’ and ‘NL’ countries and the age group between 45 to 50 prefer to visit ‘IT’.
- Females of age 25 to 30 prefer to visit ‘PT’ and ‘NL’.
- First challenge is to convert the session data with multiple records into 1 record for each user.
- This can be done by grouping over user id’s and converting multiple records of ‘action’, ‘action_type’, ‘action_details’ fields into text form.
def conv_to_strings(items):
items = [ str(i) for i in items ]
items = [ re.sub('nan','',i) for i in items ]
items = ' '.join(items)
return itemsdef conv_to_strings_unique(items):
items = [ str(i) for i in items ]
items = [ re.sub('nan','',i) for i in items ]
items = ' '.join(set(items))
return items#Below code is for sec_elapsed field replace null to 0.def replace_nan_to_0(items):
items = [ 0 if math.isnan(i) else i for i in items ]
return items
- Applied TFIDF and bag of words technique for better feature engineering.
- Extract day, date, year, month and apply ‘onehotencoding’ technique.
- Consider the valid age in the range of 15 to 95 and rest all values have been assigned to null, Bucketing of age into old, middle, young and unknown (null values)categories by considering the intervals of 20.
def set_age_group(x):
if x < 40:
return 'Young'
elif x >=40 and x < 60:
return 'Middle'
elif x >= 60 and x <= 95:
return 'Old'
else:
return 'Unknown_age'
- I have trained the model using Logistic Regression, Random Forest, XGBoost and CataBoost and got the following NDGC Score. The below score is mentioned by considering minimum multi log loss and the model has been trained by splitting the actual train data into train, Test and CV to get the best model. Later submission will be done by training the model on entire actual train dataset given on kaggle dataset :
- Out of the above model, Highest NDGC score has been given by CataBoost Model.
CataBoost Gave the best private NDGC Score of 0.88402
- Using Bigrams, Trigrams, word2vec, which will give large dimension data and increase model complexity, but this might give more better result.
- More hyper parameter tuning might improve the result.
- Neural network can also be used to get better results.