Airbnb is an American vacation rental online marketplace company based in San Francisco, California. Airbnb maintains and hosts a marketplace, accessible to consumers on its website or app. Users can arrange lodging, primarily homestays, and tourism experiences or list their spare rooms, properties, or part of it for rental. On the other hand, users who are traveling and looking for stays search properties and rooms by neighborhood or location. Airbnb recommends the best price in the neighborhood and users book the best deal.
Thanks to Kaggle and Udacity that I got a chance to analyze Airbnb listings of Boston city. Boston Airbnb listings dataset has various features such as neighborhood, property_type, bedrooms, bathrooms, beds, price, reviews, ratings, etc. It would be interesting to see what features are affecting the price in Boston city and draw interesting conclusions. I would be more interested in training and evaluating the model and to see how the model has performed while predicting the prices in Boston city at Airbnb.
To understand the dataset we have to explore it. Thanks to Python, Pandas, NumPy, Matplot, Seaborn, and Sklearn aka scikit learn it made my life easy to perform data science activities. Pandas is been excellent when it comes to load, clean and transform the data sets. Seaborn is a handy package to visualize data concluded from pandas transformation functions. It offers high-level functions to plot bar charts, histograms, distributions, box plots, etc. I will be using all these packages to explore the data. I would be performing the following data science activities to explore the data:
- Import packages and read Boston Airbnb datasets
- Data cleaning and transformation
- Numerical features analysis
- Categorical features analysis
Importing NumPy and pandas for linear algebra and data processing respectively. Importing matplotlib pyplot and seaborn for plotting dataset. Importing sklearn packages for training and evaluating a model.
After importing all the necessary packages let’s load the Boston Airbnb listings dataset into the memory. Pandas read_csv function made reading CSV files is way easy. It takes the file path including other optional parameters and returns a data frame object.
Exploring datasets is one of my favorite data science activities. It gives us lots of interesting and shocking facts about the features of the dataset. Moreover, it helps to identify the best features affecting the target variable. There are some cool functions such as a shape that returns the number of rows and columns of the dataset. Info function outputs a full list of columns, data type, and count of non-null values along with rows and columns. These functions help me understand the nature of features.
As part of this activity, I would like to initially clean the dataset followed by simple transformations and then perform Numerical and Categorical features analysis.
Observation:
We can see Boston Airbnb listings dataset has 3585 rows and 94 columns. There are too many columns. We need to know more about the type of columns and null value counts so that we can clean data next.
Int64Index: 3585 entries, 12147973 to 14504422
Data columns (total 94 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 listing_url 3585 non-null object
1 scrape_id 3585 non-null int64
2 last_scraped 3585 non-null object
3 name 3585 non-null object
4 summary 3442 non-null object
5 space 2528 non-null object
6 description 3585 non-null object
7 experiences_offered 3585 non-null object
8 neighborhood_overview 2170 non-null object
9 notes 1610 non-null object
10 transit 2295 non-null object
11 access 2096 non-null object
12 interaction 2031 non-null object
13 house_rules 2393 non-null object
14 thumbnail_url 2986 non-null object
15 medium_url 2986 non-null object
16 picture_url 3585 non-null object
17 xl_picture_url 2986 non-null object
18 host_id 3585 non-null int64
19 host_url 3585 non-null object
20 host_name 3585 non-null object
21 host_since 3585 non-null object
22 host_location 3574 non-null object
23 host_about 2276 non-null object
24 host_response_time 3114 non-null object
25 host_response_rate 3114 non-null object
26 host_acceptance_rate 3114 non-null object
27 host_is_superhost 3585 non-null object
28 host_thumbnail_url 3585 non-null object
29 host_picture_url 3585 non-null object
30 host_neighbourhood 3246 non-null object
31 host_listings_count 3585 non-null int64
32 host_total_listings_count 3585 non-null int64
33 host_verifications 3585 non-null object
34 host_has_profile_pic 3585 non-null object
35 host_identity_verified 3585 non-null object
36 street 3585 non-null object
37 neighbourhood 3042 non-null object
38 neighbourhood_cleansed 3585 non-null object
39 neighbourhood_group_cleansed 0 non-null float64
40 city 3583 non-null object
41 state 3585 non-null object
42 zipcode 3547 non-null object
43 market 3571 non-null object
44 smart_location 3585 non-null object
45 country_code 3585 non-null object
46 country 3585 non-null object
47 latitude 3585 non-null float64
48 longitude 3585 non-null float64
49 is_location_exact 3585 non-null object
50 property_type 3582 non-null object
51 room_type 3585 non-null object
52 accommodates 3585 non-null int64
53 bathrooms 3571 non-null float64
54 bedrooms 3575 non-null float64
55 beds 3576 non-null float64
56 bed_type 3585 non-null object
57 amenities 3585 non-null object
58 square_feet 56 non-null float64
59 price 3585 non-null object
60 weekly_price 892 non-null object
61 monthly_price 888 non-null object
62 security_deposit 1342 non-null object
63 cleaning_fee 2478 non-null object
64 guests_included 3585 non-null int64
65 extra_people 3585 non-null object
66 minimum_nights 3585 non-null int64
67 maximum_nights 3585 non-null int64
68 calendar_updated 3585 non-null object
69 has_availability 0 non-null float64
70 availability_30 3585 non-null int64
71 availability_60 3585 non-null int64
72 availability_90 3585 non-null int64
73 availability_365 3585 non-null int64
74 calendar_last_scraped 3585 non-null object
75 number_of_reviews 3585 non-null int64
76 first_review 2829 non-null object
77 last_review 2829 non-null object
78 review_scores_rating 2772 non-null float64
79 review_scores_accuracy 2762 non-null float64
80 review_scores_cleanliness 2767 non-null float64
81 review_scores_checkin 2765 non-null float64
82 review_scores_communication 2767 non-null float64
83 review_scores_location 2763 non-null float64
84 review_scores_value 2764 non-null float64
85 requires_license 3585 non-null object
86 license 0 non-null float64
87 jurisdiction_names 0 non-null float64
88 instant_bookable 3585 non-null object
89 cancellation_policy 3585 non-null object
90 require_guest_profile_picture 3585 non-null object
91 require_guest_phone_verification 3585 non-null object
92 calculated_host_listings_count 3585 non-null int64
93 reviews_per_month 2829 non-null float64
dtypes: float64(18), int64(14), object(62)
Observations:
- Some columns have very few non-null values. I am going to remove these columns from the data sets.
- There are columns such as host_url, medium_url, pricture_url, etc that are not useful thus should be removed.
- There are columns such as price, cleaning_fee, security_deposit, host_response_rate, etc that are of type object. These columns can be converted to number type.
Based on the above observations I am going to write a function that uses pandas high-level functions to drop columns that are not useful, drop columns having fewer values, fill NA values, and converting some object type columns to numeric columns. This activity will clean the data and will make more sense.