How Much Does AirBnb Cost in Hong Kong?

Building a regression model to predict the prices

Executive Summary

The goal of this project is to predict the total cost for two people staying 4 nights in an AirBnB in Hong Kong. We downloaded the data from insideairbnb.com and followed the IICE method (import data, inspect data, clean data, explore data).

Understanding the Data

After importing the data, we looked at the data entries and variables in order to adapt them to our required format. We created factor variables and turned some character variables into numeric values. While skimming through the data, we discovered several anomalies inside the data such as really high outliers and NAs that were either removed or replaced. We did the following to clean the data:

  • Firstly, we selected only relevant columns to decrease the size of our dataframe and created our variable of interest price_4_nights which calculates the cost for two people staying in Hong Kong for 4 nights. Also, we removed the accommodations that have more less than two accommodates as two people will most likely not look for such Airbnbs.

  • Secondly we clustered the type of accommodations, properties, beds, cancellation policies into the top most popular ones and “Other”.

  • Thirdly, we grouped the different areas into 3 zones of Hong Kong as we believe that certain districts will be more appealing for a tourist than others and could thus lead to higher Airbnb prices.

  • Fourthly, preliminary regression analysis showed us that the neighbourhood, room type, and the number of bedrooms had the most significant effect on price.

  • Lastly, we both created new variables like total amenity number and looked at the cancellation policy and security deposit since they could have an impact on the price as well.

After having adapted the dataframe to our needs, we created several boxplots, scatterplots and correlation plots for our variables in order to look at their spread and the relationships among them. We also used ggpairs to look at the correlation of our chosen variables and see whether they would be of use later for our ideal model.

After creating informative visualisations, we created our first model consisting of different combinations of variables. In order to increase our adjusted R squared, we added new variables and removed insignificant variables (t-value below 2 and thus p value more than 5%). To compare all our models we used huxreg and also used VIF to confirm that the variables in our model are not multicorrelated. We finally decided on a model that includes review_scores_communication, review_scores_location, review_scores_value, cancellation_policy, security_deposit, neighbourhood_simplified, latitude, host_response_time, host_is_superhost, bedrooms, accommodates, total_amenities, prop_type_simplified, number_of_reviews_ltm, review_scores_rating, room_type. This model reaches 39% adjusted R-squared and therefore explains 39% of the variability in price_4_nights.

As the last step, we predicted the price for 2 people staying for 4 nights in Hong Kong along with the 95% prediction intervals.

The first step for our project is loading the essential packages.

library(vroom)
library(dplyr)
library(skimr)
library(tidyverse)
library(mosaic)
library(janitor)
library(skimr)
library(broom)
library(lubridate)
library(GGally)
library(leaflet)
library(ggfortify)
library(huxtable)
library(car)
library(kableExtra)
library(data.table)
library(modelr)
library(readr)
library(kableExtra)

Exploratory Data Analysis (EDA)

Loading Data

To begin with, we need to get data for AirBnB listings in Hong Kong. we are going to read the data directly from the URL using vroom for higher speed, and clean the column names :

listings <- vroom::vroom("http://data.insideairbnb.com/china/hk/hong-kong/2020-06-15/data/listings.csv.gz", na=c("", "NA", "N/A"))%>%
  clean_names() # clean column names

Looking at the raw values

How many variables/columns? How many rows/observations?

Firstly, we are looking at the raw data to get a basical understanding of it. We have 106 variables and 11,187 observations, and there are numerical,character, and date variables in the dataset.

glimpse(listings)
## Rows: 11,187
## Columns: 106
## $ id                                           <dbl> 69074, 75083, 103760, 13…
## $ listing_url                                  <chr> "https://www.airbnb.com/…
## $ scrape_id                                    <dbl> 2.02e+13, 2.02e+13, 2.02…
## $ last_scraped                                 <date> 2020-06-17, 2020-06-17,…
## $ name                                         <chr> "Beautiful oasis of plan…
## $ summary                                      <chr> "An ideal Hong location …
## $ space                                        <chr> "Filled with plants and …
## $ description                                  <chr> "An ideal Hong location …
## $ experiences_offered                          <chr> "none", "none", "none", …
## $ neighborhood_overview                        <chr> "In the upper part of tr…
## $ notes                                        <chr> NA, "Once you arrive in …
## $ transit                                      <chr> "Buses pass often along …
## $ access                                       <chr> "All access, except one …
## $ interaction                                  <chr> "If a guest is staying t…
## $ house_rules                                  <chr> "Everything to make your…
## $ thumbnail_url                                <lgl> NA, NA, NA, NA, NA, NA, …
## $ medium_url                                   <lgl> NA, NA, NA, NA, NA, NA, …
## $ picture_url                                  <chr> "https://a0.muscache.com…
## $ xl_picture_url                               <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_id                                      <dbl> 160139, 304876, 304876, …
## $ host_url                                     <chr> "https://www.airbnb.com/…
## $ host_name                                    <chr> "Amy", "Brend", "Brend",…
## $ host_since                                   <date> 2010-07-07, 2010-11-30,…
## $ host_location                                <chr> "Hong Kong", "Hong Kong"…
## $ host_about                                   <chr> "I've been with AirBnB n…
## $ host_response_time                           <chr> "within a few hours", "w…
## $ host_response_rate                           <chr> "86%", "100%", "100%", "…
## $ host_acceptance_rate                         <chr> "60%", "99%", "99%", "99…
## $ host_is_superhost                            <lgl> TRUE, FALSE, FALSE, FALS…
## $ host_thumbnail_url                           <chr> "https://a0.muscache.com…
## $ host_picture_url                             <chr> "https://a0.muscache.com…
## $ host_neighbourhood                           <chr> "Sheung Wan", "Sheung Wa…
## $ host_listings_count                          <dbl> 2, 12, 12, 12, 1, 12, 12…
## $ host_total_listings_count                    <dbl> 2, 12, 12, 12, 1, 12, 12…
## $ host_verifications                           <chr> "['email', 'phone', 'rev…
## $ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ host_identity_verified                       <lgl> TRUE, FALSE, FALSE, FALS…
## $ street                                       <chr> "Sheung Wan, Hong Kong",…
## $ neighbourhood                                <chr> "Central & Western Distr…
## $ neighbourhood_cleansed                       <chr> "Central & Western", "Ce…
## $ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, …
## $ city                                         <chr> "Sheung Wan", "Sheung Wa…
## $ state                                        <chr> NA, NA, NA, NA, "Hong Ko…
## $ zipcode                                      <chr> NA, NA, NA, NA, NA, NA, …
## $ market                                       <chr> "Hong Kong", "Hong Kong"…
## $ smart_location                               <chr> "Sheung Wan, Hong Kong",…
## $ country_code                                 <chr> "HK", "HK", "HK", "HK", …
## $ country                                      <chr> "Hong Kong", "Hong Kong"…
## $ latitude                                     <dbl> 22.3, 22.3, 22.3, 22.3, …
## $ longitude                                    <dbl> 114, 114, 114, 114, 114,…
## $ is_location_exact                            <lgl> TRUE, TRUE, TRUE, FALSE,…
## $ property_type                                <chr> "Apartment", "Apartment"…
## $ room_type                                    <chr> "Entire home/apt", "Enti…
## $ accommodates                                 <dbl> 3, 3, 6, 6, 2, 6, 6, 2, …
## $ bathrooms                                    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, …
## $ bedrooms                                     <dbl> 1, 0, 2, 2, 1, 2, 2, 1, …
## $ beds                                         <dbl> 2, 2, 3, 3, 1, 3, 3, 1, …
## $ bed_type                                     <chr> "Real Bed", "Real Bed", …
## $ amenities                                    <chr> "{\"Cable TV\",Internet,…
## $ square_feet                                  <lgl> NA, NA, NA, NA, NA, NA, …
## $ price                                        <chr> "$1,395.00", "$783.00", …
## $ weekly_price                                 <chr> NA, NA, NA, NA, NA, NA, …
## $ monthly_price                                <chr> "$29,451.00", NA, NA, NA…
## $ security_deposit                             <chr> "$2,325.00", "$775.00", …
## $ cleaning_fee                                 <chr> "$310.00", "$271.00", "$…
## $ guests_included                              <dbl> 2, 2, 2, 3, 1, 2, 2, 1, …
## $ extra_people                                 <chr> "$155.00", "$155.00", "$…
## $ minimum_nights                               <dbl> 3, 14, 2, 2, 2, 2, 2, 1,…
## $ maximum_nights                               <dbl> 365, 365, 365, 365, 60, …
## $ minimum_minimum_nights                       <dbl> 3, 14, 2, 2, 2, 2, 2, 1,…
## $ maximum_minimum_nights                       <dbl> 4, 14, 2, 2, 2, 2, 2, 1,…
## $ minimum_maximum_nights                       <dbl> 365, 365, 365, 365, 60, …
## $ maximum_maximum_nights                       <dbl> 365, 365, 365, 365, 60, …
## $ minimum_nights_avg_ntm                       <dbl> 3.1, 14.0, 2.0, 2.0, 2.0…
## $ maximum_nights_avg_ntm                       <dbl> 365, 365, 365, 365, 60, …
## $ calendar_updated                             <chr> "2 months ago", "7 weeks…
## $ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ availability_30                              <dbl> 0, 0, 0, 14, 0, 8, 9, 30…
## $ availability_60                              <dbl> 23, 0, 0, 44, 15, 33, 39…
## $ availability_90                              <dbl> 53, 14, 0, 74, 45, 63, 6…
## $ availability_365                             <dbl> 143, 193, 0, 345, 135, 3…
## $ calendar_last_scraped                        <date> 2020-06-17, 2020-06-17,…
## $ number_of_reviews                            <dbl> 134, 229, 271, 305, 27, …
## $ number_of_reviews_ltm                        <dbl> 4, 1, 13, 48, 0, 16, 11,…
## $ first_review                                 <date> 2011-02-14, 2011-03-05,…
## $ last_review                                  <date> 2020-03-24, 2020-04-18,…
## $ review_scores_rating                         <dbl> 97, 89, 89, 93, 97, 86, …
## $ review_scores_accuracy                       <dbl> 10, 8, 9, 10, 10, 9, 9, …
## $ review_scores_cleanliness                    <dbl> 9, 9, 9, 10, 9, 9, 9, 10…
## $ review_scores_checkin                        <dbl> 10, 9, 10, 10, 10, 9, 10…
## $ review_scores_communication                  <dbl> 10, 9, 10, 10, 10, 9, 10…
## $ review_scores_location                       <dbl> 10, 10, 10, 10, 10, 10, …
## $ review_scores_value                          <dbl> 9, 9, 9, 9, 10, 9, 9, 10…
## $ requires_license                             <lgl> FALSE, FALSE, FALSE, FAL…
## $ license                                      <lgl> NA, NA, NA, NA, NA, NA, …
## $ jurisdiction_names                           <lgl> NA, NA, NA, NA, NA, NA, …
## $ instant_bookable                             <lgl> FALSE, FALSE, FALSE, FAL…
## $ is_business_travel_ready                     <lgl> FALSE, FALSE, FALSE, FAL…
## $ cancellation_policy                          <chr> "strict_14_with_grace_pe…
## $ require_guest_profile_picture                <lgl> FALSE, FALSE, FALSE, FAL…
## $ require_guest_phone_verification             <lgl> FALSE, FALSE, FALSE, FAL…
## $ calculated_host_listings_count               <dbl> 1, 13, 13, 13, 1, 13, 13…
## $ calculated_host_listings_count_entire_homes  <dbl> 1, 9, 9, 9, 1, 9, 9, 0, …
## $ calculated_host_listings_count_private_rooms <dbl> 0, 4, 4, 4, 0, 4, 4, 1, …
## $ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reviews_per_month                            <dbl> 1.18, 2.02, 2.47, 2.81, …

Computing summary statistics of the variables of interest

After the first glimpse it is useful to dive deeper and get the distribution of variables. While there are a lot of them - 46 character, 38 numeric, 5 date and 17 logical - let’s consider ones that are particularly interesting:

minimum_nights,accommodates, number_of_reviews, bedrooms, beds, bathrooms.

Display summary statistics for these variables using favstats from the mosaic package.

skim(listings)%>%
  kable()%>%
  kable_styling()
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace Date.min Date.max Date.median Date.n_unique logical.mean logical.count numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character listing_url 0 1.000 34 37 0 11187 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character name 8 0.999 1 250 0 10899 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character summary 756 0.932 1 1000 0 7994 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character space 4528 0.595 1 1000 0 4892 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character description 521 0.953 1 1000 0 8953 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character experiences_offered 0 1.000 4 4 0 1 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character neighborhood_overview 5879 0.474 1 1000 0 3573 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character notes 6862 0.387 1 1000 0 2410 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character transit 5598 0.500 1 1000 0 3666 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character access 6790 0.393 1 1000 0 2872 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character interaction 6119 0.453 1 1000 0 2981 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character house_rules 6217 0.444 2 1000 0 3169 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character picture_url 0 1.000 81 146 0 10607 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_url 0 1.000 39 43 0 4874 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_name 12 0.999 1 33 0 2846 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_location 38 0.997 2 133 0 429 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_about 4315 0.614 1 3851 0 2456 5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_response_time 4142 0.630 12 18 0 4 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_response_rate 4142 0.630 2 4 0 58 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_acceptance_rate 3346 0.701 2 4 0 73 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_thumbnail_url 12 0.999 55 106 0 4851 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_picture_url 12 0.999 57 109 0 4851 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_neighbourhood 2525 0.774 2 26 0 163 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_verifications 0 1.000 2 156 0 265 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character street 0 1.000 13 82 0 688 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character neighbourhood 1284 0.885 4 26 0 56 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character neighbourhood_cleansed 0 1.000 5 17 0 18 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character city 772 0.931 1 50 0 343 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character state 371 0.967 1 31 0 176 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character zipcode 10480 0.063 1 20 0 120 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character market 9 0.999 6 22 0 12 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character smart_location 0 1.000 9 61 0 387 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character country_code 0 1.000 2 2 0 3 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character country 0 1.000 5 14 0 3 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character property_type 0 1.000 3 22 0 41 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character room_type 0 1.000 10 15 0 4 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character bed_type 0 1.000 5 13 0 5 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character amenities 0 1.000 2 1126 0 8558 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character price 0 1.000 5 10 0 374 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character weekly_price 10601 0.052 6 10 0 268 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character monthly_price 10480 0.063 7 11 0 316 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character security_deposit 5677 0.493 5 10 0 231 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character cleaning_fee 5055 0.548 5 9 0 259 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character extra_people 0 1.000 5 9 0 184 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character calendar_updated 0 1.000 5 13 0 78 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character cancellation_policy 0 1.000 6 27 0 6 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Date last_scraped 0 1.000 NA NA NA NA NA 2020-06-15 2020-06-19 2020-06-17 4 NA NA NA NA NA NA NA NA NA NA
Date host_since 12 0.999 NA NA NA NA NA 2009-08-17 2020-06-10 2015-12-27 2355 NA NA NA NA NA NA NA NA NA NA
Date calendar_last_scraped 0 1.000 NA NA NA NA NA 2020-06-15 2020-06-19 2020-06-17 4 NA NA NA NA NA NA NA NA NA NA
Date first_review 4155 0.629 NA NA NA NA NA 2011-02-14 2020-06-15 2018-02-19 1986 NA NA NA NA NA NA NA NA NA NA
Date last_review 4155 0.629 NA NA NA NA NA 2013-01-02 2020-06-17 2019-06-23 1365 NA NA NA NA NA NA NA NA NA NA
logical thumbnail_url 11187 0.000 NA NA NA NA NA NA NA NA NA NaN : NA NA NA NA NA NA NA NA
logical medium_url 11187 0.000 NA NA NA NA NA NA NA NA NA NaN : NA NA NA NA NA NA NA NA
logical xl_picture_url 11187 0.000 NA NA NA NA NA NA NA NA NA NaN : NA NA NA NA NA NA NA NA
logical host_is_superhost 12 0.999 NA NA NA NA NA NA NA NA NA 0.135 FAL: 9669, TRU: 1506 NA NA NA NA NA NA NA NA
logical host_has_profile_pic 12 0.999 NA NA NA NA NA NA NA NA NA 0.997 TRU: 11141, FAL: 34 NA NA NA NA NA NA NA NA
logical host_identity_verified 12 0.999 NA NA NA NA NA NA NA NA NA 0.268 FAL: 8179, TRU: 2996 NA NA NA NA NA NA NA NA
logical neighbourhood_group_cleansed 11187 0.000 NA NA NA NA NA NA NA NA NA NaN : NA NA NA NA NA NA NA NA
logical is_location_exact 0 1.000 NA NA NA NA NA NA NA NA NA 0.688 TRU: 7698, FAL: 3489 NA NA NA NA NA NA NA NA
logical square_feet 11172 0.001 NA NA NA NA NA NA NA NA NA 0.000 FAL: 15 NA NA NA NA NA NA NA NA
logical has_availability 0 1.000 NA NA NA NA NA NA NA NA NA 1.000 TRU: 11187 NA NA NA NA NA NA NA NA
logical requires_license 0 1.000 NA NA NA NA NA NA NA NA NA 0.000 FAL: 11187 NA NA NA NA NA NA NA NA
logical license 11187 0.000 NA NA NA NA NA NA NA NA NA NaN : NA NA NA NA NA NA NA NA
logical jurisdiction_names 11187 0.000 NA NA NA NA NA NA NA NA NA NaN : NA NA NA NA NA NA NA NA
logical instant_bookable 0 1.000 NA NA NA NA NA NA NA NA NA 0.420 FAL: 6485, TRU: 4702 NA NA NA NA NA NA NA NA
logical is_business_travel_ready 0 1.000 NA NA NA NA NA NA NA NA NA 0.000 FAL: 11187 NA NA NA NA NA NA NA NA
logical require_guest_profile_picture 0 1.000 NA NA NA NA NA NA NA NA NA 0.008 FAL: 11102, TRU: 85 NA NA NA NA NA NA NA NA
logical require_guest_phone_verification 0 1.000 NA NA NA NA NA NA NA NA NA 0.009 FAL: 11086, TRU: 101 NA NA NA NA NA NA NA NA
numeric id 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 2.50e+07 1.17e+07 6.91e+04 1.63e+07 2.63e+07 3.47e+07 4.38e+07 ▃▅▆▇▇
numeric scrape_id 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 2.02e+13 0.00e+00 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ▁▁▇▁▁
numeric host_id 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 8.84e+07 8.74e+07 3.22e+04 1.69e+07 5.25e+07 1.39e+08 3.49e+08 ▇▃▂▂▁
numeric host_listings_count 12 0.999 NA NA NA NA NA NA NA NA NA NA NA 4.85e+01 1.05e+02 0.00e+00 1.00e+00 5.00e+00 2.20e+01 3.86e+02 ▇▁▁▁▁
numeric host_total_listings_count 12 0.999 NA NA NA NA NA NA NA NA NA NA NA 4.85e+01 1.05e+02 0.00e+00 1.00e+00 5.00e+00 2.20e+01 3.86e+02 ▇▁▁▁▁
numeric latitude 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 2.23e+01 4.80e-02 2.22e+01 2.23e+01 2.23e+01 2.23e+01 2.26e+01 ▁▇▁▁▁
numeric longitude 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 1.14e+02 4.40e-02 1.14e+02 1.14e+02 1.14e+02 1.14e+02 1.14e+02 ▁▁▃▇▁
numeric accommodates 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 2.82e+00 2.18e+00 1.00e+00 2.00e+00 2.00e+00 3.00e+00 1.60e+01 ▇▁▁▁▁
numeric bathrooms 17 0.998 NA NA NA NA NA NA NA NA NA NA NA 1.16e+00 5.73e-01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 1.10e+01 ▇▁▁▁▁
numeric bedrooms 38 0.997 NA NA NA NA NA NA NA NA NA NA NA 1.09e+00 8.54e-01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 1.10e+01 ▇▁▁▁▁
numeric beds 69 0.994 NA NA NA NA NA NA NA NA NA NA NA 1.68e+00 1.44e+00 0.00e+00 1.00e+00 1.00e+00 2.00e+00 2.00e+01 ▇▁▁▁▁
numeric guests_included 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 1.39e+00 1.06e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.60e+01 ▇▁▁▁▁
numeric minimum_nights 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 9.76e+00 2.83e+01 1.00e+00 1.00e+00 2.00e+00 7.00e+00 1.10e+03 ▇▁▁▁▁
numeric maximum_nights 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 3.86e+05 2.87e+07 1.00e+00 3.65e+02 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
numeric minimum_minimum_nights 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 9.61e+00 2.80e+01 1.00e+00 1.00e+00 2.00e+00 7.00e+00 1.10e+03 ▇▁▁▁▁
numeric maximum_minimum_nights 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 9.99e+00 2.91e+01 1.00e+00 1.00e+00 2.00e+00 7.00e+00 1.10e+03 ▇▁▁▁▁
numeric minimum_maximum_nights 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 3.86e+05 2.87e+07 1.00e+00 3.65e+02 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
numeric maximum_maximum_nights 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 3.86e+05 2.87e+07 1.00e+00 3.65e+02 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
numeric minimum_nights_avg_ntm 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 9.79e+00 2.82e+01 1.00e+00 1.00e+00 2.00e+00 7.00e+00 1.10e+03 ▇▁▁▁▁
numeric maximum_nights_avg_ntm 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 3.86e+05 2.87e+07 1.00e+00 3.65e+02 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
numeric availability_30 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 1.55e+01 1.40e+01 0.00e+00 0.00e+00 2.00e+01 3.00e+01 3.00e+01 ▇▁▁▁▇
numeric availability_60 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 3.28e+01 2.79e+01 0.00e+00 0.00e+00 4.70e+01 6.00e+01 6.00e+01 ▆▁▁▁▇
numeric availability_90 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 5.06e+01 4.17e+01 0.00e+00 0.00e+00 7.60e+01 9.00e+01 9.00e+01 ▆▁▁▁▇
numeric availability_365 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 1.68e+02 1.57e+02 0.00e+00 0.00e+00 1.08e+02 3.64e+02 3.65e+02 ▇▂▂▁▇
numeric number_of_reviews 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 1.77e+01 4.12e+01 0.00e+00 0.00e+00 2.00e+00 1.40e+01 7.57e+02 ▇▁▁▁▁
numeric number_of_reviews_ltm 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 2.68e+00 7.55e+00 0.00e+00 0.00e+00 0.00e+00 1.00e+00 1.38e+02 ▇▁▁▁▁
numeric review_scores_rating 4355 0.611 NA NA NA NA NA NA NA NA NA NA NA 9.09e+01 1.12e+01 2.00e+01 8.70e+01 9.40e+01 9.90e+01 1.00e+02 ▁▁▁▂▇
numeric review_scores_accuracy 4357 0.611 NA NA NA NA NA NA NA NA NA NA NA 9.34e+00 1.12e+00 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
numeric review_scores_cleanliness 4357 0.611 NA NA NA NA NA NA NA NA NA NA NA 9.10e+00 1.20e+00 2.00e+00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 ▁▁▁▂▇
numeric review_scores_checkin 4356 0.611 NA NA NA NA NA NA NA NA NA NA NA 9.50e+00 1.04e+00 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
numeric review_scores_communication 4357 0.611 NA NA NA NA NA NA NA NA NA NA NA 9.51e+00 1.03e+00 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
numeric review_scores_location 4358 0.610 NA NA NA NA NA NA NA NA NA NA NA 9.61e+00 8.50e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
numeric review_scores_value 4358 0.610 NA NA NA NA NA NA NA NA NA NA NA 9.13e+00 1.13e+00 2.00e+00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 ▁▁▁▂▇
numeric calculated_host_listings_count 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 4.57e+01 1.03e+02 1.00e+00 1.00e+00 4.00e+00 1.90e+01 3.89e+02 ▇▁▁▁▁
numeric calculated_host_listings_count_entire_homes 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 7.80e+00 1.90e+01 0.00e+00 0.00e+00 1.00e+00 4.00e+00 1.08e+02 ▇▁▁▁▁
numeric calculated_host_listings_count_private_rooms 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 3.29e+01 8.23e+01 0.00e+00 0.00e+00 1.00e+00 1.10e+01 3.39e+02 ▇▁▁▁▁
numeric calculated_host_listings_count_shared_rooms 0 1.000 NA NA NA NA NA NA NA NA NA NA NA 4.54e+00 1.57e+01 0.00e+00 0.00e+00 0.00e+00 0.00e+00 8.20e+01 ▇▁▁▁▁
numeric reviews_per_month 4155 0.629 NA NA NA NA NA NA NA NA NA NA NA 8.37e-01 1.18e+00 1.00e-02 1.20e-01 3.50e-01 1.03e+00 1.32e+01 ▇▁▁▁▁
skim(listings)
Table 1: Data summary
Name listings
Number of rows 11187
Number of columns 106
_______________________
Column type frequency:
character 46
Date 5
logical 17
numeric 38
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 34 37 0 11187 0
name 8 1.00 1 250 0 10899 0
summary 756 0.93 1 1000 0 7994 0
space 4528 0.60 1 1000 0 4892 0
description 521 0.95 1 1000 0 8953 0
experiences_offered 0 1.00 4 4 0 1 0
neighborhood_overview 5879 0.47 1 1000 0 3573 0
notes 6862 0.39 1 1000 0 2410 0
transit 5598 0.50 1 1000 0 3666 0
access 6790 0.39 1 1000 0 2872 0
interaction 6119 0.45 1 1000 0 2981 0
house_rules 6217 0.44 2 1000 0 3169 0
picture_url 0 1.00 81 146 0 10607 0
host_url 0 1.00 39 43 0 4874 0
host_name 12 1.00 1 33 0 2846 0
host_location 38 1.00 2 133 0 429 0
host_about 4315 0.61 1 3851 0 2456 5
host_response_time 4142 0.63 12 18 0 4 0
host_response_rate 4142 0.63 2 4 0 58 0
host_acceptance_rate 3346 0.70 2 4 0 73 0
host_thumbnail_url 12 1.00 55 106 0 4851 0
host_picture_url 12 1.00 57 109 0 4851 0
host_neighbourhood 2525 0.77 2 26 0 163 0
host_verifications 0 1.00 2 156 0 265 0
street 0 1.00 13 82 0 688 0
neighbourhood 1284 0.89 4 26 0 56 0
neighbourhood_cleansed 0 1.00 5 17 0 18 0
city 772 0.93 1 50 0 343 0
state 371 0.97 1 31 0 176 0
zipcode 10480 0.06 1 20 0 120 0
market 9 1.00 6 22 0 12 0
smart_location 0 1.00 9 61 0 387 0
country_code 0 1.00 2 2 0 3 0
country 0 1.00 5 14 0 3 0
property_type 0 1.00 3 22 0 41 0
room_type 0 1.00 10 15 0 4 0
bed_type 0 1.00 5 13 0 5 0
amenities 0 1.00 2 1126 0 8558 0
price 0 1.00 5 10 0 374 0
weekly_price 10601 0.05 6 10 0 268 0
monthly_price 10480 0.06 7 11 0 316 0
security_deposit 5677 0.49 5 10 0 231 0
cleaning_fee 5055 0.55 5 9 0 259 0
extra_people 0 1.00 5 9 0 184 0
calendar_updated 0 1.00 5 13 0 78 0
cancellation_policy 0 1.00 6 27 0 6 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2020-06-15 2020-06-19 2020-06-17 4
host_since 12 1.00 2009-08-17 2020-06-10 2015-12-27 2355
calendar_last_scraped 0 1.00 2020-06-15 2020-06-19 2020-06-17 4
first_review 4155 0.63 2011-02-14 2020-06-15 2018-02-19 1986
last_review 4155 0.63 2013-01-02 2020-06-17 2019-06-23 1365

Variable type: logical

skim_variable n_missing complete_rate mean count
thumbnail_url 11187 0 NaN :
medium_url 11187 0 NaN :
xl_picture_url 11187 0 NaN :
host_is_superhost 12 1 0.13 FAL: 9669, TRU: 1506
host_has_profile_pic 12 1 1.00 TRU: 11141, FAL: 34
host_identity_verified 12 1 0.27 FAL: 8179, TRU: 2996
neighbourhood_group_cleansed 11187 0 NaN :
is_location_exact 0 1 0.69 TRU: 7698, FAL: 3489
square_feet 11172 0 0.00 FAL: 15
has_availability 0 1 1.00 TRU: 11187
requires_license 0 1 0.00 FAL: 11187
license 11187 0 NaN :
jurisdiction_names 11187 0 NaN :
instant_bookable 0 1 0.42 FAL: 6485, TRU: 4702
is_business_travel_ready 0 1 0.00 FAL: 11187
require_guest_profile_picture 0 1 0.01 FAL: 11102, TRU: 85
require_guest_phone_verification 0 1 0.01 FAL: 11086, TRU: 101

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.50e+07 1.17e+07 6.91e+04 1.63e+07 2.63e+07 3.47e+07 4.38e+07 ▃▅▆▇▇
scrape_id 0 1.00 2.02e+13 0.00e+00 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ▁▁▇▁▁
host_id 0 1.00 8.84e+07 8.74e+07 3.22e+04 1.69e+07 5.25e+07 1.39e+08 3.49e+08 ▇▃▂▂▁
host_listings_count 12 1.00 4.85e+01 1.05e+02 0.00e+00 1.00e+00 5.00e+00 2.20e+01 3.86e+02 ▇▁▁▁▁
host_total_listings_count 12 1.00 4.85e+01 1.05e+02 0.00e+00 1.00e+00 5.00e+00 2.20e+01 3.86e+02 ▇▁▁▁▁
latitude 0 1.00 2.23e+01 5.00e-02 2.22e+01 2.23e+01 2.23e+01 2.23e+01 2.26e+01 ▁▇▁▁▁
longitude 0 1.00 1.14e+02 4.00e-02 1.14e+02 1.14e+02 1.14e+02 1.14e+02 1.14e+02 ▁▁▃▇▁
accommodates 0 1.00 2.82e+00 2.18e+00 1.00e+00 2.00e+00 2.00e+00 3.00e+00 1.60e+01 ▇▁▁▁▁
bathrooms 17 1.00 1.16e+00 5.70e-01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 1.10e+01 ▇▁▁▁▁
bedrooms 38 1.00 1.09e+00 8.50e-01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 1.10e+01 ▇▁▁▁▁
beds 69 0.99 1.68e+00 1.44e+00 0.00e+00 1.00e+00 1.00e+00 2.00e+00 2.00e+01 ▇▁▁▁▁
guests_included 0 1.00 1.39e+00 1.06e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.60e+01 ▇▁▁▁▁
minimum_nights 0 1.00 9.76e+00 2.83e+01 1.00e+00 1.00e+00 2.00e+00 7.00e+00 1.10e+03 ▇▁▁▁▁
maximum_nights 0 1.00 3.86e+05 2.87e+07 1.00e+00 3.65e+02 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
minimum_minimum_nights 0 1.00 9.61e+00 2.80e+01 1.00e+00 1.00e+00 2.00e+00 7.00e+00 1.10e+03 ▇▁▁▁▁
maximum_minimum_nights 0 1.00 1.00e+01 2.91e+01 1.00e+00 1.00e+00 2.00e+00 7.00e+00 1.10e+03 ▇▁▁▁▁
minimum_maximum_nights 0 1.00 3.86e+05 2.87e+07 1.00e+00 3.65e+02 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
maximum_maximum_nights 0 1.00 3.86e+05 2.87e+07 1.00e+00 3.65e+02 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
minimum_nights_avg_ntm 0 1.00 9.79e+00 2.82e+01 1.00e+00 1.00e+00 2.00e+00 7.00e+00 1.10e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 0 1.00 3.86e+05 2.87e+07 1.00e+00 3.65e+02 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
availability_30 0 1.00 1.55e+01 1.40e+01 0.00e+00 0.00e+00 2.00e+01 3.00e+01 3.00e+01 ▇▁▁▁▇
availability_60 0 1.00 3.28e+01 2.79e+01 0.00e+00 0.00e+00 4.70e+01 6.00e+01 6.00e+01 ▆▁▁▁▇
availability_90 0 1.00 5.06e+01 4.17e+01 0.00e+00 0.00e+00 7.60e+01 9.00e+01 9.00e+01 ▆▁▁▁▇
availability_365 0 1.00 1.68e+02 1.57e+02 0.00e+00 0.00e+00 1.08e+02 3.64e+02 3.65e+02 ▇▂▂▁▇
number_of_reviews 0 1.00 1.77e+01 4.12e+01 0.00e+00 0.00e+00 2.00e+00 1.40e+01 7.57e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 2.68e+00 7.55e+00 0.00e+00 0.00e+00 0.00e+00 1.00e+00 1.38e+02 ▇▁▁▁▁
review_scores_rating 4355 0.61 9.09e+01 1.12e+01 2.00e+01 8.70e+01 9.40e+01 9.90e+01 1.00e+02 ▁▁▁▂▇
review_scores_accuracy 4357 0.61 9.34e+00 1.12e+00 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_cleanliness 4357 0.61 9.09e+00 1.20e+00 2.00e+00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 ▁▁▁▂▇
review_scores_checkin 4356 0.61 9.50e+00 1.04e+00 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_communication 4357 0.61 9.51e+00 1.03e+00 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_location 4358 0.61 9.61e+00 8.50e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_value 4358 0.61 9.13e+00 1.13e+00 2.00e+00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 ▁▁▁▂▇
calculated_host_listings_count 0 1.00 4.57e+01 1.03e+02 1.00e+00 1.00e+00 4.00e+00 1.90e+01 3.89e+02 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 7.80e+00 1.90e+01 0.00e+00 0.00e+00 1.00e+00 4.00e+00 1.08e+02 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 3.29e+01 8.22e+01 0.00e+00 0.00e+00 1.00e+00 1.10e+01 3.39e+02 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 4.54e+00 1.57e+01 0.00e+00 0.00e+00 0.00e+00 0.00e+00 8.20e+01 ▇▁▁▁▁
reviews_per_month 4155 0.63 8.40e-01 1.18e+00 1.00e-02 1.20e-01 3.50e-01 1.03e+00 1.32e+01 ▇▁▁▁▁

Minimum number of nights

From histogram we saw that it is right skewed, and the median is 2, meaning that majority of property is for tourists. At the same time, there are a lot of long minimum night observations- that is property for long-term rent.

favstats(~minimum_nights, data= listings)
minQ1medianQ3maxmeansdnmissing
11271.1e+039.7628.3111870

Number of accommodates

The situation is similar to minimum number of nights: slightly right skewed data set. There are some big accommodation options that allow 16 people to stay.

favstats(~accommodates, data= listings)

minQ1medianQ3maxmeansdnmissing
1223162.822.18111870
### Number of number of reviews The data for reviews number is heavily right skewed. From median it can be seen that most of the listing have very small number of reviews- just 2.

favstats(~number_of_reviews, data= listings)

minQ1medianQ3maxmeansdnmissing
0021475717.741.2111870
### Bedrooms, beds, bathrooms The data for reviews number is heavily right skewed. From median it can be seen that most of the listing have very small number of reviews- just 2.

favstats(~bedrooms, data= listings)
minQ1medianQ3maxmeansdnmissing
0111111.090.8541114938
favstats(~beds, data= listings)
minQ1medianQ3maxmeansdnmissing
0112201.681.441111869
favstats(~bathrooms, data= listings)

minQ1medianQ3maxmeansdnmissing
0111111.160.5731117017
The most interesting observation is that there some really big “mansions” in the data set: houses with 11 bathrooms and 20 beds. All three variables are slightly right skewed.

Data cleaning and transformation

Before proceding to analysis we do following transformation of the main data set:

  • We selected the ones we found interesting
  • Change the price variables (price, cleaning_fee, extra_people, and security_deposit) to numeric variables, as they are now characters
  • Remove listings which have minimum_nights>4 or maximum_nights <= 4. These listings are either for long term rents not connected to tourism (not always but most likely) or two short options. Since the goal of the project is projecting the price for 4 nights they are irrelevant
  • Delete observations with 0 price to rent (most likely some listings with a need to discuss the price by phone)
  • Delete observations not allowing for second person to stay
  • Delete listings without summaries or host response times
  • Making values for security deposit and cleaning fee 0 if they are NA, as NA means in this case that there is no need for security deposit or cleaning is free
  • For Property type we find most popular types 7 in the data set. We then creating the simplified version of the variable by keeping the most popular ones and assigning the value “Other” to other observations. Cancelation type is transformed to have 3 types. Neighbourhood type, Room type, Bed type are checked to have less then 6 factors.
  • Create host_since_calculated variable which reflects the number of years the host is a host
  • Convert the amenities variable to be a count of the number of amenities in a certain property, rather than comma separated values and store it to total_amenities
  • After talking to a person who lives in Hong Kong we create 3 new smaller groups of neighbourhoods: hong_kong, kowloon and new_territories.
  • Finally calculate the cost for two people to stay in an Hong Kong for 4 nights and store it in the variable price_4_nights. The price is calculated as sum of price per night, cleaning price and charge for the additional guest if such exists (otherwise it is 0).

In order to run models with our data, we need to modify some variables and create some new variables that we want to examine.

After that wee need to filter the data

interesting_data <- listings %>% # Select variables of interest 
  select(id, 
           listing_url, 
           summary, 
           space, 
           description, 
           host_since, 
           host_response_time, 
           host_response_rate, 
           host_is_superhost, 
           host_listings_count, 
           host_has_profile_pic, 
           host_identity_verified, 
           neighbourhood_cleansed, 
           latitude, longitude, 
           is_location_exact, 
           property_type, 
           room_type, 
           accommodates, 
           bathrooms,
           bedrooms, 
           beds, 
           bed_type, 
           amenities, 
           price, 
           security_deposit, 
           cleaning_fee, guests_included, 
           extra_people, 
           minimum_nights, 
           maximum_nights, 
           has_availability, 
           number_of_reviews, 
           number_of_reviews_ltm, 
           last_review, 
           review_scores_rating, 
           review_scores_accuracy, 
           review_scores_cleanliness, 
           review_scores_checkin, 
           review_scores_communication, 
           review_scores_location, 
           review_scores_value, 
           instant_bookable, 
           cancellation_policy, 
           reviews_per_month)

Data Wrangling

main_data <- interesting_data %>%
  mutate(host_response_rate = parse_number(host_response_rate),
         price = parse_number(price),
         security_deposit = parse_number(security_deposit),
         cleaning_fee = parse_number(cleaning_fee),
         extra_people = parse_number(extra_people),
         has_summary = !is.na(summary))%>%
  filter(minimum_nights <= 4,  
         maximum_nights >= 4,
         price > 0, 
         accommodates >= 2,
         !is.na(host_response_time) 
         )

skim(main_data)
Table 2: Data summary
Name main_data
Number of rows 3829
Number of columns 46
_______________________
Column type frequency:
character 11
Date 2
logical 7
numeric 26
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 34 37 0 3829 0
summary 273 0.93 1 1000 0 2486 0
space 1184 0.69 3 1000 0 1843 0
description 179 0.95 1 1000 0 2904 0
host_response_time 0 1.00 12 18 0 4 0
neighbourhood_cleansed 0 1.00 5 17 0 18 0
property_type 0 1.00 3 21 0 33 0
room_type 0 1.00 10 15 0 4 0
bed_type 0 1.00 5 13 0 5 0
amenities 0 1.00 2 1126 0 2962 0
cancellation_policy 0 1.00 8 27 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 0 1.00 2010-02-13 2020-06-03 2016-01-17 1026
last_review 798 0.79 2015-01-03 2020-06-17 2019-11-20 718

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 0 1 0.20 FAL: 3069, TRU: 760
host_has_profile_pic 0 1 1.00 TRU: 3815, FAL: 14
host_identity_verified 0 1 0.32 FAL: 2597, TRU: 1232
is_location_exact 0 1 0.70 TRU: 2668, FAL: 1161
has_availability 0 1 1.00 TRU: 3829
instant_bookable 0 1 0.57 TRU: 2198, FAL: 1631
has_summary 0 1 0.93 TRU: 3556, FAL: 273

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.57e+07 1.22e+07 69074.00 1.65e+07 2.77e+07 3.54e+07 4.38e+07 ▃▅▆▇▇
host_response_rate 0 1.00 8.21e+01 3.20e+01 0.00 8.40e+01 1.00e+02 1.00e+02 1.00e+02 ▁▁▁▁▇
host_listings_count 0 1.00 1.80e+01 3.06e+01 0.00 2.00e+00 8.00e+00 1.80e+01 3.86e+02 ▇▁▁▁▁
latitude 0 1.00 2.23e+01 6.00e-02 22.16 2.23e+01 2.23e+01 2.23e+01 2.26e+01 ▁▇▂▁▁
longitude 0 1.00 1.14e+02 5.00e-02 113.86 1.14e+02 1.14e+02 1.14e+02 1.14e+02 ▁▁▂▇▁
accommodates 0 1.00 3.56e+00 2.40e+00 2.00 2.00e+00 2.00e+00 4.00e+00 1.60e+01 ▇▁▁▁▁
bathrooms 2 1.00 1.20e+00 6.40e-01 0.00 1.00e+00 1.00e+00 1.00e+00 1.00e+01 ▇▁▁▁▁
bedrooms 8 1.00 1.21e+00 9.20e-01 0.00 1.00e+00 1.00e+00 1.00e+00 1.10e+01 ▇▁▁▁▁
beds 19 1.00 2.07e+00 1.72e+00 0.00 1.00e+00 2.00e+00 2.00e+00 2.00e+01 ▇▁▁▁▁
price 0 1.00 8.23e+02 1.72e+03 47.00 3.49e+02 5.04e+02 8.53e+02 5.80e+04 ▇▁▁▁▁
security_deposit 1605 0.58 8.86e+02 2.08e+03 0.00 0.00e+00 0.00e+00 1.00e+03 3.80e+04 ▇▁▁▁▁
cleaning_fee 1339 0.65 1.63e+02 2.37e+02 0.00 0.00e+00 1.00e+02 2.50e+02 4.68e+03 ▇▁▁▁▁
guests_included 0 1.00 1.72e+00 1.41e+00 1.00 1.00e+00 1.00e+00 2.00e+00 1.60e+01 ▇▁▁▁▁
extra_people 0 1.00 8.60e+01 1.86e+02 0.00 0.00e+00 0.00e+00 1.00e+02 2.30e+03 ▇▁▁▁▁
minimum_nights 0 1.00 1.36e+00 6.90e-01 1.00 1.00e+00 1.00e+00 2.00e+00 4.00e+00 ▇▂▁▁▁
maximum_nights 0 1.00 5.64e+05 3.47e+07 4.00 3.65e+02 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
number_of_reviews 0 1.00 3.32e+01 5.57e+01 0.00 1.00e+00 9.00e+00 4.00e+01 7.57e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 5.70e+00 1.08e+01 0.00 0.00e+00 1.00e+00 6.00e+00 1.38e+02 ▇▁▁▁▁
review_scores_rating 836 0.78 9.03e+01 1.04e+01 20.00 8.70e+01 9.30e+01 9.70e+01 1.00e+02 ▁▁▁▂▇
review_scores_accuracy 836 0.78 9.26e+00 1.08e+00 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_cleanliness 837 0.78 9.09e+00 1.09e+00 2.00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 ▁▁▁▂▇
review_scores_checkin 836 0.78 9.49e+00 9.70e-01 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_communication 836 0.78 9.48e+00 9.60e-01 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_location 836 0.78 9.60e+00 8.30e-01 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_value 836 0.78 9.06e+00 1.05e+00 2.00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 ▁▁▁▂▇
reviews_per_month 798 0.79 1.25e+00 1.41e+00 0.02 2.70e-01 7.30e-01 1.71e+00 1.32e+01 ▇▁▁▁▁

Handling NAs

#Check NAs
main_data %>% 
  select(price, cleaning_fee, extra_people, host_response_rate, security_deposit) %>% 
  skim()%>%
  kable()%>%
  kable_styling()
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric price 0 1.000 823.4 1722 47 349 504 853 58002 ▇▁▁▁▁
numeric cleaning_fee 1339 0.650 163.1 237 0 0 100 250 4682 ▇▁▁▁▁
numeric extra_people 0 1.000 86.0 186 0 0 0 100 2300 ▇▁▁▁▁
numeric host_response_rate 0 1.000 82.1 32 0 84 100 100 100 ▁▁▁▁▇
numeric security_deposit 1605 0.581 886.4 2079 0 0 0 1000 38000 ▇▁▁▁▁
# Making values for security deposit and claening fee 0 if they are NA
main_data2 <- main_data %>%
  mutate(cleaning_fee = replace_na(cleaning_fee, 0),
         security_deposit = replace_na(security_deposit, 0))

Filtering Variables and Creating Factor Levels

top6_prop_type <- main_data2 %>%
  group_by(property_type)%>%
  summarise(count =n())%>%
  arrange(desc(count))%>%
  slice(1:7)

main_data3 <- main_data2 %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Apartment", "Condominium", "Hostel", "Serviced apartment", "Guesthouse", "House", "Hotel") ~ property_type, 
    TRUE ~ "Other"
  ))

# Neighborhood_cleansed
top_neighbourhood_type <- main_data2 %>%
  group_by(neighbourhood_cleansed)%>%
  summarise(count=n())%>%
  arrange(desc(count))

# Room type
top4_room_type <- main_data2 %>%
  group_by(room_type)%>%
  summarise(count =n())%>%
  arrange(desc(count))%>%
  slice(1:6)

# No adjustment needed because room_type only has 4 values

# bed_type
top6_bed_type <- main_data2 %>%
  group_by(bed_type)%>%
  summarise(count =n())%>%
  arrange(desc(count))%>%
  slice(1:6)

# we wont look into this variable as almost all observations are the same category

# cancellation_policy
top6_cancel_type <- main_data2 %>%
  group_by(cancellation_policy)%>%
  summarise(count =n())%>%
  arrange(desc(count))%>%
  slice(1:10)

# we will make one bucket for all strict policies

main_data3 <- main_data3 %>%
  mutate(cancellation_policy = case_when(
    cancellation_policy %in% c("strict_14_with_grace_period", "super_strict_60", "super_strict_30", "strict") ~ "strict", 
    cancellation_policy %in% c("moderate") ~"moderate",
    cancellation_policy %in% c("flexible") ~"flexible"
  ))

# Check if number of levels is less than 6
top6_host_time_type <- main_data2 %>%
  group_by(host_response_time)%>%
  summarise(count =n())%>%
  arrange(desc(count))

# We only have 4 types of response time so we will keep it this way
# Factoring the categories

main_data4 <- main_data3 %>%
  mutate(
    # room_type = factor(room_type, order = TRUE, levels = c("Entire home/apt", "Hotel room", "Private room", "Shared room")),
         host_response_time = factor(host_response_time, order = TRUE, levels = c("within an hour", "within a few hours", "within a day","a few days or more"))
  )

#Creation of the new variable reflecting the number of years a host is being host
main_data4 <- main_data4 %>% 
  mutate(host_since_calculated = as.numeric(ymd("2020-10-19") - host_since)/365)
# Convert the amenities variable to be a count of the number of amenities in a certain property, rather than comma separated values. 
main_data4<- main_data4 %>% 
  mutate(total_amenities=str_count(amenities, ',')+1)
# Based on an interview with a local, we then assigned each neighbourhood to an area

hong_kong <-c("Central & Western","Wan Chai","Southern","Eastern")
kowloon <-c("Kowloon City","Yau Tsim Mong","Sham Shui Po","Wong Tai Sin", "Kwun Tong")
new_territories <-c("Tsuen Wan","Wan Chai","Sai Kung", "Tai Po", "Tuen Mun","Kwai Tsing","Sha Tin","Yuen Long","North","Islands")
# Assigning neighbourhood simplified

main_data4 <- main_data4 %>% 
  mutate(neighbourhood_simplified = case_when(neighbourhood_cleansed %in% hong_kong ~ "hong_kong",
                                              neighbourhood_cleansed %in% kowloon ~ "kowloon",
                                              neighbourhood_cleansed %in% new_territories ~ "new_territories",
                                              ))
# Creating variable for the size of the apartment
main_data4 <- main_data4 %>% 
  mutate(is_it_big_sum = str_detect(summary,c("large", "big" , "Big" , "spacious", "bright", "Spacious", "Large", "Bright", "refurbished","Refurbished", "Renovated", "renovated")),
          is_it_big_spa = str_detect(space,c("large", "big" , "Big" , "spacious", "bright", "Spacious", "Large", "Bright", "refurbished","Refurbished", "Renovated", "renovated")),
          is_it_big_desc = str_detect(description,c("large", "big" , "Big" , "spacious", "bright", "Spacious", "Large", "Bright", "refurbished","Refurbished", "Renovated", "renovated")),
          is_it_big = is_it_big_sum | is_it_big_spa | is_it_big_desc
           )

# Check how many apartments are big         
proportions<- main_data4 %>% 
  group_by(is_it_big) %>% 
  summarise(count=n()) %>% 
  mutate(proportion=count/sum(count))
proportions%>%
    kable()%>%
  kable_styling()
is_it_big count proportion
FALSE 2452 0.640
TRUE 149 0.039
NA 1228 0.321

Creating price_4_nights

main_data4 <- main_data4 %>%
  filter(guests_included<=2)%>% 
  mutate(
    #Extra charge is 0 if 2 guests are included, and extra_people otw
    extra_charge_pp= case_when(guests_included==2 ~ 0, guests_included ==1 ~ extra_people), 
    #Calculate the price for 4 nights for 2 people
    price_4_nights= price*4 + extra_charge_pp*4 + cleaning_fee
  )

Visualizing the distribution of the variables of interest

Visualizing the distribution of price_4_nights

#density plot price_4_nights
ggplot(data=main_data4, aes(x=price_4_nights)) +
  geom_density(fill = "blue", alpha = 0.3) +
  labs(x="Price for Four Nights", y="Denisty", title = "The price for 4 nights is heavily right skewed") +
  theme_minimal()

This graph is heavily right skewed so we will try log(price_4_nights) next…

#density plot log
ggplot(data=main_data4, aes(x=price_4_nights)) + 
  scale_x_log10()+geom_density(fill = "blue", alpha = 0.3) + 
  labs(title = "The logarithm of the price for 4 nights is closer to a normal distribution...", x="Log of Price for Four Nights", y="Density") +
  theme_minimal()

First we use boxplots to analyse variables contributing to the price of 4 nights.

Room Type

ggplot(main_data4, aes(y= log(price_4_nights), x=room_type))+
  geom_boxplot()+
  coord_flip()+
  labs(
    title = "Mean Prices Change by Room Type",
    subtitle = "Price for 4 nights vs Room Type",
    x = "Room Type",
    y = "log(Price for 4 nights)"
  ) +
  theme_minimal()

So we see that Private Room has the lowest median price, followed by Hotel Room and Entire Home/Apartment. Shared room seems to have a quite high median price for 4 nights. This seems odd. Let us double check:

dubblecheck <- main_data4 %>%
  group_by(room_type)%>%
  summarise(median(price_4_nights),
            mean(price_4_nights))

dubblecheck

room_typemedian(price_4_nights)mean(price_4_nights)
Entire home/apt3.01e+034.17e+03
Hotel room2.02e+033.66e+03
Private room1.68e+032.68e+03
Shared room4.11e+033.1e+03 
Indeed, the median price for shared apartments is the highest among all categories. The mean price however, is the 3rd largest. This means that there must be something in shared rooms that makes many shared rooms still more expensive.

Neighbourhood

ggplot(main_data4, aes(y= log(price_4_nights), x=neighbourhood_simplified))+
  geom_boxplot() +
  coord_flip() +
  labs(
    title = "Different Zones seem to have an impact on the price",
    subtitle = "Price for 4 nights vs Neighbourhood Zone",
    x = "Neighbourhood Zone",
    y = "log(Price for 4 nights)"
  ) +
  theme_minimal()

Big Size

ggplot(main_data4, aes(x= log(price_4_nights), y=is_it_big))+
  geom_boxplot()+
  labs(
    title = "Airbnbs that are big tend to have higher prices",
    subtitle= "Price for 4 nights for Airbnbs that say they are 'big' ",
    y= "Airbnb is big?",
    x="log(Price for 4 nights)"
  )+
  theme_minimal()

From the graph above we see that we should definitely consider the variable is_it_big… Now let us check for the variable super_host:

Superhost

main_data4 %>% 
  filter(!is.na(host_is_superhost)) %>% 
  ggplot(aes(x= log(price_4_nights), y=host_is_superhost))+
  geom_boxplot()+
  labs(
    title = "Super Host Variable shows a big impact on the price...",
    subtitle= "Price for 4 nights for Airbnbs with superhosts compared to regular hosts",
    y= "Host is superhost?",
    x="log(Price for 4 nights)"
  )+
  theme_minimal()

Through our analysis, we have a good overview on the single variables that we have. Now we will create a correlation scatterplot to analyse how variables in the dataset correlate to each other.

Numerical Variables

main_data4%>%
  mutate(log4 = log(price_4_nights))%>%
  select(bedrooms, bathrooms, accommodates, review_scores_rating, host_since_calculated, host_listings_count, host_response_rate, number_of_reviews, number_of_reviews_ltm, price_4_nights) %>% 
  ggpairs(aes(alpha=0.4)) +
  theme_minimal()

Overall we see two things: First, there are some correlations between price and our other numeric variables which help us modeling. It is important that they have strong coefficients to the price. Bedroom, Bathroom, Accomodates have all a correlation of greater than 10%. Secondly, we see that some of our variables are also correlated among each other. Therefore we must be very cautious when taking additional variables into our model, because they could also be too strong correlated, which is bad. For example we need to be cautious with Bedroom, Bathroom, Accomodates as they have correlations greater than 50%.

Now we can create scatter plots to visualize the distribution between some numeric variables and price.

Reviews

#Scatterplot of price_4_nights vs number of reviews
ggplot(main_data4, aes(y= log(price_4_nights), x=number_of_reviews))+
  geom_point(alpha=0.3)+
  geom_smooth(method="lm")+
  scale_x_log10()+
labs(
  title = "Prices decrease as number of reviews increases",
  subtitle= "Price for 4 nights vs Number of Reviews",
  x= "Number of Reviews",
  y="log(Price for 4 nights)") +
 theme_minimal()

Now that is definitely an interesting finding. Normally, one would expect that the more reviews a host has, the higher he can charge the price. However, that is not the case here.

Let us check if the variable for “Last 12 Months” reviews shows the same…

Reviews Last 12 Months

#Scatterplot of price_4_nights vs number of reviews
ggplot(main_data4, aes(y= log(price_4_nights), x=number_of_reviews_ltm))+
  geom_point(alpha=0.3)+
  geom_smooth(method="lm")+
  scale_x_log10()+
labs(
  title = "Prices decrease as number of Last 12 months reviews increases",
  subtitle= "Price for 4 nights vs Number of Reviews",
  x= "Number of Reviews",
  y="log(Price for 4 nights)")+
  theme_minimal()

Last 12 month shows a similar relationship. This will be interesting for modeling because it doesn’t seem intuitive and maybe there is a different variable correlated to number of reviews that makes this observation possible. For example, maybe there were many new and highly priced apartments build in the last time, so that the apartments and hosts dont have many reviews. Then the above observation would just be an coincidence.

Now let us check for the experience that hosts have:

Host experience

ggplot(main_data4, aes(y= log(price_4_nights), x=host_since_calculated))+
  geom_point(alpha=0.3)+
  geom_smooth(method='lm')+
  labs(
    title = "Host experience seems to be correlated with higher prices",
    subtitle= "Price for 4 nights vs Host experience",
    x= "Experience of Host (in Years)",
    y="log(Price for 4 nights)"
  )+
  theme_minimal()

Mapping

leaflet(data = main_data4) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

Regression Analysis

We will be doing a regression analysis with price_4_nights as the target variable or Y. We decided to take the logarithmic transformation of price_4_nights to account for the outliers that were making the distribution right skewed. Without doing a logarithmic transformation, our regression models failed to explain most of the variability in price_4_nights.

Visualizing the distribution of price_4_nights

#density plot price_4_nights
ggplot(data=main_data4, aes(x=price_4_nights)) +
  geom_density(fill = "blue", alpha = 0.3) +
  labs(x="Price for Four Nights") +
  theme_minimal()

## This graph doesnt help us as we cant interpret it so we will try log.

#density plot log
ggplot(data=main_data4, aes(x=price_4_nights)) + 
  scale_x_log10()+geom_density(fill = "blue", alpha = 0.3) + 
  labs(x="log of Price for Four Nights") +
  theme_minimal()

Model 1: Base Case

In our first model we will run a regression with prop_type_simplified,number_of_reviews and review_scores_ratingas the explanatory variables.

# ***** Fit linear regression models: First, just the mean

model0 <- lm(log(price_4_nights) ~ 1, data= main_data4)

model0 %>% 
  broom::tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)7.730.012263407.77.75
model0 %>% 
  broom::glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
000.693-3.4e+036.8e+036.81e+031.55e+0332283229
# ***** Fit linear regression models: criminals on 3 explanatory variables.
model1 <- lm(log(price_4_nights) ~ prop_type_simplified + number_of_reviews + review_scores_rating, data= main_data4)

model1 %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 7.076 0.106 66.53 0.000 6.868 7.285
prop_type_simplifiedCondominium -0.186 0.043 -4.28 0.000 -0.271 -0.101
prop_type_simplifiedGuesthouse -0.369 0.063 -5.82 0.000 -0.493 -0.244
prop_type_simplifiedHostel -0.357 0.056 -6.42 0.000 -0.466 -0.248
prop_type_simplifiedHotel -0.139 0.062 -2.26 0.024 -0.260 -0.018
prop_type_simplifiedHouse -0.216 0.065 -3.35 0.001 -0.343 -0.090
prop_type_simplifiedOther -0.083 0.039 -2.13 0.033 -0.159 -0.007
prop_type_simplifiedServiced apartment -0.253 0.057 -4.44 0.000 -0.365 -0.141
number_of_reviews -0.001 0.000 -4.96 0.000 -0.001 -0.001
review_scores_rating 0.009 0.001 7.36 0.000 0.006 0.011
model1 %>% 
  broom::glance()%>%
 kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.068 0.065 0.609 19.9 0 9 -2255 4533 4597 904 2440 2450
msummary(model1)
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             7.076490   0.106366   66.53  < 2e-16
## prop_type_simplifiedCondominium        -0.185723   0.043417   -4.28  2.0e-05
## prop_type_simplifiedGuesthouse         -0.368573   0.063353   -5.82  6.7e-09
## prop_type_simplifiedHostel             -0.357016   0.055609   -6.42  1.6e-10
## prop_type_simplifiedHotel              -0.139032   0.061578   -2.26  0.02404
## prop_type_simplifiedHouse              -0.216427   0.064640   -3.35  0.00083
## prop_type_simplifiedOther              -0.082629   0.038757   -2.13  0.03311
## prop_type_simplifiedServiced apartment -0.252819   0.056994   -4.44  9.6e-06
## number_of_reviews                      -0.001036   0.000209   -4.96  7.4e-07
## review_scores_rating                    0.008587   0.001167    7.36  2.5e-13
##                                           
## (Intercept)                            ***
## prop_type_simplifiedCondominium        ***
## prop_type_simplifiedGuesthouse         ***
## prop_type_simplifiedHostel             ***
## prop_type_simplifiedHotel              *  
## prop_type_simplifiedHouse              ***
## prop_type_simplifiedOther              *  
## prop_type_simplifiedServiced apartment ***
## number_of_reviews                      ***
## review_scores_rating                   ***
## 
## Residual standard error: 0.609 on 2440 degrees of freedom
##   (779 observations deleted due to missingness)
## Multiple R-squared:  0.0683, Adjusted R-squared:  0.0648 
## F-statistic: 19.9 on 9 and 2440 DF,  p-value: <2e-16
car::vif(model1)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.03  7            1.00
## number_of_reviews    1.03  1            1.01
## review_scores_rating 1.04  1            1.02
autoplot(model1)

The coefficient of the variable review_scores_rating indicates that after controlling for all other variables, Airbnbs with higher ratings have a higher price. All variables we have used in this model are significant predictors of price since their test statistics are greater than 2. However, the adjusted R Squared value for our first model is only 6.5 % indicating that these variables alone do not explain much of the variability in price. We computed the Variance Inflation Factor (VIF) for our model and observe that it doesn’t suffer from multicollinearity as the VIF is less than 5 for all variables.

Model 2: Adding room_type

We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model.

# ***** Fit linear regression models: criminals on 4 explanatory variables.
model2 <- lm(log(price_4_nights) ~ prop_type_simplified + number_of_reviews +  review_scores_rating + room_type , data= main_data4)

model2 %>% 
  broom::tidy(conf.int=TRUE)%>%
   kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 7.531 0.103 73.246 0.000 7.330 7.733
prop_type_simplifiedCondominium -0.181 0.041 -4.451 0.000 -0.261 -0.101
prop_type_simplifiedGuesthouse -0.176 0.060 -2.921 0.004 -0.294 -0.058
prop_type_simplifiedHostel -0.191 0.055 -3.506 0.000 -0.298 -0.084
prop_type_simplifiedHotel -0.079 0.067 -1.176 0.240 -0.212 0.053
prop_type_simplifiedHouse -0.096 0.061 -1.568 0.117 -0.215 0.024
prop_type_simplifiedOther 0.016 0.038 0.432 0.666 -0.058 0.090
prop_type_simplifiedServiced apartment -0.093 0.054 -1.706 0.088 -0.199 0.014
number_of_reviews -0.001 0.000 -3.467 0.001 -0.001 0.000
review_scores_rating 0.006 0.001 5.307 0.000 0.004 0.008
room_typeHotel room -0.138 0.067 -2.062 0.039 -0.269 -0.007
room_typePrivate room -0.475 0.026 -17.955 0.000 -0.527 -0.423
room_typeShared room -0.165 0.063 -2.620 0.009 -0.288 -0.041
model2 %>% 
  broom::glance()%>%
   kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.183 0.179 0.57 45.6 0 12 -2094 4216 4297 793 2437 2450
msummary(model2)
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             7.531184   0.102821   73.25  < 2e-16
## prop_type_simplifiedCondominium        -0.181293   0.040731   -4.45  8.9e-06
## prop_type_simplifiedGuesthouse         -0.176148   0.060308   -2.92  0.00352
## prop_type_simplifiedHostel             -0.191084   0.054507   -3.51  0.00046
## prop_type_simplifiedHotel              -0.079297   0.067443   -1.18  0.23981
## prop_type_simplifiedHouse              -0.095567   0.060960   -1.57  0.11708
## prop_type_simplifiedOther               0.016296   0.037764    0.43  0.66614
## prop_type_simplifiedServiced apartment -0.092799   0.054382   -1.71  0.08805
## number_of_reviews                      -0.000683   0.000197   -3.47  0.00053
## review_scores_rating                    0.005858   0.001104    5.31  1.2e-07
## room_typeHotel room                    -0.137632   0.066760   -2.06  0.03935
## room_typePrivate room                  -0.475182   0.026465  -17.96  < 2e-16
## room_typeShared room                   -0.164634   0.062837   -2.62  0.00885
##                                           
## (Intercept)                            ***
## prop_type_simplifiedCondominium        ***
## prop_type_simplifiedGuesthouse         ** 
## prop_type_simplifiedHostel             ***
## prop_type_simplifiedHotel                 
## prop_type_simplifiedHouse                 
## prop_type_simplifiedOther                 
## prop_type_simplifiedServiced apartment .  
## number_of_reviews                      ***
## review_scores_rating                   ***
## room_typeHotel room                    *  
## room_typePrivate room                  ***
## room_typeShared room                   ** 
## 
## Residual standard error: 0.57 on 2437 degrees of freedom
##   (779 observations deleted due to missingness)
## Multiple R-squared:  0.183,  Adjusted R-squared:  0.179 
## F-statistic: 45.6 on 12 and 2437 DF,  p-value: <2e-16
car::vif(model2)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.63  7            1.04
## number_of_reviews    1.04  1            1.02
## review_scores_rating 1.06  1            1.03
## room_type            1.65  3            1.09
autoplot(model2)

We observe that room_type has improved our model and is significant predictor of price. It is not collinear with our other variables as the VIF is still less than 5. The adjusted R Squared value for our new model has increased to 17.9 %, indicating room_type is an important indicater of price.

Now we will explore adding further variables.

Model 3: Improving Model with More Listing Features

We start our own exploration now. We first wonder if we can improve model 2.

In the visualisations above we saw that the variables bedrooms, beds, bathrooms, total_amenities and accommodates have more than 10% correlation with price. This logically makes sense since larger Airbnbs tend to accommodate more guests and cost a higher price. We will look at the effect of these variables on our model. We also wanted to add the number_of_reviews_ltm instead of number_of_reviews since we realized that it improves R squared.

glimpse(main_data4) 
## Rows: 3,229
## Columns: 56
## $ id                          <dbl> 69074, 103760, 163664, 163742, 248140, 27…
## $ listing_url                 <chr> "https://www.airbnb.com/rooms/69074", "ht…
## $ summary                     <chr> "An ideal Hong location any visitor--hip …
## $ space                       <chr> "Filled with plants and art, this one bed…
## $ description                 <chr> "An ideal Hong location any visitor--hip …
## $ host_since                  <date> 2010-07-07, 2010-11-30, 2010-11-30, 2010…
## $ host_response_time          <ord> within a few hours, within an hour, withi…
## $ host_response_rate          <dbl> 86, 100, 100, 100, 100, 84, 84, 93, 93, 1…
## $ host_is_superhost           <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ host_listings_count         <dbl> 2, 12, 12, 12, 1, 8, 8, 18, 18, 3, 15, 2,…
## $ host_has_profile_pic        <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ host_identity_verified      <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ neighbourhood_cleansed      <chr> "Central & Western", "Central & Western",…
## $ latitude                    <dbl> 22.3, 22.3, 22.3, 22.3, 22.3, 22.3, 22.5,…
## $ longitude                   <dbl> 114, 114, 114, 114, 114, 114, 114, 114, 1…
## $ is_location_exact           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ property_type               <chr> "Apartment", "Apartment", "Apartment", "A…
## $ room_type                   <chr> "Entire home/apt", "Entire home/apt", "En…
## $ accommodates                <dbl> 3, 6, 6, 6, 2, 4, 6, 3, 3, 5, 2, 3, 4, 3,…
## $ bathrooms                   <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1,…
## $ bedrooms                    <dbl> 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 2, 1, 2,…
## $ beds                        <dbl> 2, 3, 3, 3, 1, 1, 3, 1, 2, 3, 1, 2, 2, 2,…
## $ bed_type                    <chr> "Real Bed", "Real Bed", "Real Bed", "Real…
## $ amenities                   <chr> "{\"Cable TV\",Internet,Wifi,\"Air condit…
## $ price                       <dbl> 1395, 845, 690, 767, 643, 891, 930, 992, …
## $ security_deposit            <dbl> 2325, 775, 775, 775, 0, 1008, 930, 1163, …
## $ cleaning_fee                <dbl> 310, 271, 302, 302, 0, 310, 388, 310, 310…
## $ guests_included             <dbl> 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2,…
## $ extra_people                <dbl> 155, 194, 194, 194, 0, 155, 155, 116, 116…
## $ minimum_nights              <dbl> 3, 2, 2, 2, 1, 2, 4, 1, 1, 1, 1, 3, 2, 4,…
## $ maximum_nights              <dbl> 365, 365, 365, 365, 60, 365, 365, 14, 14,…
## $ has_availability            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ number_of_reviews           <dbl> 134, 271, 222, 225, 163, 43, 160, 15, 18,…
## $ number_of_reviews_ltm       <dbl> 4, 13, 16, 11, 12, 2, 49, 0, 0, 6, 2, 1, …
## $ last_review                 <date> 2020-03-24, 2020-05-22, 2020-04-11, 2020…
## $ review_scores_rating        <dbl> 97, 89, 86, 86, 98, 93, 97, 86, 99, 91, 7…
## $ review_scores_accuracy      <dbl> 10, 9, 9, 9, 10, 9, 9, 8, 10, 10, 8, 8, 1…
## $ review_scores_cleanliness   <dbl> 9, 9, 9, 9, 10, 9, 7, 8, 10, 10, 7, 8, 9,…
## $ review_scores_checkin       <dbl> 10, 10, 9, 10, 10, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_communication <dbl> 10, 10, 9, 10, 10, 10, 9, 9, 10, 10, 8, 8…
## $ review_scores_location      <dbl> 10, 10, 10, 10, 9, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_value         <dbl> 9, 9, 9, 9, 10, 9, 8, 8, 9, 9, 8, 9, 9, 9…
## $ instant_bookable            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
## $ cancellation_policy         <chr> "strict", "strict", "strict", "strict", "…
## $ reviews_per_month           <dbl> 1.18, 2.47, 2.07, 2.09, 1.55, 0.41, 1.61,…
## $ has_summary                 <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ prop_type_simplified        <chr> "Apartment", "Apartment", "Apartment", "A…
## $ host_since_calculated       <dbl> 10.29, 9.89, 9.89, 9.89, 9.01, 8.90, 8.90…
## $ total_amenities             <dbl> 33, 11, 18, 17, 27, 22, 36, 27, 28, 22, 2…
## $ neighbourhood_simplified    <chr> "hong_kong", "hong_kong", "hong_kong", "h…
## $ is_it_big_sum               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ is_it_big_spa               <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big_desc              <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big                   <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ extra_charge_pp             <dbl> 0, 0, 0, 0, 0, 0, 0, 116, 116, 0, 0, 0, 0…
## $ price_4_nights              <dbl> 5890, 3651, 3062, 3370, 2572, 3874, 4108,…
model3 <- lm(log(price_4_nights) ~  bathrooms + bedrooms + beds + total_amenities +
               accommodates + prop_type_simplified + number_of_reviews_ltm +  review_scores_rating + room_type, data= main_data4)

model3 %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 6.989 0.102 68.392 0.000 6.789 7.190
bathrooms -0.027 0.029 -0.935 0.350 -0.085 0.030
bedrooms 0.090 0.019 4.672 0.000 0.052 0.128
beds -0.016 0.014 -1.155 0.248 -0.043 0.011
total_amenities 0.013 0.001 9.331 0.000 0.011 0.016
accommodates 0.092 0.011 8.447 0.000 0.071 0.114
prop_type_simplifiedCondominium -0.157 0.038 -4.137 0.000 -0.231 -0.082
prop_type_simplifiedGuesthouse -0.125 0.056 -2.230 0.026 -0.235 -0.015
prop_type_simplifiedHostel -0.152 0.051 -2.994 0.003 -0.251 -0.052
prop_type_simplifiedHotel -0.105 0.063 -1.669 0.095 -0.228 0.018
prop_type_simplifiedHouse -0.150 0.057 -2.633 0.009 -0.261 -0.038
prop_type_simplifiedOther 0.019 0.035 0.555 0.579 -0.049 0.088
prop_type_simplifiedServiced apartment -0.081 0.051 -1.609 0.108 -0.181 0.018
number_of_reviews_ltm -0.006 0.001 -5.571 0.000 -0.008 -0.004
review_scores_rating 0.005 0.001 4.805 0.000 0.003 0.007
room_typeHotel room -0.039 0.062 -0.633 0.527 -0.162 0.083
room_typePrivate room -0.374 0.026 -14.593 0.000 -0.424 -0.324
room_typeShared room -0.414 0.079 -5.205 0.000 -0.570 -0.258
model3 %>% 
  broom::glance()%>%
  kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.3 0.295 0.529 61.1 0 17 -1901 3841 3951 678 2426 2444
msummary(model3)
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                             6.98944    0.10220   68.39  < 2e-16 ***
## bathrooms                              -0.02741    0.02932   -0.93   0.3500    
## bedrooms                                0.09010    0.01928    4.67  3.1e-06 ***
## beds                                   -0.01612    0.01395   -1.16   0.2480    
## total_amenities                         0.01342    0.00144    9.33  < 2e-16 ***
## accommodates                            0.09250    0.01095    8.45  < 2e-16 ***
## prop_type_simplifiedCondominium        -0.15676    0.03790   -4.14  3.6e-05 ***
## prop_type_simplifiedGuesthouse         -0.12527    0.05619   -2.23   0.0259 *  
## prop_type_simplifiedHostel             -0.15182    0.05071   -2.99   0.0028 ** 
## prop_type_simplifiedHotel              -0.10472    0.06274   -1.67   0.0952 .  
## prop_type_simplifiedHouse              -0.14952    0.05680   -2.63   0.0085 ** 
## prop_type_simplifiedOther               0.01942    0.03503    0.55   0.5793    
## prop_type_simplifiedServiced apartment -0.08138    0.05057   -1.61   0.1077    
## number_of_reviews_ltm                  -0.00569    0.00102   -5.57  2.8e-08 ***
## review_scores_rating                    0.00507    0.00105    4.80  1.6e-06 ***
## room_typeHotel room                    -0.03947    0.06234   -0.63   0.5267    
## room_typePrivate room                  -0.37391    0.02562  -14.59  < 2e-16 ***
## room_typeShared room                   -0.41376    0.07949   -5.20  2.1e-07 ***
## 
## Residual standard error: 0.529 on 2426 degrees of freedom
##   (785 observations deleted due to missingness)
## Multiple R-squared:   0.3,   Adjusted R-squared:  0.295 
## F-statistic: 61.1 on 17 and 2426 DF,  p-value: <2e-16
car::vif(model3)
##                       GVIF Df GVIF^(1/(2*Df))
## bathrooms             1.93  1            1.39
## bedrooms              1.74  1            1.32
## beds                  3.62  1            1.90
## total_amenities       1.16  1            1.08
## accommodates          3.64  1            1.91
## prop_type_simplified  1.69  7            1.04
## number_of_reviews_ltm 1.11  1            1.05
## review_scores_rating  1.12  1            1.06
## room_type             3.43  3            1.23
autoplot(model3)

After taking number_of_reviews away, we find that there is almost no change in adjusted R-squared. Also, there is no change in collinearity and significance after eliminating number_of_reviews. So, we can get rid of it in our model and put number_of_reviews_ltm which gives the number of reviews in the last twelve months.

Looking at the effects of the variables we have just added, we realised that even though their collinearities are not higher than 5, changes that some of these variables create are not significant. Thus, we decide to keep the ones that have the lowest p value or highest t statistics. We get rid of beds and bathrooms, while keeping bedrooms, total_amenities, and accommodates. This is meaningful because while the number of amenities or the number of people that can stay in a house would affect the price, number of bedrooms is usually correlated with number of beds and bathrooms.

Model 4: Taking out the less significant variables

After taking out the unnecessary variables, we observe that our R squared has not changed.

glimpse(main_data4) 
## Rows: 3,229
## Columns: 56
## $ id                          <dbl> 69074, 103760, 163664, 163742, 248140, 27…
## $ listing_url                 <chr> "https://www.airbnb.com/rooms/69074", "ht…
## $ summary                     <chr> "An ideal Hong location any visitor--hip …
## $ space                       <chr> "Filled with plants and art, this one bed…
## $ description                 <chr> "An ideal Hong location any visitor--hip …
## $ host_since                  <date> 2010-07-07, 2010-11-30, 2010-11-30, 2010…
## $ host_response_time          <ord> within a few hours, within an hour, withi…
## $ host_response_rate          <dbl> 86, 100, 100, 100, 100, 84, 84, 93, 93, 1…
## $ host_is_superhost           <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ host_listings_count         <dbl> 2, 12, 12, 12, 1, 8, 8, 18, 18, 3, 15, 2,…
## $ host_has_profile_pic        <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ host_identity_verified      <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ neighbourhood_cleansed      <chr> "Central & Western", "Central & Western",…
## $ latitude                    <dbl> 22.3, 22.3, 22.3, 22.3, 22.3, 22.3, 22.5,…
## $ longitude                   <dbl> 114, 114, 114, 114, 114, 114, 114, 114, 1…
## $ is_location_exact           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ property_type               <chr> "Apartment", "Apartment", "Apartment", "A…
## $ room_type                   <chr> "Entire home/apt", "Entire home/apt", "En…
## $ accommodates                <dbl> 3, 6, 6, 6, 2, 4, 6, 3, 3, 5, 2, 3, 4, 3,…
## $ bathrooms                   <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1,…
## $ bedrooms                    <dbl> 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 2, 1, 2,…
## $ beds                        <dbl> 2, 3, 3, 3, 1, 1, 3, 1, 2, 3, 1, 2, 2, 2,…
## $ bed_type                    <chr> "Real Bed", "Real Bed", "Real Bed", "Real…
## $ amenities                   <chr> "{\"Cable TV\",Internet,Wifi,\"Air condit…
## $ price                       <dbl> 1395, 845, 690, 767, 643, 891, 930, 992, …
## $ security_deposit            <dbl> 2325, 775, 775, 775, 0, 1008, 930, 1163, …
## $ cleaning_fee                <dbl> 310, 271, 302, 302, 0, 310, 388, 310, 310…
## $ guests_included             <dbl> 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2,…
## $ extra_people                <dbl> 155, 194, 194, 194, 0, 155, 155, 116, 116…
## $ minimum_nights              <dbl> 3, 2, 2, 2, 1, 2, 4, 1, 1, 1, 1, 3, 2, 4,…
## $ maximum_nights              <dbl> 365, 365, 365, 365, 60, 365, 365, 14, 14,…
## $ has_availability            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ number_of_reviews           <dbl> 134, 271, 222, 225, 163, 43, 160, 15, 18,…
## $ number_of_reviews_ltm       <dbl> 4, 13, 16, 11, 12, 2, 49, 0, 0, 6, 2, 1, …
## $ last_review                 <date> 2020-03-24, 2020-05-22, 2020-04-11, 2020…
## $ review_scores_rating        <dbl> 97, 89, 86, 86, 98, 93, 97, 86, 99, 91, 7…
## $ review_scores_accuracy      <dbl> 10, 9, 9, 9, 10, 9, 9, 8, 10, 10, 8, 8, 1…
## $ review_scores_cleanliness   <dbl> 9, 9, 9, 9, 10, 9, 7, 8, 10, 10, 7, 8, 9,…
## $ review_scores_checkin       <dbl> 10, 10, 9, 10, 10, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_communication <dbl> 10, 10, 9, 10, 10, 10, 9, 9, 10, 10, 8, 8…
## $ review_scores_location      <dbl> 10, 10, 10, 10, 9, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_value         <dbl> 9, 9, 9, 9, 10, 9, 8, 8, 9, 9, 8, 9, 9, 9…
## $ instant_bookable            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
## $ cancellation_policy         <chr> "strict", "strict", "strict", "strict", "…
## $ reviews_per_month           <dbl> 1.18, 2.47, 2.07, 2.09, 1.55, 0.41, 1.61,…
## $ has_summary                 <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ prop_type_simplified        <chr> "Apartment", "Apartment", "Apartment", "A…
## $ host_since_calculated       <dbl> 10.29, 9.89, 9.89, 9.89, 9.01, 8.90, 8.90…
## $ total_amenities             <dbl> 33, 11, 18, 17, 27, 22, 36, 27, 28, 22, 2…
## $ neighbourhood_simplified    <chr> "hong_kong", "hong_kong", "hong_kong", "h…
## $ is_it_big_sum               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ is_it_big_spa               <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big_desc              <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big                   <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ extra_charge_pp             <dbl> 0, 0, 0, 0, 0, 0, 0, 116, 116, 0, 0, 0, 0…
## $ price_4_nights              <dbl> 5890, 3651, 3062, 3370, 2572, 3874, 4108,…
model4 <- lm(log(price_4_nights) ~  bedrooms + accommodates + total_amenities + prop_type_simplified + number_of_reviews_ltm +  review_scores_rating + room_type, data= main_data4)


model4 %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 6.971 0.099 70.072 0.000 6.776 7.166
bedrooms 0.082 0.018 4.585 0.000 0.047 0.117
accommodates 0.082 0.008 10.619 0.000 0.067 0.097
total_amenities 0.013 0.001 9.374 0.000 0.011 0.016
prop_type_simplifiedCondominium -0.162 0.038 -4.283 0.000 -0.236 -0.088
prop_type_simplifiedGuesthouse -0.126 0.056 -2.236 0.025 -0.236 -0.015
prop_type_simplifiedHostel -0.152 0.051 -3.008 0.003 -0.252 -0.053
prop_type_simplifiedHotel -0.109 0.063 -1.735 0.083 -0.231 0.014
prop_type_simplifiedHouse -0.144 0.057 -2.546 0.011 -0.255 -0.033
prop_type_simplifiedOther 0.019 0.035 0.551 0.582 -0.049 0.088
prop_type_simplifiedServiced apartment -0.078 0.050 -1.553 0.121 -0.177 0.021
number_of_reviews_ltm -0.006 0.001 -5.542 0.000 -0.008 -0.004
review_scores_rating 0.005 0.001 4.862 0.000 0.003 0.007
room_typeHotel room -0.041 0.062 -0.660 0.509 -0.163 0.081
room_typePrivate room -0.379 0.025 -14.944 0.000 -0.429 -0.330
room_typeShared room -0.479 0.065 -7.418 0.000 -0.605 -0.352
model4 %>% 
  broom::glance()%>%
  kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.3 0.295 0.529 69.4 0 15 -1904 3841 3940 679 2431 2447
msummary(model4)
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                             6.97067    0.09948   70.07  < 2e-16 ***
## bedrooms                                0.08178    0.01784    4.58  4.8e-06 ***
## accommodates                            0.08213    0.00773   10.62  < 2e-16 ***
## total_amenities                         0.01344    0.00143    9.37  < 2e-16 ***
## prop_type_simplifiedCondominium        -0.16158    0.03772   -4.28  1.9e-05 ***
## prop_type_simplifiedGuesthouse         -0.12554    0.05616   -2.24   0.0255 *  
## prop_type_simplifiedHostel             -0.15243    0.05068   -3.01   0.0027 ** 
## prop_type_simplifiedHotel              -0.10866    0.06262   -1.74   0.0828 .  
## prop_type_simplifiedHouse              -0.14419    0.05664   -2.55   0.0110 *  
## prop_type_simplifiedOther               0.01928    0.03499    0.55   0.5816    
## prop_type_simplifiedServiced apartment -0.07831    0.05043   -1.55   0.1206    
## number_of_reviews_ltm                  -0.00565    0.00102   -5.54  3.3e-08 ***
## review_scores_rating                    0.00510    0.00105    4.86  1.2e-06 ***
## room_typeHotel room                    -0.04108    0.06227   -0.66   0.5094    
## room_typePrivate room                  -0.37931    0.02538  -14.94  < 2e-16 ***
## room_typeShared room                   -0.47854    0.06451   -7.42  1.6e-13 ***
## 
## Residual standard error: 0.529 on 2431 degrees of freedom
##   (782 observations deleted due to missingness)
## Multiple R-squared:   0.3,   Adjusted R-squared:  0.295 
## F-statistic: 69.4 on 15 and 2431 DF,  p-value: <2e-16
car::vif(model4)
##                       GVIF Df GVIF^(1/(2*Df))
## bedrooms              1.49  1            1.22
## accommodates          1.82  1            1.35
## total_amenities       1.16  1            1.08
## prop_type_simplified  1.66  7            1.04
## number_of_reviews_ltm 1.11  1            1.05
## review_scores_rating  1.11  1            1.06
## room_type             2.25  3            1.14
autoplot(model4)

Model 5: Adding Host

Next we would like to explore the effect of the variables about the host on price. We add host_is_superhost, host_response_time, host_response_rate, host_has_profile_pic, and host_since_calculated to the model.

#glimpse(main_data) 

model5 <- lm(log(price_4_nights) ~  host_response_time + host_response_rate + host_is_superhost + host_has_profile_pic + host_since_calculated + bedrooms + accommodates + total_amenities + prop_type_simplified + number_of_reviews_ltm +  review_scores_rating + room_type, data= main_data4)

model5 %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 7.022 0.214 32.889 0.000 6.604 7.441
host_response_time.L 0.118 0.061 1.938 0.053 -0.001 0.237
host_response_time.Q -0.028 0.038 -0.730 0.466 -0.103 0.047
host_response_time.C -0.004 0.029 -0.127 0.899 -0.061 0.053
host_response_rate 0.002 0.001 1.955 0.051 0.000 0.004
host_is_superhostTRUE 0.147 0.028 5.269 0.000 0.092 0.202
host_has_profile_picTRUE -0.135 0.178 -0.759 0.448 -0.483 0.213
host_since_calculated 0.021 0.006 3.879 0.000 0.011 0.032
bedrooms 0.082 0.018 4.601 0.000 0.047 0.116
accommodates 0.085 0.008 11.049 0.000 0.070 0.100
total_amenities 0.012 0.001 8.125 0.000 0.009 0.015
prop_type_simplifiedCondominium -0.149 0.037 -3.976 0.000 -0.222 -0.075
prop_type_simplifiedGuesthouse -0.099 0.056 -1.767 0.077 -0.208 0.011
prop_type_simplifiedHostel -0.134 0.050 -2.675 0.008 -0.233 -0.036
prop_type_simplifiedHotel -0.072 0.064 -1.137 0.255 -0.197 0.052
prop_type_simplifiedHouse -0.168 0.057 -2.976 0.003 -0.279 -0.057
prop_type_simplifiedOther 0.026 0.035 0.758 0.448 -0.042 0.095
prop_type_simplifiedServiced apartment -0.023 0.051 -0.464 0.643 -0.123 0.076
number_of_reviews_ltm -0.006 0.001 -5.642 0.000 -0.008 -0.004
review_scores_rating 0.003 0.001 3.272 0.001 0.001 0.006
room_typeHotel room -0.051 0.062 -0.823 0.411 -0.173 0.071
room_typePrivate room -0.385 0.026 -15.076 0.000 -0.435 -0.335
room_typeShared room -0.496 0.064 -7.734 0.000 -0.622 -0.371
model5 %>% 
  broom::glance()%>%
  kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.318 0.312 0.522 51.5 0 22 -1871 3790 3929 661 2424 2447
msummary(model5)
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             7.022437   0.213518   32.89  < 2e-16
## host_response_time.L                    0.117697   0.060747    1.94  0.05280
## host_response_time.Q                   -0.028067   0.038461   -0.73  0.46561
## host_response_time.C                   -0.003680   0.028997   -0.13  0.89903
## host_response_rate                      0.001825   0.000933    1.95  0.05071
## host_is_superhostTRUE                   0.147240   0.027943    5.27  1.5e-07
## host_has_profile_picTRUE               -0.134868   0.177586   -0.76  0.44766
## host_since_calculated                   0.021479   0.005537    3.88  0.00011
## bedrooms                                0.081653   0.017747    4.60  4.4e-06
## accommodates                            0.085023   0.007695   11.05  < 2e-16
## total_amenities                         0.011892   0.001464    8.13  7.0e-16
## prop_type_simplifiedCondominium        -0.148797   0.037427   -3.98  7.2e-05
## prop_type_simplifiedGuesthouse         -0.098551   0.055774   -1.77  0.07736
## prop_type_simplifiedHostel             -0.134343   0.050217   -2.68  0.00752
## prop_type_simplifiedHotel              -0.072230   0.063500   -1.14  0.25545
## prop_type_simplifiedHouse              -0.168189   0.056513   -2.98  0.00295
## prop_type_simplifiedOther               0.026365   0.034765    0.76  0.44830
## prop_type_simplifiedServiced apartment -0.023470   0.050584   -0.46  0.64271
## number_of_reviews_ltm                  -0.005864   0.001039   -5.64  1.9e-08
## review_scores_rating                    0.003489   0.001066    3.27  0.00108
## room_typeHotel room                    -0.051063   0.062075   -0.82  0.41081
## room_typePrivate room                  -0.384867   0.025528  -15.08  < 2e-16
## room_typeShared room                   -0.496397   0.064180   -7.73  1.5e-14
##                                           
## (Intercept)                            ***
## host_response_time.L                   .  
## host_response_time.Q                      
## host_response_time.C                      
## host_response_rate                     .  
## host_is_superhostTRUE                  ***
## host_has_profile_picTRUE                  
## host_since_calculated                  ***
## bedrooms                               ***
## accommodates                           ***
## total_amenities                        ***
## prop_type_simplifiedCondominium        ***
## prop_type_simplifiedGuesthouse         .  
## prop_type_simplifiedHostel             ** 
## prop_type_simplifiedHotel                 
## prop_type_simplifiedHouse              ** 
## prop_type_simplifiedOther                 
## prop_type_simplifiedServiced apartment    
## number_of_reviews_ltm                  ***
## review_scores_rating                   ** 
## room_typeHotel room                       
## room_typePrivate room                  ***
## room_typeShared room                   ***
## 
## Residual standard error: 0.522 on 2424 degrees of freedom
##   (782 observations deleted due to missingness)
## Multiple R-squared:  0.318,  Adjusted R-squared:  0.312 
## F-statistic: 51.5 on 22 and 2424 DF,  p-value: <2e-16
car::vif(model5)
##                       GVIF Df GVIF^(1/(2*Df))
## host_response_time    8.57  3            1.43
## host_response_rate    7.73  1            2.78
## host_is_superhost     1.21  1            1.10
## host_has_profile_pic  1.04  1            1.02
## host_since_calculated 1.11  1            1.05
## bedrooms              1.51  1            1.23
## accommodates          1.85  1            1.36
## total_amenities       1.24  1            1.11
## prop_type_simplified  1.88  7            1.05
## number_of_reviews_ltm 1.18  1            1.09
## review_scores_rating  1.18  1            1.09
## room_type             2.43  3            1.16
autoplot(model5)

We observe that host response rate and time have high VIF numbers and host_response_time gives more meaningful results while explaining the data as it has a lower p value. Thus, we will keep that and get rid of host_response_rate. The variable that we have created before that shows the time that the host has been registered to the AirBnb seems to have a significant effect on the price, thus, we decided to keep it. However, the host having a profile picture does not seem to make a statistically significant difference. So we take out the host_has_profile_pic. Lastly, we can see that host being a Super Host has a very low p value as well, meaning a listing being posted by a Super Host affects its price.

Model 6: Adjusting Host Variables

After taking out the unnecessary variables, we observe that our adjusted R squared has not changed.

#glimpse(main_data) 

model6 <- lm(log(price_4_nights) ~   host_response_time + host_is_superhost + host_since_calculated + bedrooms + accommodates + total_amenities + prop_type_simplified + number_of_reviews_ltm +  review_scores_rating + room_type, data= main_data4)

model6 %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 6.998 0.104 67.127 0.000 6.793 7.202
host_response_time.L 0.011 0.024 0.439 0.661 -0.037 0.058
host_response_time.Q -0.083 0.026 -3.234 0.001 -0.133 -0.033
host_response_time.C -0.022 0.027 -0.796 0.426 -0.075 0.032
host_is_superhostTRUE 0.150 0.028 5.381 0.000 0.095 0.205
host_since_calculated 0.023 0.006 4.104 0.000 0.012 0.033
bedrooms 0.082 0.018 4.614 0.000 0.047 0.117
accommodates 0.085 0.008 11.033 0.000 0.070 0.100
total_amenities 0.012 0.001 8.161 0.000 0.009 0.015
prop_type_simplifiedCondominium -0.151 0.037 -4.038 0.000 -0.224 -0.078
prop_type_simplifiedGuesthouse -0.097 0.056 -1.741 0.082 -0.206 0.012
prop_type_simplifiedHostel -0.137 0.050 -2.737 0.006 -0.236 -0.039
prop_type_simplifiedHotel -0.059 0.063 -0.933 0.351 -0.183 0.065
prop_type_simplifiedHouse -0.170 0.057 -3.000 0.003 -0.280 -0.059
prop_type_simplifiedOther 0.024 0.035 0.683 0.495 -0.044 0.092
prop_type_simplifiedServiced apartment -0.031 0.050 -0.618 0.537 -0.130 0.068
number_of_reviews_ltm -0.006 0.001 -5.561 0.000 -0.008 -0.004
review_scores_rating 0.004 0.001 3.394 0.001 0.002 0.006
room_typeHotel room -0.048 0.062 -0.774 0.439 -0.170 0.074
room_typePrivate room -0.383 0.026 -15.025 0.000 -0.434 -0.333
room_typeShared room -0.496 0.064 -7.730 0.000 -0.622 -0.370
model6 %>% 
  broom::glance()%>%
  kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.317 0.312 0.522 56.4 0 20 -1873 3790 3917 662 2426 2447
msummary(model6)
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                             6.99781    0.10425   67.13  < 2e-16 ***
## host_response_time.L                    0.01064    0.02423    0.44   0.6606    
## host_response_time.Q                   -0.08263    0.02555   -3.23   0.0012 ** 
## host_response_time.C                   -0.02177    0.02736   -0.80   0.4263    
## host_is_superhostTRUE                   0.15012    0.02790    5.38  8.1e-08 ***
## host_since_calculated                   0.02262    0.00551    4.10  4.2e-05 ***
## bedrooms                                0.08190    0.01775    4.61  4.2e-06 ***
## accommodates                            0.08493    0.00770   11.03  < 2e-16 ***
## total_amenities                         0.01194    0.00146    8.16  5.3e-16 ***
## prop_type_simplifiedCondominium        -0.15111    0.03742   -4.04  5.6e-05 ***
## prop_type_simplifiedGuesthouse         -0.09711    0.05577   -1.74   0.0818 .  
## prop_type_simplifiedHostel             -0.13739    0.05021   -2.74   0.0063 ** 
## prop_type_simplifiedHotel              -0.05891    0.06314   -0.93   0.3509    
## prop_type_simplifiedHouse              -0.16960    0.05653   -3.00   0.0027 ** 
## prop_type_simplifiedOther               0.02373    0.03475    0.68   0.4947    
## prop_type_simplifiedServiced apartment -0.03118    0.05047   -0.62   0.5368    
## number_of_reviews_ltm                  -0.00578    0.00104   -5.56  3.0e-08 ***
## review_scores_rating                    0.00361    0.00106    3.39   0.0007 ***
## room_typeHotel room                    -0.04804    0.06209   -0.77   0.4391    
## room_typePrivate room                  -0.38349    0.02552  -15.03  < 2e-16 ***
## room_typeShared room                   -0.49629    0.06420   -7.73  1.6e-14 ***
## 
## Residual standard error: 0.522 on 2426 degrees of freedom
##   (782 observations deleted due to missingness)
## Multiple R-squared:  0.317,  Adjusted R-squared:  0.312 
## F-statistic: 56.4 on 20 and 2426 DF,  p-value: <2e-16
car::vif(model6)
##                       GVIF Df GVIF^(1/(2*Df))
## host_response_time    1.26  3            1.04
## host_is_superhost     1.21  1            1.10
## host_since_calculated 1.10  1            1.05
## bedrooms              1.51  1            1.23
## accommodates          1.84  1            1.36
## total_amenities       1.24  1            1.11
## prop_type_simplified  1.84  7            1.04
## number_of_reviews_ltm 1.18  1            1.09
## review_scores_rating  1.17  1            1.08
## room_type             2.42  3            1.16
autoplot(model6)

We also check for collinearity at this point, and we observe that all the VIF values are smaller than 5, which means we can keep on building our model.

Model 7: Adding Neighbourhood and Location Factors

We believe that location is an important factor in Hong Kong listings. Thus, we would like to add related factors which are neighbourhood_simplified that shows which part of the city the listing is located, is_location_exact that shows if the location is shown exactly on the map, and latitude. We added latitude since as we have learned through interviewing our friends from Hong Kong that the prices in the city increase as we go closer to the coast.

glimpse(main_data4) 
## Rows: 3,229
## Columns: 56
## $ id                          <dbl> 69074, 103760, 163664, 163742, 248140, 27…
## $ listing_url                 <chr> "https://www.airbnb.com/rooms/69074", "ht…
## $ summary                     <chr> "An ideal Hong location any visitor--hip …
## $ space                       <chr> "Filled with plants and art, this one bed…
## $ description                 <chr> "An ideal Hong location any visitor--hip …
## $ host_since                  <date> 2010-07-07, 2010-11-30, 2010-11-30, 2010…
## $ host_response_time          <ord> within a few hours, within an hour, withi…
## $ host_response_rate          <dbl> 86, 100, 100, 100, 100, 84, 84, 93, 93, 1…
## $ host_is_superhost           <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ host_listings_count         <dbl> 2, 12, 12, 12, 1, 8, 8, 18, 18, 3, 15, 2,…
## $ host_has_profile_pic        <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ host_identity_verified      <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ neighbourhood_cleansed      <chr> "Central & Western", "Central & Western",…
## $ latitude                    <dbl> 22.3, 22.3, 22.3, 22.3, 22.3, 22.3, 22.5,…
## $ longitude                   <dbl> 114, 114, 114, 114, 114, 114, 114, 114, 1…
## $ is_location_exact           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ property_type               <chr> "Apartment", "Apartment", "Apartment", "A…
## $ room_type                   <chr> "Entire home/apt", "Entire home/apt", "En…
## $ accommodates                <dbl> 3, 6, 6, 6, 2, 4, 6, 3, 3, 5, 2, 3, 4, 3,…
## $ bathrooms                   <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1,…
## $ bedrooms                    <dbl> 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 2, 1, 2,…
## $ beds                        <dbl> 2, 3, 3, 3, 1, 1, 3, 1, 2, 3, 1, 2, 2, 2,…
## $ bed_type                    <chr> "Real Bed", "Real Bed", "Real Bed", "Real…
## $ amenities                   <chr> "{\"Cable TV\",Internet,Wifi,\"Air condit…
## $ price                       <dbl> 1395, 845, 690, 767, 643, 891, 930, 992, …
## $ security_deposit            <dbl> 2325, 775, 775, 775, 0, 1008, 930, 1163, …
## $ cleaning_fee                <dbl> 310, 271, 302, 302, 0, 310, 388, 310, 310…
## $ guests_included             <dbl> 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2,…
## $ extra_people                <dbl> 155, 194, 194, 194, 0, 155, 155, 116, 116…
## $ minimum_nights              <dbl> 3, 2, 2, 2, 1, 2, 4, 1, 1, 1, 1, 3, 2, 4,…
## $ maximum_nights              <dbl> 365, 365, 365, 365, 60, 365, 365, 14, 14,…
## $ has_availability            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ number_of_reviews           <dbl> 134, 271, 222, 225, 163, 43, 160, 15, 18,…
## $ number_of_reviews_ltm       <dbl> 4, 13, 16, 11, 12, 2, 49, 0, 0, 6, 2, 1, …
## $ last_review                 <date> 2020-03-24, 2020-05-22, 2020-04-11, 2020…
## $ review_scores_rating        <dbl> 97, 89, 86, 86, 98, 93, 97, 86, 99, 91, 7…
## $ review_scores_accuracy      <dbl> 10, 9, 9, 9, 10, 9, 9, 8, 10, 10, 8, 8, 1…
## $ review_scores_cleanliness   <dbl> 9, 9, 9, 9, 10, 9, 7, 8, 10, 10, 7, 8, 9,…
## $ review_scores_checkin       <dbl> 10, 10, 9, 10, 10, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_communication <dbl> 10, 10, 9, 10, 10, 10, 9, 9, 10, 10, 8, 8…
## $ review_scores_location      <dbl> 10, 10, 10, 10, 9, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_value         <dbl> 9, 9, 9, 9, 10, 9, 8, 8, 9, 9, 8, 9, 9, 9…
## $ instant_bookable            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
## $ cancellation_policy         <chr> "strict", "strict", "strict", "strict", "…
## $ reviews_per_month           <dbl> 1.18, 2.47, 2.07, 2.09, 1.55, 0.41, 1.61,…
## $ has_summary                 <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ prop_type_simplified        <chr> "Apartment", "Apartment", "Apartment", "A…
## $ host_since_calculated       <dbl> 10.29, 9.89, 9.89, 9.89, 9.01, 8.90, 8.90…
## $ total_amenities             <dbl> 33, 11, 18, 17, 27, 22, 36, 27, 28, 22, 2…
## $ neighbourhood_simplified    <chr> "hong_kong", "hong_kong", "hong_kong", "h…
## $ is_it_big_sum               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ is_it_big_spa               <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big_desc              <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big                   <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ extra_charge_pp             <dbl> 0, 0, 0, 0, 0, 0, 0, 116, 116, 0, 0, 0, 0…
## $ price_4_nights              <dbl> 5890, 3651, 3062, 3370, 2572, 3874, 4108,…
model7 <- lm(log(price_4_nights) ~  neighbourhood_simplified + is_location_exact + latitude + host_response_time + host_is_superhost + host_since_calculated + bedrooms + accommodates + total_amenities + prop_type_simplified + number_of_reviews_ltm +  review_scores_rating + room_type, data= main_data4)

model7 %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 53.200 5.203 10.224 0.000 42.997 63.404
neighbourhood_simplifiedkowloon -0.196 0.029 -6.771 0.000 -0.253 -0.139
neighbourhood_simplifiednew_territories -0.155 0.043 -3.600 0.000 -0.239 -0.070
is_location_exactTRUE 0.015 0.023 0.670 0.503 -0.030 0.061
latitude -2.065 0.233 -8.845 0.000 -2.522 -1.607
host_response_time.L 0.015 0.023 0.642 0.521 -0.031 0.061
host_response_time.Q -0.024 0.025 -0.955 0.339 -0.073 0.025
host_response_time.C -0.018 0.026 -0.672 0.502 -0.069 0.034
host_is_superhostTRUE 0.176 0.027 6.530 0.000 0.123 0.229
host_since_calculated 0.006 0.005 1.073 0.283 -0.005 0.017
bedrooms 0.103 0.017 5.912 0.000 0.069 0.137
accommodates 0.084 0.008 11.146 0.000 0.069 0.098
total_amenities 0.011 0.001 8.005 0.000 0.009 0.014
prop_type_simplifiedCondominium -0.098 0.036 -2.697 0.007 -0.168 -0.027
prop_type_simplifiedGuesthouse -0.094 0.054 -1.737 0.083 -0.199 0.012
prop_type_simplifiedHostel -0.102 0.049 -2.100 0.036 -0.197 -0.007
prop_type_simplifiedHotel -0.006 0.061 -0.099 0.921 -0.126 0.114
prop_type_simplifiedHouse -0.097 0.057 -1.712 0.087 -0.209 0.014
prop_type_simplifiedOther 0.069 0.034 2.047 0.041 0.003 0.135
prop_type_simplifiedServiced apartment 0.031 0.049 0.643 0.521 -0.064 0.127
number_of_reviews_ltm -0.006 0.001 -6.085 0.000 -0.008 -0.004
review_scores_rating 0.004 0.001 3.501 0.000 0.002 0.006
room_typeHotel room -0.016 0.061 -0.256 0.798 -0.135 0.103
room_typePrivate room -0.347 0.027 -13.086 0.000 -0.399 -0.295
room_typeShared room -0.486 0.062 -7.821 0.000 -0.608 -0.364
model7 %>% 
  broom::glance()%>%
  kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.372 0.366 0.501 59.8 0 24 -1770 3592 3743 609 2422 2447
msummary(model7)
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             53.20017    5.20333   10.22  < 2e-16
## neighbourhood_simplifiedkowloon         -0.19621    0.02898   -6.77  1.6e-11
## neighbourhood_simplifiednew_territories -0.15479    0.04300   -3.60  0.00032
## is_location_exactTRUE                    0.01550    0.02315    0.67  0.50318
## latitude                                -2.06456    0.23341   -8.85  < 2e-16
## host_response_time.L                     0.01496    0.02330    0.64  0.52094
## host_response_time.Q                    -0.02396    0.02508   -0.96  0.33943
## host_response_time.C                    -0.01771    0.02635   -0.67  0.50164
## host_is_superhostTRUE                    0.17612    0.02697    6.53  8.0e-11
## host_since_calculated                    0.00584    0.00544    1.07  0.28344
## bedrooms                                 0.10315    0.01745    5.91  3.9e-09
## accommodates                             0.08370    0.00751   11.15  < 2e-16
## total_amenities                          0.01139    0.00142    8.00  1.8e-15
## prop_type_simplifiedCondominium         -0.09755    0.03618   -2.70  0.00705
## prop_type_simplifiedGuesthouse          -0.09362    0.05391   -1.74  0.08258
## prop_type_simplifiedHostel              -0.10199    0.04856   -2.10  0.03582
## prop_type_simplifiedHotel               -0.00608    0.06116   -0.10  0.92076
## prop_type_simplifiedHouse               -0.09749    0.05696   -1.71  0.08710
## prop_type_simplifiedOther                0.06890    0.03366    2.05  0.04077
## prop_type_simplifiedServiced apartment   0.03132    0.04874    0.64  0.52056
## number_of_reviews_ltm                   -0.00617    0.00101   -6.09  1.3e-09
## review_scores_rating                     0.00359    0.00103    3.50  0.00047
## room_typeHotel room                     -0.01555    0.06067   -0.26  0.79771
## room_typePrivate room                   -0.34737    0.02655  -13.09  < 2e-16
## room_typeShared room                    -0.48586    0.06212   -7.82  7.7e-15
##                                            
## (Intercept)                             ***
## neighbourhood_simplifiedkowloon         ***
## neighbourhood_simplifiednew_territories ***
## is_location_exactTRUE                      
## latitude                                ***
## host_response_time.L                       
## host_response_time.Q                       
## host_response_time.C                       
## host_is_superhostTRUE                   ***
## host_since_calculated                      
## bedrooms                                ***
## accommodates                            ***
## total_amenities                         ***
## prop_type_simplifiedCondominium         ** 
## prop_type_simplifiedGuesthouse          .  
## prop_type_simplifiedHostel              *  
## prop_type_simplifiedHotel                  
## prop_type_simplifiedHouse               .  
## prop_type_simplifiedOther               *  
## prop_type_simplifiedServiced apartment     
## number_of_reviews_ltm                   ***
## review_scores_rating                    ***
## room_typeHotel room                        
## room_typePrivate room                   ***
## room_typeShared room                    ***
## 
## Residual standard error: 0.501 on 2422 degrees of freedom
##   (782 observations deleted due to missingness)
## Multiple R-squared:  0.372,  Adjusted R-squared:  0.366 
## F-statistic: 59.8 on 24 and 2422 DF,  p-value: <2e-16
car::vif(model7)
##                          GVIF Df GVIF^(1/(2*Df))
## neighbourhood_simplified 2.61  2            1.27
## is_location_exact        1.11  1            1.05
## latitude                 1.61  1            1.27
## host_response_time       1.33  3            1.05
## host_is_superhost        1.23  1            1.11
## host_since_calculated    1.16  1            1.08
## bedrooms                 1.59  1            1.26
## accommodates             1.91  1            1.38
## total_amenities          1.27  1            1.13
## prop_type_simplified     2.19  7            1.06
## number_of_reviews_ltm    1.22  1            1.10
## review_scores_rating     1.18  1            1.09
## room_type                2.94  3            1.20
autoplot(model7)

The addition of these variables vastly improves our adjusted R squared value to about 37%. Thus, we can conclude that the neighbourhood of the listing is an important indicator of the price. However, the exact location being shown on the map does not make a statistically significant difference, so we take out is_location_exact. That is why we decided to take it out. At this point, since the p value is too large, we decided to take out the host_since_calculated as well.

Model 8: Adding Cancellation Policy and Security Deposit

Next, we would like to see the impact of adding cancellation_policy and security_deposit to the model.

glimpse(main_data4)
## Rows: 3,229
## Columns: 56
## $ id                          <dbl> 69074, 103760, 163664, 163742, 248140, 27…
## $ listing_url                 <chr> "https://www.airbnb.com/rooms/69074", "ht…
## $ summary                     <chr> "An ideal Hong location any visitor--hip …
## $ space                       <chr> "Filled with plants and art, this one bed…
## $ description                 <chr> "An ideal Hong location any visitor--hip …
## $ host_since                  <date> 2010-07-07, 2010-11-30, 2010-11-30, 2010…
## $ host_response_time          <ord> within a few hours, within an hour, withi…
## $ host_response_rate          <dbl> 86, 100, 100, 100, 100, 84, 84, 93, 93, 1…
## $ host_is_superhost           <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ host_listings_count         <dbl> 2, 12, 12, 12, 1, 8, 8, 18, 18, 3, 15, 2,…
## $ host_has_profile_pic        <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ host_identity_verified      <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ neighbourhood_cleansed      <chr> "Central & Western", "Central & Western",…
## $ latitude                    <dbl> 22.3, 22.3, 22.3, 22.3, 22.3, 22.3, 22.5,…
## $ longitude                   <dbl> 114, 114, 114, 114, 114, 114, 114, 114, 1…
## $ is_location_exact           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ property_type               <chr> "Apartment", "Apartment", "Apartment", "A…
## $ room_type                   <chr> "Entire home/apt", "Entire home/apt", "En…
## $ accommodates                <dbl> 3, 6, 6, 6, 2, 4, 6, 3, 3, 5, 2, 3, 4, 3,…
## $ bathrooms                   <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1,…
## $ bedrooms                    <dbl> 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 2, 1, 2,…
## $ beds                        <dbl> 2, 3, 3, 3, 1, 1, 3, 1, 2, 3, 1, 2, 2, 2,…
## $ bed_type                    <chr> "Real Bed", "Real Bed", "Real Bed", "Real…
## $ amenities                   <chr> "{\"Cable TV\",Internet,Wifi,\"Air condit…
## $ price                       <dbl> 1395, 845, 690, 767, 643, 891, 930, 992, …
## $ security_deposit            <dbl> 2325, 775, 775, 775, 0, 1008, 930, 1163, …
## $ cleaning_fee                <dbl> 310, 271, 302, 302, 0, 310, 388, 310, 310…
## $ guests_included             <dbl> 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2,…
## $ extra_people                <dbl> 155, 194, 194, 194, 0, 155, 155, 116, 116…
## $ minimum_nights              <dbl> 3, 2, 2, 2, 1, 2, 4, 1, 1, 1, 1, 3, 2, 4,…
## $ maximum_nights              <dbl> 365, 365, 365, 365, 60, 365, 365, 14, 14,…
## $ has_availability            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ number_of_reviews           <dbl> 134, 271, 222, 225, 163, 43, 160, 15, 18,…
## $ number_of_reviews_ltm       <dbl> 4, 13, 16, 11, 12, 2, 49, 0, 0, 6, 2, 1, …
## $ last_review                 <date> 2020-03-24, 2020-05-22, 2020-04-11, 2020…
## $ review_scores_rating        <dbl> 97, 89, 86, 86, 98, 93, 97, 86, 99, 91, 7…
## $ review_scores_accuracy      <dbl> 10, 9, 9, 9, 10, 9, 9, 8, 10, 10, 8, 8, 1…
## $ review_scores_cleanliness   <dbl> 9, 9, 9, 9, 10, 9, 7, 8, 10, 10, 7, 8, 9,…
## $ review_scores_checkin       <dbl> 10, 10, 9, 10, 10, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_communication <dbl> 10, 10, 9, 10, 10, 10, 9, 9, 10, 10, 8, 8…
## $ review_scores_location      <dbl> 10, 10, 10, 10, 9, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_value         <dbl> 9, 9, 9, 9, 10, 9, 8, 8, 9, 9, 8, 9, 9, 9…
## $ instant_bookable            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
## $ cancellation_policy         <chr> "strict", "strict", "strict", "strict", "…
## $ reviews_per_month           <dbl> 1.18, 2.47, 2.07, 2.09, 1.55, 0.41, 1.61,…
## $ has_summary                 <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ prop_type_simplified        <chr> "Apartment", "Apartment", "Apartment", "A…
## $ host_since_calculated       <dbl> 10.29, 9.89, 9.89, 9.89, 9.01, 8.90, 8.90…
## $ total_amenities             <dbl> 33, 11, 18, 17, 27, 22, 36, 27, 28, 22, 2…
## $ neighbourhood_simplified    <chr> "hong_kong", "hong_kong", "hong_kong", "h…
## $ is_it_big_sum               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ is_it_big_spa               <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big_desc              <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big                   <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ extra_charge_pp             <dbl> 0, 0, 0, 0, 0, 0, 0, 116, 116, 0, 0, 0, 0…
## $ price_4_nights              <dbl> 5890, 3651, 3062, 3370, 2572, 3874, 4108,…
model8 <- lm(log(price_4_nights) ~  cancellation_policy + security_deposit +  neighbourhood_simplified + latitude + host_response_time + host_is_superhost + bedrooms + accommodates + total_amenities + prop_type_simplified + number_of_reviews_ltm +  review_scores_rating + room_type, data= main_data4)


model8 %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 50.972 5.112 9.971 0.000 40.948 60.997
cancellation_policymoderate 0.059 0.039 1.519 0.129 -0.017 0.135
cancellation_policystrict 0.105 0.031 3.388 0.001 0.044 0.165
security_deposit 0.000 0.000 4.381 0.000 0.000 0.000
neighbourhood_simplifiedkowloon -0.187 0.029 -6.370 0.000 -0.245 -0.129
neighbourhood_simplifiednew_territories -0.124 0.042 -2.918 0.004 -0.207 -0.041
latitude -1.968 0.229 -8.581 0.000 -2.418 -1.518
host_response_time.L 0.029 0.023 1.246 0.213 -0.017 0.074
host_response_time.Q -0.028 0.025 -1.128 0.260 -0.077 0.021
host_response_time.C -0.022 0.026 -0.821 0.412 -0.073 0.030
host_is_superhostTRUE 0.177 0.027 6.617 0.000 0.124 0.229
bedrooms 0.109 0.017 6.273 0.000 0.075 0.143
accommodates 0.081 0.007 10.862 0.000 0.067 0.096
total_amenities 0.011 0.001 7.518 0.000 0.008 0.013
prop_type_simplifiedCondominium -0.098 0.036 -2.718 0.007 -0.168 -0.027
prop_type_simplifiedGuesthouse -0.097 0.053 -1.821 0.069 -0.201 0.007
prop_type_simplifiedHostel -0.094 0.048 -1.932 0.053 -0.189 0.001
prop_type_simplifiedHotel 0.018 0.061 0.302 0.763 -0.101 0.138
prop_type_simplifiedHouse -0.106 0.057 -1.863 0.063 -0.217 0.006
prop_type_simplifiedOther 0.077 0.033 2.292 0.022 0.011 0.142
prop_type_simplifiedServiced apartment 0.035 0.048 0.724 0.469 -0.060 0.130
number_of_reviews_ltm -0.006 0.001 -5.981 0.000 -0.008 -0.004
review_scores_rating 0.004 0.001 3.717 0.000 0.002 0.006
room_typeHotel room -0.030 0.060 -0.498 0.619 -0.149 0.088
room_typePrivate room -0.345 0.026 -13.013 0.000 -0.397 -0.293
room_typeShared room -0.437 0.062 -7.080 0.000 -0.558 -0.316
model8 %>% 
  broom::glance()%>%
  kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.381 0.374 0.498 59.5 0 25 -1754 3561 3718 601 2421 2447
msummary(model8)
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              5.10e+01   5.11e+00    9.97  < 2e-16
## cancellation_policymoderate              5.91e-02   3.89e-02    1.52  0.12881
## cancellation_policystrict                1.05e-01   3.09e-02    3.39  0.00071
## security_deposit                         3.24e-05   7.38e-06    4.38  1.2e-05
## neighbourhood_simplifiedkowloon         -1.87e-01   2.94e-02   -6.37  2.3e-10
## neighbourhood_simplifiednew_territories -1.24e-01   4.25e-02   -2.92  0.00355
## latitude                                -1.97e+00   2.29e-01   -8.58  < 2e-16
## host_response_time.L                     2.89e-02   2.32e-02    1.25  0.21294
## host_response_time.Q                    -2.81e-02   2.49e-02   -1.13  0.25956
## host_response_time.C                    -2.15e-02   2.62e-02   -0.82  0.41180
## host_is_superhostTRUE                    1.77e-01   2.67e-02    6.62  4.5e-11
## bedrooms                                 1.09e-01   1.74e-02    6.27  4.2e-10
## accommodates                             8.14e-02   7.49e-03   10.86  < 2e-16
## total_amenities                          1.07e-02   1.42e-03    7.52  7.8e-14
## prop_type_simplifiedCondominium         -9.77e-02   3.60e-02   -2.72  0.00661
## prop_type_simplifiedGuesthouse          -9.69e-02   5.32e-02   -1.82  0.06870
## prop_type_simplifiedHostel              -9.37e-02   4.85e-02   -1.93  0.05349
## prop_type_simplifiedHotel                1.84e-02   6.09e-02    0.30  0.76282
## prop_type_simplifiedHouse               -1.06e-01   5.67e-02   -1.86  0.06264
## prop_type_simplifiedOther                7.67e-02   3.35e-02    2.29  0.02196
## prop_type_simplifiedServiced apartment   3.50e-02   4.83e-02    0.72  0.46931
## number_of_reviews_ltm                   -6.00e-03   1.00e-03   -5.98  2.5e-09
## review_scores_rating                     3.79e-03   1.02e-03    3.72  0.00021
## room_typeHotel room                     -3.01e-02   6.04e-02   -0.50  0.61865
## room_typePrivate room                   -3.45e-01   2.65e-02  -13.01  < 2e-16
## room_typeShared room                    -4.37e-01   6.17e-02   -7.08  1.9e-12
##                                            
## (Intercept)                             ***
## cancellation_policymoderate                
## cancellation_policystrict               ***
## security_deposit                        ***
## neighbourhood_simplifiedkowloon         ***
## neighbourhood_simplifiednew_territories ** 
## latitude                                ***
## host_response_time.L                       
## host_response_time.Q                       
## host_response_time.C                       
## host_is_superhostTRUE                   ***
## bedrooms                                ***
## accommodates                            ***
## total_amenities                         ***
## prop_type_simplifiedCondominium         ** 
## prop_type_simplifiedGuesthouse          .  
## prop_type_simplifiedHostel              .  
## prop_type_simplifiedHotel                  
## prop_type_simplifiedHouse               .  
## prop_type_simplifiedOther               *  
## prop_type_simplifiedServiced apartment     
## number_of_reviews_ltm                   ***
## review_scores_rating                    ***
## room_typeHotel room                        
## room_typePrivate room                   ***
## room_typeShared room                    ***
## 
## Residual standard error: 0.498 on 2421 degrees of freedom
##   (782 observations deleted due to missingness)
## Multiple R-squared:  0.381,  Adjusted R-squared:  0.374 
## F-statistic: 59.5 on 25 and 2421 DF,  p-value: <2e-16
car::vif(model8)
##                          GVIF Df GVIF^(1/(2*Df))
## cancellation_policy      1.30  2            1.07
## security_deposit         1.12  1            1.06
## neighbourhood_simplified 2.75  2            1.29
## latitude                 1.57  1            1.25
## host_response_time       1.34  3            1.05
## host_is_superhost        1.22  1            1.10
## bedrooms                 1.60  1            1.26
## accommodates             1.92  1            1.39
## total_amenities          1.29  1            1.13
## prop_type_simplified     2.19  7            1.06
## number_of_reviews_ltm    1.21  1            1.10
## review_scores_rating     1.19  1            1.09
## room_type                2.96  3            1.20
autoplot(model8)

These new variables did not change our R Squared much, we observe that cancellation policy and security deposit are statistically significant. Thus, we decided to keep them.

Model 9: Adding Review Types

Lastly, we would like to try adding specific review scores to our model, although we know that they might be intercorrelated.

glimpse(main_data4) 
## Rows: 3,229
## Columns: 56
## $ id                          <dbl> 69074, 103760, 163664, 163742, 248140, 27…
## $ listing_url                 <chr> "https://www.airbnb.com/rooms/69074", "ht…
## $ summary                     <chr> "An ideal Hong location any visitor--hip …
## $ space                       <chr> "Filled with plants and art, this one bed…
## $ description                 <chr> "An ideal Hong location any visitor--hip …
## $ host_since                  <date> 2010-07-07, 2010-11-30, 2010-11-30, 2010…
## $ host_response_time          <ord> within a few hours, within an hour, withi…
## $ host_response_rate          <dbl> 86, 100, 100, 100, 100, 84, 84, 93, 93, 1…
## $ host_is_superhost           <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ host_listings_count         <dbl> 2, 12, 12, 12, 1, 8, 8, 18, 18, 3, 15, 2,…
## $ host_has_profile_pic        <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ host_identity_verified      <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ neighbourhood_cleansed      <chr> "Central & Western", "Central & Western",…
## $ latitude                    <dbl> 22.3, 22.3, 22.3, 22.3, 22.3, 22.3, 22.5,…
## $ longitude                   <dbl> 114, 114, 114, 114, 114, 114, 114, 114, 1…
## $ is_location_exact           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ property_type               <chr> "Apartment", "Apartment", "Apartment", "A…
## $ room_type                   <chr> "Entire home/apt", "Entire home/apt", "En…
## $ accommodates                <dbl> 3, 6, 6, 6, 2, 4, 6, 3, 3, 5, 2, 3, 4, 3,…
## $ bathrooms                   <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1,…
## $ bedrooms                    <dbl> 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 2, 1, 2,…
## $ beds                        <dbl> 2, 3, 3, 3, 1, 1, 3, 1, 2, 3, 1, 2, 2, 2,…
## $ bed_type                    <chr> "Real Bed", "Real Bed", "Real Bed", "Real…
## $ amenities                   <chr> "{\"Cable TV\",Internet,Wifi,\"Air condit…
## $ price                       <dbl> 1395, 845, 690, 767, 643, 891, 930, 992, …
## $ security_deposit            <dbl> 2325, 775, 775, 775, 0, 1008, 930, 1163, …
## $ cleaning_fee                <dbl> 310, 271, 302, 302, 0, 310, 388, 310, 310…
## $ guests_included             <dbl> 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2,…
## $ extra_people                <dbl> 155, 194, 194, 194, 0, 155, 155, 116, 116…
## $ minimum_nights              <dbl> 3, 2, 2, 2, 1, 2, 4, 1, 1, 1, 1, 3, 2, 4,…
## $ maximum_nights              <dbl> 365, 365, 365, 365, 60, 365, 365, 14, 14,…
## $ has_availability            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ number_of_reviews           <dbl> 134, 271, 222, 225, 163, 43, 160, 15, 18,…
## $ number_of_reviews_ltm       <dbl> 4, 13, 16, 11, 12, 2, 49, 0, 0, 6, 2, 1, …
## $ last_review                 <date> 2020-03-24, 2020-05-22, 2020-04-11, 2020…
## $ review_scores_rating        <dbl> 97, 89, 86, 86, 98, 93, 97, 86, 99, 91, 7…
## $ review_scores_accuracy      <dbl> 10, 9, 9, 9, 10, 9, 9, 8, 10, 10, 8, 8, 1…
## $ review_scores_cleanliness   <dbl> 9, 9, 9, 9, 10, 9, 7, 8, 10, 10, 7, 8, 9,…
## $ review_scores_checkin       <dbl> 10, 10, 9, 10, 10, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_communication <dbl> 10, 10, 9, 10, 10, 10, 9, 9, 10, 10, 8, 8…
## $ review_scores_location      <dbl> 10, 10, 10, 10, 9, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_value         <dbl> 9, 9, 9, 9, 10, 9, 8, 8, 9, 9, 8, 9, 9, 9…
## $ instant_bookable            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
## $ cancellation_policy         <chr> "strict", "strict", "strict", "strict", "…
## $ reviews_per_month           <dbl> 1.18, 2.47, 2.07, 2.09, 1.55, 0.41, 1.61,…
## $ has_summary                 <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ prop_type_simplified        <chr> "Apartment", "Apartment", "Apartment", "A…
## $ host_since_calculated       <dbl> 10.29, 9.89, 9.89, 9.89, 9.01, 8.90, 8.90…
## $ total_amenities             <dbl> 33, 11, 18, 17, 27, 22, 36, 27, 28, 22, 2…
## $ neighbourhood_simplified    <chr> "hong_kong", "hong_kong", "hong_kong", "h…
## $ is_it_big_sum               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ is_it_big_spa               <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big_desc              <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big                   <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ extra_charge_pp             <dbl> 0, 0, 0, 0, 0, 0, 0, 116, 116, 0, 0, 0, 0…
## $ price_4_nights              <dbl> 5890, 3651, 3062, 3370, 2572, 3874, 4108,…
model9 <- lm(log(price_4_nights) ~ review_scores_rating +  review_scores_cleanliness + review_scores_checkin + review_scores_communication + review_scores_location + review_scores_value + cancellation_policy + security_deposit +  neighbourhood_simplified + latitude + host_response_time + host_is_superhost + bedrooms + accommodates + total_amenities + prop_type_simplified + number_of_reviews_ltm +  review_scores_rating + room_type, data= main_data4)

model9 %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 49.136 5.095 9.644 0.000 39.145 59.127
review_scores_rating 0.004 0.002 2.035 0.042 0.000 0.009
review_scores_cleanliness 0.024 0.016 1.539 0.124 -0.007 0.054
review_scores_checkin 0.029 0.019 1.534 0.125 -0.008 0.065
review_scores_communication 0.043 0.019 2.249 0.025 0.006 0.080
review_scores_location -0.040 0.018 -2.289 0.022 -0.075 -0.006
review_scores_value -0.064 0.017 -3.792 0.000 -0.098 -0.031
cancellation_policymoderate 0.055 0.039 1.411 0.158 -0.021 0.131
cancellation_policystrict 0.099 0.031 3.190 0.001 0.038 0.159
security_deposit 0.000 0.000 4.212 0.000 0.000 0.000
neighbourhood_simplifiedkowloon -0.178 0.029 -6.069 0.000 -0.236 -0.120
neighbourhood_simplifiednew_territories -0.144 0.043 -3.359 0.001 -0.227 -0.060
latitude -1.885 0.229 -8.242 0.000 -2.333 -1.436
host_response_time.L 0.030 0.023 1.291 0.197 -0.016 0.075
host_response_time.Q -0.032 0.025 -1.297 0.195 -0.081 0.016
host_response_time.C -0.018 0.026 -0.682 0.496 -0.069 0.033
host_is_superhostTRUE 0.172 0.027 6.475 0.000 0.120 0.225
bedrooms 0.115 0.017 6.602 0.000 0.081 0.149
accommodates 0.080 0.007 10.711 0.000 0.065 0.095
total_amenities 0.010 0.001 7.241 0.000 0.008 0.013
prop_type_simplifiedCondominium -0.094 0.036 -2.621 0.009 -0.164 -0.024
prop_type_simplifiedGuesthouse -0.089 0.053 -1.677 0.094 -0.193 0.015
prop_type_simplifiedHostel -0.097 0.048 -2.014 0.044 -0.192 -0.003
prop_type_simplifiedHotel 0.014 0.061 0.227 0.821 -0.105 0.133
prop_type_simplifiedHouse -0.105 0.057 -1.861 0.063 -0.216 0.006
prop_type_simplifiedOther 0.074 0.033 2.223 0.026 0.009 0.139
prop_type_simplifiedServiced apartment 0.030 0.048 0.623 0.533 -0.065 0.125
number_of_reviews_ltm -0.006 0.001 -5.991 0.000 -0.008 -0.004
room_typeHotel room -0.008 0.060 -0.134 0.893 -0.126 0.110
room_typePrivate room -0.338 0.026 -12.804 0.000 -0.390 -0.286
room_typeShared room -0.431 0.061 -7.010 0.000 -0.552 -0.311
model9 %>% 
  broom::glance()%>%
  kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.389 0.381 0.495 51.2 0 30 -1737 3538 3724 593 2415 2446
msummary(model9)
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              4.91e+01   5.09e+00    9.64  < 2e-16
## review_scores_rating                     4.43e-03   2.18e-03    2.03  0.04198
## review_scores_cleanliness                2.40e-02   1.56e-02    1.54  0.12383
## review_scores_checkin                    2.87e-02   1.87e-02    1.53  0.12509
## review_scores_communication              4.30e-02   1.91e-02    2.25  0.02460
## review_scores_location                  -4.02e-02   1.76e-02   -2.29  0.02216
## review_scores_value                     -6.45e-02   1.70e-02   -3.79  0.00015
## cancellation_policymoderate              5.46e-02   3.87e-02    1.41  0.15826
## cancellation_policystrict                9.85e-02   3.09e-02    3.19  0.00144
## security_deposit                         3.10e-05   7.36e-06    4.21  2.6e-05
## neighbourhood_simplifiedkowloon         -1.78e-01   2.93e-02   -6.07  1.5e-09
## neighbourhood_simplifiednew_territories -1.44e-01   4.28e-02   -3.36  0.00079
## latitude                                -1.88e+00   2.29e-01   -8.24  2.7e-16
## host_response_time.L                     2.99e-02   2.31e-02    1.29  0.19681
## host_response_time.Q                    -3.22e-02   2.48e-02   -1.30  0.19492
## host_response_time.C                    -1.78e-02   2.61e-02   -0.68  0.49557
## host_is_superhostTRUE                    1.72e-01   2.66e-02    6.48  1.1e-10
## bedrooms                                 1.15e-01   1.74e-02    6.60  5.0e-11
## accommodates                             8.00e-02   7.46e-03   10.71  < 2e-16
## total_amenities                          1.03e-02   1.42e-03    7.24  6.0e-13
## prop_type_simplifiedCondominium         -9.39e-02   3.58e-02   -2.62  0.00883
## prop_type_simplifiedGuesthouse          -8.90e-02   5.31e-02   -1.68  0.09368
## prop_type_simplifiedHostel              -9.72e-02   4.83e-02   -2.01  0.04412
## prop_type_simplifiedHotel                1.37e-02   6.06e-02    0.23  0.82070
## prop_type_simplifiedHouse               -1.05e-01   5.65e-02   -1.86  0.06287
## prop_type_simplifiedOther                7.41e-02   3.33e-02    2.22  0.02633
## prop_type_simplifiedServiced apartment   3.01e-02   4.83e-02    0.62  0.53310
## number_of_reviews_ltm                   -5.98e-03   9.98e-04   -5.99  2.4e-09
## room_typeHotel room                     -8.11e-03   6.03e-02   -0.13  0.89311
## room_typePrivate room                   -3.38e-01   2.64e-02  -12.80  < 2e-16
## room_typeShared room                    -4.31e-01   6.15e-02   -7.01  3.1e-12
##                                            
## (Intercept)                             ***
## review_scores_rating                    *  
## review_scores_cleanliness                  
## review_scores_checkin                      
## review_scores_communication             *  
## review_scores_location                  *  
## review_scores_value                     ***
## cancellation_policymoderate                
## cancellation_policystrict               ** 
## security_deposit                        ***
## neighbourhood_simplifiedkowloon         ***
## neighbourhood_simplifiednew_territories ***
## latitude                                ***
## host_response_time.L                       
## host_response_time.Q                       
## host_response_time.C                       
## host_is_superhostTRUE                   ***
## bedrooms                                ***
## accommodates                            ***
## total_amenities                         ***
## prop_type_simplifiedCondominium         ** 
## prop_type_simplifiedGuesthouse          .  
## prop_type_simplifiedHostel              *  
## prop_type_simplifiedHotel                  
## prop_type_simplifiedHouse               .  
## prop_type_simplifiedOther               *  
## prop_type_simplifiedServiced apartment     
## number_of_reviews_ltm                   ***
## room_typeHotel room                        
## room_typePrivate room                   ***
## room_typeShared room                    ***
## 
## Residual standard error: 0.495 on 2415 degrees of freedom
##   (783 observations deleted due to missingness)
## Multiple R-squared:  0.389,  Adjusted R-squared:  0.381 
## F-statistic: 51.2 on 30 and 2415 DF,  p-value: <2e-16
car::vif(model9)
##                             GVIF Df GVIF^(1/(2*Df))
## review_scores_rating        5.36  1            2.32
## review_scores_cleanliness   3.04  1            1.74
## review_scores_checkin       3.48  1            1.86
## review_scores_communication 3.50  1            1.87
## review_scores_location      2.10  1            1.45
## review_scores_value         3.35  1            1.83
## cancellation_policy         1.31  2            1.07
## security_deposit            1.12  1            1.06
## neighbourhood_simplified    2.86  2            1.30
## latitude                    1.58  1            1.26
## host_response_time          1.36  3            1.05
## host_is_superhost           1.22  1            1.11
## bedrooms                    1.61  1            1.27
## accommodates                1.93  1            1.39
## total_amenities             1.29  1            1.14
## prop_type_simplified        2.27  7            1.06
## number_of_reviews_ltm       1.21  1            1.10
## room_type                   3.00  3            1.20
autoplot(model9)

We take out the scores of accuracy, cleanliness, rating and checkin becase of their high p values. Their collinearity cause high VIFs. However, we keep the communication, location and value scores because it makes the biggest difference among the ones that we have just added. It is meaningful in the sense that it shows there is a correlation between the price that is paid for the listing and the communication with the host. We observe that our adjusted R squared increases.

Model 10: Adjusted Final Model

After we did our final adjustments, we get to the R squared of 38%. Although it is not a high rate, we can observe that our model does not suffer from collinearity and our variables are usually statistically significant with only a couple of them having larger than 5% p values. We decided to keep them since they are categorical and some of the other values in these variables have p values lower than 5%.

model10 <- lm(log(price_4_nights) ~ review_scores_communication + review_scores_location + review_scores_value + cancellation_policy + security_deposit +  neighbourhood_simplified + latitude + host_response_time + host_is_superhost + bedrooms + accommodates + total_amenities + prop_type_simplified + number_of_reviews_ltm +  review_scores_rating + room_type, data= main_data4)

model10 %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 49.467 5.096 9.708 0.000 39.475 59.460
review_scores_communication 0.058 0.017 3.497 0.000 0.026 0.091
review_scores_location -0.038 0.017 -2.195 0.028 -0.072 -0.004
review_scores_value -0.059 0.017 -3.562 0.000 -0.092 -0.027
cancellation_policymoderate 0.056 0.039 1.444 0.149 -0.020 0.132
cancellation_policystrict 0.100 0.031 3.243 0.001 0.040 0.160
security_deposit 0.000 0.000 4.265 0.000 0.000 0.000
neighbourhood_simplifiedkowloon -0.179 0.029 -6.091 0.000 -0.236 -0.121
neighbourhood_simplifiednew_territories -0.142 0.043 -3.329 0.001 -0.226 -0.058
latitude -1.897 0.229 -8.293 0.000 -2.345 -1.448
host_response_time.L 0.030 0.023 1.288 0.198 -0.016 0.075
host_response_time.Q -0.030 0.025 -1.228 0.220 -0.079 0.018
host_response_time.C -0.019 0.026 -0.721 0.471 -0.070 0.032
host_is_superhostTRUE 0.174 0.027 6.544 0.000 0.122 0.226
bedrooms 0.113 0.017 6.501 0.000 0.079 0.147
accommodates 0.080 0.007 10.755 0.000 0.066 0.095
total_amenities 0.010 0.001 7.313 0.000 0.008 0.013
prop_type_simplifiedCondominium -0.097 0.036 -2.711 0.007 -0.167 -0.027
prop_type_simplifiedGuesthouse -0.091 0.053 -1.717 0.086 -0.195 0.013
prop_type_simplifiedHostel -0.095 0.048 -1.971 0.049 -0.190 -0.001
prop_type_simplifiedHotel 0.014 0.061 0.236 0.813 -0.105 0.133
prop_type_simplifiedHouse -0.105 0.057 -1.856 0.064 -0.216 0.006
prop_type_simplifiedOther 0.073 0.033 2.203 0.028 0.008 0.139
prop_type_simplifiedServiced apartment 0.027 0.048 0.558 0.577 -0.068 0.122
number_of_reviews_ltm -0.006 0.001 -6.006 0.000 -0.008 -0.004
review_scores_rating 0.007 0.002 3.596 0.000 0.003 0.010
room_typeHotel room -0.007 0.060 -0.121 0.904 -0.126 0.111
room_typePrivate room -0.337 0.026 -12.753 0.000 -0.389 -0.285
room_typeShared room -0.430 0.062 -6.995 0.000 -0.551 -0.310
model10 %>% 
  broom::glance()%>%
  kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.387 0.38 0.496 54.6 0 28 -1740 3540 3714 594 2418 2447
msummary(model10)
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              4.95e+01   5.10e+00    9.71  < 2e-16
## review_scores_communication              5.83e-02   1.67e-02    3.50  0.00048
## review_scores_location                  -3.80e-02   1.73e-02   -2.19  0.02829
## review_scores_value                     -5.94e-02   1.67e-02   -3.56  0.00037
## cancellation_policymoderate              5.59e-02   3.87e-02    1.44  0.14881
## cancellation_policystrict                1.00e-01   3.08e-02    3.24  0.00120
## security_deposit                         3.14e-05   7.35e-06    4.27  2.1e-05
## neighbourhood_simplifiedkowloon         -1.79e-01   2.93e-02   -6.09  1.3e-09
## neighbourhood_simplifiednew_territories -1.42e-01   4.27e-02   -3.33  0.00089
## latitude                                -1.90e+00   2.29e-01   -8.29  < 2e-16
## host_response_time.L                     2.98e-02   2.31e-02    1.29  0.19779
## host_response_time.Q                    -3.05e-02   2.48e-02   -1.23  0.21958
## host_response_time.C                    -1.88e-02   2.61e-02   -0.72  0.47107
## host_is_superhostTRUE                    1.74e-01   2.66e-02    6.54  7.3e-11
## bedrooms                                 1.13e-01   1.73e-02    6.50  9.7e-11
## accommodates                             8.03e-02   7.46e-03   10.76  < 2e-16
## total_amenities                          1.04e-02   1.42e-03    7.31  3.5e-13
## prop_type_simplifiedCondominium         -9.71e-02   3.58e-02   -2.71  0.00676
## prop_type_simplifiedGuesthouse          -9.11e-02   5.31e-02   -1.72  0.08613
## prop_type_simplifiedHostel              -9.52e-02   4.83e-02   -1.97  0.04878
## prop_type_simplifiedHotel                1.43e-02   6.06e-02    0.24  0.81326
## prop_type_simplifiedHouse               -1.05e-01   5.66e-02   -1.86  0.06360
## prop_type_simplifiedOther                7.34e-02   3.33e-02    2.20  0.02771
## prop_type_simplifiedServiced apartment   2.69e-02   4.83e-02    0.56  0.57721
## number_of_reviews_ltm                   -6.00e-03   9.99e-04   -6.01  2.2e-09
## review_scores_rating                     6.77e-03   1.88e-03    3.60  0.00033
## room_typeHotel room                     -7.28e-03   6.04e-02   -0.12  0.90403
## room_typePrivate room                   -3.37e-01   2.64e-02  -12.75  < 2e-16
## room_typeShared room                    -4.30e-01   6.15e-02   -6.99  3.4e-12
##                                            
## (Intercept)                             ***
## review_scores_communication             ***
## review_scores_location                  *  
## review_scores_value                     ***
## cancellation_policymoderate                
## cancellation_policystrict               ** 
## security_deposit                        ***
## neighbourhood_simplifiedkowloon         ***
## neighbourhood_simplifiednew_territories ***
## latitude                                ***
## host_response_time.L                       
## host_response_time.Q                       
## host_response_time.C                       
## host_is_superhostTRUE                   ***
## bedrooms                                ***
## accommodates                            ***
## total_amenities                         ***
## prop_type_simplifiedCondominium         ** 
## prop_type_simplifiedGuesthouse          .  
## prop_type_simplifiedHostel              *  
## prop_type_simplifiedHotel                  
## prop_type_simplifiedHouse               .  
## prop_type_simplifiedOther               *  
## prop_type_simplifiedServiced apartment     
## number_of_reviews_ltm                   ***
## review_scores_rating                    ***
## room_typeHotel room                        
## room_typePrivate room                   ***
## room_typeShared room                    ***
## 
## Residual standard error: 0.496 on 2418 degrees of freedom
##   (782 observations deleted due to missingness)
## Multiple R-squared:  0.387,  Adjusted R-squared:  0.38 
## F-statistic: 54.6 on 28 and 2418 DF,  p-value: <2e-16
car::vif(model10)
##                             GVIF Df GVIF^(1/(2*Df))
## review_scores_communication 2.72  1            1.65
## review_scores_location      2.11  1            1.45
## review_scores_value         3.28  1            1.81
## cancellation_policy         1.31  2            1.07
## security_deposit            1.12  1            1.06
## neighbourhood_simplified    2.85  2            1.30
## latitude                    1.58  1            1.26
## host_response_time          1.35  3            1.05
## host_is_superhost           1.22  1            1.10
## bedrooms                    1.60  1            1.27
## accommodates                1.93  1            1.39
## total_amenities             1.29  1            1.14
## prop_type_simplified        2.25  7            1.06
## number_of_reviews_ltm       1.21  1            1.10
## review_scores_rating        4.08  1            2.02
## room_type                   3.00  3            1.20
autoplot(model10)

Regression Diagnostics

From model 1 to 10, we observe that the residuals vs fitted values become more random. A concentration of the residuals for high fitted values is clearly visible with the first model, while they seem centered around Y = 0 for Model 10. Consequently, the linearity assumption holds for Model 10.

The S-shaped Q-Q plot of residuals for Model 10 indicates heavy tails, or an excess of extreme values relative to the normal distribution. This can be explained by some Airbnbs having extremely high or low prices given their characteristics.

There is no visible trend in the Scale-Location graph, indicating that the variability of residuals in Model 10 is constant. Finally, the Residuals vs Factor levels leverage graph shows that the size of residuals is the same on average for all factor levels. We still observe some extreme values, due to the presence of outliers in Airbnb prices.

As a conclusion, the 4 assumptions of the linear regression model hold in our Model 10. To improve the distribution of residuals, we included new variables, or we could perform a more detailed classification of the neighbourhoods.

To compare the last 5 models that we have created we will use huxtable to create a summary table.

##summarize these models
huxreg(model6, model7, model8, model9, model10)

(1)(2)(3)(4)(5)
(Intercept)6.998 ***53.200 ***50.972 ***49.136 ***49.467 ***
(0.104)   (5.203)   (5.112)   (5.095)   (5.096)   
host_response_time.L0.011    0.015    0.029    0.030    0.030    
(0.024)   (0.023)   (0.023)   (0.023)   (0.023)   
host_response_time.Q-0.083 ** -0.024    -0.028    -0.032    -0.030    
(0.026)   (0.025)   (0.025)   (0.025)   (0.025)   
host_response_time.C-0.022    -0.018    -0.022    -0.018    -0.019    
(0.027)   (0.026)   (0.026)   (0.026)   (0.026)   
host_is_superhostTRUE0.150 ***0.176 ***0.177 ***0.172 ***0.174 ***
(0.028)   (0.027)   (0.027)   (0.027)   (0.027)   
host_since_calculated0.023 ***0.006                            
(0.006)   (0.005)                           
bedrooms0.082 ***0.103 ***0.109 ***0.115 ***0.113 ***
(0.018)   (0.017)   (0.017)   (0.017)   (0.017)   
accommodates0.085 ***0.084 ***0.081 ***0.080 ***0.080 ***
(0.008)   (0.008)   (0.007)   (0.007)   (0.007)   
total_amenities0.012 ***0.011 ***0.011 ***0.010 ***0.010 ***
(0.001)   (0.001)   (0.001)   (0.001)   (0.001)   
prop_type_simplifiedCondominium-0.151 ***-0.098 ** -0.098 ** -0.094 ** -0.097 ** 
(0.037)   (0.036)   (0.036)   (0.036)   (0.036)   
prop_type_simplifiedGuesthouse-0.097    -0.094    -0.097    -0.089    -0.091    
(0.056)   (0.054)   (0.053)   (0.053)   (0.053)   
prop_type_simplifiedHostel-0.137 ** -0.102 *  -0.094    -0.097 *  -0.095 *  
(0.050)   (0.049)   (0.048)   (0.048)   (0.048)   
prop_type_simplifiedHotel-0.059    -0.006    0.018    0.014    0.014    
(0.063)   (0.061)   (0.061)   (0.061)   (0.061)   
prop_type_simplifiedHouse-0.170 ** -0.097    -0.106    -0.105    -0.105    
(0.057)   (0.057)   (0.057)   (0.057)   (0.057)   
prop_type_simplifiedOther0.024    0.069 *  0.077 *  0.074 *  0.073 *  
(0.035)   (0.034)   (0.033)   (0.033)   (0.033)   
prop_type_simplifiedServiced apartment-0.031    0.031    0.035    0.030    0.027    
(0.050)   (0.049)   (0.048)   (0.048)   (0.048)   
number_of_reviews_ltm-0.006 ***-0.006 ***-0.006 ***-0.006 ***-0.006 ***
(0.001)   (0.001)   (0.001)   (0.001)   (0.001)   
review_scores_rating0.004 ***0.004 ***0.004 ***0.004 *  0.007 ***
(0.001)   (0.001)   (0.001)   (0.002)   (0.002)   
room_typeHotel room-0.048    -0.016    -0.030    -0.008    -0.007    
(0.062)   (0.061)   (0.060)   (0.060)   (0.060)   
room_typePrivate room-0.383 ***-0.347 ***-0.345 ***-0.338 ***-0.337 ***
(0.026)   (0.027)   (0.026)   (0.026)   (0.026)   
room_typeShared room-0.496 ***-0.486 ***-0.437 ***-0.431 ***-0.430 ***
(0.064)   (0.062)   (0.062)   (0.061)   (0.062)   
neighbourhood_simplifiedkowloon        -0.196 ***-0.187 ***-0.178 ***-0.179 ***
        (0.029)   (0.029)   (0.029)   (0.029)   
neighbourhood_simplifiednew_territories        -0.155 ***-0.124 ** -0.144 ***-0.142 ***
        (0.043)   (0.042)   (0.043)   (0.043)   
is_location_exactTRUE        0.015                            
        (0.023)                           
latitude        -2.065 ***-1.968 ***-1.885 ***-1.897 ***
        (0.233)   (0.229)   (0.229)   (0.229)   
cancellation_policymoderate                0.059    0.055    0.056    
                (0.039)   (0.039)   (0.039)   
cancellation_policystrict                0.105 ***0.099 ** 0.100 ** 
                (0.031)   (0.031)   (0.031)   
security_deposit                0.000 ***0.000 ***0.000 ***
                (0.000)   (0.000)   (0.000)   
review_scores_cleanliness                        0.024            
                        (0.016)           
review_scores_checkin                        0.029            
                        (0.019)           
review_scores_communication                        0.043 *  0.058 ***
                        (0.019)   (0.017)   
review_scores_location                        -0.040 *  -0.038 *  
                        (0.018)   (0.017)   
review_scores_value                        -0.064 ***-0.059 ***
                        (0.017)   (0.017)   
N2447        2447        2447        2446        2447        
R20.317    0.372    0.381    0.389    0.387    
logLik-1872.901    -1770.115    -1753.698    -1736.944    -1740.159    
AIC3789.803    3592.231    3561.396    3537.888    3540.318    
*** p < 0.001; ** p < 0.01; * p < 0.05.
We developed 10 models to forecast the price of and Airbnb apartment in Hong Kong. Our first, most basic model only included the type of property, number of reviews and rating as predictors and had an adjusted R squared of 6.5%.

By brainstorming on the features we look at when booking an Airbnb, we added several explanatory variables, and increased the adjusted R squared to 38.7% in our latest model.

#residuals check
autoplot(model10)

##check for collinearity
car::vif(model10) 
##                             GVIF Df GVIF^(1/(2*Df))
## review_scores_communication 2.72  1            1.65
## review_scores_location      2.11  1            1.45
## review_scores_value         3.28  1            1.81
## cancellation_policy         1.31  2            1.07
## security_deposit            1.12  1            1.06
## neighbourhood_simplified    2.85  2            1.30
## latitude                    1.58  1            1.26
## host_response_time          1.35  3            1.05
## host_is_superhost           1.22  1            1.10
## bedrooms                    1.60  1            1.27
## accommodates                1.93  1            1.39
## total_amenities             1.29  1            1.14
## prop_type_simplified        2.25  7            1.06
## number_of_reviews_ltm       1.21  1            1.10
## review_scores_rating        4.08  1            2.02
## room_type                   3.00  3            1.20
##renaming the final model
final_model <- model10
## summarize final model 
msummary(final_model)
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              4.95e+01   5.10e+00    9.71  < 2e-16
## review_scores_communication              5.83e-02   1.67e-02    3.50  0.00048
## review_scores_location                  -3.80e-02   1.73e-02   -2.19  0.02829
## review_scores_value                     -5.94e-02   1.67e-02   -3.56  0.00037
## cancellation_policymoderate              5.59e-02   3.87e-02    1.44  0.14881
## cancellation_policystrict                1.00e-01   3.08e-02    3.24  0.00120
## security_deposit                         3.14e-05   7.35e-06    4.27  2.1e-05
## neighbourhood_simplifiedkowloon         -1.79e-01   2.93e-02   -6.09  1.3e-09
## neighbourhood_simplifiednew_territories -1.42e-01   4.27e-02   -3.33  0.00089
## latitude                                -1.90e+00   2.29e-01   -8.29  < 2e-16
## host_response_time.L                     2.98e-02   2.31e-02    1.29  0.19779
## host_response_time.Q                    -3.05e-02   2.48e-02   -1.23  0.21958
## host_response_time.C                    -1.88e-02   2.61e-02   -0.72  0.47107
## host_is_superhostTRUE                    1.74e-01   2.66e-02    6.54  7.3e-11
## bedrooms                                 1.13e-01   1.73e-02    6.50  9.7e-11
## accommodates                             8.03e-02   7.46e-03   10.76  < 2e-16
## total_amenities                          1.04e-02   1.42e-03    7.31  3.5e-13
## prop_type_simplifiedCondominium         -9.71e-02   3.58e-02   -2.71  0.00676
## prop_type_simplifiedGuesthouse          -9.11e-02   5.31e-02   -1.72  0.08613
## prop_type_simplifiedHostel              -9.52e-02   4.83e-02   -1.97  0.04878
## prop_type_simplifiedHotel                1.43e-02   6.06e-02    0.24  0.81326
## prop_type_simplifiedHouse               -1.05e-01   5.66e-02   -1.86  0.06360
## prop_type_simplifiedOther                7.34e-02   3.33e-02    2.20  0.02771
## prop_type_simplifiedServiced apartment   2.69e-02   4.83e-02    0.56  0.57721
## number_of_reviews_ltm                   -6.00e-03   9.99e-04   -6.01  2.2e-09
## review_scores_rating                     6.77e-03   1.88e-03    3.60  0.00033
## room_typeHotel room                     -7.28e-03   6.04e-02   -0.12  0.90403
## room_typePrivate room                   -3.37e-01   2.64e-02  -12.75  < 2e-16
## room_typeShared room                    -4.30e-01   6.15e-02   -6.99  3.4e-12
##                                            
## (Intercept)                             ***
## review_scores_communication             ***
## review_scores_location                  *  
## review_scores_value                     ***
## cancellation_policymoderate                
## cancellation_policystrict               ** 
## security_deposit                        ***
## neighbourhood_simplifiedkowloon         ***
## neighbourhood_simplifiednew_territories ***
## latitude                                ***
## host_response_time.L                       
## host_response_time.Q                       
## host_response_time.C                       
## host_is_superhostTRUE                   ***
## bedrooms                                ***
## accommodates                            ***
## total_amenities                         ***
## prop_type_simplifiedCondominium         ** 
## prop_type_simplifiedGuesthouse          .  
## prop_type_simplifiedHostel              *  
## prop_type_simplifiedHotel                  
## prop_type_simplifiedHouse               .  
## prop_type_simplifiedOther               *  
## prop_type_simplifiedServiced apartment     
## number_of_reviews_ltm                   ***
## review_scores_rating                    ***
## room_typeHotel room                        
## room_typePrivate room                   ***
## room_typeShared room                    ***
## 
## Residual standard error: 0.496 on 2418 degrees of freedom
##   (782 observations deleted due to missingness)
## Multiple R-squared:  0.387,  Adjusted R-squared:  0.38 
## F-statistic: 54.6 on 28 and 2418 DF,  p-value: <2e-16
confint(final_model)
##                                             2.5 %    97.5 %
## (Intercept)                              3.95e+01  5.95e+01
## review_scores_communication              2.56e-02  9.09e-02
## review_scores_location                  -7.19e-02 -4.04e-03
## review_scores_value                     -9.21e-02 -2.67e-02
## cancellation_policymoderate             -2.00e-02  1.32e-01
## cancellation_policystrict                3.95e-02  1.60e-01
## security_deposit                         1.69e-05  4.58e-05
## neighbourhood_simplifiedkowloon         -2.36e-01 -1.21e-01
## neighbourhood_simplifiednew_territories -2.26e-01 -5.84e-02
## latitude                                -2.35e+00 -1.45e+00
## host_response_time.L                    -1.55e-02  7.51e-02
## host_response_time.Q                    -7.92e-02  1.82e-02
## host_response_time.C                    -7.00e-02  3.24e-02
## host_is_superhostTRUE                    1.22e-01  2.26e-01
## bedrooms                                 7.87e-02  1.47e-01
## accommodates                             6.57e-02  9.49e-02
## total_amenities                          7.59e-03  1.31e-02
## prop_type_simplifiedCondominium         -1.67e-01 -2.69e-02
## prop_type_simplifiedGuesthouse          -1.95e-01  1.30e-02
## prop_type_simplifiedHostel              -1.90e-01 -5.08e-04
## prop_type_simplifiedHotel               -1.05e-01  1.33e-01
## prop_type_simplifiedHouse               -2.16e-01  5.95e-03
## prop_type_simplifiedOther                8.06e-03  1.39e-01
## prop_type_simplifiedServiced apartment  -6.78e-02  1.22e-01
## number_of_reviews_ltm                   -7.95e-03 -4.04e-03
## review_scores_rating                     3.08e-03  1.05e-02
## room_typeHotel room                     -1.26e-01  1.11e-01
## room_typePrivate room                   -3.89e-01 -2.85e-01
## room_typeShared room                    -5.51e-01 -3.10e-01

Predicting Price

To predict the price for two people staying 4 nights in Hong Kong, we need to first create a data frame that contains their preferences and then feed it into our model. We have chosen two tourists who want to stay in an apartment, that has 2 bedrooms, 20 amenities, and a strict cancellation policy. The apartment is in hong_kong area and has 10 reviews in last twelve months, a rating of 90 and accommodates 3 people.

## Using our model to predict price 
predictors <- data.frame(review_scores_communication = 8,
                         review_scores_location = 8,
                         review_scores_value = 8,
                         cancellation_policy = "strict",
                         security_deposit = 775 ,
                         neighbourhood_simplified = "hong_kong",
                         latitude = 22.5,
                         host_response_time = "within an hour",
                         host_is_superhost = TRUE,
                         bedrooms = 2,
                         accommodates = 3,
                         total_amenities = 20,
                         prop_type_simplified = "Apartment",
                         number_of_reviews_ltm = 10,
                         review_scores_rating = 90,
                         room_type = "Entire home/apt") 

#Predicting price and prediction intervals
price_predicted = exp(predict(final_model, newdata = predictors, interval = "prediction"))
paste("Price for two people staying four nights in Hong Kong is: ", price_predicted[1])
## [1] "Price for two people staying four nights in Hong Kong is:  2896.75582785523"
paste("Lower PI for price for two people staying four nights in Hong Kong  is: ", price_predicted[2])
## [1] "Lower PI for price for two people staying four nights in Hong Kong  is:  1085.6185400902"
paste("Upper PI for Price for two people staying four nights in Hong Kong is: ", price_predicted[3])
## [1] "Upper PI for Price for two people staying four nights in Hong Kong is:  7729.41324815255"

After calculating the anti-log of the price that our model predicts, we find that the price comes out to be 2896.76 HKD with an upper prediction interval of 7729.41 HKD and lower prediction interval of 1085.62 HKD. Our model predicts a price closer to the lower PI indicating that the price data is right skewed. The wide range in prediction interval of the predicted price is due to our model having an adjusted R2 of 39% which leaves room for uncertainty.