How Much Does AirBnb Cost in Hong Kong?
Building a regression model to predict the prices
Executive Summary
The goal of this project is to predict the total cost for two people staying 4 nights in an AirBnB in Hong Kong. We downloaded the data from insideairbnb.com and followed the IICE method (import data, inspect data, clean data, explore data).
Understanding the Data
After importing the data, we looked at the data entries and variables in order to adapt them to our required format. We created factor variables and turned some character variables into numeric values. While skimming through the data, we discovered several anomalies inside the data such as really high outliers and NAs that were either removed or replaced. We did the following to clean the data:
Firstly, we selected only relevant columns to decrease the size of our dataframe and created our variable of interest price_4_nights which calculates the cost for two people staying in Hong Kong for 4 nights. Also, we removed the accommodations that have more less than two accommodates as two people will most likely not look for such Airbnbs.
Secondly we clustered the type of accommodations, properties, beds, cancellation policies into the top most popular ones and “Other”.
Thirdly, we grouped the different areas into 3 zones of Hong Kong as we believe that certain districts will be more appealing for a tourist than others and could thus lead to higher Airbnb prices.
Fourthly, preliminary regression analysis showed us that the neighbourhood, room type, and the number of bedrooms had the most significant effect on price.
Lastly, we both created new variables like total amenity number and looked at the cancellation policy and security deposit since they could have an impact on the price as well.
After having adapted the dataframe to our needs, we created several boxplots, scatterplots and correlation plots for our variables in order to look at their spread and the relationships among them. We also used ggpairs to look at the correlation of our chosen variables and see whether they would be of use later for our ideal model.
After creating informative visualisations, we created our first model consisting of different combinations of variables. In order to increase our adjusted R squared, we added new variables and removed insignificant variables (t-value below 2 and thus p value more than 5%). To compare all our models we used huxreg and also used VIF to confirm that the variables in our model are not multicorrelated. We finally decided on a model that includes review_scores_communication, review_scores_location, review_scores_value, cancellation_policy, security_deposit, neighbourhood_simplified, latitude, host_response_time, host_is_superhost, bedrooms, accommodates, total_amenities, prop_type_simplified, number_of_reviews_ltm, review_scores_rating, room_type. This model reaches 39% adjusted R-squared and therefore explains 39% of the variability in price_4_nights.
As the last step, we predicted the price for 2 people staying for 4 nights in Hong Kong along with the 95% prediction intervals.
The first step for our project is loading the essential packages.
library(vroom)
library(dplyr)
library(skimr)
library(tidyverse)
library(mosaic)
library(janitor)
library(skimr)
library(broom)
library(lubridate)
library(GGally)
library(leaflet)
library(ggfortify)
library(huxtable)
library(car)
library(kableExtra)
library(data.table)
library(modelr)
library(readr)
library(kableExtra)
Exploratory Data Analysis (EDA)
Loading Data
To begin with, we need to get data for AirBnB listings in Hong Kong. we are going to read the data directly from the URL using vroom for higher speed, and clean the column names :
listings <- vroom::vroom("http://data.insideairbnb.com/china/hk/hong-kong/2020-06-15/data/listings.csv.gz", na=c("", "NA", "N/A"))%>%
clean_names() # clean column names
Looking at the raw values
How many variables/columns? How many rows/observations?
Firstly, we are looking at the raw data to get a basical understanding of it. We have 106 variables and 11,187 observations, and there are numerical,character, and date variables in the dataset.
glimpse(listings)
## Rows: 11,187
## Columns: 106
## $ id <dbl> 69074, 75083, 103760, 13…
## $ listing_url <chr> "https://www.airbnb.com/…
## $ scrape_id <dbl> 2.02e+13, 2.02e+13, 2.02…
## $ last_scraped <date> 2020-06-17, 2020-06-17,…
## $ name <chr> "Beautiful oasis of plan…
## $ summary <chr> "An ideal Hong location …
## $ space <chr> "Filled with plants and …
## $ description <chr> "An ideal Hong location …
## $ experiences_offered <chr> "none", "none", "none", …
## $ neighborhood_overview <chr> "In the upper part of tr…
## $ notes <chr> NA, "Once you arrive in …
## $ transit <chr> "Buses pass often along …
## $ access <chr> "All access, except one …
## $ interaction <chr> "If a guest is staying t…
## $ house_rules <chr> "Everything to make your…
## $ thumbnail_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ medium_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ picture_url <chr> "https://a0.muscache.com…
## $ xl_picture_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_id <dbl> 160139, 304876, 304876, …
## $ host_url <chr> "https://www.airbnb.com/…
## $ host_name <chr> "Amy", "Brend", "Brend",…
## $ host_since <date> 2010-07-07, 2010-11-30,…
## $ host_location <chr> "Hong Kong", "Hong Kong"…
## $ host_about <chr> "I've been with AirBnB n…
## $ host_response_time <chr> "within a few hours", "w…
## $ host_response_rate <chr> "86%", "100%", "100%", "…
## $ host_acceptance_rate <chr> "60%", "99%", "99%", "99…
## $ host_is_superhost <lgl> TRUE, FALSE, FALSE, FALS…
## $ host_thumbnail_url <chr> "https://a0.muscache.com…
## $ host_picture_url <chr> "https://a0.muscache.com…
## $ host_neighbourhood <chr> "Sheung Wan", "Sheung Wa…
## $ host_listings_count <dbl> 2, 12, 12, 12, 1, 12, 12…
## $ host_total_listings_count <dbl> 2, 12, 12, 12, 1, 12, 12…
## $ host_verifications <chr> "['email', 'phone', 'rev…
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ host_identity_verified <lgl> TRUE, FALSE, FALSE, FALS…
## $ street <chr> "Sheung Wan, Hong Kong",…
## $ neighbourhood <chr> "Central & Western Distr…
## $ neighbourhood_cleansed <chr> "Central & Western", "Ce…
## $ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, …
## $ city <chr> "Sheung Wan", "Sheung Wa…
## $ state <chr> NA, NA, NA, NA, "Hong Ko…
## $ zipcode <chr> NA, NA, NA, NA, NA, NA, …
## $ market <chr> "Hong Kong", "Hong Kong"…
## $ smart_location <chr> "Sheung Wan, Hong Kong",…
## $ country_code <chr> "HK", "HK", "HK", "HK", …
## $ country <chr> "Hong Kong", "Hong Kong"…
## $ latitude <dbl> 22.3, 22.3, 22.3, 22.3, …
## $ longitude <dbl> 114, 114, 114, 114, 114,…
## $ is_location_exact <lgl> TRUE, TRUE, TRUE, FALSE,…
## $ property_type <chr> "Apartment", "Apartment"…
## $ room_type <chr> "Entire home/apt", "Enti…
## $ accommodates <dbl> 3, 3, 6, 6, 2, 6, 6, 2, …
## $ bathrooms <dbl> 1, 1, 1, 1, 1, 1, 1, 1, …
## $ bedrooms <dbl> 1, 0, 2, 2, 1, 2, 2, 1, …
## $ beds <dbl> 2, 2, 3, 3, 1, 3, 3, 1, …
## $ bed_type <chr> "Real Bed", "Real Bed", …
## $ amenities <chr> "{\"Cable TV\",Internet,…
## $ square_feet <lgl> NA, NA, NA, NA, NA, NA, …
## $ price <chr> "$1,395.00", "$783.00", …
## $ weekly_price <chr> NA, NA, NA, NA, NA, NA, …
## $ monthly_price <chr> "$29,451.00", NA, NA, NA…
## $ security_deposit <chr> "$2,325.00", "$775.00", …
## $ cleaning_fee <chr> "$310.00", "$271.00", "$…
## $ guests_included <dbl> 2, 2, 2, 3, 1, 2, 2, 1, …
## $ extra_people <chr> "$155.00", "$155.00", "$…
## $ minimum_nights <dbl> 3, 14, 2, 2, 2, 2, 2, 1,…
## $ maximum_nights <dbl> 365, 365, 365, 365, 60, …
## $ minimum_minimum_nights <dbl> 3, 14, 2, 2, 2, 2, 2, 1,…
## $ maximum_minimum_nights <dbl> 4, 14, 2, 2, 2, 2, 2, 1,…
## $ minimum_maximum_nights <dbl> 365, 365, 365, 365, 60, …
## $ maximum_maximum_nights <dbl> 365, 365, 365, 365, 60, …
## $ minimum_nights_avg_ntm <dbl> 3.1, 14.0, 2.0, 2.0, 2.0…
## $ maximum_nights_avg_ntm <dbl> 365, 365, 365, 365, 60, …
## $ calendar_updated <chr> "2 months ago", "7 weeks…
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ availability_30 <dbl> 0, 0, 0, 14, 0, 8, 9, 30…
## $ availability_60 <dbl> 23, 0, 0, 44, 15, 33, 39…
## $ availability_90 <dbl> 53, 14, 0, 74, 45, 63, 6…
## $ availability_365 <dbl> 143, 193, 0, 345, 135, 3…
## $ calendar_last_scraped <date> 2020-06-17, 2020-06-17,…
## $ number_of_reviews <dbl> 134, 229, 271, 305, 27, …
## $ number_of_reviews_ltm <dbl> 4, 1, 13, 48, 0, 16, 11,…
## $ first_review <date> 2011-02-14, 2011-03-05,…
## $ last_review <date> 2020-03-24, 2020-04-18,…
## $ review_scores_rating <dbl> 97, 89, 89, 93, 97, 86, …
## $ review_scores_accuracy <dbl> 10, 8, 9, 10, 10, 9, 9, …
## $ review_scores_cleanliness <dbl> 9, 9, 9, 10, 9, 9, 9, 10…
## $ review_scores_checkin <dbl> 10, 9, 10, 10, 10, 9, 10…
## $ review_scores_communication <dbl> 10, 9, 10, 10, 10, 9, 10…
## $ review_scores_location <dbl> 10, 10, 10, 10, 10, 10, …
## $ review_scores_value <dbl> 9, 9, 9, 9, 10, 9, 9, 10…
## $ requires_license <lgl> FALSE, FALSE, FALSE, FAL…
## $ license <lgl> NA, NA, NA, NA, NA, NA, …
## $ jurisdiction_names <lgl> NA, NA, NA, NA, NA, NA, …
## $ instant_bookable <lgl> FALSE, FALSE, FALSE, FAL…
## $ is_business_travel_ready <lgl> FALSE, FALSE, FALSE, FAL…
## $ cancellation_policy <chr> "strict_14_with_grace_pe…
## $ require_guest_profile_picture <lgl> FALSE, FALSE, FALSE, FAL…
## $ require_guest_phone_verification <lgl> FALSE, FALSE, FALSE, FAL…
## $ calculated_host_listings_count <dbl> 1, 13, 13, 13, 1, 13, 13…
## $ calculated_host_listings_count_entire_homes <dbl> 1, 9, 9, 9, 1, 9, 9, 0, …
## $ calculated_host_listings_count_private_rooms <dbl> 0, 4, 4, 4, 0, 4, 4, 1, …
## $ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reviews_per_month <dbl> 1.18, 2.02, 2.47, 2.81, …
Computing summary statistics of the variables of interest
After the first glimpse it is useful to dive deeper and get the distribution of variables. While there are a lot of them - 46 character, 38 numeric, 5 date and 17 logical - let’s consider ones that are particularly interesting:
minimum_nights,accommodates, number_of_reviews, bedrooms, beds, bathrooms.
Display summary statistics for these variables using favstats from the mosaic package.
skim(listings)%>%
kable()%>%
kable_styling()
| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | Date.min | Date.max | Date.median | Date.n_unique | logical.mean | logical.count | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | listing_url | 0 | 1.000 | 34 | 37 | 0 | 11187 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | name | 8 | 0.999 | 1 | 250 | 0 | 10899 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | summary | 756 | 0.932 | 1 | 1000 | 0 | 7994 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | space | 4528 | 0.595 | 1 | 1000 | 0 | 4892 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | description | 521 | 0.953 | 1 | 1000 | 0 | 8953 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | experiences_offered | 0 | 1.000 | 4 | 4 | 0 | 1 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighborhood_overview | 5879 | 0.474 | 1 | 1000 | 0 | 3573 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | notes | 6862 | 0.387 | 1 | 1000 | 0 | 2410 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | transit | 5598 | 0.500 | 1 | 1000 | 0 | 3666 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | access | 6790 | 0.393 | 1 | 1000 | 0 | 2872 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | interaction | 6119 | 0.453 | 1 | 1000 | 0 | 2981 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | house_rules | 6217 | 0.444 | 2 | 1000 | 0 | 3169 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | picture_url | 0 | 1.000 | 81 | 146 | 0 | 10607 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_url | 0 | 1.000 | 39 | 43 | 0 | 4874 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_name | 12 | 0.999 | 1 | 33 | 0 | 2846 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_location | 38 | 0.997 | 2 | 133 | 0 | 429 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_about | 4315 | 0.614 | 1 | 3851 | 0 | 2456 | 5 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_response_time | 4142 | 0.630 | 12 | 18 | 0 | 4 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_response_rate | 4142 | 0.630 | 2 | 4 | 0 | 58 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_acceptance_rate | 3346 | 0.701 | 2 | 4 | 0 | 73 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_thumbnail_url | 12 | 0.999 | 55 | 106 | 0 | 4851 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_picture_url | 12 | 0.999 | 57 | 109 | 0 | 4851 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_neighbourhood | 2525 | 0.774 | 2 | 26 | 0 | 163 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_verifications | 0 | 1.000 | 2 | 156 | 0 | 265 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | street | 0 | 1.000 | 13 | 82 | 0 | 688 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighbourhood | 1284 | 0.885 | 4 | 26 | 0 | 56 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighbourhood_cleansed | 0 | 1.000 | 5 | 17 | 0 | 18 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | city | 772 | 0.931 | 1 | 50 | 0 | 343 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | state | 371 | 0.967 | 1 | 31 | 0 | 176 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | zipcode | 10480 | 0.063 | 1 | 20 | 0 | 120 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | market | 9 | 0.999 | 6 | 22 | 0 | 12 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | smart_location | 0 | 1.000 | 9 | 61 | 0 | 387 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | country_code | 0 | 1.000 | 2 | 2 | 0 | 3 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | country | 0 | 1.000 | 5 | 14 | 0 | 3 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | property_type | 0 | 1.000 | 3 | 22 | 0 | 41 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | room_type | 0 | 1.000 | 10 | 15 | 0 | 4 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | bed_type | 0 | 1.000 | 5 | 13 | 0 | 5 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | amenities | 0 | 1.000 | 2 | 1126 | 0 | 8558 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | price | 0 | 1.000 | 5 | 10 | 0 | 374 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | weekly_price | 10601 | 0.052 | 6 | 10 | 0 | 268 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | monthly_price | 10480 | 0.063 | 7 | 11 | 0 | 316 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | security_deposit | 5677 | 0.493 | 5 | 10 | 0 | 231 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | cleaning_fee | 5055 | 0.548 | 5 | 9 | 0 | 259 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | extra_people | 0 | 1.000 | 5 | 9 | 0 | 184 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | calendar_updated | 0 | 1.000 | 5 | 13 | 0 | 78 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | cancellation_policy | 0 | 1.000 | 6 | 27 | 0 | 6 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | last_scraped | 0 | 1.000 | NA | NA | NA | NA | NA | 2020-06-15 | 2020-06-19 | 2020-06-17 | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | host_since | 12 | 0.999 | NA | NA | NA | NA | NA | 2009-08-17 | 2020-06-10 | 2015-12-27 | 2355 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | calendar_last_scraped | 0 | 1.000 | NA | NA | NA | NA | NA | 2020-06-15 | 2020-06-19 | 2020-06-17 | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | first_review | 4155 | 0.629 | NA | NA | NA | NA | NA | 2011-02-14 | 2020-06-15 | 2018-02-19 | 1986 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | last_review | 4155 | 0.629 | NA | NA | NA | NA | NA | 2013-01-02 | 2020-06-17 | 2019-06-23 | 1365 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | thumbnail_url | 11187 | 0.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | : | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | medium_url | 11187 | 0.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | : | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | xl_picture_url | 11187 | 0.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | : | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_is_superhost | 12 | 0.999 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.135 | FAL: 9669, TRU: 1506 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_has_profile_pic | 12 | 0.999 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.997 | TRU: 11141, FAL: 34 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_identity_verified | 12 | 0.999 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.268 | FAL: 8179, TRU: 2996 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | neighbourhood_group_cleansed | 11187 | 0.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | : | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | is_location_exact | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.688 | TRU: 7698, FAL: 3489 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | square_feet | 11172 | 0.001 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.000 | FAL: 15 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | has_availability | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.000 | TRU: 11187 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | requires_license | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.000 | FAL: 11187 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | license | 11187 | 0.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | : | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | jurisdiction_names | 11187 | 0.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | : | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | instant_bookable | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.420 | FAL: 6485, TRU: 4702 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | is_business_travel_ready | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.000 | FAL: 11187 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | require_guest_profile_picture | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.008 | FAL: 11102, TRU: 85 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | require_guest_phone_verification | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.009 | FAL: 11086, TRU: 101 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | id | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.50e+07 | 1.17e+07 | 6.91e+04 | 1.63e+07 | 2.63e+07 | 3.47e+07 | 4.38e+07 | ▃▅▆▇▇ |
| numeric | scrape_id | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.02e+13 | 0.00e+00 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | ▁▁▇▁▁ |
| numeric | host_id | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 8.84e+07 | 8.74e+07 | 3.22e+04 | 1.69e+07 | 5.25e+07 | 1.39e+08 | 3.49e+08 | ▇▃▂▂▁ |
| numeric | host_listings_count | 12 | 0.999 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.85e+01 | 1.05e+02 | 0.00e+00 | 1.00e+00 | 5.00e+00 | 2.20e+01 | 3.86e+02 | ▇▁▁▁▁ |
| numeric | host_total_listings_count | 12 | 0.999 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.85e+01 | 1.05e+02 | 0.00e+00 | 1.00e+00 | 5.00e+00 | 2.20e+01 | 3.86e+02 | ▇▁▁▁▁ |
| numeric | latitude | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.23e+01 | 4.80e-02 | 2.22e+01 | 2.23e+01 | 2.23e+01 | 2.23e+01 | 2.26e+01 | ▁▇▁▁▁ |
| numeric | longitude | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.14e+02 | 4.40e-02 | 1.14e+02 | 1.14e+02 | 1.14e+02 | 1.14e+02 | 1.14e+02 | ▁▁▃▇▁ |
| numeric | accommodates | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.82e+00 | 2.18e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 3.00e+00 | 1.60e+01 | ▇▁▁▁▁ |
| numeric | bathrooms | 17 | 0.998 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.16e+00 | 5.73e-01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.10e+01 | ▇▁▁▁▁ |
| numeric | bedrooms | 38 | 0.997 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.09e+00 | 8.54e-01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.10e+01 | ▇▁▁▁▁ |
| numeric | beds | 69 | 0.994 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.68e+00 | 1.44e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.00e+01 | ▇▁▁▁▁ |
| numeric | guests_included | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.39e+00 | 1.06e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.60e+01 | ▇▁▁▁▁ |
| numeric | minimum_nights | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.76e+00 | 2.83e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 7.00e+00 | 1.10e+03 | ▇▁▁▁▁ |
| numeric | maximum_nights | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3.86e+05 | 2.87e+07 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 2.15e+09 | ▇▁▁▁▁ |
| numeric | minimum_minimum_nights | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.61e+00 | 2.80e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 7.00e+00 | 1.10e+03 | ▇▁▁▁▁ |
| numeric | maximum_minimum_nights | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.99e+00 | 2.91e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 7.00e+00 | 1.10e+03 | ▇▁▁▁▁ |
| numeric | minimum_maximum_nights | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3.86e+05 | 2.87e+07 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 2.15e+09 | ▇▁▁▁▁ |
| numeric | maximum_maximum_nights | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3.86e+05 | 2.87e+07 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 2.15e+09 | ▇▁▁▁▁ |
| numeric | minimum_nights_avg_ntm | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.79e+00 | 2.82e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 7.00e+00 | 1.10e+03 | ▇▁▁▁▁ |
| numeric | maximum_nights_avg_ntm | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3.86e+05 | 2.87e+07 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 2.15e+09 | ▇▁▁▁▁ |
| numeric | availability_30 | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.55e+01 | 1.40e+01 | 0.00e+00 | 0.00e+00 | 2.00e+01 | 3.00e+01 | 3.00e+01 | ▇▁▁▁▇ |
| numeric | availability_60 | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3.28e+01 | 2.79e+01 | 0.00e+00 | 0.00e+00 | 4.70e+01 | 6.00e+01 | 6.00e+01 | ▆▁▁▁▇ |
| numeric | availability_90 | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5.06e+01 | 4.17e+01 | 0.00e+00 | 0.00e+00 | 7.60e+01 | 9.00e+01 | 9.00e+01 | ▆▁▁▁▇ |
| numeric | availability_365 | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.68e+02 | 1.57e+02 | 0.00e+00 | 0.00e+00 | 1.08e+02 | 3.64e+02 | 3.65e+02 | ▇▂▂▁▇ |
| numeric | number_of_reviews | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.77e+01 | 4.12e+01 | 0.00e+00 | 0.00e+00 | 2.00e+00 | 1.40e+01 | 7.57e+02 | ▇▁▁▁▁ |
| numeric | number_of_reviews_ltm | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.68e+00 | 7.55e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 1.38e+02 | ▇▁▁▁▁ |
| numeric | review_scores_rating | 4355 | 0.611 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.09e+01 | 1.12e+01 | 2.00e+01 | 8.70e+01 | 9.40e+01 | 9.90e+01 | 1.00e+02 | ▁▁▁▂▇ |
| numeric | review_scores_accuracy | 4357 | 0.611 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.34e+00 | 1.12e+00 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| numeric | review_scores_cleanliness | 4357 | 0.611 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.10e+00 | 1.20e+00 | 2.00e+00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | ▁▁▁▂▇ |
| numeric | review_scores_checkin | 4356 | 0.611 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.50e+00 | 1.04e+00 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| numeric | review_scores_communication | 4357 | 0.611 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.51e+00 | 1.03e+00 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| numeric | review_scores_location | 4358 | 0.610 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.61e+00 | 8.50e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| numeric | review_scores_value | 4358 | 0.610 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.13e+00 | 1.13e+00 | 2.00e+00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | ▁▁▁▂▇ |
| numeric | calculated_host_listings_count | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.57e+01 | 1.03e+02 | 1.00e+00 | 1.00e+00 | 4.00e+00 | 1.90e+01 | 3.89e+02 | ▇▁▁▁▁ |
| numeric | calculated_host_listings_count_entire_homes | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 7.80e+00 | 1.90e+01 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 4.00e+00 | 1.08e+02 | ▇▁▁▁▁ |
| numeric | calculated_host_listings_count_private_rooms | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3.29e+01 | 8.23e+01 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 1.10e+01 | 3.39e+02 | ▇▁▁▁▁ |
| numeric | calculated_host_listings_count_shared_rooms | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.54e+00 | 1.57e+01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 8.20e+01 | ▇▁▁▁▁ |
| numeric | reviews_per_month | 4155 | 0.629 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 8.37e-01 | 1.18e+00 | 1.00e-02 | 1.20e-01 | 3.50e-01 | 1.03e+00 | 1.32e+01 | ▇▁▁▁▁ |
skim(listings)
| Name | listings |
| Number of rows | 11187 |
| Number of columns | 106 |
| _______________________ | |
| Column type frequency: | |
| character | 46 |
| Date | 5 |
| logical | 17 |
| numeric | 38 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 34 | 37 | 0 | 11187 | 0 |
| name | 8 | 1.00 | 1 | 250 | 0 | 10899 | 0 |
| summary | 756 | 0.93 | 1 | 1000 | 0 | 7994 | 0 |
| space | 4528 | 0.60 | 1 | 1000 | 0 | 4892 | 0 |
| description | 521 | 0.95 | 1 | 1000 | 0 | 8953 | 0 |
| experiences_offered | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| neighborhood_overview | 5879 | 0.47 | 1 | 1000 | 0 | 3573 | 0 |
| notes | 6862 | 0.39 | 1 | 1000 | 0 | 2410 | 0 |
| transit | 5598 | 0.50 | 1 | 1000 | 0 | 3666 | 0 |
| access | 6790 | 0.39 | 1 | 1000 | 0 | 2872 | 0 |
| interaction | 6119 | 0.45 | 1 | 1000 | 0 | 2981 | 0 |
| house_rules | 6217 | 0.44 | 2 | 1000 | 0 | 3169 | 0 |
| picture_url | 0 | 1.00 | 81 | 146 | 0 | 10607 | 0 |
| host_url | 0 | 1.00 | 39 | 43 | 0 | 4874 | 0 |
| host_name | 12 | 1.00 | 1 | 33 | 0 | 2846 | 0 |
| host_location | 38 | 1.00 | 2 | 133 | 0 | 429 | 0 |
| host_about | 4315 | 0.61 | 1 | 3851 | 0 | 2456 | 5 |
| host_response_time | 4142 | 0.63 | 12 | 18 | 0 | 4 | 0 |
| host_response_rate | 4142 | 0.63 | 2 | 4 | 0 | 58 | 0 |
| host_acceptance_rate | 3346 | 0.70 | 2 | 4 | 0 | 73 | 0 |
| host_thumbnail_url | 12 | 1.00 | 55 | 106 | 0 | 4851 | 0 |
| host_picture_url | 12 | 1.00 | 57 | 109 | 0 | 4851 | 0 |
| host_neighbourhood | 2525 | 0.77 | 2 | 26 | 0 | 163 | 0 |
| host_verifications | 0 | 1.00 | 2 | 156 | 0 | 265 | 0 |
| street | 0 | 1.00 | 13 | 82 | 0 | 688 | 0 |
| neighbourhood | 1284 | 0.89 | 4 | 26 | 0 | 56 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 5 | 17 | 0 | 18 | 0 |
| city | 772 | 0.93 | 1 | 50 | 0 | 343 | 0 |
| state | 371 | 0.97 | 1 | 31 | 0 | 176 | 0 |
| zipcode | 10480 | 0.06 | 1 | 20 | 0 | 120 | 0 |
| market | 9 | 1.00 | 6 | 22 | 0 | 12 | 0 |
| smart_location | 0 | 1.00 | 9 | 61 | 0 | 387 | 0 |
| country_code | 0 | 1.00 | 2 | 2 | 0 | 3 | 0 |
| country | 0 | 1.00 | 5 | 14 | 0 | 3 | 0 |
| property_type | 0 | 1.00 | 3 | 22 | 0 | 41 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bed_type | 0 | 1.00 | 5 | 13 | 0 | 5 | 0 |
| amenities | 0 | 1.00 | 2 | 1126 | 0 | 8558 | 0 |
| price | 0 | 1.00 | 5 | 10 | 0 | 374 | 0 |
| weekly_price | 10601 | 0.05 | 6 | 10 | 0 | 268 | 0 |
| monthly_price | 10480 | 0.06 | 7 | 11 | 0 | 316 | 0 |
| security_deposit | 5677 | 0.49 | 5 | 10 | 0 | 231 | 0 |
| cleaning_fee | 5055 | 0.55 | 5 | 9 | 0 | 259 | 0 |
| extra_people | 0 | 1.00 | 5 | 9 | 0 | 184 | 0 |
| calendar_updated | 0 | 1.00 | 5 | 13 | 0 | 78 | 0 |
| cancellation_policy | 0 | 1.00 | 6 | 27 | 0 | 6 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2020-06-15 | 2020-06-19 | 2020-06-17 | 4 |
| host_since | 12 | 1.00 | 2009-08-17 | 2020-06-10 | 2015-12-27 | 2355 |
| calendar_last_scraped | 0 | 1.00 | 2020-06-15 | 2020-06-19 | 2020-06-17 | 4 |
| first_review | 4155 | 0.63 | 2011-02-14 | 2020-06-15 | 2018-02-19 | 1986 |
| last_review | 4155 | 0.63 | 2013-01-02 | 2020-06-17 | 2019-06-23 | 1365 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| thumbnail_url | 11187 | 0 | NaN | : |
| medium_url | 11187 | 0 | NaN | : |
| xl_picture_url | 11187 | 0 | NaN | : |
| host_is_superhost | 12 | 1 | 0.13 | FAL: 9669, TRU: 1506 |
| host_has_profile_pic | 12 | 1 | 1.00 | TRU: 11141, FAL: 34 |
| host_identity_verified | 12 | 1 | 0.27 | FAL: 8179, TRU: 2996 |
| neighbourhood_group_cleansed | 11187 | 0 | NaN | : |
| is_location_exact | 0 | 1 | 0.69 | TRU: 7698, FAL: 3489 |
| square_feet | 11172 | 0 | 0.00 | FAL: 15 |
| has_availability | 0 | 1 | 1.00 | TRU: 11187 |
| requires_license | 0 | 1 | 0.00 | FAL: 11187 |
| license | 11187 | 0 | NaN | : |
| jurisdiction_names | 11187 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.42 | FAL: 6485, TRU: 4702 |
| is_business_travel_ready | 0 | 1 | 0.00 | FAL: 11187 |
| require_guest_profile_picture | 0 | 1 | 0.01 | FAL: 11102, TRU: 85 |
| require_guest_phone_verification | 0 | 1 | 0.01 | FAL: 11086, TRU: 101 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.50e+07 | 1.17e+07 | 6.91e+04 | 1.63e+07 | 2.63e+07 | 3.47e+07 | 4.38e+07 | ▃▅▆▇▇ |
| scrape_id | 0 | 1.00 | 2.02e+13 | 0.00e+00 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | ▁▁▇▁▁ |
| host_id | 0 | 1.00 | 8.84e+07 | 8.74e+07 | 3.22e+04 | 1.69e+07 | 5.25e+07 | 1.39e+08 | 3.49e+08 | ▇▃▂▂▁ |
| host_listings_count | 12 | 1.00 | 4.85e+01 | 1.05e+02 | 0.00e+00 | 1.00e+00 | 5.00e+00 | 2.20e+01 | 3.86e+02 | ▇▁▁▁▁ |
| host_total_listings_count | 12 | 1.00 | 4.85e+01 | 1.05e+02 | 0.00e+00 | 1.00e+00 | 5.00e+00 | 2.20e+01 | 3.86e+02 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | 2.23e+01 | 5.00e-02 | 2.22e+01 | 2.23e+01 | 2.23e+01 | 2.23e+01 | 2.26e+01 | ▁▇▁▁▁ |
| longitude | 0 | 1.00 | 1.14e+02 | 4.00e-02 | 1.14e+02 | 1.14e+02 | 1.14e+02 | 1.14e+02 | 1.14e+02 | ▁▁▃▇▁ |
| accommodates | 0 | 1.00 | 2.82e+00 | 2.18e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 3.00e+00 | 1.60e+01 | ▇▁▁▁▁ |
| bathrooms | 17 | 1.00 | 1.16e+00 | 5.70e-01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.10e+01 | ▇▁▁▁▁ |
| bedrooms | 38 | 1.00 | 1.09e+00 | 8.50e-01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.10e+01 | ▇▁▁▁▁ |
| beds | 69 | 0.99 | 1.68e+00 | 1.44e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.00e+01 | ▇▁▁▁▁ |
| guests_included | 0 | 1.00 | 1.39e+00 | 1.06e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.60e+01 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 9.76e+00 | 2.83e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 7.00e+00 | 1.10e+03 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 3.86e+05 | 2.87e+07 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 2.15e+09 | ▇▁▁▁▁ |
| minimum_minimum_nights | 0 | 1.00 | 9.61e+00 | 2.80e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 7.00e+00 | 1.10e+03 | ▇▁▁▁▁ |
| maximum_minimum_nights | 0 | 1.00 | 1.00e+01 | 2.91e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 7.00e+00 | 1.10e+03 | ▇▁▁▁▁ |
| minimum_maximum_nights | 0 | 1.00 | 3.86e+05 | 2.87e+07 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 2.15e+09 | ▇▁▁▁▁ |
| maximum_maximum_nights | 0 | 1.00 | 3.86e+05 | 2.87e+07 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 2.15e+09 | ▇▁▁▁▁ |
| minimum_nights_avg_ntm | 0 | 1.00 | 9.79e+00 | 2.82e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 7.00e+00 | 1.10e+03 | ▇▁▁▁▁ |
| maximum_nights_avg_ntm | 0 | 1.00 | 3.86e+05 | 2.87e+07 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 2.15e+09 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 1.55e+01 | 1.40e+01 | 0.00e+00 | 0.00e+00 | 2.00e+01 | 3.00e+01 | 3.00e+01 | ▇▁▁▁▇ |
| availability_60 | 0 | 1.00 | 3.28e+01 | 2.79e+01 | 0.00e+00 | 0.00e+00 | 4.70e+01 | 6.00e+01 | 6.00e+01 | ▆▁▁▁▇ |
| availability_90 | 0 | 1.00 | 5.06e+01 | 4.17e+01 | 0.00e+00 | 0.00e+00 | 7.60e+01 | 9.00e+01 | 9.00e+01 | ▆▁▁▁▇ |
| availability_365 | 0 | 1.00 | 1.68e+02 | 1.57e+02 | 0.00e+00 | 0.00e+00 | 1.08e+02 | 3.64e+02 | 3.65e+02 | ▇▂▂▁▇ |
| number_of_reviews | 0 | 1.00 | 1.77e+01 | 4.12e+01 | 0.00e+00 | 0.00e+00 | 2.00e+00 | 1.40e+01 | 7.57e+02 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 2.68e+00 | 7.55e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 1.38e+02 | ▇▁▁▁▁ |
| review_scores_rating | 4355 | 0.61 | 9.09e+01 | 1.12e+01 | 2.00e+01 | 8.70e+01 | 9.40e+01 | 9.90e+01 | 1.00e+02 | ▁▁▁▂▇ |
| review_scores_accuracy | 4357 | 0.61 | 9.34e+00 | 1.12e+00 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_cleanliness | 4357 | 0.61 | 9.09e+00 | 1.20e+00 | 2.00e+00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | ▁▁▁▂▇ |
| review_scores_checkin | 4356 | 0.61 | 9.50e+00 | 1.04e+00 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_communication | 4357 | 0.61 | 9.51e+00 | 1.03e+00 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_location | 4358 | 0.61 | 9.61e+00 | 8.50e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_value | 4358 | 0.61 | 9.13e+00 | 1.13e+00 | 2.00e+00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | ▁▁▁▂▇ |
| calculated_host_listings_count | 0 | 1.00 | 4.57e+01 | 1.03e+02 | 1.00e+00 | 1.00e+00 | 4.00e+00 | 1.90e+01 | 3.89e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 7.80e+00 | 1.90e+01 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 4.00e+00 | 1.08e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 3.29e+01 | 8.22e+01 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 1.10e+01 | 3.39e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 4.54e+00 | 1.57e+01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 8.20e+01 | ▇▁▁▁▁ |
| reviews_per_month | 4155 | 0.63 | 8.40e-01 | 1.18e+00 | 1.00e-02 | 1.20e-01 | 3.50e-01 | 1.03e+00 | 1.32e+01 | ▇▁▁▁▁ |
Minimum number of nights
From histogram we saw that it is right skewed, and the median is 2, meaning that majority of property is for tourists. At the same time, there are a lot of long minimum night observations- that is property for long-term rent.
favstats(~minimum_nights, data= listings)
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 2 | 7 | 1.1e+03 | 9.76 | 28.3 | 11187 | 0 |
Number of accommodates
The situation is similar to minimum number of nights: slightly right skewed data set. There are some big accommodation options that allow 16 people to stay.
favstats(~accommodates, data= listings)
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 2 | 3 | 16 | 2.82 | 2.18 | 11187 | 0 |
favstats(~number_of_reviews, data= listings)
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2 | 14 | 757 | 17.7 | 41.2 | 11187 | 0 |
favstats(~bedrooms, data= listings)
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 11 | 1.09 | 0.854 | 11149 | 38 |
favstats(~beds, data= listings)
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 2 | 20 | 1.68 | 1.44 | 11118 | 69 |
favstats(~bathrooms, data= listings)
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 11 | 1.16 | 0.573 | 11170 | 17 |
Data cleaning and transformation
Before proceding to analysis we do following transformation of the main data set:
- We selected the ones we found interesting
- Change the price variables (price, cleaning_fee, extra_people, and security_deposit) to numeric variables, as they are now characters
- Remove listings which have minimum_nights>4 or maximum_nights <= 4. These listings are either for long term rents not connected to tourism (not always but most likely) or two short options. Since the goal of the project is projecting the price for 4 nights they are irrelevant
- Delete observations with 0 price to rent (most likely some listings with a need to discuss the price by phone)
- Delete observations not allowing for second person to stay
- Delete listings without summaries or host response times
- Making values for security deposit and cleaning fee 0 if they are NA, as NA means in this case that there is no need for security deposit or cleaning is free
- For
Property typewe find most popular types 7 in the data set. We then creating the simplified version of the variable by keeping the most popular ones and assigning the value “Other” to other observations.Cancelation typeis transformed to have 3 types.Neighbourhood type,Room type,Bed typeare checked to have less then 6 factors. - Create
host_since_calculatedvariable which reflects the number of years the host is a host - Convert the amenities variable to be a count of the number of amenities in a certain property, rather than comma separated values and store it to
total_amenities - After talking to a person who lives in Hong Kong we create 3 new smaller groups of neighbourhoods: hong_kong, kowloon and new_territories.
- Finally calculate the cost for two people to stay in an Hong Kong for 4 nights and store it in the variable
price_4_nights. The price is calculated as sum of price per night, cleaning price and charge for the additional guest if such exists (otherwise it is 0).
In order to run models with our data, we need to modify some variables and create some new variables that we want to examine.
After that wee need to filter the data
interesting_data <- listings %>% # Select variables of interest
select(id,
listing_url,
summary,
space,
description,
host_since,
host_response_time,
host_response_rate,
host_is_superhost,
host_listings_count,
host_has_profile_pic,
host_identity_verified,
neighbourhood_cleansed,
latitude, longitude,
is_location_exact,
property_type,
room_type,
accommodates,
bathrooms,
bedrooms,
beds,
bed_type,
amenities,
price,
security_deposit,
cleaning_fee, guests_included,
extra_people,
minimum_nights,
maximum_nights,
has_availability,
number_of_reviews,
number_of_reviews_ltm,
last_review,
review_scores_rating,
review_scores_accuracy,
review_scores_cleanliness,
review_scores_checkin,
review_scores_communication,
review_scores_location,
review_scores_value,
instant_bookable,
cancellation_policy,
reviews_per_month)
Data Wrangling
main_data <- interesting_data %>%
mutate(host_response_rate = parse_number(host_response_rate),
price = parse_number(price),
security_deposit = parse_number(security_deposit),
cleaning_fee = parse_number(cleaning_fee),
extra_people = parse_number(extra_people),
has_summary = !is.na(summary))%>%
filter(minimum_nights <= 4,
maximum_nights >= 4,
price > 0,
accommodates >= 2,
!is.na(host_response_time)
)
skim(main_data)
| Name | main_data |
| Number of rows | 3829 |
| Number of columns | 46 |
| _______________________ | |
| Column type frequency: | |
| character | 11 |
| Date | 2 |
| logical | 7 |
| numeric | 26 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 34 | 37 | 0 | 3829 | 0 |
| summary | 273 | 0.93 | 1 | 1000 | 0 | 2486 | 0 |
| space | 1184 | 0.69 | 3 | 1000 | 0 | 1843 | 0 |
| description | 179 | 0.95 | 1 | 1000 | 0 | 2904 | 0 |
| host_response_time | 0 | 1.00 | 12 | 18 | 0 | 4 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 5 | 17 | 0 | 18 | 0 |
| property_type | 0 | 1.00 | 3 | 21 | 0 | 33 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bed_type | 0 | 1.00 | 5 | 13 | 0 | 5 | 0 |
| amenities | 0 | 1.00 | 2 | 1126 | 0 | 2962 | 0 |
| cancellation_policy | 0 | 1.00 | 8 | 27 | 0 | 5 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| host_since | 0 | 1.00 | 2010-02-13 | 2020-06-03 | 2016-01-17 | 1026 |
| last_review | 798 | 0.79 | 2015-01-03 | 2020-06-17 | 2019-11-20 | 718 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 0 | 1 | 0.20 | FAL: 3069, TRU: 760 |
| host_has_profile_pic | 0 | 1 | 1.00 | TRU: 3815, FAL: 14 |
| host_identity_verified | 0 | 1 | 0.32 | FAL: 2597, TRU: 1232 |
| is_location_exact | 0 | 1 | 0.70 | TRU: 2668, FAL: 1161 |
| has_availability | 0 | 1 | 1.00 | TRU: 3829 |
| instant_bookable | 0 | 1 | 0.57 | TRU: 2198, FAL: 1631 |
| has_summary | 0 | 1 | 0.93 | TRU: 3556, FAL: 273 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.57e+07 | 1.22e+07 | 69074.00 | 1.65e+07 | 2.77e+07 | 3.54e+07 | 4.38e+07 | ▃▅▆▇▇ |
| host_response_rate | 0 | 1.00 | 8.21e+01 | 3.20e+01 | 0.00 | 8.40e+01 | 1.00e+02 | 1.00e+02 | 1.00e+02 | ▁▁▁▁▇ |
| host_listings_count | 0 | 1.00 | 1.80e+01 | 3.06e+01 | 0.00 | 2.00e+00 | 8.00e+00 | 1.80e+01 | 3.86e+02 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | 2.23e+01 | 6.00e-02 | 22.16 | 2.23e+01 | 2.23e+01 | 2.23e+01 | 2.26e+01 | ▁▇▂▁▁ |
| longitude | 0 | 1.00 | 1.14e+02 | 5.00e-02 | 113.86 | 1.14e+02 | 1.14e+02 | 1.14e+02 | 1.14e+02 | ▁▁▂▇▁ |
| accommodates | 0 | 1.00 | 3.56e+00 | 2.40e+00 | 2.00 | 2.00e+00 | 2.00e+00 | 4.00e+00 | 1.60e+01 | ▇▁▁▁▁ |
| bathrooms | 2 | 1.00 | 1.20e+00 | 6.40e-01 | 0.00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+01 | ▇▁▁▁▁ |
| bedrooms | 8 | 1.00 | 1.21e+00 | 9.20e-01 | 0.00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.10e+01 | ▇▁▁▁▁ |
| beds | 19 | 1.00 | 2.07e+00 | 1.72e+00 | 0.00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 2.00e+01 | ▇▁▁▁▁ |
| price | 0 | 1.00 | 8.23e+02 | 1.72e+03 | 47.00 | 3.49e+02 | 5.04e+02 | 8.53e+02 | 5.80e+04 | ▇▁▁▁▁ |
| security_deposit | 1605 | 0.58 | 8.86e+02 | 2.08e+03 | 0.00 | 0.00e+00 | 0.00e+00 | 1.00e+03 | 3.80e+04 | ▇▁▁▁▁ |
| cleaning_fee | 1339 | 0.65 | 1.63e+02 | 2.37e+02 | 0.00 | 0.00e+00 | 1.00e+02 | 2.50e+02 | 4.68e+03 | ▇▁▁▁▁ |
| guests_included | 0 | 1.00 | 1.72e+00 | 1.41e+00 | 1.00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 1.60e+01 | ▇▁▁▁▁ |
| extra_people | 0 | 1.00 | 8.60e+01 | 1.86e+02 | 0.00 | 0.00e+00 | 0.00e+00 | 1.00e+02 | 2.30e+03 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 1.36e+00 | 6.90e-01 | 1.00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 4.00e+00 | ▇▂▁▁▁ |
| maximum_nights | 0 | 1.00 | 5.64e+05 | 3.47e+07 | 4.00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 2.15e+09 | ▇▁▁▁▁ |
| number_of_reviews | 0 | 1.00 | 3.32e+01 | 5.57e+01 | 0.00 | 1.00e+00 | 9.00e+00 | 4.00e+01 | 7.57e+02 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 5.70e+00 | 1.08e+01 | 0.00 | 0.00e+00 | 1.00e+00 | 6.00e+00 | 1.38e+02 | ▇▁▁▁▁ |
| review_scores_rating | 836 | 0.78 | 9.03e+01 | 1.04e+01 | 20.00 | 8.70e+01 | 9.30e+01 | 9.70e+01 | 1.00e+02 | ▁▁▁▂▇ |
| review_scores_accuracy | 836 | 0.78 | 9.26e+00 | 1.08e+00 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_cleanliness | 837 | 0.78 | 9.09e+00 | 1.09e+00 | 2.00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | ▁▁▁▂▇ |
| review_scores_checkin | 836 | 0.78 | 9.49e+00 | 9.70e-01 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_communication | 836 | 0.78 | 9.48e+00 | 9.60e-01 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_location | 836 | 0.78 | 9.60e+00 | 8.30e-01 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_value | 836 | 0.78 | 9.06e+00 | 1.05e+00 | 2.00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | ▁▁▁▂▇ |
| reviews_per_month | 798 | 0.79 | 1.25e+00 | 1.41e+00 | 0.02 | 2.70e-01 | 7.30e-01 | 1.71e+00 | 1.32e+01 | ▇▁▁▁▁ |
Handling NAs
#Check NAs
main_data %>%
select(price, cleaning_fee, extra_people, host_response_rate, security_deposit) %>%
skim()%>%
kable()%>%
kable_styling()
| skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| numeric | price | 0 | 1.000 | 823.4 | 1722 | 47 | 349 | 504 | 853 | 58002 | ▇▁▁▁▁ |
| numeric | cleaning_fee | 1339 | 0.650 | 163.1 | 237 | 0 | 0 | 100 | 250 | 4682 | ▇▁▁▁▁ |
| numeric | extra_people | 0 | 1.000 | 86.0 | 186 | 0 | 0 | 0 | 100 | 2300 | ▇▁▁▁▁ |
| numeric | host_response_rate | 0 | 1.000 | 82.1 | 32 | 0 | 84 | 100 | 100 | 100 | ▁▁▁▁▇ |
| numeric | security_deposit | 1605 | 0.581 | 886.4 | 2079 | 0 | 0 | 0 | 1000 | 38000 | ▇▁▁▁▁ |
# Making values for security deposit and claening fee 0 if they are NA
main_data2 <- main_data %>%
mutate(cleaning_fee = replace_na(cleaning_fee, 0),
security_deposit = replace_na(security_deposit, 0))
Filtering Variables and Creating Factor Levels
top6_prop_type <- main_data2 %>%
group_by(property_type)%>%
summarise(count =n())%>%
arrange(desc(count))%>%
slice(1:7)
main_data3 <- main_data2 %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Apartment", "Condominium", "Hostel", "Serviced apartment", "Guesthouse", "House", "Hotel") ~ property_type,
TRUE ~ "Other"
))
# Neighborhood_cleansed
top_neighbourhood_type <- main_data2 %>%
group_by(neighbourhood_cleansed)%>%
summarise(count=n())%>%
arrange(desc(count))
# Room type
top4_room_type <- main_data2 %>%
group_by(room_type)%>%
summarise(count =n())%>%
arrange(desc(count))%>%
slice(1:6)
# No adjustment needed because room_type only has 4 values
# bed_type
top6_bed_type <- main_data2 %>%
group_by(bed_type)%>%
summarise(count =n())%>%
arrange(desc(count))%>%
slice(1:6)
# we wont look into this variable as almost all observations are the same category
# cancellation_policy
top6_cancel_type <- main_data2 %>%
group_by(cancellation_policy)%>%
summarise(count =n())%>%
arrange(desc(count))%>%
slice(1:10)
# we will make one bucket for all strict policies
main_data3 <- main_data3 %>%
mutate(cancellation_policy = case_when(
cancellation_policy %in% c("strict_14_with_grace_period", "super_strict_60", "super_strict_30", "strict") ~ "strict",
cancellation_policy %in% c("moderate") ~"moderate",
cancellation_policy %in% c("flexible") ~"flexible"
))
# Check if number of levels is less than 6
top6_host_time_type <- main_data2 %>%
group_by(host_response_time)%>%
summarise(count =n())%>%
arrange(desc(count))
# We only have 4 types of response time so we will keep it this way
# Factoring the categories
main_data4 <- main_data3 %>%
mutate(
# room_type = factor(room_type, order = TRUE, levels = c("Entire home/apt", "Hotel room", "Private room", "Shared room")),
host_response_time = factor(host_response_time, order = TRUE, levels = c("within an hour", "within a few hours", "within a day","a few days or more"))
)
#Creation of the new variable reflecting the number of years a host is being host
main_data4 <- main_data4 %>%
mutate(host_since_calculated = as.numeric(ymd("2020-10-19") - host_since)/365)
# Convert the amenities variable to be a count of the number of amenities in a certain property, rather than comma separated values.
main_data4<- main_data4 %>%
mutate(total_amenities=str_count(amenities, ',')+1)
# Based on an interview with a local, we then assigned each neighbourhood to an area
hong_kong <-c("Central & Western","Wan Chai","Southern","Eastern")
kowloon <-c("Kowloon City","Yau Tsim Mong","Sham Shui Po","Wong Tai Sin", "Kwun Tong")
new_territories <-c("Tsuen Wan","Wan Chai","Sai Kung", "Tai Po", "Tuen Mun","Kwai Tsing","Sha Tin","Yuen Long","North","Islands")
# Assigning neighbourhood simplified
main_data4 <- main_data4 %>%
mutate(neighbourhood_simplified = case_when(neighbourhood_cleansed %in% hong_kong ~ "hong_kong",
neighbourhood_cleansed %in% kowloon ~ "kowloon",
neighbourhood_cleansed %in% new_territories ~ "new_territories",
))
# Creating variable for the size of the apartment
main_data4 <- main_data4 %>%
mutate(is_it_big_sum = str_detect(summary,c("large", "big" , "Big" , "spacious", "bright", "Spacious", "Large", "Bright", "refurbished","Refurbished", "Renovated", "renovated")),
is_it_big_spa = str_detect(space,c("large", "big" , "Big" , "spacious", "bright", "Spacious", "Large", "Bright", "refurbished","Refurbished", "Renovated", "renovated")),
is_it_big_desc = str_detect(description,c("large", "big" , "Big" , "spacious", "bright", "Spacious", "Large", "Bright", "refurbished","Refurbished", "Renovated", "renovated")),
is_it_big = is_it_big_sum | is_it_big_spa | is_it_big_desc
)
# Check how many apartments are big
proportions<- main_data4 %>%
group_by(is_it_big) %>%
summarise(count=n()) %>%
mutate(proportion=count/sum(count))
proportions%>%
kable()%>%
kable_styling()
| is_it_big | count | proportion |
|---|---|---|
| FALSE | 2452 | 0.640 |
| TRUE | 149 | 0.039 |
| NA | 1228 | 0.321 |
Creating price_4_nights
main_data4 <- main_data4 %>%
filter(guests_included<=2)%>%
mutate(
#Extra charge is 0 if 2 guests are included, and extra_people otw
extra_charge_pp= case_when(guests_included==2 ~ 0, guests_included ==1 ~ extra_people),
#Calculate the price for 4 nights for 2 people
price_4_nights= price*4 + extra_charge_pp*4 + cleaning_fee
)
Visualizing the distribution of the variables of interest
Visualizing the distribution of price_4_nights
#density plot price_4_nights
ggplot(data=main_data4, aes(x=price_4_nights)) +
geom_density(fill = "blue", alpha = 0.3) +
labs(x="Price for Four Nights", y="Denisty", title = "The price for 4 nights is heavily right skewed") +
theme_minimal()
This graph is heavily right skewed so we will try log(price_4_nights) next…
#density plot log
ggplot(data=main_data4, aes(x=price_4_nights)) +
scale_x_log10()+geom_density(fill = "blue", alpha = 0.3) +
labs(title = "The logarithm of the price for 4 nights is closer to a normal distribution...", x="Log of Price for Four Nights", y="Density") +
theme_minimal()
First we use boxplots to analyse variables contributing to the price of 4 nights.
Room Type
ggplot(main_data4, aes(y= log(price_4_nights), x=room_type))+
geom_boxplot()+
coord_flip()+
labs(
title = "Mean Prices Change by Room Type",
subtitle = "Price for 4 nights vs Room Type",
x = "Room Type",
y = "log(Price for 4 nights)"
) +
theme_minimal()
So we see that Private Room has the lowest median price, followed by Hotel Room and Entire Home/Apartment. Shared room seems to have a quite high median price for 4 nights. This seems odd. Let us double check:
dubblecheck <- main_data4 %>%
group_by(room_type)%>%
summarise(median(price_4_nights),
mean(price_4_nights))
dubblecheck
| room_type | median(price_4_nights) | mean(price_4_nights) |
|---|---|---|
| Entire home/apt | 3.01e+03 | 4.17e+03 |
| Hotel room | 2.02e+03 | 3.66e+03 |
| Private room | 1.68e+03 | 2.68e+03 |
| Shared room | 4.11e+03 | 3.1e+03 |
Neighbourhood
ggplot(main_data4, aes(y= log(price_4_nights), x=neighbourhood_simplified))+
geom_boxplot() +
coord_flip() +
labs(
title = "Different Zones seem to have an impact on the price",
subtitle = "Price for 4 nights vs Neighbourhood Zone",
x = "Neighbourhood Zone",
y = "log(Price for 4 nights)"
) +
theme_minimal()

Big Size
ggplot(main_data4, aes(x= log(price_4_nights), y=is_it_big))+
geom_boxplot()+
labs(
title = "Airbnbs that are big tend to have higher prices",
subtitle= "Price for 4 nights for Airbnbs that say they are 'big' ",
y= "Airbnb is big?",
x="log(Price for 4 nights)"
)+
theme_minimal()
From the graph above we see that we should definitely consider the variable is_it_big…
Now let us check for the variable super_host:
Superhost
main_data4 %>%
filter(!is.na(host_is_superhost)) %>%
ggplot(aes(x= log(price_4_nights), y=host_is_superhost))+
geom_boxplot()+
labs(
title = "Super Host Variable shows a big impact on the price...",
subtitle= "Price for 4 nights for Airbnbs with superhosts compared to regular hosts",
y= "Host is superhost?",
x="log(Price for 4 nights)"
)+
theme_minimal()
Through our analysis, we have a good overview on the single variables that we have.
Now we will create a correlation scatterplot to analyse how variables in the dataset correlate to each other.
Numerical Variables
main_data4%>%
mutate(log4 = log(price_4_nights))%>%
select(bedrooms, bathrooms, accommodates, review_scores_rating, host_since_calculated, host_listings_count, host_response_rate, number_of_reviews, number_of_reviews_ltm, price_4_nights) %>%
ggpairs(aes(alpha=0.4)) +
theme_minimal()
Overall we see two things:
First, there are some correlations between price and our other numeric variables which help us modeling. It is important that they have strong coefficients to the price. Bedroom, Bathroom, Accomodates have all a correlation of greater than 10%.
Secondly, we see that some of our variables are also correlated among each other. Therefore we must be very cautious when taking additional variables into our model, because they could also be too strong correlated, which is bad. For example we need to be cautious with Bedroom, Bathroom, Accomodates as they have correlations greater than 50%.
Now we can create scatter plots to visualize the distribution between some numeric variables and price.
Reviews
#Scatterplot of price_4_nights vs number of reviews
ggplot(main_data4, aes(y= log(price_4_nights), x=number_of_reviews))+
geom_point(alpha=0.3)+
geom_smooth(method="lm")+
scale_x_log10()+
labs(
title = "Prices decrease as number of reviews increases",
subtitle= "Price for 4 nights vs Number of Reviews",
x= "Number of Reviews",
y="log(Price for 4 nights)") +
theme_minimal()
Now that is definitely an interesting finding. Normally, one would expect that the more reviews a host has, the higher he can charge the price. However, that is not the case here.
Let us check if the variable for “Last 12 Months” reviews shows the same…
Reviews Last 12 Months
#Scatterplot of price_4_nights vs number of reviews
ggplot(main_data4, aes(y= log(price_4_nights), x=number_of_reviews_ltm))+
geom_point(alpha=0.3)+
geom_smooth(method="lm")+
scale_x_log10()+
labs(
title = "Prices decrease as number of Last 12 months reviews increases",
subtitle= "Price for 4 nights vs Number of Reviews",
x= "Number of Reviews",
y="log(Price for 4 nights)")+
theme_minimal()
Last 12 month shows a similar relationship. This will be interesting for modeling because it doesn’t seem intuitive and maybe there is a different variable correlated to number of reviews that makes this observation possible. For example, maybe there were many new and highly priced apartments build in the last time, so that the apartments and hosts dont have many reviews. Then the above observation would just be an coincidence.
Now let us check for the experience that hosts have:
Host experience
ggplot(main_data4, aes(y= log(price_4_nights), x=host_since_calculated))+
geom_point(alpha=0.3)+
geom_smooth(method='lm')+
labs(
title = "Host experience seems to be correlated with higher prices",
subtitle= "Price for 4 nights vs Host experience",
x= "Experience of Host (in Years)",
y="log(Price for 4 nights)"
)+
theme_minimal()

Mapping
leaflet(data = main_data4) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillColor = "blue",
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type)
Regression Analysis
We will be doing a regression analysis with price_4_nights as the target variable or Y. We decided to take the logarithmic transformation of price_4_nights to account for the outliers that were making the distribution right skewed. Without doing a logarithmic transformation, our regression models failed to explain most of the variability in price_4_nights.
Visualizing the distribution of price_4_nights
#density plot price_4_nights
ggplot(data=main_data4, aes(x=price_4_nights)) +
geom_density(fill = "blue", alpha = 0.3) +
labs(x="Price for Four Nights") +
theme_minimal()

## This graph doesnt help us as we cant interpret it so we will try log.
#density plot log
ggplot(data=main_data4, aes(x=price_4_nights)) +
scale_x_log10()+geom_density(fill = "blue", alpha = 0.3) +
labs(x="log of Price for Four Nights") +
theme_minimal()

Model 1: Base Case
In our first model we will run a regression with prop_type_simplified,number_of_reviews and review_scores_ratingas the explanatory variables.
# ***** Fit linear regression models: First, just the mean
model0 <- lm(log(price_4_nights) ~ 1, data= main_data4)
model0 %>%
broom::tidy(conf.int=TRUE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 7.73 | 0.0122 | 634 | 0 | 7.7 | 7.75 |
model0 %>%
broom::glance()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.693 | -3.4e+03 | 6.8e+03 | 6.81e+03 | 1.55e+03 | 3228 | 3229 |
# ***** Fit linear regression models: criminals on 3 explanatory variables.
model1 <- lm(log(price_4_nights) ~ prop_type_simplified + number_of_reviews + review_scores_rating, data= main_data4)
model1 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 7.076 | 0.106 | 66.53 | 0.000 | 6.868 | 7.285 |
| prop_type_simplifiedCondominium | -0.186 | 0.043 | -4.28 | 0.000 | -0.271 | -0.101 |
| prop_type_simplifiedGuesthouse | -0.369 | 0.063 | -5.82 | 0.000 | -0.493 | -0.244 |
| prop_type_simplifiedHostel | -0.357 | 0.056 | -6.42 | 0.000 | -0.466 | -0.248 |
| prop_type_simplifiedHotel | -0.139 | 0.062 | -2.26 | 0.024 | -0.260 | -0.018 |
| prop_type_simplifiedHouse | -0.216 | 0.065 | -3.35 | 0.001 | -0.343 | -0.090 |
| prop_type_simplifiedOther | -0.083 | 0.039 | -2.13 | 0.033 | -0.159 | -0.007 |
| prop_type_simplifiedServiced apartment | -0.253 | 0.057 | -4.44 | 0.000 | -0.365 | -0.141 |
| number_of_reviews | -0.001 | 0.000 | -4.96 | 0.000 | -0.001 | -0.001 |
| review_scores_rating | 0.009 | 0.001 | 7.36 | 0.000 | 0.006 | 0.011 |
model1 %>%
broom::glance()%>%
kable()%>%
kable_styling()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.068 | 0.065 | 0.609 | 19.9 | 0 | 9 | -2255 | 4533 | 4597 | 904 | 2440 | 2450 |
msummary(model1)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.076490 0.106366 66.53 < 2e-16
## prop_type_simplifiedCondominium -0.185723 0.043417 -4.28 2.0e-05
## prop_type_simplifiedGuesthouse -0.368573 0.063353 -5.82 6.7e-09
## prop_type_simplifiedHostel -0.357016 0.055609 -6.42 1.6e-10
## prop_type_simplifiedHotel -0.139032 0.061578 -2.26 0.02404
## prop_type_simplifiedHouse -0.216427 0.064640 -3.35 0.00083
## prop_type_simplifiedOther -0.082629 0.038757 -2.13 0.03311
## prop_type_simplifiedServiced apartment -0.252819 0.056994 -4.44 9.6e-06
## number_of_reviews -0.001036 0.000209 -4.96 7.4e-07
## review_scores_rating 0.008587 0.001167 7.36 2.5e-13
##
## (Intercept) ***
## prop_type_simplifiedCondominium ***
## prop_type_simplifiedGuesthouse ***
## prop_type_simplifiedHostel ***
## prop_type_simplifiedHotel *
## prop_type_simplifiedHouse ***
## prop_type_simplifiedOther *
## prop_type_simplifiedServiced apartment ***
## number_of_reviews ***
## review_scores_rating ***
##
## Residual standard error: 0.609 on 2440 degrees of freedom
## (779 observations deleted due to missingness)
## Multiple R-squared: 0.0683, Adjusted R-squared: 0.0648
## F-statistic: 19.9 on 9 and 2440 DF, p-value: <2e-16
car::vif(model1)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.03 7 1.00
## number_of_reviews 1.03 1 1.01
## review_scores_rating 1.04 1 1.02
autoplot(model1)

The coefficient of the variable review_scores_rating indicates that after controlling for all other variables, Airbnbs with higher ratings have a higher price. All variables we have used in this model are significant predictors of price since their test statistics are greater than 2. However, the adjusted R Squared value for our first model is only 6.5 % indicating that these variables alone do not explain much of the variability in price. We computed the Variance Inflation Factor (VIF) for our model and observe that it doesn’t suffer from multicollinearity as the VIF is less than 5 for all variables.
Model 2: Adding room_type
We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model.
# ***** Fit linear regression models: criminals on 4 explanatory variables.
model2 <- lm(log(price_4_nights) ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type , data= main_data4)
model2 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 7.531 | 0.103 | 73.246 | 0.000 | 7.330 | 7.733 |
| prop_type_simplifiedCondominium | -0.181 | 0.041 | -4.451 | 0.000 | -0.261 | -0.101 |
| prop_type_simplifiedGuesthouse | -0.176 | 0.060 | -2.921 | 0.004 | -0.294 | -0.058 |
| prop_type_simplifiedHostel | -0.191 | 0.055 | -3.506 | 0.000 | -0.298 | -0.084 |
| prop_type_simplifiedHotel | -0.079 | 0.067 | -1.176 | 0.240 | -0.212 | 0.053 |
| prop_type_simplifiedHouse | -0.096 | 0.061 | -1.568 | 0.117 | -0.215 | 0.024 |
| prop_type_simplifiedOther | 0.016 | 0.038 | 0.432 | 0.666 | -0.058 | 0.090 |
| prop_type_simplifiedServiced apartment | -0.093 | 0.054 | -1.706 | 0.088 | -0.199 | 0.014 |
| number_of_reviews | -0.001 | 0.000 | -3.467 | 0.001 | -0.001 | 0.000 |
| review_scores_rating | 0.006 | 0.001 | 5.307 | 0.000 | 0.004 | 0.008 |
| room_typeHotel room | -0.138 | 0.067 | -2.062 | 0.039 | -0.269 | -0.007 |
| room_typePrivate room | -0.475 | 0.026 | -17.955 | 0.000 | -0.527 | -0.423 |
| room_typeShared room | -0.165 | 0.063 | -2.620 | 0.009 | -0.288 | -0.041 |
model2 %>%
broom::glance()%>%
kable()%>%
kable_styling()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.183 | 0.179 | 0.57 | 45.6 | 0 | 12 | -2094 | 4216 | 4297 | 793 | 2437 | 2450 |
msummary(model2)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.531184 0.102821 73.25 < 2e-16
## prop_type_simplifiedCondominium -0.181293 0.040731 -4.45 8.9e-06
## prop_type_simplifiedGuesthouse -0.176148 0.060308 -2.92 0.00352
## prop_type_simplifiedHostel -0.191084 0.054507 -3.51 0.00046
## prop_type_simplifiedHotel -0.079297 0.067443 -1.18 0.23981
## prop_type_simplifiedHouse -0.095567 0.060960 -1.57 0.11708
## prop_type_simplifiedOther 0.016296 0.037764 0.43 0.66614
## prop_type_simplifiedServiced apartment -0.092799 0.054382 -1.71 0.08805
## number_of_reviews -0.000683 0.000197 -3.47 0.00053
## review_scores_rating 0.005858 0.001104 5.31 1.2e-07
## room_typeHotel room -0.137632 0.066760 -2.06 0.03935
## room_typePrivate room -0.475182 0.026465 -17.96 < 2e-16
## room_typeShared room -0.164634 0.062837 -2.62 0.00885
##
## (Intercept) ***
## prop_type_simplifiedCondominium ***
## prop_type_simplifiedGuesthouse **
## prop_type_simplifiedHostel ***
## prop_type_simplifiedHotel
## prop_type_simplifiedHouse
## prop_type_simplifiedOther
## prop_type_simplifiedServiced apartment .
## number_of_reviews ***
## review_scores_rating ***
## room_typeHotel room *
## room_typePrivate room ***
## room_typeShared room **
##
## Residual standard error: 0.57 on 2437 degrees of freedom
## (779 observations deleted due to missingness)
## Multiple R-squared: 0.183, Adjusted R-squared: 0.179
## F-statistic: 45.6 on 12 and 2437 DF, p-value: <2e-16
car::vif(model2)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.63 7 1.04
## number_of_reviews 1.04 1 1.02
## review_scores_rating 1.06 1 1.03
## room_type 1.65 3 1.09
autoplot(model2)
We observe that room_type has improved our model and is significant predictor of price. It is not collinear with our other variables as the VIF is still less than 5. The adjusted R Squared value for our new model has increased to 17.9 %, indicating room_type is an important indicater of price.
Now we will explore adding further variables.
Model 3: Improving Model with More Listing Features
We start our own exploration now. We first wonder if we can improve model 2.
In the visualisations above we saw that the variables bedrooms, beds, bathrooms, total_amenities and accommodates have more than 10% correlation with price. This logically makes sense since larger Airbnbs tend to accommodate more guests and cost a higher price. We will look at the effect of these variables on our model. We also wanted to add the number_of_reviews_ltm instead of number_of_reviews since we realized that it improves R squared.
glimpse(main_data4)
## Rows: 3,229
## Columns: 56
## $ id <dbl> 69074, 103760, 163664, 163742, 248140, 27…
## $ listing_url <chr> "https://www.airbnb.com/rooms/69074", "ht…
## $ summary <chr> "An ideal Hong location any visitor--hip …
## $ space <chr> "Filled with plants and art, this one bed…
## $ description <chr> "An ideal Hong location any visitor--hip …
## $ host_since <date> 2010-07-07, 2010-11-30, 2010-11-30, 2010…
## $ host_response_time <ord> within a few hours, within an hour, withi…
## $ host_response_rate <dbl> 86, 100, 100, 100, 100, 84, 84, 93, 93, 1…
## $ host_is_superhost <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ host_listings_count <dbl> 2, 12, 12, 12, 1, 8, 8, 18, 18, 3, 15, 2,…
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ host_identity_verified <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ neighbourhood_cleansed <chr> "Central & Western", "Central & Western",…
## $ latitude <dbl> 22.3, 22.3, 22.3, 22.3, 22.3, 22.3, 22.5,…
## $ longitude <dbl> 114, 114, 114, 114, 114, 114, 114, 114, 1…
## $ is_location_exact <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ property_type <chr> "Apartment", "Apartment", "Apartment", "A…
## $ room_type <chr> "Entire home/apt", "Entire home/apt", "En…
## $ accommodates <dbl> 3, 6, 6, 6, 2, 4, 6, 3, 3, 5, 2, 3, 4, 3,…
## $ bathrooms <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1,…
## $ bedrooms <dbl> 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 2, 1, 2,…
## $ beds <dbl> 2, 3, 3, 3, 1, 1, 3, 1, 2, 3, 1, 2, 2, 2,…
## $ bed_type <chr> "Real Bed", "Real Bed", "Real Bed", "Real…
## $ amenities <chr> "{\"Cable TV\",Internet,Wifi,\"Air condit…
## $ price <dbl> 1395, 845, 690, 767, 643, 891, 930, 992, …
## $ security_deposit <dbl> 2325, 775, 775, 775, 0, 1008, 930, 1163, …
## $ cleaning_fee <dbl> 310, 271, 302, 302, 0, 310, 388, 310, 310…
## $ guests_included <dbl> 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2,…
## $ extra_people <dbl> 155, 194, 194, 194, 0, 155, 155, 116, 116…
## $ minimum_nights <dbl> 3, 2, 2, 2, 1, 2, 4, 1, 1, 1, 1, 3, 2, 4,…
## $ maximum_nights <dbl> 365, 365, 365, 365, 60, 365, 365, 14, 14,…
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ number_of_reviews <dbl> 134, 271, 222, 225, 163, 43, 160, 15, 18,…
## $ number_of_reviews_ltm <dbl> 4, 13, 16, 11, 12, 2, 49, 0, 0, 6, 2, 1, …
## $ last_review <date> 2020-03-24, 2020-05-22, 2020-04-11, 2020…
## $ review_scores_rating <dbl> 97, 89, 86, 86, 98, 93, 97, 86, 99, 91, 7…
## $ review_scores_accuracy <dbl> 10, 9, 9, 9, 10, 9, 9, 8, 10, 10, 8, 8, 1…
## $ review_scores_cleanliness <dbl> 9, 9, 9, 9, 10, 9, 7, 8, 10, 10, 7, 8, 9,…
## $ review_scores_checkin <dbl> 10, 10, 9, 10, 10, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_communication <dbl> 10, 10, 9, 10, 10, 10, 9, 9, 10, 10, 8, 8…
## $ review_scores_location <dbl> 10, 10, 10, 10, 9, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_value <dbl> 9, 9, 9, 9, 10, 9, 8, 8, 9, 9, 8, 9, 9, 9…
## $ instant_bookable <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
## $ cancellation_policy <chr> "strict", "strict", "strict", "strict", "…
## $ reviews_per_month <dbl> 1.18, 2.47, 2.07, 2.09, 1.55, 0.41, 1.61,…
## $ has_summary <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ prop_type_simplified <chr> "Apartment", "Apartment", "Apartment", "A…
## $ host_since_calculated <dbl> 10.29, 9.89, 9.89, 9.89, 9.01, 8.90, 8.90…
## $ total_amenities <dbl> 33, 11, 18, 17, 27, 22, 36, 27, 28, 22, 2…
## $ neighbourhood_simplified <chr> "hong_kong", "hong_kong", "hong_kong", "h…
## $ is_it_big_sum <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ is_it_big_spa <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big_desc <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ extra_charge_pp <dbl> 0, 0, 0, 0, 0, 0, 0, 116, 116, 0, 0, 0, 0…
## $ price_4_nights <dbl> 5890, 3651, 3062, 3370, 2572, 3874, 4108,…
model3 <- lm(log(price_4_nights) ~ bathrooms + bedrooms + beds + total_amenities +
accommodates + prop_type_simplified + number_of_reviews_ltm + review_scores_rating + room_type, data= main_data4)
model3 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 6.989 | 0.102 | 68.392 | 0.000 | 6.789 | 7.190 |
| bathrooms | -0.027 | 0.029 | -0.935 | 0.350 | -0.085 | 0.030 |
| bedrooms | 0.090 | 0.019 | 4.672 | 0.000 | 0.052 | 0.128 |
| beds | -0.016 | 0.014 | -1.155 | 0.248 | -0.043 | 0.011 |
| total_amenities | 0.013 | 0.001 | 9.331 | 0.000 | 0.011 | 0.016 |
| accommodates | 0.092 | 0.011 | 8.447 | 0.000 | 0.071 | 0.114 |
| prop_type_simplifiedCondominium | -0.157 | 0.038 | -4.137 | 0.000 | -0.231 | -0.082 |
| prop_type_simplifiedGuesthouse | -0.125 | 0.056 | -2.230 | 0.026 | -0.235 | -0.015 |
| prop_type_simplifiedHostel | -0.152 | 0.051 | -2.994 | 0.003 | -0.251 | -0.052 |
| prop_type_simplifiedHotel | -0.105 | 0.063 | -1.669 | 0.095 | -0.228 | 0.018 |
| prop_type_simplifiedHouse | -0.150 | 0.057 | -2.633 | 0.009 | -0.261 | -0.038 |
| prop_type_simplifiedOther | 0.019 | 0.035 | 0.555 | 0.579 | -0.049 | 0.088 |
| prop_type_simplifiedServiced apartment | -0.081 | 0.051 | -1.609 | 0.108 | -0.181 | 0.018 |
| number_of_reviews_ltm | -0.006 | 0.001 | -5.571 | 0.000 | -0.008 | -0.004 |
| review_scores_rating | 0.005 | 0.001 | 4.805 | 0.000 | 0.003 | 0.007 |
| room_typeHotel room | -0.039 | 0.062 | -0.633 | 0.527 | -0.162 | 0.083 |
| room_typePrivate room | -0.374 | 0.026 | -14.593 | 0.000 | -0.424 | -0.324 |
| room_typeShared room | -0.414 | 0.079 | -5.205 | 0.000 | -0.570 | -0.258 |
model3 %>%
broom::glance()%>%
kable()%>%
kable_styling()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.3 | 0.295 | 0.529 | 61.1 | 0 | 17 | -1901 | 3841 | 3951 | 678 | 2426 | 2444 |
msummary(model3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.98944 0.10220 68.39 < 2e-16 ***
## bathrooms -0.02741 0.02932 -0.93 0.3500
## bedrooms 0.09010 0.01928 4.67 3.1e-06 ***
## beds -0.01612 0.01395 -1.16 0.2480
## total_amenities 0.01342 0.00144 9.33 < 2e-16 ***
## accommodates 0.09250 0.01095 8.45 < 2e-16 ***
## prop_type_simplifiedCondominium -0.15676 0.03790 -4.14 3.6e-05 ***
## prop_type_simplifiedGuesthouse -0.12527 0.05619 -2.23 0.0259 *
## prop_type_simplifiedHostel -0.15182 0.05071 -2.99 0.0028 **
## prop_type_simplifiedHotel -0.10472 0.06274 -1.67 0.0952 .
## prop_type_simplifiedHouse -0.14952 0.05680 -2.63 0.0085 **
## prop_type_simplifiedOther 0.01942 0.03503 0.55 0.5793
## prop_type_simplifiedServiced apartment -0.08138 0.05057 -1.61 0.1077
## number_of_reviews_ltm -0.00569 0.00102 -5.57 2.8e-08 ***
## review_scores_rating 0.00507 0.00105 4.80 1.6e-06 ***
## room_typeHotel room -0.03947 0.06234 -0.63 0.5267
## room_typePrivate room -0.37391 0.02562 -14.59 < 2e-16 ***
## room_typeShared room -0.41376 0.07949 -5.20 2.1e-07 ***
##
## Residual standard error: 0.529 on 2426 degrees of freedom
## (785 observations deleted due to missingness)
## Multiple R-squared: 0.3, Adjusted R-squared: 0.295
## F-statistic: 61.1 on 17 and 2426 DF, p-value: <2e-16
car::vif(model3)
## GVIF Df GVIF^(1/(2*Df))
## bathrooms 1.93 1 1.39
## bedrooms 1.74 1 1.32
## beds 3.62 1 1.90
## total_amenities 1.16 1 1.08
## accommodates 3.64 1 1.91
## prop_type_simplified 1.69 7 1.04
## number_of_reviews_ltm 1.11 1 1.05
## review_scores_rating 1.12 1 1.06
## room_type 3.43 3 1.23
autoplot(model3)

After taking number_of_reviews away, we find that there is almost no change in adjusted R-squared. Also, there is no change in collinearity and significance after eliminating number_of_reviews. So, we can get rid of it in our model and put number_of_reviews_ltm which gives the number of reviews in the last twelve months.
Looking at the effects of the variables we have just added, we realised that even though their collinearities are not higher than 5, changes that some of these variables create are not significant. Thus, we decide to keep the ones that have the lowest p value or highest t statistics. We get rid of beds and bathrooms, while keeping bedrooms, total_amenities, and accommodates. This is meaningful because while the number of amenities or the number of people that can stay in a house would affect the price, number of bedrooms is usually correlated with number of beds and bathrooms.
Model 4: Taking out the less significant variables
After taking out the unnecessary variables, we observe that our R squared has not changed.
glimpse(main_data4)
## Rows: 3,229
## Columns: 56
## $ id <dbl> 69074, 103760, 163664, 163742, 248140, 27…
## $ listing_url <chr> "https://www.airbnb.com/rooms/69074", "ht…
## $ summary <chr> "An ideal Hong location any visitor--hip …
## $ space <chr> "Filled with plants and art, this one bed…
## $ description <chr> "An ideal Hong location any visitor--hip …
## $ host_since <date> 2010-07-07, 2010-11-30, 2010-11-30, 2010…
## $ host_response_time <ord> within a few hours, within an hour, withi…
## $ host_response_rate <dbl> 86, 100, 100, 100, 100, 84, 84, 93, 93, 1…
## $ host_is_superhost <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ host_listings_count <dbl> 2, 12, 12, 12, 1, 8, 8, 18, 18, 3, 15, 2,…
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ host_identity_verified <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ neighbourhood_cleansed <chr> "Central & Western", "Central & Western",…
## $ latitude <dbl> 22.3, 22.3, 22.3, 22.3, 22.3, 22.3, 22.5,…
## $ longitude <dbl> 114, 114, 114, 114, 114, 114, 114, 114, 1…
## $ is_location_exact <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ property_type <chr> "Apartment", "Apartment", "Apartment", "A…
## $ room_type <chr> "Entire home/apt", "Entire home/apt", "En…
## $ accommodates <dbl> 3, 6, 6, 6, 2, 4, 6, 3, 3, 5, 2, 3, 4, 3,…
## $ bathrooms <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1,…
## $ bedrooms <dbl> 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 2, 1, 2,…
## $ beds <dbl> 2, 3, 3, 3, 1, 1, 3, 1, 2, 3, 1, 2, 2, 2,…
## $ bed_type <chr> "Real Bed", "Real Bed", "Real Bed", "Real…
## $ amenities <chr> "{\"Cable TV\",Internet,Wifi,\"Air condit…
## $ price <dbl> 1395, 845, 690, 767, 643, 891, 930, 992, …
## $ security_deposit <dbl> 2325, 775, 775, 775, 0, 1008, 930, 1163, …
## $ cleaning_fee <dbl> 310, 271, 302, 302, 0, 310, 388, 310, 310…
## $ guests_included <dbl> 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2,…
## $ extra_people <dbl> 155, 194, 194, 194, 0, 155, 155, 116, 116…
## $ minimum_nights <dbl> 3, 2, 2, 2, 1, 2, 4, 1, 1, 1, 1, 3, 2, 4,…
## $ maximum_nights <dbl> 365, 365, 365, 365, 60, 365, 365, 14, 14,…
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ number_of_reviews <dbl> 134, 271, 222, 225, 163, 43, 160, 15, 18,…
## $ number_of_reviews_ltm <dbl> 4, 13, 16, 11, 12, 2, 49, 0, 0, 6, 2, 1, …
## $ last_review <date> 2020-03-24, 2020-05-22, 2020-04-11, 2020…
## $ review_scores_rating <dbl> 97, 89, 86, 86, 98, 93, 97, 86, 99, 91, 7…
## $ review_scores_accuracy <dbl> 10, 9, 9, 9, 10, 9, 9, 8, 10, 10, 8, 8, 1…
## $ review_scores_cleanliness <dbl> 9, 9, 9, 9, 10, 9, 7, 8, 10, 10, 7, 8, 9,…
## $ review_scores_checkin <dbl> 10, 10, 9, 10, 10, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_communication <dbl> 10, 10, 9, 10, 10, 10, 9, 9, 10, 10, 8, 8…
## $ review_scores_location <dbl> 10, 10, 10, 10, 9, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_value <dbl> 9, 9, 9, 9, 10, 9, 8, 8, 9, 9, 8, 9, 9, 9…
## $ instant_bookable <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
## $ cancellation_policy <chr> "strict", "strict", "strict", "strict", "…
## $ reviews_per_month <dbl> 1.18, 2.47, 2.07, 2.09, 1.55, 0.41, 1.61,…
## $ has_summary <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ prop_type_simplified <chr> "Apartment", "Apartment", "Apartment", "A…
## $ host_since_calculated <dbl> 10.29, 9.89, 9.89, 9.89, 9.01, 8.90, 8.90…
## $ total_amenities <dbl> 33, 11, 18, 17, 27, 22, 36, 27, 28, 22, 2…
## $ neighbourhood_simplified <chr> "hong_kong", "hong_kong", "hong_kong", "h…
## $ is_it_big_sum <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ is_it_big_spa <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big_desc <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ extra_charge_pp <dbl> 0, 0, 0, 0, 0, 0, 0, 116, 116, 0, 0, 0, 0…
## $ price_4_nights <dbl> 5890, 3651, 3062, 3370, 2572, 3874, 4108,…
model4 <- lm(log(price_4_nights) ~ bedrooms + accommodates + total_amenities + prop_type_simplified + number_of_reviews_ltm + review_scores_rating + room_type, data= main_data4)
model4 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 6.971 | 0.099 | 70.072 | 0.000 | 6.776 | 7.166 |
| bedrooms | 0.082 | 0.018 | 4.585 | 0.000 | 0.047 | 0.117 |
| accommodates | 0.082 | 0.008 | 10.619 | 0.000 | 0.067 | 0.097 |
| total_amenities | 0.013 | 0.001 | 9.374 | 0.000 | 0.011 | 0.016 |
| prop_type_simplifiedCondominium | -0.162 | 0.038 | -4.283 | 0.000 | -0.236 | -0.088 |
| prop_type_simplifiedGuesthouse | -0.126 | 0.056 | -2.236 | 0.025 | -0.236 | -0.015 |
| prop_type_simplifiedHostel | -0.152 | 0.051 | -3.008 | 0.003 | -0.252 | -0.053 |
| prop_type_simplifiedHotel | -0.109 | 0.063 | -1.735 | 0.083 | -0.231 | 0.014 |
| prop_type_simplifiedHouse | -0.144 | 0.057 | -2.546 | 0.011 | -0.255 | -0.033 |
| prop_type_simplifiedOther | 0.019 | 0.035 | 0.551 | 0.582 | -0.049 | 0.088 |
| prop_type_simplifiedServiced apartment | -0.078 | 0.050 | -1.553 | 0.121 | -0.177 | 0.021 |
| number_of_reviews_ltm | -0.006 | 0.001 | -5.542 | 0.000 | -0.008 | -0.004 |
| review_scores_rating | 0.005 | 0.001 | 4.862 | 0.000 | 0.003 | 0.007 |
| room_typeHotel room | -0.041 | 0.062 | -0.660 | 0.509 | -0.163 | 0.081 |
| room_typePrivate room | -0.379 | 0.025 | -14.944 | 0.000 | -0.429 | -0.330 |
| room_typeShared room | -0.479 | 0.065 | -7.418 | 0.000 | -0.605 | -0.352 |
model4 %>%
broom::glance()%>%
kable()%>%
kable_styling()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.3 | 0.295 | 0.529 | 69.4 | 0 | 15 | -1904 | 3841 | 3940 | 679 | 2431 | 2447 |
msummary(model4)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.97067 0.09948 70.07 < 2e-16 ***
## bedrooms 0.08178 0.01784 4.58 4.8e-06 ***
## accommodates 0.08213 0.00773 10.62 < 2e-16 ***
## total_amenities 0.01344 0.00143 9.37 < 2e-16 ***
## prop_type_simplifiedCondominium -0.16158 0.03772 -4.28 1.9e-05 ***
## prop_type_simplifiedGuesthouse -0.12554 0.05616 -2.24 0.0255 *
## prop_type_simplifiedHostel -0.15243 0.05068 -3.01 0.0027 **
## prop_type_simplifiedHotel -0.10866 0.06262 -1.74 0.0828 .
## prop_type_simplifiedHouse -0.14419 0.05664 -2.55 0.0110 *
## prop_type_simplifiedOther 0.01928 0.03499 0.55 0.5816
## prop_type_simplifiedServiced apartment -0.07831 0.05043 -1.55 0.1206
## number_of_reviews_ltm -0.00565 0.00102 -5.54 3.3e-08 ***
## review_scores_rating 0.00510 0.00105 4.86 1.2e-06 ***
## room_typeHotel room -0.04108 0.06227 -0.66 0.5094
## room_typePrivate room -0.37931 0.02538 -14.94 < 2e-16 ***
## room_typeShared room -0.47854 0.06451 -7.42 1.6e-13 ***
##
## Residual standard error: 0.529 on 2431 degrees of freedom
## (782 observations deleted due to missingness)
## Multiple R-squared: 0.3, Adjusted R-squared: 0.295
## F-statistic: 69.4 on 15 and 2431 DF, p-value: <2e-16
car::vif(model4)
## GVIF Df GVIF^(1/(2*Df))
## bedrooms 1.49 1 1.22
## accommodates 1.82 1 1.35
## total_amenities 1.16 1 1.08
## prop_type_simplified 1.66 7 1.04
## number_of_reviews_ltm 1.11 1 1.05
## review_scores_rating 1.11 1 1.06
## room_type 2.25 3 1.14
autoplot(model4)

Model 5: Adding Host
Next we would like to explore the effect of the variables about the host on price. We add host_is_superhost, host_response_time, host_response_rate, host_has_profile_pic, and host_since_calculated to the model.
#glimpse(main_data)
model5 <- lm(log(price_4_nights) ~ host_response_time + host_response_rate + host_is_superhost + host_has_profile_pic + host_since_calculated + bedrooms + accommodates + total_amenities + prop_type_simplified + number_of_reviews_ltm + review_scores_rating + room_type, data= main_data4)
model5 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 7.022 | 0.214 | 32.889 | 0.000 | 6.604 | 7.441 |
| host_response_time.L | 0.118 | 0.061 | 1.938 | 0.053 | -0.001 | 0.237 |
| host_response_time.Q | -0.028 | 0.038 | -0.730 | 0.466 | -0.103 | 0.047 |
| host_response_time.C | -0.004 | 0.029 | -0.127 | 0.899 | -0.061 | 0.053 |
| host_response_rate | 0.002 | 0.001 | 1.955 | 0.051 | 0.000 | 0.004 |
| host_is_superhostTRUE | 0.147 | 0.028 | 5.269 | 0.000 | 0.092 | 0.202 |
| host_has_profile_picTRUE | -0.135 | 0.178 | -0.759 | 0.448 | -0.483 | 0.213 |
| host_since_calculated | 0.021 | 0.006 | 3.879 | 0.000 | 0.011 | 0.032 |
| bedrooms | 0.082 | 0.018 | 4.601 | 0.000 | 0.047 | 0.116 |
| accommodates | 0.085 | 0.008 | 11.049 | 0.000 | 0.070 | 0.100 |
| total_amenities | 0.012 | 0.001 | 8.125 | 0.000 | 0.009 | 0.015 |
| prop_type_simplifiedCondominium | -0.149 | 0.037 | -3.976 | 0.000 | -0.222 | -0.075 |
| prop_type_simplifiedGuesthouse | -0.099 | 0.056 | -1.767 | 0.077 | -0.208 | 0.011 |
| prop_type_simplifiedHostel | -0.134 | 0.050 | -2.675 | 0.008 | -0.233 | -0.036 |
| prop_type_simplifiedHotel | -0.072 | 0.064 | -1.137 | 0.255 | -0.197 | 0.052 |
| prop_type_simplifiedHouse | -0.168 | 0.057 | -2.976 | 0.003 | -0.279 | -0.057 |
| prop_type_simplifiedOther | 0.026 | 0.035 | 0.758 | 0.448 | -0.042 | 0.095 |
| prop_type_simplifiedServiced apartment | -0.023 | 0.051 | -0.464 | 0.643 | -0.123 | 0.076 |
| number_of_reviews_ltm | -0.006 | 0.001 | -5.642 | 0.000 | -0.008 | -0.004 |
| review_scores_rating | 0.003 | 0.001 | 3.272 | 0.001 | 0.001 | 0.006 |
| room_typeHotel room | -0.051 | 0.062 | -0.823 | 0.411 | -0.173 | 0.071 |
| room_typePrivate room | -0.385 | 0.026 | -15.076 | 0.000 | -0.435 | -0.335 |
| room_typeShared room | -0.496 | 0.064 | -7.734 | 0.000 | -0.622 | -0.371 |
model5 %>%
broom::glance()%>%
kable()%>%
kable_styling()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.318 | 0.312 | 0.522 | 51.5 | 0 | 22 | -1871 | 3790 | 3929 | 661 | 2424 | 2447 |
msummary(model5)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.022437 0.213518 32.89 < 2e-16
## host_response_time.L 0.117697 0.060747 1.94 0.05280
## host_response_time.Q -0.028067 0.038461 -0.73 0.46561
## host_response_time.C -0.003680 0.028997 -0.13 0.89903
## host_response_rate 0.001825 0.000933 1.95 0.05071
## host_is_superhostTRUE 0.147240 0.027943 5.27 1.5e-07
## host_has_profile_picTRUE -0.134868 0.177586 -0.76 0.44766
## host_since_calculated 0.021479 0.005537 3.88 0.00011
## bedrooms 0.081653 0.017747 4.60 4.4e-06
## accommodates 0.085023 0.007695 11.05 < 2e-16
## total_amenities 0.011892 0.001464 8.13 7.0e-16
## prop_type_simplifiedCondominium -0.148797 0.037427 -3.98 7.2e-05
## prop_type_simplifiedGuesthouse -0.098551 0.055774 -1.77 0.07736
## prop_type_simplifiedHostel -0.134343 0.050217 -2.68 0.00752
## prop_type_simplifiedHotel -0.072230 0.063500 -1.14 0.25545
## prop_type_simplifiedHouse -0.168189 0.056513 -2.98 0.00295
## prop_type_simplifiedOther 0.026365 0.034765 0.76 0.44830
## prop_type_simplifiedServiced apartment -0.023470 0.050584 -0.46 0.64271
## number_of_reviews_ltm -0.005864 0.001039 -5.64 1.9e-08
## review_scores_rating 0.003489 0.001066 3.27 0.00108
## room_typeHotel room -0.051063 0.062075 -0.82 0.41081
## room_typePrivate room -0.384867 0.025528 -15.08 < 2e-16
## room_typeShared room -0.496397 0.064180 -7.73 1.5e-14
##
## (Intercept) ***
## host_response_time.L .
## host_response_time.Q
## host_response_time.C
## host_response_rate .
## host_is_superhostTRUE ***
## host_has_profile_picTRUE
## host_since_calculated ***
## bedrooms ***
## accommodates ***
## total_amenities ***
## prop_type_simplifiedCondominium ***
## prop_type_simplifiedGuesthouse .
## prop_type_simplifiedHostel **
## prop_type_simplifiedHotel
## prop_type_simplifiedHouse **
## prop_type_simplifiedOther
## prop_type_simplifiedServiced apartment
## number_of_reviews_ltm ***
## review_scores_rating **
## room_typeHotel room
## room_typePrivate room ***
## room_typeShared room ***
##
## Residual standard error: 0.522 on 2424 degrees of freedom
## (782 observations deleted due to missingness)
## Multiple R-squared: 0.318, Adjusted R-squared: 0.312
## F-statistic: 51.5 on 22 and 2424 DF, p-value: <2e-16
car::vif(model5)
## GVIF Df GVIF^(1/(2*Df))
## host_response_time 8.57 3 1.43
## host_response_rate 7.73 1 2.78
## host_is_superhost 1.21 1 1.10
## host_has_profile_pic 1.04 1 1.02
## host_since_calculated 1.11 1 1.05
## bedrooms 1.51 1 1.23
## accommodates 1.85 1 1.36
## total_amenities 1.24 1 1.11
## prop_type_simplified 1.88 7 1.05
## number_of_reviews_ltm 1.18 1 1.09
## review_scores_rating 1.18 1 1.09
## room_type 2.43 3 1.16
autoplot(model5)

We observe that host response rate and time have high VIF numbers and host_response_time gives more meaningful results while explaining the data as it has a lower p value. Thus, we will keep that and get rid of host_response_rate. The variable that we have created before that shows the time that the host has been registered to the AirBnb seems to have a significant effect on the price, thus, we decided to keep it. However, the host having a profile picture does not seem to make a statistically significant difference. So we take out the host_has_profile_pic. Lastly, we can see that host being a Super Host has a very low p value as well, meaning a listing being posted by a Super Host affects its price.
Model 6: Adjusting Host Variables
After taking out the unnecessary variables, we observe that our adjusted R squared has not changed.
#glimpse(main_data)
model6 <- lm(log(price_4_nights) ~ host_response_time + host_is_superhost + host_since_calculated + bedrooms + accommodates + total_amenities + prop_type_simplified + number_of_reviews_ltm + review_scores_rating + room_type, data= main_data4)
model6 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 6.998 | 0.104 | 67.127 | 0.000 | 6.793 | 7.202 |
| host_response_time.L | 0.011 | 0.024 | 0.439 | 0.661 | -0.037 | 0.058 |
| host_response_time.Q | -0.083 | 0.026 | -3.234 | 0.001 | -0.133 | -0.033 |
| host_response_time.C | -0.022 | 0.027 | -0.796 | 0.426 | -0.075 | 0.032 |
| host_is_superhostTRUE | 0.150 | 0.028 | 5.381 | 0.000 | 0.095 | 0.205 |
| host_since_calculated | 0.023 | 0.006 | 4.104 | 0.000 | 0.012 | 0.033 |
| bedrooms | 0.082 | 0.018 | 4.614 | 0.000 | 0.047 | 0.117 |
| accommodates | 0.085 | 0.008 | 11.033 | 0.000 | 0.070 | 0.100 |
| total_amenities | 0.012 | 0.001 | 8.161 | 0.000 | 0.009 | 0.015 |
| prop_type_simplifiedCondominium | -0.151 | 0.037 | -4.038 | 0.000 | -0.224 | -0.078 |
| prop_type_simplifiedGuesthouse | -0.097 | 0.056 | -1.741 | 0.082 | -0.206 | 0.012 |
| prop_type_simplifiedHostel | -0.137 | 0.050 | -2.737 | 0.006 | -0.236 | -0.039 |
| prop_type_simplifiedHotel | -0.059 | 0.063 | -0.933 | 0.351 | -0.183 | 0.065 |
| prop_type_simplifiedHouse | -0.170 | 0.057 | -3.000 | 0.003 | -0.280 | -0.059 |
| prop_type_simplifiedOther | 0.024 | 0.035 | 0.683 | 0.495 | -0.044 | 0.092 |
| prop_type_simplifiedServiced apartment | -0.031 | 0.050 | -0.618 | 0.537 | -0.130 | 0.068 |
| number_of_reviews_ltm | -0.006 | 0.001 | -5.561 | 0.000 | -0.008 | -0.004 |
| review_scores_rating | 0.004 | 0.001 | 3.394 | 0.001 | 0.002 | 0.006 |
| room_typeHotel room | -0.048 | 0.062 | -0.774 | 0.439 | -0.170 | 0.074 |
| room_typePrivate room | -0.383 | 0.026 | -15.025 | 0.000 | -0.434 | -0.333 |
| room_typeShared room | -0.496 | 0.064 | -7.730 | 0.000 | -0.622 | -0.370 |
model6 %>%
broom::glance()%>%
kable()%>%
kable_styling()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.317 | 0.312 | 0.522 | 56.4 | 0 | 20 | -1873 | 3790 | 3917 | 662 | 2426 | 2447 |
msummary(model6)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.99781 0.10425 67.13 < 2e-16 ***
## host_response_time.L 0.01064 0.02423 0.44 0.6606
## host_response_time.Q -0.08263 0.02555 -3.23 0.0012 **
## host_response_time.C -0.02177 0.02736 -0.80 0.4263
## host_is_superhostTRUE 0.15012 0.02790 5.38 8.1e-08 ***
## host_since_calculated 0.02262 0.00551 4.10 4.2e-05 ***
## bedrooms 0.08190 0.01775 4.61 4.2e-06 ***
## accommodates 0.08493 0.00770 11.03 < 2e-16 ***
## total_amenities 0.01194 0.00146 8.16 5.3e-16 ***
## prop_type_simplifiedCondominium -0.15111 0.03742 -4.04 5.6e-05 ***
## prop_type_simplifiedGuesthouse -0.09711 0.05577 -1.74 0.0818 .
## prop_type_simplifiedHostel -0.13739 0.05021 -2.74 0.0063 **
## prop_type_simplifiedHotel -0.05891 0.06314 -0.93 0.3509
## prop_type_simplifiedHouse -0.16960 0.05653 -3.00 0.0027 **
## prop_type_simplifiedOther 0.02373 0.03475 0.68 0.4947
## prop_type_simplifiedServiced apartment -0.03118 0.05047 -0.62 0.5368
## number_of_reviews_ltm -0.00578 0.00104 -5.56 3.0e-08 ***
## review_scores_rating 0.00361 0.00106 3.39 0.0007 ***
## room_typeHotel room -0.04804 0.06209 -0.77 0.4391
## room_typePrivate room -0.38349 0.02552 -15.03 < 2e-16 ***
## room_typeShared room -0.49629 0.06420 -7.73 1.6e-14 ***
##
## Residual standard error: 0.522 on 2426 degrees of freedom
## (782 observations deleted due to missingness)
## Multiple R-squared: 0.317, Adjusted R-squared: 0.312
## F-statistic: 56.4 on 20 and 2426 DF, p-value: <2e-16
car::vif(model6)
## GVIF Df GVIF^(1/(2*Df))
## host_response_time 1.26 3 1.04
## host_is_superhost 1.21 1 1.10
## host_since_calculated 1.10 1 1.05
## bedrooms 1.51 1 1.23
## accommodates 1.84 1 1.36
## total_amenities 1.24 1 1.11
## prop_type_simplified 1.84 7 1.04
## number_of_reviews_ltm 1.18 1 1.09
## review_scores_rating 1.17 1 1.08
## room_type 2.42 3 1.16
autoplot(model6)

We also check for collinearity at this point, and we observe that all the VIF values are smaller than 5, which means we can keep on building our model.
Model 7: Adding Neighbourhood and Location Factors
We believe that location is an important factor in Hong Kong listings. Thus, we would like to add related factors which are neighbourhood_simplified that shows which part of the city the listing is located, is_location_exact that shows if the location is shown exactly on the map, and latitude. We added latitude since as we have learned through interviewing our friends from Hong Kong that the prices in the city increase as we go closer to the coast.
glimpse(main_data4)
## Rows: 3,229
## Columns: 56
## $ id <dbl> 69074, 103760, 163664, 163742, 248140, 27…
## $ listing_url <chr> "https://www.airbnb.com/rooms/69074", "ht…
## $ summary <chr> "An ideal Hong location any visitor--hip …
## $ space <chr> "Filled with plants and art, this one bed…
## $ description <chr> "An ideal Hong location any visitor--hip …
## $ host_since <date> 2010-07-07, 2010-11-30, 2010-11-30, 2010…
## $ host_response_time <ord> within a few hours, within an hour, withi…
## $ host_response_rate <dbl> 86, 100, 100, 100, 100, 84, 84, 93, 93, 1…
## $ host_is_superhost <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ host_listings_count <dbl> 2, 12, 12, 12, 1, 8, 8, 18, 18, 3, 15, 2,…
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ host_identity_verified <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ neighbourhood_cleansed <chr> "Central & Western", "Central & Western",…
## $ latitude <dbl> 22.3, 22.3, 22.3, 22.3, 22.3, 22.3, 22.5,…
## $ longitude <dbl> 114, 114, 114, 114, 114, 114, 114, 114, 1…
## $ is_location_exact <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ property_type <chr> "Apartment", "Apartment", "Apartment", "A…
## $ room_type <chr> "Entire home/apt", "Entire home/apt", "En…
## $ accommodates <dbl> 3, 6, 6, 6, 2, 4, 6, 3, 3, 5, 2, 3, 4, 3,…
## $ bathrooms <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1,…
## $ bedrooms <dbl> 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 2, 1, 2,…
## $ beds <dbl> 2, 3, 3, 3, 1, 1, 3, 1, 2, 3, 1, 2, 2, 2,…
## $ bed_type <chr> "Real Bed", "Real Bed", "Real Bed", "Real…
## $ amenities <chr> "{\"Cable TV\",Internet,Wifi,\"Air condit…
## $ price <dbl> 1395, 845, 690, 767, 643, 891, 930, 992, …
## $ security_deposit <dbl> 2325, 775, 775, 775, 0, 1008, 930, 1163, …
## $ cleaning_fee <dbl> 310, 271, 302, 302, 0, 310, 388, 310, 310…
## $ guests_included <dbl> 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2,…
## $ extra_people <dbl> 155, 194, 194, 194, 0, 155, 155, 116, 116…
## $ minimum_nights <dbl> 3, 2, 2, 2, 1, 2, 4, 1, 1, 1, 1, 3, 2, 4,…
## $ maximum_nights <dbl> 365, 365, 365, 365, 60, 365, 365, 14, 14,…
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ number_of_reviews <dbl> 134, 271, 222, 225, 163, 43, 160, 15, 18,…
## $ number_of_reviews_ltm <dbl> 4, 13, 16, 11, 12, 2, 49, 0, 0, 6, 2, 1, …
## $ last_review <date> 2020-03-24, 2020-05-22, 2020-04-11, 2020…
## $ review_scores_rating <dbl> 97, 89, 86, 86, 98, 93, 97, 86, 99, 91, 7…
## $ review_scores_accuracy <dbl> 10, 9, 9, 9, 10, 9, 9, 8, 10, 10, 8, 8, 1…
## $ review_scores_cleanliness <dbl> 9, 9, 9, 9, 10, 9, 7, 8, 10, 10, 7, 8, 9,…
## $ review_scores_checkin <dbl> 10, 10, 9, 10, 10, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_communication <dbl> 10, 10, 9, 10, 10, 10, 9, 9, 10, 10, 8, 8…
## $ review_scores_location <dbl> 10, 10, 10, 10, 9, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_value <dbl> 9, 9, 9, 9, 10, 9, 8, 8, 9, 9, 8, 9, 9, 9…
## $ instant_bookable <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
## $ cancellation_policy <chr> "strict", "strict", "strict", "strict", "…
## $ reviews_per_month <dbl> 1.18, 2.47, 2.07, 2.09, 1.55, 0.41, 1.61,…
## $ has_summary <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ prop_type_simplified <chr> "Apartment", "Apartment", "Apartment", "A…
## $ host_since_calculated <dbl> 10.29, 9.89, 9.89, 9.89, 9.01, 8.90, 8.90…
## $ total_amenities <dbl> 33, 11, 18, 17, 27, 22, 36, 27, 28, 22, 2…
## $ neighbourhood_simplified <chr> "hong_kong", "hong_kong", "hong_kong", "h…
## $ is_it_big_sum <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ is_it_big_spa <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big_desc <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ extra_charge_pp <dbl> 0, 0, 0, 0, 0, 0, 0, 116, 116, 0, 0, 0, 0…
## $ price_4_nights <dbl> 5890, 3651, 3062, 3370, 2572, 3874, 4108,…
model7 <- lm(log(price_4_nights) ~ neighbourhood_simplified + is_location_exact + latitude + host_response_time + host_is_superhost + host_since_calculated + bedrooms + accommodates + total_amenities + prop_type_simplified + number_of_reviews_ltm + review_scores_rating + room_type, data= main_data4)
model7 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 53.200 | 5.203 | 10.224 | 0.000 | 42.997 | 63.404 |
| neighbourhood_simplifiedkowloon | -0.196 | 0.029 | -6.771 | 0.000 | -0.253 | -0.139 |
| neighbourhood_simplifiednew_territories | -0.155 | 0.043 | -3.600 | 0.000 | -0.239 | -0.070 |
| is_location_exactTRUE | 0.015 | 0.023 | 0.670 | 0.503 | -0.030 | 0.061 |
| latitude | -2.065 | 0.233 | -8.845 | 0.000 | -2.522 | -1.607 |
| host_response_time.L | 0.015 | 0.023 | 0.642 | 0.521 | -0.031 | 0.061 |
| host_response_time.Q | -0.024 | 0.025 | -0.955 | 0.339 | -0.073 | 0.025 |
| host_response_time.C | -0.018 | 0.026 | -0.672 | 0.502 | -0.069 | 0.034 |
| host_is_superhostTRUE | 0.176 | 0.027 | 6.530 | 0.000 | 0.123 | 0.229 |
| host_since_calculated | 0.006 | 0.005 | 1.073 | 0.283 | -0.005 | 0.017 |
| bedrooms | 0.103 | 0.017 | 5.912 | 0.000 | 0.069 | 0.137 |
| accommodates | 0.084 | 0.008 | 11.146 | 0.000 | 0.069 | 0.098 |
| total_amenities | 0.011 | 0.001 | 8.005 | 0.000 | 0.009 | 0.014 |
| prop_type_simplifiedCondominium | -0.098 | 0.036 | -2.697 | 0.007 | -0.168 | -0.027 |
| prop_type_simplifiedGuesthouse | -0.094 | 0.054 | -1.737 | 0.083 | -0.199 | 0.012 |
| prop_type_simplifiedHostel | -0.102 | 0.049 | -2.100 | 0.036 | -0.197 | -0.007 |
| prop_type_simplifiedHotel | -0.006 | 0.061 | -0.099 | 0.921 | -0.126 | 0.114 |
| prop_type_simplifiedHouse | -0.097 | 0.057 | -1.712 | 0.087 | -0.209 | 0.014 |
| prop_type_simplifiedOther | 0.069 | 0.034 | 2.047 | 0.041 | 0.003 | 0.135 |
| prop_type_simplifiedServiced apartment | 0.031 | 0.049 | 0.643 | 0.521 | -0.064 | 0.127 |
| number_of_reviews_ltm | -0.006 | 0.001 | -6.085 | 0.000 | -0.008 | -0.004 |
| review_scores_rating | 0.004 | 0.001 | 3.501 | 0.000 | 0.002 | 0.006 |
| room_typeHotel room | -0.016 | 0.061 | -0.256 | 0.798 | -0.135 | 0.103 |
| room_typePrivate room | -0.347 | 0.027 | -13.086 | 0.000 | -0.399 | -0.295 |
| room_typeShared room | -0.486 | 0.062 | -7.821 | 0.000 | -0.608 | -0.364 |
model7 %>%
broom::glance()%>%
kable()%>%
kable_styling()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.372 | 0.366 | 0.501 | 59.8 | 0 | 24 | -1770 | 3592 | 3743 | 609 | 2422 | 2447 |
msummary(model7)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.20017 5.20333 10.22 < 2e-16
## neighbourhood_simplifiedkowloon -0.19621 0.02898 -6.77 1.6e-11
## neighbourhood_simplifiednew_territories -0.15479 0.04300 -3.60 0.00032
## is_location_exactTRUE 0.01550 0.02315 0.67 0.50318
## latitude -2.06456 0.23341 -8.85 < 2e-16
## host_response_time.L 0.01496 0.02330 0.64 0.52094
## host_response_time.Q -0.02396 0.02508 -0.96 0.33943
## host_response_time.C -0.01771 0.02635 -0.67 0.50164
## host_is_superhostTRUE 0.17612 0.02697 6.53 8.0e-11
## host_since_calculated 0.00584 0.00544 1.07 0.28344
## bedrooms 0.10315 0.01745 5.91 3.9e-09
## accommodates 0.08370 0.00751 11.15 < 2e-16
## total_amenities 0.01139 0.00142 8.00 1.8e-15
## prop_type_simplifiedCondominium -0.09755 0.03618 -2.70 0.00705
## prop_type_simplifiedGuesthouse -0.09362 0.05391 -1.74 0.08258
## prop_type_simplifiedHostel -0.10199 0.04856 -2.10 0.03582
## prop_type_simplifiedHotel -0.00608 0.06116 -0.10 0.92076
## prop_type_simplifiedHouse -0.09749 0.05696 -1.71 0.08710
## prop_type_simplifiedOther 0.06890 0.03366 2.05 0.04077
## prop_type_simplifiedServiced apartment 0.03132 0.04874 0.64 0.52056
## number_of_reviews_ltm -0.00617 0.00101 -6.09 1.3e-09
## review_scores_rating 0.00359 0.00103 3.50 0.00047
## room_typeHotel room -0.01555 0.06067 -0.26 0.79771
## room_typePrivate room -0.34737 0.02655 -13.09 < 2e-16
## room_typeShared room -0.48586 0.06212 -7.82 7.7e-15
##
## (Intercept) ***
## neighbourhood_simplifiedkowloon ***
## neighbourhood_simplifiednew_territories ***
## is_location_exactTRUE
## latitude ***
## host_response_time.L
## host_response_time.Q
## host_response_time.C
## host_is_superhostTRUE ***
## host_since_calculated
## bedrooms ***
## accommodates ***
## total_amenities ***
## prop_type_simplifiedCondominium **
## prop_type_simplifiedGuesthouse .
## prop_type_simplifiedHostel *
## prop_type_simplifiedHotel
## prop_type_simplifiedHouse .
## prop_type_simplifiedOther *
## prop_type_simplifiedServiced apartment
## number_of_reviews_ltm ***
## review_scores_rating ***
## room_typeHotel room
## room_typePrivate room ***
## room_typeShared room ***
##
## Residual standard error: 0.501 on 2422 degrees of freedom
## (782 observations deleted due to missingness)
## Multiple R-squared: 0.372, Adjusted R-squared: 0.366
## F-statistic: 59.8 on 24 and 2422 DF, p-value: <2e-16
car::vif(model7)
## GVIF Df GVIF^(1/(2*Df))
## neighbourhood_simplified 2.61 2 1.27
## is_location_exact 1.11 1 1.05
## latitude 1.61 1 1.27
## host_response_time 1.33 3 1.05
## host_is_superhost 1.23 1 1.11
## host_since_calculated 1.16 1 1.08
## bedrooms 1.59 1 1.26
## accommodates 1.91 1 1.38
## total_amenities 1.27 1 1.13
## prop_type_simplified 2.19 7 1.06
## number_of_reviews_ltm 1.22 1 1.10
## review_scores_rating 1.18 1 1.09
## room_type 2.94 3 1.20
autoplot(model7)

The addition of these variables vastly improves our adjusted R squared value to about 37%. Thus, we can conclude that the neighbourhood of the listing is an important indicator of the price. However, the exact location being shown on the map does not make a statistically significant difference, so we take out is_location_exact. That is why we decided to take it out. At this point, since the p value is too large, we decided to take out the host_since_calculated as well.
Model 8: Adding Cancellation Policy and Security Deposit
Next, we would like to see the impact of adding cancellation_policy and security_deposit to the model.
glimpse(main_data4)
## Rows: 3,229
## Columns: 56
## $ id <dbl> 69074, 103760, 163664, 163742, 248140, 27…
## $ listing_url <chr> "https://www.airbnb.com/rooms/69074", "ht…
## $ summary <chr> "An ideal Hong location any visitor--hip …
## $ space <chr> "Filled with plants and art, this one bed…
## $ description <chr> "An ideal Hong location any visitor--hip …
## $ host_since <date> 2010-07-07, 2010-11-30, 2010-11-30, 2010…
## $ host_response_time <ord> within a few hours, within an hour, withi…
## $ host_response_rate <dbl> 86, 100, 100, 100, 100, 84, 84, 93, 93, 1…
## $ host_is_superhost <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ host_listings_count <dbl> 2, 12, 12, 12, 1, 8, 8, 18, 18, 3, 15, 2,…
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ host_identity_verified <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ neighbourhood_cleansed <chr> "Central & Western", "Central & Western",…
## $ latitude <dbl> 22.3, 22.3, 22.3, 22.3, 22.3, 22.3, 22.5,…
## $ longitude <dbl> 114, 114, 114, 114, 114, 114, 114, 114, 1…
## $ is_location_exact <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ property_type <chr> "Apartment", "Apartment", "Apartment", "A…
## $ room_type <chr> "Entire home/apt", "Entire home/apt", "En…
## $ accommodates <dbl> 3, 6, 6, 6, 2, 4, 6, 3, 3, 5, 2, 3, 4, 3,…
## $ bathrooms <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1,…
## $ bedrooms <dbl> 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 2, 1, 2,…
## $ beds <dbl> 2, 3, 3, 3, 1, 1, 3, 1, 2, 3, 1, 2, 2, 2,…
## $ bed_type <chr> "Real Bed", "Real Bed", "Real Bed", "Real…
## $ amenities <chr> "{\"Cable TV\",Internet,Wifi,\"Air condit…
## $ price <dbl> 1395, 845, 690, 767, 643, 891, 930, 992, …
## $ security_deposit <dbl> 2325, 775, 775, 775, 0, 1008, 930, 1163, …
## $ cleaning_fee <dbl> 310, 271, 302, 302, 0, 310, 388, 310, 310…
## $ guests_included <dbl> 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2,…
## $ extra_people <dbl> 155, 194, 194, 194, 0, 155, 155, 116, 116…
## $ minimum_nights <dbl> 3, 2, 2, 2, 1, 2, 4, 1, 1, 1, 1, 3, 2, 4,…
## $ maximum_nights <dbl> 365, 365, 365, 365, 60, 365, 365, 14, 14,…
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ number_of_reviews <dbl> 134, 271, 222, 225, 163, 43, 160, 15, 18,…
## $ number_of_reviews_ltm <dbl> 4, 13, 16, 11, 12, 2, 49, 0, 0, 6, 2, 1, …
## $ last_review <date> 2020-03-24, 2020-05-22, 2020-04-11, 2020…
## $ review_scores_rating <dbl> 97, 89, 86, 86, 98, 93, 97, 86, 99, 91, 7…
## $ review_scores_accuracy <dbl> 10, 9, 9, 9, 10, 9, 9, 8, 10, 10, 8, 8, 1…
## $ review_scores_cleanliness <dbl> 9, 9, 9, 9, 10, 9, 7, 8, 10, 10, 7, 8, 9,…
## $ review_scores_checkin <dbl> 10, 10, 9, 10, 10, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_communication <dbl> 10, 10, 9, 10, 10, 10, 9, 9, 10, 10, 8, 8…
## $ review_scores_location <dbl> 10, 10, 10, 10, 9, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_value <dbl> 9, 9, 9, 9, 10, 9, 8, 8, 9, 9, 8, 9, 9, 9…
## $ instant_bookable <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
## $ cancellation_policy <chr> "strict", "strict", "strict", "strict", "…
## $ reviews_per_month <dbl> 1.18, 2.47, 2.07, 2.09, 1.55, 0.41, 1.61,…
## $ has_summary <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ prop_type_simplified <chr> "Apartment", "Apartment", "Apartment", "A…
## $ host_since_calculated <dbl> 10.29, 9.89, 9.89, 9.89, 9.01, 8.90, 8.90…
## $ total_amenities <dbl> 33, 11, 18, 17, 27, 22, 36, 27, 28, 22, 2…
## $ neighbourhood_simplified <chr> "hong_kong", "hong_kong", "hong_kong", "h…
## $ is_it_big_sum <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ is_it_big_spa <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big_desc <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ extra_charge_pp <dbl> 0, 0, 0, 0, 0, 0, 0, 116, 116, 0, 0, 0, 0…
## $ price_4_nights <dbl> 5890, 3651, 3062, 3370, 2572, 3874, 4108,…
model8 <- lm(log(price_4_nights) ~ cancellation_policy + security_deposit + neighbourhood_simplified + latitude + host_response_time + host_is_superhost + bedrooms + accommodates + total_amenities + prop_type_simplified + number_of_reviews_ltm + review_scores_rating + room_type, data= main_data4)
model8 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 50.972 | 5.112 | 9.971 | 0.000 | 40.948 | 60.997 |
| cancellation_policymoderate | 0.059 | 0.039 | 1.519 | 0.129 | -0.017 | 0.135 |
| cancellation_policystrict | 0.105 | 0.031 | 3.388 | 0.001 | 0.044 | 0.165 |
| security_deposit | 0.000 | 0.000 | 4.381 | 0.000 | 0.000 | 0.000 |
| neighbourhood_simplifiedkowloon | -0.187 | 0.029 | -6.370 | 0.000 | -0.245 | -0.129 |
| neighbourhood_simplifiednew_territories | -0.124 | 0.042 | -2.918 | 0.004 | -0.207 | -0.041 |
| latitude | -1.968 | 0.229 | -8.581 | 0.000 | -2.418 | -1.518 |
| host_response_time.L | 0.029 | 0.023 | 1.246 | 0.213 | -0.017 | 0.074 |
| host_response_time.Q | -0.028 | 0.025 | -1.128 | 0.260 | -0.077 | 0.021 |
| host_response_time.C | -0.022 | 0.026 | -0.821 | 0.412 | -0.073 | 0.030 |
| host_is_superhostTRUE | 0.177 | 0.027 | 6.617 | 0.000 | 0.124 | 0.229 |
| bedrooms | 0.109 | 0.017 | 6.273 | 0.000 | 0.075 | 0.143 |
| accommodates | 0.081 | 0.007 | 10.862 | 0.000 | 0.067 | 0.096 |
| total_amenities | 0.011 | 0.001 | 7.518 | 0.000 | 0.008 | 0.013 |
| prop_type_simplifiedCondominium | -0.098 | 0.036 | -2.718 | 0.007 | -0.168 | -0.027 |
| prop_type_simplifiedGuesthouse | -0.097 | 0.053 | -1.821 | 0.069 | -0.201 | 0.007 |
| prop_type_simplifiedHostel | -0.094 | 0.048 | -1.932 | 0.053 | -0.189 | 0.001 |
| prop_type_simplifiedHotel | 0.018 | 0.061 | 0.302 | 0.763 | -0.101 | 0.138 |
| prop_type_simplifiedHouse | -0.106 | 0.057 | -1.863 | 0.063 | -0.217 | 0.006 |
| prop_type_simplifiedOther | 0.077 | 0.033 | 2.292 | 0.022 | 0.011 | 0.142 |
| prop_type_simplifiedServiced apartment | 0.035 | 0.048 | 0.724 | 0.469 | -0.060 | 0.130 |
| number_of_reviews_ltm | -0.006 | 0.001 | -5.981 | 0.000 | -0.008 | -0.004 |
| review_scores_rating | 0.004 | 0.001 | 3.717 | 0.000 | 0.002 | 0.006 |
| room_typeHotel room | -0.030 | 0.060 | -0.498 | 0.619 | -0.149 | 0.088 |
| room_typePrivate room | -0.345 | 0.026 | -13.013 | 0.000 | -0.397 | -0.293 |
| room_typeShared room | -0.437 | 0.062 | -7.080 | 0.000 | -0.558 | -0.316 |
model8 %>%
broom::glance()%>%
kable()%>%
kable_styling()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.381 | 0.374 | 0.498 | 59.5 | 0 | 25 | -1754 | 3561 | 3718 | 601 | 2421 | 2447 |
msummary(model8)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.10e+01 5.11e+00 9.97 < 2e-16
## cancellation_policymoderate 5.91e-02 3.89e-02 1.52 0.12881
## cancellation_policystrict 1.05e-01 3.09e-02 3.39 0.00071
## security_deposit 3.24e-05 7.38e-06 4.38 1.2e-05
## neighbourhood_simplifiedkowloon -1.87e-01 2.94e-02 -6.37 2.3e-10
## neighbourhood_simplifiednew_territories -1.24e-01 4.25e-02 -2.92 0.00355
## latitude -1.97e+00 2.29e-01 -8.58 < 2e-16
## host_response_time.L 2.89e-02 2.32e-02 1.25 0.21294
## host_response_time.Q -2.81e-02 2.49e-02 -1.13 0.25956
## host_response_time.C -2.15e-02 2.62e-02 -0.82 0.41180
## host_is_superhostTRUE 1.77e-01 2.67e-02 6.62 4.5e-11
## bedrooms 1.09e-01 1.74e-02 6.27 4.2e-10
## accommodates 8.14e-02 7.49e-03 10.86 < 2e-16
## total_amenities 1.07e-02 1.42e-03 7.52 7.8e-14
## prop_type_simplifiedCondominium -9.77e-02 3.60e-02 -2.72 0.00661
## prop_type_simplifiedGuesthouse -9.69e-02 5.32e-02 -1.82 0.06870
## prop_type_simplifiedHostel -9.37e-02 4.85e-02 -1.93 0.05349
## prop_type_simplifiedHotel 1.84e-02 6.09e-02 0.30 0.76282
## prop_type_simplifiedHouse -1.06e-01 5.67e-02 -1.86 0.06264
## prop_type_simplifiedOther 7.67e-02 3.35e-02 2.29 0.02196
## prop_type_simplifiedServiced apartment 3.50e-02 4.83e-02 0.72 0.46931
## number_of_reviews_ltm -6.00e-03 1.00e-03 -5.98 2.5e-09
## review_scores_rating 3.79e-03 1.02e-03 3.72 0.00021
## room_typeHotel room -3.01e-02 6.04e-02 -0.50 0.61865
## room_typePrivate room -3.45e-01 2.65e-02 -13.01 < 2e-16
## room_typeShared room -4.37e-01 6.17e-02 -7.08 1.9e-12
##
## (Intercept) ***
## cancellation_policymoderate
## cancellation_policystrict ***
## security_deposit ***
## neighbourhood_simplifiedkowloon ***
## neighbourhood_simplifiednew_territories **
## latitude ***
## host_response_time.L
## host_response_time.Q
## host_response_time.C
## host_is_superhostTRUE ***
## bedrooms ***
## accommodates ***
## total_amenities ***
## prop_type_simplifiedCondominium **
## prop_type_simplifiedGuesthouse .
## prop_type_simplifiedHostel .
## prop_type_simplifiedHotel
## prop_type_simplifiedHouse .
## prop_type_simplifiedOther *
## prop_type_simplifiedServiced apartment
## number_of_reviews_ltm ***
## review_scores_rating ***
## room_typeHotel room
## room_typePrivate room ***
## room_typeShared room ***
##
## Residual standard error: 0.498 on 2421 degrees of freedom
## (782 observations deleted due to missingness)
## Multiple R-squared: 0.381, Adjusted R-squared: 0.374
## F-statistic: 59.5 on 25 and 2421 DF, p-value: <2e-16
car::vif(model8)
## GVIF Df GVIF^(1/(2*Df))
## cancellation_policy 1.30 2 1.07
## security_deposit 1.12 1 1.06
## neighbourhood_simplified 2.75 2 1.29
## latitude 1.57 1 1.25
## host_response_time 1.34 3 1.05
## host_is_superhost 1.22 1 1.10
## bedrooms 1.60 1 1.26
## accommodates 1.92 1 1.39
## total_amenities 1.29 1 1.13
## prop_type_simplified 2.19 7 1.06
## number_of_reviews_ltm 1.21 1 1.10
## review_scores_rating 1.19 1 1.09
## room_type 2.96 3 1.20
autoplot(model8)

These new variables did not change our R Squared much, we observe that cancellation policy and security deposit are statistically significant. Thus, we decided to keep them.
Model 9: Adding Review Types
Lastly, we would like to try adding specific review scores to our model, although we know that they might be intercorrelated.
glimpse(main_data4)
## Rows: 3,229
## Columns: 56
## $ id <dbl> 69074, 103760, 163664, 163742, 248140, 27…
## $ listing_url <chr> "https://www.airbnb.com/rooms/69074", "ht…
## $ summary <chr> "An ideal Hong location any visitor--hip …
## $ space <chr> "Filled with plants and art, this one bed…
## $ description <chr> "An ideal Hong location any visitor--hip …
## $ host_since <date> 2010-07-07, 2010-11-30, 2010-11-30, 2010…
## $ host_response_time <ord> within a few hours, within an hour, withi…
## $ host_response_rate <dbl> 86, 100, 100, 100, 100, 84, 84, 93, 93, 1…
## $ host_is_superhost <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ host_listings_count <dbl> 2, 12, 12, 12, 1, 8, 8, 18, 18, 3, 15, 2,…
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ host_identity_verified <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, F…
## $ neighbourhood_cleansed <chr> "Central & Western", "Central & Western",…
## $ latitude <dbl> 22.3, 22.3, 22.3, 22.3, 22.3, 22.3, 22.5,…
## $ longitude <dbl> 114, 114, 114, 114, 114, 114, 114, 114, 1…
## $ is_location_exact <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ property_type <chr> "Apartment", "Apartment", "Apartment", "A…
## $ room_type <chr> "Entire home/apt", "Entire home/apt", "En…
## $ accommodates <dbl> 3, 6, 6, 6, 2, 4, 6, 3, 3, 5, 2, 3, 4, 3,…
## $ bathrooms <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1,…
## $ bedrooms <dbl> 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 2, 1, 2,…
## $ beds <dbl> 2, 3, 3, 3, 1, 1, 3, 1, 2, 3, 1, 2, 2, 2,…
## $ bed_type <chr> "Real Bed", "Real Bed", "Real Bed", "Real…
## $ amenities <chr> "{\"Cable TV\",Internet,Wifi,\"Air condit…
## $ price <dbl> 1395, 845, 690, 767, 643, 891, 930, 992, …
## $ security_deposit <dbl> 2325, 775, 775, 775, 0, 1008, 930, 1163, …
## $ cleaning_fee <dbl> 310, 271, 302, 302, 0, 310, 388, 310, 310…
## $ guests_included <dbl> 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2,…
## $ extra_people <dbl> 155, 194, 194, 194, 0, 155, 155, 116, 116…
## $ minimum_nights <dbl> 3, 2, 2, 2, 1, 2, 4, 1, 1, 1, 1, 3, 2, 4,…
## $ maximum_nights <dbl> 365, 365, 365, 365, 60, 365, 365, 14, 14,…
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ number_of_reviews <dbl> 134, 271, 222, 225, 163, 43, 160, 15, 18,…
## $ number_of_reviews_ltm <dbl> 4, 13, 16, 11, 12, 2, 49, 0, 0, 6, 2, 1, …
## $ last_review <date> 2020-03-24, 2020-05-22, 2020-04-11, 2020…
## $ review_scores_rating <dbl> 97, 89, 86, 86, 98, 93, 97, 86, 99, 91, 7…
## $ review_scores_accuracy <dbl> 10, 9, 9, 9, 10, 9, 9, 8, 10, 10, 8, 8, 1…
## $ review_scores_cleanliness <dbl> 9, 9, 9, 9, 10, 9, 7, 8, 10, 10, 7, 8, 9,…
## $ review_scores_checkin <dbl> 10, 10, 9, 10, 10, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_communication <dbl> 10, 10, 9, 10, 10, 10, 9, 9, 10, 10, 8, 8…
## $ review_scores_location <dbl> 10, 10, 10, 10, 9, 9, 9, 9, 10, 10, 9, 9,…
## $ review_scores_value <dbl> 9, 9, 9, 9, 10, 9, 8, 8, 9, 9, 8, 9, 9, 9…
## $ instant_bookable <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
## $ cancellation_policy <chr> "strict", "strict", "strict", "strict", "…
## $ reviews_per_month <dbl> 1.18, 2.47, 2.07, 2.09, 1.55, 0.41, 1.61,…
## $ has_summary <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ prop_type_simplified <chr> "Apartment", "Apartment", "Apartment", "A…
## $ host_since_calculated <dbl> 10.29, 9.89, 9.89, 9.89, 9.01, 8.90, 8.90…
## $ total_amenities <dbl> 33, 11, 18, 17, 27, 22, 36, 27, 28, 22, 2…
## $ neighbourhood_simplified <chr> "hong_kong", "hong_kong", "hong_kong", "h…
## $ is_it_big_sum <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ is_it_big_spa <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big_desc <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ is_it_big <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, …
## $ extra_charge_pp <dbl> 0, 0, 0, 0, 0, 0, 0, 116, 116, 0, 0, 0, 0…
## $ price_4_nights <dbl> 5890, 3651, 3062, 3370, 2572, 3874, 4108,…
model9 <- lm(log(price_4_nights) ~ review_scores_rating + review_scores_cleanliness + review_scores_checkin + review_scores_communication + review_scores_location + review_scores_value + cancellation_policy + security_deposit + neighbourhood_simplified + latitude + host_response_time + host_is_superhost + bedrooms + accommodates + total_amenities + prop_type_simplified + number_of_reviews_ltm + review_scores_rating + room_type, data= main_data4)
model9 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 49.136 | 5.095 | 9.644 | 0.000 | 39.145 | 59.127 |
| review_scores_rating | 0.004 | 0.002 | 2.035 | 0.042 | 0.000 | 0.009 |
| review_scores_cleanliness | 0.024 | 0.016 | 1.539 | 0.124 | -0.007 | 0.054 |
| review_scores_checkin | 0.029 | 0.019 | 1.534 | 0.125 | -0.008 | 0.065 |
| review_scores_communication | 0.043 | 0.019 | 2.249 | 0.025 | 0.006 | 0.080 |
| review_scores_location | -0.040 | 0.018 | -2.289 | 0.022 | -0.075 | -0.006 |
| review_scores_value | -0.064 | 0.017 | -3.792 | 0.000 | -0.098 | -0.031 |
| cancellation_policymoderate | 0.055 | 0.039 | 1.411 | 0.158 | -0.021 | 0.131 |
| cancellation_policystrict | 0.099 | 0.031 | 3.190 | 0.001 | 0.038 | 0.159 |
| security_deposit | 0.000 | 0.000 | 4.212 | 0.000 | 0.000 | 0.000 |
| neighbourhood_simplifiedkowloon | -0.178 | 0.029 | -6.069 | 0.000 | -0.236 | -0.120 |
| neighbourhood_simplifiednew_territories | -0.144 | 0.043 | -3.359 | 0.001 | -0.227 | -0.060 |
| latitude | -1.885 | 0.229 | -8.242 | 0.000 | -2.333 | -1.436 |
| host_response_time.L | 0.030 | 0.023 | 1.291 | 0.197 | -0.016 | 0.075 |
| host_response_time.Q | -0.032 | 0.025 | -1.297 | 0.195 | -0.081 | 0.016 |
| host_response_time.C | -0.018 | 0.026 | -0.682 | 0.496 | -0.069 | 0.033 |
| host_is_superhostTRUE | 0.172 | 0.027 | 6.475 | 0.000 | 0.120 | 0.225 |
| bedrooms | 0.115 | 0.017 | 6.602 | 0.000 | 0.081 | 0.149 |
| accommodates | 0.080 | 0.007 | 10.711 | 0.000 | 0.065 | 0.095 |
| total_amenities | 0.010 | 0.001 | 7.241 | 0.000 | 0.008 | 0.013 |
| prop_type_simplifiedCondominium | -0.094 | 0.036 | -2.621 | 0.009 | -0.164 | -0.024 |
| prop_type_simplifiedGuesthouse | -0.089 | 0.053 | -1.677 | 0.094 | -0.193 | 0.015 |
| prop_type_simplifiedHostel | -0.097 | 0.048 | -2.014 | 0.044 | -0.192 | -0.003 |
| prop_type_simplifiedHotel | 0.014 | 0.061 | 0.227 | 0.821 | -0.105 | 0.133 |
| prop_type_simplifiedHouse | -0.105 | 0.057 | -1.861 | 0.063 | -0.216 | 0.006 |
| prop_type_simplifiedOther | 0.074 | 0.033 | 2.223 | 0.026 | 0.009 | 0.139 |
| prop_type_simplifiedServiced apartment | 0.030 | 0.048 | 0.623 | 0.533 | -0.065 | 0.125 |
| number_of_reviews_ltm | -0.006 | 0.001 | -5.991 | 0.000 | -0.008 | -0.004 |
| room_typeHotel room | -0.008 | 0.060 | -0.134 | 0.893 | -0.126 | 0.110 |
| room_typePrivate room | -0.338 | 0.026 | -12.804 | 0.000 | -0.390 | -0.286 |
| room_typeShared room | -0.431 | 0.061 | -7.010 | 0.000 | -0.552 | -0.311 |
model9 %>%
broom::glance()%>%
kable()%>%
kable_styling()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.389 | 0.381 | 0.495 | 51.2 | 0 | 30 | -1737 | 3538 | 3724 | 593 | 2415 | 2446 |
msummary(model9)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.91e+01 5.09e+00 9.64 < 2e-16
## review_scores_rating 4.43e-03 2.18e-03 2.03 0.04198
## review_scores_cleanliness 2.40e-02 1.56e-02 1.54 0.12383
## review_scores_checkin 2.87e-02 1.87e-02 1.53 0.12509
## review_scores_communication 4.30e-02 1.91e-02 2.25 0.02460
## review_scores_location -4.02e-02 1.76e-02 -2.29 0.02216
## review_scores_value -6.45e-02 1.70e-02 -3.79 0.00015
## cancellation_policymoderate 5.46e-02 3.87e-02 1.41 0.15826
## cancellation_policystrict 9.85e-02 3.09e-02 3.19 0.00144
## security_deposit 3.10e-05 7.36e-06 4.21 2.6e-05
## neighbourhood_simplifiedkowloon -1.78e-01 2.93e-02 -6.07 1.5e-09
## neighbourhood_simplifiednew_territories -1.44e-01 4.28e-02 -3.36 0.00079
## latitude -1.88e+00 2.29e-01 -8.24 2.7e-16
## host_response_time.L 2.99e-02 2.31e-02 1.29 0.19681
## host_response_time.Q -3.22e-02 2.48e-02 -1.30 0.19492
## host_response_time.C -1.78e-02 2.61e-02 -0.68 0.49557
## host_is_superhostTRUE 1.72e-01 2.66e-02 6.48 1.1e-10
## bedrooms 1.15e-01 1.74e-02 6.60 5.0e-11
## accommodates 8.00e-02 7.46e-03 10.71 < 2e-16
## total_amenities 1.03e-02 1.42e-03 7.24 6.0e-13
## prop_type_simplifiedCondominium -9.39e-02 3.58e-02 -2.62 0.00883
## prop_type_simplifiedGuesthouse -8.90e-02 5.31e-02 -1.68 0.09368
## prop_type_simplifiedHostel -9.72e-02 4.83e-02 -2.01 0.04412
## prop_type_simplifiedHotel 1.37e-02 6.06e-02 0.23 0.82070
## prop_type_simplifiedHouse -1.05e-01 5.65e-02 -1.86 0.06287
## prop_type_simplifiedOther 7.41e-02 3.33e-02 2.22 0.02633
## prop_type_simplifiedServiced apartment 3.01e-02 4.83e-02 0.62 0.53310
## number_of_reviews_ltm -5.98e-03 9.98e-04 -5.99 2.4e-09
## room_typeHotel room -8.11e-03 6.03e-02 -0.13 0.89311
## room_typePrivate room -3.38e-01 2.64e-02 -12.80 < 2e-16
## room_typeShared room -4.31e-01 6.15e-02 -7.01 3.1e-12
##
## (Intercept) ***
## review_scores_rating *
## review_scores_cleanliness
## review_scores_checkin
## review_scores_communication *
## review_scores_location *
## review_scores_value ***
## cancellation_policymoderate
## cancellation_policystrict **
## security_deposit ***
## neighbourhood_simplifiedkowloon ***
## neighbourhood_simplifiednew_territories ***
## latitude ***
## host_response_time.L
## host_response_time.Q
## host_response_time.C
## host_is_superhostTRUE ***
## bedrooms ***
## accommodates ***
## total_amenities ***
## prop_type_simplifiedCondominium **
## prop_type_simplifiedGuesthouse .
## prop_type_simplifiedHostel *
## prop_type_simplifiedHotel
## prop_type_simplifiedHouse .
## prop_type_simplifiedOther *
## prop_type_simplifiedServiced apartment
## number_of_reviews_ltm ***
## room_typeHotel room
## room_typePrivate room ***
## room_typeShared room ***
##
## Residual standard error: 0.495 on 2415 degrees of freedom
## (783 observations deleted due to missingness)
## Multiple R-squared: 0.389, Adjusted R-squared: 0.381
## F-statistic: 51.2 on 30 and 2415 DF, p-value: <2e-16
car::vif(model9)
## GVIF Df GVIF^(1/(2*Df))
## review_scores_rating 5.36 1 2.32
## review_scores_cleanliness 3.04 1 1.74
## review_scores_checkin 3.48 1 1.86
## review_scores_communication 3.50 1 1.87
## review_scores_location 2.10 1 1.45
## review_scores_value 3.35 1 1.83
## cancellation_policy 1.31 2 1.07
## security_deposit 1.12 1 1.06
## neighbourhood_simplified 2.86 2 1.30
## latitude 1.58 1 1.26
## host_response_time 1.36 3 1.05
## host_is_superhost 1.22 1 1.11
## bedrooms 1.61 1 1.27
## accommodates 1.93 1 1.39
## total_amenities 1.29 1 1.14
## prop_type_simplified 2.27 7 1.06
## number_of_reviews_ltm 1.21 1 1.10
## room_type 3.00 3 1.20
autoplot(model9)
We take out the scores of accuracy, cleanliness, rating and checkin becase of their high p values. Their collinearity cause high VIFs. However, we keep the communication, location and value scores because it makes the biggest difference among the ones that we have just added. It is meaningful in the sense that it shows there is a correlation between the price that is paid for the listing and the communication with the host. We observe that our adjusted R squared increases.
Model 10: Adjusted Final Model
After we did our final adjustments, we get to the R squared of 38%. Although it is not a high rate, we can observe that our model does not suffer from collinearity and our variables are usually statistically significant with only a couple of them having larger than 5% p values. We decided to keep them since they are categorical and some of the other values in these variables have p values lower than 5%.
model10 <- lm(log(price_4_nights) ~ review_scores_communication + review_scores_location + review_scores_value + cancellation_policy + security_deposit + neighbourhood_simplified + latitude + host_response_time + host_is_superhost + bedrooms + accommodates + total_amenities + prop_type_simplified + number_of_reviews_ltm + review_scores_rating + room_type, data= main_data4)
model10 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 49.467 | 5.096 | 9.708 | 0.000 | 39.475 | 59.460 |
| review_scores_communication | 0.058 | 0.017 | 3.497 | 0.000 | 0.026 | 0.091 |
| review_scores_location | -0.038 | 0.017 | -2.195 | 0.028 | -0.072 | -0.004 |
| review_scores_value | -0.059 | 0.017 | -3.562 | 0.000 | -0.092 | -0.027 |
| cancellation_policymoderate | 0.056 | 0.039 | 1.444 | 0.149 | -0.020 | 0.132 |
| cancellation_policystrict | 0.100 | 0.031 | 3.243 | 0.001 | 0.040 | 0.160 |
| security_deposit | 0.000 | 0.000 | 4.265 | 0.000 | 0.000 | 0.000 |
| neighbourhood_simplifiedkowloon | -0.179 | 0.029 | -6.091 | 0.000 | -0.236 | -0.121 |
| neighbourhood_simplifiednew_territories | -0.142 | 0.043 | -3.329 | 0.001 | -0.226 | -0.058 |
| latitude | -1.897 | 0.229 | -8.293 | 0.000 | -2.345 | -1.448 |
| host_response_time.L | 0.030 | 0.023 | 1.288 | 0.198 | -0.016 | 0.075 |
| host_response_time.Q | -0.030 | 0.025 | -1.228 | 0.220 | -0.079 | 0.018 |
| host_response_time.C | -0.019 | 0.026 | -0.721 | 0.471 | -0.070 | 0.032 |
| host_is_superhostTRUE | 0.174 | 0.027 | 6.544 | 0.000 | 0.122 | 0.226 |
| bedrooms | 0.113 | 0.017 | 6.501 | 0.000 | 0.079 | 0.147 |
| accommodates | 0.080 | 0.007 | 10.755 | 0.000 | 0.066 | 0.095 |
| total_amenities | 0.010 | 0.001 | 7.313 | 0.000 | 0.008 | 0.013 |
| prop_type_simplifiedCondominium | -0.097 | 0.036 | -2.711 | 0.007 | -0.167 | -0.027 |
| prop_type_simplifiedGuesthouse | -0.091 | 0.053 | -1.717 | 0.086 | -0.195 | 0.013 |
| prop_type_simplifiedHostel | -0.095 | 0.048 | -1.971 | 0.049 | -0.190 | -0.001 |
| prop_type_simplifiedHotel | 0.014 | 0.061 | 0.236 | 0.813 | -0.105 | 0.133 |
| prop_type_simplifiedHouse | -0.105 | 0.057 | -1.856 | 0.064 | -0.216 | 0.006 |
| prop_type_simplifiedOther | 0.073 | 0.033 | 2.203 | 0.028 | 0.008 | 0.139 |
| prop_type_simplifiedServiced apartment | 0.027 | 0.048 | 0.558 | 0.577 | -0.068 | 0.122 |
| number_of_reviews_ltm | -0.006 | 0.001 | -6.006 | 0.000 | -0.008 | -0.004 |
| review_scores_rating | 0.007 | 0.002 | 3.596 | 0.000 | 0.003 | 0.010 |
| room_typeHotel room | -0.007 | 0.060 | -0.121 | 0.904 | -0.126 | 0.111 |
| room_typePrivate room | -0.337 | 0.026 | -12.753 | 0.000 | -0.389 | -0.285 |
| room_typeShared room | -0.430 | 0.062 | -6.995 | 0.000 | -0.551 | -0.310 |
model10 %>%
broom::glance()%>%
kable()%>%
kable_styling()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.387 | 0.38 | 0.496 | 54.6 | 0 | 28 | -1740 | 3540 | 3714 | 594 | 2418 | 2447 |
msummary(model10)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.95e+01 5.10e+00 9.71 < 2e-16
## review_scores_communication 5.83e-02 1.67e-02 3.50 0.00048
## review_scores_location -3.80e-02 1.73e-02 -2.19 0.02829
## review_scores_value -5.94e-02 1.67e-02 -3.56 0.00037
## cancellation_policymoderate 5.59e-02 3.87e-02 1.44 0.14881
## cancellation_policystrict 1.00e-01 3.08e-02 3.24 0.00120
## security_deposit 3.14e-05 7.35e-06 4.27 2.1e-05
## neighbourhood_simplifiedkowloon -1.79e-01 2.93e-02 -6.09 1.3e-09
## neighbourhood_simplifiednew_territories -1.42e-01 4.27e-02 -3.33 0.00089
## latitude -1.90e+00 2.29e-01 -8.29 < 2e-16
## host_response_time.L 2.98e-02 2.31e-02 1.29 0.19779
## host_response_time.Q -3.05e-02 2.48e-02 -1.23 0.21958
## host_response_time.C -1.88e-02 2.61e-02 -0.72 0.47107
## host_is_superhostTRUE 1.74e-01 2.66e-02 6.54 7.3e-11
## bedrooms 1.13e-01 1.73e-02 6.50 9.7e-11
## accommodates 8.03e-02 7.46e-03 10.76 < 2e-16
## total_amenities 1.04e-02 1.42e-03 7.31 3.5e-13
## prop_type_simplifiedCondominium -9.71e-02 3.58e-02 -2.71 0.00676
## prop_type_simplifiedGuesthouse -9.11e-02 5.31e-02 -1.72 0.08613
## prop_type_simplifiedHostel -9.52e-02 4.83e-02 -1.97 0.04878
## prop_type_simplifiedHotel 1.43e-02 6.06e-02 0.24 0.81326
## prop_type_simplifiedHouse -1.05e-01 5.66e-02 -1.86 0.06360
## prop_type_simplifiedOther 7.34e-02 3.33e-02 2.20 0.02771
## prop_type_simplifiedServiced apartment 2.69e-02 4.83e-02 0.56 0.57721
## number_of_reviews_ltm -6.00e-03 9.99e-04 -6.01 2.2e-09
## review_scores_rating 6.77e-03 1.88e-03 3.60 0.00033
## room_typeHotel room -7.28e-03 6.04e-02 -0.12 0.90403
## room_typePrivate room -3.37e-01 2.64e-02 -12.75 < 2e-16
## room_typeShared room -4.30e-01 6.15e-02 -6.99 3.4e-12
##
## (Intercept) ***
## review_scores_communication ***
## review_scores_location *
## review_scores_value ***
## cancellation_policymoderate
## cancellation_policystrict **
## security_deposit ***
## neighbourhood_simplifiedkowloon ***
## neighbourhood_simplifiednew_territories ***
## latitude ***
## host_response_time.L
## host_response_time.Q
## host_response_time.C
## host_is_superhostTRUE ***
## bedrooms ***
## accommodates ***
## total_amenities ***
## prop_type_simplifiedCondominium **
## prop_type_simplifiedGuesthouse .
## prop_type_simplifiedHostel *
## prop_type_simplifiedHotel
## prop_type_simplifiedHouse .
## prop_type_simplifiedOther *
## prop_type_simplifiedServiced apartment
## number_of_reviews_ltm ***
## review_scores_rating ***
## room_typeHotel room
## room_typePrivate room ***
## room_typeShared room ***
##
## Residual standard error: 0.496 on 2418 degrees of freedom
## (782 observations deleted due to missingness)
## Multiple R-squared: 0.387, Adjusted R-squared: 0.38
## F-statistic: 54.6 on 28 and 2418 DF, p-value: <2e-16
car::vif(model10)
## GVIF Df GVIF^(1/(2*Df))
## review_scores_communication 2.72 1 1.65
## review_scores_location 2.11 1 1.45
## review_scores_value 3.28 1 1.81
## cancellation_policy 1.31 2 1.07
## security_deposit 1.12 1 1.06
## neighbourhood_simplified 2.85 2 1.30
## latitude 1.58 1 1.26
## host_response_time 1.35 3 1.05
## host_is_superhost 1.22 1 1.10
## bedrooms 1.60 1 1.27
## accommodates 1.93 1 1.39
## total_amenities 1.29 1 1.14
## prop_type_simplified 2.25 7 1.06
## number_of_reviews_ltm 1.21 1 1.10
## review_scores_rating 4.08 1 2.02
## room_type 3.00 3 1.20
autoplot(model10)

Regression Diagnostics
From model 1 to 10, we observe that the residuals vs fitted values become more random. A concentration of the residuals for high fitted values is clearly visible with the first model, while they seem centered around Y = 0 for Model 10. Consequently, the linearity assumption holds for Model 10.
The S-shaped Q-Q plot of residuals for Model 10 indicates heavy tails, or an excess of extreme values relative to the normal distribution. This can be explained by some Airbnbs having extremely high or low prices given their characteristics.
There is no visible trend in the Scale-Location graph, indicating that the variability of residuals in Model 10 is constant. Finally, the Residuals vs Factor levels leverage graph shows that the size of residuals is the same on average for all factor levels. We still observe some extreme values, due to the presence of outliers in Airbnb prices.
As a conclusion, the 4 assumptions of the linear regression model hold in our Model 10. To improve the distribution of residuals, we included new variables, or we could perform a more detailed classification of the neighbourhoods.
To compare the last 5 models that we have created we will use huxtable to create a summary table.
##summarize these models
huxreg(model6, model7, model8, model9, model10)
| (1) | (2) | (3) | (4) | (5) | |
|---|---|---|---|---|---|
| (Intercept) | 6.998 *** | 53.200 *** | 50.972 *** | 49.136 *** | 49.467 *** |
| (0.104) | (5.203) | (5.112) | (5.095) | (5.096) | |
| host_response_time.L | 0.011 | 0.015 | 0.029 | 0.030 | 0.030 |
| (0.024) | (0.023) | (0.023) | (0.023) | (0.023) | |
| host_response_time.Q | -0.083 ** | -0.024 | -0.028 | -0.032 | -0.030 |
| (0.026) | (0.025) | (0.025) | (0.025) | (0.025) | |
| host_response_time.C | -0.022 | -0.018 | -0.022 | -0.018 | -0.019 |
| (0.027) | (0.026) | (0.026) | (0.026) | (0.026) | |
| host_is_superhostTRUE | 0.150 *** | 0.176 *** | 0.177 *** | 0.172 *** | 0.174 *** |
| (0.028) | (0.027) | (0.027) | (0.027) | (0.027) | |
| host_since_calculated | 0.023 *** | 0.006 | |||
| (0.006) | (0.005) | ||||
| bedrooms | 0.082 *** | 0.103 *** | 0.109 *** | 0.115 *** | 0.113 *** |
| (0.018) | (0.017) | (0.017) | (0.017) | (0.017) | |
| accommodates | 0.085 *** | 0.084 *** | 0.081 *** | 0.080 *** | 0.080 *** |
| (0.008) | (0.008) | (0.007) | (0.007) | (0.007) | |
| total_amenities | 0.012 *** | 0.011 *** | 0.011 *** | 0.010 *** | 0.010 *** |
| (0.001) | (0.001) | (0.001) | (0.001) | (0.001) | |
| prop_type_simplifiedCondominium | -0.151 *** | -0.098 ** | -0.098 ** | -0.094 ** | -0.097 ** |
| (0.037) | (0.036) | (0.036) | (0.036) | (0.036) | |
| prop_type_simplifiedGuesthouse | -0.097 | -0.094 | -0.097 | -0.089 | -0.091 |
| (0.056) | (0.054) | (0.053) | (0.053) | (0.053) | |
| prop_type_simplifiedHostel | -0.137 ** | -0.102 * | -0.094 | -0.097 * | -0.095 * |
| (0.050) | (0.049) | (0.048) | (0.048) | (0.048) | |
| prop_type_simplifiedHotel | -0.059 | -0.006 | 0.018 | 0.014 | 0.014 |
| (0.063) | (0.061) | (0.061) | (0.061) | (0.061) | |
| prop_type_simplifiedHouse | -0.170 ** | -0.097 | -0.106 | -0.105 | -0.105 |
| (0.057) | (0.057) | (0.057) | (0.057) | (0.057) | |
| prop_type_simplifiedOther | 0.024 | 0.069 * | 0.077 * | 0.074 * | 0.073 * |
| (0.035) | (0.034) | (0.033) | (0.033) | (0.033) | |
| prop_type_simplifiedServiced apartment | -0.031 | 0.031 | 0.035 | 0.030 | 0.027 |
| (0.050) | (0.049) | (0.048) | (0.048) | (0.048) | |
| number_of_reviews_ltm | -0.006 *** | -0.006 *** | -0.006 *** | -0.006 *** | -0.006 *** |
| (0.001) | (0.001) | (0.001) | (0.001) | (0.001) | |
| review_scores_rating | 0.004 *** | 0.004 *** | 0.004 *** | 0.004 * | 0.007 *** |
| (0.001) | (0.001) | (0.001) | (0.002) | (0.002) | |
| room_typeHotel room | -0.048 | -0.016 | -0.030 | -0.008 | -0.007 |
| (0.062) | (0.061) | (0.060) | (0.060) | (0.060) | |
| room_typePrivate room | -0.383 *** | -0.347 *** | -0.345 *** | -0.338 *** | -0.337 *** |
| (0.026) | (0.027) | (0.026) | (0.026) | (0.026) | |
| room_typeShared room | -0.496 *** | -0.486 *** | -0.437 *** | -0.431 *** | -0.430 *** |
| (0.064) | (0.062) | (0.062) | (0.061) | (0.062) | |
| neighbourhood_simplifiedkowloon | -0.196 *** | -0.187 *** | -0.178 *** | -0.179 *** | |
| (0.029) | (0.029) | (0.029) | (0.029) | ||
| neighbourhood_simplifiednew_territories | -0.155 *** | -0.124 ** | -0.144 *** | -0.142 *** | |
| (0.043) | (0.042) | (0.043) | (0.043) | ||
| is_location_exactTRUE | 0.015 | ||||
| (0.023) | |||||
| latitude | -2.065 *** | -1.968 *** | -1.885 *** | -1.897 *** | |
| (0.233) | (0.229) | (0.229) | (0.229) | ||
| cancellation_policymoderate | 0.059 | 0.055 | 0.056 | ||
| (0.039) | (0.039) | (0.039) | |||
| cancellation_policystrict | 0.105 *** | 0.099 ** | 0.100 ** | ||
| (0.031) | (0.031) | (0.031) | |||
| security_deposit | 0.000 *** | 0.000 *** | 0.000 *** | ||
| (0.000) | (0.000) | (0.000) | |||
| review_scores_cleanliness | 0.024 | ||||
| (0.016) | |||||
| review_scores_checkin | 0.029 | ||||
| (0.019) | |||||
| review_scores_communication | 0.043 * | 0.058 *** | |||
| (0.019) | (0.017) | ||||
| review_scores_location | -0.040 * | -0.038 * | |||
| (0.018) | (0.017) | ||||
| review_scores_value | -0.064 *** | -0.059 *** | |||
| (0.017) | (0.017) | ||||
| N | 2447 | 2447 | 2447 | 2446 | 2447 |
| R2 | 0.317 | 0.372 | 0.381 | 0.389 | 0.387 |
| logLik | -1872.901 | -1770.115 | -1753.698 | -1736.944 | -1740.159 |
| AIC | 3789.803 | 3592.231 | 3561.396 | 3537.888 | 3540.318 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||||
By brainstorming on the features we look at when booking an Airbnb, we added several explanatory variables, and increased the adjusted R squared to 38.7% in our latest model.
#residuals check
autoplot(model10)

##check for collinearity
car::vif(model10)
## GVIF Df GVIF^(1/(2*Df))
## review_scores_communication 2.72 1 1.65
## review_scores_location 2.11 1 1.45
## review_scores_value 3.28 1 1.81
## cancellation_policy 1.31 2 1.07
## security_deposit 1.12 1 1.06
## neighbourhood_simplified 2.85 2 1.30
## latitude 1.58 1 1.26
## host_response_time 1.35 3 1.05
## host_is_superhost 1.22 1 1.10
## bedrooms 1.60 1 1.27
## accommodates 1.93 1 1.39
## total_amenities 1.29 1 1.14
## prop_type_simplified 2.25 7 1.06
## number_of_reviews_ltm 1.21 1 1.10
## review_scores_rating 4.08 1 2.02
## room_type 3.00 3 1.20
##renaming the final model
final_model <- model10
## summarize final model
msummary(final_model)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.95e+01 5.10e+00 9.71 < 2e-16
## review_scores_communication 5.83e-02 1.67e-02 3.50 0.00048
## review_scores_location -3.80e-02 1.73e-02 -2.19 0.02829
## review_scores_value -5.94e-02 1.67e-02 -3.56 0.00037
## cancellation_policymoderate 5.59e-02 3.87e-02 1.44 0.14881
## cancellation_policystrict 1.00e-01 3.08e-02 3.24 0.00120
## security_deposit 3.14e-05 7.35e-06 4.27 2.1e-05
## neighbourhood_simplifiedkowloon -1.79e-01 2.93e-02 -6.09 1.3e-09
## neighbourhood_simplifiednew_territories -1.42e-01 4.27e-02 -3.33 0.00089
## latitude -1.90e+00 2.29e-01 -8.29 < 2e-16
## host_response_time.L 2.98e-02 2.31e-02 1.29 0.19779
## host_response_time.Q -3.05e-02 2.48e-02 -1.23 0.21958
## host_response_time.C -1.88e-02 2.61e-02 -0.72 0.47107
## host_is_superhostTRUE 1.74e-01 2.66e-02 6.54 7.3e-11
## bedrooms 1.13e-01 1.73e-02 6.50 9.7e-11
## accommodates 8.03e-02 7.46e-03 10.76 < 2e-16
## total_amenities 1.04e-02 1.42e-03 7.31 3.5e-13
## prop_type_simplifiedCondominium -9.71e-02 3.58e-02 -2.71 0.00676
## prop_type_simplifiedGuesthouse -9.11e-02 5.31e-02 -1.72 0.08613
## prop_type_simplifiedHostel -9.52e-02 4.83e-02 -1.97 0.04878
## prop_type_simplifiedHotel 1.43e-02 6.06e-02 0.24 0.81326
## prop_type_simplifiedHouse -1.05e-01 5.66e-02 -1.86 0.06360
## prop_type_simplifiedOther 7.34e-02 3.33e-02 2.20 0.02771
## prop_type_simplifiedServiced apartment 2.69e-02 4.83e-02 0.56 0.57721
## number_of_reviews_ltm -6.00e-03 9.99e-04 -6.01 2.2e-09
## review_scores_rating 6.77e-03 1.88e-03 3.60 0.00033
## room_typeHotel room -7.28e-03 6.04e-02 -0.12 0.90403
## room_typePrivate room -3.37e-01 2.64e-02 -12.75 < 2e-16
## room_typeShared room -4.30e-01 6.15e-02 -6.99 3.4e-12
##
## (Intercept) ***
## review_scores_communication ***
## review_scores_location *
## review_scores_value ***
## cancellation_policymoderate
## cancellation_policystrict **
## security_deposit ***
## neighbourhood_simplifiedkowloon ***
## neighbourhood_simplifiednew_territories ***
## latitude ***
## host_response_time.L
## host_response_time.Q
## host_response_time.C
## host_is_superhostTRUE ***
## bedrooms ***
## accommodates ***
## total_amenities ***
## prop_type_simplifiedCondominium **
## prop_type_simplifiedGuesthouse .
## prop_type_simplifiedHostel *
## prop_type_simplifiedHotel
## prop_type_simplifiedHouse .
## prop_type_simplifiedOther *
## prop_type_simplifiedServiced apartment
## number_of_reviews_ltm ***
## review_scores_rating ***
## room_typeHotel room
## room_typePrivate room ***
## room_typeShared room ***
##
## Residual standard error: 0.496 on 2418 degrees of freedom
## (782 observations deleted due to missingness)
## Multiple R-squared: 0.387, Adjusted R-squared: 0.38
## F-statistic: 54.6 on 28 and 2418 DF, p-value: <2e-16
confint(final_model)
## 2.5 % 97.5 %
## (Intercept) 3.95e+01 5.95e+01
## review_scores_communication 2.56e-02 9.09e-02
## review_scores_location -7.19e-02 -4.04e-03
## review_scores_value -9.21e-02 -2.67e-02
## cancellation_policymoderate -2.00e-02 1.32e-01
## cancellation_policystrict 3.95e-02 1.60e-01
## security_deposit 1.69e-05 4.58e-05
## neighbourhood_simplifiedkowloon -2.36e-01 -1.21e-01
## neighbourhood_simplifiednew_territories -2.26e-01 -5.84e-02
## latitude -2.35e+00 -1.45e+00
## host_response_time.L -1.55e-02 7.51e-02
## host_response_time.Q -7.92e-02 1.82e-02
## host_response_time.C -7.00e-02 3.24e-02
## host_is_superhostTRUE 1.22e-01 2.26e-01
## bedrooms 7.87e-02 1.47e-01
## accommodates 6.57e-02 9.49e-02
## total_amenities 7.59e-03 1.31e-02
## prop_type_simplifiedCondominium -1.67e-01 -2.69e-02
## prop_type_simplifiedGuesthouse -1.95e-01 1.30e-02
## prop_type_simplifiedHostel -1.90e-01 -5.08e-04
## prop_type_simplifiedHotel -1.05e-01 1.33e-01
## prop_type_simplifiedHouse -2.16e-01 5.95e-03
## prop_type_simplifiedOther 8.06e-03 1.39e-01
## prop_type_simplifiedServiced apartment -6.78e-02 1.22e-01
## number_of_reviews_ltm -7.95e-03 -4.04e-03
## review_scores_rating 3.08e-03 1.05e-02
## room_typeHotel room -1.26e-01 1.11e-01
## room_typePrivate room -3.89e-01 -2.85e-01
## room_typeShared room -5.51e-01 -3.10e-01
Predicting Price
To predict the price for two people staying 4 nights in Hong Kong, we need to first create a data frame that contains their preferences and then feed it into our model. We have chosen two tourists who want to stay in an apartment, that has 2 bedrooms, 20 amenities, and a strict cancellation policy. The apartment is in hong_kong area and has 10 reviews in last twelve months, a rating of 90 and accommodates 3 people.
## Using our model to predict price
predictors <- data.frame(review_scores_communication = 8,
review_scores_location = 8,
review_scores_value = 8,
cancellation_policy = "strict",
security_deposit = 775 ,
neighbourhood_simplified = "hong_kong",
latitude = 22.5,
host_response_time = "within an hour",
host_is_superhost = TRUE,
bedrooms = 2,
accommodates = 3,
total_amenities = 20,
prop_type_simplified = "Apartment",
number_of_reviews_ltm = 10,
review_scores_rating = 90,
room_type = "Entire home/apt")
#Predicting price and prediction intervals
price_predicted = exp(predict(final_model, newdata = predictors, interval = "prediction"))
paste("Price for two people staying four nights in Hong Kong is: ", price_predicted[1])
## [1] "Price for two people staying four nights in Hong Kong is: 2896.75582785523"
paste("Lower PI for price for two people staying four nights in Hong Kong is: ", price_predicted[2])
## [1] "Lower PI for price for two people staying four nights in Hong Kong is: 1085.6185400902"
paste("Upper PI for Price for two people staying four nights in Hong Kong is: ", price_predicted[3])
## [1] "Upper PI for Price for two people staying four nights in Hong Kong is: 7729.41324815255"
After calculating the anti-log of the price that our model predicts, we find that the price comes out to be 2896.76 HKD with an upper prediction interval of 7729.41 HKD and lower prediction interval of 1085.62 HKD. Our model predicts a price closer to the lower PI indicating that the price data is right skewed. The wide range in prediction interval of the predicted price is due to our model having an adjusted R2 of 39% which leaves room for uncertainty.