library(tidyverse)
library(tidymodels)
library(readxl)
library(openxlsx)
library(lubridate)
library(here)
library(devtools)
library(recipes) 
library(rsample)
library(timetk)
library(glmnet)
library(tidyquant)
library(visdat)
library(janitor)
library(kableExtra)
library(caret)
library(rpart)  
library(rpart.plot)
library(pdp)
library(vip)
library(GGally)
library(car)
library(ggcorrplot)
library(ggdensity)
library(tidyquant)
library(scico)
library(paletteer)
library(earth)
library(vip)
library(ranger)
library(h2o)
library(xgboost)
library(modeltime)
library(caret)
library(lares)
library(lmtest)
library(nortest)
library(auditor)
library(DALEXtra)
library(modelStudio)
library(patchwork)

What determines the price of houses in Bilbao? I want to investigate this question using Kaggle’s Spanish Housing Dataset which was originally web-crawled from Idealista (between March-April 2019).

Knowledge of the housing market is not only valuable for sellers and buyers of real estate but can also provide profound understanding of socio-economic and socio-demographic local variations within a city.

Bilbao is a city in northern Spain located in the province Bizkaia. It is the largest metropolitan area in the region and an economic and cultural hub.

In this analysis I will explore Bilbao’s housing market in search for an understanding of the driving factors of real estate prices.

Selecting the Data

First, let’s have a look at the features and the presence of missing data.

load(file = here("Data", "Houses_Bilbao" ,"data.Rda"))

vis_miss(data, cluster = TRUE)

The data has many variables, but we are particularly interested in features such as construction year, district, number of rooms, lift and number of bathrooms, etc. to regress on our response price per square meter.

Let’s first drop features which have no data to build our working data set.

Bilbao <- data %>%
  filter(loc_city == "Bilbao") %>%
  select(-c(loc_city, garage, ad_description, ad_last_update, ground_size, kitchen, unfurnished, house_per_city, house_id, floor, loc_zone, loc_street, loc_neigh, obtention_date, orientation, loc_full, m2_useful)) %>%
  mutate(price_m2 = price/m2_real) %>%
  mutate(loc_district = gsub("Distrito", "", loc_district)) %>%
  mutate(bath_num = as.numeric(bath_num)) %>%
  filter(price_m2 >= quantile(price_m2, probs = 0.001)) %>%
  filter(price_m2 <= quantile(price_m2, probs = 0.999)) %>%
  mutate(condition = case_when(condition == "segunda mano/buen estado" ~ "used_ok",
                               condition == "promoción de obra nueva" ~ "new",
                               condition == "segunda mano/para reformar" ~ "used_to_reno")) %>%
  mutate(heating_type = case_when(heating == "calefacción central" ~ "central",
                             heating == "calefacción central: gas" ~ "central",
                             heating == "calefacción central: gas propano/butano" ~ "central",
                             heating == "calefacción central: gasoil" ~ "central",
                             heating == "calefacción individual" ~ "individual",
                             heating == "calefacción individual: bomba de frío/calor" ~ "individual",
                             heating == "calefacción individual: eléctrica" ~ "individual",
                             heating == "calefacción individual: gas" ~ "individual",
                             heating == "calefacción individual: gas natural" ~ "individual",
                             heating == "calefacción individual: gas propano/butano" ~ "individual",
                             heating == "calefacción individual: gas propano/butano" ~ "individual",
                             heating == "no dispone de calefacción" ~ "no_heat"
                             )) %>%
  mutate(id = row_number()) %>%
  filter(id != 3239) %>%
  select(-c(id, m2_real, heating, price)) 

# Save File
save(Bilbao, file = here("Data", "Houses_Bilbao", "Bilbao.Rda"))

Summary Statistics

Bilbao %>%
  tabyl(loc_district) %>%
  ggplot(aes(x = "", y = percent, fill = loc_district)) +
           geom_bar(stat = "identity", width =1, color="white") +
           coord_polar("y" , start = 0) +
  scale_fill_tq() +
  theme_minimal() +
  ggtitle("Proportion of houses per district")

The graphic shows the proportion houses per district. In some cases two adjacent districts were grouped together in the data.

Square Meter Prices per District

Removing outliers <01 and >99 quantiles

Bilbao %>%
  group_by(loc_district) %>%
  summarise(Min = min(price_m2),
            Q25 = quantile(price_m2, probs = .25),
            Median = median(price_m2),
            Mean = mean(price_m2),
            Q75 = quantile(price_m2, probs = .75),
            Max = max(price_m2)) %>%
  arrange(desc(Median)) %>%
  dplyr::rename(District = loc_district) %>%
  kable(digits = 0, caption = "Statistic of price_m2 per district") %>%
  kable_styling(full_width = T)

Statistic of price_m2 per district
District	Min	Q25	Median	Mean	Q75	Max
Abando - Albia	1882	3333	4444	4515	5294	8744
Indautxu	2333	3579	4091	4316	4778	8800
Deusto	249	2755	3184	3109	3569	4750
Casco Viejo	1478	2503	2963	3116	3667	5857
Basurto - Zorroza	314	1889	2835	2804	3531	6429
Rekalde	251	2224	2793	2765	3172	6042
Begoña - Santutxu	1024	2321	2757	2751	3133	5857
San Ignacio	1250	2288	2707	2797	3365	3931
Uribarri	45	2129	2517	2676	3110	8000
Ibaiondo	696	1963	2400	2543	3098	5556
Otxarkoaga - Txurdinaga	290	1791	2362	2385	2883	4982
San Adrián - La Peña	915	1797	2138	2223	2500	4808

It is better to visualize the table in a boxplot.

Bilbao %>%
  summarise(Min = min(price_m2),
            Q25 = quantile(price_m2, probs = .25),
            Median = median(price_m2),
            Mean = mean(price_m2),
            Q75 = quantile(price_m2, probs = .75),
            Max = max(price_m2)) %>%
  arrange(desc(Median))

  Min      Q25   Median     Mean  Q75  Max
1  45 2321.627 2962.963 3155.546 3750 8800

Bilbao %>%
  group_by(loc_district) %>%
  ggplot(aes(x = loc_district, y = price_m2, fill = loc_district)) +
  geom_boxplot() +
  geom_hline(yintercept = 2962.963, linetype="dashed", color = "black") + # general Median
  scale_fill_tq() +
  theme_minimal() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  ggtitle("Boxplots of price_m2 per district") +
  scale_y_continuous(labels = scales::dollar_format(prefix = "€"))

The districts Abando/Albia and Indautxu stand out for having particularly high square meter prices which are pushing the population median upwards. The differences between the other districts are comparabily not so high.

Let’s also have a look on the general distribution of price_m2

Bilbao %>%
  ggplot(aes(x = price_m2)) +
  geom_histogram(bins = 100) +
  ggtitle("House price/m2 in Euro") +
  scale_x_continuous(labels = scales::dollar_format(prefix = "€")) +
  theme_minimal()

The distribution is relatively normal with a slight skewness to the right.

Overview of house features

Bilbao %>%
  filter(room_num <= 5) %>%
  tabyl(room_num, loc_district) %>%
  pivot_longer(
    cols = c(2:13),
    names_to = "loc_district",
    values_to = "Count"
  ) %>%
  group_by(loc_district) %>%
  mutate(Sum = sum(Count)) %>%
  mutate(Perc = Count/Sum * 100) %>%
  ggplot(aes(x = loc_district, y = Perc, fill = loc_district)) +
  geom_bar(position="dodge", stat = "identity") +
  facet_grid(cols = vars(room_num)) +
  scale_fill_tq() +
  theme_minimal() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  labs(title = "Percentage of Rooms per District")

Most dwellings have 3 rooms. Interestingly Casco Viejo has a ove rproportionally many one room appartments while Indautxu and Abando have proportionally the largest number of 5 rooms.

Bilbao %>%
  ggplot(aes(x = heating_type, y = price_m2, fill = loc_district)) +
  geom_bar(position="dodge", stat = "identity") +
  scale_fill_tq() +
  theme_minimal() +
  labs(title = "Price_m2 per Heating Type") +
  scale_y_continuous(labels = scales::dollar_format(prefix = "€"))

The type of heating doesn’t seem to affect very much.

Bilbao %>%
  tabyl(condition, loc_district) %>%
  pivot_longer(
    cols = c(2:13),
    names_to = "loc_district",
    values_to = "Count"
  ) %>%
  group_by(loc_district) %>%
  ggplot(aes(x = loc_district, y = Count, fill = loc_district)) +
  geom_bar(position="dodge", stat = "identity") +
  facet_grid(cols = vars(condition)) +
  scale_fill_tq() +
  theme_minimal() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  labs(title = "Absolute Count of House Condition per District")

The majority of houses in the markets are second hand appartments in good condition.

Bilbao %>%
  ggplot(aes(x = condition, y = price_m2, fill = loc_district)) +
  geom_bar(position="dodge", stat = "identity") +
  scale_fill_tq() +
  theme_minimal() +
  labs(title = "Price_m2 per Condition")

At first sight, it seems counter intuitive that new apartments cost less than second hand per sqare meter. Because of the hilliness of Bilbao, it might be that the new buildings are build in more peripheral areas of a district and therefore have a lower square meter price.

Construction Date

scatterplot(price_m2 ~ construct_date, data = Bilbao, grid = FALSE)

We see that there obvious relation between construction year and square meter price. Ad the same time we can observe that a great amount of houses were built during the period 1960-1975 corresponding to large migration from other Spanish provinces.

Development of Housing and Price

# Hiogh density region
Bilbao %>%
  ggplot(aes(x =construct_date, y = price_m2, fill = loc_district)) +
  geom_hdr(probs = c (0.9, 0.5)) +
  geom_point(shape = 21, size = 1.5) +
  scale_fill_tq() +
  theme_tq() +
  labs(title = "High density regions (Price_m2 ~ Date)") +
  facet_wrap(~ loc_district)

We can see that houses within district often share similar price ranges and construction years. Some older district live through a longer period of continuous construction, whereas in other construction seemed to had ben concentrated in smaller period.

Bilbao %>%
  filter(loc_district == " Abando - Albia" | loc_district == " San Ignacio" | loc_district == " Otxarkoaga - Txurdinaga") %>%
  ggplot(aes(x =construct_date, y = price_m2, fill = loc_district)) +
  geom_hdr(probs = c (0.9, 0.5)) +
  geom_point(shape = 21, size = 1.5) +
  scale_fill_tq() +
  theme_tq() +
  labs(title = "High density regions certain Neighborhoods")

The clustering is specially visible for theese districts in a single graph.

Looking at Correlations

First, let us look at the correlations between all numeric features and response.

numeric = Bilbao %>% 
  select(where(is.numeric)) 

ggcorr(numeric, label = TRUE, label_size = 2, label_color = "white") +
  scale_fill_paletteer_c("viridis::plasma") +
  labs(title = "Correlations (Numeric Variables)")

Especially important are the correlations with response variable.

Bilbao %>%
  corr_var(price_m2)

Using the simple correlation we can see that being loacted in the district of Abando/Albia has the highest positive correlation with the square meter price.

Modeling

Now, since we got an overview of the data, we can start to analyzenthe relationships between the features and the response more in depth.

Splitting the Data

First we split the data into a train and test set where 3/4 of the whole data is reserved for training and the rest for testing. We also use strata = "price_m2" to make sure that the distribution of price_m2 is equal between training and testing set.

set.seed(123)
split <- initial_split(Bilbao, prop = 0.75, 
                       strata = "price_m2")
data_train  <- training(split)
data_test   <- testing(split)

Missing Data

There is missing data. Because some models (like OLS) cannot handle NAs we will interpolate these using a KNN Algorithm.

vis_miss(Bilbao, cluster = TRUE)

Create a Recipe

The function recipe() allows to preprocess the data before modeling. It can be applied to training and testing data. The advantage of uf using recipe() is that it avoids data leakage between data sets. Data leakage occurs when data transformation steps are processed on the entire data set before it is subdivided into training and testing splits. E.g., if a min-max transformation would be applied on the whole data set before splitting, the individual splits would be biased towards to global minimum and maximum. In a resampling and cross validation context, recipe() ensures that the data preprocessing is conducted after every iteration of data splitting.

model_rec <- recipe(
  price_m2 ~ .,
  data= data_train) %>% 
  step_zv(all_predictors()) %>%
  step_dummy(all_nominal()) %>%
  step_impute_knn(all_predictors(), neighbors = 10) %>% 
  prep(training = data_train, retain=TRUE, verbose=TRUE)

oper 1 step zv [training] 
oper 2 step dummy [training] 
oper 3 step impute knn [training] 
The retained training set is ~ 0.97 Mb  in memory.

trainSet.prep <- bake(model_rec, new_data = data_train, composition='matrix')
trainSet = as.data.frame((trainSet.prep))

testSet.prep<-bake(model_rec, new_data = data_test, composition='matrix')
testSet = as.data.frame((testSet.prep))

Experimenting with various Models

Let’s first have a look at how well different models perform on the data. For the resampling method we perform a 10-fold cross validation repeated 5 times.

Initialize Models

myControl = trainControl(method = 'cv', 
                         number = 10, 
                         repeats = 5,
                         verboseIter = FALSE, 
                         savePredictions = TRUE,
                         allowParallel = T)


parallel_start(6)

set.seed(174)
Linear.Model = train(price_m2 ~., 
                     data = trainSet, 
                     metric = 'RMSE', 
                     method = 'lm',
                     preProcess = c('center', 'scale'),
                     trControl = myControl)


set.seed(174)
Glmnet.Model = train(price_m2 ~ ., 
                     data = trainSet , 
                     metric = 'RMSE', 
                     method = 'glmnet',
                     preProcess = c('center', 'scale'), 
                     trControl = myControl)


set.seed(174)
Rapid.Ranger = train(price_m2 ~ ., 
                     data = trainSet, 
                     metric = 'RMSE', 
                     method = 'ranger',
                     preProcess = c('center', 'scale'),
                     trControl = myControl)


set.seed(174)
Basic.Knn <- train(price_m2 ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k =1:3), 
             trControl  = myControl, 
             metric= "RMSE", 
             data = trainSet)


set.seed(174)
Xgb.Super <- train(price_m2~.,
                   method = "xgbTree", 
                   tuneLength = 4,
                   trControl = myControl,
                   metric= "RMSE", 
                   data = trainSet)

parallel_stop()

suite.of.models = list("LINEAR.MODEL" = Linear.Model,
                       "GLMNET.MODEL" = Glmnet.Model, 
                       "RANGER.QUEST" = Rapid.Ranger, 
                       "KNN.SIMPLE" = Basic.Knn, 
                       "XGB.SUPER"= Xgb.Super)

resamps = resamples(suite.of.models) 
dotplot(resamps, metric = 'RMSE')

XGBoost perfomes best on the training data. Let’s test it on the test data.

TESTING MODELS ON TEST SET

Evaluate.Prediction <- function(model, model.label, testData, ytest, grid = NULL) {
 
  #capture prediction time
  ptm <- proc.time()
  # use test data to make predictions
  pred <- predict(model, testData)
  tm <- proc.time() - ptm
  
  Pred.metric<- postResample(pred = pred, obs = ytest)
  RMSE.test <- c(Pred.metric[[1]])
  RSquared.test <- c(Pred.metric[[2]])
  MAE.test <- c(Pred.metric[[3]])
  
  
  Summarised.results = NULL
  if (is.null(grid)) { 
    Summarised.results = data.frame(predictor = c(model.label) ,  RMSE = RMSE.test , RSquared = RSquared.test, MAE = MAE.test, time = c(tm[[3]]))
  } else {
    .grid = data.frame(predictor = c(model.label) , RMSE = RMSE.test , RSquared = RSquared.test, MAE = MAE.test, time = c(tm[[3]]))
    Summarised.results = rbind(grid, .grid)}
  
  
  Summarised.results }


METRIC.GRID <- Evaluate.Prediction (Rapid.Ranger, "RAPID.QUEST", testSet, testSet$price_m2, grid=NULL)

METRIC.GRID <- Evaluate.Prediction (Glmnet.Model, "GLMNET.MODEL", testSet, testSet$price_m2, grid=METRIC.GRID)

METRIC.GRID <- Evaluate.Prediction (Basic.Knn, "KNN.SIMPLE", testSet, testSet$price_m2, grid=METRIC.GRID)

METRIC.GRID <- Evaluate.Prediction (Linear.Model, "LINEAR.MODEL", testSet, testSet$price_m2, grid=METRIC.GRID)

METRIC.GRID <- Evaluate.Prediction (Xgb.Super, "XGB.SUPER", testSet, testSet$price_m2, grid=METRIC.GRID)


kable(METRIC.GRID[order(METRIC.GRID$RMSE, decreasing=F),]) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

	predictor	RMSE	RSquared	MAE	time
5	XGB.SUPER	773.6237	0.6066140	576.2671	0.035
1	RAPID.QUEST	805.4272	0.5733575	584.3447	0.032
2	GLMNET.MODEL	822.0240	0.5559522	607.2007	0.013
4	LINEAR.MODEL	822.3261	0.5553493	607.3631	0.006
3	KNN.SIMPLE	1023.4154	0.3434201	720.6059	0.063

Also here, XGBoost has the smallest RMSE and highest R2. Let’s tune this model for further examination.

XGBoost Modelling

Preprocessing

First we start again by dividing the data into training and testing samples. Again, we use price_m2 as a strata to ensure that the distribution of the response is eqaul between the testing and training sets. Then we create a recipe.

set.seed(123)
split <- initial_split(Bilbao, prop = 0.75, 
                       strata = "price_m2")
data_train  <- training(split)
data_test   <- testing(split)

XGB_rec <- recipe(
  price_m2 ~ .,
  data= data_train) %>% 
  step_zv(all_predictors()) %>%
  step_dummy(all_nominal()) %>%
  step_impute_knn(all_predictors(), neighbors = 10) %>% 
  prep()

Apply pre-processing to randomly divide train data in subsets.

set.seed(123)
cv_folds <-recipes::bake(
    XGB_rec, 
    new_data = data_train)%>%  
  rsample::vfold_cv(v = 5)

train.ready<-juice(XGB_rec)
test.ready<-bake(XGB_rec, new_data = data_test)

Modelling specifications

Define XGBoost modelling specifications and hyper parameters for tuning.

Model.XGB <- 
  boost_tree(
    mode = "regression",
    trees = 1000,
    min_n = tune(),
    tree_depth = tune(),
    learn_rate = tune(),
    loss_reduction = tune()) %>% 
  set_engine("xgboost", objective = "reg:squarederror")

Specify the model parameters

# grid specification 
XGB.aspects <- 
  dials::parameters(
    min_n(),
    tree_depth(),
    learn_rate(),
    loss_reduction())

Grid Space

Set up a grid space which covers the hyper parameters XGB.aspects.

xgboost_grid <- 
dials::grid_max_entropy(
XGB.aspects, size = 200)
kable(head(xgboost_grid))

min_n	tree_depth	learn_rate	loss_reduction
19	4	0.0000000	0.0024421
15	11	0.0502801	0.0000204
38	3	0.0000000	0.0000000
7	10	0.0000001	0.0002759
38	8	0.0000000	0.0000001
37	5	0.0287556	0.0000003

Create a workflow

xgboost_wf <- 
workflows::workflow() %>%
add_model(Model.XGB) %>% 
add_formula(price_m2 ~ .)

Hyper parameter searching

In this step R searches for the optimal hyper parameters by iteratively applying the different hyper parameters to multiple training samples. This step can take a while to compute.

'parallel_start(6)

TUNE.XGB <- tune::tune_grid(
  object = xgboost_wf,
  resamples = cv_folds,
  grid = xgboost_grid,
  metrics = yardstick::metric_set(yardstick::rmse, yardstick::rsq, yardstick::rsq_trad, yardstick::mae),
  control = tune::control_grid(verbose = FALSE)) 

parallel_stop()

saveRDS(TUNE.XGB, file = here("Data", "Tune_XGB.RData"))'

Finalize optimal tune

Extract parameters with lowest RMSE.

TUNE.XGB <- readRDS(here("Data", "Houses_Bilbao","Tune_XGB.RData"))

param_final <- TUNE.XGB %>%select_best(metric = "rmse")

Finalize XGBoost with optimal tune.

xgboost_wf2 <- xgboost_wf%>%
finalize_workflow(param_final)

Fit the final model

Fit final model on the preprocessed training data.

XGB.model <- xgboost_wf2 %>%
fit(train.ready)

Extract important features.

XGB.model %>% 
  pull_workflow_fit() %>% 
  vip()

The number of bathrooms, whether the appartment has a lift and the construction date seems to be the most important determinants for the square meter price.

Predict on Test set.

Let’s evaluate the XGBoost model on the test data by computing the RMSE.

# use the training model fit to predict the test data
XGB_res <- predict(XGB.model, new_data = test.ready %>% select(-price_m2))

XGB_res <- bind_cols(XGB_res, test.ready %>% select(price_m2))

XGB_metrics <- metric_set(yardstick::rmse, yardstick:: mae)

kable(XGB_metrics(XGB_res, truth = price_m2, estimate = .pred))

.metric	.estimator	.estimate
rmse	standard	799.5876
mae	standard	590.3934

We can asses the fit of the pridiction by plotting them against the actual observations from the testing samples. We can see that the model works good up to 5000 Euro per square meter. For higher the model tends to underestimate the real prices.

ggplot(XGB_res, aes(x = price_m2, y = .pred)) + 
    # Create a diagonal line:
    geom_abline(lty = 2) + 
    geom_point(alpha = 0.5) + 
    labs(y = "Predicted Sale Price", x = "SalePrice") +
    # Scale and size the x- and y-axis uniformly:
    coord_obs_pred() +
  theme_minimal()

Explainer

Finaly, we can explore the model prediction using the Explainer from modelstudio() which supplies metrics on feature importance, drop dow charts and Shapley values.

explainer<- explain_tidymodels(model = XGB.model,
                    data = test.ready,
                    y = test.ready$price_m2,
                    label = "XGBoost")

Preparation of a new explainer is initiated
  -> model label       :  XGBoost 
  -> data              :  1063  rows  39  cols 
  -> data              :  tibble converted into a data.frame 
  -> target variable   :  1063  values 
  -> predict function  :  yhat.workflow  will be used (  default  )
  -> predicted values  :  No value for predict function target column. (  default  )
  -> model_info        :  package tidymodels , ver. 1.0.0 , task regression (  default  ) 
  -> predicted values  :  numerical, min =  744.7933 , mean =  3149.823 , max =  7328.426  
  -> residual function :  difference between y and yhat (  default  )
  -> residuals         :  numerical, min =  -2960.2 , mean =  17.77691 , max =  3933.913  
  A new explainer has been created!

modelStudio(explainer)