1. Introduction

This project is from one of the Kaggle projects, the purpose of which is to estimate a housing price in Melborne. Many participants made use of conventional machine learning methods such as Random Forest, Decision Tree, and so on. In my project, I’ll apply Bayesian method to LMM models to estimate the price.

After fitting each model, I’ll compare them using \(R^2\). While there is a well-known \(R^2\) for LMM \[R_{MM}^2 = \frac{\sigma_f^2 + \sigma_r^2}{\sigma_f^2 + \sigma_r^2 + \sigma_{\epsilon}^2}\] where \(\sigma_f^2\) is fixed effect, \(\sigma_r^2\) is random effect, and \(\sigma_{\epsilon}^2\) is observation-level, the similar measure is not widely known in Bayesian framework. Gelman et.al (2019) introduced the \(R^2\) that can be used in Bayesian R-squared for Bayesian Regression Models, so this project will use this measure to compare with the LMM based on frequentist one. Explaining this concept simply, the \(R^2\) for Bayesian can be defined as \[R_B^2 = \frac{\textbf{Var}[\hat y]}{\textbf{Var}[\hat y] + \textbf{Var}[e]}\] where \(\hat y\) : fitted value and \(e\) : residual, \(y - \hat y\), which reflects posterior uncertainty.

2. EDA

rm(list = ls())
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(lme4)
## Loading required package: Matrix
library(MuMIn)
library(rstan)
## Loading required package: StanHeaders
## 
## rstan version 2.32.7 (Stan version 2.32.2)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)
## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file
library(ggplot2)
#install.packages("gridExtra")
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
#install.packages("corrplot")
library(corrplot)
## corrplot 0.95 loaded
melb <- read.csv('melb_data.csv', header = TRUE)

0) data preprocessing

You may figure out that there are too many missing values in such columns as CouncilArea, BuildingArea, YearBuilt. Some might have thought that those could have been imputed by data augmentation, but we don’t have enough information for the imputation. In the case of YearBuilt, for instance, there is no solid reason to impute the median or mean year to the emtpy rooms of the column. Also, we still have 6195 rows left with no NA values, after deleting all rows containing NA. So, I decided to perform data analysis after deleting all NA.

nrow(melb)
## [1] 13580
melb_om <- melb %>%
  na.omit() %>%
  filter(BuildingArea != 0,
         CouncilArea != "",
         YearBuilt >= 1800)
nrow(melb_om)
## [1] 6194

1) historgram of price

Following histogram shows the histograms of housing price. Since true price data are skewed, log transformation is necessary so that the result resembles normal distribution.

h1 <- ggplot(melb_om, aes(x = Price)) + 
  geom_histogram(binwidth = 100000, fill = 'skyblue', color = 'black') + 
  labs(title = "Histogram of Price", x = 'Price', y = 'Frequency') +
  theme(plot.title = element_text(hjust = 0.5))

h2 <- ggplot(melb_om, aes(x = log(Price))) + 
  geom_histogram(binwidth = 0.1, fill = 'green', color = 'black') + 
  labs(title = 'Histogram of log(Price)', x = 'log(Price)', y = 'Frequency') +
  theme(plot.title = element_text(hjust = 0.5))
  
grid.arrange(h1, h2, nrow = 1)

2) by house type

In type columns, h represents house, t townhouse, and u unit. Putting aside the outliers, the price of house is expensive and that of unit is cheap.

ggplot(melb_om, aes(x = Type, y = Price, fill = Type)) + 
  geom_boxplot() + 
  labs(title = "Boxplot of Price by Type", x = "Type", y = "Price") + 
  theme(plot.title = element_text(hjust = 0.5))

3) region name

This is a boxplot of housing price vs region names. Given the median of each region, the price is different according to diffent region.

ggplot(melb_om, aes(x = Regionname, y = Price, fill = Regionname)) + 
  geom_boxplot() + 
  labs(title = 'Boxplot of Price by Regionname', x = "Region name", y = "Price") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 90, hjust = 1)) + 
  scale_fill_discrete(name = "Region name")

4) year & price

It seems the year the housing was built rarely has an impact on its price.

ggplot(melb_om, aes(x = YearBuilt, y= Price)) + 
  geom_point() + 
  labs(title = "Scatter Plot of YearBuilt and Price", x='Year Built', y='Price')+ 
  theme(plot.title = element_text(hjust = 0.5))

5) room & price

According to this scatter plot, the house price reaches its maximum when the number of rooms is four or five.

ggplot(melb_om, aes(x = Rooms, y = Price)) + 
  geom_point() + 
  labs(title = "Scatter Plot of Rooms and Price", x = 'Rooms', y = 'Price') + 
  theme(plot.title = element_text(hjust = 0.5))

6) distance & price

The scatter plot indicates that the house price becomes maximum at around 5km from CBD.

ggplot(melb_om, aes(x = Distance, y = Price)) + 
  geom_point() + 
  labs(title = "Scatter Plot of Distance and Price", x = 'Distance', y = 'Price')+
  theme(plot.title = element_text(hjust = 0.5))

7) car & price

According to this scatter plot, the house price reaches its maximum when the number of rooms is 2 or 3.

ggplot(melb_om, aes(x = Car, y = Price)) + 
  geom_point() + 
  labs(title = "Scatter Plot of Car and Price", x = 'Car', y = 'Price')+
  theme(plot.title = element_text(hjust = 0.5)) + 
  scale_x_continuous(breaks = seq(0, 10, by = 1))

8) building area & price

There is a general tendency for house prices to increase with larger area, but the pattern is not very stark.

ggplot(melb_om[melb_om$BuildingArea<=1000 & melb_om$Price <= 5000000,],
       aes(x = BuildingArea, y = Price, color = Type)) + 
  geom_point() + 
  labs(title = "Scatter Plot of BuildingArea and Price", x = 'Building Area', y = 'Price')+
  theme(plot.title = element_text(hjust = 0.5))

9) correlation

The correlation heatmap shows that the variables that have strong correlation with price are Rooms, Type, Bedroom, Bathroom, Car, BuildingArea, YearBuilt.

melb_num <- melb_om %>%
  arrange(Date) %>%
  mutate(Date = row_number()) %>%
  mutate(
    Price = log(Price),
    Landsize = scale(Landsize),
    BuildingArea = scale(BuildingArea),
    Date = scale(Date),
    YearBuilt = scale(YearBuilt),
    Distance = scale(Distance)
  ) %>%
  mutate(Type = as.numeric(factor(Type)),
         Regionname = as.numeric(factor(Regionname)),
         Suburb = as.numeric(factor(Suburb)),
         CouncilArea = as.numeric(factor(CouncilArea)),
         SellerG = as.numeric(factor(SellerG)),
         Method = as.numeric(factor(Method))
  )

melb_cor <- cor(melb_num)

corrplot(melb_cor, method = "color",
         tl.col = "black", tl.srt = 90,
         addCoef.col = "black", 
         number.cex = 0.5, tl.cex = 0.7,
         #title = "Correlation Matrix Heatmap",
         mar = c(0,0,1,0)) 

10) variable selection

I decided to delete those variables “Method”, “SellerG”, “Postcode”,“Bedroom2”, “Date”, “Landsize”, “Lattitude”, “Longtitude”, “Propertycount”. The reason is that they have little to no correlation with the price variable. Also, we’re not going to use the lattitude and longtitude information.

head(melb)
##       Suburb CouncilArea            Regionname Rooms Type   Price Method
## 1 Abbotsford       Yarra Northern Metropolitan     2    h 1480000      S
## 2 Abbotsford       Yarra Northern Metropolitan     2    h 1035000      S
## 3 Abbotsford       Yarra Northern Metropolitan     3    h 1465000     SP
## 4 Abbotsford       Yarra Northern Metropolitan     3    h  850000     PI
## 5 Abbotsford       Yarra Northern Metropolitan     4    h 1600000     VB
## 6 Abbotsford       Yarra Northern Metropolitan     2    h  941000      S
##   SellerG      Date Distance Postcode Bedroom2 Bathroom Car Landsize
## 1  Biggin 3/12/2016      2.5     3067        2        1   1      202
## 2  Biggin  4/2/2016      2.5     3067        2        1   0      156
## 3  Biggin  4/3/2017      2.5     3067        3        2   0      134
## 4  Biggin  4/3/2017      2.5     3067        3        2   1       94
## 5  Nelson  4/6/2016      2.5     3067        3        1   2      120
## 6  Jellis  7/5/2016      2.5     3067        2        1   0      181
##   BuildingArea YearBuilt Lattitude Longtitude Propertycount
## 1           NA        NA  -37.7996   144.9984          4019
## 2           79      1900  -37.8079   144.9934          4019
## 3          150      1900  -37.8093   144.9944          4019
## 4           NA        NA  -37.7969   144.9969          4019
## 5          142      2014  -37.8072   144.9941          4019
## 6           NA        NA  -37.8041   144.9953          4019
columns_to_exclude <- c("Method", "SellerG",  "Postcode", "Bedroom2", "Date", "Landsize", "Lattitude", "Longtitude", "Propertycount")

melb_da <- melb_num %>% select(-one_of(columns_to_exclude))

3. LMM

In this section, we’ll use linear mixed model. There are two ways to specify the LMM model: one is to determine whether house prices vary by housing type (Type), and the other is to examine whether they differ by residential region (Regionname). The main purpose of this section is to compare which method is better. I assumed the slope of the price varies depending on the BuildingArea for the former one, while that varies depending on Distance for the latter one.

1) basic setup

Data/notation

  • \(y \in \mathbb{R}^{N}\): response
  • \(X \in \mathbb{R}^{N \times p}\) with \(p=N_{\text{fixed}}\): fixed‐effect design
  • \(\beta \in \mathbb{R}^{p}\): fixed effects
  • \(g_i \in \{1,\dots,G\}\) with \(G=N_{\text{groups}}\): group for obs \(i\)
  • \(u_i \in \mathbb{R}\): covariate for random slope

Random‐effects design

  • \((Z_{\text{int}})_{i,g} = \mathbf{1}[g_i = g]\), size \(N \times G\)
  • \((Z_{\text{slope}})_{i,g} = u_i \,\mathbf{1}[g_i = g]\), size \(N \times G\)
  • \(Z = [\,Z_{\text{int}} \;\; Z_{\text{slope}}\,] \in \mathbb{R}^{N \times 2G}\)

Random effects (stacked)

  • \(b = \begin{bmatrix} b_{\text{int}} \\ b_{\text{slope}} \end{bmatrix}\), with
  • \(b_{\text{int}} = (\text{ranint}_1,\dots,\text{ranint}_G)^\top\),
  • \(b_{\text{slope}} = (\text{ranslope}_1,\dots,\text{ranslope}_G)^\top\)

Hierarchical (conditional) model

  • Likelihood:
    \[ y \mid b \sim \mathcal{N}\!\big(X\beta + Z b,\; \sigma_\varepsilon^2 I_N\big) \]
  • Random effects (independent blocks):
    \[ b_{\text{int}} \sim \mathcal{N}\!\big(0,\; \sigma_{\text{int}}^2 I_G\big),\quad b_{\text{slope}} \sim \mathcal{N}\!\big(0,\; \sigma_{\text{slope}}^2 I_G\big) \] Equivalently, \[ b \sim \mathcal{N}\!\left(0,\; G = \begin{bmatrix} \sigma_{\text{int}}^2 I_G & 0\\ 0 & \sigma_{\text{slope}}^2 I_G \end{bmatrix}\right) \]

2) group is ’Type’ while ’BuildingArea’ is predictor for random slope

We define a model.lmm.Type that will be used in Rstan code. This corresponds to LMM model, assuming Gaussian prior for all random effects and Gaussian error for \(\text{log(Price)}\).

model.lmm.Type = "
data {
  int<lower=0> Nobs;                       // number of observations
  int<lower=0> Nfixed;                     // number of fixed-effect covariates
  int<lower=0> Ngroups;                    // number of groups (Type), 3
  vector[Nobs] y;                          // response 
  matrix[Nobs,Nfixed] X;                   // fixed-effect design matrix
  vector[Nobs] u;                          // random-effect covariates
  int<lower=1,upper=Ngroups> group[Nobs];  // group (Type) indices
}
parameters {
  vector[Nfixed] beta;                     // fixed effects
  real<lower=0> sigmaint;                  // random intercept variance
  real<lower=0> sigmaslope;                // random slope variance
  real<lower=0> sigmaeps;                  // error variance
  vector[Ngroups] ranint;                  // random intercepts
  vector[Ngroups] ranslope;                // random slope
}
transformed parameters {
  vector[Nobs] yhat;
  for (i in 1:Nobs){
    yhat[i] = X[i]*beta + ranint[group[i]] + ranslope[group[i]]*u[i];
  }
}
model {
  ranint ~ normal(0, sigmaint);
  ranslope ~ normal(0, sigmaslope);
  y ~ normal(yhat, sigmaeps);
}
" 

The variable y is the housing price, while matrix X is the rest of the data except for Type.

y <- as.numeric(melb_da$Price)

Xtemp <- as.data.frame(cbind(1, melb_da))
X <- Xtemp %>% select(-Price, -Type)

After fitting the model, run the code as follows.

sm_lmm_Type = stan_model(model_code=model.lmm.Type)

fit_lmm_Type = sampling(
  object = sm_lmm_Type,
  data = list(Nobs = nrow(X),
              Nfixed = ncol(X),
              Ngroups = length(unique(Xtemp$Type)),
              y = y,
              X = X,
              u = as.numeric(Xtemp$BuildingArea),
              group = Xtemp$Type),
  seed = 2023311161,
  chains = 4,    
  warmup = 1000,
  iter   = 2000, 
  cores  = 4    
) 
## Warning: There were 1059 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess

i) traceplot

Some fixed effects \(\beta\) have poor convergence.

traceplot(fit_lmm_Type,pars=c("beta"))

The poor convergence occurs in randon effect as well.

traceplot(fit_lmm_Type, pars = c("ranint"))

traceplot(fit_lmm_Type, pars = c("ranslope"))

ii) histogram

Using the above result, this shows how well the LMM model with Type as a group fits the original data. The red and blue line represent the Rstan and frequentist LMM respectively. The result shows that both models doesn’t explain houses in the mid-price range quite well.

posterior_lmm_Type <- extract(fit_lmm_Type)
y_hat_Type <- posterior_lmm_Type$yhat
y_hat_Type_mean <- apply(y_hat_Type, 2, mean)

hist(y, main = 'Histogram of log(Price) and Densities, group = Type',
     xlab = "price", ylab = "Density", cex.main = 1,
     ylim = range(0, max(density(y_hat_Type_mean)$y, density(y)$y )),
     breaks = seq(min(y),max(y), length.out= 50), freq = FALSE )
lines(density(y_hat_Type_mean), col = "red")


lmm_freq_Type <- lmer(Price ~ Suburb + CouncilArea + Regionname + Rooms + Distance + Bathroom + Car + BuildingArea + YearBuilt + (BuildingArea | Type), data = Xtemp)

y_hat_Type_freq <- fitted(lmm_freq_Type)
lines(density(y_hat_Type_freq), col = "blue")
legend("topright", legend = c('Rstan', 'Freqentist'), col = c('red','blue'), lwd = 2, cex = 0.7)

iii) \(R^2\)

The former one is the \(R^2\) for Bayesian framework,

var(y_hat_Type_mean)/(var(y_hat_Type_mean) + var(y-y_hat_Type_mean))
## [1] 0.6610601

while the latter is the conventional \(R^2\) for LMM.

r2_lmm_freq_Type <- MuMIn::r.squaredGLMM(lmm_freq_Type)
r2_lmm_freq_Type # 0.6456741
##            R2m       R2c
## [1,] 0.5738016 0.6456741

3) group is Regionnamewhile Distance is predictor for random slope

Let us shift to the case where Regionname is the group. The following model.lmm.Rn is same as the former one.. This also corresponds to LMM model, assuming Gaussian prior for all random effects and Gaussian error for \(\text{log(Price)}\).

model.lmm.Rn = "
data {
  int<lower=0> Nobs;     // number of observations
  int<lower=0> Nfixed;   // number of fixed-effect covariates
  int<lower=0> Ngroups;  // number of groups (Regionname)
  vector[Nobs] y;        // response 
  matrix[Nobs,Nfixed] X; // fixed-effect design matrix
  vector[Nobs] u;  // random-effect covariates (Distance)
  int<lower=1,upper=Ngroups> group[Nobs]; // group (Regionname) indices
}
parameters {
  vector[Nfixed] beta;      // fixed effects
  real<lower=0> sigmaint;   // random intercept variance
  real<lower=0> sigmaslope; // random slope variance
  real<lower=0> sigmaeps;   // error variance
  vector[Ngroups] ranint;   // random intercepts
  vector[Ngroups] ranslope; // random slope
}
transformed parameters {
  vector[Nobs] yhat;
  for (i in 1:Nobs){
    yhat[i] = X[i]*beta + ranint[group[i]] + ranslope[group[i]]*u[i];
  }
}
model {
  ranint ~ normal(0, sigmaint);
  ranslope ~ normal(0, sigmaslope);
  y ~ normal(yhat, sigmaeps);
}
" 

The variable y is the housing price, while matrix X is the rest of the data except for Regionname.

y <- as.numeric(melb_da$Price)

Xtemp <- as.data.frame(cbind(1, melb_da))
X <- Xtemp %>% select(-Price, -Regionname)

After fitting the model, run the code as follows.

sm_lmm_Rn = stan_model(model_code=model.lmm.Rn)

fit_lmm_Rn = sampling(
  object = sm_lmm_Rn,
  data = list(Nobs = nrow(X),
              Nfixed = ncol(X),
              Ngroups = length(unique(melb_da$Regionname)),
              y = y,
              X = X,
              u = as.numeric(X$Distance),
              group = Xtemp$Regionname),
  seed = 2023311161,
  chains = 4,    
  warmup = 1000, 
  iter   = 2000, 
  cores  = 4     
) 
## Warning: There were 557 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems

i) traceplot

Compared to the Type case, the fixed effect coefficients converged much better.

traceplot(fit_lmm_Rn,pars=c("beta"))

The better convergence is also the case in random effect, too.

traceplot(fit_lmm_Rn, pars = c("ranint"))

traceplot(fit_lmm_Rn, pars = c("ranslope"))

ii) histogram

This shows how well the LMM model with Regionname as a group fits the original data. The red and blue line represent the Rstan and frequentist LMM respectively. The result shows that both models explain the price data better than former one.

posterior_lmm_Rn <- extract(fit_lmm_Rn)
y_hat_Rn <- posterior_lmm_Rn$yhat
y_hat_Rn_mean <- apply(y_hat_Rn, 2, mean)

hist(y, main = 'Histogram of log(Price) and Densities, group = Regionname',
     xlab = "price", ylab = "Density", cex.main = 1,
     ylim = range(0, max(density(y_hat_Rn_mean)$y, density(y)$y )),
     breaks = seq(min(y),max(y), length.out= 50), freq = FALSE )
lines(density(y_hat_Rn_mean), col = "red")


lmm_freq_Rn <- lmer(Price ~ Suburb + CouncilArea + Rooms + Type + Distance  + Bathroom + Car + BuildingArea + YearBuilt + (Distance | Regionname), data = Xtemp)

y_hat_Rn_freq <- fitted(lmm_freq_Rn)
lines(density(y_hat_Rn_freq), col = "blue")
legend("topright", legend = c('Rstan', 'Freqentist'), col = c('red','blue'), lwd = 2, cex = 0.7)

\(R^2\)

The result of \(R^2\) also indicates that LMM with Regionname as a group is better. Both \(R^2\) is above \(0.75\), while that of former one is about \(0.65\).

var(y_hat_Rn_mean)/(var(y_hat_Rn_mean) + var(y-y_hat_Rn_mean)) # 0.7584194
## [1] 0.7584194
r2_lmm_freq_Rn <- MuMIn::r.squaredGLMM(lmm_freq_Rn)
r2_lmm_freq_Rn # 0.7669087
##            R2m       R2c
## [1,] 0.6072761 0.7669087

4) explanation for the difference

  • The boxplots show strong between-region median differences, while type (h/t/u) has broad within-type dispersion.
  • Letting the Distance slope vary by Regionname captures region-specific accessibility/amenity trade-offs. By contrast, the area premium is relatively similar across types once rooms/baths are in the fixed effects
  • Type has only 3 groups, so the random-effects structure is almost a fixed three-parameter tweak. However, Regionname offers many groups, letting the model learn rich intercept/slope variation.