Bayesian Statistics Project 2
2024-06-05
1. Introduction
This project is from one of the Kaggle projects, the purpose of which is to estimate a housing price in Melborne. Many participants made use of conventional machine learning methods such as Random Forest, Decision Tree, and so on. In my project, I’ll apply Bayesian method to LMM models to estimate the price.
After fitting each model, I’ll compare them using \(R^2\). While there is a well-known \(R^2\) for LMM \[R_{MM}^2 = \frac{\sigma_f^2 + \sigma_r^2}{\sigma_f^2 + \sigma_r^2 + \sigma_{\epsilon}^2}\] where \(\sigma_f^2\) is fixed effect, \(\sigma_r^2\) is random effect, and \(\sigma_{\epsilon}^2\) is observation-level, the similar measure is not widely known in Bayesian framework. Gelman et.al (2019) introduced the \(R^2\) that can be used in Bayesian R-squared for Bayesian Regression Models, so this project will use this measure to compare with the LMM based on frequentist one. Explaining this concept simply, the \(R^2\) for Bayesian can be defined as \[R_B^2 = \frac{\textbf{Var}[\hat y]}{\textbf{Var}[\hat y] + \textbf{Var}[e]}\] where \(\hat y\) : fitted value and \(e\) : residual, \(y - \hat y\), which reflects posterior uncertainty.
2. EDA
rm(list = ls())
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(lme4)
## Loading required package: Matrix
library(MuMIn)
library(rstan)
## Loading required package: StanHeaders
##
## rstan version 2.32.7 (Stan version 2.32.2)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)
## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file
library(ggplot2)
#install.packages("gridExtra")
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
#install.packages("corrplot")
library(corrplot)
## corrplot 0.95 loaded
melb <- read.csv('melb_data.csv', header = TRUE)
0) data preprocessing
You may figure out that there are too many missing values in such
columns as CouncilArea, BuildingArea,
YearBuilt. Some might have thought that those could have
been imputed by data augmentation, but we don’t have enough information
for the imputation. In the case of YearBuilt, for instance,
there is no solid reason to impute the median or mean year to the emtpy
rooms of the column. Also, we still have 6195 rows left with no NA
values, after deleting all rows containing NA. So, I decided to perform
data analysis after deleting all NA.
nrow(melb)
## [1] 13580
melb_om <- melb %>%
na.omit() %>%
filter(BuildingArea != 0,
CouncilArea != "",
YearBuilt >= 1800)
nrow(melb_om)
## [1] 6194
1) historgram of price
Following histogram shows the histograms of housing price. Since true price data are skewed, log transformation is necessary so that the result resembles normal distribution.
h1 <- ggplot(melb_om, aes(x = Price)) +
geom_histogram(binwidth = 100000, fill = 'skyblue', color = 'black') +
labs(title = "Histogram of Price", x = 'Price', y = 'Frequency') +
theme(plot.title = element_text(hjust = 0.5))
h2 <- ggplot(melb_om, aes(x = log(Price))) +
geom_histogram(binwidth = 0.1, fill = 'green', color = 'black') +
labs(title = 'Histogram of log(Price)', x = 'log(Price)', y = 'Frequency') +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(h1, h2, nrow = 1)
2) by house type
In type columns, h represents house,
t townhouse, and u unit. Putting aside the
outliers, the price of house is expensive and that of
unit is cheap.
ggplot(melb_om, aes(x = Type, y = Price, fill = Type)) +
geom_boxplot() +
labs(title = "Boxplot of Price by Type", x = "Type", y = "Price") +
theme(plot.title = element_text(hjust = 0.5))
3) region name
This is a boxplot of housing price vs region names. Given the median of each region, the price is different according to diffent region.
ggplot(melb_om, aes(x = Regionname, y = Price, fill = Regionname)) +
geom_boxplot() +
labs(title = 'Boxplot of Price by Regionname', x = "Region name", y = "Price") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_fill_discrete(name = "Region name")
4) year & price
It seems the year the housing was built rarely has an impact on its price.
ggplot(melb_om, aes(x = YearBuilt, y= Price)) +
geom_point() +
labs(title = "Scatter Plot of YearBuilt and Price", x='Year Built', y='Price')+
theme(plot.title = element_text(hjust = 0.5))
5) room & price
According to this scatter plot, the house price reaches its maximum when the number of rooms is four or five.
ggplot(melb_om, aes(x = Rooms, y = Price)) +
geom_point() +
labs(title = "Scatter Plot of Rooms and Price", x = 'Rooms', y = 'Price') +
theme(plot.title = element_text(hjust = 0.5))
6) distance & price
The scatter plot indicates that the house price becomes maximum at around 5km from CBD.
ggplot(melb_om, aes(x = Distance, y = Price)) +
geom_point() +
labs(title = "Scatter Plot of Distance and Price", x = 'Distance', y = 'Price')+
theme(plot.title = element_text(hjust = 0.5))
7) car & price
According to this scatter plot, the house price reaches its maximum when the number of rooms is 2 or 3.
ggplot(melb_om, aes(x = Car, y = Price)) +
geom_point() +
labs(title = "Scatter Plot of Car and Price", x = 'Car', y = 'Price')+
theme(plot.title = element_text(hjust = 0.5)) +
scale_x_continuous(breaks = seq(0, 10, by = 1))
8) building area & price
There is a general tendency for house prices to increase with larger area, but the pattern is not very stark.
ggplot(melb_om[melb_om$BuildingArea<=1000 & melb_om$Price <= 5000000,],
aes(x = BuildingArea, y = Price, color = Type)) +
geom_point() +
labs(title = "Scatter Plot of BuildingArea and Price", x = 'Building Area', y = 'Price')+
theme(plot.title = element_text(hjust = 0.5))
9) correlation
The correlation heatmap shows that the variables that have strong
correlation with price are Rooms,
Type, Bedroom, Bathroom,
Car, BuildingArea, YearBuilt.
melb_num <- melb_om %>%
arrange(Date) %>%
mutate(Date = row_number()) %>%
mutate(
Price = log(Price),
Landsize = scale(Landsize),
BuildingArea = scale(BuildingArea),
Date = scale(Date),
YearBuilt = scale(YearBuilt),
Distance = scale(Distance)
) %>%
mutate(Type = as.numeric(factor(Type)),
Regionname = as.numeric(factor(Regionname)),
Suburb = as.numeric(factor(Suburb)),
CouncilArea = as.numeric(factor(CouncilArea)),
SellerG = as.numeric(factor(SellerG)),
Method = as.numeric(factor(Method))
)
melb_cor <- cor(melb_num)
corrplot(melb_cor, method = "color",
tl.col = "black", tl.srt = 90,
addCoef.col = "black",
number.cex = 0.5, tl.cex = 0.7,
#title = "Correlation Matrix Heatmap",
mar = c(0,0,1,0))
10) variable selection
I decided to delete those variables “Method”, “SellerG”,
“Postcode”,“Bedroom2”, “Date”, “Landsize”, “Lattitude”, “Longtitude”,
“Propertycount”. The reason is that they have little to no correlation
with the price variable. Also, we’re not going to use the
lattitude and longtitude information.
head(melb)
## Suburb CouncilArea Regionname Rooms Type Price Method
## 1 Abbotsford Yarra Northern Metropolitan 2 h 1480000 S
## 2 Abbotsford Yarra Northern Metropolitan 2 h 1035000 S
## 3 Abbotsford Yarra Northern Metropolitan 3 h 1465000 SP
## 4 Abbotsford Yarra Northern Metropolitan 3 h 850000 PI
## 5 Abbotsford Yarra Northern Metropolitan 4 h 1600000 VB
## 6 Abbotsford Yarra Northern Metropolitan 2 h 941000 S
## SellerG Date Distance Postcode Bedroom2 Bathroom Car Landsize
## 1 Biggin 3/12/2016 2.5 3067 2 1 1 202
## 2 Biggin 4/2/2016 2.5 3067 2 1 0 156
## 3 Biggin 4/3/2017 2.5 3067 3 2 0 134
## 4 Biggin 4/3/2017 2.5 3067 3 2 1 94
## 5 Nelson 4/6/2016 2.5 3067 3 1 2 120
## 6 Jellis 7/5/2016 2.5 3067 2 1 0 181
## BuildingArea YearBuilt Lattitude Longtitude Propertycount
## 1 NA NA -37.7996 144.9984 4019
## 2 79 1900 -37.8079 144.9934 4019
## 3 150 1900 -37.8093 144.9944 4019
## 4 NA NA -37.7969 144.9969 4019
## 5 142 2014 -37.8072 144.9941 4019
## 6 NA NA -37.8041 144.9953 4019
columns_to_exclude <- c("Method", "SellerG", "Postcode", "Bedroom2", "Date", "Landsize", "Lattitude", "Longtitude", "Propertycount")
melb_da <- melb_num %>% select(-one_of(columns_to_exclude))
3. LMM
In this section, we’ll use linear mixed model. There are two ways to
specify the LMM model: one is to determine whether house prices vary by
housing type (Type), and the other is to examine whether
they differ by residential region (Regionname). The main
purpose of this section is to compare which method is better. I assumed
the slope of the price varies depending on the BuildingArea
for the former one, while that varies depending on Distance
for the latter one.
1) basic setup
Data/notation
- \(y \in \mathbb{R}^{N}\):
response
- \(X \in \mathbb{R}^{N \times p}\)
with \(p=N_{\text{fixed}}\):
fixed‐effect design
- \(\beta \in \mathbb{R}^{p}\): fixed
effects
- \(g_i \in \{1,\dots,G\}\) with
\(G=N_{\text{groups}}\): group for obs
\(i\)
- \(u_i \in \mathbb{R}\): covariate for random slope
Random‐effects design
- \((Z_{\text{int}})_{i,g} = \mathbf{1}[g_i
= g]\), size \(N \times
G\)
- \((Z_{\text{slope}})_{i,g} = u_i
\,\mathbf{1}[g_i = g]\), size \(N
\times G\)
- \(Z = [\,Z_{\text{int}} \;\; Z_{\text{slope}}\,] \in \mathbb{R}^{N \times 2G}\)
Random effects (stacked)
- \(b = \begin{bmatrix} b_{\text{int}} \\
b_{\text{slope}} \end{bmatrix}\), with
- \(b_{\text{int}} =
(\text{ranint}_1,\dots,\text{ranint}_G)^\top\),
- \(b_{\text{slope}} = (\text{ranslope}_1,\dots,\text{ranslope}_G)^\top\)
Hierarchical (conditional) model
- Likelihood:
\[ y \mid b \sim \mathcal{N}\!\big(X\beta + Z b,\; \sigma_\varepsilon^2 I_N\big) \] - Random effects (independent blocks):
\[ b_{\text{int}} \sim \mathcal{N}\!\big(0,\; \sigma_{\text{int}}^2 I_G\big),\quad b_{\text{slope}} \sim \mathcal{N}\!\big(0,\; \sigma_{\text{slope}}^2 I_G\big) \] Equivalently, \[ b \sim \mathcal{N}\!\left(0,\; G = \begin{bmatrix} \sigma_{\text{int}}^2 I_G & 0\\ 0 & \sigma_{\text{slope}}^2 I_G \end{bmatrix}\right) \]
2) group is ’Type’ while ’BuildingArea’ is predictor for random slope
We define a model.lmm.Type that will be used in Rstan
code. This corresponds to LMM model, assuming Gaussian prior for all
random effects and Gaussian error for \(\text{log(Price)}\).
model.lmm.Type = "
data {
int<lower=0> Nobs; // number of observations
int<lower=0> Nfixed; // number of fixed-effect covariates
int<lower=0> Ngroups; // number of groups (Type), 3
vector[Nobs] y; // response
matrix[Nobs,Nfixed] X; // fixed-effect design matrix
vector[Nobs] u; // random-effect covariates
int<lower=1,upper=Ngroups> group[Nobs]; // group (Type) indices
}
parameters {
vector[Nfixed] beta; // fixed effects
real<lower=0> sigmaint; // random intercept variance
real<lower=0> sigmaslope; // random slope variance
real<lower=0> sigmaeps; // error variance
vector[Ngroups] ranint; // random intercepts
vector[Ngroups] ranslope; // random slope
}
transformed parameters {
vector[Nobs] yhat;
for (i in 1:Nobs){
yhat[i] = X[i]*beta + ranint[group[i]] + ranslope[group[i]]*u[i];
}
}
model {
ranint ~ normal(0, sigmaint);
ranslope ~ normal(0, sigmaslope);
y ~ normal(yhat, sigmaeps);
}
"
The variable y is the housing price, while matrix
X is the rest of the data except for Type.
y <- as.numeric(melb_da$Price)
Xtemp <- as.data.frame(cbind(1, melb_da))
X <- Xtemp %>% select(-Price, -Type)
After fitting the model, run the code as follows.
sm_lmm_Type = stan_model(model_code=model.lmm.Type)
fit_lmm_Type = sampling(
object = sm_lmm_Type,
data = list(Nobs = nrow(X),
Nfixed = ncol(X),
Ngroups = length(unique(Xtemp$Type)),
y = y,
X = X,
u = as.numeric(Xtemp$BuildingArea),
group = Xtemp$Type),
seed = 2023311161,
chains = 4,
warmup = 1000,
iter = 2000,
cores = 4
)
## Warning: There were 1059 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess
i) traceplot
Some fixed effects \(\beta\) have poor convergence.
traceplot(fit_lmm_Type,pars=c("beta"))
The poor convergence occurs in randon effect as well.
traceplot(fit_lmm_Type, pars = c("ranint"))
traceplot(fit_lmm_Type, pars = c("ranslope"))
ii) histogram
Using the above result, this shows how well the LMM model with
Type as a group fits the original data. The red and blue
line represent the Rstan and frequentist LMM respectively. The result
shows that both models doesn’t explain houses in the mid-price range
quite well.
posterior_lmm_Type <- extract(fit_lmm_Type)
y_hat_Type <- posterior_lmm_Type$yhat
y_hat_Type_mean <- apply(y_hat_Type, 2, mean)
hist(y, main = 'Histogram of log(Price) and Densities, group = Type',
xlab = "price", ylab = "Density", cex.main = 1,
ylim = range(0, max(density(y_hat_Type_mean)$y, density(y)$y )),
breaks = seq(min(y),max(y), length.out= 50), freq = FALSE )
lines(density(y_hat_Type_mean), col = "red")
lmm_freq_Type <- lmer(Price ~ Suburb + CouncilArea + Regionname + Rooms + Distance + Bathroom + Car + BuildingArea + YearBuilt + (BuildingArea | Type), data = Xtemp)
y_hat_Type_freq <- fitted(lmm_freq_Type)
lines(density(y_hat_Type_freq), col = "blue")
legend("topright", legend = c('Rstan', 'Freqentist'), col = c('red','blue'), lwd = 2, cex = 0.7)
iii) \(R^2\)
The former one is the \(R^2\) for Bayesian framework,
var(y_hat_Type_mean)/(var(y_hat_Type_mean) + var(y-y_hat_Type_mean))
## [1] 0.6610601
while the latter is the conventional \(R^2\) for LMM.
r2_lmm_freq_Type <- MuMIn::r.squaredGLMM(lmm_freq_Type)
r2_lmm_freq_Type # 0.6456741
## R2m R2c
## [1,] 0.5738016 0.6456741
3) group is Regionnamewhile Distance is
predictor for random slope
Let us shift to the case where Regionname is the group.
The following model.lmm.Rn is same as the former one.. This
also corresponds to LMM model, assuming Gaussian prior for all random
effects and Gaussian error for \(\text{log(Price)}\).
model.lmm.Rn = "
data {
int<lower=0> Nobs; // number of observations
int<lower=0> Nfixed; // number of fixed-effect covariates
int<lower=0> Ngroups; // number of groups (Regionname)
vector[Nobs] y; // response
matrix[Nobs,Nfixed] X; // fixed-effect design matrix
vector[Nobs] u; // random-effect covariates (Distance)
int<lower=1,upper=Ngroups> group[Nobs]; // group (Regionname) indices
}
parameters {
vector[Nfixed] beta; // fixed effects
real<lower=0> sigmaint; // random intercept variance
real<lower=0> sigmaslope; // random slope variance
real<lower=0> sigmaeps; // error variance
vector[Ngroups] ranint; // random intercepts
vector[Ngroups] ranslope; // random slope
}
transformed parameters {
vector[Nobs] yhat;
for (i in 1:Nobs){
yhat[i] = X[i]*beta + ranint[group[i]] + ranslope[group[i]]*u[i];
}
}
model {
ranint ~ normal(0, sigmaint);
ranslope ~ normal(0, sigmaslope);
y ~ normal(yhat, sigmaeps);
}
"
The variable y is the housing price, while matrix
X is the rest of the data except for
Regionname.
y <- as.numeric(melb_da$Price)
Xtemp <- as.data.frame(cbind(1, melb_da))
X <- Xtemp %>% select(-Price, -Regionname)
After fitting the model, run the code as follows.
sm_lmm_Rn = stan_model(model_code=model.lmm.Rn)
fit_lmm_Rn = sampling(
object = sm_lmm_Rn,
data = list(Nobs = nrow(X),
Nfixed = ncol(X),
Ngroups = length(unique(melb_da$Regionname)),
y = y,
X = X,
u = as.numeric(X$Distance),
group = Xtemp$Regionname),
seed = 2023311161,
chains = 4,
warmup = 1000,
iter = 2000,
cores = 4
)
## Warning: There were 557 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
i) traceplot
Compared to the Type case, the fixed effect coefficients
converged much better.
traceplot(fit_lmm_Rn,pars=c("beta"))
The better convergence is also the case in random effect, too.
traceplot(fit_lmm_Rn, pars = c("ranint"))
traceplot(fit_lmm_Rn, pars = c("ranslope"))
ii) histogram
This shows how well the LMM model with Regionname as a
group fits the original data. The red and blue line represent the Rstan
and frequentist LMM respectively. The result shows that both models
explain the price data better than former one.
posterior_lmm_Rn <- extract(fit_lmm_Rn)
y_hat_Rn <- posterior_lmm_Rn$yhat
y_hat_Rn_mean <- apply(y_hat_Rn, 2, mean)
hist(y, main = 'Histogram of log(Price) and Densities, group = Regionname',
xlab = "price", ylab = "Density", cex.main = 1,
ylim = range(0, max(density(y_hat_Rn_mean)$y, density(y)$y )),
breaks = seq(min(y),max(y), length.out= 50), freq = FALSE )
lines(density(y_hat_Rn_mean), col = "red")
lmm_freq_Rn <- lmer(Price ~ Suburb + CouncilArea + Rooms + Type + Distance + Bathroom + Car + BuildingArea + YearBuilt + (Distance | Regionname), data = Xtemp)
y_hat_Rn_freq <- fitted(lmm_freq_Rn)
lines(density(y_hat_Rn_freq), col = "blue")
legend("topright", legend = c('Rstan', 'Freqentist'), col = c('red','blue'), lwd = 2, cex = 0.7)
\(R^2\)
The result of \(R^2\) also indicates
that LMM with Regionname as a group is better. Both \(R^2\) is above \(0.75\), while that of former one is about
\(0.65\).
var(y_hat_Rn_mean)/(var(y_hat_Rn_mean) + var(y-y_hat_Rn_mean)) # 0.7584194
## [1] 0.7584194
r2_lmm_freq_Rn <- MuMIn::r.squaredGLMM(lmm_freq_Rn)
r2_lmm_freq_Rn # 0.7669087
## R2m R2c
## [1,] 0.6072761 0.7669087
4) explanation for the difference
- The boxplots show strong between-region median differences, while
type (
h/t/u) has broad within-type dispersion. - Letting the Distance slope vary by
Regionnamecaptures region-specific accessibility/amenity trade-offs. By contrast, the area premium is relatively similar across types once rooms/baths are in the fixed effects Typehas only 3 groups, so the random-effects structure is almost a fixed three-parameter tweak. However,Regionnameoffers many groups, letting the model learn rich intercept/slope variation.