Multivariate Analysis Final Project
key words: PCA, MLM & PCM, Canonical Correlation Analysis(CCA)
Using kc-house-data, perform your analysis to explore price and other variables.
(0) Data Preprocess
The kc house data contains 21 columns, including information such as id, date, price, and categorical data like waterfront, which indicates whether the house is close to a body of water. Additionally, the sqft_living column represents the total living area (in square feet), sqft_basement indicates the basement area, and sqft_above refers to the area excluding the basement. Including all these columns would make the matrix singular (non-invertible), making calculations impossible. Therefore, sqft_basement was excluded. In summary, the excluded columns are id, date, price, waterfront, sqft_basement, and zipcode.
original_data = read.csv("kc_house_data.csv",header=TRUE)
data <- original_data[,!colnames(original_data) %in% c('id','date','price','waterfront','sqft_basement', 'zipcode')]
price <- original_data$price
waterfront <- original_data$waterfront
original_colname <- colnames(data)
(1) PCA
According to the left figure, the eigenvalues of the preprocessed data, arranged in descending order, are 5.07, 1.80, 1.61, 1.13, 0.99, 0.92, etc., and their respective cumulative proportions are 33.8%, 45.9%, 56.6%, 64.1%, 70.7%, 76.8%, and so on. Ideally, the selected principal components should have eigenvalues greater than 1 and explain at least 70% of the variance. However, if only the eigenvalue > 1 criterion is applied, only 64% of the variance is explained. To ensure a stable explanation of at least 70% of the variance, six principal components (PCs) were chosen. Since the 5th and 6th eigenvalues are approximately 0.99 and 0.92, which are close to 1, this selection was considered to have minimal error.
In the right figure, the variables with absolute correlation values of 0.8 or higher in the first principal component (PC1) are sqft_above, sqft_living, grade, bathrooms, and sqft_living15, suggesting that PC1 represents house size. The second principal component (PC2) has sqft_lot and sqft_lot15 with correlation values of -0.8 and -0.85, respectively, indicating that PC2 represents lot size. The third principal component (PC3) has yr_built with the highest absolute correlation value (-0.65), meaning PC3 represents the year the house was built. The fourth principal component (PC4) has condition with the highest positive correlation (0.58) and yr_renovated with the most negative correlation (-0.58), so PC4 represents housing condition and renovation history. The fifth principal component (PC5) has lat with the highest absolute correlation (-0.78), indicating that PC5 represents latitude. The sixth principal component (PC6) has view with the highest absolute correlation (0.69), suggesting that PC6 represents the number of views or visit frequency.
nr = nrow(data) # 21613
nc = ncol(data) # 15
x <- (data - matrix(apply(data, 2, mean), nr, nc, byrow = T))/matrix(sqrt((nr - 1)*apply(data, 2, var)/nr), nr, nc, byrow = T)
## spectral decomposition
eig = eigen((nr - 1) * cov(x)/nr)
e = eig$values
v = eig$vectors
# corresponding eigenvalues
par(mfrow = c(1, 2))
plot(c(1:length(e)),e, main = 'eigenvalues', pch = 16,
xlab = 'index', ylab = 'value', col="black",)
perc = e/sum(e) # explained variance
cum = cumsum(e)/sum(e) # cumulated explained percentages
plot(c(1:length(e)),cum, main = '% of eigenvalue', pch = 15,
xlab = 'index', ylab = '%', col="red")
The following figure illustrates the distribution of each principal component (PC) based on house price categories. The first quartile (1Q) of house prices is 321,950, and the third quartile (3Q) is 645,000. Based on these thresholds, houses priced at or below 1Q are marked in red, those between 1Q and 3Q are marked in gray, and expensive houses priced above 3Q are marked in green. The most notable pattern appears in the PC1 graph, where lower PC1 values correspond to more expensive houses, while higher PC1 values indicate cheaper houses. Since PC1 decreases as the number of bathrooms and indoor living space increase, this suggests that more expensive houses tend to have larger interiors and more bathrooms. This interpretation aligns well with the previous analysis of PC1. Meanwhile, the values of other principal components (PCs) are more evenly distributed and do not show a strong correlation with house prices.
z = as.matrix(x) %*% v # principal components (9.21)
col_price <- price
col_price[price <= 321950] <- 'red'
col_price[price > 321950 & price <= 645000] <- 'gray'
col_price[price > 645000] <- 'green'
par(mfrow = c(1,3))
plot(z[,1], z[,2], pch = 0, col = col_price, xlab = "PC1", ylab = "PC2", main = "PC1 vs PC2 wrt house price")
plot(z[,3], z[,4], pch = 1, col = col_price, xlab = "PC3", ylab = "PC4", main = "PC3 vs PC4 wrt house price")
plot(z[,5], z[,6], pch = 2, col = col_price, xlab = "PC5", ylab = "PC6", main = "PC5 vs PC6 wrt house price")
The following figure shows the distribution of each principal component (PC) based on whether a house is close to a body of water. In this figure, blue represents houses near water, while gray represents houses that are not. For PC1, which is the most correlated with house prices, the results show no significant relationship between proximity to water and PC1. This suggests that the distance between a house and a body of water does not have a major impact on house prices. However, for PC3, which represents the year the house was built, houses closer to water tend to have higher PC3 values. Since higher PC3 values correspond to older construction years, this indicates that houses near water are relatively older. This observation aligns with the median construction year: houses near water have a median construction year of 1960, whereas those farther from water have a median construction year of 1975. Regarding PC6, which represents the number of visits to a house, houses closer to water tend to have higher PC6 values. This suggests that homes near water attract more visits, likely due to the scenic views, which aligns well with the interpretation of this principal component.
## scatter plot w.r.t waterfront
pch_wf <- waterfront
pch_wf[waterfront == 0] <- 5
pch_wf[waterfront == 1] <- 15
col_wf <- waterfront
col_wf[waterfront == 0] <- 'gray'
col_wf[waterfront == 1] <- 'blue'
par(mfrow = c(1,3))
plot(z[,1], z[,2], pch = pch_wf, col = col_wf, xlab = "PC1", ylab = "PC2", main = "PC1 vs PC2 wrt waterfront")
plot(z[,3], z[,4], pch = pch_wf, col = col_wf, xlab = "PC3", ylab = "PC4", main = "PC3 vs PC4 wrt waterfront")
plot(z[,5], z[,6], pch = pch_wf, col = col_wf, xlab = "PC5", ylab = "PC6", main = "PC5 vs PC6 wrt waterfront")
As a result of the PCA, instead of increasing the
number of principal components to six, it would have
been reasonable to interpret the data using only three
principal components. This is because there are three
eigenvalues that are clearly greater than 1, meaning that
selecting three principal components would not have
caused significant issues in the analysis.
(2) Factor Analysis
1) MLM
In this factor analysis, the MLM method was used. Since there was no significant difference in interpretation between applying Varimax rotation and not applying it, the explanation will primarily focus on the results without Varimax. As with PCA, increasing the number of factors to six would result in a slightly higher cumulative variance, but the increase is not substantial. Additionally, the SS loading values show that three factors have values greater than 1. Therefore, the number of factors was set to three.
Looking at the loading matrix table below, the variables with the strongest positive correlations in the first factor (F1) are sqft_living, sqft_above, grade, bathrooms, and sqft_living15. Therefore, F1 represents house size. In the second factor (F2), the variables sqft_lot and sqft_lot15 have the highest positive values, indicating that F2 represents lot size (parking area). For the third factor (F3), yr_built has the strongest negative correlation, suggesting that F3 represents the construction year. The interpretation of these three factors aligns closely with the principal component analysis (PCA) results. Among the 15 variables, those with communalities greater than 0.6 due to the three factors are bathrooms, sqft_living, sqft_lot, grade, sqft_above, yr_built, sqft_living15, and sqft_lot15, which are all key variables across the three factors. This indicates that these three factors alone account for approximately 70-80% of the variance in these variables.
mlm.kc <- factanal(scale(data), 3, rotation = "none", covmat = cor(data))
mlm.kc
##
## Call:
## factanal(x = scale(data), factors = 3, covmat = cor(data), rotation = "none")
##
## Uniquenesses:
## bedrooms bathrooms sqft_living sqft_lot floors
## 0.659 0.332 0.043 0.341 0.635
## view condition grade sqft_above yr_built
## 0.886 0.843 0.327 0.168 0.157
## yr_renovated lat long sqft_living15 sqft_lot15
## 0.937 0.954 0.745 0.383 0.224
##
## Loadings:
## Factor1 Factor2 Factor3
## bedrooms 0.559 -0.153
## bathrooms 0.794 -0.194
## sqft_living 0.967 0.122
## sqft_lot 0.202 0.707 0.345
## floors 0.448 -0.404
## view 0.259 0.203
## condition -0.123 -0.136 0.352
## grade 0.811 -0.122
## sqft_above 0.911
## yr_built 0.448 0.311 -0.739
## yr_renovated -0.110 0.223
## lat -0.183 0.105
## long 0.311 0.346 -0.197
## sqft_living15 0.785
## sqft_lot15 0.218 0.773 0.362
##
## Factor1 Factor2 Factor3
## SS loadings 4.651 1.419 1.298
## Proportion Var 0.310 0.095 0.087
## Cumulative Var 0.310 0.405 0.491
##
## The degrees of freedom for the model is 63 and the fit was 0.8359
ld.kc <- mlm.kc$loadings # estimated factor loading
ld.kc <- cbind(ld.kc[,1], ld.kc[,2], ld.kc[,3])
com.kc <- diag(ld.kc %*% t(ld.kc)) # communalities
psi.kc <- diag(cor(data)) - com.kc
tbl.kc <- cbind(ld.kc, com.kc, psi.kc)
colnames(tbl.kc) <- c('q1','q2','q3','com','psi')
tbl.kc
## q1 q2 q3 com psi
## bedrooms 0.55882621 -0.152648390 0.07445041 0.34113113 0.65886887
## bathrooms 0.79375185 -0.027803780 -0.19392318 0.66842125 0.33157875
## sqft_living 0.96650727 -0.091583940 0.12177535 0.95735316 0.04264684
## sqft_lot 0.20237363 0.706798441 0.34455672 0.65923845 0.34076155
## floors 0.44768828 0.041779479 -0.40362341 0.36508218 0.63491782
## view 0.25867505 -0.077897949 0.20287260 0.11413816 0.88586184
## condition -0.12286906 -0.136110916 0.35158324 0.15723376 0.84276624
## grade 0.81086603 -0.019281223 -0.12200600 0.67276095 0.32723905
## sqft_above 0.91148016 0.016808794 -0.03766667 0.83249739 0.16750261
## yr_built 0.44815280 0.311034431 -0.73885343 0.84348775 0.15651225
## yr_renovated 0.02671175 -0.109557841 0.22349524 0.06266656 0.93733344
## lat 0.02986828 -0.183427584 0.10517680 0.04559995 0.95440005
## long 0.31056647 0.346478536 -0.19700754 0.25531088 0.74468912
## sqft_living15 0.78495357 0.001543892 0.02757234 0.61691473 0.38308527
## sqft_lot15 0.21754798 0.773273585 0.36204455 0.77635542 0.22364458
Following figures present factor scores for the three factors, organized by house price. The color scheme in the graph follows the same criteria as in PCA. The most notable trend is seen in F1, which, similar to the first principal component (PC1) in PCA, represents house size. The larger the house, the higher the factor score, while smaller houses have lower factor scores. Looking at the loading matrix for the first factor, there is a strong correlation between larger interior space, more bathrooms, and higher F1 values, which aligns with the interpretation of the factor score graph. For the other factors (F2 and F3), their factor scores do not show a clear relationship with house prices, indicating that these factors are not significantly influenced by price.
### factor score, MLM
kc.scale <- scale(data)
fsc <- t(ld.kc)%*%solve(cor(kc.scale))%*%t(kc.scale)
par(mfrow = c(1,3))
plot(fsc[1,], fsc[2,],pch = 0, col = col_price, xlab = "first factor",ylab = "second factor", main = "factor scores for F1 vs F2, MLM", cex.main = 1)
plot(fsc[2,], fsc[3,],pch = 0, col = col_price, xlab = "second factor",ylab = "third factor", main = "factor scores for F2 vs F3, MLM", cex.main = 1)
plot(fsc[3,], fsc[1,],pch = 1, col = col_price, xlab = "third factor",ylab = "first factor", main = "factor scores for F3 vs F1, MLM", cex.main = 1)
2) PCM Varimax
Since eigenvalues and eigenvectors were already obtained using PCA, it is also possible to use PCM (Principal Component Method), which is one of the factor analysis methods. If PCM is applied directly, it would yield identical results to PCA in (1), so Varimax rotation was applied instead. As mentioned earlier, there are three eigenvalues clearly greater than 1, so the number of factors was also set to three in this analysis.
num.e <- 3
eval.kc <- e[1:num.e]
evec.kc <- v[,1:num.e]
E <- matrix(eval.kc, nrow(cor(data)), ncol = num.e, byrow = TRUE)
Q <- sqrt(E)*evec.kc
row.names(Q) <- colnames(data)
Looking at the right loading matrix table below, the variables with the strongest negative correlations in the first factor (F1) are sqft_living, sqft_above, grade, bathrooms, and sqft_living15. Therefore, F1 represents house size. In the second factor (F2), the variables sqft_lot and sqft_lot15 have the largest negative values, indicating that F2 represents lot size (parking area). For the third factor (F3), yr_built has the strongest negative correlation, suggesting that F3 represents the construction year. The interpretation of these three factors in PCM (Principal Component Method) closely aligns with the factor interpretation in MLM (Maximum Likelihood Method), with only the signs reversed. Among the 15 variables, those with communalities greater than 0.6 due to the three factors are bathrooms, sqft_living, sqft_lot, grade, sqft_above, yr_built, sqft_living15, and sqft_lot15, which is consistent with the MLM results. This indicates that these three factors alone account for approximately 70-80% of the variance in these variables.
pcm.kc.v <- varimax(Q)
Q.vari <- pcm.kc.v$loadings
Q.vari <- cbind(Q.vari[,1], Q.vari[,2], Q.vari[,3])
com.kc.vari <- diag(Q.vari %*% t(Q.vari))
psi.kc.vari <- diag(cor(data)) - com.kc.vari
tbl.kc.vari <- cbind(Q.vari, com.kc.vari, psi.kc.vari)
colnames(tbl.kc.vari) <- c('q1','q2','q3','com','psi')
tbl.kc.vari
## q1 q2 q3 com psi
## bedrooms -0.61133549 0.031355889 0.11973059 0.3890497 0.6109503
## bathrooms -0.83613662 0.009187141 -0.15678612 0.7237907 0.2762093
## sqft_living -0.92984339 -0.114839993 0.07912478 0.8840577 0.1159423
## sqft_lot -0.09249453 -0.871719464 0.06939529 0.7732658 0.2267342
## floors -0.52303589 0.150482255 -0.46084224 0.5085870 0.4914130
## view -0.35012529 -0.043128992 0.45170683 0.3284869 0.6715131
## condition 0.11667420 -0.051022986 0.55277552 0.3217770 0.6782230
## grade -0.85886789 -0.025464327 -0.08760666 0.7459774 0.2540226
## sqft_above -0.89007278 -0.142886063 -0.14137000 0.8326315 0.1673685
## yr_built -0.42826020 -0.075404980 -0.75518833 0.7594021 0.2405979
## yr_renovated -0.09013497 0.046522849 0.35763053 0.1381883 0.8618117
## lat -0.11470731 0.277753963 0.23686611 0.1464106 0.8535894
## long -0.24514004 -0.447318257 -0.43842270 0.4524017 0.5475983
## sqft_living15 -0.81460079 -0.159467851 0.01333073 0.6891821 0.3108179
## sqft_lot15 -0.10388396 -0.881237277 0.06149197 0.7911523 0.2088477
Following figures present factor scores for the three factors, organized by house price. The color scheme in the graph follows the same criteria as in PCA. The most notable trend is seen in F1, which, similar to the first principal component (PC1) in PCA, represents house size. In this case, larger houses have lower factor scores, while smaller houses have higher factor scores. In PCM, the correlation coefficients for the first factor in the loading matrix have opposite signs compared to those in MLM. Consequently, the interpretation of factor scores must also be reversed. This means that larger houses, which have more bathrooms and larger interior space, correspond to lower factor scores. For the other factors (F2 and F3), their factor scores do not show a clear relationship with house prices, which is consistent with the MLM results.
# factor score
fsc.pcm <- t(Q.vari)%*%solve(cor(kc.scale))%*%t(kc.scale)#??
#dev.new()
par(mfrow = c(1,3))
plot(fsc.pcm[1,], fsc.pcm[2,],pch = 0, col = col_price, xlab = "first factor",ylab = "second factor", main = "factor scores for F1 vs F2, PCM", cex.main = 1)
plot(fsc.pcm[2,], fsc.pcm[3,],pch = 0, col = col_price, xlab = "second factor",ylab = "third factor", main = "factor scores for F2 vs F3, PCM", cex.main = 1)
plot(fsc.pcm[3,], fsc.pcm[1,],pch = 1, col = col_price, xlab = "third factor",ylab = "first factor", main = "factor scores for F3 vs F1, PCM", cex.main = 1)
(3) Canonical Correlation Analysis (CCA)
Canonical Correlation Analysis (CCA) is a multivariate analysis technique aimed at maximizing the correlation between two datasets. In this case, the goal is to analyze the relationships among variables that significantly influence house prices, specifically sqft_living, sqft_above, grade, bathrooms, and sqft_living15.
original_data = read.csv("kc_house_data.csv",header=TRUE)
data <- original_data[,!colnames(original_data) %in% c('id','date','price','waterfront','sqft_basement', 'zipcode')] # waterfront included
price <- original_data$price
waterfront <- original_data$waterfront
# reordering the columns of the matrix
kc.data = cbind(data[, c(2,8)], data[, c(3,9,14)])
s = cov(kc.data)
sa = s[1:2, 1:2] # sigmaXX
sb = s[3:5, 3:5] # sigmaYY
eiga = eigen(sa)
eigb = eigen(sb)
Let the first dataset \(X\) consist of bathrooms and grade, and let the second dataset \(Y\) consist of the remaining variables: sqft_living, sqft_above, and sqft_living15. The matrix \(K\) is computed using the following formula: \[ K = \Sigma_{XX}^{-1/2} \Sigma_{XY} \Sigma_{YY}^{-1/2}. \] Performing Singular Value Decomposition (SVD) on this matrix \(K\), we obtain: \[K = \Gamma \Lambda \Delta^T\] where: \[\Gamma = [\gamma_1, \gamma_2], \quad \Lambda = \text{diag}(\lambda_1, \lambda_2), \quad \Delta = [\delta_1, \delta_2] \] The canonical correlation vectors \(a\) and \(b\) are computed as: \[a_i = \Sigma_{XX}^{-1/2} \gamma_i, \quad b_i = \Sigma_{YY}^{-1/2} \delta_i\] The canonical variables are then defined as: \[\eta_i = a_i^T X, \quad \psi_i = b_i^T Y\] Finally, the quantity \(\lambda_i^{-1/2}\) is referred to as the canonical correlation coefficient.
#spectral decomposition of sa and sb
sa2 = eiga$vectors %*% diag(1/sqrt(eiga$values)) %*% t(eiga$vectors) # SXX^-1/2
sb2 = eigb$vectors %*% diag(1/sqrt(eigb$values)) %*% t(eigb$vectors) # SYY^-1/2
k = sa2 %*% s[1:2, 3:5] %*% sb2
si = svd(k) # u: gamma, v: delta
si$d # 0.8477228 0.2736026
## [1] 0.8477228 0.2736026
a = sa2 %*% si$u
b = sb2 %*% si$v
eta1 = as.matrix(kc.data[, 1:2]) %*% a[, 1]
phi1 = as.matrix(kc.data[, 3:5]) %*% b[, 1]
In this case, the correlation between the first two canonical variables is 0.8477, indicating a high correlation. The first canonical variable can be expressed as follows: \[ \eta_1 = -0.5898x_1 - 0.5433x_2 \] \[ \psi_1 = (-6.386y_1 - 3.408y_2 - 2.799y_3) \times 10^{-4} \] It can be observed that all variables have negative weights for their respective canonical variables. The high correlation between these two variables is also evident in [Figure 3]. The fact that all data points exhibit an almost linear relationship corresponds to the high canonical correlation coefficient.
col_price <- price
col_price[price <= 321950] <- 'red'
col_price[price > 321950 & price <= 645000] <- 'gray'
col_price[price > 645000] <- 'green'
#dev.new()
par(mfrow = c(1,1))
plot(eta1, phi1, xlab = "Eta 1", ylab = "Phi 1",col=col_price,pch=5,
xlim = c(-10,0),ylim = c(-10,0),main = "Eta 1 vs Phi 1, kc data")