MVA HW3

In January 1998, 303 MBA students were asked about their evaluations and preference for 10 different automobiles. The automobiles, listed in order of presentation in the survey were BMW 28i, Ford Explorer, Infinity J30, Jeep Grand Cherokee, Lexus ES300, Chrysler Town & Country, Mercedes C280, Saab 9000, Porsche Boxster, and Volvo V90. Each student rated all 10 cars. For the purpose of this exercise, one car was selected randomly from each student, resulting in a sample size of 303 evaluations. The students rated each car on the 16 attributes. The first eight questions asked the students to assess the extent to which each of the following words was descriptive of a particular car (where 5=“Extremely descriptive” and 1=“Not all descriptive”): exciting, dependable, luxurious, outdoorsy, powerful, stylish, comfortable, and rugged. The next eight questions asked the students to rate their level of agreement with each of the following statements about a particular car (where 5=“Strong Agree” and 1=“Strong Disagree”): Fun to drive, Safe, High-performance car, Family car, Versatile, Sporty, High-status car, Practical. There are 18 variables in the file, defined as follows:

• Student ID
• Car ID

BMW 28i
Ford Explorer
Infiniti J30
Jeep Grand Cherokee
Lexus ES300
Chrysler Town & Country
Mercedes C280
Saab 9000
Porsche Boxster
Volvo V90

• X1= Exciting
• X2= Dependable
• X3= Luxurious
• X4= Outdoorsy
• X5= Powerful
• X6= Stylish
• X7= Comfortable
• X8= Rugged
• X9= Fun
• X10= Safe
• X11= Performance
• X12= Family
• X13= Versatile
• X14= Sporty
• X15= Status
• X16= Practical

Q1. Conduct a factor analysis of the data set. How many factors would you retain? How do you interpret you result?

Before analyzing the given Car data, missing values were excluded. The mean for each column was calculated, and these means were substituted for the missing values for analysis.

library(car)

## Warning: 패키지 'car'는 R 버전 4.3.2에서 작성되었습니다

## 필요한 패키지를 로딩중입니다: carData

## Warning: 패키지 'carData'는 R 버전 4.3.2에서 작성되었습니다

set.seed(123)

Car_data <- read.csv('Car_data.csv', header = TRUE, na.strings = 'NA')
#str(Car_data)
stu_id <- Car_data$Col1
car_id <- Car_data$CarID
Car_data <- Car_data[,-c(1,2)]
#row.names(Car_data) <- stu_id

colnames(Car_data) <- c('X1','X2','X3','X4','X5','X6','X7','X8','X9','X10',
                 'X11','X12','X13','X14','X15','X16')

mean.Car <- as.vector(colMeans(Car_data, na.rm = TRUE))
for(i in 1:ncol(Car_data)){
  Car_data[i][is.na(Car_data[i])] <- mean.Car[i]
}
sum(is.na(Car_data))

## [1] 0

m    <- matrix(mean.Car, nrow(Car_data), length(mean.Car), byrow = T)
x    <- Car_data - m
car <- scale(x)

In this factor analysis, MLM was used, and no significant difference was found between the interpretations of the factors with and without Varimax rotation. When setting the number of factors to 3, they accounted for approximately 66.2% of the variance, which doesn’t reach the 70% threshold. However, even if the number of factors is set to 4, the fourth factor’s contribution is minimal and does not significantly alter the interpretation of the preceding three factors. Therefore, the number of factors was set to 3 in MLM.

mlm.car <- factanal(car, 3, rotation = "none", covmat = cor(car))
mlm.car

## 
## Call:
## factanal(x = car, factors = 3, covmat = cor(car), rotation = "none")
## 
## Uniquenesses:
##    X1    X2    X3    X4    X5    X6    X7    X8    X9   X10   X11   X12   X13 
## 0.217 0.570 0.338 0.229 0.414 0.217 0.443 0.234 0.232 0.421 0.353 0.246 0.479 
##   X14   X15   X16 
## 0.319 0.256 0.444 
## 
## Loadings:
##     Factor1 Factor2 Factor3
## X1   0.869  -0.116   0.124 
## X2   0.113   0.643         
## X3   0.619   0.498  -0.178 
## X4          -0.202   0.852 
## X5   0.692           0.313 
## X6   0.873   0.145         
## X7   0.140   0.717   0.153 
## X8          -0.147   0.859 
## X9   0.867           0.119 
## X10          0.729   0.202 
## X11  0.760   0.216  -0.150 
## X12 -0.665   0.433   0.354 
## X13 -0.220   0.276   0.630 
## X14  0.705  -0.303   0.304 
## X15  0.817   0.255  -0.109 
## X16 -0.430   0.475   0.382 
## 
##                Factor1 Factor2 Factor3
## SS loadings      5.596   2.508   2.487
## Proportion Var   0.350   0.157   0.155
## Cumulative Var   0.350   0.506   0.662
## 
## The degrees of freedom for the model is 75 and the fit was 0.7884

ld.car <- mlm.car$loadings # estimated factor loading
ld.car <- cbind(ld.car[,1], ld.car[,2], ld.car[,3])
#ld.car <- cbind(ld.car[,1], ld.car[,2], ld.car[,3], ld.car[,4], ld.car[,5])
ld.car

##            [,1]        [,2]        [,3]
## X1   0.86870400 -0.11563660  0.12385805
## X2   0.11300347  0.64267644  0.06699365
## X3   0.61851628  0.49798642 -0.17819721
## X4   0.06238811 -0.20158514  0.85226012
## X5   0.69191383  0.09930241  0.31287675
## X6   0.87270722  0.14528420 -0.01444146
## X7   0.13968272  0.71735634  0.15268244
## X8   0.07597636 -0.14717837  0.85943004
## X9   0.86732508 -0.04174078  0.11891197
## X10 -0.08515355  0.72870062  0.20166883
## X11  0.76009481  0.21602653 -0.15003083
## X12 -0.66467153  0.43257142  0.35366029
## X13 -0.21985191  0.27588833  0.62961553
## X14  0.70495953 -0.30326034  0.30409704
## X15  0.81681523  0.25532961 -0.10889289
## X16 -0.43007664  0.47503854  0.38180810

com.car  <- diag(ld.car %*% t(ld.car))  # communalities
psi.car <- diag(cor(car)) - com.car
tbl.car <- cbind(ld.car, com.car, psi.car)
colnames(tbl.car) <- c('q1','q2','q3','com','psi')
#colnames(tbl.car) <- c('q1','q2','q3','q4','q5','com','psi')

tbl.car

##              q1          q2          q3       com       psi
## X1   0.86870400 -0.11563660  0.12385805 0.7833593 0.2166407
## X2   0.11300347  0.64267644  0.06699365 0.4302909 0.5697091
## X3   0.61851628  0.49798642 -0.17819721 0.6623071 0.3376929
## X4   0.06238811 -0.20158514  0.85226012 0.7708762 0.2291238
## X5   0.69191383  0.09930241  0.31287675 0.5864976 0.4135024
## X6   0.87270722  0.14528420 -0.01444146 0.7829339 0.2170661
## X7   0.13968272  0.71735634  0.15268244 0.5574233 0.4425767
## X8   0.07597636 -0.14717837  0.85943004 0.7660539 0.2339461
## X9   0.86732508 -0.04174078  0.11891197 0.7681351 0.2318649
## X10 -0.08515355  0.72870062  0.20166883 0.5789260 0.4210740
## X11  0.76009481  0.21602653 -0.15003083 0.6469208 0.3530792
## X12 -0.66467153  0.43257142  0.35366029 0.7539819 0.2460181
## X13 -0.21985191  0.27588833  0.62961553 0.5208649 0.4791351
## X14  0.70495953 -0.30326034  0.30409704 0.6814098 0.3185902
## X15  0.81681523  0.25532961 -0.10889289 0.7442380 0.2557620
## X16 -0.43007664  0.47503854  0.38180810 0.5564050 0.4435950

mlm.car.vari <- varimax(ld.car)
mlm.car.vari

## $loadings
## 
## Loadings:
##     [,1]   [,2]   [,3]  
## X1   0.862 -0.149  0.133
## X2   0.156  0.634       
## X3   0.645  0.409 -0.282
## X4                 0.874
## X5   0.705  0.111  0.278
## X6   0.880              
## X7   0.190  0.722       
## X8                 0.870
## X9   0.866         0.113
## X10         0.758       
## X11  0.768  0.129 -0.200
## X12 -0.625  0.539  0.270
## X13 -0.185  0.409  0.565
## X14  0.692 -0.285  0.349
## X15  0.828  0.172 -0.168
## X16 -0.388  0.569  0.286
## 
##                 [,1]  [,2]  [,3]
## SS loadings    5.585 2.623 2.383
## Proportion Var 0.349 0.164 0.149
## Cumulative Var 0.349 0.513 0.662
## 
## $rotmat
##            [,1]        [,2]        [,3]
## [1,] 0.99752964 -0.06898079 -0.01327636
## [2,] 0.06499051  0.97798831 -0.19828033
## [3,] 0.02666166  0.19692767  0.98005543

ld.car.vari <- mlm.car.vari$loadings # estimated factor loading
ld.car.vari <- cbind(ld.car.vari[,1], ld.car.vari[,2], ld.car.vari[,3])

com.car.vari  <- diag(ld.car.vari %*% t(ld.car.vari))  # communalities
psi.car.vari <- diag(cor(car)) - com.car.vari
tbl.car.vari <- cbind(ld.car.vari, com.car.vari, psi.car.vari)
colnames(tbl.car.vari) <- c('q1','q2','q3','com','psi')

tbl.car.vari

##              q1          q2           q3       com       psi
## X1   0.86234497 -0.14862405  0.132782996 0.7833593 0.2166407
## X2   0.15627834  0.63392788 -0.063272877 0.4302909 0.5697091
## X3   0.64460168  0.40926719 -0.281595698 0.6623071 0.3376929
## X4   0.07185554 -0.03361789  0.874404241 0.7708762 0.2291238
## X5   0.70500009  0.11100193  0.277760750 0.5864976 0.4135024
## X6   0.87960838  0.07904230 -0.054546801 0.7829339 0.2170661
## X7   0.19002977  0.72199809  0.005545123 0.5574233 0.4425767
## X8   0.08913730  0.02006592  0.870462967 0.7660539 0.2339461
## X9   0.86564011 -0.07723371  0.113301779 0.7681351 0.2318649
## X10 -0.03220774  0.75824882  0.054290165 0.5789260 0.4210740
## X11  0.76825671  0.12929426 -0.199963632 0.6469208 0.3530792
## X12 -0.62548735  0.53854486  0.269660704 0.7539819 0.2460181
## X13 -0.18459208  0.40896984  0.565273722 0.5208649 0.4791351
## X14  0.69161671 -0.28532861  0.348803217 0.6814098 0.3185902
## X15  0.82848814  0.17192079 -0.168192241 0.7442380 0.2557620
## X16 -0.38796156  0.56943774  0.285712158 0.5564050 0.4435950

The loading matrix and communality table above show that prominent variables in the first factor include excitement level (X1), flamboyance (X3), design (X6), and social status (X15). These variables are predominantly associated with luxury cars. Thus, the first factor can be interpreted as representing the luxury features of cars. For a car once its basic elements are established, factors like driving performance (X11) and social status (X15) are the primary factors that elevate its value.

In the second factor, variables like comfort (X7), safety (X10), and practicality (X16) are pronounced, which are fundamental considerations when purchasing regular passenger cars. Hence, the second factor can be viewed as representing the basic performance of cars.

The third factor highlights variables like outdoor activities (X4), vehicle ruggedness (X8), and versatility (X13), which are factors considered when choosing a vehicle for travel purposes. Therefore, the third factor can be seen as representing travel suitability.

The following figure confirms this interpretation with scatter plots based on factor values. Most of these variables have communality values ranging from 0.5 to 0.7, indicating their significant contributions to the factors. Hence, there’s no need to increase the number of factors or include unnecessary variables.

plot(ld.car[,1], ld.car[,2], type = "n", xlab = "f1", ylab = "f2", main = "Factor 1 vs Factor 2 of car data", cex.main = 1)
text(ld.car[,1], ld.car[,2], colnames(Car_data), cex = 1.1)
abline(h = 0, v = 0)

plot(ld.car[,2], ld.car[,3], type = "n", xlab = "f2", ylab = "f3", main = "Factor 2 vs Factor 3 of car data", cex.main = 1)
text(ld.car[,2], ld.car[,3], colnames(Car_data), cex = 1.1)
abline(h = 0, v = 0)

plot(ld.car[,3], ld.car[,1], type = "n", xlab = "f3", ylab = "f1", main = "Factor 3 vs Factor 1 of car data", cex.main = 1)
text(ld.car[,3], ld.car[,1], colnames(Car_data), cex = 1.1)
abline(h = 0, v = 0)

cf) Factor Scores for Each Factor

The first factor score shows that red luxury sports cars have the highest positive values, while black large vehicles have the most negative values. This aligns with the interpretation of the first factor, emphasizing the luxury performance of sports cars and the contrasting needs of large vans. The second factor score indicates that sedans and large cars have higher values, whereas sports cars have lower values, reflecting the basic performan ce focus of sedans and the performance sacrifices made in sports cars. The third factor score analysis shows SUVs with the highest values, aligning with the interpretation that SUVs are versatile vehicles suited for mountain trips or other travels.

col_car <- car_id
col_car[car_id == 9] <- 'red'
col_car[car_id==1|car_id==3|car_id==5|car_id==7|car_id==8] <- 'blue'
col_car[car_id==2|car_id==4] <- 'darkgreen'
col_car[car_id==6|car_id==10] <- 'black'

fsc <- t(ld.car)%*%solve(cor(car))%*%t(car)#??

plot(fsc[1,], fsc[2,],type = "n", xlab = "first factor",ylab = "second factor", main = "factor scores for F1 vs F2", cex.main = 1)
text(fsc[1,], fsc[2,], car_id, col = col_car, cex = 0.8)

plot(fsc[2,], fsc[3,],type = "n", xlab = "second factor",ylab = "third factor", main = "factor scores for F2 vs F3", cex.main = 1)
text(fsc[2,], fsc[3,], car_id, col = col_car, cex = 0.8)

plot(fsc[3,], fsc[1,],type = "n", xlab = "third factor",ylab = "first factor", main = "factor scores for F3 vs F1", cex.main = 1)
text(fsc[3,], fsc[1,], car_id, col = col_car, cex = 0.8)

Q2. Perform the Cluster Analysis based on

1) Ward method.

par(mfrow = c(1,1))

eig  = eigen(cov(x))  # spectral decomposition  
eva  = eig$values
eva

##  [1] 8.3639297 4.4446401 3.1865195 0.8390662 0.6454796 0.5770914 0.5307561
##  [8] 0.4547439 0.4201426 0.3788315 0.3542952 0.3285294 0.2827707 0.2688774
## [15] 0.2642647 0.2046885

eve  = eig$vectors
xm   = as.matrix(x)
y    = xm %*% eve
ym   = y[, 1:2]       # first two eigenvectors

Using L2 norm to determine the Euclidean distance matrix, the dendrogram is depicted below. Given that there are 10 car types and considering the factor analysis, classifying into at least 3 groups seems meaningful. Thus, analyzing with 4 clusters might provide more insights.

d <- dist(Car_data, method = "euclidean", p = 2)  # euclidean distance matrix
dd <- d^2                                    # squared euclidean distance matrix  
w <- hclust(d, method = "ward.D")          # cluster analysis with ward algorithm

plot(w, hang = -0.1, labels = FALSE, frame.plot = TRUE, ann = FALSE) #
title(main = "Dendrogram for car data with Ward.D", ylab = "Euclidean Distance", cex.main = 1)

Based on the dendrogram, using a distance of 75 as a threshold resulted in four groups (1,2,3,4). Group 1 mainly includes Car IDs 1,3,5,7,8, which are sedan types. Group 2 predominantly consists of Car IDs 6 and 10, indicating large sized cars. Group 3 is mostly Car ID 9, signifying sports cars. Lastly, Group 4 includes Car IDs 2,4, and a few others, representing SUVs.

h <- 75
groups <- cutree(w,h = h) #-> 4 groups

car_id_den <- cbind(car_id, groups)

merg1 <- ym[groups == 1,]
merg2 <- ym[groups == 2,]
merg3 <- ym[groups == 3,]
merg4 <- ym[groups == 4,]


col_car <- car_id
col_car[car_id == 9] <- 'red'
col_car[car_id==1|car_id==3|car_id==5|car_id==7|car_id==8] <- 'blue'
col_car[car_id==2|car_id==4] <- 'darkgreen'
col_car[car_id==6|car_id==10] <- 'black'



plot(ym, type = "n", xlab = "first PC",ylab = "second PC", main = "CA of Car data based on Ward.D", cex.main = 1)
text(ym[, 1], ym[, 2], car_id, col = col_car, cex = 0.8)
dataEllipse(x = merg1[, 1], y = merg1[, 2], center.pch = 0, col = "blue", plot.points = F, add = T, levels = 0.9)
dataEllipse(x = merg2[, 1], y = merg2[, 2], center.pch = 0, col = "black", plot.points = F, add = T, levels = 0.9) 
dataEllipse(x = merg3[, 1], y = merg3[, 2], center.pch = 0, col = "red", plot.points = F, add = T, levels = 0.9)
dataEllipse(x = merg4[, 1], y = merg4[, 2], center.pch = 0, col = "darkgreen", plot.points = F, add = T, levels = 0.9)

2) K-Means method (k = 3,4,5)

For k=3, Group 1 comprises Car IDs 2,4, and a few 10s, all SUVs (gray). Group 2 includes Car IDs 6 and 10, both large sized vehicles (green). Group 3 contains sedans like Car IDs 1,3,5,7 and sports cars like Car ID 9 (light blue).

set.seed(123)
k3 <- kmeans(x, centers = 3, nstart = 25)

groups.k3 = k3$cluster
car_id_k3 <- cbind(car_id, groups.k3)

merg.k3.1 <- ym[groups.k3 == 1,]
merg.k3.2 <- ym[groups.k3 == 2,]
merg.k3.3 <- ym[groups.k3 == 3,]

col_k3 <- car_id
col_k3[groups.k3==1] <- 'gray'
col_k3[groups.k3==2] <- 'darkgreen'
col_k3[groups.k3==3] <- 'skyblue'

plot(ym, type = "n", xlab = "first PC",ylab = "second PC", main = "K-Means clustering Car data when k = 3", cex.main = 1)
text(ym[, 1], ym[, 2], car_id, col = col_k3,cex = 0.8)
dataEllipse(x = merg.k3.1[, 1], y = merg.k3.1[, 2], center.pch = 0, col = "gray", plot.points = F, add = T, levels = 0.9)
dataEllipse(x = merg.k3.2[, 1], y = merg.k3.2[, 2], center.pch = 0, col = "darkgreen", plot.points = F, add = T, levels = 0.9)
dataEllipse(x = merg.k3.3[, 1], y = merg.k3.3[, 2], center.pch = 0, col = "skyblue", plot.points = F, add = T, levels = 0.9)

For k=4, Group 1consists of sedans like Car IDs 1,3,5,7,8 (gray). Group 2 has Car IDs 6 and 10 (green). Group 3 is dominated by Car ID 9 (light blue), and Group 4 includes Car IDs 2 and 4 (orange).

set.seed(123)
k4 <- kmeans(x, centers = 4, nstart = 50)

groups.k4 = k4$cluster
car_id_k4 <- cbind(car_id, groups.k4)

merg.k4.1 <- ym[groups.k4 == 1,]
merg.k4.2 <- ym[groups.k4 == 2,]
merg.k4.3 <- ym[groups.k4 == 3,]
merg.k4.4 <- ym[groups.k4 == 4,]

col_k4 <- car_id
col_k4[groups.k4==1] <- 'gray'
col_k4[groups.k4==2] <- 'darkgreen'
col_k4[groups.k4==3] <- 'skyblue'
col_k4[groups.k4==4] <- 'orange'

plot(ym, type = "n", xlab = "first PC",ylab = "second PC", main = "K-Means clustering Car data when k = 4", cex.main = 1)
text(ym[, 1], ym[, 2], car_id, col = col_k4, cex = 0.8)
dataEllipse(x = merg.k4.1[, 1], y = merg.k4.1[, 2], center.pch = 0, col = "gray", plot.points = F, add = T, levels = 0.9)
dataEllipse(x = merg.k4.2[, 1], y = merg.k4.2[, 2], center.pch = 0, col = "darkgreen", plot.points = F, add = T, levels = 0.9)
dataEllipse(x = merg.k4.3[, 1], y = merg.k4.3[, 2], center.pch = 0, col = "skyblue", plot.points = F, add = T, levels = 0.9)
dataEllipse(x = merg.k4.4[, 1], y = merg.k4.4[, 2], center.pch = 0, col = "orange", plot.points = F, add = T, levels = 0.9)

For k=5, the five clusters are also depicted below. Notably, the sedan group from k=4 splits into two. This distinction is evident in Group 1 and Group 3, with the latter having Car ID 7 as the most represented.

set.seed(123)
k5 <- kmeans(x, centers = 5, nstart = 50)

groups.k5 = k5$cluster
car_id_k5 <- cbind(car_id, groups.k5)

merg.k5.1 <- ym[groups.k5 == 1,]
merg.k5.2 <- ym[groups.k5 == 2,]
merg.k5.3 <- ym[groups.k5 == 3,]
merg.k5.4 <- ym[groups.k5 == 4,]
merg.k5.5 <- ym[groups.k5 == 5,]

col_k5 <- car_id
col_k5[groups.k5==1] <- 'gray'
col_k5[groups.k5==2] <- 'darkgreen'
col_k5[groups.k5==3] <- 'skyblue'
col_k5[groups.k5==4] <- 'orange'
col_k5[groups.k5==5] <- 'red'

plot(ym, type = "n", xlab = "first PC",ylab = "second PC", main = "K-Means clustering Car data when k = 5", cex.main = 1)
text(ym[, 1], ym[, 2], car_id, col = col_k5, cex = 0.8)
dataEllipse(x = merg.k5.1[, 1], y = merg.k5.1[, 2], center.pch = 0, col = "gray", plot.points = F, add = T, levels = 0.9)
dataEllipse(x = merg.k5.2[, 1], y = merg.k5.2[, 2], center.pch = 0, col = "darkgreen", plot.points = F, add = T, levels = 0.9)
dataEllipse(x = merg.k5.3[, 1], y = merg.k5.3[, 2], center.pch = 0, col = "skyblue", plot.points = F, add = T, levels = 0.9)
dataEllipse(x = merg.k5.4[, 1], y = merg.k5.4[, 2], center.pch = 0, col = "orange", plot.points = F, add = T, levels = 0.9)
dataEllipse(x = merg.k5.5[, 1], y = merg.k5.5[, 2], center.pch = 0, col = "red", plot.points = F, add = T, levels = 0.9)

cf) Differences between the Two Methods (k=4)

Hierarchical methods like the Ward method iteratively combine the two closest clusters, allowing for flexible cluster shapes. In contrast, K means requires a specified number of clusters and often assumes spherical cluster shapes. For our data, the hierarchical method seems more appropriate given the distinct cluster shapes compared to the circular clusters.

Multivariate Analysis HW3