MVA HW2

Q1. (Exercise 12.7) Perform a factor analysis on the variables X3 to X9 in the US crime data set (Table 22.10). Would it make sense to use all of the variables for the analysis?

In this data, the Maximum Likelihood Method (MLM) is used for this Factor Analysis (‘FA’ from now on) and compare the difference between the MLM with Varimax and without Varimax. The number of factors for both methods will be 3 to make the degrees of freedom bigger than 0.

uscrime <- read.csv('uscrime.csv', header = TRUE)

state.cr <- uscrime$X
land.area.cr <- uscrime$land.area
popu.cr <- uscrime$popu.1985
reg.cr <- uscrime$reg # 4 regions
div.cr <- uscrime$div # 9 divisions

crime <- uscrime[,-c(1,2,3,11,12)]
crime.colname <- colnames(crime)
colnames(crime) <- c("X3", 'X4',"X5", "X6", "X7", "X8", "X9")
crime <- scale(crime)

cor.crime <- cor(crime)
cor.crime

##            X3        X4        X5        X6        X7         X8        X9
## X3 1.00000000 0.5198681 0.3410585 0.8125571 0.2767242 0.06478257 0.1098294
## X4 0.51986815 1.0000000 0.5514390 0.6959321 0.6801545 0.60060649 0.4407031
## X5 0.34105849 0.5514390 1.0000000 0.5632028 0.6221916 0.43618107 0.6170526
## X6 0.81255711 0.6959321 0.5632028 1.0000000 0.5207205 0.31670002 0.3303804
## X7 0.27672425 0.6801545 0.6221916 0.5207205 1.0000000 0.80110107 0.7001002
## X8 0.06478257 0.6006065 0.4361811 0.3167000 0.8011011 1.00000000 0.5547790
## X9 0.10982939 0.4407031 0.6170526 0.3303804 0.7001002 0.55477902 1.0000000

Based on non-Varimax, the first factor has generally high positive values for all variables. So the first factor can be named as the ‘general crime effect’. The second and third factors can be called as ‘the effect of violent crimes’ and ‘auto theft factor’ for the similar reason above. The first factor of the loading matrix (without rotation) explains general crime, but the first and third factor of the loading matrix (using rotation) split it into two types of crimes. If the Varimax is not used, the two different kinds of crimes can’t be explained in detail. So, the MLM with Varimax is chosen for this FA.

mlm.crime <- factanal(crime, 3, rotation = "none", covmat = cor.crime)
mlm.crime

## 
## Call:
## factanal(x = crime, factors = 3, covmat = cor.crime, rotation = "none")
## 
## Uniquenesses:
##    X3    X4    X5    X6    X7    X8    X9 
## 0.223 0.312 0.386 0.072 0.156 0.072 0.243 
## 
## Loadings:
##    Factor1 Factor2 Factor3
## X3  0.543   0.687         
## X4  0.819                 
## X5  0.686           0.376 
## X6  0.786   0.556         
## X7  0.867  -0.277   0.124 
## X8  0.777  -0.541  -0.179 
## X9  0.639  -0.280   0.520 
## 
##                Factor1 Factor2 Factor3
## SS loadings      3.818   1.240   0.478
## Proportion Var   0.545   0.177   0.068
## Cumulative Var   0.545   0.723   0.791
## 
## The degrees of freedom for the model is 3 and the fit was 0.0231

ld.crime <- mlm.crime$loadings # estimated factor loading
ld.crime <- cbind(ld.crime[,1], ld.crime[,2], ld.crime[,3])

com.crime  <- diag(ld.crime %*% t(ld.crime))  # communalities
psi.crime <- diag(cor.crime) - com.crime
tbl.crime <- cbind(ld.crime, com.crime, psi.crime)
colnames(tbl.crime) <- c('q1','q2','q3','com','psi')

tbl.crime

##           q1          q2          q3       com        psi
## X3 0.5428015  0.68732152 -0.09947105 0.7769388 0.22306116
## X4 0.8193851  0.09296157 -0.08734634 0.6876632 0.31233675
## X5 0.6858865  0.05048156  0.37550868 0.6139955 0.38600450
## X6 0.7862850  0.55584819 -0.03045219 0.9281387 0.07186130
## X7 0.8672775 -0.27727699  0.12364828 0.8443417 0.15565826
## X8 0.7766083 -0.54082114 -0.17905820 0.9276698 0.07233018
## X9 0.6390551 -0.27955393  0.52042074 0.7573795 0.24262047

par(mfrow = c(1,1))

# plot first factor against second
plot(ld.crime[,1], ld.crime[,2], type = "n", xlab = "f1", ylab = "f2", main = "Factor 1 vs Factor 2", cex.main = 1.4, xlim = c(-1,1), ylim = c(-1,1))
text(ld.crime[,1], ld.crime[,2], colnames(crime), cex = 1.1)
abline(h = 0, v = 0)

When Varimax rotation method is used in FA of US crime data, the consecutive 3 factors take up more than 79.1% of total variance. So, using 3 factors is appropriate for Varimax case. According to the factor loading matrix, the most prominent variables of the first factor are larceny (X8) followed by burglary (X7) and rape (X4). So, the first factor can be interpreted as theft-related factor. Considering the theft is the most frequently occurred crime (which is consistent to the original US crime data), how many crimes have occurred is sensitively affected by those kinds of crimes. The highest correlation values of second factor occur at assault (X6), murder (X3), and rape (X4). So, the second factor is named as ‘the effect of violent crimes.’ The most significant value of third factor exists in auto thief (X9) followed by robbery (X5) and burglary (X7). So, the third factor can be viewed as ‘auto theft factor’ since robbery naturally occurs while stealing vehicles.

mlm.crime.vari <- varimax(ld.crime)
mlm.crime.vari

## $loadings
## 
## Loadings:
##    [,1]  [,2]  [,3] 
## X3       0.879      
## X4 0.527 0.563 0.305
## X5 0.241 0.372 0.646
## X6 0.209 0.903 0.262
## X7 0.672 0.259 0.571
## X8 0.910       0.310
## X9 0.332       0.802
## 
##                 [,1]  [,2]  [,3]
## SS loadings    1.770 2.117 1.649
## Proportion Var 0.253 0.302 0.236
## Cumulative Var 0.253 0.555 0.791
## 
## $rotmat
##            [,1]       [,2]       [,3]
## [1,]  0.6576990  0.5781434  0.4828895
## [2,] -0.5799004  0.7977458 -0.1652791
## [3,] -0.4807781 -0.1713239  0.8599421

ld.cr.vari <- mlm.crime.vari$loadings # estimated factor loading
ld.cr.vari <- cbind(ld.cr.vari[,1], ld.cr.vari[,2], ld.cr.vari[,3])
colnames(ld.cr.vari) <- c('q1','q2','q3')
com.cr.vari  <- diag(ld.cr.vari %*% t(ld.cr.vari))  # communalities
psi.cr.vari <- diag(cor.crime) - diag(ld.cr.vari %*% t(ld.cr.vari))
tbl.cr.vari <- cbind(ld.cr.vari, com.cr.vari, psi.cr.vari)
colnames(tbl.cr.vari) <- c('q1','q2','q3','com','psi')

tbl.cr.vari

##             q1         q2         q3       com        psi
## X3 0.006245483 0.87916672 0.06297392 0.7769388 0.22306116
## X4 0.526994524 0.56284631 0.30519512 0.6876632 0.31233675
## X5 0.241296226 0.37247859 0.64577963 0.6139955 0.38600450
## X6 0.209443030 0.90322822 0.26163160 0.9281387 0.07186130
## X7 0.671753180 0.25903029 0.57095772 0.8443417 0.15565826
## X8 0.910484140 0.04823010 0.31042278 0.9276698 0.07233018
## X9 0.332212384 0.05729196 0.80232917 0.7573795 0.24262047

The following figure shows the plot of correlation between each factor. In the first plot (f1 vs f2), the larceny (X8) and burglary (X7) are in the right side of factor 1, while assault (X6) and murder (X3) are in the upper side of factor 2. The rape (X4) lies between the two group, which is natural because the rape, one of the violent crimes, sometimes occurs during burglary. The plot between second and third factors (third plot) shows that the auto theft (X9) is grouped with robbery (X5) and burglary (X7) in the upper side of factor 3. So, it is easily shown that the similar variables are grouped together in the corresponding factors.

# plot first factor against second
plot(ld.cr.vari[,1], ld.cr.vari[,2], type = "n", xlab = "f1", ylab = "f2", main = "Factor 1 vs Factor 2", cex.main = 1.4, xlim = c(-1,1), ylim = c(-1,1))
text(ld.cr.vari[,1], ld.cr.vari[,2], colnames(crime), cex = 1.1)
abline(h = 0, v = 0)

# plot first factor against third
plot(ld.cr.vari[,1], ld.cr.vari[,3], type = "n", xlab = "f1", ylab = "f3", main = "Factor 1 vs Factor 3", cex.main = 1.4, xlim = c(-1,1), ylim = c(-1,1))
text(ld.cr.vari[,1], ld.cr.vari[,3], colnames(crime), cex = 1.1)
abline(h = 0, v = 0)

# plot second factor against third
plot(ld.cr.vari[,2], ld.cr.vari[,3], type = "n", xlab = "f2", ylab = "f3", main = "Factor 2 vs Factor 3", cex.main = 1.4, xlim = c(-1,1), ylim = c(-1,1))
text(ld.cr.vari[,2], ld.cr.vari[,3], colnames(crime), cex = 1.1)
abline(h = 0, v = 0)

The following figure shows how the population density, the population size (X2) divided by the land area (X1), affects each crime. The red marks means high population rate (≥ 1.5the average rate), blue means low rate (<0.5average rate), and gray is between them. Compared to burglary and murder, auto theft (the prominent variable of 3rd factor) shows that the population rate is proportional to auto theft rate. Based on this observation, the main reason for this phenomenon is assumed that the vehicles, which are expensive and (relatively) portable property, can be easily seen in a crowded and compact region, which leads the thieves to steal the cars readily compared to quite and empty regions. / Each variable has considerable correlation with at least one variable and it has significant communality. So, it is appropriate to make use of all of the variables of crime data for Factor Analysis.

pop.rate <- popu.cr/land.area.cr
pch_pop.rate <- pop.rate
mean.rate <- mean(pop.rate)


col_rate <- pop.rate
col_rate[pop.rate < mean.rate/2] <- 'blue'
col_rate[pop.rate >= mean.rate/2 & pop.rate < 3*mean.rate/2] <- 'gray'
col_rate[pop.rate >= 3*mean.rate/2] <- 'red'

plot(crime[,'X7'], crime[,'X3'], pch = 15, col = col_rate, xlab = "burglary", ylab = "murder", main = "burglary vs murder wrt population rate")

plot(crime[,'X7'], crime[,'X9'], pch = 16, col = col_rate, xlab = "burglary", ylab = "auto", main = "burglary vs auto wrt population rate")

plot(crime[,'X3'], crime[,'X9'], pch = 17, col = col_rate, xlab = "murder", ylab = "auto", main = "murder vs auto wrt population rate")

Q2. (Exercise 12.9) Perform a factor analysis on the US health data set (Table 22.16) and estimate the factor scores.

In this data, the Principal Component Method (PCM) is used as a method of FA, opposite to the previous question. The PCM method requires some eigenvalues and eigenvectors to build the factor loading matrix Q. According to Q, the 3 consecutive eigenvalues, which are greater than 1, account for over 75.4% of total variance, so 3 eigenvalues and the corresponding 3 eigenvectors would be enough. The loading matrix after the rotation also shows the 3 consecutive eigenvalues have more than 75% of total variance. So, using 3 factors would be proper as well after Varimax rotation.

ushealth <- read.table('ushealth.txt', header = TRUE)
state.h <- ushealth$state
land.h <- ushealth$land
popu.h <- ushealth$popu
reg.h <- ushealth$r
div.h <- ushealth$d


health <- ushealth[,-c(1,2,3,13,14)]
health.colnames <- colnames(health)
colnames(health) <- c('X3','X4',"X5", "X6", "X7", "X8", "X9","X10","X11")
health <- scale(health)
cor.health <- cor(health)
cor.health

##              X3         X4         X5          X6          X7          X8
## X3   1.00000000 -0.5060140 -0.5780965 -0.09939666 -0.19555969 -0.51726729
## X4  -0.50601403  1.0000000  0.9080732  0.44089424  0.53822343  0.61930117
## X5  -0.57809651  0.9080732  1.0000000  0.43794184  0.35777645  0.70905640
## X6  -0.09939666  0.4408942  0.4379418  1.00000000  0.40044653  0.22696567
## X7  -0.19555969  0.5382234  0.3577765  0.40044653  1.00000000  0.02233657
## X8  -0.51726729  0.6193012  0.7090564  0.22696567  0.02233657  1.00000000
## X9  -0.15530345  0.1359821  0.3628637  0.26338462 -0.09658649  0.14767214
## X10 -0.35651322  0.2104690  0.2640200 -0.10211738  0.07680268  0.02635667
## X11 -0.22506132  0.2208912  0.1953137 -0.09711215  0.08761296 -0.03605098
##              X9         X10         X11
## X3  -0.15530345 -0.35651322 -0.22506132
## X4   0.13598213  0.21046897  0.22089116
## X5   0.36286374  0.26402004  0.19531368
## X6   0.26338462 -0.10211738 -0.09711215
## X7  -0.09658649  0.07680268  0.08761296
## X8   0.14767214  0.02635667 -0.03605098
## X9   1.00000000  0.55533628  0.28690906
## X10  0.55533628  1.00000000  0.85970847
## X11  0.28690906  0.85970847  1.00000000

According to the loading matrix in [figure 2-1], the first factor has high negative value at cancer (X5), cardiovascular (X4), diabetes (X8), and high positive value at accident (X3). So, the first factor can be interpreted as the severe diseases in US. This coincides with the report presented by National Center for Health Statistics (NCHS) in 2018 [ref. 1]. The dominant variable at the second factor is doctors (X10) and hospitals (X11), so the second factor can be viewed as ‘medical factor’. Since the degree of medical service in the state determines the health of the people, it makes sense for the medical factor to be the second highest factor. The pneumonia flu (X7) and pulmonary (X6) are the most prominent variable of factor 3, so the third factor can be named as ‘diseases related to lung’. The interpretation of each factor is almost similar to those made by the loading matrix from Varimax method, so the following explanation is just based on Varimax rotation. Note that PCM without Varimax makes similar results.

e.hth <- eigen(cor.health)
eval <- e.hth$values
perc <- eval/sum(eval)
cumul <- cumsum(perc)
e.table <- rbind(eval,perc,cumul)
e.table # why 3

##            [,1]      [,2]      [,3]      [,4]       [,5]       [,6]       [,7]
## eval  3.6023712 2.0116184 1.1753927 1.0345614 0.49060091 0.37056973 0.21064248
## perc  0.4002635 0.2235132 0.1305992 0.1149513 0.05451121 0.04117441 0.02340472
## cumul 0.4002635 0.6237766 0.7543758 0.8693271 0.92383829 0.96501271 0.98841743
##              [,8]        [,9]
## eval  0.062591112 0.041652045
## perc  0.006954568 0.004628005
## cumul 0.995371995 1.000000000

num.e <- 3
eval.hth <- e.hth$values[1:num.e]
evec.hth <- e.hth$vectors[,1:num.e]
E <- matrix(eval.hth, nrow(cor.health), ncol = num.e, byrow = TRUE)
Q <- sqrt(E)*evec.hth
row.names(Q) <- colnames(health)

com.hth <- diag(Q %*% t(Q))
psi.hth <- diag(cor.health) - com.hth
tbl.hth <- cbind(Q, com.hth, psi.hth)
colnames(tbl.hth) <- c('q1','q2','q3','com','psi')
#colnames(tbl.hth) <- c('q1','q2','q3','q4','com','psi')
tbl.hth

##             q1          q2          q3       com        psi
## X3   0.6925406 -0.04091823  0.32320104 0.5857457 0.41425425
## X4  -0.8888515 -0.27025232  0.08544195 0.8703936 0.12960636
## X5  -0.9291493 -0.18842852 -0.11391810 0.9118011 0.08819890
## X6  -0.4716090 -0.42124446  0.41367072 0.5709854 0.42901457
## X7  -0.4537853 -0.28801582  0.71736435 0.8034858 0.19651421
## X8  -0.6714197 -0.31643770 -0.55732696 0.8615505 0.13844948
## X9  -0.4350007  0.45979019 -0.05306960 0.4034490 0.59655096
## X10 -0.4886176  0.84266160  0.09823056 0.9584749 0.04152505
## X11 -0.4028275  0.78700955  0.20455350 0.8234961 0.17650387

pcm.hth.vari <- varimax(Q)
pcm.hth.vari

## $loadings
## 
## Loadings:
##     [,1]   [,2]   [,3]  
## X3   0.709 -0.282       
## X4  -0.700  0.161  0.596
## X5  -0.827  0.217  0.426
## X6  -0.227         0.714
## X7                 0.894
## X8  -0.924              
## X9  -0.244  0.586       
## X10         0.974       
## X11         0.906       
## 
##                 [,1]  [,2]  [,3]
## SS loadings    2.651 2.286 1.852
## Proportion Var 0.295 0.254 0.206
## Cumulative Var 0.295 0.549 0.754
## 
## $rotmat
##           [,1]       [,2]       [,3]
## [1,] 0.7645823 -0.4336081 -0.4768627
## [2,] 0.2613710  0.8848915 -0.3855543
## [3,] 0.5891512  0.1701499  0.7899050

Q.vari <- pcm.hth.vari$loadings
Q.vari <- cbind(Q.vari[,1], Q.vari[,2], Q.vari[,3])

com.hth.vari <- diag(Q.vari %*% t(Q.vari))
psi.hth.vari <- diag(cor.health) - com.hth.vari
tbl.hth.vari <- cbind(Q.vari, com.hth.vari, psi.hth.vari)
colnames(tbl.hth.vari) <- c('q1','q2','q3','com','psi')
tbl.hth.vari

##                q1          q2           q3       com        psi
## X3   0.7092237407 -0.28150675 -0.059172453 0.5857457 0.41425425
## X4  -0.6998979999  0.16080712  0.595548063 0.8703936 0.12960636
## X5  -0.8267758446  0.21676467  0.425741571 0.9118011 0.08819890
## X6  -0.2269703950 -0.09787614  0.714065916 0.5709854 0.42901457
## X7   0.0004009032  0.06396169  0.894088659 0.8034858 0.19651421
## X8  -0.9244130641 -0.08370922  0.001943559 0.8615505 0.13844948
## X9  -0.2436840456  0.58645449 -0.011758438 0.4034490 0.59655096
## X10 -0.0954683734  0.97424657 -0.014295560 0.9584749 0.04152505
## X11  0.0182196725  0.90589210  0.050236257 0.8234961 0.17650387

The following figure shows the correlation between each factor. In the first plot (f1 vs f2), cancer (X5), cardiovascular (X4), and diabetes (X8) are grouped together in the left part of first factor, doctors (X10) and hospitals (X11) are grouped at the top of the second factor. Note that the accident (X3) lies solely on the right part of the first factor. The pneumonia flu (X7) and pulmonary (X6) are at the top of the third factor. So, it is easily shown that the similar variables are grouped together in the corresponding factors.

# plot first factor against second
plot(Q.vari[,1], Q.vari[,2], type = "n", xlab = "f1", ylab = "f2", main = "Factor 1 vs Factor 2", cex.main = 1.4, xlim = c(-1,1), ylim = c(-1,1))
text(Q.vari[,1], Q.vari[,2], colnames(health), cex = 1.1)
abline(h = 0, v = 0)

# plot first factor against third
plot(Q.vari[,1], Q.vari[,3], type = "n", xlab = "f1", ylab = "f3", main = "Factor 1 vs Factor 3", cex.main = 1.4, xlim = c(-1,1), ylim = c(-1,1))
text(Q.vari[,1], Q.vari[,3], colnames(health), cex = 1.1)
abline(h = 0, v = 0)

# plot second factor against third
plot(Q.vari[,2], Q.vari[,3], type = "n", xlab = "f2", ylab = "f3", main = "Factor 2 vs Factor 3", cex.main = 1.4, xlim = c(-1,1), ylim = c(-1,1))
text(Q.vari[,2], Q.vari[,3], colnames(health), cex = 1.1)
abline(h = 0, v = 0)

The next plots explain how each disease variables are related to land area. The red mark means narrow area (< 0.5the average), blue means large area (<1.5the average), and gray means between them. The first plot shows the diabetes (one of main variables of first factor) occurs more in the place where the land area is narrow (except for some outliers). As the area of the living place is large, it is likely for the inhabitants to have eco-friendly environments and enjoy exercising in the hospitable place. This would naturally lead to less occurrence of the diseases. This interpretation is similar to the case of cardiovascular (one of main variables of first factor) in the second plot.

col_land <- land.h
av_land <- mean(land.h)
col_land[land.h <= av_land/2] <- 'red'
col_land[land.h > av_land/2 & land.h <= 3*av_land/2] <- 'gray'
col_land[land.h > 3*av_land/2] <- 'blue'



plot(health[,'X8'], health[,'X10'], pch = 15, col = col_land, xlab = "diabetes", ylab = "doctors", main = "diabetes vs doctor wrt land area")

plot(health[,'X4'], health[,'X7'], pch = 16, col = col_land, xlab = "cardiovascular", ylab = "pneumonia flu", main = "cardio vs pneu wrt land area")

plot(health[,'X6'], health[,'X11'], pch = 17, col = col_land, xlab = "pulmonary", ylab = "hospitals", main = "pulmonary vs hospitals wrt land area")

Analyzing the correlation matrix and loading matrix, it can be shown that the liver variable (X9) has insignificant effect for any factors of health data: it has low value of correlation among other variables and low communality. So, excluding the liver (X9) is also possible for the Factor analysis.

Multivariate Analysis HW2

Q1. (Exercise 12.7) Perform a factor analysis on the variables X3 to X9 in the US crime data set (Table 22.10). Would it make sense to use all of the variables for the analysis?

Q2. (Exercise 12.9) Perform a factor analysis on the US health data set (Table 22.16) and estimate the factor scores.