MVA HW1

key words : PCA, number of PC using eigenvalue, loading matrix, correlation circle

Q1. (Exercise 11.8)

i) Apply PCA to the US CRIME data set (Table 22.10). Interpret the results. Would it be necessary to look at the third PC? Can you see any difference between the four regions?

Before normalizing the crime dataset (‘data’ in R), categorical data, such as state, division, and region, should be eliminated in advance.

data <- read.csv("uscrime.csv")

## data manipulation
state <- data$X
reg <- data$reg # 4 regions
div <- data$div # 9 divisions
data <- data[,-c(1,11,12)]

## data standardization
nr = nrow(data) # 50
nc = ncol(data) # 9

x <- (data - matrix(apply(data, 2, mean), nr, nc, byrow = T))/matrix(sqrt((nr - 1)*apply(data, 2, var)/nr), nr, nc, byrow = T)

After obtaining the normalized numerical sets (‘x’), the eigenvalues and eigenvectors of the dataset can be obtained, denoted by ‘e’ and ‘v’. The eigenvalues are, in order of size, 4.46, 1.47, 1.15, 0.73 etc., and their cumulative percentages of explain variance are 49.6, 65.9, 78.7, 86.8% and so on. It is known that the eigenvalues should be bigger than or equal to 1, and the proportion of the explained variance should be at least more than 70%. In this case, there are 3 eigenvalues that are bigger than 1, and they account for more than 78% of total variance. Based on this criterion, 3 principal components (PC) would be appropriate to explain the data set.

## spectral decomposition
eig  = eigen((nr - 1) * cov(x)/nr) 
e    = eig$values
v    = eig$vectors

# corresponding eigenvalues
par(mfrow = c(1, 2))
plot(c(1:length(e)),e, main = 'eigenvalues', pch = 16,
     xlab = 'index', ylab = 'value', col="black",)
perc = e/sum(e)            # explained variance
cum  = cumsum(e)/sum(e)    # cumulated explained percentages
plot(c(1:length(e)),cum, main = '% of eigenvalue', pch = 15,
     xlab = 'index', ylab = '%', col="red")

variance_table = rbind(e,perc,cum)
variance_table

##           [,1]      [,2]      [,3]       [,4]       [,5]       [,6]       [,7]
## e    4.4625194 1.4688827 1.1544711 0.72853585 0.45276319 0.27740669 0.22468719
## perc 0.4958355 0.1632092 0.1282746 0.08094843 0.05030702 0.03082297 0.02496524
## cum  0.4958355 0.6590447 0.7873192 0.86826767 0.91857469 0.94939766 0.97436290
##            [,8]       [,9]
## e    0.13331769 0.09741621
## perc 0.01481308 0.01082402
## cum  0.98917598 1.00000000

z = as.matrix(x) %*% v  # principal components (9.21)

The correlations between the original data ‘x’ and the first PC have strongly negative values (between -0.56 and -0.85) except for the land area data. The first PC can be viewed as the overall effect of crime (where burglary and rape are the main) and the population of each state. The correlations between ‘x’ and the second PC shows highly negative values in murder, assault, and positive values in auto theft and larceny. The second PC can be interpreted as relationship between several types of crime (larceny, auto theft, murder, and assault). The correlations between ‘x’ and the third PC shows highly positive values at land area and larceny, and negative values at population and robbery. The third PC can be viewed as the effect of land area and population with larceny and robbery.

## Factor Loading Matrix, correlations between variables and pc's
corr <- cor(x,z[,1:3])
corr

##                 [,1]         [,2]        [,3]
## land.area -0.2618305 -0.339856554  0.76888535
## popu.1985 -0.6211691  0.003546976 -0.39882003
## murder    -0.5610239 -0.741660358 -0.11259351
## rape      -0.8514624 -0.172993630  0.22729571
## robbery   -0.7961727  0.145977533 -0.36866260
## assault   -0.7814353 -0.489538026 -0.16372467
## burglary  -0.8590623  0.338391323  0.06777117
## larceny   -0.7017641  0.426918846  0.41507351
## autotheft -0.7024132  0.464387163 -0.01707406

# Factor Loading Plot
par(mfrow = c(1,1))
ucircle = cbind(cos((0:360)/180 * pi), sin((0:360)/180 * pi))
label = c("X1","X2","X3","X4","X5","X6","X7","X8","X9")


plot(ucircle, type = "l", lty = "dashed", col = "blue", xlab = "First PC", ylab = "Second PC", main = "Crime Data cor, PC1 vs PC2")
abline(h = 0, v = 0)
text(corr[,-3], label)

plot(ucircle, type = "l", lty = "dashed", col = "blue", xlab = "First PC", ylab = "Third PC", main = "Crime Data cor, PC1 vs PC3")
abline(h = 0, v = 0)
text(corr[,-2], label)

plot(ucircle, type = "l", lty = "dashed", col = "blue", xlab = "Second PC", ylab = "Third PC",main = "Crime Data cor, PC2 vs PC3")
abline(h = 0, v = 0)
text(corr[,-1], label)

The dots in the scatter plot represent the following meaning: X with red color is the Northeast area, empty diamond with blue is the Midwest, circle filled with green is the South, and triangle filled with orange is the West.

## scatter plot w.r.t four regions
pch_reg <- reg
pch_reg[reg == "Northeast"] <- 4 # X
pch_reg[reg == "Midwest"] <- 5 # empty diamond
pch_reg[reg == "South"] <- 16 # filled circle
pch_reg[reg == "West"] <- 17 # filled triangle
pch_reg <- as.numeric(pch_reg)

col_reg <- reg
col_reg[reg == "Northeast"] <- 'red'
col_reg[reg == "Midwest"] <- 'blue'
col_reg[reg == "South"] <- 'green'
col_reg[reg == "West"] <- 'orange'
col_reg

##  [1] "red"    "red"    "red"    "red"    "red"    "red"    "red"    "red"   
##  [9] "red"    "blue"   "blue"   "blue"   "blue"   "blue"   "blue"   "blue"  
## [17] "blue"   "blue"   "blue"   "blue"   "blue"   "green"  "green"  "green" 
## [25] "green"  "green"  "green"  "green"  "green"  "green"  "green"  "green" 
## [33] "green"  "green"  "green"  "green"  "green"  "orange" "orange" "orange"
## [41] "orange" "orange" "orange" "orange" "orange" "orange" "orange" "orange"
## [49] "orange" "orange"

In the scatter plot of PC1 and PC2, all of the dots appears uniformly scattered with respect to PC1, while those with same colors are grouped together with respect to PC2. There are red X, orange triangles in the upper part, blue diamonds in the middle, and green circles in the lower part of the plot. So, the northeast and west area have more auto theft, larceny and burglary, and less murder, assault, and less effect on land area, while the south area has the opposite effect.

In the scatter plot of PC1 and PC3, most of the orange points are between 0 and 2 of PC3, red points are between -2 and 0 of PC3, and the rest of them (blue and green) are around 0 of PC3. So, it is estimated that the west region is positively effected by larceny and land area and negatively effected by robbery and population, while the northeast is the opposite.

plot(z[,1], z[,2], pch = pch_reg, col = col_reg, xlab = "PC1", ylab = "PC2", main = "PC1 vs PC2 wrt 4 regions")

plot(z[,1], z[,3], pch = pch_reg, col = col_reg, xlab = "PC1", ylab = "PC3", main = "PC1 vs PC3 wrt 4 regions")

plot(z[,2], z[,3], pch = pch_reg, col = col_reg, xlab = "PC2", ylab = "PC3", main = "PC2 vs PC3 wrt 4 regions")

ii) Redo the analysis excluding the variable ‘area of the state’

Most of the process is identical to previous one, apart from eliminating the ‘land area’ column as well from the original categorical data.

rm(list = ls())
## data manipulation

data <- read.csv("uscrime.csv")
head(data)

##    X land.area popu.1985 murder rape robbery assault burglary larceny autotheft
## 1 ME     33265      1164    1.5  7.0    12.6      62      562    1055       146
## 2 NH      9279       998    2.0  6.0    12.1      36      566     929       172
## 3 VT      9614       535    1.3 10.3     7.6      55      731     969       124
## 4 MA      8284      5822    3.5 12.0    99.5      88     1134    1531       878
## 5 RI      1212       968    3.2  3.6    78.3     120     1019    2186       859
## 6 CT      5018      3174    3.5  9.1    70.4      87     1084    1751       484
##         reg         div
## 1 Northeast New England
## 2 Northeast New England
## 3 Northeast New England
## 4 Northeast New England
## 5 Northeast New England
## 6 Northeast New England

state <- data$X
area <- data$land.area
reg <- data$reg # 4
div <- data$div # 9
data <- data[,-c(1,2,11,12)]


## standardization
nr = nrow(data) # 50
nc = ncol(data) # 8

x <- (data - matrix(apply(data, 2, mean), nr, nc, byrow = T))/matrix(sqrt((nr - 1)*apply(data, 2, var)/nr), nr, nc, byrow = T)

Then, the eigenvalues are, in sequential order, 4.409, 1.43, 0.88 etc., and their cumulative explained variances are 0.55, 0.73, 0.84 and so on. The first and second eigenvalue already explains more than 73% of the original variance, so 2 principal components would be enough.

## spectral decomposition
eig  = eigen((nr - 1) * cov(x)/nr)
e    = eig$values
v    = eig$vectors


## corresponding eigenvalues
par(mfrow = c(1, 2))
plot(c(1:length(e)),e, main = 'eigenvalues', pch = 16,
     xlab = 'index', ylab = 'eigenvalue', col="black",)
perc = e/sum(e)            # explained variance
cum  = cumsum(e)/sum(e)    # cumulated explained percentages
plot(c(1:length(e)),cum, main = '% of eigenvalue', pch = 15,
     xlab = 'index', ylab = '%', col="red")

variance_table = rbind(e,perc,cum)
variance_table

##           [,1]      [,2]      [,3]       [,4]       [,5]      [,6]       [,7]
## e    4.4090091 1.4347944 0.8775517 0.48197976 0.28111566 0.2482464 0.13520207
## perc 0.5511261 0.1793493 0.1096940 0.06024747 0.03513946 0.0310308 0.01690026
## cum  0.5511261 0.7304754 0.8401694 0.90041687 0.93555633 0.9665871 0.98348738
##            [,8]
## e    0.13210094
## perc 0.01651262
## cum  1.00000000

z = as.matrix(x) %*% v  # principal components (9.21)

The correlation between ‘x’ and first PC has high negative values between -0.55 and -0.87, so the first PC is seen as the overall effect of crime (where burglary, rape, and robbery are the main) and population. The correlation between ‘x’ and the second PC has high positive value for larceny and auto theft, and high negative value for murder and assault. So the second PC would be the effect of larceny, auto theft, murder, and assault. Note that this interpretation is similar to previous one.

## Factor Loading Matrix, correlations between variables and pc's
corr <- cor(x,z[,1:2])
corr

##                 [,1]        [,2]
## popu.1985 -0.6271407 -0.07582602
## murder    -0.5517729 -0.75798486
## rape      -0.8390387 -0.12167903
## robbery   -0.8104215  0.04846703
## assault   -0.7810609 -0.53281321
## burglary  -0.8675270  0.32326193
## larceny   -0.6962629  0.50166158
## autotheft -0.7092533  0.44418058

The meaning of the 4 types of dots in the scatter plot of PCs is same as previous one.

## scatter plot w.r.t four regions
pch_reg <- reg
pch_reg[reg == "Northeast"] <- 4 # X
pch_reg[reg == "Midwest"] <- 5 # empty diamond
pch_reg[reg == "South"] <- 16 # filled circle
pch_reg[reg == "West"] <- 17 # filled triangle
pch_reg <- as.numeric(pch_reg)

col_reg <- reg
col_reg[reg == "Northeast"] <- 'red'
col_reg[reg == "Midwest"] <- 'blue'
col_reg[reg == "South"] <- 'green'
col_reg[reg == "West"] <- 'orange'
col_reg

##  [1] "red"    "red"    "red"    "red"    "red"    "red"    "red"    "red"   
##  [9] "red"    "blue"   "blue"   "blue"   "blue"   "blue"   "blue"   "blue"  
## [17] "blue"   "blue"   "blue"   "blue"   "blue"   "green"  "green"  "green" 
## [25] "green"  "green"  "green"  "green"  "green"  "green"  "green"  "green" 
## [33] "green"  "green"  "green"  "green"  "green"  "orange" "orange" "orange"
## [41] "orange" "orange" "orange" "orange" "orange" "orange" "orange" "orange"
## [49] "orange" "orange"

It’s easily shown that the same kind of dot is placed in the similar part of the plot to the previous scatter plot of PC1 and PC2. This shows that PC1 and PC2 of this problem have similar meaning to those of previous question (the northeast and west area have more auto theft, larceny and burglary, and less murder and assault, while the south area has the opposite).

plot(z[,1], z[,2], pch = pch_reg, col = col_reg, xlab = "PC1", ylab = "PC2", main = "PC1 vs PC2 wrt 4 regions")

There is another scatter plot of PC1 and PC2 with respect to the land area, where red means over the average and blue means less than average. The points are scattered quite uniformly in the PC1’s point of view. Although the red points are more centered than the blue one with respect to PC2, they are still spread uniformly in this point. as well. So, it is estimated that the land area has minor effect on crime data.

## scatter plot w.r.t land area
av <- mean(area)

pch_area <- area
pch_area[area >= av] <- 15 # filled rectangular
pch_area[area < av] <- 0 # empty rectangular

col_area <- area
col_area[area >= av] <- 'red'
col_area[area < av] <- 'blue'

plot(z[,1], z[,2], pch = pch_area, col = col_area, xlab = "PC1", ylab = "PC2", main = "PC1 and PC2 wrt land area")

ushealth.txt

Q2. (Exercise 11.9) Repeat Q1 using the US HEALTH data set (Table 22.16).

i) Apply PCA to the US HEALTH data set. Interpret the results. Would it be necessary to look at the third PC? Can you see any difference between the four regions?

The process of manipulation and standardization is almost same as the question 1.

rm(list = ls())
## data manipulation
health  <- read.table("ushealth.txt", header = TRUE)

state <- health$state
reg <- health$r
div <- health$d
data <- health[,-c(1,13,14)]

## standardization
nr <- nrow(data) # 50
nc <- ncol(data) # 11

x <- (data - matrix(apply(data, 2, mean), nr, nc, byrow = T))/matrix(sqrt((nr - 1)*apply(data, 2, var)/nr), nr, nc, byrow = T)

After obtaining the normalized data ‘x’, the eigenvalues and eigenvectors can be calculated. The eigenvalues are, in sequential order, 3.948, 2.920, 1.468, 1.038 etc., and the cumulative explained variances are 35.89, 62.44, 75.78, 85.22% and so on. Although the fourth eigenvalue is slightly bigger than 1, the cumulative 3 eigenvalues already explain more than 75% of the original variance. So, 3 principal components would be enough.

## spectral decomposition
eig  = eigen((nr - 1) * cov(x)/nr) 
e    = eig$values
v    = eig$vectors

## corresponding eigenvalues
par(mfrow = c(1, 2))
plot(c(1:length(e)),e, main = 'eigenvalues', pch = 16,
     xlab = 'index', ylab = 'value', col="black",)
perc = e/sum(e)            # explained variance
cum  = cumsum(e)/sum(e)    # cumulated explained percentages
plot(c(1:length(e)),cum, main = '% of eigenvalue', pch = 15,
     xlab = 'index', ylab = '%', col="red")

variance_table = rbind(e,perc,cum)
variance_table

##           [,1]      [,2]      [,3]      [,4]       [,5]       [,6]       [,7]
## e    3.9481265 2.9204669 1.4677020 1.0384144 0.65959730 0.46766511 0.25108440
## perc 0.3589206 0.2654970 0.1334275 0.0944013 0.05996339 0.04251501 0.02282585
## cum  0.3589206 0.6244176 0.7578450 0.8522463 0.91220974 0.95472475 0.97755061
##            [,8]        [,9]       [,10]        [,11]
## e    0.15435760 0.047311680 0.038596203 0.0066778621
## perc 0.01403251 0.004301062 0.003508746 0.0006070784
## cum  0.99158311 0.995884176 0.999392922 1.0000000000

z = as.matrix(x) %*% v  # principal components (9.21)

The correlations between ‘x’ and 1st PC are significant values, except for the land area. The accident variable is the only one whose correlation is positive. So, the PC1 would be the general health effect (where cancer and cardiovascular are prominent) with and medical variables, population and accident. The main variables of PC2 is doctor, hospital, population, and land area. So, PC2 can be interpreted as the medical effect, population and land area. The PC3 has high negative correlation with pneumonia flu, pulmonary and land area. So, PC3 can be viewed as the effect of pneumonia flu and pulmonary with land area.

## Factor Loading Matrix, correlations between variables and pc's
corr <- cor(x,z[,1:3])
corr

##             [,1]       [,2]        [,3]
## land -0.01552047 -0.6372141 -0.58649453
## popu -0.72070275 -0.6728140  0.09107150
## acc   0.65861428 -0.2623557 -0.28775808
## card -0.76058938  0.5081829 -0.18818343
## canc -0.81234834  0.5064357  0.02643019
## pul  -0.31877910  0.4176085 -0.57629283
## pneu -0.35097896  0.2815092 -0.70269095
## diab -0.52296037  0.5770103  0.31638309
## liv  -0.53308354 -0.1950664  0.21406581
## doc  -0.73813353 -0.6156915  0.12279398
## hosp -0.65496666 -0.6888229 -0.09811779

# Factor Loading Plot
par(mfrow = c(1,1))

ucircle = cbind(cos((0:360)/180 * pi), sin((0:360)/180 * pi))
label = c("X1","X2","X3","X4","X5","X6","X7","X8","X9","X10","X11")

plot(ucircle, type = "l", lty = "dashed", col = "blue", xlab = "First PC", ylab = "Second PC", main = "Health Data cor, PC1 vs PC2")
abline(h = 0, v = 0)
text(corr[,-3], label)

plot(ucircle, type = "l", lty = "dashed", col = "blue", xlab = "First PC", ylab = "Third PC", main = "Health Data cor, PC1 vs PC3")
abline(h = 0, v = 0)
text(corr[,-2], label)

plot(ucircle, type = "l", lty = "dashed", col = "blue", xlab = "Second PC", ylab = "Third PC",main = "Health Data cor, PC2 vs PC3")
abline(h = 0, v = 0)
text(corr[,-1], label)

The meaning of the 4 types of dots in the scatter plot of PCs is similar to previous one: The meaning of the color is same, but the shape of each dot is different.

pch_reg <- reg # 1: Northwest, 2: Midwest, 3: South, 4: West
# 1: empty circle, 2: empty triangle, 3: crisscross, 4: X

col_reg <- reg
col_reg[reg == 1] <- 'red'
col_reg[reg == 2] <- 'blue'
col_reg[reg == 3] <- 'green'
col_reg[reg == 4] <- 'orange'
col_reg

##  [1] "red"    "red"    "red"    "red"    "red"    "red"    "red"    "red"   
##  [9] "red"    "blue"   "blue"   "blue"   "blue"   "blue"   "blue"   "blue"  
## [17] "blue"   "blue"   "blue"   "blue"   "blue"   "green"  "green"  "green" 
## [25] "green"  "green"  "green"  "green"  "green"  "green"  "green"  "green" 
## [33] "green"  "green"  "green"  "green"  "green"  "orange" "orange" "orange"
## [41] "orange" "orange" "orange" "orange" "orange" "orange" "orange" "orange"
## [49] "orange" "orange"

Most of the red dots (Northwest) and blue triangles are between -4 and 0 of PC1, and the orange X’s (West) and green ‘+’ are between 0 and 4 o PC1 (while red and orange dots are more widely spread respectively) . So, the West and the South region has more accident effects and less effect of cardiovascular and cancer, and the Northwest and the Midwest has the opposite.

plot(z[,1], z[,2], pch = pch_reg, col = col_reg, xlab = "PC1", ylab = "PC2", main = "First vs Second PC")

The general PC2 values of red ones are between 0 and 2, those of orange ones are between -2 and 0, and those of the rest of the points are around 0. So, the West is interpreted as to be positively affected by medical variables as well as land area and population, while the Northwest is the opposite.

plot(z[,1], z[,3], pch = pch_reg, col = col_reg, xlab = "PC1", ylab = "PC3", main = "First vs Third PC")

From PC3’s point of view, red dots have generally positive values, and blue and orange dots have generally negative values. Then, Midwest and West area regions have significant effect of pneumonia flu, pulmonary, and land area, while Northwest is the opposite.

plot(z[,2], z[,3], pch = pch_reg, col = col_reg, xlab = "PC2", ylab = "PC3", main = "Second vs Third PC")

ii) Redo the analysis excluding the variable ‘area of the state’

After deleting land area data as well as other categorical data, the eigenvalues and eigenvectors can be calculated through the normalized data ‘x’.

rm(list = ls())
## data manipulation
health  <- read.table("ushealth.txt", header = TRUE)

state <- health$state
land <- health$land
reg <- health$r
div <- health$d
data <- health[,-c(1,2,13,14)]

## standardization
nr <- nrow(data) # 50
nc <- ncol(data) # 10

x <- (data - matrix(apply(data, 2, mean), nr, nc, byrow = T))/matrix(sqrt((nr - 1)*apply(data, 2, var)/nr), nr, nc, byrow = T)

The eigenvalues are, in sequential order, 3.948, 2.63, 1.18, 1.038 etc., and the cumulative explained variances are 39.48, 65.8, 77.6, 88.0% and so on. Although the fourth eigenvalue is slightly bigger than 1, the cumulative 3 eigenvalues already explain more than 77% of the original variance. So, 3 principal components would be enough.

## spectral decomposition
eig  = eigen((nr - 1) * cov(x)/nr) 
e    = eig$values
v    = eig$vectors


## corresponding eigenvalues
par(mfrow = c(1, 2))
plot(c(1:length(e)),e, main = 'eigenvalues', pch = 16,
     xlab = 'index', ylab = 'value', col="black",)
perc = e/sum(e)            # explained variance
cum  = cumsum(e)/sum(e)    # cumulated explained percentages
plot(c(1:length(e)),cum, main = '% of eigenvalue', pch = 15,
     xlab = 'index', ylab = '%', col="red")

variance_table = rbind(e,perc,cum)
variance_table

##           [,1]      [,2]      [,3]      [,4]       [,5]      [,6]       [,7]
## e    3.9479767 2.6336713 1.1795218 1.0384143 0.49622333 0.3809570 0.21064256
## perc 0.3947977 0.2633671 0.1179522 0.1038414 0.04962233 0.0380957 0.02106426
## cum  0.3947977 0.6581648 0.7761170 0.8799584 0.92958075 0.9676764 0.98874070
##             [,8]        [,9]        [,10]
## e    0.063910552 0.041717284 0.0069651231
## perc 0.006391055 0.004171728 0.0006965123
## cum  0.995131759 0.999303488 1.0000000000

z = as.matrix(x) %*% v  # principal components (9.21)

The meaning of correlation is almost similar, except for the opposite sign. The correlations between ‘x’ and 1st PC are significant values. The accident variable is the only one whose correlation is negative. So, the PC1 would be the general health effect (where cancer and cardiovascular are prominent) with and medical variables, population and accident. The dominant variables of PC2 are doctors, hospitals, and population with positive values. So, PC2 can be interpreted as medical effect and population. The PC3 has highly positive correlation with pneumonia flu, pulmonary, and negative correlation with diabetes. So, PC3 can be viewed as the effect of pneumonia flu, pulmonary and diabetes.

## Factor Loading Matrix, correlations between variables and pc's
corr <- cor(x,z[,1:3])
corr

##            [,1]       [,2]        [,3]
## popu -0.7169546  0.6824347  0.04674594
## acc   0.6606971  0.1970303  0.32460680
## card -0.7632240 -0.5275494  0.09564432
## canc -0.8153869 -0.4837969 -0.11423601
## pul  -0.3198647 -0.5374017  0.41758479
## pneu -0.3515092 -0.3908038  0.73148958
## diab -0.5265504 -0.5306942 -0.54560844
## liv  -0.5325528  0.2515986 -0.10120358
## doc  -0.7349403  0.6449218  0.05443819
## hosp -0.6505290  0.6437196  0.17152563

t(corr)

##             popu       acc        card       canc        pul       pneu
## [1,] -0.71695461 0.6606971 -0.76322405 -0.8153869 -0.3198647 -0.3515092
## [2,]  0.68243470 0.1970303 -0.52754939 -0.4837969 -0.5374017 -0.3908038
## [3,]  0.04674594 0.3246068  0.09564432 -0.1142360  0.4175848  0.7314896
##            diab        liv         doc       hosp
## [1,] -0.5265504 -0.5325528 -0.73494032 -0.6505290
## [2,] -0.5306942  0.2515986  0.64492183  0.6437196
## [3,] -0.5456084 -0.1012036  0.05443819  0.1715256

# Factor Loading Plot
par(mfrow = c(1,1))
ucircle = cbind(cos((0:360)/180 * pi), sin((0:360)/180 * pi))
label = c("X2","X3","X4","X5","X6","X7","X8","X9","X10","X11")


plot(ucircle, type = "l", lty = "dashed", col = "blue", xlab = "First PC", ylab = "Second PC", main = "Health Data cor, PC1 vs PC2")
abline(h = 0, v = 0)
text(corr[,-3], label)

plot(ucircle, type = "l", lty = "dashed", col = "blue", xlab = "First PC", ylab = "Third PC", main = "Health Data cor, PC1 vs PC3")
abline(h = 0, v = 0)
text(corr[,-2], label)

plot(ucircle, type = "l", lty = "dashed", col = "blue", xlab = "Second PC", ylab = "Third PC",main = "Health Data cor, PC2 vs PC3")
abline(h = 0, v = 0)
text(corr[,-1], label)

The meaning of the 4 types of dots in the scatter plot of PCs is same as previous one.

pch_reg <- reg # 1: Northwest, 2: Midwest, 3: South, 4: West
# 1: empty circle, 2: empty triangle, 3: crisscross, 4: X

col_reg <- reg
col_reg[reg == 1] <- 'red'
col_reg[reg == 2] <- 'blue'
col_reg[reg == 3] <- 'green'
col_reg[reg == 4] <- 'orange'
col_reg

##  [1] "red"    "red"    "red"    "red"    "red"    "red"    "red"    "red"   
##  [9] "red"    "blue"   "blue"   "blue"   "blue"   "blue"   "blue"   "blue"  
## [17] "blue"   "blue"   "blue"   "blue"   "blue"   "green"  "green"  "green" 
## [25] "green"  "green"  "green"  "green"  "green"  "green"  "green"  "green" 
## [33] "green"  "green"  "green"  "green"  "green"  "orange" "orange" "orange"
## [41] "orange" "orange" "orange" "orange" "orange" "orange" "orange" "orange"
## [49] "orange" "orange"

Most of the red and blue points are between 0 and 4 of PC1, and the orange X and green ‘+’ are between -4 and 0 o PC1. So, the West and the South region has more accident effects and less effect of cardiovascular and cancer, and the Northwest and the Midwest has the opposite.

plot(z[,1], z[,2], pch = pch_reg, col = col_reg, xlab = "PC1", ylab = "PC2", main = "First vs Second PC")

The general PC2 values of red ones are between 0 and -2, those of orange ones are between 0 and 2, and those of the rest of the points are around 0. So, the West is interpreted as to be positively affected by medical variables as well as the population, while the Northwest is the opposite.

plot(z[,1], z[,3], pch = pch_reg, col = col_reg, xlab = "PC1", ylab = "PC3", main = "First vs Third PC")

From PC3’s point of view, red dots have generally negative values, and blue and orange dots have generally positive values. Then, Midwest and West area regions have significant effect of pneumonia flu, pulmonary, and less effect of diabetes, while Northwest is the opposite.

plot(z[,2], z[,3], pch = pch_reg, col = col_reg, xlab = "PC2", ylab = "PC3", main = "Second vs Third PC")

Since the meaning of PC’s are almost identical to previous ones, the land area has insignificant effect for the data.

Multivariate Analysis HW1

Q1. (Exercise 11.8)

i) Apply PCA to the US CRIME data set (Table 22.10). Interpret the results. Would it be necessary to look at the third PC? Can you see any difference between the four regions?

ii) Redo the analysis excluding the variable ‘area of the state’

Q2. (Exercise 11.9) Repeat Q1 using the US HEALTH data set (Table 22.16).

i) Apply PCA to the US HEALTH data set. Interpret the results. Would it be necessary to look at the third PC? Can you see any difference between the four regions?

ii) Redo the analysis excluding the variable ‘area of the state’