MVA HW4

Q1 Exercise 14.6: Compute Fisher’s linear discrimination function for the 20 bank notes from Example 13.6. Apply it to the entire bank data set. How many observations are misclassiffed?

1) LDA for 20 Random Bank Notes

The results of LDA for 20 random bank notes with a seed number of 127 yielded the following column numbers: “195,” “38,” “137,” “142,” “48,” “17,” “57,” “67,” “47,” “70,” “31,” “55,” “94,” “50,” “153,” “143,” “51,” “76,” “91,” and “196.” These column numbers correspond to the data extracted from the bank notes.

bank2 <- read.table("bank2.dat")
set.seed(127)
x <- bank2[sample(1:nrow(bank2),20),]
rownames(x)

##  [1] "195" "38"  "137" "142" "48"  "17"  "57"  "67"  "47"  "70"  "31"  "55" 
## [13] "94"  "50"  "153" "143" "51"  "76"  "91"  "196"

Using these results, discriminant scores were calculated for genuine notes (column number ≤ 100) and counterfeit notes (column number > 100).

xg  = x[as.numeric(row.names(x))<=100, ]                  
xf  = x[as.numeric(row.names(x)) >100, ]
mg  = colMeans(xg)              # Determine the mean for the seperate groups
mf  = colMeans(xf)

Upon examining the graph, it becomes evident that counterfeit notes all have scores below 0, while genuine notes all have scores above 0, indicating a correct discrimination between the two.

The density plot based on the scores is presented as follows.

ng <- nrow(xg)
nf <- nrow(xf)
nc <- ncol(bank2)

m   = (mg + mf)/2
w   = ng*cov(xg) + nf*cov(xf) # matrix w 
d   = mg - mf                   # Difference in means
a   = solve(w) %*% d            # Determine the factors for linear combinations

yg = as.matrix(xg - matrix(m, nrow = ng, ncol = nc, byrow = T)) %*% a  # Discriminant rule for genuine notes
yf = as.matrix(xf - matrix(m, nrow = nf, ncol = nc, byrow = T)) %*% a  # Discriminant rule for forged notes

# plot
fg = density(yg)   # density of projection of genuine notes
ff = density(yf)   # density of projection of forged notes

plot(ff, lwd = 3, col = "red", xlab = "", ylab = "Densities of Projections", main = "Densities of Projections of 20 Swiss bank notes", xlim = c(-3,4), cex.main = 0.8)
lines(fg, lwd = 3, col = "blue", lty = 2)
text(mean(yf), 0.3, "Forged", col = "red")
text(mean(yg), 0.3, "Genuine", col = "blue")

When applying LDA, the number of genuine notes (true1) falsely classified as counterfeit (true2) and the number of counterfeit notes falsely classified as genuine were both 0. This leads to the confusion matrix shown in the table, with an APER value of 0%.

xgtest = yg
sg = sum(xgtest < 0)            # Number of misclassified genuine notes
sg # 0

## [1] 0

xftest = yf                     # Number of misclassified forged notes
sf = sum(xftest > 0)
sf  # 0

## [1] 0

aper <- matrix(c(ng-sg, sf, sg, nf-sf), ncol = 2, byrow = TRUE)
rownames(aper) <- c('pred1', 'pred2')
colnames(aper) <- c('true1', 'true2')
aper

##       true1 true2
## pred1    14     0
## pred2     0     6

After applying the hold-out procedure, only 1 genuine note was incorrectly classified as counterfeit, and no counterfeit notes were misclassified as genuine. This results in the confusion matrix shown in the left table of Figure 1-2, with an AER value of 5%.

compute_aer_bank_20 <- function(data){
  data <- as.matrix(data)
  i     = 0
  mis1  = 0
  mis2  = 0
  corr1 <- 0
  corr2 <- 0
  n <- nrow(data)
  gf <- rep(1,n)
  gf[as.numeric(row.names(data)) >100] <- 2

  
  while (i < n) {
    i     = i + 1
    xi    = subset(data, 1:n != i)
    treei = subset(gf, 1:n != i)
    t1    = subset(xi, treei == 1)
    t2    = subset(xi, treei == 2)
    m1    = colMeans(t1)                # mean of first cluster
    m2    = colMeans(t2)                # mean of second cluster
    m     = (m1 + m2)/2                 # mean of both clusters
    s     = ((nrow(t1) - 1)*cov(t1) + (nrow(t2) - 1)*cov(t2))/(n-2)
    
    alpha = solve(s) %*% (m1 - m2)
    # actual 1, but classified as 2
    mis1 = as.numeric(mis1+(gf[i] == 1)*((data[i,]-m) %*% alpha < 0)) 
    # actual 2, but classified as 1
    mis2 = as.numeric(mis2+(gf[i] == 2)*((data[i,]-m) %*% alpha > 0))
    
    corr1 = as.numeric(corr1+(gf[i] == 1)*((data[i,]-m) %*% alpha > 0))     
    corr2 = as.numeric(corr2+(gf[i] == 2)*((data[i,]-m) %*% alpha < 0))           
  }
  
  aer = matrix(c(corr1, mis2, mis1, corr2), ncol = 2, byrow = TRUE)
  rownames(aer) <- c('pred1', 'pred2')
  colnames(aer) <- c('true1', 'true2')
  return(aer)
}


aer <- compute_aer_bank_20(x)
aer

##       true1 true2
## pred1    13     0
## pred2     1     6

2) LDA for All Bank Notes

For all data points, discriminant scores were computed, and based on these scores, density plots were created for genuine and counterfeit notes as shown in the right part of Figure 1-1. It is notable that some genuine notes have negative scores, which indicates they were incorrectly classified as counterfeit.

rm(list = ls())
bank2 <- read.table("bank2.dat")

xg  = bank2[1:100, ]                # Group first 100 observations    
xf  = bank2[101:200, ]              # Group second 100 observations
mg  = colMeans(xg)              # Determine the mean for the seperate groups
mf  = colMeans(xf)

ng <- nrow(xg)
nf <- nrow(xf)
nc <- ncol(bank2)

m   = (mg + mf)/2
#w   = ng*cov(xg) + nf*cov(xf) # matrix w 
w   = (ng-1)*cov(xg) + (nf-1)*cov(xf)
d   = mg - mf                   # Difference in means
a   = solve(w) %*% d            # Determine the factors for linear combinations

yg = as.matrix(xg - matrix(m, nrow = ng, ncol = nc, byrow = T)) %*% a  # Discriminant rule for genuine notes
yf = as.matrix(xf - matrix(m, nrow = nf, ncol = nc, byrow = T)) %*% a  # Discriminant rule for forged notes

# plot
fg = density(yg)                # density of projection of genuine notes
ff = density(yf)                # density of projection of forged notes

plot(ff, lwd = 3, col = "red", xlab = "", ylab = "Densities of Projections", main = "Densities of Projections of all Swiss bank notes", xlim = c(-0.2, 0.2), cex.main = 0.8)
lines(fg, lwd = 3, col = "blue", lty = 2)
text(mean(yf), 3.72, "Forged", col = "red")
text(mean(yg), 2.72, "Genuine", col = "blue")

When applying LDA, only 1 genuine note (true1) was incorrectly classified as counterfeit (true2), and there were no cases of counterfeit notes being misclassified as genuine. This leads to the confusion matrix shown in the right table of Figure 1-2, with an APER value of 0.5%.

Even after applying the hold-out procedure, only 1 genuine note was incorrectly classified as counterfeit, and no counterfeit notes were misclassified as genuine. This results in the confusion matrix shown in the table, with an AER value of 0.5%. Both APER and AER values demonstrate excellent classification performance.

# APER

xgtest = yg
sg = sum(xgtest < 0)            # Number of misclassified genuine notes


xftest = yf                     # Number of misclassified forged notes
sf = sum(xftest > 0)

aper <- matrix(c(ng-sg, sf, sg, nf-sf), ncol = 2, byrow = TRUE)
rownames(aper) <- c('pred1', 'pred2')
colnames(aper) <- c('true1', 'true2')
aper

##       true1 true2
## pred1    99     0
## pred2     1   100

# AER

compute_aer_bank_all <- function(data){
  data <- as.matrix(data)
  i     = 0
  mis1  = 0
  mis2  = 0
  corr1 <- 0
  corr2 <- 0
  n <- nrow(data)
  gf <- append(rep(1,100), rep(2,100))
  
  while (i < n) {
    i     = i + 1
    xi    = subset(data, 1:n != i)
    treei = subset(gf, 1:n != i)
    t1    = subset(xi, treei == 1)
    t2    = subset(xi, treei == 2)
    m1    = colMeans(t1)                # mean of first cluster
    m2    = colMeans(t2)                # mean of second cluster
    m     = (m1 + m2)/2                 # mean of both clusters
    s     = ((nrow(t1) - 1)*cov(t1) + (nrow(t2) - 1)*cov(t2))/(n-2)

    alpha = solve(s) %*% (m1 - m2)
    # actual 1, but classified as 2
    mis1 = as.numeric(mis1+(gf[i] == 1)*((data[i,]-m) %*% alpha < 0)) 
    # actual 2, but classified as 1
    mis2 = as.numeric(mis2+(gf[i] == 2)*((data[i,]-m) %*% alpha > 0))
    
    corr1 = as.numeric(corr1+(gf[i] == 1)*((data[i,]-m) %*% alpha > 0))     
    corr2 = as.numeric(corr2+(gf[i] == 2)*((data[i,]-m) %*% alpha < 0))           
  }
  
  aer = matrix(c(corr1, mis2, mis1, corr2), ncol = 2, byrow = TRUE)
  rownames(aer) <- c('pred1', 'pred2')
  colnames(aer) <- c('true1', 'true2')
  return(aer)
}

aer <- compute_aer_bank_all(bank2)
aer

##       true1 true2
## pred1    99     0
## pred2     1   100

It is important to note that in the context of banknotes, the consequences of incorrectly classifying counterfeit notes as genuine are more significant in the market economy than mistakenly identifying genuine notes as counterfeit. The results from the analysis of 20 random data points and all banknotes indicate that LDA discriminant analysis can be applied effectively without major challenges.

The graph representing the discriminant scores for this data is displayed, with the black line emphasizing x = 0. Here as well, it is apparent that all the red points (counterfeit) have scores below 0, while only one blue point (genuine) has a negative score.

alph <- rbind(yg, yf)
gf <- rep(1:2, each = 100)
p = cbind(alph, gf + 0.05 * rnorm(NROW(gf)))

pch_bank <- gf
pch_bank[gf == 1] <- 16
pch_bank[gf == 2] <- 17

col_bank <- gf
col_bank[gf == 1] <- 'blue'
col_bank[gf == 2] <- 'red'

plot(p[, 1], p[, 2], pch = pch_bank, col = col_bank, 
     xaxt = "n", yaxt = "n", xlab = "", ylab = "",bty = "n")
abline(v = 0, lwd = 2) 
title(paste("Discrimination scores for Bank Note"))
text(min(yg)-0.04 ,1, 'genuine', col = 'blue')
text(max(yf)+0.08, 2, 'forge', col = 'red')

WAIS.csv

Q2 Exercise 14.7: Use the Fisher’s linear discrimination function on the WAIS data set (Table 22.12) and evaluate the results by re-substitution the probabilities of misclassiffcation.

The WAIS data quantifies four intelligence indices for individuals and categorizes them into two groups, Group 1 (absent) and Group 2 (present) based on the absence or presence of cognitive impairment. The density plots for Group 1 and Group 2 are presented below.

rm(list = ls())
wais <- read.csv('WAIS.csv', header = TRUE)
group <- wais$group

x1 <- wais[wais$group==1,-c(1,6)]
x2 <- wais[wais$group==2,-c(1,6)]

m1  = colMeans(x1) 
m2  = colMeans(x2)

n1 <- nrow(x1) # 37
n2 <- nrow(x2) # 12
nc <- ncol(wais[,-c(1,6)]) # 4

m   = (m1 + m2)/2
w   = n1*cov(x1) + n2*cov(x2) # matrix w 
d   = m1 - m2                   # Difference in means
a   = solve(w) %*% d            # Determine the factors for linear combinations

y1 = as.matrix(x1 - matrix(m, nrow = n1, ncol = nc, byrow = T)) %*% a  # Discriminant rule for genuine notes
y2 = as.matrix(x2 - matrix(m, nrow = n2, ncol = nc, byrow = T)) %*% a  # Discriminant rule for forged notes

# plot
f1 = density(y1)                # density of projection of genuine notes
f2 = density(y2)                # density of projection of forged notes

plot(f2, lwd = 3, col = "red", xlab = "", ylab = "Densities of Projections", main = "Densities of Projections of WAIS", xlim = c(-0.2, 0.2),
     ylim = c(0,14))
lines(f1, lwd = 3, col = "blue", lty = 2)
text(mean(y2), 4, "group2", col = "red")
text(mean(y1), 8, "group1", col = "blue")

It is evident that a significant number of Group 1 data points, which should have positive values, have negative scores, while many Group 2 data points, which should have negative values, have positive scores. This indicates substantial misclassification of data.

When applying LDA, 7 cases of Group 1 being incorrectly classified as Group 2 and 4 cases of Group 2 being incorrectly classified as Group 1 were observed. This results in the confusion matrix shown in the left table of Figure 2-2, with an APER value of 22.45%. The graph representing the discriminant scores for this data is displayed in Figure 3, with the black line emphasizing x = 0. Here as well, it is apparent that 4 red points (Group 2) have non-negative values, while 7 blue points (Group 1) have negative values (including those crossing the black line).

# aper
s1 = sum(y1 < 0)            # Number of misclassified genuine notes
s1;n1 #7, 37

## [1] 7

## [1] 37

s2 = sum(y2 > 0)
s2;n2  # 4, 12

## [1] 4

## [1] 12

aper <- matrix(c(n1-s1, s2, s1, n2-s2), ncol = 2, byrow = TRUE)
rownames(aper) <- c('pred1', 'pred2')
colnames(aper) <- c('true1', 'true2')
aper

##       true1 true2
## pred1    30     4
## pred2     7     8

(aper[1,2] + aper[2,1])/(n1+n2)

## [1] 0.2244898

Even after applying the hold-out procedure, 11 cases of Group 1 being incorrectly classified as Group 2 and 4 cases of Group 2 being incorrectly classified as Group 1 were observed. This results in the confusion matrix shown in the right table of Figure 2-2, with an AER value of 30.61%.

compute_aer_wais <- function(data){
  gf <- data$group
  data <- as.matrix(data[,-c(1,6)])
  i     = 0
  mis1  = 0
  mis2  = 0
  corr1 <- 0
  corr2 <- 0
  n <- nrow(data)
  
  while (i < n) {
    i     = i + 1
    xi    = subset(data, 1:n != i)
    treei = subset(gf, 1:n != i)
    t1    = subset(xi, treei == 1)
    t2    = subset(xi, treei == 2)
    m1    = colMeans(t1)                # mean of first cluster
    m2    = colMeans(t2)                # mean of second cluster
    m     = (m1 + m2)/2                 # mean of both clusters
    #w <- nrow(t1)*cov(t1) + nrow(t2)*cov(t2)
    s     = ((nrow(t1) - 1)*cov(t1) + (nrow(t2) - 1)*cov(t2))/(n-2)
    alpha = solve(s) %*% (m1 - m2)
    #alpha <- solve(w) %*% (m1 - m2)
    
    # actual 1, but classified as 2
    mis1 = as.numeric(mis1+(gf[i] == 1)*((data[i,]-m) %*% alpha < 0)) 
    # actual 2, but classified as 1
    mis2 = as.numeric(mis2+(gf[i] == 2)*((data[i,]-m) %*% alpha > 0))
    
    corr1 = as.numeric(corr1+(gf[i] == 1)*((data[i,]-m) %*% alpha > 0))     
    corr2 = as.numeric(corr2+(gf[i] == 2)*((data[i,]-m) %*% alpha < 0))           
  }
  
  aer = matrix(c(corr1, mis2, mis1, corr2), ncol = 2, byrow = TRUE)
  rownames(aer) <- c('pred1', 'pred2')
  colnames(aer) <- c('true1', 'true2')
  return(aer)
}

aer <- compute_aer_wais(wais)
aer

##       true1 true2
## pred1    26     4
## pred2    11     8

(aer[1,2] + aer[2,1])/(n1+n2)

## [1] 0.3061224

Discriminant scores were computed for all data points.

# discrimination scores
alph <- rbind(y2, y1)
p = cbind(alph, group + 0.05 * rnorm(NROW(group)))

pch_wais <- group
pch_wais[group == 1] <- 16
pch_wais[group == 2] <- 17

col_wais <- group
col_wais[group == 1] <- 'blue'
col_wais[group == 2] <- 'red'

plot(p[, 1], p[, 2], pch = pch_wais, col = col_wais, 
     xaxt = "n", yaxt = "n", xlab = "", ylab = "",bty = "n")
abline(v = 0, lwd = 2) 
title(paste("Discrimination scores for WAIS"))
text(min(y1)-0.02 ,1, 'group1', col = 'blue')
text(max(y2)+0.02, 2, 'group2', col = 'red')

cf) reference

Compared to the banknote data, LDA for WAIS data demonstrates relatively poor classification performance. Additionally, clustering analysis performed on the WAIS data revealed that some data points in Group 1 were assigned to the second cluster, while the rest were assigned to the first cluster. In Group 2, 8 data points were assigned to the first cluster, and the remaining 29 were assigned to the second cluster. The misclassification observed in LDA aligns to some extent with the clustering results, suggesting that using individual intelligence indices (such as information) as a sole criterion for distinguishing between the presence and absence of cognitive impairment may not be the optimal approach.

# CA for WAIS data

d  = dist(wais[,-c(1,6)], method = "euclidean", p = 2)
w  = hclust(d, method = "ward.D")
plot(w, hang = -0.1, frame.plot = TRUE, ann = FALSE)

clusters = cutree(w, k=2)

c1 <- clusters[13:49]
length(c1[c1 == 1]); length(c1[c1 == 2])

## [1] 8

## [1] 29

c2 <- clusters[1:12]
length(c2[c2 == 1]); length(c2[c2 == 2])

## [1] 7

## [1] 5

Multivariate Analysis HW4

Q1 Exercise 14.6: Compute Fisher’s linear discrimination function for the 20 bank notes from Example 13.6. Apply it to the entire bank data set. How many observations are misclassiffed?

1) LDA for 20 Random Bank Notes

2) LDA for All Bank Notes

Q2 Exercise 14.7: Use the Fisher’s linear discrimination function on the WAIS data set (Table 22.12) and evaluate the results by re-substitution the probabilities of misclassiffcation.

cf) reference