Solution to Series 10

Size: px

Start display at page:

Download "Solution to Series 10"

Horace Elliott
5 years ago
Views:

1 Prof. Dr. M. Maathuis Multivariate Statistics SS 0 Solution to Series 0. a) > bumpus <- read.table(" skip=0, nrows=9, col.names=c("id","total","alar","head","humerus","sternum")) > bumpus <- bumpus[,-] b) The assumptions are simple random sample from each population in each population the variables are multivariate normal the two populations have the same covariance matrix c) H 0 : µ = µ H A : µ µ (under the assumption that Σ = Σ ) d) The T statistic is defined as T = n n n + n D with n = n + n and D = (x x ) S u (x x ), where S u = n (n S + n S ), S, S = sample covariance matrices of group and. Under H 0, T T (p, n ). The F statistic is derived from T : Under H 0, F F p,n p. F = n p (n ) p T Computing the test statistics with R: > bumpus.s <- bumpus[:,] > bumpus.d <- bumpus[:9,] > n.s <- nrow(bumpus.s) > n.d <- nrow(bumpus.d) > n <- n.s + n.d > p <- 5 > # sample mean vectors: > sample.mean.s <- apply(bumpus.s,, mean) > sample.mean.d <- apply(bumpus.d,, mean) > # pooled estimate for the covariance matrix: > S.u <- ((n.s-)*var(bumpus.s)+(n.d-)*var(bumpus.d))/(n-) > S.u.inverse <- solve(s.u) > # sample version of Mahalonobis distance (squared): > D <- t(sample.mean.s-sample.mean.d)%*%s.u.inverse%*%(sample.mean.s-sample.mean.d) > # T-squared statistic: > (T <- n.s*n.d/n*d) [,].8698 > # F-statistic > (Fstat <- (n--p)/((n-)*p)*t) [,] e) The p-value is the probability of observing a test statistic that is at least as extreme (in terms of H A ) as the one we saw, given that H 0 holds.

2 > (p.value <- pf(fstat, p, n--p, lower=false)) [,] The p-value is larger than 0.05 there is not enough evidence in the data to say that µ µ we do not reject H 0. f) > library(icsnp) > HotellingsT(bumpus.d, bumpus.s) Hotelling's two sample T-test data: bumpus.d and bumpus.s T. = 0.567, df = 5, df =, p-value = 0.76 alternative hypothesis: true location difference is not equal to c(0,0,0,0,0). a) > data(iris) > # only consider the species 'versicolor' and 'viriginica' > dat <- iris[c(5:50),] > # re-factorize the last column to get rid of the empty class > dat[,5] <- factor(dat[,5]) b) > # compute lda: > res <- lda(species ~., data=dat) > # show the computed vector "a": > res$scaling Sepal.Length Sepal.Width Petal.Length.8850 Petal.Width.870 c) > # predict class of new observation > newdat <- data.frame(sepal.length=6, Sepal.Width=, Petal.Length=, Petal.Width=) > predict(res, newdata=newdat)$class [] versicolor Levels: versicolor virginica d) > nobs <- nrow(dat) > predictions <- array(na, nobs) > for (i in :nobs){ dat.temp <- dat[-i,] res.temp <- lda(species~., data=dat.temp, prior=c(0.5,0.5)) predictions[i] <- predict(res.temp, newdata=dat[i,c(:)])$class } le(predictions, dat$species) predictions versicolor virginica 8 9 > (mcr <- sum(predictions!=as.numeric(dat[,"species"]))/nobs) [] 0.0 > # or easier > lda.cv <- lda(species~., data=dat, prior=c(0.5, 0.5), CV=T) > res <- data.frame(est = lda.cv$class, = dat[, 5]) <- table(res) ## confusion matrix est versicolor virginica versicolor 8 virginica 9

3 > - sum(diag(tab)) / nrow(dat) [] 0.0 The estimated misclassification rate is only 0.0%, we thus expect very good predictions.. a) > library(mass) > t.d <- d.vegenv[d.vegenv[,"vegetationgroup"]>=,] > t.r <- lda(vegetationgroup~sqrt(nardstri)+sqrt(caluvulg)+sqrt(festrubr), data=t.d) > t.r Call: lda(vegetationgroup ~ sqrt(nardstri) + sqrt(caluvulg) + sqrt(festrubr), data = t.d) Prior probabilities of groups: Group means: sqrt(nardstri) sqrt(caluvulg) sqrt(festrubr) Coefficients of linear discriminants: sqrt(nardstri) sqrt(caluvulg) sqrt(festrubr) > plot(t.r) group group The call of the function plot shows the projection of the observations onto the discriminant. We can see that Group attains lower values than does Group. Both groups can largely be separated, albeit not perfectly. b) > t.r.all <- lda(vegetationgroup~sqrt(nardstri)+sqrt(caluvulg)+sqrt(festrubr), data=d.vegenv) > t.r.all

4 Call: lda(vegetationgroup ~ sqrt(nardstri) + sqrt(caluvulg) + sqrt(festrubr), data = d.vegenv) Prior probabilities of groups: Group means: sqrt(nardstri) sqrt(caluvulg) sqrt(festrubr) Coefficients of linear discriminants: LD LD sqrt(nardstri) sqrt(caluvulg) sqrt(festrubr) Proportion of trace: LD LD > plot(t.r.all) LD LD As we can see, groups, and can be separated relatively well by the first two discriminants. The first group is comparatively small, and it is difficult to distinguish from the other three. The third discriminant does not seem to aid the classification into these groups. c) > # only group and > t.pr <- predict(t.r) <- table(t.pr$class, t.d$vegetationgroup) 8

5 5 > -sum(diag(tab))/nrow(t.d) [] > # all four groups > t.pr.all <- predict(t.r.all) <- table(t.pr.all$class, d.vegenv$vegetationgroup) > -sum(diag(tab))/nrow(d.vegenv) [] 0.95 The first table compares the predicted and true group membership for groups. Groups and are easily separated in this manner (with about 9% of all observations correctly classified (misclassification rate 7.%)). The second table shows the -group classification. While groups, and can be easily separated, observations from group are not so frequently recognized as such. This difficulty in classifying Group is one we already saw in the images of the last part of this exercise. The misclassification rate here is.%. d) > t.r.cv <- lda(vegetationgroup~sqrt(nardstri)+sqrt(caluvulg)+sqrt(festrubr), data=t.d, CV=T) > res <- data.frame(est=t.r.cv$class, =t.d$vegetationgroup) <- table(res) est 8 > -sum(diag(tab))/nrow(t.d) [] > t.r.all.cv <- lda(vegetationgroup~sqrt(nardstri)+sqrt(caluvulg)+sqrt(festrubr), data=d.vegenv, CV=T) > res <- data.frame(est=t.r.all.cv$class, =d.vegenv$vegetationgroup) <- table(res) est > -sum(diag(tab))/nrow(d.vegenv) [] 0.65 The estimated misclassification rates for only the two groups and are exactly the same with CV and the plug-in method. For all groups the misclassification rate estimated with CV (.6%) is higher then the one estimated by the plug-in method (.%). The estimated misclassification rate obtained with the plug-in method is in most cases highly optimistic, and thus the CV method should be used.. No solution.

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes November 9, 2012 1 / 1 Nearest centroid rule Suppose we break down our data matrix as by the labels yielding (X