CSE555: Introduction to Pattern Recognition Midterm Exam Solution (00 points, Closed book/notes) There are 5 questions in this exam. The last page is the Appendix that contains some useful formulas.. (5pts) Bayes Decision Theory. (a) (5pts) Assume there are c classes w,,w c, and one feature ector x, gie the Bayes rule for classification in terms of a priori probabilities of the classes and classconditional probability densities of x. Bayes rule for classification is Decide ω i if p(x ω i )P(ω i ) > p(x ω j )P(ω j ) for all j i and i,j =,,c. (b) (0pts) Suppose we hae a two-classes problem (A, A), with a single binaryalued feature (x, x). Assume the prior probability P(A) = 0.33. Gien the distribution of the samples as shown in the following table, use Bayes Rule to compute the alues of posterior probabilities of classes. A A x 48 7 x 8 503 By Bayes formula, we hae we also know that P(A x) = p(x A)P(A) p(x) thus p(x) = p(x A)P(A) + p(x A)P( A) and P(A x) = p(x A) = 48 48 + 8 0.755 7 p(x A) = 7 + 503 0.493 P(A) = 0.33 P( A) = P(A) = 0.7 0.755 0.33 0.755 0.33 + 0.493 0.7 0.597
Similarly, we hae P( A x) 0.404 P(A x) 0.40 P( A x) 0.8598. (5pts) Fisher Linear Discriminant. (a) (5pts) What is the Fisher linear discriminant method? The Fisher linear discriminant finds a good subspace in which categories are best separated in a least-squares sense; other, general classification techniques can then be applied in the subspace. (b) Gien the -d data for two classes: ω = (, ), (, ), (, 4), (, ), (3, ), (3, 3) and ω = (, ), (3, ), (3, 4), (5, ), (5, 4), (5, 5) as shown in the figure: 5 4 3 0 0 3 4 5 i. (0pts) Determine the optimal projection line in a single dimension. Let w be the direction of the projection line, then the Fisher linear discriminant method finds that the best w is the one for which the criterion function J(w) = wt S B w w t S ww is maximum, as follows where and S i = w = S w(m m ) Sw = S + S x D i (x m i )(x m i ) t i =,
Thus, we first compute the sample means for each class and get 3 m = m = 3 Then we subtract the sample mean from each sample and get 5 x m = 5 5 7 7 0 therefore S = S = x m = 5 5 7 7 7 5+5+5++49+49 3 5+007+7 +5+5+49+49+49 3 +554+7+4 5+007+7 + 0 + 4 + + + +554+7+4 + + + 4 + + 4 and then 4 Sw = S + S = 3 0 S w = 0 4 = 0 Sw 808 4 = 3 3 3 Finally we hae w = S w = 5 3 0 404 3 4 404 808 w (m m ) 57 = 404 9 808 = 9 3 8 = 53 3 3 5 3 0 404 3 4 404 808 0.4 0.0359 ii. (0pts) Show the mapping of the points to the line as well as the Bayes discriminant assuming a suitable distribution. The samples are mapped by x = w t x and we get w = 0.770, 0.9, 0.847, 0.38, 0.459, 0.5309 w = 0.3540, 0.4950, 0.58, 0.743, 0.8490, 0.8849 and we compute the mean and the standard deiation as µ = 0.3304 σ = 0.388 µ = 0.485 σ = 0.0 If we assume both p(x ω ) and p(x ω ) hae a Gaussian distribution, then the Bayes decision rule will be Decide ω if p(x ω i )P(ω ) > p(x ω )P(ω ); otherwise decide ω 3
where p(x ω i ) = exp πσi ( ) x µi If we assume the prior probabilities are equal, i.e. P(ω ) = P(ω ) = 0.5, then the threshold will be about 0.4933. That is, we decide ω if w t x > 0.4933, otherwise decide ω. 3. (0pts) Suppose p(x w ) and p(x w ) are defined as follows: p(x w ) = π e x, x σ i p(x w ) = 4, < x < (a) (7pts) Find the minimum error classification rule g(x) for this two-class problem, assuming P(w ) = P(w ) = 0.5. (i) In case of < x <, because P(ω ) = P(ω ) = 0.5, we hae the discriminant function g(x) as g(x) = ln p(x ω ) p(x ω ) = ln 4 x π The Bayes rule for classification will be or Decide ω if g(x) > 0; otherwise decide ω Decide ω if 0.98 < x < 0.98; otherwise decide ω (ii) In case of x or x, we always decide ω. (b) (0pts) There is a prior probability of class, designated as π, so that if P(w ) > π, the minimum error classification rule is to always decide w regardless of x. Find π. According to the question, π will satisfy the following equation p(x ω )π = p(x ω )( π ) when x = or x = Therefore, we hae π e 4 π = 4 ( π ) π 0.84 (c) (3pts) There is no π so that if P(w ) > π, we would always decide w. Why not? Because p(x ω ) is only defined for < x <, therefore we would always decide w for x or x, no matter what is the prior probability p(w ). 4
4. (0pts) Let samples be drawn by successie, independent selections of a state of nature w i with unknown probability P(w i ). Let z ik = if the state of nature for the kth sample is w i and z ik = 0 otherwise. (a) (7pts) Show that n P(z i,,z in P(w i )) = P(w i ) z ik ( P(w i )) z ik We are gien that { if the state of nature for the k z ik = th sample is ω i 0 otherwise The samples are drawn by successie independent selection of a state of nature w i with probability P(w i ). We hae then and These two equations can be unified as Prz ik = P(w i ) = P(w i ) Prz ik = 0 P(w i ) = P(w i ) P(z ik P(w i )) = P(w i ) z ik P(w i ) z ik By the independence of the successie selection, we hae P(z i,,z in P(w i )) = = n P(z ik P(w i )) n P(w i ) z ik P(w i ) z ik (b) (0pts) Gien the equation aboe, show that the maximum likelihood estimate for P(w i ) is ˆP(w i ) = z ik n The log-likelihood as a function of P(w i ) is l(p(w i )) = lnp(z i,,z in P(w i )) n = ln P(w i ) z ik P(w i ) z ik = z ik ln P(w i ) + ( z ik ) ln( P(w i )) Therefore, the maximum-likelihood alues for the P(w i ) must satisfy P(wi )l(p(w i )) = P(w i ) z ik 5 P(w i ) ( z ik ) = 0
We sole this equation and find which can be rewritten as The final solution is then ( ˆP(w i )) z ik = ˆP(w i ) ( z ik ) z ik = ˆP(w i ) z ik + n ˆP(w i ) ˆP(w i ) z ik ˆP(w i ) = z ik n (c) (3pts) Interpret the meaning of your result in words. In this question, we apply the maximum-likelihood method to estimate the prior probability. From the result in part (b), it can be obsered that the estimate of the probability of category w i is merely the probability of obtaining its indicatory alue in the training data, just as we would expect. 5. (0pts) Consider an HMM with an explicit absorber state w 0 and unique null isible symbol 0 with the following transition probabilities a ij and symbol probabilities b jk (where the matrix indexes begin at 0): a ij = 0 0 0. 0.3 0.5 0.4 0.5 0. b jk = 0 0 0 0.7 0.3 0 0.4 0. (a) (7pts) Gie a graph representation of this Hidden Marko Model. 0.3 ω 0. 0.3 0.7 0.5 0.5 ω 0 0 0.4 0.4 ω 0. 0.
(b) (0pts) Suppose the initial hidden state at t = 0 is w. Starting from t =, what is the probability it generates the particular sequence V 3 = {,, 0 }? The probability of obsering the sequence V 3 is 0.0378. See the figure below for the details. 0 W0 W W 0 0.0378 *.3*.3 *.3*.7 0 0.3 0.09 0.39 *.5*. *.5*.7 *.5*.4 *.4* *.*.4 0.03 *.* t=0 3 (c) (3pts) Gien the aboe sequence V 3, what is the most probable sequence of hidden states? From the figure aboe and by using the decoding algorithm, one can obsere that the most probable sequence of hidden states is {w,w,w,w 0 }. 7
Appendix: Useful formulas. For a matrix, A = a b c d the matrix inerse is A = d b = A c a ad bc d b c a The scatter matrices S i are defined as S i = x D i (x m i )(x m i ) t where m i is the d-dimensional sample mean. The within-class scatter matrix is defined as S W = S + S The between-class scatter matrix is defined as S B = (m m )(m m ) t The solution for the w that optimizes J(w) = wt S B w w t S W w is w = S W (m m ) 8