CSE555: Introduction to Pattern Recognition Midterm Exam Solution (100 points, Closed book/notes)

Similar documents
Bayesian Decision Theory

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Bayes Decision Theory

p(d θ ) l(θ ) 1.2 x x x

Pattern Classification

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

44 CHAPTER 2. BAYESIAN DECISION THEORY

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

SGN (4 cr) Chapter 5

Minimum Error Rate Classification

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Mathematical Formulation of Our Example

Naïve Bayes classification

p(x ω i 0.4 ω 2 ω

PATTERN RECOGNITION AND MACHINE LEARNING

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Statistical Pattern Recognition

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Artificial Neural Networks (ANN)

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Recitation 13 10/31/2008. Markov Chains

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Machine Learning 2017

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Linear Models for Classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

p(x ω i 0.4 ω 2 ω

E( x ) [b(n) - a(n, m)x(m) ]

Linear Classification

Introduction to Machine Learning

Bayes Rule for Minimizing Risk

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Ch 4. Linear Models for Classification

Bayesian Decision and Bayesian Learning

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Minimum Error-Rate Discriminant

CMU-Q Lecture 24:

Linear Classification: Probabilistic Generative Models

Introduction to Signal Detection and Classification. Phani Chavali

Midterm Exam. CS283, Computer Vision Harvard University. Nov. 20, 2009

Bayesian Decision Theory

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization

Optimal Joint Detection and Estimation in Linear Models

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

Multivariate statistical methods and data mining in particle physics

Math 144 Activity #9 Introduction to Vectors

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Support Vector Machines

6.1.1 Angle between Two Lines Intersection of Two lines Shortest Distance from a Point to a Line

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

COM336: Neural Computing

Computer Vision Group Prof. Daniel Cremers. 3. Regression

LESSON 4: INTEGRATION BY PARTS (I) MATH FALL 2018

CSC411 Fall 2018 Homework 5

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Brief Introduction of Machine Learning Techniques for Content Analysis

STA 4273H: Statistical Machine Learning

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Announcements Wednesday, September 05

Probabilistic Graphical Models

6.867 Machine Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Machine Learning, Midterm Exam

5. Discriminant analysis

Multilayer Neural Networks

Course Outline MODEL INFORMATION. Bayes Decision Theory. Unsupervised Learning. Supervised Learning. Parametric Approach. Nonparametric Approach

Lecture 3: Pattern Classification

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Lecture 3: Machine learning, classification, and generative models

Multilayer Neural Networks

Math 425 Lecture 1: Vectors in R 3, R n

Bayesian Learning. Bayesian Learning Criteria

Machine Learning Lecture 5

STA 414/2104: Machine Learning

Contents 2 Bayesian decision theory

L11: Pattern recognition principles

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

On computing Gaussian curvature of some well known distribution

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Parametric Models Part III: Hidden Markov Models

Gaussians. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Chris Bishop s PRML Ch. 8: Graphical Models

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning Lecture 7

10-701/ Machine Learning, Fall

Linear Models for Classification

Concerns of the Psychophysicist. Three methods for measuring perception. Yes/no method of constant stimuli. Detection / discrimination.

Clustering VS Classification

Transcription:

CSE555: Introduction to Pattern Recognition Midterm Exam Solution (00 points, Closed book/notes) There are 5 questions in this exam. The last page is the Appendix that contains some useful formulas.. (5pts) Bayes Decision Theory. (a) (5pts) Assume there are c classes w,,w c, and one feature ector x, gie the Bayes rule for classification in terms of a priori probabilities of the classes and classconditional probability densities of x. Bayes rule for classification is Decide ω i if p(x ω i )P(ω i ) > p(x ω j )P(ω j ) for all j i and i,j =,,c. (b) (0pts) Suppose we hae a two-classes problem (A, A), with a single binaryalued feature (x, x). Assume the prior probability P(A) = 0.33. Gien the distribution of the samples as shown in the following table, use Bayes Rule to compute the alues of posterior probabilities of classes. A A x 48 7 x 8 503 By Bayes formula, we hae we also know that P(A x) = p(x A)P(A) p(x) thus p(x) = p(x A)P(A) + p(x A)P( A) and P(A x) = p(x A) = 48 48 + 8 0.755 7 p(x A) = 7 + 503 0.493 P(A) = 0.33 P( A) = P(A) = 0.7 0.755 0.33 0.755 0.33 + 0.493 0.7 0.597

Similarly, we hae P( A x) 0.404 P(A x) 0.40 P( A x) 0.8598. (5pts) Fisher Linear Discriminant. (a) (5pts) What is the Fisher linear discriminant method? The Fisher linear discriminant finds a good subspace in which categories are best separated in a least-squares sense; other, general classification techniques can then be applied in the subspace. (b) Gien the -d data for two classes: ω = (, ), (, ), (, 4), (, ), (3, ), (3, 3) and ω = (, ), (3, ), (3, 4), (5, ), (5, 4), (5, 5) as shown in the figure: 5 4 3 0 0 3 4 5 i. (0pts) Determine the optimal projection line in a single dimension. Let w be the direction of the projection line, then the Fisher linear discriminant method finds that the best w is the one for which the criterion function J(w) = wt S B w w t S ww is maximum, as follows where and S i = w = S w(m m ) Sw = S + S x D i (x m i )(x m i ) t i =,

Thus, we first compute the sample means for each class and get 3 m = m = 3 Then we subtract the sample mean from each sample and get 5 x m = 5 5 7 7 0 therefore S = S = x m = 5 5 7 7 7 5+5+5++49+49 3 5+007+7 +5+5+49+49+49 3 +554+7+4 5+007+7 + 0 + 4 + + + +554+7+4 + + + 4 + + 4 and then 4 Sw = S + S = 3 0 S w = 0 4 = 0 Sw 808 4 = 3 3 3 Finally we hae w = S w = 5 3 0 404 3 4 404 808 w (m m ) 57 = 404 9 808 = 9 3 8 = 53 3 3 5 3 0 404 3 4 404 808 0.4 0.0359 ii. (0pts) Show the mapping of the points to the line as well as the Bayes discriminant assuming a suitable distribution. The samples are mapped by x = w t x and we get w = 0.770, 0.9, 0.847, 0.38, 0.459, 0.5309 w = 0.3540, 0.4950, 0.58, 0.743, 0.8490, 0.8849 and we compute the mean and the standard deiation as µ = 0.3304 σ = 0.388 µ = 0.485 σ = 0.0 If we assume both p(x ω ) and p(x ω ) hae a Gaussian distribution, then the Bayes decision rule will be Decide ω if p(x ω i )P(ω ) > p(x ω )P(ω ); otherwise decide ω 3

where p(x ω i ) = exp πσi ( ) x µi If we assume the prior probabilities are equal, i.e. P(ω ) = P(ω ) = 0.5, then the threshold will be about 0.4933. That is, we decide ω if w t x > 0.4933, otherwise decide ω. 3. (0pts) Suppose p(x w ) and p(x w ) are defined as follows: p(x w ) = π e x, x σ i p(x w ) = 4, < x < (a) (7pts) Find the minimum error classification rule g(x) for this two-class problem, assuming P(w ) = P(w ) = 0.5. (i) In case of < x <, because P(ω ) = P(ω ) = 0.5, we hae the discriminant function g(x) as g(x) = ln p(x ω ) p(x ω ) = ln 4 x π The Bayes rule for classification will be or Decide ω if g(x) > 0; otherwise decide ω Decide ω if 0.98 < x < 0.98; otherwise decide ω (ii) In case of x or x, we always decide ω. (b) (0pts) There is a prior probability of class, designated as π, so that if P(w ) > π, the minimum error classification rule is to always decide w regardless of x. Find π. According to the question, π will satisfy the following equation p(x ω )π = p(x ω )( π ) when x = or x = Therefore, we hae π e 4 π = 4 ( π ) π 0.84 (c) (3pts) There is no π so that if P(w ) > π, we would always decide w. Why not? Because p(x ω ) is only defined for < x <, therefore we would always decide w for x or x, no matter what is the prior probability p(w ). 4

4. (0pts) Let samples be drawn by successie, independent selections of a state of nature w i with unknown probability P(w i ). Let z ik = if the state of nature for the kth sample is w i and z ik = 0 otherwise. (a) (7pts) Show that n P(z i,,z in P(w i )) = P(w i ) z ik ( P(w i )) z ik We are gien that { if the state of nature for the k z ik = th sample is ω i 0 otherwise The samples are drawn by successie independent selection of a state of nature w i with probability P(w i ). We hae then and These two equations can be unified as Prz ik = P(w i ) = P(w i ) Prz ik = 0 P(w i ) = P(w i ) P(z ik P(w i )) = P(w i ) z ik P(w i ) z ik By the independence of the successie selection, we hae P(z i,,z in P(w i )) = = n P(z ik P(w i )) n P(w i ) z ik P(w i ) z ik (b) (0pts) Gien the equation aboe, show that the maximum likelihood estimate for P(w i ) is ˆP(w i ) = z ik n The log-likelihood as a function of P(w i ) is l(p(w i )) = lnp(z i,,z in P(w i )) n = ln P(w i ) z ik P(w i ) z ik = z ik ln P(w i ) + ( z ik ) ln( P(w i )) Therefore, the maximum-likelihood alues for the P(w i ) must satisfy P(wi )l(p(w i )) = P(w i ) z ik 5 P(w i ) ( z ik ) = 0

We sole this equation and find which can be rewritten as The final solution is then ( ˆP(w i )) z ik = ˆP(w i ) ( z ik ) z ik = ˆP(w i ) z ik + n ˆP(w i ) ˆP(w i ) z ik ˆP(w i ) = z ik n (c) (3pts) Interpret the meaning of your result in words. In this question, we apply the maximum-likelihood method to estimate the prior probability. From the result in part (b), it can be obsered that the estimate of the probability of category w i is merely the probability of obtaining its indicatory alue in the training data, just as we would expect. 5. (0pts) Consider an HMM with an explicit absorber state w 0 and unique null isible symbol 0 with the following transition probabilities a ij and symbol probabilities b jk (where the matrix indexes begin at 0): a ij = 0 0 0. 0.3 0.5 0.4 0.5 0. b jk = 0 0 0 0.7 0.3 0 0.4 0. (a) (7pts) Gie a graph representation of this Hidden Marko Model. 0.3 ω 0. 0.3 0.7 0.5 0.5 ω 0 0 0.4 0.4 ω 0. 0.

(b) (0pts) Suppose the initial hidden state at t = 0 is w. Starting from t =, what is the probability it generates the particular sequence V 3 = {,, 0 }? The probability of obsering the sequence V 3 is 0.0378. See the figure below for the details. 0 W0 W W 0 0.0378 *.3*.3 *.3*.7 0 0.3 0.09 0.39 *.5*. *.5*.7 *.5*.4 *.4* *.*.4 0.03 *.* t=0 3 (c) (3pts) Gien the aboe sequence V 3, what is the most probable sequence of hidden states? From the figure aboe and by using the decoding algorithm, one can obsere that the most probable sequence of hidden states is {w,w,w,w 0 }. 7

Appendix: Useful formulas. For a matrix, A = a b c d the matrix inerse is A = d b = A c a ad bc d b c a The scatter matrices S i are defined as S i = x D i (x m i )(x m i ) t where m i is the d-dimensional sample mean. The within-class scatter matrix is defined as S W = S + S The between-class scatter matrix is defined as S B = (m m )(m m ) t The solution for the w that optimizes J(w) = wt S B w w t S W w is w = S W (m m ) 8