ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2], [0,2]} ω 2 :{[1,1], [2,1], [1,2], [3,3]} (a) Compute the miimum achievable error rate by a liear machie (hit: draw a picture of the data). Assume the classes are equiprobable. (b) Assume the priors for each class are: P(ω 1 ) = α ad P(ω 2 ) = 1-α. Sketch P(E) as a fuctio of α for a maximum likelihood classifier based o the assumptio that each class is draw from a multivariate Gaussia distributio. Compare ad cotrast your aswer with your aswer to (a). Be very specific i your sketch ad label all critical poits. Ulabeled plots will receive o partial credit. (c) Assume you are ot costraied to a liear machie. What is the miimum achievable error rate that ca be achieved for this data? Is this value differet tha (a)? If so, why? How might you achieve such a solutio? Compare ad cotrast this solutio to (a). Solutio: (a) Let s assume a Gaussia model for the data i each class, as without additioal kowledge, we shall assume simplest possible model. Mea of class-1 data, μ 1 = 1 Mea of class-2 data, μ 2 = 1 Covariace of class-1, Σ 1 =[1.33, 0; 0, 1.33] Covariace of class-2, Σ 2 =[0.92, 0.58; 0.58, 0.92] X i=1 i = [1, 1]; Where, =No. of samples = 4 X j=1 j = [1.75, 1.75]; Where, =No. of samples = 4 P(E 1 )=Probability of error for class-1 P(E 2 )=Probability of error for class-2 P(E)= Average Probability of error for give two class classificatio Page 1
Approach-1: Oe ituitive possible approach to this problem is to plot graph of the give data. By observatio of the data, we ca determie liear threshold to achieve miimum error rate. By observig value of μ 1 ad μ 2, we ca determie equatio of the lie joiig μ 1 ad μ 2. This equatio is give by x = y, with slope, m=1. Oe ituitio says that if we draw a lie, which is perpedicular to the lie joiig two meas ad passes through oe of the poits o the lie segmet joiig two meas, we might get such liear threshold to achieve miimum error rate. Slope, m, of this perpedicular lie would be -1. Threshold-1: A lie draw at -45º from x-axis, y + x = α; where 2 α < 3 Decisio: Choose ω 1 for X α, choose ω 2 for X > α Oe of such threshold, for y + x = 2, is show i Figure-1, i gree color. Error for class-1, X ω1, for X > α Error for class-2, X ω2, for X α P(E 1 )=Number of miss-classified samples for class-1/total samples for class-1...eq-1.1 =1/4 = 0.25 P(E 2 )=Number of miss-classified samples for class-2/total samples for class-2...eq-1.2 =1/4 = 0.25 P(E)= P(E1)+ P(E2) 2 = (0.25+0.25)/2...Eq-1.3 Miimum probability of Error usig liear Machie, P(E)= 0.25 Aswer Threshold-2 (show i Figure-1, i pik color): A lie draw at -45º from x-axis, y + x = 3; Decisio: Choose ω 1 for X < 3, choose ω 2 for X 3 Error for class-1, X ω1, for X 3 Error for class-2, X ω2, for X < 3 Usig Eq.-1.1, 1.2 ad 1.3, P(E 1 )=0.25, P(E 2 )=0.25 ad P(E)=0.25 Miimum probability of Error usig liear Machie, P(E)= 0.25 Aswer Page 2
Figure-1 Geeralized approach: It s ot possible to fid liear discrimiatio fuctio (threshold) for large umber of data by just observatio of the data. With kowledge of differet parameters such as mea ad covariace of the data, liear discrimiatio fuctio for two class problem could be achieved for miimum error rate usig followig equatios. Where, g i (X) g j (X) = X t A X + b t X + c = 0 (b) Prior for class-1, P(ω 1 ) = α ad Prior for class-2, P(ω 2 ) = 1-α; where 0 α 1 Coditioal Probability of Error is give by, P error x P( 2 x) P( 1 x) x 1 x 2 Page 3
For maximum likelyhood classifier, we use followig Bayes decisio rule for miimizig the probability of error, decide ω 1 if P(ω 1 /x) > P(ω 2 /x); otherwise decide ω2 Bayes formula, j P x p x P j px Where, Evidece, p(x), is a scale factor ad take as 1. It assures that, P(ω 1 /X)+ P(ω 2 /X)=1 Hece, decisio rule for maximum likelyhood classifier is, decide ω 1 if p(x /ω 1 ) P(ω 1) > p(x /ω 2 ) P(ω 2 ); otherwise decide ω2 Hece, P error x) mi[ P( x), P( )].. Eq 1.4 ( 1 2 x j Assume, p(x /ω 1 ) = p(x /ω 2 ). i.e. both the classes are equally likely. Hece, measuremet, x, give o useful iformatio ad decisio is completely based o prior iformatio. I this case, decisio rule for maximum likelihood classifier is, decide ω 1 if P(ω 1) > P(ω 2 ); otherwise decide ω2 For this assumptio, P(E)=mi [P(ω 1), P(ω 2 )].. Eq 1.5 Figure-2 Shows, Plot of probability of Error, P(E), calculated usig Eq-1.5, as fuctio of α. Critical Poits: The maximum probability of Error occurs at α = 0. 5. Whe P(ω1)= P(ω2)=0.5 With equal probability of both the classes, the ucertaity associated with classificatio is maximum, ad hece, the error. P(E)=0.5 For α = 1, P(ω1)= 1 ad P(ω2)=0 For α = 0, P(ω1)= 0 ad P(ω2)=1 For the above values of α, the classificatio is completely predictable as oly data of oe class exists at a time. Hece, P(E)=0 for α = 1 ad 0. Compariso/Cotrast with solutio i part (a): Solutio i part (a) assumes that both classes are equiprobable, where as, i part (b), it is assumed that both classes are equally likely. Hece, for P(ω1)= P(ω2)=0.5, better results i terms of error rate could be achieved usig solutio i part (a). For solutio i part (a), if both classes are ot equiprobable, to get miimum error rate, decisio surface will shift away from the more likely class, ad towards the less likely class. Both the above solutios assume zero-oe loss fuctio. If both classes are equiprobable, but the cost associated with misclassificatio of oe class is higher tha that of the other class, decisio surface will shift away from the class with higher cost of misclassificatio, towards the other class. Page 4
Figure-2 (c) Miimum achievable Error-rate: Miimum error rate, P(E) = 0 could be achieved usig highly o-liear surface, show i Figure-3. For give data, this type of surface could be obtaied usig Support Vector Machies (SVM). Compariso/Cotrast with solutio i part (a): Highly o-liear ad complicated decisio surfaces, such as show i Figure-3, may lead to perfect classificatio of traiig samples, givig 0% error rate, but may lead to poor performace o future patters by lackig geeralizatio; whereas the liear decisio surface show i Figure-1 of part (a) provides good geeralizatio of the data uder classificatio. Solutio i part (c) demads use of complex models, which are tued to the particular traiig samples, whereas classifier from part (a) use simple models based o some uderlyig characteristics. Page 5
Figure-3 Problem No. 2: Suppose we have a radom sample X 1, X 2,,X where: X i = 0 if a radomly selected studet does ot ow a laptop, ad X i = 1 if a radomly selected studet does ow a laptop. Assumig that the X i are idepedet Beroulli radom variables with ukow parameter p: p(x; p) = (p) xi (1-p) 1-xi Where, x i = 0 or 1 ad 0 < p < 1. Fid the maximum likelihood estimator of p, the proportio of studets who ow a laptop. Solutio: Probability mass fuctio for a Beroulli radom variable, x i = f(x i ; p) = (p) xi (1-p) 1-xi Let s defie set D of traiig samples draw idepedetly from the probability desity, p(x; p), to estimate the ukow parameter, p. D = {x1, x2,, x}; where, values of x1, x2,, x are kow. Page 6
Because these samples are draw idepedetly, the likelihood fuctio f of ukow parameter, p = f(d /p ) = f(xi; p) i=1 f (D / p) = f(x 1 ; p) f(x 2 ; p) f(x 3 ; p) f(x ; p) = (p) x1 (1-p) 1-x1 (p) x2 (1-p) 1-x2 (p) x3 (1-p) 1-x3.. (p) x (1-p) 1-x f (D / p) = p i=1 xi (1 p) ( xi) i=1..eq-2.1 Now, atural logarithm is a icreasig fuctio of x. i.e. for x 1 >x 2, l (x 1 ) > l (x 2 ). Hece, the value of p which maximize the atural logarithm of the likelihood fuctio, l (f (D / p)), is also the value of p which maximize the likelihood fuctio, f (D / p). Now, takig atural log both sides of the equatio (2.1), l (f (D / p) ) = l (p i=1 xi (1 p) ( i=1 xi) ) = ( i=1 xi) l (p) + ( i=1 xi) l (1 p)..eq-2.2 To get maximum value of l (f (D / p) ), we eed to differetiate the equatio-2.2 with respect to the ukow parameter p ad set it to 0. d (l (f (D / p) )) dp = i=1 xi p ( i=1 xi) = 0 (1 p) (1 p) i=1 xi p ( i=1 xi) = 0 i=1 xi p = 0 p = 1 xi Alteratively, The likelihood estimator of p, the proportio of studets who ow a laptop, i=1 p = 1 Xi i=1 Aswer However, to cofirm that the solutio p is true global maximum, we eed to take secod derivative of the equatio-2.2 with respect to p. The derivative must come egative for our estimate, p, to be global maximum. Page 7
i=1 d 2 l (f (D/p)) dp 2 = xi p 2 ( i=1 xi) (1 p) 2 < 0.. Eq 2. 3 Equatio-2.3 cofirms that estimate, p, maximizes the likelihood of the proportio of the studets who ow a laptop. Problem No. 3: Let s assume you have a 2D Gaussia source which geerates radom vectors of the form [x 1, x 2 ]. You observe the followig data: [1, 1], [2, 2], [3, 3]. You were told the mea of this source was 0 ad the stadard deviatio was 1. (a) Usig Bayesia estimate techiques, what is your best estimate of the mea based o these observatios? (b) Now, suppose you observe a 4 th value: [0, 0]. How does this impact your estimate of the mea? Explai, beig as specific as possible. Support your explaatio with calculatios ad equatios. Solutio: (a) Derivatio of Bayesia estimate for ukow mea, based o observatios: D = {x 1, x 2,, x}, is set of idepedet samples, x 1, x 2,, x. Let s assume oly µ is ukow parameter. For give Bi-variate case, p (x/μ) ~ N (μ/ Σ) Kow prior desity, p(μ)~ N ( μ 0 / Σ 0 ), where, Σ, μ 0 ad Σ 0 are assumed to be kow. Applyig Bayes Formula, to fid Posteriori desity, p(μ/d): p(μ D) = p p (D/μ)p(μ) p (D/μ)p(μ)dμ D p( x ) p k1 1 t 1 1 t 1 exp ( ( 0 ) 2 ( x 2 i1 Which has the form, 1 t 1 p D exp ( ) ( ) 2 Here, α is ormalizatio factor which depeds o D, but idepedet of μ. Here, p D ~ N(, ) Equatig the coefficiets betwee the two Gaussias, we obtai the followig: k k 1 0 0)) 1 1 ˆ 1 1 1 x k k 1 ˆ 1 0 1 0 0.Eq. 3.1 Page 8
Where, μ is the sample mea. Usig kowledge of matrix idetity ad applyig little maipulatio, the solutios of the Eq-3.1 is give by, 1 1 1 1 1 0( 0 ) ˆ ( 0 ) 0 1 1 1 0( 0 ).Eq. 3.2 I Eq-3.2, µ represets our best guess for µ after observatio of samples, ad Σ measures our ucertaity about this guess. Equatios-3.2 shows how prior iformatio is combied with the empirical iformatio i the samples to obtai a posteriori desity, p (μ/d), usig Bayesia Parameter Estimatio Techiques. For give bi-variate problem: x 1 =[1, 1], x 2 = [2, 2], x 3 = [3, 3], =3, Σ 0 = [1, 0; 0, 1], µ 0 = [0 0] Covariace Matrix of observed samples, Σ = [1, 1; 1, 1] 3 μ = 1 x 3 k=1 k = ( [1, 1] + [2, 2] + [3, 3] ) /3 = [2, 2] Usig Eq.-3.2, our best guess for µ after observatio of =3 samples, µ = [1.2, 1.2] Aswer Ucertaity about this guess, Σ = [0.20, 0.20; 0.20, 0.20] The above was calculated usig the followig Matlab code: % Part (a) _1=3; Sigma_Sample=cov([1 1; 2 2; 3 3]); Sigma_0=[1, 0; 0, 1]; Mea_0= [0; 0]; Mea_sample_1=[2; 2]; Mea=(Sigma_0*iv(Sigma_0+Sigma_Sample/_1))*Mea_sample_1+(Sigma_Sample*iv( Sigma_0+Sigma_Sample/_1)*Mea_0)/_1; Covariace=Sigma_0*iv(Sigma_0+Sigma_Sample/_1)*(Sigma_Sample/_1); display(mea); display(covariace); (b) Impact o Estimatio of mea, µ, by addig aother observatio poit, x 4 = [0, 0] 4 Sample mea, μ = 1 x 4 k=1 k = ( [1, 1] + [2, 2] + [3, 3] + [0, 0]) /4 = [1.5, 1.5] Page 9
Matlab code: _2=4; Mea_sample_2=[1.5; 1.5]; Mea_2=(Sigma_0*iv(Sigma_0+Sigma_Sample/_2))*Mea_sample_2+(Sigma_Sample*i v(sigma_0+sigma_sample/_2)*mea_0)/_2; Covariace_2=Sigma_0*iv(Sigma_0+Sigma_Sample/_2)*(Sigma_Sample/_2); display(mea_2); display(covariace_2); Usig Eq.-3.2 ad above Matlab code, our best guess for µ, after observatio of =4 samples, µ = [1.0, 1.0] Aswer Observatios: By comparig of aswers of part (a) ad (b), we ca commet that by icreasig the umber of observatios i.e. data poits, the best estimate of the mea come closer to the sample mea. Also, ucertaity associated with this estimatio reduces with icrease i observed data poits, i.e., the best estimate of the mea coverges to the true mea. Proof: From Eq-3.2, it is clear that Σ decreases mootoically with. Hece, each additioal observatio decreases our ucertaity about the true value of μ. As approaches ifiity: Σ, i.e. ucertaity associated with our best estimatio of mea approaches to zero. I the equatio of µ, the term, "Σ0(Σ0 + Σ ) 1 μ ", becomes μ Ucertaity about this guess, Σ = [0.17, 0.17; 0.17, 0.17] ad the term, " 1 Σ(Σ0 + Σ ) 1 μ 0 " becomes 0. Hece, our best estimate of the mea, µ, approaches to the sample mea, μ, with zero ucertaity of measuremet; ad its reliace o prior iformatio, μ 0, decreases, provided Σ0 0 matrix (This is the case for most of the observatios). Posteriori desity, p (μ/d), becomes more ad more sharply peaked, approachig a Dirac delta fuctio as approaches ifiity. Special Cases: If Σ0 = 0 Matrix (rare possibility), we have a degeerate case i which our prior certaity that µ = μ 0 is so strog that o umber of observatios ca chage our opiio. If Σ0 Σ, we are extremely ucertai about our prior guesses ad we would take µ = μ usig oly samples to estimate, µ. Geeralizatio: If ratio of Σ/Σ0 is ot ifiity, after observatio of sufficietly large umber of samples, prior iformatio μ 0 ad Σ0 will be uimportat, ad µ will coverge to the sample mea, μ. Page 10