ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Similar documents
CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Random Variables, Sampling and Estimation

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Empirical Process Theory and Oracle Inequalities

Problem Set 4 Due Oct, 12

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

10-701/ Machine Learning Mid-term Exam Solution

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Exponential Families and Bayesian Inference

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Statistical Pattern Recognition

1 Inferential Methods for Correlation and Regression Analysis

Properties and Hypothesis Testing

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Regression and generalization

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

STAT 350 Handout 19 Sampling Distribution, Central Limit Theorem (6.6)

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Pattern Classification

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

x = Pr ( X (n) βx ) =

Machine Learning Brett Bernstein

( θ. sup θ Θ f X (x θ) = L. sup Pr (Λ (X) < c) = α. x : Λ (x) = sup θ H 0. sup θ Θ f X (x θ) = ) < c. NH : θ 1 = θ 2 against AH : θ 1 θ 2

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Data Analysis and Statistical Methods Statistics 651

Machine Learning Brett Bernstein

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Chapter 8: Estimating with Confidence

Probability and MLE.

Mixtures of Gaussians and the EM Algorithm

Pattern Classification

Parameter, Statistic and Random Samples

Introduction to Machine Learning DIS10

Lecture 2: Monte Carlo Simulation

Castiel, Supernatural, Season 6, Episode 18

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Expectation and Variance of a random variable

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Topic 5 [434 marks] (i) Find the range of values of n for which. (ii) Write down the value of x dx in terms of n, when it does exist.

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

11 Correlation and Regression

1 Review of Probability & Statistics

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Lecture 11 and 12: Basic estimation theory

f(x i ; ) L(x; p) = i=1 To estimate the value of that maximizes L or equivalently ln L we will set =0, for i =1, 2,...,m p x i (1 p) 1 x i i=1

Output Analysis (2, Chapters 10 &11 Law)

THE KALMAN FILTER RAUL ROJAS

Infinite Sequences and Series

CSE 527, Additional notes on MLE & EM

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Confidence Intervals for the Population Proportion p

Expectation-Maximization Algorithm.

A statistical method to determine sample size to estimate characteristic value of soil parameters

Statistical Properties of OLS estimators

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Sequences. A Sequence is a list of numbers written in order.

On an Application of Bayesian Estimation

Axis Aligned Ellipsoid

6.867 Machine learning

Statistical Inference Based on Extremum Estimators

Bayesian Methods: Introduction to Multi-parameter Models

Inferential Statistics. Inference Process. Inferential Statistics and Probability a Holistic Approach. Inference Process.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Section 5.5. Infinite Series: The Ratio Test

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Basis for simulation techniques

5. Likelihood Ratio Tests

Stat 421-SP2012 Interval Estimation Section

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

The standard deviation of the mean

CEU Department of Economics Econometrics 1, Problem Set 1 - Solutions

Estimation for Complete Data

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

An Introduction to Randomized Algorithms

6.3 Testing Series With Positive Terms

NO! This is not evidence in favor of ESP. We are rejecting the (null) hypothesis that the results are

Solutions to Odd Numbered End of Chapter Exercises: Chapter 4

MA Advanced Econometrics: Properties of Least Squares Estimators

Chapter 6 Principles of Data Reduction

Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 18

Department of Civil Engineering-I.I.T. Delhi CEL 899: Environmental Risk Assessment HW5 Solution

Approximations and more PMFs and PDFs

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

MBACATÓLICA. Quantitative Methods. Faculdade de Ciências Económicas e Empresariais UNIVERSIDADE CATÓLICA PORTUGUESA 9. SAMPLING DISTRIBUTIONS

Topic 18: Composite Hypotheses

Lecture 7: Properties of Random Samples

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Topic 1 2: Sequences and Series. A sequence is an ordered list of numbers, e.g. 1, 2, 4, 8, 16, or

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

ECO 312 Fall 2013 Chris Sims LIKELIHOOD, POSTERIORS, DIAGNOSING NON-NORMALITY

Linear Regression Demystified

1 Models for Matched Pairs

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Transcription:

ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2], [0,2]} ω 2 :{[1,1], [2,1], [1,2], [3,3]} (a) Compute the miimum achievable error rate by a liear machie (hit: draw a picture of the data). Assume the classes are equiprobable. (b) Assume the priors for each class are: P(ω 1 ) = α ad P(ω 2 ) = 1-α. Sketch P(E) as a fuctio of α for a maximum likelihood classifier based o the assumptio that each class is draw from a multivariate Gaussia distributio. Compare ad cotrast your aswer with your aswer to (a). Be very specific i your sketch ad label all critical poits. Ulabeled plots will receive o partial credit. (c) Assume you are ot costraied to a liear machie. What is the miimum achievable error rate that ca be achieved for this data? Is this value differet tha (a)? If so, why? How might you achieve such a solutio? Compare ad cotrast this solutio to (a). Solutio: (a) Let s assume a Gaussia model for the data i each class, as without additioal kowledge, we shall assume simplest possible model. Mea of class-1 data, μ 1 = 1 Mea of class-2 data, μ 2 = 1 Covariace of class-1, Σ 1 =[1.33, 0; 0, 1.33] Covariace of class-2, Σ 2 =[0.92, 0.58; 0.58, 0.92] X i=1 i = [1, 1]; Where, =No. of samples = 4 X j=1 j = [1.75, 1.75]; Where, =No. of samples = 4 P(E 1 )=Probability of error for class-1 P(E 2 )=Probability of error for class-2 P(E)= Average Probability of error for give two class classificatio Page 1

Approach-1: Oe ituitive possible approach to this problem is to plot graph of the give data. By observatio of the data, we ca determie liear threshold to achieve miimum error rate. By observig value of μ 1 ad μ 2, we ca determie equatio of the lie joiig μ 1 ad μ 2. This equatio is give by x = y, with slope, m=1. Oe ituitio says that if we draw a lie, which is perpedicular to the lie joiig two meas ad passes through oe of the poits o the lie segmet joiig two meas, we might get such liear threshold to achieve miimum error rate. Slope, m, of this perpedicular lie would be -1. Threshold-1: A lie draw at -45º from x-axis, y + x = α; where 2 α < 3 Decisio: Choose ω 1 for X α, choose ω 2 for X > α Oe of such threshold, for y + x = 2, is show i Figure-1, i gree color. Error for class-1, X ω1, for X > α Error for class-2, X ω2, for X α P(E 1 )=Number of miss-classified samples for class-1/total samples for class-1...eq-1.1 =1/4 = 0.25 P(E 2 )=Number of miss-classified samples for class-2/total samples for class-2...eq-1.2 =1/4 = 0.25 P(E)= P(E1)+ P(E2) 2 = (0.25+0.25)/2...Eq-1.3 Miimum probability of Error usig liear Machie, P(E)= 0.25 Aswer Threshold-2 (show i Figure-1, i pik color): A lie draw at -45º from x-axis, y + x = 3; Decisio: Choose ω 1 for X < 3, choose ω 2 for X 3 Error for class-1, X ω1, for X 3 Error for class-2, X ω2, for X < 3 Usig Eq.-1.1, 1.2 ad 1.3, P(E 1 )=0.25, P(E 2 )=0.25 ad P(E)=0.25 Miimum probability of Error usig liear Machie, P(E)= 0.25 Aswer Page 2

Figure-1 Geeralized approach: It s ot possible to fid liear discrimiatio fuctio (threshold) for large umber of data by just observatio of the data. With kowledge of differet parameters such as mea ad covariace of the data, liear discrimiatio fuctio for two class problem could be achieved for miimum error rate usig followig equatios. Where, g i (X) g j (X) = X t A X + b t X + c = 0 (b) Prior for class-1, P(ω 1 ) = α ad Prior for class-2, P(ω 2 ) = 1-α; where 0 α 1 Coditioal Probability of Error is give by, P error x P( 2 x) P( 1 x) x 1 x 2 Page 3

For maximum likelyhood classifier, we use followig Bayes decisio rule for miimizig the probability of error, decide ω 1 if P(ω 1 /x) > P(ω 2 /x); otherwise decide ω2 Bayes formula, j P x p x P j px Where, Evidece, p(x), is a scale factor ad take as 1. It assures that, P(ω 1 /X)+ P(ω 2 /X)=1 Hece, decisio rule for maximum likelyhood classifier is, decide ω 1 if p(x /ω 1 ) P(ω 1) > p(x /ω 2 ) P(ω 2 ); otherwise decide ω2 Hece, P error x) mi[ P( x), P( )].. Eq 1.4 ( 1 2 x j Assume, p(x /ω 1 ) = p(x /ω 2 ). i.e. both the classes are equally likely. Hece, measuremet, x, give o useful iformatio ad decisio is completely based o prior iformatio. I this case, decisio rule for maximum likelihood classifier is, decide ω 1 if P(ω 1) > P(ω 2 ); otherwise decide ω2 For this assumptio, P(E)=mi [P(ω 1), P(ω 2 )].. Eq 1.5 Figure-2 Shows, Plot of probability of Error, P(E), calculated usig Eq-1.5, as fuctio of α. Critical Poits: The maximum probability of Error occurs at α = 0. 5. Whe P(ω1)= P(ω2)=0.5 With equal probability of both the classes, the ucertaity associated with classificatio is maximum, ad hece, the error. P(E)=0.5 For α = 1, P(ω1)= 1 ad P(ω2)=0 For α = 0, P(ω1)= 0 ad P(ω2)=1 For the above values of α, the classificatio is completely predictable as oly data of oe class exists at a time. Hece, P(E)=0 for α = 1 ad 0. Compariso/Cotrast with solutio i part (a): Solutio i part (a) assumes that both classes are equiprobable, where as, i part (b), it is assumed that both classes are equally likely. Hece, for P(ω1)= P(ω2)=0.5, better results i terms of error rate could be achieved usig solutio i part (a). For solutio i part (a), if both classes are ot equiprobable, to get miimum error rate, decisio surface will shift away from the more likely class, ad towards the less likely class. Both the above solutios assume zero-oe loss fuctio. If both classes are equiprobable, but the cost associated with misclassificatio of oe class is higher tha that of the other class, decisio surface will shift away from the class with higher cost of misclassificatio, towards the other class. Page 4

Figure-2 (c) Miimum achievable Error-rate: Miimum error rate, P(E) = 0 could be achieved usig highly o-liear surface, show i Figure-3. For give data, this type of surface could be obtaied usig Support Vector Machies (SVM). Compariso/Cotrast with solutio i part (a): Highly o-liear ad complicated decisio surfaces, such as show i Figure-3, may lead to perfect classificatio of traiig samples, givig 0% error rate, but may lead to poor performace o future patters by lackig geeralizatio; whereas the liear decisio surface show i Figure-1 of part (a) provides good geeralizatio of the data uder classificatio. Solutio i part (c) demads use of complex models, which are tued to the particular traiig samples, whereas classifier from part (a) use simple models based o some uderlyig characteristics. Page 5

Figure-3 Problem No. 2: Suppose we have a radom sample X 1, X 2,,X where: X i = 0 if a radomly selected studet does ot ow a laptop, ad X i = 1 if a radomly selected studet does ow a laptop. Assumig that the X i are idepedet Beroulli radom variables with ukow parameter p: p(x; p) = (p) xi (1-p) 1-xi Where, x i = 0 or 1 ad 0 < p < 1. Fid the maximum likelihood estimator of p, the proportio of studets who ow a laptop. Solutio: Probability mass fuctio for a Beroulli radom variable, x i = f(x i ; p) = (p) xi (1-p) 1-xi Let s defie set D of traiig samples draw idepedetly from the probability desity, p(x; p), to estimate the ukow parameter, p. D = {x1, x2,, x}; where, values of x1, x2,, x are kow. Page 6

Because these samples are draw idepedetly, the likelihood fuctio f of ukow parameter, p = f(d /p ) = f(xi; p) i=1 f (D / p) = f(x 1 ; p) f(x 2 ; p) f(x 3 ; p) f(x ; p) = (p) x1 (1-p) 1-x1 (p) x2 (1-p) 1-x2 (p) x3 (1-p) 1-x3.. (p) x (1-p) 1-x f (D / p) = p i=1 xi (1 p) ( xi) i=1..eq-2.1 Now, atural logarithm is a icreasig fuctio of x. i.e. for x 1 >x 2, l (x 1 ) > l (x 2 ). Hece, the value of p which maximize the atural logarithm of the likelihood fuctio, l (f (D / p)), is also the value of p which maximize the likelihood fuctio, f (D / p). Now, takig atural log both sides of the equatio (2.1), l (f (D / p) ) = l (p i=1 xi (1 p) ( i=1 xi) ) = ( i=1 xi) l (p) + ( i=1 xi) l (1 p)..eq-2.2 To get maximum value of l (f (D / p) ), we eed to differetiate the equatio-2.2 with respect to the ukow parameter p ad set it to 0. d (l (f (D / p) )) dp = i=1 xi p ( i=1 xi) = 0 (1 p) (1 p) i=1 xi p ( i=1 xi) = 0 i=1 xi p = 0 p = 1 xi Alteratively, The likelihood estimator of p, the proportio of studets who ow a laptop, i=1 p = 1 Xi i=1 Aswer However, to cofirm that the solutio p is true global maximum, we eed to take secod derivative of the equatio-2.2 with respect to p. The derivative must come egative for our estimate, p, to be global maximum. Page 7

i=1 d 2 l (f (D/p)) dp 2 = xi p 2 ( i=1 xi) (1 p) 2 < 0.. Eq 2. 3 Equatio-2.3 cofirms that estimate, p, maximizes the likelihood of the proportio of the studets who ow a laptop. Problem No. 3: Let s assume you have a 2D Gaussia source which geerates radom vectors of the form [x 1, x 2 ]. You observe the followig data: [1, 1], [2, 2], [3, 3]. You were told the mea of this source was 0 ad the stadard deviatio was 1. (a) Usig Bayesia estimate techiques, what is your best estimate of the mea based o these observatios? (b) Now, suppose you observe a 4 th value: [0, 0]. How does this impact your estimate of the mea? Explai, beig as specific as possible. Support your explaatio with calculatios ad equatios. Solutio: (a) Derivatio of Bayesia estimate for ukow mea, based o observatios: D = {x 1, x 2,, x}, is set of idepedet samples, x 1, x 2,, x. Let s assume oly µ is ukow parameter. For give Bi-variate case, p (x/μ) ~ N (μ/ Σ) Kow prior desity, p(μ)~ N ( μ 0 / Σ 0 ), where, Σ, μ 0 ad Σ 0 are assumed to be kow. Applyig Bayes Formula, to fid Posteriori desity, p(μ/d): p(μ D) = p p (D/μ)p(μ) p (D/μ)p(μ)dμ D p( x ) p k1 1 t 1 1 t 1 exp ( ( 0 ) 2 ( x 2 i1 Which has the form, 1 t 1 p D exp ( ) ( ) 2 Here, α is ormalizatio factor which depeds o D, but idepedet of μ. Here, p D ~ N(, ) Equatig the coefficiets betwee the two Gaussias, we obtai the followig: k k 1 0 0)) 1 1 ˆ 1 1 1 x k k 1 ˆ 1 0 1 0 0.Eq. 3.1 Page 8

Where, μ is the sample mea. Usig kowledge of matrix idetity ad applyig little maipulatio, the solutios of the Eq-3.1 is give by, 1 1 1 1 1 0( 0 ) ˆ ( 0 ) 0 1 1 1 0( 0 ).Eq. 3.2 I Eq-3.2, µ represets our best guess for µ after observatio of samples, ad Σ measures our ucertaity about this guess. Equatios-3.2 shows how prior iformatio is combied with the empirical iformatio i the samples to obtai a posteriori desity, p (μ/d), usig Bayesia Parameter Estimatio Techiques. For give bi-variate problem: x 1 =[1, 1], x 2 = [2, 2], x 3 = [3, 3], =3, Σ 0 = [1, 0; 0, 1], µ 0 = [0 0] Covariace Matrix of observed samples, Σ = [1, 1; 1, 1] 3 μ = 1 x 3 k=1 k = ( [1, 1] + [2, 2] + [3, 3] ) /3 = [2, 2] Usig Eq.-3.2, our best guess for µ after observatio of =3 samples, µ = [1.2, 1.2] Aswer Ucertaity about this guess, Σ = [0.20, 0.20; 0.20, 0.20] The above was calculated usig the followig Matlab code: % Part (a) _1=3; Sigma_Sample=cov([1 1; 2 2; 3 3]); Sigma_0=[1, 0; 0, 1]; Mea_0= [0; 0]; Mea_sample_1=[2; 2]; Mea=(Sigma_0*iv(Sigma_0+Sigma_Sample/_1))*Mea_sample_1+(Sigma_Sample*iv( Sigma_0+Sigma_Sample/_1)*Mea_0)/_1; Covariace=Sigma_0*iv(Sigma_0+Sigma_Sample/_1)*(Sigma_Sample/_1); display(mea); display(covariace); (b) Impact o Estimatio of mea, µ, by addig aother observatio poit, x 4 = [0, 0] 4 Sample mea, μ = 1 x 4 k=1 k = ( [1, 1] + [2, 2] + [3, 3] + [0, 0]) /4 = [1.5, 1.5] Page 9

Matlab code: _2=4; Mea_sample_2=[1.5; 1.5]; Mea_2=(Sigma_0*iv(Sigma_0+Sigma_Sample/_2))*Mea_sample_2+(Sigma_Sample*i v(sigma_0+sigma_sample/_2)*mea_0)/_2; Covariace_2=Sigma_0*iv(Sigma_0+Sigma_Sample/_2)*(Sigma_Sample/_2); display(mea_2); display(covariace_2); Usig Eq.-3.2 ad above Matlab code, our best guess for µ, after observatio of =4 samples, µ = [1.0, 1.0] Aswer Observatios: By comparig of aswers of part (a) ad (b), we ca commet that by icreasig the umber of observatios i.e. data poits, the best estimate of the mea come closer to the sample mea. Also, ucertaity associated with this estimatio reduces with icrease i observed data poits, i.e., the best estimate of the mea coverges to the true mea. Proof: From Eq-3.2, it is clear that Σ decreases mootoically with. Hece, each additioal observatio decreases our ucertaity about the true value of μ. As approaches ifiity: Σ, i.e. ucertaity associated with our best estimatio of mea approaches to zero. I the equatio of µ, the term, "Σ0(Σ0 + Σ ) 1 μ ", becomes μ Ucertaity about this guess, Σ = [0.17, 0.17; 0.17, 0.17] ad the term, " 1 Σ(Σ0 + Σ ) 1 μ 0 " becomes 0. Hece, our best estimate of the mea, µ, approaches to the sample mea, μ, with zero ucertaity of measuremet; ad its reliace o prior iformatio, μ 0, decreases, provided Σ0 0 matrix (This is the case for most of the observatios). Posteriori desity, p (μ/d), becomes more ad more sharply peaked, approachig a Dirac delta fuctio as approaches ifiity. Special Cases: If Σ0 = 0 Matrix (rare possibility), we have a degeerate case i which our prior certaity that µ = μ 0 is so strog that o umber of observatios ca chage our opiio. If Σ0 Σ, we are extremely ucertai about our prior guesses ad we would take µ = μ usig oly samples to estimate, µ. Geeralizatio: If ratio of Σ/Σ0 is ot ifiity, after observatio of sufficietly large umber of samples, prior iformatio μ 0 ad Σ0 will be uimportat, ad µ will coverge to the sample mea, μ. Page 10