CS 195-5: Machine Learning Problem Set 1

Similar documents
CS 195-5: Machine Learning Problem Set 2

Linear Regression and Discrimination

Introduction to Machine Learning Spring 2018 Note 18

Machine Learning 2017

Parametric Techniques

Parametric Techniques Lecture 3

Bayesian Decision and Bayesian Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bayesian Decision Theory

p(d θ ) l(θ ) 1.2 x x x

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Machine Learning Linear Classification. Prof. Matteo Matteucci

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Lecture Notes on the Gaussian Distribution

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Multivariate statistical methods and data mining in particle physics

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning Practice Page 2 of 2 10/28/13

ECE521 week 3: 23/26 January 2017

6.867 Machine Learning

Unsupervised Learning with Permuted Data

Lecture 2 Machine Learning Review

6.867 Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Introduction to Machine Learning

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

5. Discriminant analysis

p(x ω i 0.4 ω 2 ω

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Introduction to Machine Learning

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Machine Learning (CS 567) Lecture 5

COM336: Neural Computing

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Machine Learning Lecture 2

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Linear Classifiers as Pattern Detectors

Introduction to Machine Learning

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

The Bayes classifier

Advanced statistical methods for data analysis Lecture 1

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

6.867 Machine Learning

LDA, QDA, Naive Bayes

Support Vector Machines

Association studies and regression

Linear Regression Linear Regression with Shrinkage

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Chapter 5 continued. Chapter 5 sections

Machine Learning Lecture 2

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

B4 Estimation and Inference

Gaussian Process Regression

Bayes Decision Theory

Classification. Chapter Introduction. 6.2 The Bayes classifier

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Least Squares Regression

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Machine learning - HT Maximum Likelihood

Pattern Classification

p(x ω i 0.4 ω 2 ω

Statistical Machine Learning Hilary Term 2018

Machine Learning - MT Classification: Generative Models

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification. Sandro Cumani. Politecnico di Torino

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

GWAS V: Gaussian processes

Ch 4. Linear Models for Classification

Naive Bayes and Gaussian Bayes Classifier

Clustering VS Classification

Minimum Error-Rate Discriminant

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

10-701/ Machine Learning - Midterm Exam, Fall 2010

Classification 2: Linear discriminant analysis (continued); logistic regression

Discriminant analysis and supervised classification

A Study of Relative Efficiency and Robustness of Classification Methods

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Naive Bayes and Gaussian Bayes Classifier

Notes on Discriminant Functions and Optimal Classification

Motivating the Covariance Matrix

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Minimum Error Rate Classification

Machine Learning and Pattern Recognition Density Estimation: Gaussians

Introduction to Machine Learning

CS534 Machine Learning - Spring Final Exam

Transcription:

CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of the training inputs. That is, show that for any a R d+ ˆσ(e, a T x) =, where e i = y i ŵ T x i is the prediction error for the i th training example. First, recall that the least squares estimate for the linear regression parameters is given by ŵ = argmin w (y i w d j= w j x (i) j ) = argmin w (y i ŵ T x i ) = argmin w (e i ), () where we have augmented the inputs x R d by adding as the zeroth dimension such that x =. Recall that the minimum (or maximum) of the argument of Equation will be achieved where / w i equals zero for every regression parameter w i. Differentiating with respect to w and equating with zero gives w (y i w n d j= w j x (i) j ) = (y i w d j= (y i w w j x (i) j ) = n d j= w j x (i) j ) ( ) = e i = ē =. () In other words, a necessary condition for ˆω is that the prediction errors have zero mean. Similarly, differentiating with respect to any w i (for i {,..., n}) and equating with zero gives w i (y i w d j= w j x (i) j ) = (y i w (y i w d j= w j x (i) j )x(i) = d j= w j x (i) j ) ( w jx (i) j ) = e i x i =. (3) As in class, this result implies that the prediction errors are uncorrelated with the training data. We now turn our attention to proving the claim that the prediction errors are necessarily uncorrelated with any linear function of the training inputs. Recall that the correlation is written ˆσ(e, a T x) = (e i ē)(a T x i a T x i ) = e i (a T x i a T x i ), ()

CS 95-5: Machine Learning Problem Set Douglas Lanman where we have applied Equation to conclude ē =. Now let us examine the term a T x. ( ) a T x = a T x i = a T x i = a T x n n Note that we are allowed to bring a T out of the summation by linearity. Also note that a T x is a scalar quantity and, as a result, we can write Equation as ( ) ( ) ˆσ(e, a T x) = a T e i x i a T x e i =, where, by Equations and 3, we know that both summations are equal to zero. As a result, we have proven desired result. (QED) ˆσ(e, a T x) =

CS 95-5: Machine Learning Problem Set Douglas Lanman Problem Suppose that the data in a regression problem are scaled by multiplying the j th dimension of the input by a non-zero number c j. Let x [, c x,..., c d x d ] T denote a single scaled data point and let X represent the corresponding design matrix. Similarly, let ŵ be the maximum likelihood (ML) estimate of the regression parameters from the unscaled X, and let ˆ w be the solution obtained from the scaled X. Show that scaling does not change optimality, in the sense that ŵ T x = ˆ w T x. First, note that scaling can be represented as a linear operator C composed of the scaling factors along its main diagonal. c C =... Using the scaling operator C, we can express the scaled inputs x and design matrix X as functions of x and X, respectively. x = Cx X = XC (5) As was demonstrated in class, under the Gaussian noise model, the ML estimate of the regression parameters is given by ŵ = (X T X) X T y. Using the expressions in Equation 5, we find ˆ w = ( X T X) XT y = ( (XC) T XC ) (XC) T y = ( C T X T XC ) C T X T y. At this point, we can apply the following matrix identity: if the individual inverses A and B exist, then (AB) = B A. Since C is a real, symmetric square matrix, C must exist. Similarly, X T X is a real, symmetric square matrix so (X T XC) must also exist. As a result, we can apply the matrix identity to the previous expression. c d ˆ w = ( (C T )(X T XC) ) C T X T y = (X T XC) (C T ) C T X T y = C (X T X) X T y = C ŵ () To prove that the scaled solution is optimal, we apply Equations 5 and as follows. ˆ w T x = (C ŵ) T Cx = ŵ T (C ) T Cx Recall from [] that (A ) T symmetric matrix, C = C T desired result. (QED) = (A T ). As a result, (C ) T C = (C T ) C. Since C is a real, and, as a result, (C ) T C = I. In conclusion we have proven the ŵ T x = ˆ w T x 3

CS 95-5: Machine Learning Problem Set Douglas Lanman Problem 3 Derive the maximum likelihood (ML) estimate of σ under the Gaussian noise model. Recall that the likelihood P of the noise variance σ, given the observations X = [x,..., x N ] T and Y = [y,..., y N ] T, is defined to be P(Y; w, σ) p(y X, w, σ). Also recall that, under the Gaussian noise model, the label y is a random variable with the following distribution y = f(x; w) + ν, ν N (ν,, σ), p(y x, w, σ) = N (y; f(x; w), σ) = σ π exp ( ) (y f(x; w)) σ. Assuming that the observations are independent and identically distributed (i.i.d.), then the likelihood can be expressed as the following product. N N P(Y; w, σ) = p(y i x i, w, σ) = σ π exp with the corresponding ML estimator for σ given by ˆσ ML = argmax σ N σ π exp ( (y i f(x i ; w)) ( (y i f(x i ; w)) As was done in class, we consider the log-likelihood l which converts this product into a summation as follows. l(y; w, σ) log P(Y; w, σ) = σ σ σ ). ), (y i f(x i ; w)) N log(σ π) (7) Since the logarithm is a monotonically-increasing function, the ML estimator becomes { ˆσ ML = argmax σ σ (y i f(x i ; w)) N log(σ } π). The maximum (or minimum) of this function will necessarily be obtained where the derivative with respect to σ equals zero. { σ σ (y i f(x i ; w)) N log(σ } π) = σ (y i f(x i ; w)) N σ =

CS 95-5: Machine Learning Problem Set Douglas Lanman In conclusion, the ML estimate of σ is given by the following expression. ˆσ ML = N (y i f(x i ; w)) () Note that this expression maximizes the likelihood of the observations, assuming w is known. If this is not the case, then the ML estimate ŵ ML should be substituted for w in Equation. As was shown in class, the ML estimate ŵ ML is independent of σ and, as a result, can be estimated independently. In other words, if the ML estimates of both w and σ are required, then the following expressions should be used. ŵ ML = argmin w ˆσ ML = N (y i f(x i ; w)) (y i f(x i ; ŵ ML )) 5

CS 95-5: Machine Learning Problem Set Douglas Lanman Problem Part : Consider the noise model y = f(x; w) + ν in which ν is drawn from the following distribution. ( ) p(ν) = C exp( ν ), for C = exp( x ) dx Derive the conditions on the maximum likelihood estimate ŵ ML. What is the corresponding loss function? Compare this loss function with squared loss on the interval [ 3, 3]. How do you expect these differences to affect regression? As in Problem 3, let s begin by defining the distribution of the label y, given the input x. p(y x, w) = C exp ( (y f(x; w)) ) Once again, we assume that the observations are i.i.d. such that the likelihood P is given by N N P(Y; w) = p(y i x i, w) = C exp ( (y i f(x i ; w)) ). The log-likelihood l is then given by l(y; w) = log P(Y; w) = NC with the corresponding ML estimate for w given by ŵ ML = argmax l(y; w) = argmin w w where L is the quadrupled loss function defined as follows. (y i f(x i ; w)), (y i f(x i ; w)) = argmin L (y i, f(x i ; w)) w L (y, ŷ) = (y ŷ) The squared loss L(y, ŷ) = (y ŷ) and quadrupled loss were plotted (as a function of y ŷ) in Figure (a) using the Matlab script prob.m. From this plot it is apparent that L creates a greater penalty for outliers than L. That is, as the prediction error y ŷ increases, the regression with quadrupled loss will bias towards outliers more than with squared loss. Part : Now consider the following data set X = [,, ] T and Y = [,, ] T. Find the numerical solution for ŵ ML using both linear least squares (i.e., squared loss) and quadrupled loss and report the empirical losses. Plot the corresponding functions and explain how what is seen results from the differences in loss functions. Using prob.m, the optimal values of (w, w ) were determined using the provided testws.m function for both the squared and quadrupled losses. The results are tabulated below. Loss Function ŵ ŵ Empirical Loss Squared Loss (LSQ): L(y, ŷ).33.. Quadrupled Loss: L (y, ŷ)...3

CS 95-5: Machine Learning Problem Set Douglas Lanman Squared Loss Quadrupled Loss.5 Loss 3 3 y ŷ (a) Comparison of loss functions y.5.5 Sample Data Linear Least Squares Quadrupled Loss.5.5.5.5.5 x (b) Comparison of ML-estimated models.5.5 w.5.5.5 Squared Loss w.5.5.5 Quadrupled Loss w w (c) Squared loss surface (d) Quadrupled loss surface Figure : ML estimation using exhaustive search under varying loss functions. Note that the empirical losses L N or L N are given by the average sum of squares as follows. L N (w) = L N(w) = N (y i f(x i ; w)) (9) Using the values in the table, the corresponding regression models were plotted in Figure (b), with the associated loss surfaces as shown in Figures (c) and (d). Note that the two noise models resulted in lines with identical slopes (i.e., equal values of w ), however the y-intercepts (given by w ) differed. This is consistent with our previous observation that the quadrupled loss model biases towards outliers. For this example, the estimated line moves towards the second data point at (x =, y = ) since this point can be viewed as an outlier. 7

CS 95-5: Machine Learning Problem Set Douglas Lanman Problem 5 Apply linear and quadratic polynomial regression to the data in meteodata.mat using linear least squares. Use -fold cross-validation to select the best model and plot the results. Report the empirical loss and log-likelihood. Based on the plot, comment on the Gaussian noise model. Recall from class on 9/3/ that we can solve for the least squares polynomial regression coefficients ŵ by using the extended design matrix X such that x x... x d ŵ = ( X T X) XT y, with X x x... x d =......., x N x N... xd N where d is the degree of the polynomial and {x,..., x N } and y are the observed points and their associated labels, respectively. To prevent numerical round-off errors, this method was applied to the column-normalized design matrix X using degexpand.m within the main script prob5.m (as discussed in Problem ). The resulting linear and quadratic polynomials are shown in Figure (a). The fitting parameters obtained using all data points and up to a fourth-order polynomial are tabulated below. Polynomial Degree Empirical Loss Log-likelihood -fold Cross Validation Score Linear (d = ).57-59.5.3 Quadratic (d = ).93-5..93 Cubic (d = 3).93-5..97 Quartic (d = ).95-55..95 Note that the empirical loss L N is defined to be the average sum of squared errors as given by Equation 9. The log-likelihood of the data (under a Gaussian model) was derived in class on 9// and is given by Equation 7 as l(y; w, σ) = σ (y i f(x i ; w)) N log(σ π), where σ ˆσ ML is the maximum likelihood estimate of σ under a Gaussian noise model (as given by Equation in Problem 3). At this point, we turn our attention to the model-order selection task (i.e., deciding what degree of polynomial best-represents the data). As discussed in class of 9/3/, we will use -fold cross validation to select the best model. First, we partition the data into roughly equal parts (see lines 5-7 of prob5.m). Next, we perform sequential trials where we train on all but the i th fold of the data and then measure the empirical error on the remaining samples. In general, we formulate the k-fold cross validation score as ˆL k = N k (y j f(x j ; ŵ i ), j fold i where ŵ i is fit to all samples except those in the i th fold. The resulting -fold cross-validation scores are tabulated above (for up to a fourth-order polynomial). Since the lowest cross-validation score is achieved for the quadratic polynomial we select this as the best model.

CS 95-5: Machine Learning Problem Set Douglas Lanman Reported Temperture (Kelvin) Sample Data Linear Regression Quadratic Regression Reported Temperture (Kelvin) Sample Data Quadratic Regression Level of Seismic Activity (Richters) (a) Least squares polynomial regression results Level of Seismic Activity (Richters) (b) Quadratic polynomial (selected by cross-validation) Figure : Comparison of linear and quadratic polynomials estimated using linear least squares. In conclusion, we find that the quadratic polynomial has the lowest cross-validation score. However, as shown in Figure (b), it is immediately apparent that the Gaussian noise model does not accurately represent the underlying distribution; more specifically, if the underlying distribution was Gaussian, we d expect half of the data points to be above the model prediction (and the other half below). This is clearly not the case for this example motivating the alternate noise model we ll analyze in Problem. 9

CS 95-5: Machine Learning Problem Set Douglas Lanman Problem Part : Consider the noise model y = f(x; w) + ν in which ν is drawn from p(ν) as follows. { e ν if ν >, p(ν) = otherwise Perform -fold cross-validation for linear and quadratic regression under this noise model, using exhaustive numerical search similar to that used in Problem. Plot the selected model and report the empirical loss and the log-likelihood under the estimated exponential noise model. As in Problems 3 and, let s begin by defining the distribution of the label y, given the input x. { exp ( (y f(x; w))) if y > f(x; w), p(y x, w) = otherwise Once again, we assume that the observations are i.i.d. such that the likelihood P is given as follows. N N { exp (f(xi ; w) y P(Y; w) = p(y i x i, w) = i ) if y i > f(x i ; w), otherwise The log-likelihood l is then given by () l(y; w) = log P(Y; w) = { f(xi ; w) y i if y i > f(x i ; w), otherwise since the logarithm is a monotonic function and lim x log x =. The corresponding ML estimate for w is given by the following expression. () ŵ ML = argmax l(y; w) = argmin w w { yi f(x i ; w) if y i > f(x i ; w), otherwise () Note that Equations and prevent any prediction f(x i ; w) from being above the corresponding label y i. As a result, the exponential noise distribution will effectively lead to a model corresponding to the lower-envelope of the training data. Using the maximum likelihood formulation in Equation, we can solve for the optimal regression parameters using an exhaustive search (similar to what was done in Problem ). This approach was applied to all the data points on lines 9-7 of prob.m. The resulting best-fit linear and quadratic polynomial models are shown in Figure 3(a). The fitting parameters obtained using all the data points and up to a second-order polynomial are tabulated below. Polynomial Degree Empirical Loss Log-likelihood -fold Cross Validation Score Linear (d = ).37-7.5.37 Quadratic (d = ).9-359..9 Note that the empirical loss L N was calculated using Equation 9. The log-likelihood of the data (under the exponential noise model) was determined using Equation. Model selection was performed using -fold cross-validation as in Problem 5. The resulting scores are tabulated above. Since the lowest cross-validation score was achieved for the quadratic polynomial we select this as the best model and plot the result in Figure 3(b). Comparing the

CS 95-5: Machine Learning Problem Set Douglas Lanman Reported Temperture (Kelvin) Sample Data Linear Regression Quadratic Regression Reported Temperture (Kelvin) Sample Data Quadratic Regression Level of Seismic Activity (Richters) (a) Polynomial regression results Level of Seismic Activity (Richters) (b) Quadratic polynomial (selected by cross-validation) Figure 3: ML estimation using exhaustive search under an exponential noise model. model in Figure 3(b) with that in Figure (b), we conclude that the quadratic polynomial under the exponential noise model better approximates the underlying distribution; more specifically, the quadratic polynomial (under an exponential noise model) achieves a log-likelihood of -359., whereas it only achieves a log-likelihood of -5. under the Gaussian noise model. It is also important to note that the Gaussian noise model leads to a lower squared loss (i.e., empirical loss), however this is by the construction of the least squares estimator used in Problem 5. This highlights the important observation that a lower empirical loss does not necessarily indicate a better model this only applies when the choice of noise model appropriately models the actual distribution. Part : Now evaluate the polynomials selected under the Gaussian and exponential noise models for the data in 5. Report which performs better in terms of likelihood and empirical loss. In both Problems 5 and., the quadratic polynomial was selected as the best model by -fold cross validation. In order to gauge the generalization capabilities of these models, the model parameters we used to predict the 5 samples. The results for each noise model are tabulated below. Noise Model Empirical Loss Log-likelihood Gaussian (d = ).3-5. Exponential (d = ). -35.5 In conclusion, we find that the Gaussian model achieves a lower empirical loss on the 5 samples. This is expected, since the Gaussian model is equivalent to the least squares estimator which minimizes empirical loss. The exponential model, however, achieves a significantly higher loglikelihood indicating that it models the underlying data more effectively than the Gaussian noise model. As a result, we reiterate the point made previously: low empirical loss can be achieved even if the noise model does not accurately represent the underlying noise distribution.

CS 95-5: Machine Learning Problem Set Douglas Lanman Multivariate Gaussian Distributions Problem 7 Recall that the probability density function (pdf) of a Gaussian distribution in R d is given by ( p(x) = (π) d/ exp ) Σ / (x µ)t Σ (x µ). Show that a contour corresponding to a fixed value of the pdf is an ellipse in the D x, x space. An iso-contour of the pdf along p(x) = p is given by ( (π) d/ exp ) Σ / (x µ)t Σ (x µ) = p. Taking the logarithm of each side and rearranging terms gives ( (x µ) T Σ (x µ) = log p (π) d/ Σ /) = C, where C R is a constant. As this point it is most convenient to write this expression explicitly as a function of x and x such that x = [x, x ] T and µ = [ x, x ] T. ( x x x x ) ( σ σ σ σ ( x x x x ) ( σ σ σ σ Multiplying the terms in this expression, we obtain ) ( x x ) ( x x x x x x ) = C ) = (σσ σ)c = C σ (x x ) σ (x x )(x x ) + σ (x x ) = C. Simplifying, we obtain a solution for the D iso-contours given by (x σ x ) σ σ σ C ( σ σ σ σ σ (x x )(x x ) + σ σ (x x ) = C, ) log ( p (π) d/ Σ /). (3) Recall from [] that a general quadratic curve can be written as ax + bx x + cx + dx + fx + g =, with an associated discriminant given by b ac. Also recall that if the discriminant b ac <, then the quadratic curve represents either an ellipse, a circle, or a point (with the later two representing certain degenerate configurations of an ellipse) []. Since Equation 3 has the form of a quadratic curve, it has a discriminant given by ( σ b ac = σ σ σ σ ) = ( σ σ σσ )? σ <.

CS 95-5: Machine Learning Problem Set Douglas Lanman Recall that the covariance must be non-negative, therefore {σ, σ, σ }. As a result, /(σ σ ) is positive and we only need to prove that σ < σ σ in order to show that Equation 3 represents an ellipse. Recall from class on 9/5/ that the definition of the cross-correlation coefficient ρ is ρ = σ σ = ρ σ σ σσ, where ρ. As a result, ρ which implies σ < σ σ and the discriminant has the form of an ellipse. In conclusion, Equation 3 has the associated discriminant b ac = ( σ σ σσ ) σ <, which by [] defines an ellipse in the D x, x space. (QED) 3

CS 95-5: Machine Learning Problem Set Douglas Lanman Problem Suppose we have a sample x,..., x n drawn from N (x; µ, Σ). Show that the ML estimate of the mean vector µ is ˆµ ML = x i n and the ML estimate of the covariance matrix Σ is ˆΣ ML = (x i ˆµ ML )(x i ˆµ ML ) T. n Qualitatively, we will repeat the derivation used for the univariate Gaussian in Problem 3. Let s begin by defining the multivariate Gaussian distribution in R d. ( p(x; µ, Σ) = (π) d/ exp ) Σ / (x µ)t Σ (x µ) Once again, we assume that the samples are i.i.d. such that the likelihood P is given by The log-likelihood l is then given by P(X; µ, Σ) = l(x; µ, Σ) = log P(X; µ, Σ) = n p(x i µ, Σ). log (p(x i µ, Σ)) ( = n log (π) d/ Σ /) with the corresponding ML estimate for µ given by ˆµ ML = argmax l(x; µ, Σ) = argmin µ µ (x i µ) T Σ (x i µ), () (x i µ) T Σ (x i µ) since the first term in Equation is independent of µ. The minimum (or maximum) of this function will occur where the derivative with respect to µ equals zero. Let s proceed by making the substitution Σ A. Equating the first partial derivative with zero, we find { } (x i µ) T [ A(x i µ) = (xi µ) T A(x i µ) ] = µ µ [ x T µ i Ax i x T i Aµ µ T Ax i + µ T Aµ ] =. Note that the first term is independent of µ and can be eliminated. { (xt i Aµ) } (µt Ax i ) + (µt Aµ) = µ µ µ

CS 95-5: Machine Learning Problem Set Douglas Lanman We can now apply the following identities for the derivatives of scalar and matrix/vector forms []. (Ax) x = (xt A) = A x (x T Ax) x Applying these expressions to the previous equation gives = (A + A T )x { x T i A Ax i + Aµ + A T µ } =. Recall that Σ is a real, symmetric matrix such that Σ T = Σ []. In addition, we have A T = A since A is also a real, symmetric matrix. Finally, since A is symmetric, we have x T i A = Ax i. Applying these identities to the previous equation, we find A (x i µ) = Diving by n, we obtain the desired result. µ = nµ = x i. ˆµ ML = n x i Similar to the derivation of ˆµ ML, the ML estimate for ˆΣ ML is given by { ( ˆΣ ML = argmax l(x; µ, Σ) = argmin (x i µ) T Σ (x i µ) + n log (π) d/ Σ /)} Σ Σ { } = argmin Σ (x i µ) T Σ (x i µ) + n log ( Σ ) Once again, we make the substitution Σ A. As a result, the minimum (or maximum) of Equation 5 will occur where the derivative with respect to A equals zero. { (x i µ) T A(x i µ) + n log ( A )} = A Note that we can substitute A = A to obtain the following result []. A log ( A ) = n [ (xi µ) T A(x i µ) ] A We can now apply the following identities for the derivatives of scalar and matrix/vector forms []. [ (x µ) T A(x µ) ] = (x µ)(x µ) T A A log ( A ) = A = Σ Applying these identities to the previous equation, we find the derived result. (5) ˆΣ ML = n (x i ˆµ ML )(x i ˆµ ML ) T (QED) 5

CS 95-5: Machine Learning Problem Set Douglas Lanman 3 Linear Discriminant Analysis Problem 9 This problem will examine an extension of linear discriminant analysis (LDA) to multiple classes (assuming equal covariance matrices such that Σ c = Σ for every class c). Show that the optimal decision rule is based on calculating a set of C linear discriminant functions δ c (x) = x T Σ µ c µt c Σ µ c and selecting C = argmax c δ c (x). Assume that the prior probability is uniform for each class such that P c = p(y = c) = /C. Apply this method to the data in apple lda.mat and plot the resulting linear decision boundaries or, alternatively, the decision regions. Report the classification error. Recall from class on 9// and [3] that the Bayes Classifier minimizes the conditional risk, for a given class-conditional density p c (x) = p(x y = c) and prior probability P c = p(y = c), such that C = argmax δ c (x), for δ c (x) log p c (x) + log P c. c For this problem we can eliminate the class prior term to obtain δ c (x) = log p c (x), since the class prior is identical for all classes. In addition, we ll assume that the class-conditionals are multivariate Gaussians with identical covariance matrices such that p c (x) = (π) d/ exp Σ / Substituting into the previous expression we find ( (x µ c) T Σ (x µ c ) ). δ c (x) = log p c (x) = (x µ c) T Σ (x µ c ) d log π log Σ = (x µ c) T Σ (x µ c ) = xt Σ x + xt Σ µ c + µt c Σ x µt c Σ µ c = xt Σ µ c + µt c Σ x µt c Σ µ c () where terms that are independent of c have been eliminated. To complete our proof, note that µ T c Σ x = (µ T c Σ x) T = x T (Σ ) T µ c = x T Σ µ c, since Σ and Σ are symmetric matrices and, as a result, Σ = (Σ ) T. Applying this observation to Equation, we prove the desired result. δ c (x) = x T Σ µ c µt c Σ µ c Using prob9.m, this method was applied to the data in apple lda.mat. The means and covariance were obtained using the ML estimators derived in Problem ; for a given class c, the mean was calculated as ˆµ c = n c x ci. n c

CS 95-5: Machine Learning Problem Set Douglas Lanman x Class Class Class 3 x Figure : Optimal linear decision boundaries/regions for apple lda.mat. Similarly, the the covariance matrix Σ was calculated as ˆΣ = n c (x ci ˆµ c )(x ci ˆµ c ) T. n c The resulting decision boundaries are shown in Figure. Note that δ c (x) is linear in x, so we only obtain linear decision boundaries. The classification error (measured as the percent of incorrect classifications on the training data) was.7%. 7

CS 95-5: Machine Learning Problem Set Douglas Lanman Problem Assume that the covariances Σ c are no longer required to be equal. Derive the discriminant function and apply the resulting decision rule to apple lda.mat. Plot the decision boundaries/regions and report the classification error. Compare the performance of the two classifiers. As in Problem 9, we begin by writing the Bayes Classifier C = argmax δ c (x), for δ c (x) = log p c (x) + log P c = log p c (x), c since P c = /C is independent of c. The class-conditionals are multivariate Gaussians given by ( p c (x) = (π) d/ exp ) Σ c / (x µ c) T Σ c (x µ c ). Substituting into the previous expression we find δ c (x) = log p c (x) = (x µ c) T Σ c (x µ c ) d log π log Σ c = (x µ c) T Σ c (x µ c ) log Σ c = xt Σ c x + xt Σ c µ c + µt c Σ c x µt c Σ c µ c log Σ c where terms that are independent of c have been eliminated. Once again, we can simplify this x Class Class Class 3 x Figure 5: Non-linear decision boundaries/regions for apple lda.mat.

CS 95-5: Machine Learning Problem Set Douglas Lanman expression since µ T c Σ c x = x T Σ µ c. In conclusion, the discriminant function is given as follows. δ c (x) = xt Σ c c x + x T Σ c µ c µt c Σ c µ c log Σ c Using prob.m, this discriminant function was applied to the data in apple lda.mat. The means and covariances were obtained using the ML estimators derived in Problem. The resulting decision boundaries are shown in Figure 5. The classification error (measured as the percent of incorrect classifications on the training data) was 7%. Note that the decision functions δ c (x) are no longer linear in x, so now we can obtain non-linear decision boundaries. In fact, the leading-order term xt Σ c x is a quadratic form which, together with the lower-order terms, can be used to express any second-degree curve (i.e., conic sections in two dimensions) of the form ax + bx x + cx + dx + fx + g = as the decision boundary separating classes. In contrast to the linear decision boundaries in Problem 9, by allowing different covariance matrices for each class, we can now achieve a variety of decision boundaries such as lines, parabolas, hyperbolas, and ellipses [5]. As shown in Figure 5, the resulting decision boundaries are curvilinear and, as a result, can achieve a lower classification error due to the increase in their degrees of freedom. 9

CS 95-5: Machine Learning Problem Set Douglas Lanman Problem An alternative approach to learning quadratic decision boundaries is to map the inputs into an extended polynomial feature space. Implement this method and compare the resulting decision boundaries to those in Problem. As discussed in class on 9/5/, we can utilize the linear regression framework to implement a naïve classifier as follows. First, we form an indicator matrix Y such that { if yi = c, Y ij = otherwise where the possible classes are labeled as,..., C. Recall from Problem that any quadratic curve can be written as ax + bx x + cx + dx + fx + g =, where {a, b, c, d, f, g} are unknown regression coefficients. As a result, we can define a design matrix with extended polynomial features as follows. x () x () x () x () x () x () x () x () x () x () x () x () X =............ x (N) x (N) x (N) x (N) x (N) x (N) where x i(j) denotes the i th coordinate of the j th training sample. (Note that inclusion of the cross x Class Class Class 3 x Figure : Quadratic decision boundaries/regions using extended polynomial features.

CS 95-5: Machine Learning Problem Set Douglas Lanman x x x (a) Decision boundaries using extended LDA x (b) Decision boundaries using extended features Figure 7: Comparison of classification results in Problems and. term x x had little effect on the decision regions.) Together, X and Y define C independent linear regression problems with least squares solutions given by Ŵ = (X T X) X T Y, where the regression coefficients for each class correspond to the columns of Ŵ. For a general set of samples x, we form the extended design matrix X with the corresponding linear predictions given by Ŷ = X (X T X) X T Y = X Ŵ, as shown in class. Finally, we assign a class label Ĉ by choosing the column of Ŷ with the largest value (independently for each row/sample). Using prob.m, this classification procedure was applied to the data in apple lda.mat. The resulting decision boundaries are shown in Figure. The classification error (measured as the percent of incorrect classifications on the training data) was 9.3%. Since the regression features were extended to allow quadratic decision boundaries, we find that the resulting regions are delineated by curvilinear boarders. While the classification error is lower than in Problem 9, we find that the resulting decision regions possess some odd/undesirable properties. Compare the decision boundaries/regions in Figures 7(a) and 7(b). Using extended linear discriminant analysis (LDA) as in Problem, we find that the decision boundaries are quadratic and, assuming underlying Gaussian distributions, seem to make reasonable generalizations. More specifically, examine the decision region for the second class (green) in Figure 7(a). Not surprisingly, it continues on the other side of Class 3. This is reasonable since Class 3 varies in the opposite direction of Class. Unfortunately, this generalization behavior is not reflected in the decision regions found in this problem. Surprisingly, we find that Class 3 becomes the most likely class label in the bottom left corner of Figure 7(b). This is inconsistent with the covariance matrix of the underlying distribution (assuming it is truely Gaussian). In conclusion, we find that the results in Problems and illustrate the benefits of generative models over naïve classification schemes (when there is some knowledge of the underlying distributions).

CS 95-5: Machine Learning Problem Set Douglas Lanman References [] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 99. [] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag,. [3] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification (Second Edition). Wiley-Interscience,. [] Sam Roweis. Matrix identities. http://www.cs.toronto.edu/ roweis/notes/matrixid.pdf. [5] Eric W. Weisstein. Quadratic curve. http://mathworld.wolfram.com/quadraticcurve.html. [] Eric W. Weisstein. Quadratic curve discriminant. http://mathworld.wolfram.com/ QuadraticCurveDiscriminant.html.