CS 195-5: Machine Learning Problem Set 1

Size: px
Start display at page:

Download "CS 195-5: Machine Learning Problem Set 1"

Transcription

1 CS 95-5: Machine Learning Problem Set Douglas Lanman 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of the training inputs. That is, show that for any a R d+ ˆσ(e, a T x) =, where e i = y i ŵ T x i is the prediction error for the i th training example. First, recall that the least squares estimate for the linear regression parameters is given by ŵ = argmin w (y i w d j= w j x (i) j ) = argmin w (y i ŵ T x i ) = argmin w (e i ), () where we have augmented the inputs x R d by adding as the zeroth dimension such that x =. Recall that the minimum (or maximum) of the argument of Equation will be achieved where / w i equals zero for every regression parameter w i. Differentiating with respect to w and equating with zero gives w (y i w n d j= w j x (i) j ) = (y i w d j= (y i w w j x (i) j ) = n d j= w j x (i) j ) ( ) = e i = ē =. () In other words, a necessary condition for ˆω is that the prediction errors have zero mean. Similarly, differentiating with respect to any w i (for i {,..., n}) and equating with zero gives w i (y i w d j= w j x (i) j ) = (y i w (y i w d j= w j x (i) j )x(i) = d j= w j x (i) j ) ( w jx (i) j ) = e i x i =. (3) As in class, this result implies that the prediction errors are uncorrelated with the training data. We now turn our attention to proving the claim that the prediction errors are necessarily uncorrelated with any linear function of the training inputs. Recall that the correlation is written ˆσ(e, a T x) = (e i ē)(a T x i a T x i ) = e i (a T x i a T x i ), ()

2 CS 95-5: Machine Learning Problem Set Douglas Lanman where we have applied Equation to conclude ē =. Now let us examine the term a T x. ( ) a T x = a T x i = a T x i = a T x n n Note that we are allowed to bring a T out of the summation by linearity. Also note that a T x is a scalar quantity and, as a result, we can write Equation as ( ) ( ) ˆσ(e, a T x) = a T e i x i a T x e i =, where, by Equations and 3, we know that both summations are equal to zero. As a result, we have proven desired result. (QED) ˆσ(e, a T x) =

3 CS 95-5: Machine Learning Problem Set Douglas Lanman Problem Suppose that the data in a regression problem are scaled by multiplying the j th dimension of the input by a non-zero number c j. Let x [, c x,..., c d x d ] T denote a single scaled data point and let X represent the corresponding design matrix. Similarly, let ŵ be the maximum likelihood (ML) estimate of the regression parameters from the unscaled X, and let ˆ w be the solution obtained from the scaled X. Show that scaling does not change optimality, in the sense that ŵ T x = ˆ w T x. First, note that scaling can be represented as a linear operator C composed of the scaling factors along its main diagonal. c C =... Using the scaling operator C, we can express the scaled inputs x and design matrix X as functions of x and X, respectively. x = Cx X = XC (5) As was demonstrated in class, under the Gaussian noise model, the ML estimate of the regression parameters is given by ŵ = (X T X) X T y. Using the expressions in Equation 5, we find ˆ w = ( X T X) XT y = ( (XC) T XC ) (XC) T y = ( C T X T XC ) C T X T y. At this point, we can apply the following matrix identity: if the individual inverses A and B exist, then (AB) = B A. Since C is a real, symmetric square matrix, C must exist. Similarly, X T X is a real, symmetric square matrix so (X T XC) must also exist. As a result, we can apply the matrix identity to the previous expression. c d ˆ w = ( (C T )(X T XC) ) C T X T y = (X T XC) (C T ) C T X T y = C (X T X) X T y = C ŵ () To prove that the scaled solution is optimal, we apply Equations 5 and as follows. ˆ w T x = (C ŵ) T Cx = ŵ T (C ) T Cx Recall from [] that (A ) T symmetric matrix, C = C T desired result. (QED) = (A T ). As a result, (C ) T C = (C T ) C. Since C is a real, and, as a result, (C ) T C = I. In conclusion we have proven the ŵ T x = ˆ w T x 3

4 CS 95-5: Machine Learning Problem Set Douglas Lanman Problem 3 Derive the maximum likelihood (ML) estimate of σ under the Gaussian noise model. Recall that the likelihood P of the noise variance σ, given the observations X = [x,..., x N ] T and Y = [y,..., y N ] T, is defined to be P(Y; w, σ) p(y X, w, σ). Also recall that, under the Gaussian noise model, the label y is a random variable with the following distribution y = f(x; w) + ν, ν N (ν,, σ), p(y x, w, σ) = N (y; f(x; w), σ) = σ π exp ( ) (y f(x; w)) σ. Assuming that the observations are independent and identically distributed (i.i.d.), then the likelihood can be expressed as the following product. N N P(Y; w, σ) = p(y i x i, w, σ) = σ π exp with the corresponding ML estimator for σ given by ˆσ ML = argmax σ N σ π exp ( (y i f(x i ; w)) ( (y i f(x i ; w)) As was done in class, we consider the log-likelihood l which converts this product into a summation as follows. l(y; w, σ) log P(Y; w, σ) = σ σ σ ). ), (y i f(x i ; w)) N log(σ π) (7) Since the logarithm is a monotonically-increasing function, the ML estimator becomes { ˆσ ML = argmax σ σ (y i f(x i ; w)) N log(σ } π). The maximum (or minimum) of this function will necessarily be obtained where the derivative with respect to σ equals zero. { σ σ (y i f(x i ; w)) N log(σ } π) = σ (y i f(x i ; w)) N σ =

5 CS 95-5: Machine Learning Problem Set Douglas Lanman In conclusion, the ML estimate of σ is given by the following expression. ˆσ ML = N (y i f(x i ; w)) () Note that this expression maximizes the likelihood of the observations, assuming w is known. If this is not the case, then the ML estimate ŵ ML should be substituted for w in Equation. As was shown in class, the ML estimate ŵ ML is independent of σ and, as a result, can be estimated independently. In other words, if the ML estimates of both w and σ are required, then the following expressions should be used. ŵ ML = argmin w ˆσ ML = N (y i f(x i ; w)) (y i f(x i ; ŵ ML )) 5

6 CS 95-5: Machine Learning Problem Set Douglas Lanman Problem Part : Consider the noise model y = f(x; w) + ν in which ν is drawn from the following distribution. ( ) p(ν) = C exp( ν ), for C = exp( x ) dx Derive the conditions on the maximum likelihood estimate ŵ ML. What is the corresponding loss function? Compare this loss function with squared loss on the interval [ 3, 3]. How do you expect these differences to affect regression? As in Problem 3, let s begin by defining the distribution of the label y, given the input x. p(y x, w) = C exp ( (y f(x; w)) ) Once again, we assume that the observations are i.i.d. such that the likelihood P is given by N N P(Y; w) = p(y i x i, w) = C exp ( (y i f(x i ; w)) ). The log-likelihood l is then given by l(y; w) = log P(Y; w) = NC with the corresponding ML estimate for w given by ŵ ML = argmax l(y; w) = argmin w w where L is the quadrupled loss function defined as follows. (y i f(x i ; w)), (y i f(x i ; w)) = argmin L (y i, f(x i ; w)) w L (y, ŷ) = (y ŷ) The squared loss L(y, ŷ) = (y ŷ) and quadrupled loss were plotted (as a function of y ŷ) in Figure (a) using the Matlab script prob.m. From this plot it is apparent that L creates a greater penalty for outliers than L. That is, as the prediction error y ŷ increases, the regression with quadrupled loss will bias towards outliers more than with squared loss. Part : Now consider the following data set X = [,, ] T and Y = [,, ] T. Find the numerical solution for ŵ ML using both linear least squares (i.e., squared loss) and quadrupled loss and report the empirical losses. Plot the corresponding functions and explain how what is seen results from the differences in loss functions. Using prob.m, the optimal values of (w, w ) were determined using the provided testws.m function for both the squared and quadrupled losses. The results are tabulated below. Loss Function ŵ ŵ Empirical Loss Squared Loss (LSQ): L(y, ŷ).33.. Quadrupled Loss: L (y, ŷ)...3

7 CS 95-5: Machine Learning Problem Set Douglas Lanman Squared Loss Quadrupled Loss.5 Loss 3 3 y ŷ (a) Comparison of loss functions y.5.5 Sample Data Linear Least Squares Quadrupled Loss x (b) Comparison of ML-estimated models.5.5 w Squared Loss w Quadrupled Loss w w (c) Squared loss surface (d) Quadrupled loss surface Figure : ML estimation using exhaustive search under varying loss functions. Note that the empirical losses L N or L N are given by the average sum of squares as follows. L N (w) = L N(w) = N (y i f(x i ; w)) (9) Using the values in the table, the corresponding regression models were plotted in Figure (b), with the associated loss surfaces as shown in Figures (c) and (d). Note that the two noise models resulted in lines with identical slopes (i.e., equal values of w ), however the y-intercepts (given by w ) differed. This is consistent with our previous observation that the quadrupled loss model biases towards outliers. For this example, the estimated line moves towards the second data point at (x =, y = ) since this point can be viewed as an outlier. 7

8 CS 95-5: Machine Learning Problem Set Douglas Lanman Problem 5 Apply linear and quadratic polynomial regression to the data in meteodata.mat using linear least squares. Use -fold cross-validation to select the best model and plot the results. Report the empirical loss and log-likelihood. Based on the plot, comment on the Gaussian noise model. Recall from class on 9/3/ that we can solve for the least squares polynomial regression coefficients ŵ by using the extended design matrix X such that x x... x d ŵ = ( X T X) XT y, with X x x... x d = , x N x N... xd N where d is the degree of the polynomial and {x,..., x N } and y are the observed points and their associated labels, respectively. To prevent numerical round-off errors, this method was applied to the column-normalized design matrix X using degexpand.m within the main script prob5.m (as discussed in Problem ). The resulting linear and quadratic polynomials are shown in Figure (a). The fitting parameters obtained using all data points and up to a fourth-order polynomial are tabulated below. Polynomial Degree Empirical Loss Log-likelihood -fold Cross Validation Score Linear (d = ) Quadratic (d = ) Cubic (d = 3) Quartic (d = ) Note that the empirical loss L N is defined to be the average sum of squared errors as given by Equation 9. The log-likelihood of the data (under a Gaussian model) was derived in class on 9// and is given by Equation 7 as l(y; w, σ) = σ (y i f(x i ; w)) N log(σ π), where σ ˆσ ML is the maximum likelihood estimate of σ under a Gaussian noise model (as given by Equation in Problem 3). At this point, we turn our attention to the model-order selection task (i.e., deciding what degree of polynomial best-represents the data). As discussed in class of 9/3/, we will use -fold cross validation to select the best model. First, we partition the data into roughly equal parts (see lines 5-7 of prob5.m). Next, we perform sequential trials where we train on all but the i th fold of the data and then measure the empirical error on the remaining samples. In general, we formulate the k-fold cross validation score as ˆL k = N k (y j f(x j ; ŵ i ), j fold i where ŵ i is fit to all samples except those in the i th fold. The resulting -fold cross-validation scores are tabulated above (for up to a fourth-order polynomial). Since the lowest cross-validation score is achieved for the quadratic polynomial we select this as the best model.

9 CS 95-5: Machine Learning Problem Set Douglas Lanman Reported Temperture (Kelvin) Sample Data Linear Regression Quadratic Regression Reported Temperture (Kelvin) Sample Data Quadratic Regression Level of Seismic Activity (Richters) (a) Least squares polynomial regression results Level of Seismic Activity (Richters) (b) Quadratic polynomial (selected by cross-validation) Figure : Comparison of linear and quadratic polynomials estimated using linear least squares. In conclusion, we find that the quadratic polynomial has the lowest cross-validation score. However, as shown in Figure (b), it is immediately apparent that the Gaussian noise model does not accurately represent the underlying distribution; more specifically, if the underlying distribution was Gaussian, we d expect half of the data points to be above the model prediction (and the other half below). This is clearly not the case for this example motivating the alternate noise model we ll analyze in Problem. 9

10 CS 95-5: Machine Learning Problem Set Douglas Lanman Problem Part : Consider the noise model y = f(x; w) + ν in which ν is drawn from p(ν) as follows. { e ν if ν >, p(ν) = otherwise Perform -fold cross-validation for linear and quadratic regression under this noise model, using exhaustive numerical search similar to that used in Problem. Plot the selected model and report the empirical loss and the log-likelihood under the estimated exponential noise model. As in Problems 3 and, let s begin by defining the distribution of the label y, given the input x. { exp ( (y f(x; w))) if y > f(x; w), p(y x, w) = otherwise Once again, we assume that the observations are i.i.d. such that the likelihood P is given as follows. N N { exp (f(xi ; w) y P(Y; w) = p(y i x i, w) = i ) if y i > f(x i ; w), otherwise The log-likelihood l is then given by () l(y; w) = log P(Y; w) = { f(xi ; w) y i if y i > f(x i ; w), otherwise since the logarithm is a monotonic function and lim x log x =. The corresponding ML estimate for w is given by the following expression. () ŵ ML = argmax l(y; w) = argmin w w { yi f(x i ; w) if y i > f(x i ; w), otherwise () Note that Equations and prevent any prediction f(x i ; w) from being above the corresponding label y i. As a result, the exponential noise distribution will effectively lead to a model corresponding to the lower-envelope of the training data. Using the maximum likelihood formulation in Equation, we can solve for the optimal regression parameters using an exhaustive search (similar to what was done in Problem ). This approach was applied to all the data points on lines 9-7 of prob.m. The resulting best-fit linear and quadratic polynomial models are shown in Figure 3(a). The fitting parameters obtained using all the data points and up to a second-order polynomial are tabulated below. Polynomial Degree Empirical Loss Log-likelihood -fold Cross Validation Score Linear (d = ) Quadratic (d = ) Note that the empirical loss L N was calculated using Equation 9. The log-likelihood of the data (under the exponential noise model) was determined using Equation. Model selection was performed using -fold cross-validation as in Problem 5. The resulting scores are tabulated above. Since the lowest cross-validation score was achieved for the quadratic polynomial we select this as the best model and plot the result in Figure 3(b). Comparing the

11 CS 95-5: Machine Learning Problem Set Douglas Lanman Reported Temperture (Kelvin) Sample Data Linear Regression Quadratic Regression Reported Temperture (Kelvin) Sample Data Quadratic Regression Level of Seismic Activity (Richters) (a) Polynomial regression results Level of Seismic Activity (Richters) (b) Quadratic polynomial (selected by cross-validation) Figure 3: ML estimation using exhaustive search under an exponential noise model. model in Figure 3(b) with that in Figure (b), we conclude that the quadratic polynomial under the exponential noise model better approximates the underlying distribution; more specifically, the quadratic polynomial (under an exponential noise model) achieves a log-likelihood of -359., whereas it only achieves a log-likelihood of -5. under the Gaussian noise model. It is also important to note that the Gaussian noise model leads to a lower squared loss (i.e., empirical loss), however this is by the construction of the least squares estimator used in Problem 5. This highlights the important observation that a lower empirical loss does not necessarily indicate a better model this only applies when the choice of noise model appropriately models the actual distribution. Part : Now evaluate the polynomials selected under the Gaussian and exponential noise models for the data in 5. Report which performs better in terms of likelihood and empirical loss. In both Problems 5 and., the quadratic polynomial was selected as the best model by -fold cross validation. In order to gauge the generalization capabilities of these models, the model parameters we used to predict the 5 samples. The results for each noise model are tabulated below. Noise Model Empirical Loss Log-likelihood Gaussian (d = ).3-5. Exponential (d = ) In conclusion, we find that the Gaussian model achieves a lower empirical loss on the 5 samples. This is expected, since the Gaussian model is equivalent to the least squares estimator which minimizes empirical loss. The exponential model, however, achieves a significantly higher loglikelihood indicating that it models the underlying data more effectively than the Gaussian noise model. As a result, we reiterate the point made previously: low empirical loss can be achieved even if the noise model does not accurately represent the underlying noise distribution.

12 CS 95-5: Machine Learning Problem Set Douglas Lanman Multivariate Gaussian Distributions Problem 7 Recall that the probability density function (pdf) of a Gaussian distribution in R d is given by ( p(x) = (π) d/ exp ) Σ / (x µ)t Σ (x µ). Show that a contour corresponding to a fixed value of the pdf is an ellipse in the D x, x space. An iso-contour of the pdf along p(x) = p is given by ( (π) d/ exp ) Σ / (x µ)t Σ (x µ) = p. Taking the logarithm of each side and rearranging terms gives ( (x µ) T Σ (x µ) = log p (π) d/ Σ /) = C, where C R is a constant. As this point it is most convenient to write this expression explicitly as a function of x and x such that x = [x, x ] T and µ = [ x, x ] T. ( x x x x ) ( σ σ σ σ ( x x x x ) ( σ σ σ σ Multiplying the terms in this expression, we obtain ) ( x x ) ( x x x x x x ) = C ) = (σσ σ)c = C σ (x x ) σ (x x )(x x ) + σ (x x ) = C. Simplifying, we obtain a solution for the D iso-contours given by (x σ x ) σ σ σ C ( σ σ σ σ σ (x x )(x x ) + σ σ (x x ) = C, ) log ( p (π) d/ Σ /). (3) Recall from [] that a general quadratic curve can be written as ax + bx x + cx + dx + fx + g =, with an associated discriminant given by b ac. Also recall that if the discriminant b ac <, then the quadratic curve represents either an ellipse, a circle, or a point (with the later two representing certain degenerate configurations of an ellipse) []. Since Equation 3 has the form of a quadratic curve, it has a discriminant given by ( σ b ac = σ σ σ σ ) = ( σ σ σσ )? σ <.

13 CS 95-5: Machine Learning Problem Set Douglas Lanman Recall that the covariance must be non-negative, therefore {σ, σ, σ }. As a result, /(σ σ ) is positive and we only need to prove that σ < σ σ in order to show that Equation 3 represents an ellipse. Recall from class on 9/5/ that the definition of the cross-correlation coefficient ρ is ρ = σ σ = ρ σ σ σσ, where ρ. As a result, ρ which implies σ < σ σ and the discriminant has the form of an ellipse. In conclusion, Equation 3 has the associated discriminant b ac = ( σ σ σσ ) σ <, which by [] defines an ellipse in the D x, x space. (QED) 3

14 CS 95-5: Machine Learning Problem Set Douglas Lanman Problem Suppose we have a sample x,..., x n drawn from N (x; µ, Σ). Show that the ML estimate of the mean vector µ is ˆµ ML = x i n and the ML estimate of the covariance matrix Σ is ˆΣ ML = (x i ˆµ ML )(x i ˆµ ML ) T. n Qualitatively, we will repeat the derivation used for the univariate Gaussian in Problem 3. Let s begin by defining the multivariate Gaussian distribution in R d. ( p(x; µ, Σ) = (π) d/ exp ) Σ / (x µ)t Σ (x µ) Once again, we assume that the samples are i.i.d. such that the likelihood P is given by The log-likelihood l is then given by P(X; µ, Σ) = l(x; µ, Σ) = log P(X; µ, Σ) = n p(x i µ, Σ). log (p(x i µ, Σ)) ( = n log (π) d/ Σ /) with the corresponding ML estimate for µ given by ˆµ ML = argmax l(x; µ, Σ) = argmin µ µ (x i µ) T Σ (x i µ), () (x i µ) T Σ (x i µ) since the first term in Equation is independent of µ. The minimum (or maximum) of this function will occur where the derivative with respect to µ equals zero. Let s proceed by making the substitution Σ A. Equating the first partial derivative with zero, we find { } (x i µ) T [ A(x i µ) = (xi µ) T A(x i µ) ] = µ µ [ x T µ i Ax i x T i Aµ µ T Ax i + µ T Aµ ] =. Note that the first term is independent of µ and can be eliminated. { (xt i Aµ) } (µt Ax i ) + (µt Aµ) = µ µ µ

15 CS 95-5: Machine Learning Problem Set Douglas Lanman We can now apply the following identities for the derivatives of scalar and matrix/vector forms []. (Ax) x = (xt A) = A x (x T Ax) x Applying these expressions to the previous equation gives = (A + A T )x { x T i A Ax i + Aµ + A T µ } =. Recall that Σ is a real, symmetric matrix such that Σ T = Σ []. In addition, we have A T = A since A is also a real, symmetric matrix. Finally, since A is symmetric, we have x T i A = Ax i. Applying these identities to the previous equation, we find A (x i µ) = Diving by n, we obtain the desired result. µ = nµ = x i. ˆµ ML = n x i Similar to the derivation of ˆµ ML, the ML estimate for ˆΣ ML is given by { ( ˆΣ ML = argmax l(x; µ, Σ) = argmin (x i µ) T Σ (x i µ) + n log (π) d/ Σ /)} Σ Σ { } = argmin Σ (x i µ) T Σ (x i µ) + n log ( Σ ) Once again, we make the substitution Σ A. As a result, the minimum (or maximum) of Equation 5 will occur where the derivative with respect to A equals zero. { (x i µ) T A(x i µ) + n log ( A )} = A Note that we can substitute A = A to obtain the following result []. A log ( A ) = n [ (xi µ) T A(x i µ) ] A We can now apply the following identities for the derivatives of scalar and matrix/vector forms []. [ (x µ) T A(x µ) ] = (x µ)(x µ) T A A log ( A ) = A = Σ Applying these identities to the previous equation, we find the derived result. (5) ˆΣ ML = n (x i ˆµ ML )(x i ˆµ ML ) T (QED) 5

16 CS 95-5: Machine Learning Problem Set Douglas Lanman 3 Linear Discriminant Analysis Problem 9 This problem will examine an extension of linear discriminant analysis (LDA) to multiple classes (assuming equal covariance matrices such that Σ c = Σ for every class c). Show that the optimal decision rule is based on calculating a set of C linear discriminant functions δ c (x) = x T Σ µ c µt c Σ µ c and selecting C = argmax c δ c (x). Assume that the prior probability is uniform for each class such that P c = p(y = c) = /C. Apply this method to the data in apple lda.mat and plot the resulting linear decision boundaries or, alternatively, the decision regions. Report the classification error. Recall from class on 9// and [3] that the Bayes Classifier minimizes the conditional risk, for a given class-conditional density p c (x) = p(x y = c) and prior probability P c = p(y = c), such that C = argmax δ c (x), for δ c (x) log p c (x) + log P c. c For this problem we can eliminate the class prior term to obtain δ c (x) = log p c (x), since the class prior is identical for all classes. In addition, we ll assume that the class-conditionals are multivariate Gaussians with identical covariance matrices such that p c (x) = (π) d/ exp Σ / Substituting into the previous expression we find ( (x µ c) T Σ (x µ c ) ). δ c (x) = log p c (x) = (x µ c) T Σ (x µ c ) d log π log Σ = (x µ c) T Σ (x µ c ) = xt Σ x + xt Σ µ c + µt c Σ x µt c Σ µ c = xt Σ µ c + µt c Σ x µt c Σ µ c () where terms that are independent of c have been eliminated. To complete our proof, note that µ T c Σ x = (µ T c Σ x) T = x T (Σ ) T µ c = x T Σ µ c, since Σ and Σ are symmetric matrices and, as a result, Σ = (Σ ) T. Applying this observation to Equation, we prove the desired result. δ c (x) = x T Σ µ c µt c Σ µ c Using prob9.m, this method was applied to the data in apple lda.mat. The means and covariance were obtained using the ML estimators derived in Problem ; for a given class c, the mean was calculated as ˆµ c = n c x ci. n c

17 CS 95-5: Machine Learning Problem Set Douglas Lanman x Class Class Class 3 x Figure : Optimal linear decision boundaries/regions for apple lda.mat. Similarly, the the covariance matrix Σ was calculated as ˆΣ = n c (x ci ˆµ c )(x ci ˆµ c ) T. n c The resulting decision boundaries are shown in Figure. Note that δ c (x) is linear in x, so we only obtain linear decision boundaries. The classification error (measured as the percent of incorrect classifications on the training data) was.7%. 7

18 CS 95-5: Machine Learning Problem Set Douglas Lanman Problem Assume that the covariances Σ c are no longer required to be equal. Derive the discriminant function and apply the resulting decision rule to apple lda.mat. Plot the decision boundaries/regions and report the classification error. Compare the performance of the two classifiers. As in Problem 9, we begin by writing the Bayes Classifier C = argmax δ c (x), for δ c (x) = log p c (x) + log P c = log p c (x), c since P c = /C is independent of c. The class-conditionals are multivariate Gaussians given by ( p c (x) = (π) d/ exp ) Σ c / (x µ c) T Σ c (x µ c ). Substituting into the previous expression we find δ c (x) = log p c (x) = (x µ c) T Σ c (x µ c ) d log π log Σ c = (x µ c) T Σ c (x µ c ) log Σ c = xt Σ c x + xt Σ c µ c + µt c Σ c x µt c Σ c µ c log Σ c where terms that are independent of c have been eliminated. Once again, we can simplify this x Class Class Class 3 x Figure 5: Non-linear decision boundaries/regions for apple lda.mat.

19 CS 95-5: Machine Learning Problem Set Douglas Lanman expression since µ T c Σ c x = x T Σ µ c. In conclusion, the discriminant function is given as follows. δ c (x) = xt Σ c c x + x T Σ c µ c µt c Σ c µ c log Σ c Using prob.m, this discriminant function was applied to the data in apple lda.mat. The means and covariances were obtained using the ML estimators derived in Problem. The resulting decision boundaries are shown in Figure 5. The classification error (measured as the percent of incorrect classifications on the training data) was 7%. Note that the decision functions δ c (x) are no longer linear in x, so now we can obtain non-linear decision boundaries. In fact, the leading-order term xt Σ c x is a quadratic form which, together with the lower-order terms, can be used to express any second-degree curve (i.e., conic sections in two dimensions) of the form ax + bx x + cx + dx + fx + g = as the decision boundary separating classes. In contrast to the linear decision boundaries in Problem 9, by allowing different covariance matrices for each class, we can now achieve a variety of decision boundaries such as lines, parabolas, hyperbolas, and ellipses [5]. As shown in Figure 5, the resulting decision boundaries are curvilinear and, as a result, can achieve a lower classification error due to the increase in their degrees of freedom. 9

20 CS 95-5: Machine Learning Problem Set Douglas Lanman Problem An alternative approach to learning quadratic decision boundaries is to map the inputs into an extended polynomial feature space. Implement this method and compare the resulting decision boundaries to those in Problem. As discussed in class on 9/5/, we can utilize the linear regression framework to implement a naïve classifier as follows. First, we form an indicator matrix Y such that { if yi = c, Y ij = otherwise where the possible classes are labeled as,..., C. Recall from Problem that any quadratic curve can be written as ax + bx x + cx + dx + fx + g =, where {a, b, c, d, f, g} are unknown regression coefficients. As a result, we can define a design matrix with extended polynomial features as follows. x () x () x () x () x () x () x () x () x () x () x () x () X = x (N) x (N) x (N) x (N) x (N) x (N) where x i(j) denotes the i th coordinate of the j th training sample. (Note that inclusion of the cross x Class Class Class 3 x Figure : Quadratic decision boundaries/regions using extended polynomial features.

21 CS 95-5: Machine Learning Problem Set Douglas Lanman x x x (a) Decision boundaries using extended LDA x (b) Decision boundaries using extended features Figure 7: Comparison of classification results in Problems and. term x x had little effect on the decision regions.) Together, X and Y define C independent linear regression problems with least squares solutions given by Ŵ = (X T X) X T Y, where the regression coefficients for each class correspond to the columns of Ŵ. For a general set of samples x, we form the extended design matrix X with the corresponding linear predictions given by Ŷ = X (X T X) X T Y = X Ŵ, as shown in class. Finally, we assign a class label Ĉ by choosing the column of Ŷ with the largest value (independently for each row/sample). Using prob.m, this classification procedure was applied to the data in apple lda.mat. The resulting decision boundaries are shown in Figure. The classification error (measured as the percent of incorrect classifications on the training data) was 9.3%. Since the regression features were extended to allow quadratic decision boundaries, we find that the resulting regions are delineated by curvilinear boarders. While the classification error is lower than in Problem 9, we find that the resulting decision regions possess some odd/undesirable properties. Compare the decision boundaries/regions in Figures 7(a) and 7(b). Using extended linear discriminant analysis (LDA) as in Problem, we find that the decision boundaries are quadratic and, assuming underlying Gaussian distributions, seem to make reasonable generalizations. More specifically, examine the decision region for the second class (green) in Figure 7(a). Not surprisingly, it continues on the other side of Class 3. This is reasonable since Class 3 varies in the opposite direction of Class. Unfortunately, this generalization behavior is not reflected in the decision regions found in this problem. Surprisingly, we find that Class 3 becomes the most likely class label in the bottom left corner of Figure 7(b). This is inconsistent with the covariance matrix of the underlying distribution (assuming it is truely Gaussian). In conclusion, we find that the results in Problems and illustrate the benefits of generative models over naïve classification schemes (when there is some knowledge of the underlying distributions).

22 CS 95-5: Machine Learning Problem Set Douglas Lanman References [] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 99. [] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag,. [3] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification (Second Edition). Wiley-Interscience,. [] Sam Roweis. Matrix identities. roweis/notes/matrixid.pdf. [5] Eric W. Weisstein. Quadratic curve. [] Eric W. Weisstein. Quadratic curve discriminant. QuadraticCurveDiscriminant.html.

CS 195-5: Machine Learning Problem Set 2

CS 195-5: Machine Learning Problem Set 2 Decision Theory Problem CS 95-5: Machine Learning Problem Set 2 Douglas Lanman dlanman@brown.edu October 26 Part : In this problem we will examine randomized classifiers. Define, for any point x, a probabilty

More information

Linear Regression and Discrimination

Linear Regression and Discrimination Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

More information

Introduction to Machine Learning Spring 2018 Note 18

Introduction to Machine Learning Spring 2018 Note 18 CS 189 Introduction to Machine Learning Spring 2018 Note 18 1 Gaussian Discriminant Analysis Recall the idea of generative models: we classify an arbitrary datapoint x with the class label that maximizes

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Bayesian Decision and Bayesian Learning

Bayesian Decision and Bayesian Learning Bayesian Decision and Bayesian Learning Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 30 Bayes Rule p(x ω i

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a

More information

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians University of Cambridge Engineering Part IIB Module 4F: Statistical Pattern Processing Handout 2: Multivariate Gaussians.2.5..5 8 6 4 2 2 4 6 8 Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2 2 Engineering

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout :. The Multivariate Gaussian & Decision Boundaries..15.1.5 1 8 6 6 8 1 Mark Gales mjfg@eng.cam.ac.uk Lent

More information

Lecture Notes on the Gaussian Distribution

Lecture Notes on the Gaussian Distribution Lecture Notes on the Gaussian Distribution Hairong Qi The Gaussian distribution is also referred to as the normal distribution or the bell curve distribution for its bell-shaped density curve. There s

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem Set 2 Due date: Wednesday October 6 Please address all questions and comments about this problem set to 6867-staff@csail.mit.edu. You will need to use MATLAB for some of

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis CS 3 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis AD March 11 AD ( March 11 1 / 17 Multivariate Gaussian Consider data { x i } N i=1 where xi R D and we assume they are

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

5. Discriminant analysis

5. Discriminant analysis 5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density

More information

p(x ω i 0.4 ω 2 ω

p(x ω i 0.4 ω 2 ω p( ω i ). ω.3.. 9 3 FIGURE.. Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value given the pattern is in category ω i.if represents

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham Generative classifiers: The Gaussian classifier Ata Kaban School of Computer Science University of Birmingham Outline We have already seen how Bayes rule can be turned into a classifier In all our examples

More information

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear

More information

Introduction to Machine Learning

Introduction to Machine Learning Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II) Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

More information

Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1 Machine Learning 1 Linear Classifiers Marius Kloft Humboldt University of Berlin Summer Term 2014 Machine Learning 1 Linear Classifiers 1 Recap Past lectures: Machine Learning 1 Linear Classifiers 2 Recap

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Linear Classifiers as Pattern Detectors

Linear Classifiers as Pattern Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,

More information

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion Today Probability and Statistics Naïve Bayes Classification Linear Algebra Matrix Multiplication Matrix Inversion Calculus Vector Calculus Optimization Lagrange Multipliers 1 Classical Artificial Intelligence

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

The Bayes classifier

The Bayes classifier The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

More information

Advanced statistical methods for data analysis Lecture 1

Advanced statistical methods for data analysis Lecture 1 Advanced statistical methods for data analysis Lecture 1 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

LDA, QDA, Naive Bayes

LDA, QDA, Naive Bayes LDA, QDA, Naive Bayes Generative Classification Models Marek Petrik 2/16/2017 Last Class Logistic Regression Maximum Likelihood Principle Logistic Regression Predict probability of a class: p(x) Example:

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

Chapter 5 continued. Chapter 5 sections

Chapter 5 continued. Chapter 5 sections Chapter 5 sections Discrete univariate distributions: 5.2 Bernoulli and Binomial distributions Just skim 5.3 Hypergeometric distributions 5.4 Poisson distributions Just skim 5.5 Negative Binomial distributions

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 6 Computing Announcements Machine Learning Lecture 2 Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on

More information

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to

More information

Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart

Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart 1 Motivation Imaging is a stochastic process: If we take all the different sources of

More information

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ). .8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics

More information

B4 Estimation and Inference

B4 Estimation and Inference B4 Estimation and Inference 6 Lectures Hilary Term 27 2 Tutorial Sheets A. Zisserman Overview Lectures 1 & 2: Introduction sensors, and basics of probability density functions for representing sensor error

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Bayes Decision Theory

Bayes Decision Theory Bayes Decision Theory Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density 0 Minimum-Error-Rate Classification Actions are decisions on classes

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

Pattern Classification

Pattern Classification Pattern Classification Introduction Parametric classifiers Semi-parametric classifiers Dimensionality reduction Significance testing 6345 Automatic Speech Recognition Semi-Parametric Classifiers 1 Semi-Parametric

More information

p(x ω i 0.4 ω 2 ω

p(x ω i 0.4 ω 2 ω p(x ω i ).4 ω.3.. 9 3 4 5 x FIGURE.. Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value x given the pattern is in category

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Machine Learning - MT Classification: Generative Models

Machine Learning - MT Classification: Generative Models Machine Learning - MT 2016 7. Classification: Generative Models Varun Kanade University of Oxford October 31, 2016 Announcements Practical 1 Submission Try to get signed off during session itself Otherwise,

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Classification. Sandro Cumani. Politecnico di Torino

Classification. Sandro Cumani. Politecnico di Torino Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to

More information

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall CS534: Machine Learning Thomas G. Dietterich 221C Dearborn Hall tgd@cs.orst.edu http://www.cs.orst.edu/~tgd/classes/534 1 Course Overview Introduction: Basic problems and questions in machine learning.

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier Naive Bayes and Gaussian Bayes Classifier Mengye Ren mren@cs.toronto.edu October 18, 2015 Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 1 / 21 Naive Bayes Bayes Rules: Naive Bayes

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

Minimum Error-Rate Discriminant

Minimum Error-Rate Discriminant Discriminants Minimum Error-Rate Discriminant In the case of zero-one loss function, the Bayes Discriminant can be further simplified: g i (x) =P (ω i x). (29) J. Corso (SUNY at Buffalo) Bayesian Decision

More information

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University Chap 2. Linear Classifiers (FTH, 4.1-4.4) Yongdai Kim Seoul National University Linear methods for classification 1. Linear classifiers For simplicity, we only consider two-class classification problems

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Classification 2: Linear discriminant analysis (continued); logistic regression

Classification 2: Linear discriminant analysis (continued); logistic regression Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:

More information

Discriminant analysis and supervised classification

Discriminant analysis and supervised classification Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others February 22, 2016 Naive Bayes and Gaussian Bayes Classifier February 22, 2016 1 / 21 Naive Bayes Bayes Rule:

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

Motivating the Covariance Matrix

Motivating the Covariance Matrix Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Minimum Error Rate Classification

Minimum Error Rate Classification Minimum Error Rate Classification Dr. K.Vijayarekha Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur-613 401 Table of Contents 1.Minimum Error Rate Classification...

More information

Machine Learning and Pattern Recognition Density Estimation: Gaussians

Machine Learning and Pattern Recognition Density Estimation: Gaussians Machine Learning and Pattern Recognition Density Estimation: Gaussians Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh 10 Crichton

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information