CS 195-5: Machine Learning Problem Set 1
|
|
- Robyn Gordon
- 5 years ago
- Views:
Transcription
1 CS 95-5: Machine Learning Problem Set Douglas Lanman 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of the training inputs. That is, show that for any a R d+ ˆσ(e, a T x) =, where e i = y i ŵ T x i is the prediction error for the i th training example. First, recall that the least squares estimate for the linear regression parameters is given by ŵ = argmin w (y i w d j= w j x (i) j ) = argmin w (y i ŵ T x i ) = argmin w (e i ), () where we have augmented the inputs x R d by adding as the zeroth dimension such that x =. Recall that the minimum (or maximum) of the argument of Equation will be achieved where / w i equals zero for every regression parameter w i. Differentiating with respect to w and equating with zero gives w (y i w n d j= w j x (i) j ) = (y i w d j= (y i w w j x (i) j ) = n d j= w j x (i) j ) ( ) = e i = ē =. () In other words, a necessary condition for ˆω is that the prediction errors have zero mean. Similarly, differentiating with respect to any w i (for i {,..., n}) and equating with zero gives w i (y i w d j= w j x (i) j ) = (y i w (y i w d j= w j x (i) j )x(i) = d j= w j x (i) j ) ( w jx (i) j ) = e i x i =. (3) As in class, this result implies that the prediction errors are uncorrelated with the training data. We now turn our attention to proving the claim that the prediction errors are necessarily uncorrelated with any linear function of the training inputs. Recall that the correlation is written ˆσ(e, a T x) = (e i ē)(a T x i a T x i ) = e i (a T x i a T x i ), ()
2 CS 95-5: Machine Learning Problem Set Douglas Lanman where we have applied Equation to conclude ē =. Now let us examine the term a T x. ( ) a T x = a T x i = a T x i = a T x n n Note that we are allowed to bring a T out of the summation by linearity. Also note that a T x is a scalar quantity and, as a result, we can write Equation as ( ) ( ) ˆσ(e, a T x) = a T e i x i a T x e i =, where, by Equations and 3, we know that both summations are equal to zero. As a result, we have proven desired result. (QED) ˆσ(e, a T x) =
3 CS 95-5: Machine Learning Problem Set Douglas Lanman Problem Suppose that the data in a regression problem are scaled by multiplying the j th dimension of the input by a non-zero number c j. Let x [, c x,..., c d x d ] T denote a single scaled data point and let X represent the corresponding design matrix. Similarly, let ŵ be the maximum likelihood (ML) estimate of the regression parameters from the unscaled X, and let ˆ w be the solution obtained from the scaled X. Show that scaling does not change optimality, in the sense that ŵ T x = ˆ w T x. First, note that scaling can be represented as a linear operator C composed of the scaling factors along its main diagonal. c C =... Using the scaling operator C, we can express the scaled inputs x and design matrix X as functions of x and X, respectively. x = Cx X = XC (5) As was demonstrated in class, under the Gaussian noise model, the ML estimate of the regression parameters is given by ŵ = (X T X) X T y. Using the expressions in Equation 5, we find ˆ w = ( X T X) XT y = ( (XC) T XC ) (XC) T y = ( C T X T XC ) C T X T y. At this point, we can apply the following matrix identity: if the individual inverses A and B exist, then (AB) = B A. Since C is a real, symmetric square matrix, C must exist. Similarly, X T X is a real, symmetric square matrix so (X T XC) must also exist. As a result, we can apply the matrix identity to the previous expression. c d ˆ w = ( (C T )(X T XC) ) C T X T y = (X T XC) (C T ) C T X T y = C (X T X) X T y = C ŵ () To prove that the scaled solution is optimal, we apply Equations 5 and as follows. ˆ w T x = (C ŵ) T Cx = ŵ T (C ) T Cx Recall from [] that (A ) T symmetric matrix, C = C T desired result. (QED) = (A T ). As a result, (C ) T C = (C T ) C. Since C is a real, and, as a result, (C ) T C = I. In conclusion we have proven the ŵ T x = ˆ w T x 3
4 CS 95-5: Machine Learning Problem Set Douglas Lanman Problem 3 Derive the maximum likelihood (ML) estimate of σ under the Gaussian noise model. Recall that the likelihood P of the noise variance σ, given the observations X = [x,..., x N ] T and Y = [y,..., y N ] T, is defined to be P(Y; w, σ) p(y X, w, σ). Also recall that, under the Gaussian noise model, the label y is a random variable with the following distribution y = f(x; w) + ν, ν N (ν,, σ), p(y x, w, σ) = N (y; f(x; w), σ) = σ π exp ( ) (y f(x; w)) σ. Assuming that the observations are independent and identically distributed (i.i.d.), then the likelihood can be expressed as the following product. N N P(Y; w, σ) = p(y i x i, w, σ) = σ π exp with the corresponding ML estimator for σ given by ˆσ ML = argmax σ N σ π exp ( (y i f(x i ; w)) ( (y i f(x i ; w)) As was done in class, we consider the log-likelihood l which converts this product into a summation as follows. l(y; w, σ) log P(Y; w, σ) = σ σ σ ). ), (y i f(x i ; w)) N log(σ π) (7) Since the logarithm is a monotonically-increasing function, the ML estimator becomes { ˆσ ML = argmax σ σ (y i f(x i ; w)) N log(σ } π). The maximum (or minimum) of this function will necessarily be obtained where the derivative with respect to σ equals zero. { σ σ (y i f(x i ; w)) N log(σ } π) = σ (y i f(x i ; w)) N σ =
5 CS 95-5: Machine Learning Problem Set Douglas Lanman In conclusion, the ML estimate of σ is given by the following expression. ˆσ ML = N (y i f(x i ; w)) () Note that this expression maximizes the likelihood of the observations, assuming w is known. If this is not the case, then the ML estimate ŵ ML should be substituted for w in Equation. As was shown in class, the ML estimate ŵ ML is independent of σ and, as a result, can be estimated independently. In other words, if the ML estimates of both w and σ are required, then the following expressions should be used. ŵ ML = argmin w ˆσ ML = N (y i f(x i ; w)) (y i f(x i ; ŵ ML )) 5
6 CS 95-5: Machine Learning Problem Set Douglas Lanman Problem Part : Consider the noise model y = f(x; w) + ν in which ν is drawn from the following distribution. ( ) p(ν) = C exp( ν ), for C = exp( x ) dx Derive the conditions on the maximum likelihood estimate ŵ ML. What is the corresponding loss function? Compare this loss function with squared loss on the interval [ 3, 3]. How do you expect these differences to affect regression? As in Problem 3, let s begin by defining the distribution of the label y, given the input x. p(y x, w) = C exp ( (y f(x; w)) ) Once again, we assume that the observations are i.i.d. such that the likelihood P is given by N N P(Y; w) = p(y i x i, w) = C exp ( (y i f(x i ; w)) ). The log-likelihood l is then given by l(y; w) = log P(Y; w) = NC with the corresponding ML estimate for w given by ŵ ML = argmax l(y; w) = argmin w w where L is the quadrupled loss function defined as follows. (y i f(x i ; w)), (y i f(x i ; w)) = argmin L (y i, f(x i ; w)) w L (y, ŷ) = (y ŷ) The squared loss L(y, ŷ) = (y ŷ) and quadrupled loss were plotted (as a function of y ŷ) in Figure (a) using the Matlab script prob.m. From this plot it is apparent that L creates a greater penalty for outliers than L. That is, as the prediction error y ŷ increases, the regression with quadrupled loss will bias towards outliers more than with squared loss. Part : Now consider the following data set X = [,, ] T and Y = [,, ] T. Find the numerical solution for ŵ ML using both linear least squares (i.e., squared loss) and quadrupled loss and report the empirical losses. Plot the corresponding functions and explain how what is seen results from the differences in loss functions. Using prob.m, the optimal values of (w, w ) were determined using the provided testws.m function for both the squared and quadrupled losses. The results are tabulated below. Loss Function ŵ ŵ Empirical Loss Squared Loss (LSQ): L(y, ŷ).33.. Quadrupled Loss: L (y, ŷ)...3
7 CS 95-5: Machine Learning Problem Set Douglas Lanman Squared Loss Quadrupled Loss.5 Loss 3 3 y ŷ (a) Comparison of loss functions y.5.5 Sample Data Linear Least Squares Quadrupled Loss x (b) Comparison of ML-estimated models.5.5 w Squared Loss w Quadrupled Loss w w (c) Squared loss surface (d) Quadrupled loss surface Figure : ML estimation using exhaustive search under varying loss functions. Note that the empirical losses L N or L N are given by the average sum of squares as follows. L N (w) = L N(w) = N (y i f(x i ; w)) (9) Using the values in the table, the corresponding regression models were plotted in Figure (b), with the associated loss surfaces as shown in Figures (c) and (d). Note that the two noise models resulted in lines with identical slopes (i.e., equal values of w ), however the y-intercepts (given by w ) differed. This is consistent with our previous observation that the quadrupled loss model biases towards outliers. For this example, the estimated line moves towards the second data point at (x =, y = ) since this point can be viewed as an outlier. 7
8 CS 95-5: Machine Learning Problem Set Douglas Lanman Problem 5 Apply linear and quadratic polynomial regression to the data in meteodata.mat using linear least squares. Use -fold cross-validation to select the best model and plot the results. Report the empirical loss and log-likelihood. Based on the plot, comment on the Gaussian noise model. Recall from class on 9/3/ that we can solve for the least squares polynomial regression coefficients ŵ by using the extended design matrix X such that x x... x d ŵ = ( X T X) XT y, with X x x... x d = , x N x N... xd N where d is the degree of the polynomial and {x,..., x N } and y are the observed points and their associated labels, respectively. To prevent numerical round-off errors, this method was applied to the column-normalized design matrix X using degexpand.m within the main script prob5.m (as discussed in Problem ). The resulting linear and quadratic polynomials are shown in Figure (a). The fitting parameters obtained using all data points and up to a fourth-order polynomial are tabulated below. Polynomial Degree Empirical Loss Log-likelihood -fold Cross Validation Score Linear (d = ) Quadratic (d = ) Cubic (d = 3) Quartic (d = ) Note that the empirical loss L N is defined to be the average sum of squared errors as given by Equation 9. The log-likelihood of the data (under a Gaussian model) was derived in class on 9// and is given by Equation 7 as l(y; w, σ) = σ (y i f(x i ; w)) N log(σ π), where σ ˆσ ML is the maximum likelihood estimate of σ under a Gaussian noise model (as given by Equation in Problem 3). At this point, we turn our attention to the model-order selection task (i.e., deciding what degree of polynomial best-represents the data). As discussed in class of 9/3/, we will use -fold cross validation to select the best model. First, we partition the data into roughly equal parts (see lines 5-7 of prob5.m). Next, we perform sequential trials where we train on all but the i th fold of the data and then measure the empirical error on the remaining samples. In general, we formulate the k-fold cross validation score as ˆL k = N k (y j f(x j ; ŵ i ), j fold i where ŵ i is fit to all samples except those in the i th fold. The resulting -fold cross-validation scores are tabulated above (for up to a fourth-order polynomial). Since the lowest cross-validation score is achieved for the quadratic polynomial we select this as the best model.
9 CS 95-5: Machine Learning Problem Set Douglas Lanman Reported Temperture (Kelvin) Sample Data Linear Regression Quadratic Regression Reported Temperture (Kelvin) Sample Data Quadratic Regression Level of Seismic Activity (Richters) (a) Least squares polynomial regression results Level of Seismic Activity (Richters) (b) Quadratic polynomial (selected by cross-validation) Figure : Comparison of linear and quadratic polynomials estimated using linear least squares. In conclusion, we find that the quadratic polynomial has the lowest cross-validation score. However, as shown in Figure (b), it is immediately apparent that the Gaussian noise model does not accurately represent the underlying distribution; more specifically, if the underlying distribution was Gaussian, we d expect half of the data points to be above the model prediction (and the other half below). This is clearly not the case for this example motivating the alternate noise model we ll analyze in Problem. 9
10 CS 95-5: Machine Learning Problem Set Douglas Lanman Problem Part : Consider the noise model y = f(x; w) + ν in which ν is drawn from p(ν) as follows. { e ν if ν >, p(ν) = otherwise Perform -fold cross-validation for linear and quadratic regression under this noise model, using exhaustive numerical search similar to that used in Problem. Plot the selected model and report the empirical loss and the log-likelihood under the estimated exponential noise model. As in Problems 3 and, let s begin by defining the distribution of the label y, given the input x. { exp ( (y f(x; w))) if y > f(x; w), p(y x, w) = otherwise Once again, we assume that the observations are i.i.d. such that the likelihood P is given as follows. N N { exp (f(xi ; w) y P(Y; w) = p(y i x i, w) = i ) if y i > f(x i ; w), otherwise The log-likelihood l is then given by () l(y; w) = log P(Y; w) = { f(xi ; w) y i if y i > f(x i ; w), otherwise since the logarithm is a monotonic function and lim x log x =. The corresponding ML estimate for w is given by the following expression. () ŵ ML = argmax l(y; w) = argmin w w { yi f(x i ; w) if y i > f(x i ; w), otherwise () Note that Equations and prevent any prediction f(x i ; w) from being above the corresponding label y i. As a result, the exponential noise distribution will effectively lead to a model corresponding to the lower-envelope of the training data. Using the maximum likelihood formulation in Equation, we can solve for the optimal regression parameters using an exhaustive search (similar to what was done in Problem ). This approach was applied to all the data points on lines 9-7 of prob.m. The resulting best-fit linear and quadratic polynomial models are shown in Figure 3(a). The fitting parameters obtained using all the data points and up to a second-order polynomial are tabulated below. Polynomial Degree Empirical Loss Log-likelihood -fold Cross Validation Score Linear (d = ) Quadratic (d = ) Note that the empirical loss L N was calculated using Equation 9. The log-likelihood of the data (under the exponential noise model) was determined using Equation. Model selection was performed using -fold cross-validation as in Problem 5. The resulting scores are tabulated above. Since the lowest cross-validation score was achieved for the quadratic polynomial we select this as the best model and plot the result in Figure 3(b). Comparing the
11 CS 95-5: Machine Learning Problem Set Douglas Lanman Reported Temperture (Kelvin) Sample Data Linear Regression Quadratic Regression Reported Temperture (Kelvin) Sample Data Quadratic Regression Level of Seismic Activity (Richters) (a) Polynomial regression results Level of Seismic Activity (Richters) (b) Quadratic polynomial (selected by cross-validation) Figure 3: ML estimation using exhaustive search under an exponential noise model. model in Figure 3(b) with that in Figure (b), we conclude that the quadratic polynomial under the exponential noise model better approximates the underlying distribution; more specifically, the quadratic polynomial (under an exponential noise model) achieves a log-likelihood of -359., whereas it only achieves a log-likelihood of -5. under the Gaussian noise model. It is also important to note that the Gaussian noise model leads to a lower squared loss (i.e., empirical loss), however this is by the construction of the least squares estimator used in Problem 5. This highlights the important observation that a lower empirical loss does not necessarily indicate a better model this only applies when the choice of noise model appropriately models the actual distribution. Part : Now evaluate the polynomials selected under the Gaussian and exponential noise models for the data in 5. Report which performs better in terms of likelihood and empirical loss. In both Problems 5 and., the quadratic polynomial was selected as the best model by -fold cross validation. In order to gauge the generalization capabilities of these models, the model parameters we used to predict the 5 samples. The results for each noise model are tabulated below. Noise Model Empirical Loss Log-likelihood Gaussian (d = ).3-5. Exponential (d = ) In conclusion, we find that the Gaussian model achieves a lower empirical loss on the 5 samples. This is expected, since the Gaussian model is equivalent to the least squares estimator which minimizes empirical loss. The exponential model, however, achieves a significantly higher loglikelihood indicating that it models the underlying data more effectively than the Gaussian noise model. As a result, we reiterate the point made previously: low empirical loss can be achieved even if the noise model does not accurately represent the underlying noise distribution.
12 CS 95-5: Machine Learning Problem Set Douglas Lanman Multivariate Gaussian Distributions Problem 7 Recall that the probability density function (pdf) of a Gaussian distribution in R d is given by ( p(x) = (π) d/ exp ) Σ / (x µ)t Σ (x µ). Show that a contour corresponding to a fixed value of the pdf is an ellipse in the D x, x space. An iso-contour of the pdf along p(x) = p is given by ( (π) d/ exp ) Σ / (x µ)t Σ (x µ) = p. Taking the logarithm of each side and rearranging terms gives ( (x µ) T Σ (x µ) = log p (π) d/ Σ /) = C, where C R is a constant. As this point it is most convenient to write this expression explicitly as a function of x and x such that x = [x, x ] T and µ = [ x, x ] T. ( x x x x ) ( σ σ σ σ ( x x x x ) ( σ σ σ σ Multiplying the terms in this expression, we obtain ) ( x x ) ( x x x x x x ) = C ) = (σσ σ)c = C σ (x x ) σ (x x )(x x ) + σ (x x ) = C. Simplifying, we obtain a solution for the D iso-contours given by (x σ x ) σ σ σ C ( σ σ σ σ σ (x x )(x x ) + σ σ (x x ) = C, ) log ( p (π) d/ Σ /). (3) Recall from [] that a general quadratic curve can be written as ax + bx x + cx + dx + fx + g =, with an associated discriminant given by b ac. Also recall that if the discriminant b ac <, then the quadratic curve represents either an ellipse, a circle, or a point (with the later two representing certain degenerate configurations of an ellipse) []. Since Equation 3 has the form of a quadratic curve, it has a discriminant given by ( σ b ac = σ σ σ σ ) = ( σ σ σσ )? σ <.
13 CS 95-5: Machine Learning Problem Set Douglas Lanman Recall that the covariance must be non-negative, therefore {σ, σ, σ }. As a result, /(σ σ ) is positive and we only need to prove that σ < σ σ in order to show that Equation 3 represents an ellipse. Recall from class on 9/5/ that the definition of the cross-correlation coefficient ρ is ρ = σ σ = ρ σ σ σσ, where ρ. As a result, ρ which implies σ < σ σ and the discriminant has the form of an ellipse. In conclusion, Equation 3 has the associated discriminant b ac = ( σ σ σσ ) σ <, which by [] defines an ellipse in the D x, x space. (QED) 3
14 CS 95-5: Machine Learning Problem Set Douglas Lanman Problem Suppose we have a sample x,..., x n drawn from N (x; µ, Σ). Show that the ML estimate of the mean vector µ is ˆµ ML = x i n and the ML estimate of the covariance matrix Σ is ˆΣ ML = (x i ˆµ ML )(x i ˆµ ML ) T. n Qualitatively, we will repeat the derivation used for the univariate Gaussian in Problem 3. Let s begin by defining the multivariate Gaussian distribution in R d. ( p(x; µ, Σ) = (π) d/ exp ) Σ / (x µ)t Σ (x µ) Once again, we assume that the samples are i.i.d. such that the likelihood P is given by The log-likelihood l is then given by P(X; µ, Σ) = l(x; µ, Σ) = log P(X; µ, Σ) = n p(x i µ, Σ). log (p(x i µ, Σ)) ( = n log (π) d/ Σ /) with the corresponding ML estimate for µ given by ˆµ ML = argmax l(x; µ, Σ) = argmin µ µ (x i µ) T Σ (x i µ), () (x i µ) T Σ (x i µ) since the first term in Equation is independent of µ. The minimum (or maximum) of this function will occur where the derivative with respect to µ equals zero. Let s proceed by making the substitution Σ A. Equating the first partial derivative with zero, we find { } (x i µ) T [ A(x i µ) = (xi µ) T A(x i µ) ] = µ µ [ x T µ i Ax i x T i Aµ µ T Ax i + µ T Aµ ] =. Note that the first term is independent of µ and can be eliminated. { (xt i Aµ) } (µt Ax i ) + (µt Aµ) = µ µ µ
15 CS 95-5: Machine Learning Problem Set Douglas Lanman We can now apply the following identities for the derivatives of scalar and matrix/vector forms []. (Ax) x = (xt A) = A x (x T Ax) x Applying these expressions to the previous equation gives = (A + A T )x { x T i A Ax i + Aµ + A T µ } =. Recall that Σ is a real, symmetric matrix such that Σ T = Σ []. In addition, we have A T = A since A is also a real, symmetric matrix. Finally, since A is symmetric, we have x T i A = Ax i. Applying these identities to the previous equation, we find A (x i µ) = Diving by n, we obtain the desired result. µ = nµ = x i. ˆµ ML = n x i Similar to the derivation of ˆµ ML, the ML estimate for ˆΣ ML is given by { ( ˆΣ ML = argmax l(x; µ, Σ) = argmin (x i µ) T Σ (x i µ) + n log (π) d/ Σ /)} Σ Σ { } = argmin Σ (x i µ) T Σ (x i µ) + n log ( Σ ) Once again, we make the substitution Σ A. As a result, the minimum (or maximum) of Equation 5 will occur where the derivative with respect to A equals zero. { (x i µ) T A(x i µ) + n log ( A )} = A Note that we can substitute A = A to obtain the following result []. A log ( A ) = n [ (xi µ) T A(x i µ) ] A We can now apply the following identities for the derivatives of scalar and matrix/vector forms []. [ (x µ) T A(x µ) ] = (x µ)(x µ) T A A log ( A ) = A = Σ Applying these identities to the previous equation, we find the derived result. (5) ˆΣ ML = n (x i ˆµ ML )(x i ˆµ ML ) T (QED) 5
16 CS 95-5: Machine Learning Problem Set Douglas Lanman 3 Linear Discriminant Analysis Problem 9 This problem will examine an extension of linear discriminant analysis (LDA) to multiple classes (assuming equal covariance matrices such that Σ c = Σ for every class c). Show that the optimal decision rule is based on calculating a set of C linear discriminant functions δ c (x) = x T Σ µ c µt c Σ µ c and selecting C = argmax c δ c (x). Assume that the prior probability is uniform for each class such that P c = p(y = c) = /C. Apply this method to the data in apple lda.mat and plot the resulting linear decision boundaries or, alternatively, the decision regions. Report the classification error. Recall from class on 9// and [3] that the Bayes Classifier minimizes the conditional risk, for a given class-conditional density p c (x) = p(x y = c) and prior probability P c = p(y = c), such that C = argmax δ c (x), for δ c (x) log p c (x) + log P c. c For this problem we can eliminate the class prior term to obtain δ c (x) = log p c (x), since the class prior is identical for all classes. In addition, we ll assume that the class-conditionals are multivariate Gaussians with identical covariance matrices such that p c (x) = (π) d/ exp Σ / Substituting into the previous expression we find ( (x µ c) T Σ (x µ c ) ). δ c (x) = log p c (x) = (x µ c) T Σ (x µ c ) d log π log Σ = (x µ c) T Σ (x µ c ) = xt Σ x + xt Σ µ c + µt c Σ x µt c Σ µ c = xt Σ µ c + µt c Σ x µt c Σ µ c () where terms that are independent of c have been eliminated. To complete our proof, note that µ T c Σ x = (µ T c Σ x) T = x T (Σ ) T µ c = x T Σ µ c, since Σ and Σ are symmetric matrices and, as a result, Σ = (Σ ) T. Applying this observation to Equation, we prove the desired result. δ c (x) = x T Σ µ c µt c Σ µ c Using prob9.m, this method was applied to the data in apple lda.mat. The means and covariance were obtained using the ML estimators derived in Problem ; for a given class c, the mean was calculated as ˆµ c = n c x ci. n c
17 CS 95-5: Machine Learning Problem Set Douglas Lanman x Class Class Class 3 x Figure : Optimal linear decision boundaries/regions for apple lda.mat. Similarly, the the covariance matrix Σ was calculated as ˆΣ = n c (x ci ˆµ c )(x ci ˆµ c ) T. n c The resulting decision boundaries are shown in Figure. Note that δ c (x) is linear in x, so we only obtain linear decision boundaries. The classification error (measured as the percent of incorrect classifications on the training data) was.7%. 7
18 CS 95-5: Machine Learning Problem Set Douglas Lanman Problem Assume that the covariances Σ c are no longer required to be equal. Derive the discriminant function and apply the resulting decision rule to apple lda.mat. Plot the decision boundaries/regions and report the classification error. Compare the performance of the two classifiers. As in Problem 9, we begin by writing the Bayes Classifier C = argmax δ c (x), for δ c (x) = log p c (x) + log P c = log p c (x), c since P c = /C is independent of c. The class-conditionals are multivariate Gaussians given by ( p c (x) = (π) d/ exp ) Σ c / (x µ c) T Σ c (x µ c ). Substituting into the previous expression we find δ c (x) = log p c (x) = (x µ c) T Σ c (x µ c ) d log π log Σ c = (x µ c) T Σ c (x µ c ) log Σ c = xt Σ c x + xt Σ c µ c + µt c Σ c x µt c Σ c µ c log Σ c where terms that are independent of c have been eliminated. Once again, we can simplify this x Class Class Class 3 x Figure 5: Non-linear decision boundaries/regions for apple lda.mat.
19 CS 95-5: Machine Learning Problem Set Douglas Lanman expression since µ T c Σ c x = x T Σ µ c. In conclusion, the discriminant function is given as follows. δ c (x) = xt Σ c c x + x T Σ c µ c µt c Σ c µ c log Σ c Using prob.m, this discriminant function was applied to the data in apple lda.mat. The means and covariances were obtained using the ML estimators derived in Problem. The resulting decision boundaries are shown in Figure 5. The classification error (measured as the percent of incorrect classifications on the training data) was 7%. Note that the decision functions δ c (x) are no longer linear in x, so now we can obtain non-linear decision boundaries. In fact, the leading-order term xt Σ c x is a quadratic form which, together with the lower-order terms, can be used to express any second-degree curve (i.e., conic sections in two dimensions) of the form ax + bx x + cx + dx + fx + g = as the decision boundary separating classes. In contrast to the linear decision boundaries in Problem 9, by allowing different covariance matrices for each class, we can now achieve a variety of decision boundaries such as lines, parabolas, hyperbolas, and ellipses [5]. As shown in Figure 5, the resulting decision boundaries are curvilinear and, as a result, can achieve a lower classification error due to the increase in their degrees of freedom. 9
20 CS 95-5: Machine Learning Problem Set Douglas Lanman Problem An alternative approach to learning quadratic decision boundaries is to map the inputs into an extended polynomial feature space. Implement this method and compare the resulting decision boundaries to those in Problem. As discussed in class on 9/5/, we can utilize the linear regression framework to implement a naïve classifier as follows. First, we form an indicator matrix Y such that { if yi = c, Y ij = otherwise where the possible classes are labeled as,..., C. Recall from Problem that any quadratic curve can be written as ax + bx x + cx + dx + fx + g =, where {a, b, c, d, f, g} are unknown regression coefficients. As a result, we can define a design matrix with extended polynomial features as follows. x () x () x () x () x () x () x () x () x () x () x () x () X = x (N) x (N) x (N) x (N) x (N) x (N) where x i(j) denotes the i th coordinate of the j th training sample. (Note that inclusion of the cross x Class Class Class 3 x Figure : Quadratic decision boundaries/regions using extended polynomial features.
21 CS 95-5: Machine Learning Problem Set Douglas Lanman x x x (a) Decision boundaries using extended LDA x (b) Decision boundaries using extended features Figure 7: Comparison of classification results in Problems and. term x x had little effect on the decision regions.) Together, X and Y define C independent linear regression problems with least squares solutions given by Ŵ = (X T X) X T Y, where the regression coefficients for each class correspond to the columns of Ŵ. For a general set of samples x, we form the extended design matrix X with the corresponding linear predictions given by Ŷ = X (X T X) X T Y = X Ŵ, as shown in class. Finally, we assign a class label Ĉ by choosing the column of Ŷ with the largest value (independently for each row/sample). Using prob.m, this classification procedure was applied to the data in apple lda.mat. The resulting decision boundaries are shown in Figure. The classification error (measured as the percent of incorrect classifications on the training data) was 9.3%. Since the regression features were extended to allow quadratic decision boundaries, we find that the resulting regions are delineated by curvilinear boarders. While the classification error is lower than in Problem 9, we find that the resulting decision regions possess some odd/undesirable properties. Compare the decision boundaries/regions in Figures 7(a) and 7(b). Using extended linear discriminant analysis (LDA) as in Problem, we find that the decision boundaries are quadratic and, assuming underlying Gaussian distributions, seem to make reasonable generalizations. More specifically, examine the decision region for the second class (green) in Figure 7(a). Not surprisingly, it continues on the other side of Class 3. This is reasonable since Class 3 varies in the opposite direction of Class. Unfortunately, this generalization behavior is not reflected in the decision regions found in this problem. Surprisingly, we find that Class 3 becomes the most likely class label in the bottom left corner of Figure 7(b). This is inconsistent with the covariance matrix of the underlying distribution (assuming it is truely Gaussian). In conclusion, we find that the results in Problems and illustrate the benefits of generative models over naïve classification schemes (when there is some knowledge of the underlying distributions).
22 CS 95-5: Machine Learning Problem Set Douglas Lanman References [] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 99. [] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag,. [3] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification (Second Edition). Wiley-Interscience,. [] Sam Roweis. Matrix identities. roweis/notes/matrixid.pdf. [5] Eric W. Weisstein. Quadratic curve. [] Eric W. Weisstein. Quadratic curve discriminant. QuadraticCurveDiscriminant.html.
CS 195-5: Machine Learning Problem Set 2
Decision Theory Problem CS 95-5: Machine Learning Problem Set 2 Douglas Lanman dlanman@brown.edu October 26 Part : In this problem we will examine randomized classifiers. Define, for any point x, a probabilty
More informationLinear Regression and Discrimination
Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian
More informationIntroduction to Machine Learning Spring 2018 Note 18
CS 189 Introduction to Machine Learning Spring 2018 Note 18 1 Gaussian Discriminant Analysis Recall the idea of generative models: we classify an arbitrary datapoint x with the class label that maximizes
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationBayesian Decision and Bayesian Learning
Bayesian Decision and Bayesian Learning Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 30 Bayes Rule p(x ω i
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationBayesian Decision Theory
Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian
More informationp(d θ ) l(θ ) 1.2 x x x
p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to
More informationDiscrete Mathematics and Probability Theory Fall 2015 Lecture 21
CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about
More informationReminders. Thought questions should be submitted on eclass. Please list the section related to the thought question
Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a
More informationBayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory
Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians
Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians
University of Cambridge Engineering Part IIB Module 4F: Statistical Pattern Processing Handout 2: Multivariate Gaussians.2.5..5 8 6 4 2 2 4 6 8 Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2 2 Engineering
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationUniversity of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries
University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout :. The Multivariate Gaussian & Decision Boundaries..15.1.5 1 8 6 6 8 1 Mark Gales mjfg@eng.cam.ac.uk Lent
More informationLecture Notes on the Gaussian Distribution
Lecture Notes on the Gaussian Distribution Hairong Qi The Gaussian distribution is also referred to as the normal distribution or the bell curve distribution for its bell-shaped density curve. There s
More informationMachine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.
More informationMultivariate statistical methods and data mining in particle physics
Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general
More informationStatistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart
Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More information6.867 Machine Learning
6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More information6.867 Machine Learning
6.867 Machine Learning Problem Set 2 Due date: Wednesday October 6 Please address all questions and comments about this problem set to 6867-staff@csail.mit.edu. You will need to use MATLAB for some of
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationCS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis
CS 3 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis AD March 11 AD ( March 11 1 / 17 Multivariate Gaussian Consider data { x i } N i=1 where xi R D and we assume they are
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More information5. Discriminant analysis
5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density
More informationp(x ω i 0.4 ω 2 ω
p( ω i ). ω.3.. 9 3 FIGURE.. Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value given the pattern is in category ω i.if represents
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationGenerative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham
Generative classifiers: The Gaussian classifier Ata Kaban School of Computer Science University of Birmingham Outline We have already seen how Bayes rule can be turned into a classifier In all our examples
More informationClassification Methods II: Linear and Quadratic Discrimminant Analysis
Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear
More informationIntroduction to Machine Learning
Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationContents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)
Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture
More informationMachine Learning (CS 567) Lecture 5
Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationMachine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1
Machine Learning 1 Linear Classifiers Marius Kloft Humboldt University of Berlin Summer Term 2014 Machine Learning 1 Linear Classifiers 1 Recap Past lectures: Machine Learning 1 Linear Classifiers 2 Recap
More informationMachine Learning Lecture 2
Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen
More informationInstance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest
More informationLinear Classifiers as Pattern Detectors
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,
More informationToday. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion
Today Probability and Statistics Naïve Bayes Classification Linear Algebra Matrix Multiplication Matrix Inversion Calculus Vector Calculus Optimization Lagrange Multipliers 1 Classical Artificial Intelligence
More informationIntroduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones
Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive
More informationThe Bayes classifier
The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal
More informationAdvanced statistical methods for data analysis Lecture 1
Advanced statistical methods for data analysis Lecture 1 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More information6.867 Machine Learning
6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.
More informationLDA, QDA, Naive Bayes
LDA, QDA, Naive Bayes Generative Classification Models Marek Petrik 2/16/2017 Last Class Logistic Regression Maximum Likelihood Principle Logistic Regression Predict probability of a class: p(x) Example:
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationLinear Regression Linear Regression with Shrinkage
Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationChapter 5 continued. Chapter 5 sections
Chapter 5 sections Discrete univariate distributions: 5.2 Bernoulli and Binomial distributions Just skim 5.3 Hypergeometric distributions 5.4 Poisson distributions Just skim 5.5 Negative Binomial distributions
More informationMachine Learning Lecture 2
Machine Perceptual Learning and Sensory Summer Augmented 6 Computing Announcements Machine Learning Lecture 2 Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on
More informationMachine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber
Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to
More informationStatistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart
Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart 1 Motivation Imaging is a stochastic process: If we take all the different sources of
More informationx. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).
.8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics
More informationB4 Estimation and Inference
B4 Estimation and Inference 6 Lectures Hilary Term 27 2 Tutorial Sheets A. Zisserman Overview Lectures 1 & 2: Introduction sensors, and basics of probability density functions for representing sensor error
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationBayes Decision Theory
Bayes Decision Theory Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density 0 Minimum-Error-Rate Classification Actions are decisions on classes
More informationClassification. Chapter Introduction. 6.2 The Bayes classifier
Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode
More informationComputer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression
Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationPattern Classification
Pattern Classification Introduction Parametric classifiers Semi-parametric classifiers Dimensionality reduction Significance testing 6345 Automatic Speech Recognition Semi-Parametric Classifiers 1 Semi-Parametric
More informationp(x ω i 0.4 ω 2 ω
p(x ω i ).4 ω.3.. 9 3 4 5 x FIGURE.. Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value x given the pattern is in category
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationMachine Learning - MT Classification: Generative Models
Machine Learning - MT 2016 7. Classification: Generative Models Varun Kanade University of Oxford October 31, 2016 Announcements Practical 1 Submission Try to get signed off during session itself Otherwise,
More informationECE662: Pattern Recognition and Decision Making Processes: HW TWO
ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationClassification. Sandro Cumani. Politecnico di Torino
Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to
More informationCS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall
CS534: Machine Learning Thomas G. Dietterich 221C Dearborn Hall tgd@cs.orst.edu http://www.cs.orst.edu/~tgd/classes/534 1 Course Overview Introduction: Basic problems and questions in machine learning.
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationEEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1
EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationNaive Bayes and Gaussian Bayes Classifier
Naive Bayes and Gaussian Bayes Classifier Mengye Ren mren@cs.toronto.edu October 18, 2015 Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 1 / 21 Naive Bayes Bayes Rules: Naive Bayes
More informationClustering VS Classification
MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:
More informationMinimum Error-Rate Discriminant
Discriminants Minimum Error-Rate Discriminant In the case of zero-one loss function, the Bayes Discriminant can be further simplified: g i (x) =P (ω i x). (29) J. Corso (SUNY at Buffalo) Bayesian Decision
More informationChap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University
Chap 2. Linear Classifiers (FTH, 4.1-4.4) Yongdai Kim Seoul National University Linear methods for classification 1. Linear classifiers For simplicity, we only consider two-class classification problems
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationClassification 2: Linear discriminant analysis (continued); logistic regression
Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:
More informationDiscriminant analysis and supervised classification
Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical
More informationA Study of Relative Efficiency and Robustness of Classification Methods
A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics
More informationCOMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)
COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless
More informationNaive Bayes and Gaussian Bayes Classifier
Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others February 22, 2016 Naive Bayes and Gaussian Bayes Classifier February 22, 2016 1 / 21 Naive Bayes Bayes Rule:
More informationNotes on Discriminant Functions and Optimal Classification
Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem
More informationMotivating the Covariance Matrix
Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationMinimum Error Rate Classification
Minimum Error Rate Classification Dr. K.Vijayarekha Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur-613 401 Table of Contents 1.Minimum Error Rate Classification...
More informationMachine Learning and Pattern Recognition Density Estimation: Gaussians
Machine Learning and Pattern Recognition Density Estimation: Gaussians Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh 10 Crichton
More informationIntroduction to Machine Learning
Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More information