Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Size: px
Start display at page:

Download "Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han"

Transcription

1 Math for Machine Learning Open Doors to Data Science and Artificial Intelligence Richard Han

2 Copyright 05 Richard Han All rights reserved.

3 CONTENTS PREFACE... - INTRODUCTION... LINEAR REGRESSION... 4 LINEAR REGRESSION... 4 THE LEAST SQUARES METHOD... 5 LINEAR ALGEBRA SOLUTION TO LEAST SQUARES PROBLEM... 7 EXAMPLE: LINEAR REGRESSION... 9 SUMMARY: LINEAR REGRESSION... 0 PROBLEM SET: LINEAR REGRESSION... SOLUTION SET: LINEAR REGRESSION... 3 LINEAR DISCRIMINANT ANALYSIS... 4 CLASSIFICATION... 4 LINEAR DISCRIMINANT ANALYSIS... 4 THE POSTERIOR PROBABILITY FUNCTIONS... 4 MODELLING THE POSTERIOR PROBABILITY FUNCTIONS... 5 LINEAR DISCRIMINANT FUNCTIONS... 7 ESTIMATING THE LINEAR DISCRIMINANT FUNCTIONS... 7 CLASSIFYING DATA POINTS USING LINEAR DISCRIMINANT FUNCTIONS... 8 LDA EXAMPLE... 9 LDA EXAMPLE... SUMMARY: LINEAR DISCRIMINANT ANALYSIS... 7 PROBLEM SET: LINEAR DISCRIMINANT ANALYSIS... 8 SOLUTION SET: LINEAR DISCRIMINANT ANALYSIS LOGISTIC REGRESSION... Error! Bookmark not defined. LOGISTIC REGRESSION... Error! Bookmark not defined. LOGISTIC REGRESSION MODEL OF THE POSTERIOR PROBABILITY FUNCTION.. Error! Bookmark not defined. ESTIMATING THE POSTERIOR PROBABILITY FUNCTION... Error! Bookmark not defined.

4 RICHARD HAN THE MULTIVARIATE NEWTON-RAPHSON METHOD... Error! Bookmark not defined. MAXIMIZING THE LOG-LIKELIHOOD FUNCTION... Error! Bookmark not defined. EXAMPLE: LOGISTIC REGRESSION... Error! Bookmark not defined. SUMMARY: LOGISTIC REGRESSION... Error! Bookmark not defined. PROBLEM SET: LOGISTIC REGRESSION... Error! Bookmark not defined. SOLUTION SET: LOGISTIC REGRESSION... Error! Bookmark not defined. 5 ARTIFICIAL NEURAL NETWORKSError! Bookmark not defined. ARTIFICIAL NEURAL NETWORKS... Error! Bookmark not defined. NEURAL NETWORK MODEL OF THE OUTPUT FUNCTIONS... Error! Bookmark not defined. FORWARD PROPAGATION... Error! Bookmark not defined. CHOOSING ACTIVATION FUNCTIONS... Error! Bookmark not defined. ESTIMATING THE OUTPUT FUNCTIONS... Error! Bookmark not defined. ERROR FUNCTION FOR REGRESSION... Error! Bookmark not defined. ERROR FUNCTION FOR BINARY CLASSIFICATION... Error! Bookmark not defined. ERROR FUNCTION FOR MULTI-CLASS CLASSIFICATION... Error! Bookmark not defined. MINIMIZING THE ERROR FUNCTION USING GRADIENT DESCENT... Error! Bookmark not defined. BACKPROPAGATION EQUATIONS... Error! Bookmark not defined. SUMMARY OF BACKPROPAGATION... Error! Bookmark not defined. SUMMARY: ARTIFICIAL NEURAL NETWORKS... Error! Bookmark not defined. PROBLEM SET: ARTIFICIAL NEURAL NETWORKS... Error! Bookmark not defined. SOLUTION SET: ARTIFICIAL NEURAL NETWORKS... Error! Bookmark not defined. 6 MAXIMAL MARGIN CLASSIFIERError! Bookmark not defined. MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. DEFINITIONS OF SEPARATING HYPERPLANE AND MARGIN... Error! Bookmark not defined. MAXIMIZING THE MARGIN... Error! Bookmark not defined. DEFINITION OF MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. REFORMULATING THE OPTIMIZATION PROBLEM... Error! Bookmark not defined. SOLVING THE CONVEX OPTIMIZATION PROBLEM... Error! Bookmark not defined. KKT CONDITIONS... Error! Bookmark not defined. iv

5 MATH FOR MACHINE LEARNING PRIMAL AND DUAL PROBLEMS... Error! Bookmark not defined. SOLVING THE DUAL PROBLEM... Error! Bookmark not defined. THE COEFFICIENTS FOR THE MAXIMAL MARGIN HYPERPLANE... Error! Bookmark not defined. THE SUPPORT VECTORS... Error! Bookmark not defined. CLASSIFYING TEST POINTS... Error! Bookmark not defined. MAXIMAL MARGIN CLASSIFIER EXAMPLE... Error! Bookmark not defined. MAXIMAL MARGIN CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUMMARY: MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. PROBLEM SET: MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. SOLUTION SET: MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. 7 SUPPORT VECTOR CLASSIFIERError! Bookmark not defined. SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. SLACK VARIABLES: POINTS ON CORRECT SIDE OF HYPERPLANE... Error! Bookmark not defined. SLACK VARIABLES: POINTS ON WRONG SIDE OF HYPERPLANE... Error! Bookmark not defined. FORMULATING THE OPTIMIZATION PROBLEM... Error! Bookmark not defined. DEFINITION OF SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. A CONVEX OPTIMIZATION PROBLEM... Error! Bookmark not defined. SOLVING THE CONVEX OPTIMIZATION PROBLEM (SOFT MARGIN).. Error! Bookmark not defined. THE COEFFICIENTS FOR THE SOFT MARGIN HYPERPLANE... Error! Bookmark not defined. THE SUPPORT VECTORS (SOFT MARGIN)... Error! Bookmark not defined. CLASSIFYING TEST POINTS (SOFT MARGIN)... Error! Bookmark not defined. SUPPORT VECTOR CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUPPORT VECTOR CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUMMARY: SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. PROBLEM SET: SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. SOLUTION SET: SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. 8 SUPPORT VECTOR MACHINE CLASSIFIERError! Bookmark not defined. SUPPORT VECTOR MACHINE CLASSIFIER... Error! Bookmark not defined. ENLARGING THE FEATURE SPACE... Error! Bookmark not defined. v

6 RICHARD HAN THE KERNEL TRICK... Error! Bookmark not defined. SUPPORT VECTOR MACHINE CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUPPORT VECTOR MACHINE CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUMMARY: SUPPORT VECTOR MACHINE CLASSIFIER... Error! Bookmark not defined. PROBLEM SET: SUPPORT VECTOR MACHINE CLASSIFIER... Error! Bookmark not defined. SOLUTION SET: SUPPORT VECTOR MACHINE CLASSIFIER... Error! Bookmark not defined. CONCLUSION... Error! Bookmark not defined. APPENDIX... Error! Bookmark not defined. APPENDIX... Error! Bookmark not defined. APPENDIX 3... Error! Bookmark not defined. APPENDIX 4... Error! Bookmark not defined. APPENDIX 5... Error! Bookmark not defined. INDEX... Error! Bookmark not defined. vi

7

8

9 PREFACE Welcome to Math for Machine Learning: Open Doors to Data Science and Artificial Intelligence. This is a first textbook in math for machine learning. Be sure to get the companion online course Math for Machine Learning here: Math for Machine Learning Online Course. The online course can be very helpful in conjunction with this book. The prerequisites for this book and the online course are Linear Algebra, Multivariable Calculus, and Probability. You can find my online course on Linear Algebra here: Linear Algebra Course. We will not do any programming in this book. This book will get you started in machine learning in a smooth and natural way, preparing you for more advanced topics and dispelling the belief that machine learning is complicated, difficult, and intimidating. I want you to succeed and prosper in your career, life, and future endeavors. I am here for you. Visit me at: Online Math Training

10 RICHARD HAN - INTRODUCTION Welcome to Math for Machine Learning: Open Doors to Data Science and Artificial Intelligence! My name is Richard Han. This is a first textbook in math for machine learning. Ideal student: If you're a working professional needing a refresher on machine learning or a complete beginner who needs to learn Machine Learning for the first time, this book is for you. If your busy schedule doesn t allow you to go back to a traditional school, this book allows you to study on your own schedule and further your career goals without being left behind. If you plan on taking machine learning in college, this is a great way to get ahead. If you re currently struggling with machine learning or have struggled with it in the past, now is the time to master it. Benefits of studying this book: After reading this book, you will have refreshed your knowledge of machine learning for your career so that you can earn a higher salary. You will have a required prerequisite for lucrative career fields such as Data Science and Artificial Intelligence. You will be in a better position to pursue a masters or PhD degree in machine learning and data science. Why Machine Learning is important: Famous uses of machine learning include: o Linear discriminant analysis. Linear discriminant analysis can be used to solve classification problems such as spam filtering and classifying patient illnesses.

11 MATH FOR MACHINE LEARNING o Logistic regression. Logistic regression can be used to solve binary classification problems such as determining whether a patient has a certain form of cancer or not. o Artificial neural networks. Artificial neural networks can be used for applications such as self-driving cars, recommender systems, online marketing, reading medical images, speech and face recognition o Support Vector machines. Real world applications of SVM s include classification of proteins and classification of images. What my book offers: In this book, I cover core topics such as: Linear Regression Linear Discriminant Analysis Logistic Regression Artificial Neural Networks Support Vector Machines I explain each definition and go through each example step by step so that you understand each topic clearly. Throughout the book, there are practice problems for you to try. Detailed solutions are provided after each problem set. I hope you benefit from the book. Best regards, Richard Han 3

12 LINEAR REGRESSION LINEAR REGRESSION Suppose we have a set of data (x, y ),, (x N, y N ). This is called the training data. x i x i Each x i is a vector [ ] of measurements, where x i is an instance of the first input variable X, x i x ip is an instance of the second input variable X, etc. X,, X p are called features or predictors. y,, y N are instances of the output variable Y, which is called the response. In linear regression, we assume that the response depends on the input variables in a linear fashion: y = f(x) + ε, where f(x) = β 0 + β X + + β p X p. Here, ε is called the error term and β 0,, β p are called parameters. We don t know the values of β 0,, β p. However, we can use the training data to approximate the values of β 0,, β p. What we ll do is look at the amount by which the predicted value f(x i ) differs from the actual y i for each of the pairs (x, y ),, (x N, y N ) from the training data. So we have y i f(x i ) as the difference. We then square this and take the sum for i =,, N: N (y i f(x i )) This is called the residual sum of squares and denoted RSS(β) where β = [ ]. β p i= We want the residual sum of squares to be as small as possible. Essentially, this means that we want our predicted value f(x i ) to be as close to the actual value y i as possible, for each of the pairs (x i, y i ). Doing this will give us a linear function of the input variables that best fits the given training data. In β 0 β 4

13 MATH FOR MACHINE LEARNING the case of only one input variable, we get the best fit line. In the case of two input variables, we get the best fit plane. And so on, for higher dimensions. THE LEAST SQUARES METHOD By minimizing RSS(β), we can obtain estimates β 0, β,, β p of the parameters β 0,, β p. This method is called the least squares method. x x x p x x x p Let X = [ x N x N x Np] y Then y Xβ = [ ] y N y and y = [ ]. y N x x x p x x x p [ [ x N x N x Np] β y 0 + β x + + β p x p β = [ ] 0 + β x + + β p x p y N [ β 0 + β x N + + β p x Np] f(x y ) = [ f(x ] [ ) ] y N f(x N ) = [ y f(x ) ] y N f(x N ) So (y Xβ) T (y Xβ) = N i= β 0 β ] β p (y i f(x i )) = RSS(β) RSS(β) = (y Xβ) T (y Xβ). Consider the vector of partial derivatives of RSS(β): RSS(β) β 0 RSS(β) β RSS(β) [ β p ] 5

14 RICHARD HAN RSS(β) = (y (β 0 + β x + + β p x p )) + + (y N (β 0 + β x N + + β p x Np )) Let s take the partial derivative with respect to β 0. RSS(β) β 0 = (y (β 0 + β x + + β p x p )) ( ) + + (y N (β 0 + β x N + + β p x Np )) ( ) = [ ](y Xβ) Next, take the partial derivative with respect to β. RSS(β) β = (y (β 0 + β x + + β p x p )) ( x ) + + (y N (β 0 + β x N + + β p x Np )) ( x N ) = [x x N ] (y Xβ) In general, RSS(β) β k = [x k x Nk ] (y Xβ) So, [ RSS(β) β 0 RSS(β) β RSS(β) β p ] [ ](y Xβ) [x = [ x N ](y Xβ) ] [x p x Np](y Xβ) x x N = [ ] (y Xβ) x p x Np = X T (y Xβ) If we take the second derivative of RSS(β), say RSS(β) β k β j, we get β j ( (y (β 0 + β x + + β p x p )) ( x k ) + + (y N (β 0 + β x N + + β p x Np )) ( x Nk )) = x j x k + + x Nj x Nk = (x j x k + + x Nj x Nk ) 6

15 MATH FOR MACHINE LEARNING x 0 x x x p x 0 x x x p Note X = [ ] x N0 x N x N x Np x 0 x 0 x N0 x 0 x x p x X T x x N x 0 x x p X = [ ] [ ] x p x p x Np x N0 x N x Np = (a jk ) where a jk = x j x k + + x Nj x Nk So RSS(β) β k β j = a jk The matrix of second derivatives of RSS(β) is X T X. This matrix is called the Hessian. By the second derivative test, if the Hessian of RSS(β) at a critical point is positive definite, then RSS(β) has a local minimum there. If we set our vector of derivatives to 0, we get X T (y Xβ) = 0 X T y + X T Xβ = 0 X T Xβ = X T y X T Xβ = X T y β = (X T X) X T y. β 0 β Thus, we solved for the vector of parameters [ ] which minimizes the residual sum of squares RSS(β). β p β 0 So we let β = (X T X) X T y. [ ] β p LINEAR ALGEBRA SOLUTION TO LEAST SQUARES PROBLEM We can arrive at the same solution for the least squares problem by using linear algebra. x x x p y x x x p Let X = and y = [ ] as before, from our training data. We want a y N [ x N x N x Np] vector β such that Xβ is close to y. In other words, we want a vector β such that the distance Xβ y between Xβ and y is minimized. A vector β that minimizes Xβ y is called a least-squares solution 7

16 RICHARD HAN of Xβ = y. X is an N by (p + ) matrix. We want a β in R p+ such that Xβ is closest to y. Note that Xβ is a linear combination of the columns of X. So Xβ lies in the span of the columns of X, which is a subspace of R N denoted Col X. So we want the vector in Col X that is closest to y. The projection of y onto the subspace Col X is that vector. proj Col X y = Xβ for some β R p+. Consider y Xβ. Note that y = Xβ + (y Xβ ). R N can be broken into two subspaces Col X and (Col X), where (Col X) is the subspace of R N consisting of all vectors that are orthogonal to the vectors in Col X. Any vector in R N can be written uniquely as z + w where z Col X and w (Col X). Since y R N, and y = Xβ + (y Xβ ), with Xβ Col X, the second vector y Xβ must lie in (Col X). y Xβ is orthogonal to the columns of X. X T (y Xβ ) = 0 X T y X T Xβ = 0. X T Xβ = X T y. Thus, it turns out that the set of least-squares solutions of Xβ = y consists of all and only the solutions to the matrix equation X T Xβ = X T y. If X T X is positive definite, then the eigenvalues of X T X are all positive. So 0 is not an eigenvalue of X T X. It follows that X T X is invertible. Then, we can solve the equation X T Xβ = X T y for β to get β = 8

17 MATH FOR MACHINE LEARNING (X T X) X T y, which is the same result we got earlier using multi-variable calculus. EXAMPLE: LINEAR REGRESSION Suppose we have the following training data: (x, y ) = (, ), (x, y ) = (, 4), (x 3, y 3 ) = (3, 4). Find the best fit line using the least squares method. Find the predicted value for x = 4. Solution: Form X = [ ] and y = [ 4]. 3 4 The coefficients β 0, β for the best fit line f(x) = β 0 + β x are given by [ β 0 ] = (X β T X) X T y. X T = [ 3 ] X T X = [ 3 ] [ ] = [ ] 3 (X T X) 7/3 = [ / ] (X T X) X T 7/3 y = [ / ] [ 3 ] [ 4] 4 = [ 0 3/ ] β 0 = 0 and β = 3/. Thus, the best fit line is given by f(x) = ( 3 ) x. The predicted value for x = 4 is f(4) = ( 3 ) 4 = 6. 9

18 SUMMARY: LINEAR REGRESSION In the least squares method, we seek a linear function of the input variables that best fits the given training data. We do this by minimizing the residual sum of squares. To minimize the residual sum of squares, we apply the second derivative test from multi-variable calculus. We can arrive at the same solution to the least squares problem using linear algebra. 0

19 PROBLEM SET: LINEAR REGRESSION MATH FOR MACHINE LEARNING. Suppose we have the following training data: (x, y ) = (0, ), (x, y ) = (, ), (x 3, y 3 ) = (, 4), (x 4, y 4 ) = (3, 4). Find the best fit line using the least squares method. Find the predicted value for x = 4.. Suppose we have the following training data: (x, y ), (x, y ), (x 3, y 3 ) where x = [ 0 0 ], x = [ 0 ], x 3 = [ 0 ], x 4 = [ ] and y =, y = 0, y 3 = 0, y 4 =. Find the best fit plane using the least squares method. Find the predicted value for x = [ ].

20 SOLUTION SET: LINEAR REGRESSION 0. Form X = [ ] and y = [ ] RICHARD HAN The coefficients β 0, β for the best fit line f(x) = β 0 + β x are given by [ β 0 β ] = (X T X) X T y. 0 X T = [ 0 3 ] XT X = [ 0 3 ] [ ] = [ ] 3 (X T X) = [ ] (X T X) X T y = [ ] [ 0 3 ] [ ] 4 4 = [ ] β 0 = 4 0 and β = 9 0. Thus, the best fit line is given by f(x) = x The predicted value for x = 4 is f(4) = = Form X = [ ] and y = [ ]. 0 0 The coefficients β 0, β, β for the best fit line f(x, x ) = β 0 + β x + β x are given by [ ] = β (X T X) X T y X T = [ 0 0 ] X T 0 X = [ 0 0 ] [ ] = [ ] β 0 β

21 MATH FOR MACHINE LEARNING (X T X) = [ ] (X T X) X T y = [ [ 0 0 ] [ ] ] = [ ] 0 [ ] 0 = 4 [ ] β 0 = 4, β =, β = Thus, the best fit plane is given by f(x, x ) = 4 + x + x The predicted value for x = [ ] is f(, ) = 4. 3

22 RICHARD HAN 3 LINEAR DISCRIMINANT ANALYSIS CLASSIFICATION In the problem of regression, we had a set of data (x, y ),, (x N, y N ) and we wanted to predict the values for the response variable Y for new data points. The values that Y took were numerical, quantitative, values. In certain problems, the values for the response variable Y that we want to predict are not quantitative but qualitative. So the values for Y will take on values from a finite set of classes or categories. Problems of this sort are called classification problems. Some examples of a classification problem are classifying an as spam or not spam and classifying a patient s illness as one among a finite number of diseases. LINEAR DISCRIMINANT ANALYSIS One method for solving a classification problem is called linear discriminant analysis. What we ll do is estimate Pr (Y = k X = x), the probability that Y is the class k given that the input variable X is x. Once we have all of these probabilities for a fixed x, we pick the class k for which the probability Pr (Y = k X = x) is largest. We then classify x as that class k. THE POSTERIOR PROBABILITY FUNCTIONS In this section, we ll build a formula for the posterior probability Pr (Y = k X = x). Let π k = Pr (Y = k), the prior probability that Y = k. Let f k (x) = Pr (X = x Y = k), the probability that X = x, given that Y = k. By Bayes rule, Pr (X = x Y = k) Pr (Y = k) Pr(Y = k X = x) = K Pr(X = x Y = l) Pr (Y = l) l= Here we assume that k can take on the values,, K. = f k(x) π k K f l (x) π l = π k f k (x) K l= π l f l (x) We can think of Pr(Y = k X = x) as a function of x and denote it as p k (x). So p k (x) = π k f k (x) K l= π l f l (x) l=. Recall that p k (x) is the posterior probability that Y = k given that X = x. 4

23 MATH FOR MACHINE LEARNING MODELLING THE POSTERIOR PROBABILITY FUNCTIONS Remember that we wanted to estimate Pr(Y = k X = x) for any given x. That is, we want an estimate for p k (x). If we can get estimates for π k, f k (x), π l and f l (x) for each l =,, K, then we would have an estimate for p k (x). Let s say that X = (X, X,, X p ) where X,, X p are the input variables. So the values of X will be vectors of p elements. We will assume that the conditional distribution of X given Y = k is the multivariate Gaussian distribution N(μ k, Σ), where μ k is a class-specific mean vector and Σ is the covariance of X. μ k The class-specific mean vector μ k is given by the vector of class-specific means [ ], where μ kj is μ kp the class-specific mean of X j. So μ kj = i:y i =k x ij Pr (X j = x ij ). Recall that x i = [ ]. (For all those x i for which y i = k, we re x ip taking the mean of their jth components.) Σ, the covariance matrix of X, is given by the matrix of covariances of X i and X j. So Σ = (a ij ), where a ij = Cov(X i, X j ) E[(X i μ Xi )(X j μ Xj )]. x i The multivariate Gaussian density is given by f(x) = (π) p Σ 5 e (x μ)t Σ (x μ) for the multivariate Gaussian distribution N(μ, Σ). Since we re assuming that the conditional distribution of X given Y = k is the multivariate Gaussian distribution N(μ k, Σ), we have that Pr(X = x Y = k) = (π) p Σ e (x μ k )T Σ (x μ k ). Recall that f k (x) = Pr(X = x Y = k). So f k (x) = (π) p Σ Recall that p k (x) = e (x μ k )T Σ (x μ k ). π k f k (x) K l= π l f l (x).

24 RICHARD HAN Plugging in what we have for f k (x), we get π k (π) p Σ p k (x) = K l= π l (π) p Σ e (x μ k )T Σ (x μ k ) e (x μ l )T Σ (x μ l ) = π k e (x μ k )T Σ (x μ k ) K π l e (x μ l )T Σ (x μ l ) l=. Note that the denominator is (π) p Σ l= π l f l (x) and that K K l= π l f l (x) = K l= f l (x)π l K = Pr(X = x Y = l) Pr (Y = l) l= = Pr(X = x). So the denominator is just (π) p Σ Pr(X = x). Hence, p k (x) = π k e (x μ k )T Σ (x μ k ) (π) p Σ Pr(X=x). 6

25 LINEAR DISCRIMINANT FUNCTIONS Recall that we want to choose the class k for which the posterior probability p k (x) is largest. Since the logarithm function is order-preserving, maximizing p k (x) is the same as maximizing log p k (x). Taking log p k (x) gives log π k e (x μ k )T Σ (x μ k ) (π) p Σ Pr(X=x) = log π k + ( ) (x μ k) T Σ (x μ k ) log ((π) p Σ Pr (X = x)) = log π k + ( ) (x μ k) T Σ (x μ k ) log C where C = (π) p Σ P r(x = x). = log π k (xt Σ μ k T Σ )(x μ k ) log C = log π k [xt Σ x x T Σ μ k μ k T Σ x + μ k T Σ μ k ] log C = log π k [xt Σ x x T Σ μ k + μ k T Σ μ k ] log C, because x T Σ μ k = μ k T Σ x Proof: x T Σ μ k = μ k (Σ ) T x = μ k T (Σ T ) x = μ k T Σ x because Σ is symmetric. = log π k xt Σ x + x T Σ μ k μ k T Σ μ k log C = x T Σ μ k μ k T Σ μ k + log π k xt Σ x log C Let δ k (x) = x T Σ μ k μ k T Σ μ k + log π k. Then log p k (x) = δ k (x) xt Σ x log C. δ k (x) is called a linear discriminant function. Maximizing log p k (x) is the same as maximizing δ k (x) since xt Σ x log C does not depend on k. ESTIMATING THE LINEAR DISCRIMINANT FUNCTIONS Now, if we can find estimates for π k, μ k, and Σ, then we would have an estimate for p k (x) and hence for log p k (x) and δ k (x). In an attempt to maximize p k (x), we instead maximize the estimate of p k (x), which is the same as 7

26 RICHARD HAN maximizing the estimate of δ k (x). π k can be estimated as π k = N k where N N k is the number of training data points in class k and N is the total number of training data points. Remember π k = Pr (Y = k). We re estimating this by just taking the proportion of data points in class k. μ k The class-specific mean vector μ k = [ ], where μ kj = i:y i =k x ij Pr(X j = x ij ). μ kp We can estimate μ kj as N k So we can estimate μ k as μ k = [ i:yi =k x ij. N i:yi =k x i k N i:yi =k x ip k ] = [ N k i:yi =k x i i:yi =k x ip = x i [ ] N k i:y i =k x ip ] In other words, μ k = N k i:yi =k x i averages of each component over all x i in class k. = N k x i i:y i =k. We estimate the class-specific mean vector by the vector of Finally, the covariance matrix Σ is estimated as Σ = K (x N K k= i:y i =k i μ ) k (x i μ ) T k. Recall that δ k (x) = x T Σ μ k μ k T Σ μ k + log π k. So, δ (x) k = x TΣ μ k (μ ) k TΣ μ k + log π. k Note that Σ, μ, k and π k only depend on the training data and not on x. Note that x is a vector and x TΣ μ k is a linear combination of the components of x. Hence, δ (x) k is a linear combination of the components of x. This is why it s called a linear discriminant function. CLASSIFYING DATA POINTS USING LINEAR DISCRIMINANT FUNCTIONS If (k, k ) is a pair of classes, we can consider whether δ k (x) > δ k (x). If so, we know x is not in class k. Then, we can compare δ k (x) > δ k3 (x) and rule out another class. Once we ve exhausted all 8

27 MATH FOR MACHINE LEARNING the classes, we ll know which class x should be assigned to. Setting δ k (x) = δ k (x), we get x TΣ μ k (μ k ) T Σ μ k + log π k = x TΣ μ k (μ k ) T Σ μ k + log π k. This gives us a hyperplane in R p which separates class k from class k. If we find the separating hyperplane for each pair of classes, we get something like this: In this example, p = and K = 3. LDA EXAMPLE Suppose we have a set of data (x, y ),, (x 6, y 6 ) as follows: x = (, 3), x = (, 3), x 3 = (, 4), x 4 = (3, ), x 5 = (3, ), x 6 = (4, ), with y = y = y 3 = k = and y 4 = y 5 = y 6 = k =. Apply linear discriminant analysis by doing the following: a) Find estimates for the linear discriminant functions δ (x) and δ (x). b) Find the line that decides between the two classes. c) Classify the new point x = (5, 0). Solution: Here is a graph of the data points: 9

28 RICHARD HAN The number of features p is, the number of classes K is, the total number of data points N is 6, the number N of data points in class k is 3, and the number N of data points in class k is 3. First, we will find estimates for π and π, the prior probabilities that Y = k and Y = k, respectively. Then, we will find estimates for μ and μ, the class-specific mean vectors. We can then calculate the estimate for the covariance matrix Σ. Finally, using the estimates π, π, μ, μ, Σ, we can find the estimates for the linear discriminant functions δ (x) and δ (x). π = N N = 3 6 = π = N N = 3 6 = μ = N x i i:y i = = 3 [x + x + x 3 ] = [ 5/3 0/3 ] μ = N x i i:y i = = 3 [x 4 + x 5 + x 6 ] = [ 0/3 5/3 ] K Σ = N K (x i μ k k= i:y i =k )(x i μ ) T k 0

29 MATH FOR MACHINE LEARNING = 6 (x i μ k k= i:y i =k )(x i μ ) T k Plugging in what we got for μ and μ, we get Σ = /3 /6 [4/3 ] = [/3 4 /3 4/3 /6 /3 ] Σ = [ 4 4 ] δ (x) = x TΣ μ (μ ) TΣ μ + log π. = x T [ 0 0 ] (00 3 ) + log = 0X log δ (x) = x TΣ μ (μ ) TΣ μ + log π. = x T [ 0 0 ] (00 3 ) + log = 0X log Setting δ (x) = δ (x) 0X 50 + log = 0X X = 0X X = X. 3 + log So, the line that decides between the two classes is given by X = X. Here is a graph of the deciding line:

30 RICHARD HAN If δ (x) > δ (x), then we classify x as of class k. So if x is above the line X = X, then we classify x as of class k. Conversely, if δ (x) < δ (x), then we classify x as of class k. This corresponds to x being below the line X = X. The point (5, 0) is below the line; so we classify it as of class k. LDA EXAMPLE Suppose we have a set of data (x, y ),, (x 6, y 6 ) as follows: x = (0, ), x = (, ), x 3 = (, 0), x 4 = (, ), x 5 = (3, 3), x 6 = (4, 4), with y = y = k =, y 3 = y 4 = k =, and y 5 = y 6 = k 3 = 3. Apply linear discriminant analysis by doing the following: a) Find estimates for the linear discriminant functions δ (x), δ (x), and δ 3 (x). b) Find the lines that decide between each pair of classes. c) Classify the new point x = (, 3). Solution: Here is a graph of the data points:

31 MATH FOR MACHINE LEARNING The number of features p is, the number of classes K is 3, the total number of data points N is 6, the number N of data points in class k is, the number N of data points in class k is, and the number N 3 of data points in class k 3 is. First, we will find estimates for π, π, π 3, the prior probabilities that Y = k, Y = k, Y = k 3, respectively. Then, we will find estimates for μ, μ, μ 3, the class-specific mean vectors. We can then calculate the estimate for the covariance matrix Σ. Finally, using the estimates π, π, π, 3 μ, μ, μ, 3 Σ, we can find the estimates for the linear discriminant functions δ (x), δ (x), and δ 3 (x). π = N N = 6 = 3 π = N N = 6 = 3 π 3 = N 3 N = 6 = 3 μ = N x i i:y i = = [x + x ] = [ / ] μ = N x i i:y i = = [x 3 + x 4 ] = [ / ] 3

32 RICHARD HAN μ 3 = N 3 x i i:y i =3 = [x 5 + x 6 ] = [ 7/ 7/ ] K Σ = N K (x i μ k k= i:y i =k )(x i μ ) T k = 6 3 [ / /6 ] = [/3 / /6 /3 ] Σ = [ 4 4 ] δ (x) = x TΣ μ (μ ) TΣ μ + log π. = x T [ 7 ] (3 ) + log 3 = X + 7X 3 + log 3 δ (x) = x TΣ μ (μ ) TΣ μ + log π. = x T [ 7 ] (3 ) + log 3 = 7X X 3 + log 3 δ (x) 3 = x TΣ μ 3 (μ ) 3 TΣ μ 3 + log π. 3 = x T [ 7 7 ] (49 ) + log 3 = 7X + 7X 49 + log 3 Setting δ (x) = δ (x) X + 7X 3 + log = 7X 3 X 3 X + 7X = 7X X + log 3 9X = 9X 4

33 MATH FOR MACHINE LEARNING X = X. So, the line that decides between classes k and k is given by X = X. Setting δ (x) = δ (x) 3 X + 7X 3 + log = 7X 3 + 7X 49 8 = 9X X = + log 3 So, the line that decides between classes k and k 3 is given by X =. Setting δ (x) = δ (x) 3 7X X 3 + log = 7X 3 + 7X 49 + log 3 8 = 9X X = So, the line that decides between classes k and k 3 is given by X =. Here is a graph of the deciding lines: The lines divide the plane into 3 regions. δ (x) > δ (x) corresponds to the region above the line X = X. Conversely, δ (x) < δ (x) corresponds to the region below the line X = X. δ (x) > δ (x) 3 corresponds to the region to the left of the line X =. Conversely, δ (x) < δ (x) 3 5

34 RICHARD HAN corresponds to the region to the right of X =. δ (x) > δ (x) 3 corresponds to the region below the line X =. Conversely, δ (x) < δ (x) 3 corresponds to the region above the line X =. If δ (x) > δ (x) and δ (x) > δ (x), 3 then we classify x as of class k. So if x is in region I, then we classify x as of class k. Conversely, if x is in region II, then we classify x as of class k ; and if x is in region III, we classify x as of class k 3. The point (, 3) is in region I; so we classify it as of class k. 6

35 MATH FOR MACHINE LEARNING SUMMARY: LINEAR DISCRIMINANT ANALYSIS In linear discriminant analysis, we find estimates p (x) k for the posterior probability p k (x) that Y = k given that X = x. We classify x according to the class k that gives the highest estimated posterior probability p (x). k Maximizing the estimated posterior probability p (x) k is equivalent to maximizing the log of p (x), k which, in turn, is equivalent to maximizing the estimated linear discriminant function (x). δ k We find estimates of the prior probability π k that Y = k, of the class-specific mean vectors μ k, and of the covariance matrix Σ in order to estimate the linear discriminant functions δ k (x). By setting δ (x) k = δ k (x) for each pair (k, k ) of classes, we get hyperplanes in R p that, together, divide R p into regions corresponding to the distinct classes. We classify x according to the class k for which δ (x) k is largest. 7

36 RICHARD HAN PROBLEM SET: LINEAR DISCRIMINANT ANALYSIS. Suppose we have a set of data (x, y ),, (x 6, y 6 ) as follows: x = (, ), x = (, ), x 3 = (, ), x 4 = (3, 3), x 5 = (3, 4), x 6 = (4, 3) with y = y = y 3 = k = and y 4 = y 5 = y 6 = k =. Apply linear discriminant analysis by doing the following: a) Find estimates for the linear discriminant functions δ (x) and δ (x). b) Find the line that decides between the two classes. c) Classify the new point x = (4, 5).. Suppose we have a set of data (x, y ),, (x 6, y 6 ) as follows: x = (0, 0), x = (, ), x 3 = (, 3), x 4 = (, 4), x 5 = (3, ), x 6 = (4, ) with y = y = k =, y 3 = y 4 = k = and y 5 = y 6 = k 3 = 3. Apply linear discriminant analysis by doing the following: a) Find estimates for the linear discriminant functions δ (x), δ (x) and δ 3 (x). b) Find the lines that decide between each pair of classes. c) Classify the new point x = (3, 0). 8

37 MATH FOR MACHINE LEARNING SOLUTION SET: LINEAR DISCRIMINANT ANALYSIS. Here is a graph of the data points: The number of features p is, the number of classes K is, the total number of data points N is 6, the number N of data points in class k is 3, and the number N of data points in class k is 3. First, we will find estimates for π and π, the prior probabilities that Y = k and Y = k, respectively. Then, we will find estimates for μ and μ, the class-specific mean vectors. We can then calculate the estimate for the covariance matrix Σ. Finally, using the estimates π, π, μ, μ, Σ, we can find the estimates for the linear discriminant functions δ (x) and δ (x). π = N N = 3 6 = π = N N = 3 6 = μ = N 5 x i = 3 [x + x + x 3 ] = [ 3 ] 5 3 i:y i = 9

38 μ = N 0 x i = 3 [x 4 + x 5 + x 6 ] = [ 3 ] 0 3 i:y i = K Σ = N K (x i μ k = 6 k= i:y i =k )(x i μ ) T k 6/9 /3 /6 [/9 ] = [ 6/9 /9 /6 /3 ] RICHARD HAN Σ = [ 4 4 ] δ (x) = x TΣ μ μ TΣ μ + log π = x T [ 0 0 ] (00) + log 3 = 0X + 0X 50 + log 3 δ (x) = x TΣ μ μ TΣ μ + log π = x T [ 0 0 ] (400) + log 3 = 0X + 0X 00 + log 3 Setting δ (x) = δ (x) 0X + 0X 50 + log = 0X 3 + 0X 00 + log 3 50 = 0X 3 + 0X 50 = 0X + 0X 5 = X + X X + 5 = X So, the line that decides between the two classes is given by X = X + 5. Here is a graph of the decision line: 30

39 MATH FOR MACHINE LEARNING If δ (x) > δ (x), then we classify x as of class k. So if x is below the line X = X + 5, then we classify x as of class k. Conversely, if δ (x) < δ (x), then we classify x as of class k. This corresponds to x being above the line X = X + 5. The point (4, 5) is above the line; so we classify it as of class k. 3

40 RICHARD HAN. Here is a graph of the data points: The number of features p is, the number of classes K is 3, the total number of data points N is 6, the number N of data points in class k is, the number N of data points in class k is, and the number N 3 of data points in class k 3 is. First, we will find estimates for π, π, π 3, the prior probabilities that Y = k, Y = k, Y = k 3, respectively. Then, we will find estimates for μ, μ, μ 3, the class-specific mean vectors. We can then calculate the estimate for the covariance matrix Σ. Finally, using the estimates π, π, π, 3 μ, μ, μ, 3 Σ, we can find the estimates for the linear discriminant functions δ (x), δ (x), and δ 3 (x). π = N N = 6 = 3 π = N N = 6 = 3 π 3 = N 3 N = 6 = 3 3

41 μ = N μ = N μ 3 = N 3 x i i:y i = x i i:y i = x i i:y i =3 MATH FOR MACHINE LEARNING = [x + x ] = [ / / ] = [x 3 + x 4 ] = [ 7/ ] = [x 5 + x 6 ] = [ 7/ ] K Σ = N K (x i μ k = k= i:y i =k )(x i μ ) T k [ / /6 ] = [/3 6 3 / /6 /3 ] Σ = [ 4 4 ] δ (x) = x TΣ μ μ TΣ μ + log π = x T [ ] () + log 3 = X + X + log 3 δ (x) = x TΣ μ μ TΣ μ + log π = x T [ 0 ] (37) + log 3 = X + 0X 37 + log 3 δ (x) 3 = x TΣ μ 3 μ TΣ 3 μ 3 + log π 3 = x T [ 0 ] (37) + log 3 = 0X + X 37 + log 3 Setting δ (x) = δ (x) X + X + log = X 3 + 0X 37 8 = 9X = X + log 3 So, the line that decides between classes k and k is given by X =. 33

42 RICHARD HAN Setting δ (x) = δ (x) 3 X + X + log = 0X 3 + X 37 8 = 9X = X + log 3 So, the line that decides between classes k and k 3 is given by X =. Setting δ (x) = δ (x) 3 X + 0X 37 + log = 0X 3 + X 37 + log 3 9X = 9X X = X So, the line that decides between classes k and k 3 is given by X = X. Here is a graph of the decision lines: The lines divide the plane into 3 regions. If x is in region I, then we classify x as of class k. Similarly, points in region II get classified as of k, and points in region III get classified as of k 3. The point (3, 0) is in region III; so we classify it as of class k 3. 34

43 MATH FOR MACHINE LEARNING 35

Linear Algebra for Beginners Open Doors to Great Careers. Richard Han

Linear Algebra for Beginners Open Doors to Great Careers. Richard Han Linear Algebra for Beginners Open Doors to Great Careers Richard Han Copyright 2018 Richard Han All rights reserved. CONTENTS PREFACE... 7 1 - INTRODUCTION... 8 2 SOLVING SYSTEMS OF LINEAR EQUATIONS...

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II) Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is

More information

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Gaussian discriminant analysis Naive Bayes

Gaussian discriminant analysis Naive Bayes DM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. is 2. Multi-variate

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University Chap 2. Linear Classifiers (FTH, 4.1-4.4) Yongdai Kim Seoul National University Linear methods for classification 1. Linear classifiers For simplicity, we only consider two-class classification problems

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

ML (cont.): SUPPORT VECTOR MACHINES

ML (cont.): SUPPORT VECTOR MACHINES ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Linear Decision Boundaries

Linear Decision Boundaries Linear Decision Boundaries A basic approach to classification is to find a decision boundary in the space of the predictor variables. The decision boundary is often a curve formed by a regression model:

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam. CS 189 Spring 2013 Introduction to Machine Learning Midterm You have 1 hour 20 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators

More information

Introduction to SVM and RVM

Introduction to SVM and RVM Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Announcements - Homework

Announcements - Homework Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA Lecture 9: Classification, LDA Reading: Chapter 4 STATS 202: Data mining and analysis October 13, 2017 1 / 21 Review: Main strategy in Chapter 4 Find an estimate ˆP (Y X). Then, given an input x 0, we

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two

More information

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA Lecture 9: Classification, LDA Reading: Chapter 4 STATS 202: Data mining and analysis October 13, 2017 1 / 21 Review: Main strategy in Chapter 4 Find an estimate ˆP (Y X). Then, given an input x 0, we

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask Machine Learning and Data Mining Support Vector Machines Kalev Kask Linear classifiers Which decision boundary is better? Both have zero training error (perfect training accuracy) But, one of them seems

More information

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall CS534: Machine Learning Thomas G. Dietterich 221C Dearborn Hall tgd@cs.orst.edu http://www.cs.orst.edu/~tgd/classes/534 1 Course Overview Introduction: Basic problems and questions in machine learning.

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane

More information

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk 94 Jonathan Richard Shewchuk 7 Neural Networks NEURAL NETWORKS Can do both classification & regression. [They tie together several ideas from the course: perceptrons, logistic regression, ensembles of

More information

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Lecture 16: Modern Classification (I) - Separating Hyperplanes Lecture 16: Modern Classification (I) - Separating Hyperplanes Outline 1 2 Separating Hyperplane Binary SVM for Separable Case Bayes Rule for Binary Problems Consider the simplest case: two classes are

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Lecture 4: Training a Classifier

Lecture 4: Training a Classifier Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017 COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University OVERVIEW This class will cover model-based

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Linear Classification and SVM. Dr. Xin Zhang

Linear Classification and SVM. Dr. Xin Zhang Linear Classification and SVM Dr. Xin Zhang Email: eexinzhang@scut.edu.cn What is linear classification? Classification is intrinsically non-linear It puts non-identical things in the same class, so a

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

More information

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37 COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation

More information

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS446: Machine Learning Fall Final Exam. December 6 th, 2016 CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Homework 3. Convex Optimization /36-725

Homework 3. Convex Optimization /36-725 Homework 3 Convex Optimization 10-725/36-725 Due Friday October 14 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Statistical NLP for the Web

Statistical NLP for the Web Statistical NLP for the Web Neural Networks, Deep Belief Networks Sameer Maskey Week 8, October 24, 2012 *some slides from Andrew Rosenberg Announcements Please ask HW2 related questions in courseworks

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

Statistical Methods for SVM

Statistical Methods for SVM Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,

More information

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices Lecture 3: Simple Linear Regression in Matrix Format To move beyond simple regression we need to use matrix algebra We ll start by re-expressing simple linear regression in matrix form Linear algebra is

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Classification 1: Linear regression of indicators, linear discriminant analysis

Classification 1: Linear regression of indicators, linear discriminant analysis Classification 1: Linear regression of indicators, linear discriminant analysis Ryan Tibshirani Data Mining: 36-462/36-662 April 2 2013 Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.1 4.3 1 Classification

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

Support Vector Machines and Bayes Regression

Support Vector Machines and Bayes Regression Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information