Math for Machine Learning Open Doors to Data Science and Artificial Intelligence Richard Han
Copyright 05 Richard Han All rights reserved.
CONTENTS PREFACE... - INTRODUCTION... LINEAR REGRESSION... 4 LINEAR REGRESSION... 4 THE LEAST SQUARES METHOD... 5 LINEAR ALGEBRA SOLUTION TO LEAST SQUARES PROBLEM... 7 EXAMPLE: LINEAR REGRESSION... 9 SUMMARY: LINEAR REGRESSION... 0 PROBLEM SET: LINEAR REGRESSION... SOLUTION SET: LINEAR REGRESSION... 3 LINEAR DISCRIMINANT ANALYSIS... 4 CLASSIFICATION... 4 LINEAR DISCRIMINANT ANALYSIS... 4 THE POSTERIOR PROBABILITY FUNCTIONS... 4 MODELLING THE POSTERIOR PROBABILITY FUNCTIONS... 5 LINEAR DISCRIMINANT FUNCTIONS... 7 ESTIMATING THE LINEAR DISCRIMINANT FUNCTIONS... 7 CLASSIFYING DATA POINTS USING LINEAR DISCRIMINANT FUNCTIONS... 8 LDA EXAMPLE... 9 LDA EXAMPLE... SUMMARY: LINEAR DISCRIMINANT ANALYSIS... 7 PROBLEM SET: LINEAR DISCRIMINANT ANALYSIS... 8 SOLUTION SET: LINEAR DISCRIMINANT ANALYSIS... 9 4 LOGISTIC REGRESSION... Error! Bookmark not defined. LOGISTIC REGRESSION... Error! Bookmark not defined. LOGISTIC REGRESSION MODEL OF THE POSTERIOR PROBABILITY FUNCTION.. Error! Bookmark not defined. ESTIMATING THE POSTERIOR PROBABILITY FUNCTION... Error! Bookmark not defined.
RICHARD HAN THE MULTIVARIATE NEWTON-RAPHSON METHOD... Error! Bookmark not defined. MAXIMIZING THE LOG-LIKELIHOOD FUNCTION... Error! Bookmark not defined. EXAMPLE: LOGISTIC REGRESSION... Error! Bookmark not defined. SUMMARY: LOGISTIC REGRESSION... Error! Bookmark not defined. PROBLEM SET: LOGISTIC REGRESSION... Error! Bookmark not defined. SOLUTION SET: LOGISTIC REGRESSION... Error! Bookmark not defined. 5 ARTIFICIAL NEURAL NETWORKSError! Bookmark not defined. ARTIFICIAL NEURAL NETWORKS... Error! Bookmark not defined. NEURAL NETWORK MODEL OF THE OUTPUT FUNCTIONS... Error! Bookmark not defined. FORWARD PROPAGATION... Error! Bookmark not defined. CHOOSING ACTIVATION FUNCTIONS... Error! Bookmark not defined. ESTIMATING THE OUTPUT FUNCTIONS... Error! Bookmark not defined. ERROR FUNCTION FOR REGRESSION... Error! Bookmark not defined. ERROR FUNCTION FOR BINARY CLASSIFICATION... Error! Bookmark not defined. ERROR FUNCTION FOR MULTI-CLASS CLASSIFICATION... Error! Bookmark not defined. MINIMIZING THE ERROR FUNCTION USING GRADIENT DESCENT... Error! Bookmark not defined. BACKPROPAGATION EQUATIONS... Error! Bookmark not defined. SUMMARY OF BACKPROPAGATION... Error! Bookmark not defined. SUMMARY: ARTIFICIAL NEURAL NETWORKS... Error! Bookmark not defined. PROBLEM SET: ARTIFICIAL NEURAL NETWORKS... Error! Bookmark not defined. SOLUTION SET: ARTIFICIAL NEURAL NETWORKS... Error! Bookmark not defined. 6 MAXIMAL MARGIN CLASSIFIERError! Bookmark not defined. MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. DEFINITIONS OF SEPARATING HYPERPLANE AND MARGIN... Error! Bookmark not defined. MAXIMIZING THE MARGIN... Error! Bookmark not defined. DEFINITION OF MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. REFORMULATING THE OPTIMIZATION PROBLEM... Error! Bookmark not defined. SOLVING THE CONVEX OPTIMIZATION PROBLEM... Error! Bookmark not defined. KKT CONDITIONS... Error! Bookmark not defined. iv
MATH FOR MACHINE LEARNING PRIMAL AND DUAL PROBLEMS... Error! Bookmark not defined. SOLVING THE DUAL PROBLEM... Error! Bookmark not defined. THE COEFFICIENTS FOR THE MAXIMAL MARGIN HYPERPLANE... Error! Bookmark not defined. THE SUPPORT VECTORS... Error! Bookmark not defined. CLASSIFYING TEST POINTS... Error! Bookmark not defined. MAXIMAL MARGIN CLASSIFIER EXAMPLE... Error! Bookmark not defined. MAXIMAL MARGIN CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUMMARY: MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. PROBLEM SET: MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. SOLUTION SET: MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. 7 SUPPORT VECTOR CLASSIFIERError! Bookmark not defined. SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. SLACK VARIABLES: POINTS ON CORRECT SIDE OF HYPERPLANE... Error! Bookmark not defined. SLACK VARIABLES: POINTS ON WRONG SIDE OF HYPERPLANE... Error! Bookmark not defined. FORMULATING THE OPTIMIZATION PROBLEM... Error! Bookmark not defined. DEFINITION OF SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. A CONVEX OPTIMIZATION PROBLEM... Error! Bookmark not defined. SOLVING THE CONVEX OPTIMIZATION PROBLEM (SOFT MARGIN).. Error! Bookmark not defined. THE COEFFICIENTS FOR THE SOFT MARGIN HYPERPLANE... Error! Bookmark not defined. THE SUPPORT VECTORS (SOFT MARGIN)... Error! Bookmark not defined. CLASSIFYING TEST POINTS (SOFT MARGIN)... Error! Bookmark not defined. SUPPORT VECTOR CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUPPORT VECTOR CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUMMARY: SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. PROBLEM SET: SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. SOLUTION SET: SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. 8 SUPPORT VECTOR MACHINE CLASSIFIERError! Bookmark not defined. SUPPORT VECTOR MACHINE CLASSIFIER... Error! Bookmark not defined. ENLARGING THE FEATURE SPACE... Error! Bookmark not defined. v
RICHARD HAN THE KERNEL TRICK... Error! Bookmark not defined. SUPPORT VECTOR MACHINE CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUPPORT VECTOR MACHINE CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUMMARY: SUPPORT VECTOR MACHINE CLASSIFIER... Error! Bookmark not defined. PROBLEM SET: SUPPORT VECTOR MACHINE CLASSIFIER... Error! Bookmark not defined. SOLUTION SET: SUPPORT VECTOR MACHINE CLASSIFIER... Error! Bookmark not defined. CONCLUSION... Error! Bookmark not defined. APPENDIX... Error! Bookmark not defined. APPENDIX... Error! Bookmark not defined. APPENDIX 3... Error! Bookmark not defined. APPENDIX 4... Error! Bookmark not defined. APPENDIX 5... Error! Bookmark not defined. INDEX... Error! Bookmark not defined. vi
PREFACE Welcome to Math for Machine Learning: Open Doors to Data Science and Artificial Intelligence. This is a first textbook in math for machine learning. Be sure to get the companion online course Math for Machine Learning here: Math for Machine Learning Online Course. The online course can be very helpful in conjunction with this book. The prerequisites for this book and the online course are Linear Algebra, Multivariable Calculus, and Probability. You can find my online course on Linear Algebra here: Linear Algebra Course. We will not do any programming in this book. This book will get you started in machine learning in a smooth and natural way, preparing you for more advanced topics and dispelling the belief that machine learning is complicated, difficult, and intimidating. I want you to succeed and prosper in your career, life, and future endeavors. I am here for you. Visit me at: Online Math Training
RICHARD HAN - INTRODUCTION Welcome to Math for Machine Learning: Open Doors to Data Science and Artificial Intelligence! My name is Richard Han. This is a first textbook in math for machine learning. Ideal student: If you're a working professional needing a refresher on machine learning or a complete beginner who needs to learn Machine Learning for the first time, this book is for you. If your busy schedule doesn t allow you to go back to a traditional school, this book allows you to study on your own schedule and further your career goals without being left behind. If you plan on taking machine learning in college, this is a great way to get ahead. If you re currently struggling with machine learning or have struggled with it in the past, now is the time to master it. Benefits of studying this book: After reading this book, you will have refreshed your knowledge of machine learning for your career so that you can earn a higher salary. You will have a required prerequisite for lucrative career fields such as Data Science and Artificial Intelligence. You will be in a better position to pursue a masters or PhD degree in machine learning and data science. Why Machine Learning is important: Famous uses of machine learning include: o Linear discriminant analysis. Linear discriminant analysis can be used to solve classification problems such as spam filtering and classifying patient illnesses.
MATH FOR MACHINE LEARNING o Logistic regression. Logistic regression can be used to solve binary classification problems such as determining whether a patient has a certain form of cancer or not. o Artificial neural networks. Artificial neural networks can be used for applications such as self-driving cars, recommender systems, online marketing, reading medical images, speech and face recognition o Support Vector machines. Real world applications of SVM s include classification of proteins and classification of images. What my book offers: In this book, I cover core topics such as: Linear Regression Linear Discriminant Analysis Logistic Regression Artificial Neural Networks Support Vector Machines I explain each definition and go through each example step by step so that you understand each topic clearly. Throughout the book, there are practice problems for you to try. Detailed solutions are provided after each problem set. I hope you benefit from the book. Best regards, Richard Han 3
LINEAR REGRESSION LINEAR REGRESSION Suppose we have a set of data (x, y ),, (x N, y N ). This is called the training data. x i x i Each x i is a vector [ ] of measurements, where x i is an instance of the first input variable X, x i x ip is an instance of the second input variable X, etc. X,, X p are called features or predictors. y,, y N are instances of the output variable Y, which is called the response. In linear regression, we assume that the response depends on the input variables in a linear fashion: y = f(x) + ε, where f(x) = β 0 + β X + + β p X p. Here, ε is called the error term and β 0,, β p are called parameters. We don t know the values of β 0,, β p. However, we can use the training data to approximate the values of β 0,, β p. What we ll do is look at the amount by which the predicted value f(x i ) differs from the actual y i for each of the pairs (x, y ),, (x N, y N ) from the training data. So we have y i f(x i ) as the difference. We then square this and take the sum for i =,, N: N (y i f(x i )) This is called the residual sum of squares and denoted RSS(β) where β = [ ]. β p i= We want the residual sum of squares to be as small as possible. Essentially, this means that we want our predicted value f(x i ) to be as close to the actual value y i as possible, for each of the pairs (x i, y i ). Doing this will give us a linear function of the input variables that best fits the given training data. In β 0 β 4
MATH FOR MACHINE LEARNING the case of only one input variable, we get the best fit line. In the case of two input variables, we get the best fit plane. And so on, for higher dimensions. THE LEAST SQUARES METHOD By minimizing RSS(β), we can obtain estimates β 0, β,, β p of the parameters β 0,, β p. This method is called the least squares method. x x x p x x x p Let X = [ x N x N x Np] y Then y Xβ = [ ] y N y and y = [ ]. y N x x x p x x x p [ [ x N x N x Np] β y 0 + β x + + β p x p β = [ ] 0 + β x + + β p x p y N [ β 0 + β x N + + β p x Np] f(x y ) = [ f(x ] [ ) ] y N f(x N ) = [ y f(x ) ] y N f(x N ) So (y Xβ) T (y Xβ) = N i= β 0 β ] β p (y i f(x i )) = RSS(β) RSS(β) = (y Xβ) T (y Xβ). Consider the vector of partial derivatives of RSS(β): RSS(β) β 0 RSS(β) β RSS(β) [ β p ] 5
RICHARD HAN RSS(β) = (y (β 0 + β x + + β p x p )) + + (y N (β 0 + β x N + + β p x Np )) Let s take the partial derivative with respect to β 0. RSS(β) β 0 = (y (β 0 + β x + + β p x p )) ( ) + + (y N (β 0 + β x N + + β p x Np )) ( ) = [ ](y Xβ) Next, take the partial derivative with respect to β. RSS(β) β = (y (β 0 + β x + + β p x p )) ( x ) + + (y N (β 0 + β x N + + β p x Np )) ( x N ) = [x x N ] (y Xβ) In general, RSS(β) β k = [x k x Nk ] (y Xβ) So, [ RSS(β) β 0 RSS(β) β RSS(β) β p ] [ ](y Xβ) [x = [ x N ](y Xβ) ] [x p x Np](y Xβ) x x N = [ ] (y Xβ) x p x Np = X T (y Xβ) If we take the second derivative of RSS(β), say RSS(β) β k β j, we get β j ( (y (β 0 + β x + + β p x p )) ( x k ) + + (y N (β 0 + β x N + + β p x Np )) ( x Nk )) = x j x k + + x Nj x Nk = (x j x k + + x Nj x Nk ) 6
MATH FOR MACHINE LEARNING x 0 x x x p x 0 x x x p Note X = [ ] x N0 x N x N x Np x 0 x 0 x N0 x 0 x x p x X T x x N x 0 x x p X = [ ] [ ] x p x p x Np x N0 x N x Np = (a jk ) where a jk = x j x k + + x Nj x Nk So RSS(β) β k β j = a jk The matrix of second derivatives of RSS(β) is X T X. This matrix is called the Hessian. By the second derivative test, if the Hessian of RSS(β) at a critical point is positive definite, then RSS(β) has a local minimum there. If we set our vector of derivatives to 0, we get X T (y Xβ) = 0 X T y + X T Xβ = 0 X T Xβ = X T y X T Xβ = X T y β = (X T X) X T y. β 0 β Thus, we solved for the vector of parameters [ ] which minimizes the residual sum of squares RSS(β). β p β 0 So we let β = (X T X) X T y. [ ] β p LINEAR ALGEBRA SOLUTION TO LEAST SQUARES PROBLEM We can arrive at the same solution for the least squares problem by using linear algebra. x x x p y x x x p Let X = and y = [ ] as before, from our training data. We want a y N [ x N x N x Np] vector β such that Xβ is close to y. In other words, we want a vector β such that the distance Xβ y between Xβ and y is minimized. A vector β that minimizes Xβ y is called a least-squares solution 7
RICHARD HAN of Xβ = y. X is an N by (p + ) matrix. We want a β in R p+ such that Xβ is closest to y. Note that Xβ is a linear combination of the columns of X. So Xβ lies in the span of the columns of X, which is a subspace of R N denoted Col X. So we want the vector in Col X that is closest to y. The projection of y onto the subspace Col X is that vector. proj Col X y = Xβ for some β R p+. Consider y Xβ. Note that y = Xβ + (y Xβ ). R N can be broken into two subspaces Col X and (Col X), where (Col X) is the subspace of R N consisting of all vectors that are orthogonal to the vectors in Col X. Any vector in R N can be written uniquely as z + w where z Col X and w (Col X). Since y R N, and y = Xβ + (y Xβ ), with Xβ Col X, the second vector y Xβ must lie in (Col X). y Xβ is orthogonal to the columns of X. X T (y Xβ ) = 0 X T y X T Xβ = 0. X T Xβ = X T y. Thus, it turns out that the set of least-squares solutions of Xβ = y consists of all and only the solutions to the matrix equation X T Xβ = X T y. If X T X is positive definite, then the eigenvalues of X T X are all positive. So 0 is not an eigenvalue of X T X. It follows that X T X is invertible. Then, we can solve the equation X T Xβ = X T y for β to get β = 8
MATH FOR MACHINE LEARNING (X T X) X T y, which is the same result we got earlier using multi-variable calculus. EXAMPLE: LINEAR REGRESSION Suppose we have the following training data: (x, y ) = (, ), (x, y ) = (, 4), (x 3, y 3 ) = (3, 4). Find the best fit line using the least squares method. Find the predicted value for x = 4. Solution: Form X = [ ] and y = [ 4]. 3 4 The coefficients β 0, β for the best fit line f(x) = β 0 + β x are given by [ β 0 ] = (X β T X) X T y. X T = [ 3 ] X T X = [ 3 ] [ ] = [ 3 6 6 4 ] 3 (X T X) 7/3 = [ / ] (X T X) X T 7/3 y = [ / ] [ 3 ] [ 4] 4 = [ 0 3/ ] β 0 = 0 and β = 3/. Thus, the best fit line is given by f(x) = ( 3 ) x. The predicted value for x = 4 is f(4) = ( 3 ) 4 = 6. 9
SUMMARY: LINEAR REGRESSION In the least squares method, we seek a linear function of the input variables that best fits the given training data. We do this by minimizing the residual sum of squares. To minimize the residual sum of squares, we apply the second derivative test from multi-variable calculus. We can arrive at the same solution to the least squares problem using linear algebra. 0
PROBLEM SET: LINEAR REGRESSION MATH FOR MACHINE LEARNING. Suppose we have the following training data: (x, y ) = (0, ), (x, y ) = (, ), (x 3, y 3 ) = (, 4), (x 4, y 4 ) = (3, 4). Find the best fit line using the least squares method. Find the predicted value for x = 4.. Suppose we have the following training data: (x, y ), (x, y ), (x 3, y 3 ) where x = [ 0 0 ], x = [ 0 ], x 3 = [ 0 ], x 4 = [ ] and y =, y = 0, y 3 = 0, y 4 =. Find the best fit plane using the least squares method. Find the predicted value for x = [ ].
SOLUTION SET: LINEAR REGRESSION 0. Form X = [ ] and y = [ ]. 4 3 4 RICHARD HAN The coefficients β 0, β for the best fit line f(x) = β 0 + β x are given by [ β 0 β ] = (X T X) X T y. 0 X T = [ 0 3 ] XT X = [ 0 3 ] [ ] = [ 4 6 6 4 ] 3 (X T X) = [ 7 0 3 0 3 0 5 ] (X T X) X T y = [ 7 0 3 0 3 0 5 ] [ 0 3 ] [ ] 4 4 = [ 4 0 9 0 ] β 0 = 4 0 and β = 9 0. Thus, the best fit line is given by f(x) = 4 0 + 9 0 x The predicted value for x = 4 is f(4) = 4 0 + 9 0 4 = 5. 0 0 0 0. Form X = [ ] and y = [ ]. 0 0 The coefficients β 0, β, β for the best fit line f(x, x ) = β 0 + β x + β x are given by [ ] = β (X T X) X T y. 0 0 4 X T = [ 0 0 ] X T 0 X = [ 0 0 ] [ ] = [ ] 0 0 0 0 0 β 0 β
MATH FOR MACHINE LEARNING (X T X) = [ 3 4 0 0 ] (X T X) X T y = [ 3 4 0 0 [ 0 0 ] [ ] 0 0 0 0 ] = [ 3 4 4 4 4 ] 0 [ ] 0 = 4 [ ] β 0 = 4, β =, β = Thus, the best fit plane is given by f(x, x ) = 4 + x + x The predicted value for x = [ ] is f(, ) = 4. 3
RICHARD HAN 3 LINEAR DISCRIMINANT ANALYSIS CLASSIFICATION In the problem of regression, we had a set of data (x, y ),, (x N, y N ) and we wanted to predict the values for the response variable Y for new data points. The values that Y took were numerical, quantitative, values. In certain problems, the values for the response variable Y that we want to predict are not quantitative but qualitative. So the values for Y will take on values from a finite set of classes or categories. Problems of this sort are called classification problems. Some examples of a classification problem are classifying an email as spam or not spam and classifying a patient s illness as one among a finite number of diseases. LINEAR DISCRIMINANT ANALYSIS One method for solving a classification problem is called linear discriminant analysis. What we ll do is estimate Pr (Y = k X = x), the probability that Y is the class k given that the input variable X is x. Once we have all of these probabilities for a fixed x, we pick the class k for which the probability Pr (Y = k X = x) is largest. We then classify x as that class k. THE POSTERIOR PROBABILITY FUNCTIONS In this section, we ll build a formula for the posterior probability Pr (Y = k X = x). Let π k = Pr (Y = k), the prior probability that Y = k. Let f k (x) = Pr (X = x Y = k), the probability that X = x, given that Y = k. By Bayes rule, Pr (X = x Y = k) Pr (Y = k) Pr(Y = k X = x) = K Pr(X = x Y = l) Pr (Y = l) l= Here we assume that k can take on the values,, K. = f k(x) π k K f l (x) π l = π k f k (x) K l= π l f l (x) We can think of Pr(Y = k X = x) as a function of x and denote it as p k (x). So p k (x) = π k f k (x) K l= π l f l (x) l=. Recall that p k (x) is the posterior probability that Y = k given that X = x. 4
MATH FOR MACHINE LEARNING MODELLING THE POSTERIOR PROBABILITY FUNCTIONS Remember that we wanted to estimate Pr(Y = k X = x) for any given x. That is, we want an estimate for p k (x). If we can get estimates for π k, f k (x), π l and f l (x) for each l =,, K, then we would have an estimate for p k (x). Let s say that X = (X, X,, X p ) where X,, X p are the input variables. So the values of X will be vectors of p elements. We will assume that the conditional distribution of X given Y = k is the multivariate Gaussian distribution N(μ k, Σ), where μ k is a class-specific mean vector and Σ is the covariance of X. μ k The class-specific mean vector μ k is given by the vector of class-specific means [ ], where μ kj is μ kp the class-specific mean of X j. So μ kj = i:y i =k x ij Pr (X j = x ij ). Recall that x i = [ ]. (For all those x i for which y i = k, we re x ip taking the mean of their jth components.) Σ, the covariance matrix of X, is given by the matrix of covariances of X i and X j. So Σ = (a ij ), where a ij = Cov(X i, X j ) E[(X i μ Xi )(X j μ Xj )]. x i The multivariate Gaussian density is given by f(x) = (π) p Σ 5 e (x μ)t Σ (x μ) for the multivariate Gaussian distribution N(μ, Σ). Since we re assuming that the conditional distribution of X given Y = k is the multivariate Gaussian distribution N(μ k, Σ), we have that Pr(X = x Y = k) = (π) p Σ e (x μ k )T Σ (x μ k ). Recall that f k (x) = Pr(X = x Y = k). So f k (x) = (π) p Σ Recall that p k (x) = e (x μ k )T Σ (x μ k ). π k f k (x) K l= π l f l (x).
RICHARD HAN Plugging in what we have for f k (x), we get π k (π) p Σ p k (x) = K l= π l (π) p Σ e (x μ k )T Σ (x μ k ) e (x μ l )T Σ (x μ l ) = π k e (x μ k )T Σ (x μ k ) K π l e (x μ l )T Σ (x μ l ) l=. Note that the denominator is (π) p Σ l= π l f l (x) and that K K l= π l f l (x) = K l= f l (x)π l K = Pr(X = x Y = l) Pr (Y = l) l= = Pr(X = x). So the denominator is just (π) p Σ Pr(X = x). Hence, p k (x) = π k e (x μ k )T Σ (x μ k ) (π) p Σ Pr(X=x). 6
LINEAR DISCRIMINANT FUNCTIONS Recall that we want to choose the class k for which the posterior probability p k (x) is largest. Since the logarithm function is order-preserving, maximizing p k (x) is the same as maximizing log p k (x). Taking log p k (x) gives log π k e (x μ k )T Σ (x μ k ) (π) p Σ Pr(X=x) = log π k + ( ) (x μ k) T Σ (x μ k ) log ((π) p Σ Pr (X = x)) = log π k + ( ) (x μ k) T Σ (x μ k ) log C where C = (π) p Σ P r(x = x). = log π k (xt Σ μ k T Σ )(x μ k ) log C = log π k [xt Σ x x T Σ μ k μ k T Σ x + μ k T Σ μ k ] log C = log π k [xt Σ x x T Σ μ k + μ k T Σ μ k ] log C, because x T Σ μ k = μ k T Σ x Proof: x T Σ μ k = μ k (Σ ) T x = μ k T (Σ T ) x = μ k T Σ x because Σ is symmetric. = log π k xt Σ x + x T Σ μ k μ k T Σ μ k log C = x T Σ μ k μ k T Σ μ k + log π k xt Σ x log C Let δ k (x) = x T Σ μ k μ k T Σ μ k + log π k. Then log p k (x) = δ k (x) xt Σ x log C. δ k (x) is called a linear discriminant function. Maximizing log p k (x) is the same as maximizing δ k (x) since xt Σ x log C does not depend on k. ESTIMATING THE LINEAR DISCRIMINANT FUNCTIONS Now, if we can find estimates for π k, μ k, and Σ, then we would have an estimate for p k (x) and hence for log p k (x) and δ k (x). In an attempt to maximize p k (x), we instead maximize the estimate of p k (x), which is the same as 7
RICHARD HAN maximizing the estimate of δ k (x). π k can be estimated as π k = N k where N N k is the number of training data points in class k and N is the total number of training data points. Remember π k = Pr (Y = k). We re estimating this by just taking the proportion of data points in class k. μ k The class-specific mean vector μ k = [ ], where μ kj = i:y i =k x ij Pr(X j = x ij ). μ kp We can estimate μ kj as N k So we can estimate μ k as μ k = [ i:yi =k x ij. N i:yi =k x i k N i:yi =k x ip k ] = [ N k i:yi =k x i i:yi =k x ip = x i [ ] N k i:y i =k x ip ] In other words, μ k = N k i:yi =k x i averages of each component over all x i in class k. = N k x i i:y i =k. We estimate the class-specific mean vector by the vector of Finally, the covariance matrix Σ is estimated as Σ = K (x N K k= i:y i =k i μ ) k (x i μ ) T k. Recall that δ k (x) = x T Σ μ k μ k T Σ μ k + log π k. So, δ (x) k = x TΣ μ k (μ ) k TΣ μ k + log π. k Note that Σ, μ, k and π k only depend on the training data and not on x. Note that x is a vector and x TΣ μ k is a linear combination of the components of x. Hence, δ (x) k is a linear combination of the components of x. This is why it s called a linear discriminant function. CLASSIFYING DATA POINTS USING LINEAR DISCRIMINANT FUNCTIONS If (k, k ) is a pair of classes, we can consider whether δ k (x) > δ k (x). If so, we know x is not in class k. Then, we can compare δ k (x) > δ k3 (x) and rule out another class. Once we ve exhausted all 8
MATH FOR MACHINE LEARNING the classes, we ll know which class x should be assigned to. Setting δ k (x) = δ k (x), we get x TΣ μ k (μ k ) T Σ μ k + log π k = x TΣ μ k (μ k ) T Σ μ k + log π k. This gives us a hyperplane in R p which separates class k from class k. If we find the separating hyperplane for each pair of classes, we get something like this: In this example, p = and K = 3. LDA EXAMPLE Suppose we have a set of data (x, y ),, (x 6, y 6 ) as follows: x = (, 3), x = (, 3), x 3 = (, 4), x 4 = (3, ), x 5 = (3, ), x 6 = (4, ), with y = y = y 3 = k = and y 4 = y 5 = y 6 = k =. Apply linear discriminant analysis by doing the following: a) Find estimates for the linear discriminant functions δ (x) and δ (x). b) Find the line that decides between the two classes. c) Classify the new point x = (5, 0). Solution: Here is a graph of the data points: 9
RICHARD HAN The number of features p is, the number of classes K is, the total number of data points N is 6, the number N of data points in class k is 3, and the number N of data points in class k is 3. First, we will find estimates for π and π, the prior probabilities that Y = k and Y = k, respectively. Then, we will find estimates for μ and μ, the class-specific mean vectors. We can then calculate the estimate for the covariance matrix Σ. Finally, using the estimates π, π, μ, μ, Σ, we can find the estimates for the linear discriminant functions δ (x) and δ (x). π = N N = 3 6 = π = N N = 3 6 = μ = N x i i:y i = = 3 [x + x + x 3 ] = [ 5/3 0/3 ] μ = N x i i:y i = = 3 [x 4 + x 5 + x 6 ] = [ 0/3 5/3 ] K Σ = N K (x i μ k k= i:y i =k )(x i μ ) T k 0
MATH FOR MACHINE LEARNING = 6 (x i μ k k= i:y i =k )(x i μ ) T k Plugging in what we got for μ and μ, we get Σ = /3 /6 [4/3 ] = [/3 4 /3 4/3 /6 /3 ] Σ = [ 4 4 ] δ (x) = x TΣ μ (μ ) TΣ μ + log π. = x T [ 0 0 ] (00 3 ) + log = 0X 50 3 + log δ (x) = x TΣ μ (μ ) TΣ μ + log π. = x T [ 0 0 ] (00 3 ) + log = 0X 50 3 + log Setting δ (x) = δ (x) 0X 50 + log = 0X 3 50 0X = 0X X = X. 3 + log So, the line that decides between the two classes is given by X = X. Here is a graph of the deciding line:
RICHARD HAN If δ (x) > δ (x), then we classify x as of class k. So if x is above the line X = X, then we classify x as of class k. Conversely, if δ (x) < δ (x), then we classify x as of class k. This corresponds to x being below the line X = X. The point (5, 0) is below the line; so we classify it as of class k. LDA EXAMPLE Suppose we have a set of data (x, y ),, (x 6, y 6 ) as follows: x = (0, ), x = (, ), x 3 = (, 0), x 4 = (, ), x 5 = (3, 3), x 6 = (4, 4), with y = y = k =, y 3 = y 4 = k =, and y 5 = y 6 = k 3 = 3. Apply linear discriminant analysis by doing the following: a) Find estimates for the linear discriminant functions δ (x), δ (x), and δ 3 (x). b) Find the lines that decide between each pair of classes. c) Classify the new point x = (, 3). Solution: Here is a graph of the data points:
MATH FOR MACHINE LEARNING The number of features p is, the number of classes K is 3, the total number of data points N is 6, the number N of data points in class k is, the number N of data points in class k is, and the number N 3 of data points in class k 3 is. First, we will find estimates for π, π, π 3, the prior probabilities that Y = k, Y = k, Y = k 3, respectively. Then, we will find estimates for μ, μ, μ 3, the class-specific mean vectors. We can then calculate the estimate for the covariance matrix Σ. Finally, using the estimates π, π, π, 3 μ, μ, μ, 3 Σ, we can find the estimates for the linear discriminant functions δ (x), δ (x), and δ 3 (x). π = N N = 6 = 3 π = N N = 6 = 3 π 3 = N 3 N = 6 = 3 μ = N x i i:y i = = [x + x ] = [ / ] μ = N x i i:y i = = [x 3 + x 4 ] = [ / ] 3
RICHARD HAN μ 3 = N 3 x i i:y i =3 = [x 5 + x 6 ] = [ 7/ 7/ ] K Σ = N K (x i μ k k= i:y i =k )(x i μ ) T k = 6 3 [ / /6 ] = [/3 / /6 /3 ] Σ = [ 4 4 ] δ (x) = x TΣ μ (μ ) TΣ μ + log π. = x T [ 7 ] (3 ) + log 3 = X + 7X 3 + log 3 δ (x) = x TΣ μ (μ ) TΣ μ + log π. = x T [ 7 ] (3 ) + log 3 = 7X X 3 + log 3 δ (x) 3 = x TΣ μ 3 (μ ) 3 TΣ μ 3 + log π. 3 = x T [ 7 7 ] (49 ) + log 3 = 7X + 7X 49 + log 3 Setting δ (x) = δ (x) X + 7X 3 + log = 7X 3 X 3 X + 7X = 7X X + log 3 9X = 9X 4
MATH FOR MACHINE LEARNING X = X. So, the line that decides between classes k and k is given by X = X. Setting δ (x) = δ (x) 3 X + 7X 3 + log = 7X 3 + 7X 49 8 = 9X X = + log 3 So, the line that decides between classes k and k 3 is given by X =. Setting δ (x) = δ (x) 3 7X X 3 + log = 7X 3 + 7X 49 + log 3 8 = 9X X = So, the line that decides between classes k and k 3 is given by X =. Here is a graph of the deciding lines: The lines divide the plane into 3 regions. δ (x) > δ (x) corresponds to the region above the line X = X. Conversely, δ (x) < δ (x) corresponds to the region below the line X = X. δ (x) > δ (x) 3 corresponds to the region to the left of the line X =. Conversely, δ (x) < δ (x) 3 5
RICHARD HAN corresponds to the region to the right of X =. δ (x) > δ (x) 3 corresponds to the region below the line X =. Conversely, δ (x) < δ (x) 3 corresponds to the region above the line X =. If δ (x) > δ (x) and δ (x) > δ (x), 3 then we classify x as of class k. So if x is in region I, then we classify x as of class k. Conversely, if x is in region II, then we classify x as of class k ; and if x is in region III, we classify x as of class k 3. The point (, 3) is in region I; so we classify it as of class k. 6
MATH FOR MACHINE LEARNING SUMMARY: LINEAR DISCRIMINANT ANALYSIS In linear discriminant analysis, we find estimates p (x) k for the posterior probability p k (x) that Y = k given that X = x. We classify x according to the class k that gives the highest estimated posterior probability p (x). k Maximizing the estimated posterior probability p (x) k is equivalent to maximizing the log of p (x), k which, in turn, is equivalent to maximizing the estimated linear discriminant function (x). δ k We find estimates of the prior probability π k that Y = k, of the class-specific mean vectors μ k, and of the covariance matrix Σ in order to estimate the linear discriminant functions δ k (x). By setting δ (x) k = δ k (x) for each pair (k, k ) of classes, we get hyperplanes in R p that, together, divide R p into regions corresponding to the distinct classes. We classify x according to the class k for which δ (x) k is largest. 7
RICHARD HAN PROBLEM SET: LINEAR DISCRIMINANT ANALYSIS. Suppose we have a set of data (x, y ),, (x 6, y 6 ) as follows: x = (, ), x = (, ), x 3 = (, ), x 4 = (3, 3), x 5 = (3, 4), x 6 = (4, 3) with y = y = y 3 = k = and y 4 = y 5 = y 6 = k =. Apply linear discriminant analysis by doing the following: a) Find estimates for the linear discriminant functions δ (x) and δ (x). b) Find the line that decides between the two classes. c) Classify the new point x = (4, 5).. Suppose we have a set of data (x, y ),, (x 6, y 6 ) as follows: x = (0, 0), x = (, ), x 3 = (, 3), x 4 = (, 4), x 5 = (3, ), x 6 = (4, ) with y = y = k =, y 3 = y 4 = k = and y 5 = y 6 = k 3 = 3. Apply linear discriminant analysis by doing the following: a) Find estimates for the linear discriminant functions δ (x), δ (x) and δ 3 (x). b) Find the lines that decide between each pair of classes. c) Classify the new point x = (3, 0). 8
MATH FOR MACHINE LEARNING SOLUTION SET: LINEAR DISCRIMINANT ANALYSIS. Here is a graph of the data points: The number of features p is, the number of classes K is, the total number of data points N is 6, the number N of data points in class k is 3, and the number N of data points in class k is 3. First, we will find estimates for π and π, the prior probabilities that Y = k and Y = k, respectively. Then, we will find estimates for μ and μ, the class-specific mean vectors. We can then calculate the estimate for the covariance matrix Σ. Finally, using the estimates π, π, μ, μ, Σ, we can find the estimates for the linear discriminant functions δ (x) and δ (x). π = N N = 3 6 = π = N N = 3 6 = μ = N 5 x i = 3 [x + x + x 3 ] = [ 3 ] 5 3 i:y i = 9
μ = N 0 x i = 3 [x 4 + x 5 + x 6 ] = [ 3 ] 0 3 i:y i = K Σ = N K (x i μ k = 6 k= i:y i =k )(x i μ ) T k 6/9 /3 /6 [/9 ] = [ 6/9 /9 /6 /3 ] RICHARD HAN Σ = [ 4 4 ] δ (x) = x TΣ μ μ TΣ μ + log π = x T [ 0 0 ] (00) + log 3 = 0X + 0X 50 + log 3 δ (x) = x TΣ μ μ TΣ μ + log π = x T [ 0 0 ] (400) + log 3 = 0X + 0X 00 + log 3 Setting δ (x) = δ (x) 0X + 0X 50 + log = 0X 3 + 0X 00 + log 3 50 = 0X 3 + 0X 50 = 0X + 0X 5 = X + X X + 5 = X So, the line that decides between the two classes is given by X = X + 5. Here is a graph of the decision line: 30
MATH FOR MACHINE LEARNING If δ (x) > δ (x), then we classify x as of class k. So if x is below the line X = X + 5, then we classify x as of class k. Conversely, if δ (x) < δ (x), then we classify x as of class k. This corresponds to x being above the line X = X + 5. The point (4, 5) is above the line; so we classify it as of class k. 3
RICHARD HAN. Here is a graph of the data points: The number of features p is, the number of classes K is 3, the total number of data points N is 6, the number N of data points in class k is, the number N of data points in class k is, and the number N 3 of data points in class k 3 is. First, we will find estimates for π, π, π 3, the prior probabilities that Y = k, Y = k, Y = k 3, respectively. Then, we will find estimates for μ, μ, μ 3, the class-specific mean vectors. We can then calculate the estimate for the covariance matrix Σ. Finally, using the estimates π, π, π, 3 μ, μ, μ, 3 Σ, we can find the estimates for the linear discriminant functions δ (x), δ (x), and δ 3 (x). π = N N = 6 = 3 π = N N = 6 = 3 π 3 = N 3 N = 6 = 3 3
μ = N μ = N μ 3 = N 3 x i i:y i = x i i:y i = x i i:y i =3 MATH FOR MACHINE LEARNING = [x + x ] = [ / / ] = [x 3 + x 4 ] = [ 7/ ] = [x 5 + x 6 ] = [ 7/ ] K Σ = N K (x i μ k = k= i:y i =k )(x i μ ) T k [ / /6 ] = [/3 6 3 / /6 /3 ] Σ = [ 4 4 ] δ (x) = x TΣ μ μ TΣ μ + log π = x T [ ] () + log 3 = X + X + log 3 δ (x) = x TΣ μ μ TΣ μ + log π = x T [ 0 ] (37) + log 3 = X + 0X 37 + log 3 δ (x) 3 = x TΣ μ 3 μ TΣ 3 μ 3 + log π 3 = x T [ 0 ] (37) + log 3 = 0X + X 37 + log 3 Setting δ (x) = δ (x) X + X + log = X 3 + 0X 37 8 = 9X = X + log 3 So, the line that decides between classes k and k is given by X =. 33
RICHARD HAN Setting δ (x) = δ (x) 3 X + X + log = 0X 3 + X 37 8 = 9X = X + log 3 So, the line that decides between classes k and k 3 is given by X =. Setting δ (x) = δ (x) 3 X + 0X 37 + log = 0X 3 + X 37 + log 3 9X = 9X X = X So, the line that decides between classes k and k 3 is given by X = X. Here is a graph of the decision lines: The lines divide the plane into 3 regions. If x is in region I, then we classify x as of class k. Similarly, points in region II get classified as of k, and points in region III get classified as of k 3. The point (3, 0) is in region III; so we classify it as of class k 3. 34
MATH FOR MACHINE LEARNING 35