Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han
|
|
- Jerome James
- 5 years ago
- Views:
Transcription
1 Math for Machine Learning Open Doors to Data Science and Artificial Intelligence Richard Han
2 Copyright 05 Richard Han All rights reserved.
3 CONTENTS PREFACE... - INTRODUCTION... LINEAR REGRESSION... 4 LINEAR REGRESSION... 4 THE LEAST SQUARES METHOD... 5 LINEAR ALGEBRA SOLUTION TO LEAST SQUARES PROBLEM... 7 EXAMPLE: LINEAR REGRESSION... 9 SUMMARY: LINEAR REGRESSION... 0 PROBLEM SET: LINEAR REGRESSION... SOLUTION SET: LINEAR REGRESSION... 3 LINEAR DISCRIMINANT ANALYSIS... 4 CLASSIFICATION... 4 LINEAR DISCRIMINANT ANALYSIS... 4 THE POSTERIOR PROBABILITY FUNCTIONS... 4 MODELLING THE POSTERIOR PROBABILITY FUNCTIONS... 5 LINEAR DISCRIMINANT FUNCTIONS... 7 ESTIMATING THE LINEAR DISCRIMINANT FUNCTIONS... 7 CLASSIFYING DATA POINTS USING LINEAR DISCRIMINANT FUNCTIONS... 8 LDA EXAMPLE... 9 LDA EXAMPLE... SUMMARY: LINEAR DISCRIMINANT ANALYSIS... 7 PROBLEM SET: LINEAR DISCRIMINANT ANALYSIS... 8 SOLUTION SET: LINEAR DISCRIMINANT ANALYSIS LOGISTIC REGRESSION... Error! Bookmark not defined. LOGISTIC REGRESSION... Error! Bookmark not defined. LOGISTIC REGRESSION MODEL OF THE POSTERIOR PROBABILITY FUNCTION.. Error! Bookmark not defined. ESTIMATING THE POSTERIOR PROBABILITY FUNCTION... Error! Bookmark not defined.
4 RICHARD HAN THE MULTIVARIATE NEWTON-RAPHSON METHOD... Error! Bookmark not defined. MAXIMIZING THE LOG-LIKELIHOOD FUNCTION... Error! Bookmark not defined. EXAMPLE: LOGISTIC REGRESSION... Error! Bookmark not defined. SUMMARY: LOGISTIC REGRESSION... Error! Bookmark not defined. PROBLEM SET: LOGISTIC REGRESSION... Error! Bookmark not defined. SOLUTION SET: LOGISTIC REGRESSION... Error! Bookmark not defined. 5 ARTIFICIAL NEURAL NETWORKSError! Bookmark not defined. ARTIFICIAL NEURAL NETWORKS... Error! Bookmark not defined. NEURAL NETWORK MODEL OF THE OUTPUT FUNCTIONS... Error! Bookmark not defined. FORWARD PROPAGATION... Error! Bookmark not defined. CHOOSING ACTIVATION FUNCTIONS... Error! Bookmark not defined. ESTIMATING THE OUTPUT FUNCTIONS... Error! Bookmark not defined. ERROR FUNCTION FOR REGRESSION... Error! Bookmark not defined. ERROR FUNCTION FOR BINARY CLASSIFICATION... Error! Bookmark not defined. ERROR FUNCTION FOR MULTI-CLASS CLASSIFICATION... Error! Bookmark not defined. MINIMIZING THE ERROR FUNCTION USING GRADIENT DESCENT... Error! Bookmark not defined. BACKPROPAGATION EQUATIONS... Error! Bookmark not defined. SUMMARY OF BACKPROPAGATION... Error! Bookmark not defined. SUMMARY: ARTIFICIAL NEURAL NETWORKS... Error! Bookmark not defined. PROBLEM SET: ARTIFICIAL NEURAL NETWORKS... Error! Bookmark not defined. SOLUTION SET: ARTIFICIAL NEURAL NETWORKS... Error! Bookmark not defined. 6 MAXIMAL MARGIN CLASSIFIERError! Bookmark not defined. MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. DEFINITIONS OF SEPARATING HYPERPLANE AND MARGIN... Error! Bookmark not defined. MAXIMIZING THE MARGIN... Error! Bookmark not defined. DEFINITION OF MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. REFORMULATING THE OPTIMIZATION PROBLEM... Error! Bookmark not defined. SOLVING THE CONVEX OPTIMIZATION PROBLEM... Error! Bookmark not defined. KKT CONDITIONS... Error! Bookmark not defined. iv
5 MATH FOR MACHINE LEARNING PRIMAL AND DUAL PROBLEMS... Error! Bookmark not defined. SOLVING THE DUAL PROBLEM... Error! Bookmark not defined. THE COEFFICIENTS FOR THE MAXIMAL MARGIN HYPERPLANE... Error! Bookmark not defined. THE SUPPORT VECTORS... Error! Bookmark not defined. CLASSIFYING TEST POINTS... Error! Bookmark not defined. MAXIMAL MARGIN CLASSIFIER EXAMPLE... Error! Bookmark not defined. MAXIMAL MARGIN CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUMMARY: MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. PROBLEM SET: MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. SOLUTION SET: MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. 7 SUPPORT VECTOR CLASSIFIERError! Bookmark not defined. SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. SLACK VARIABLES: POINTS ON CORRECT SIDE OF HYPERPLANE... Error! Bookmark not defined. SLACK VARIABLES: POINTS ON WRONG SIDE OF HYPERPLANE... Error! Bookmark not defined. FORMULATING THE OPTIMIZATION PROBLEM... Error! Bookmark not defined. DEFINITION OF SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. A CONVEX OPTIMIZATION PROBLEM... Error! Bookmark not defined. SOLVING THE CONVEX OPTIMIZATION PROBLEM (SOFT MARGIN).. Error! Bookmark not defined. THE COEFFICIENTS FOR THE SOFT MARGIN HYPERPLANE... Error! Bookmark not defined. THE SUPPORT VECTORS (SOFT MARGIN)... Error! Bookmark not defined. CLASSIFYING TEST POINTS (SOFT MARGIN)... Error! Bookmark not defined. SUPPORT VECTOR CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUPPORT VECTOR CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUMMARY: SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. PROBLEM SET: SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. SOLUTION SET: SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. 8 SUPPORT VECTOR MACHINE CLASSIFIERError! Bookmark not defined. SUPPORT VECTOR MACHINE CLASSIFIER... Error! Bookmark not defined. ENLARGING THE FEATURE SPACE... Error! Bookmark not defined. v
6 RICHARD HAN THE KERNEL TRICK... Error! Bookmark not defined. SUPPORT VECTOR MACHINE CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUPPORT VECTOR MACHINE CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUMMARY: SUPPORT VECTOR MACHINE CLASSIFIER... Error! Bookmark not defined. PROBLEM SET: SUPPORT VECTOR MACHINE CLASSIFIER... Error! Bookmark not defined. SOLUTION SET: SUPPORT VECTOR MACHINE CLASSIFIER... Error! Bookmark not defined. CONCLUSION... Error! Bookmark not defined. APPENDIX... Error! Bookmark not defined. APPENDIX... Error! Bookmark not defined. APPENDIX 3... Error! Bookmark not defined. APPENDIX 4... Error! Bookmark not defined. APPENDIX 5... Error! Bookmark not defined. INDEX... Error! Bookmark not defined. vi
7
8
9 PREFACE Welcome to Math for Machine Learning: Open Doors to Data Science and Artificial Intelligence. This is a first textbook in math for machine learning. Be sure to get the companion online course Math for Machine Learning here: Math for Machine Learning Online Course. The online course can be very helpful in conjunction with this book. The prerequisites for this book and the online course are Linear Algebra, Multivariable Calculus, and Probability. You can find my online course on Linear Algebra here: Linear Algebra Course. We will not do any programming in this book. This book will get you started in machine learning in a smooth and natural way, preparing you for more advanced topics and dispelling the belief that machine learning is complicated, difficult, and intimidating. I want you to succeed and prosper in your career, life, and future endeavors. I am here for you. Visit me at: Online Math Training
10 RICHARD HAN - INTRODUCTION Welcome to Math for Machine Learning: Open Doors to Data Science and Artificial Intelligence! My name is Richard Han. This is a first textbook in math for machine learning. Ideal student: If you're a working professional needing a refresher on machine learning or a complete beginner who needs to learn Machine Learning for the first time, this book is for you. If your busy schedule doesn t allow you to go back to a traditional school, this book allows you to study on your own schedule and further your career goals without being left behind. If you plan on taking machine learning in college, this is a great way to get ahead. If you re currently struggling with machine learning or have struggled with it in the past, now is the time to master it. Benefits of studying this book: After reading this book, you will have refreshed your knowledge of machine learning for your career so that you can earn a higher salary. You will have a required prerequisite for lucrative career fields such as Data Science and Artificial Intelligence. You will be in a better position to pursue a masters or PhD degree in machine learning and data science. Why Machine Learning is important: Famous uses of machine learning include: o Linear discriminant analysis. Linear discriminant analysis can be used to solve classification problems such as spam filtering and classifying patient illnesses.
11 MATH FOR MACHINE LEARNING o Logistic regression. Logistic regression can be used to solve binary classification problems such as determining whether a patient has a certain form of cancer or not. o Artificial neural networks. Artificial neural networks can be used for applications such as self-driving cars, recommender systems, online marketing, reading medical images, speech and face recognition o Support Vector machines. Real world applications of SVM s include classification of proteins and classification of images. What my book offers: In this book, I cover core topics such as: Linear Regression Linear Discriminant Analysis Logistic Regression Artificial Neural Networks Support Vector Machines I explain each definition and go through each example step by step so that you understand each topic clearly. Throughout the book, there are practice problems for you to try. Detailed solutions are provided after each problem set. I hope you benefit from the book. Best regards, Richard Han 3
12 LINEAR REGRESSION LINEAR REGRESSION Suppose we have a set of data (x, y ),, (x N, y N ). This is called the training data. x i x i Each x i is a vector [ ] of measurements, where x i is an instance of the first input variable X, x i x ip is an instance of the second input variable X, etc. X,, X p are called features or predictors. y,, y N are instances of the output variable Y, which is called the response. In linear regression, we assume that the response depends on the input variables in a linear fashion: y = f(x) + ε, where f(x) = β 0 + β X + + β p X p. Here, ε is called the error term and β 0,, β p are called parameters. We don t know the values of β 0,, β p. However, we can use the training data to approximate the values of β 0,, β p. What we ll do is look at the amount by which the predicted value f(x i ) differs from the actual y i for each of the pairs (x, y ),, (x N, y N ) from the training data. So we have y i f(x i ) as the difference. We then square this and take the sum for i =,, N: N (y i f(x i )) This is called the residual sum of squares and denoted RSS(β) where β = [ ]. β p i= We want the residual sum of squares to be as small as possible. Essentially, this means that we want our predicted value f(x i ) to be as close to the actual value y i as possible, for each of the pairs (x i, y i ). Doing this will give us a linear function of the input variables that best fits the given training data. In β 0 β 4
13 MATH FOR MACHINE LEARNING the case of only one input variable, we get the best fit line. In the case of two input variables, we get the best fit plane. And so on, for higher dimensions. THE LEAST SQUARES METHOD By minimizing RSS(β), we can obtain estimates β 0, β,, β p of the parameters β 0,, β p. This method is called the least squares method. x x x p x x x p Let X = [ x N x N x Np] y Then y Xβ = [ ] y N y and y = [ ]. y N x x x p x x x p [ [ x N x N x Np] β y 0 + β x + + β p x p β = [ ] 0 + β x + + β p x p y N [ β 0 + β x N + + β p x Np] f(x y ) = [ f(x ] [ ) ] y N f(x N ) = [ y f(x ) ] y N f(x N ) So (y Xβ) T (y Xβ) = N i= β 0 β ] β p (y i f(x i )) = RSS(β) RSS(β) = (y Xβ) T (y Xβ). Consider the vector of partial derivatives of RSS(β): RSS(β) β 0 RSS(β) β RSS(β) [ β p ] 5
14 RICHARD HAN RSS(β) = (y (β 0 + β x + + β p x p )) + + (y N (β 0 + β x N + + β p x Np )) Let s take the partial derivative with respect to β 0. RSS(β) β 0 = (y (β 0 + β x + + β p x p )) ( ) + + (y N (β 0 + β x N + + β p x Np )) ( ) = [ ](y Xβ) Next, take the partial derivative with respect to β. RSS(β) β = (y (β 0 + β x + + β p x p )) ( x ) + + (y N (β 0 + β x N + + β p x Np )) ( x N ) = [x x N ] (y Xβ) In general, RSS(β) β k = [x k x Nk ] (y Xβ) So, [ RSS(β) β 0 RSS(β) β RSS(β) β p ] [ ](y Xβ) [x = [ x N ](y Xβ) ] [x p x Np](y Xβ) x x N = [ ] (y Xβ) x p x Np = X T (y Xβ) If we take the second derivative of RSS(β), say RSS(β) β k β j, we get β j ( (y (β 0 + β x + + β p x p )) ( x k ) + + (y N (β 0 + β x N + + β p x Np )) ( x Nk )) = x j x k + + x Nj x Nk = (x j x k + + x Nj x Nk ) 6
15 MATH FOR MACHINE LEARNING x 0 x x x p x 0 x x x p Note X = [ ] x N0 x N x N x Np x 0 x 0 x N0 x 0 x x p x X T x x N x 0 x x p X = [ ] [ ] x p x p x Np x N0 x N x Np = (a jk ) where a jk = x j x k + + x Nj x Nk So RSS(β) β k β j = a jk The matrix of second derivatives of RSS(β) is X T X. This matrix is called the Hessian. By the second derivative test, if the Hessian of RSS(β) at a critical point is positive definite, then RSS(β) has a local minimum there. If we set our vector of derivatives to 0, we get X T (y Xβ) = 0 X T y + X T Xβ = 0 X T Xβ = X T y X T Xβ = X T y β = (X T X) X T y. β 0 β Thus, we solved for the vector of parameters [ ] which minimizes the residual sum of squares RSS(β). β p β 0 So we let β = (X T X) X T y. [ ] β p LINEAR ALGEBRA SOLUTION TO LEAST SQUARES PROBLEM We can arrive at the same solution for the least squares problem by using linear algebra. x x x p y x x x p Let X = and y = [ ] as before, from our training data. We want a y N [ x N x N x Np] vector β such that Xβ is close to y. In other words, we want a vector β such that the distance Xβ y between Xβ and y is minimized. A vector β that minimizes Xβ y is called a least-squares solution 7
16 RICHARD HAN of Xβ = y. X is an N by (p + ) matrix. We want a β in R p+ such that Xβ is closest to y. Note that Xβ is a linear combination of the columns of X. So Xβ lies in the span of the columns of X, which is a subspace of R N denoted Col X. So we want the vector in Col X that is closest to y. The projection of y onto the subspace Col X is that vector. proj Col X y = Xβ for some β R p+. Consider y Xβ. Note that y = Xβ + (y Xβ ). R N can be broken into two subspaces Col X and (Col X), where (Col X) is the subspace of R N consisting of all vectors that are orthogonal to the vectors in Col X. Any vector in R N can be written uniquely as z + w where z Col X and w (Col X). Since y R N, and y = Xβ + (y Xβ ), with Xβ Col X, the second vector y Xβ must lie in (Col X). y Xβ is orthogonal to the columns of X. X T (y Xβ ) = 0 X T y X T Xβ = 0. X T Xβ = X T y. Thus, it turns out that the set of least-squares solutions of Xβ = y consists of all and only the solutions to the matrix equation X T Xβ = X T y. If X T X is positive definite, then the eigenvalues of X T X are all positive. So 0 is not an eigenvalue of X T X. It follows that X T X is invertible. Then, we can solve the equation X T Xβ = X T y for β to get β = 8
17 MATH FOR MACHINE LEARNING (X T X) X T y, which is the same result we got earlier using multi-variable calculus. EXAMPLE: LINEAR REGRESSION Suppose we have the following training data: (x, y ) = (, ), (x, y ) = (, 4), (x 3, y 3 ) = (3, 4). Find the best fit line using the least squares method. Find the predicted value for x = 4. Solution: Form X = [ ] and y = [ 4]. 3 4 The coefficients β 0, β for the best fit line f(x) = β 0 + β x are given by [ β 0 ] = (X β T X) X T y. X T = [ 3 ] X T X = [ 3 ] [ ] = [ ] 3 (X T X) 7/3 = [ / ] (X T X) X T 7/3 y = [ / ] [ 3 ] [ 4] 4 = [ 0 3/ ] β 0 = 0 and β = 3/. Thus, the best fit line is given by f(x) = ( 3 ) x. The predicted value for x = 4 is f(4) = ( 3 ) 4 = 6. 9
18 SUMMARY: LINEAR REGRESSION In the least squares method, we seek a linear function of the input variables that best fits the given training data. We do this by minimizing the residual sum of squares. To minimize the residual sum of squares, we apply the second derivative test from multi-variable calculus. We can arrive at the same solution to the least squares problem using linear algebra. 0
19 PROBLEM SET: LINEAR REGRESSION MATH FOR MACHINE LEARNING. Suppose we have the following training data: (x, y ) = (0, ), (x, y ) = (, ), (x 3, y 3 ) = (, 4), (x 4, y 4 ) = (3, 4). Find the best fit line using the least squares method. Find the predicted value for x = 4.. Suppose we have the following training data: (x, y ), (x, y ), (x 3, y 3 ) where x = [ 0 0 ], x = [ 0 ], x 3 = [ 0 ], x 4 = [ ] and y =, y = 0, y 3 = 0, y 4 =. Find the best fit plane using the least squares method. Find the predicted value for x = [ ].
20 SOLUTION SET: LINEAR REGRESSION 0. Form X = [ ] and y = [ ] RICHARD HAN The coefficients β 0, β for the best fit line f(x) = β 0 + β x are given by [ β 0 β ] = (X T X) X T y. 0 X T = [ 0 3 ] XT X = [ 0 3 ] [ ] = [ ] 3 (X T X) = [ ] (X T X) X T y = [ ] [ 0 3 ] [ ] 4 4 = [ ] β 0 = 4 0 and β = 9 0. Thus, the best fit line is given by f(x) = x The predicted value for x = 4 is f(4) = = Form X = [ ] and y = [ ]. 0 0 The coefficients β 0, β, β for the best fit line f(x, x ) = β 0 + β x + β x are given by [ ] = β (X T X) X T y X T = [ 0 0 ] X T 0 X = [ 0 0 ] [ ] = [ ] β 0 β
21 MATH FOR MACHINE LEARNING (X T X) = [ ] (X T X) X T y = [ [ 0 0 ] [ ] ] = [ ] 0 [ ] 0 = 4 [ ] β 0 = 4, β =, β = Thus, the best fit plane is given by f(x, x ) = 4 + x + x The predicted value for x = [ ] is f(, ) = 4. 3
22 RICHARD HAN 3 LINEAR DISCRIMINANT ANALYSIS CLASSIFICATION In the problem of regression, we had a set of data (x, y ),, (x N, y N ) and we wanted to predict the values for the response variable Y for new data points. The values that Y took were numerical, quantitative, values. In certain problems, the values for the response variable Y that we want to predict are not quantitative but qualitative. So the values for Y will take on values from a finite set of classes or categories. Problems of this sort are called classification problems. Some examples of a classification problem are classifying an as spam or not spam and classifying a patient s illness as one among a finite number of diseases. LINEAR DISCRIMINANT ANALYSIS One method for solving a classification problem is called linear discriminant analysis. What we ll do is estimate Pr (Y = k X = x), the probability that Y is the class k given that the input variable X is x. Once we have all of these probabilities for a fixed x, we pick the class k for which the probability Pr (Y = k X = x) is largest. We then classify x as that class k. THE POSTERIOR PROBABILITY FUNCTIONS In this section, we ll build a formula for the posterior probability Pr (Y = k X = x). Let π k = Pr (Y = k), the prior probability that Y = k. Let f k (x) = Pr (X = x Y = k), the probability that X = x, given that Y = k. By Bayes rule, Pr (X = x Y = k) Pr (Y = k) Pr(Y = k X = x) = K Pr(X = x Y = l) Pr (Y = l) l= Here we assume that k can take on the values,, K. = f k(x) π k K f l (x) π l = π k f k (x) K l= π l f l (x) We can think of Pr(Y = k X = x) as a function of x and denote it as p k (x). So p k (x) = π k f k (x) K l= π l f l (x) l=. Recall that p k (x) is the posterior probability that Y = k given that X = x. 4
23 MATH FOR MACHINE LEARNING MODELLING THE POSTERIOR PROBABILITY FUNCTIONS Remember that we wanted to estimate Pr(Y = k X = x) for any given x. That is, we want an estimate for p k (x). If we can get estimates for π k, f k (x), π l and f l (x) for each l =,, K, then we would have an estimate for p k (x). Let s say that X = (X, X,, X p ) where X,, X p are the input variables. So the values of X will be vectors of p elements. We will assume that the conditional distribution of X given Y = k is the multivariate Gaussian distribution N(μ k, Σ), where μ k is a class-specific mean vector and Σ is the covariance of X. μ k The class-specific mean vector μ k is given by the vector of class-specific means [ ], where μ kj is μ kp the class-specific mean of X j. So μ kj = i:y i =k x ij Pr (X j = x ij ). Recall that x i = [ ]. (For all those x i for which y i = k, we re x ip taking the mean of their jth components.) Σ, the covariance matrix of X, is given by the matrix of covariances of X i and X j. So Σ = (a ij ), where a ij = Cov(X i, X j ) E[(X i μ Xi )(X j μ Xj )]. x i The multivariate Gaussian density is given by f(x) = (π) p Σ 5 e (x μ)t Σ (x μ) for the multivariate Gaussian distribution N(μ, Σ). Since we re assuming that the conditional distribution of X given Y = k is the multivariate Gaussian distribution N(μ k, Σ), we have that Pr(X = x Y = k) = (π) p Σ e (x μ k )T Σ (x μ k ). Recall that f k (x) = Pr(X = x Y = k). So f k (x) = (π) p Σ Recall that p k (x) = e (x μ k )T Σ (x μ k ). π k f k (x) K l= π l f l (x).
24 RICHARD HAN Plugging in what we have for f k (x), we get π k (π) p Σ p k (x) = K l= π l (π) p Σ e (x μ k )T Σ (x μ k ) e (x μ l )T Σ (x μ l ) = π k e (x μ k )T Σ (x μ k ) K π l e (x μ l )T Σ (x μ l ) l=. Note that the denominator is (π) p Σ l= π l f l (x) and that K K l= π l f l (x) = K l= f l (x)π l K = Pr(X = x Y = l) Pr (Y = l) l= = Pr(X = x). So the denominator is just (π) p Σ Pr(X = x). Hence, p k (x) = π k e (x μ k )T Σ (x μ k ) (π) p Σ Pr(X=x). 6
25 LINEAR DISCRIMINANT FUNCTIONS Recall that we want to choose the class k for which the posterior probability p k (x) is largest. Since the logarithm function is order-preserving, maximizing p k (x) is the same as maximizing log p k (x). Taking log p k (x) gives log π k e (x μ k )T Σ (x μ k ) (π) p Σ Pr(X=x) = log π k + ( ) (x μ k) T Σ (x μ k ) log ((π) p Σ Pr (X = x)) = log π k + ( ) (x μ k) T Σ (x μ k ) log C where C = (π) p Σ P r(x = x). = log π k (xt Σ μ k T Σ )(x μ k ) log C = log π k [xt Σ x x T Σ μ k μ k T Σ x + μ k T Σ μ k ] log C = log π k [xt Σ x x T Σ μ k + μ k T Σ μ k ] log C, because x T Σ μ k = μ k T Σ x Proof: x T Σ μ k = μ k (Σ ) T x = μ k T (Σ T ) x = μ k T Σ x because Σ is symmetric. = log π k xt Σ x + x T Σ μ k μ k T Σ μ k log C = x T Σ μ k μ k T Σ μ k + log π k xt Σ x log C Let δ k (x) = x T Σ μ k μ k T Σ μ k + log π k. Then log p k (x) = δ k (x) xt Σ x log C. δ k (x) is called a linear discriminant function. Maximizing log p k (x) is the same as maximizing δ k (x) since xt Σ x log C does not depend on k. ESTIMATING THE LINEAR DISCRIMINANT FUNCTIONS Now, if we can find estimates for π k, μ k, and Σ, then we would have an estimate for p k (x) and hence for log p k (x) and δ k (x). In an attempt to maximize p k (x), we instead maximize the estimate of p k (x), which is the same as 7
26 RICHARD HAN maximizing the estimate of δ k (x). π k can be estimated as π k = N k where N N k is the number of training data points in class k and N is the total number of training data points. Remember π k = Pr (Y = k). We re estimating this by just taking the proportion of data points in class k. μ k The class-specific mean vector μ k = [ ], where μ kj = i:y i =k x ij Pr(X j = x ij ). μ kp We can estimate μ kj as N k So we can estimate μ k as μ k = [ i:yi =k x ij. N i:yi =k x i k N i:yi =k x ip k ] = [ N k i:yi =k x i i:yi =k x ip = x i [ ] N k i:y i =k x ip ] In other words, μ k = N k i:yi =k x i averages of each component over all x i in class k. = N k x i i:y i =k. We estimate the class-specific mean vector by the vector of Finally, the covariance matrix Σ is estimated as Σ = K (x N K k= i:y i =k i μ ) k (x i μ ) T k. Recall that δ k (x) = x T Σ μ k μ k T Σ μ k + log π k. So, δ (x) k = x TΣ μ k (μ ) k TΣ μ k + log π. k Note that Σ, μ, k and π k only depend on the training data and not on x. Note that x is a vector and x TΣ μ k is a linear combination of the components of x. Hence, δ (x) k is a linear combination of the components of x. This is why it s called a linear discriminant function. CLASSIFYING DATA POINTS USING LINEAR DISCRIMINANT FUNCTIONS If (k, k ) is a pair of classes, we can consider whether δ k (x) > δ k (x). If so, we know x is not in class k. Then, we can compare δ k (x) > δ k3 (x) and rule out another class. Once we ve exhausted all 8
27 MATH FOR MACHINE LEARNING the classes, we ll know which class x should be assigned to. Setting δ k (x) = δ k (x), we get x TΣ μ k (μ k ) T Σ μ k + log π k = x TΣ μ k (μ k ) T Σ μ k + log π k. This gives us a hyperplane in R p which separates class k from class k. If we find the separating hyperplane for each pair of classes, we get something like this: In this example, p = and K = 3. LDA EXAMPLE Suppose we have a set of data (x, y ),, (x 6, y 6 ) as follows: x = (, 3), x = (, 3), x 3 = (, 4), x 4 = (3, ), x 5 = (3, ), x 6 = (4, ), with y = y = y 3 = k = and y 4 = y 5 = y 6 = k =. Apply linear discriminant analysis by doing the following: a) Find estimates for the linear discriminant functions δ (x) and δ (x). b) Find the line that decides between the two classes. c) Classify the new point x = (5, 0). Solution: Here is a graph of the data points: 9
28 RICHARD HAN The number of features p is, the number of classes K is, the total number of data points N is 6, the number N of data points in class k is 3, and the number N of data points in class k is 3. First, we will find estimates for π and π, the prior probabilities that Y = k and Y = k, respectively. Then, we will find estimates for μ and μ, the class-specific mean vectors. We can then calculate the estimate for the covariance matrix Σ. Finally, using the estimates π, π, μ, μ, Σ, we can find the estimates for the linear discriminant functions δ (x) and δ (x). π = N N = 3 6 = π = N N = 3 6 = μ = N x i i:y i = = 3 [x + x + x 3 ] = [ 5/3 0/3 ] μ = N x i i:y i = = 3 [x 4 + x 5 + x 6 ] = [ 0/3 5/3 ] K Σ = N K (x i μ k k= i:y i =k )(x i μ ) T k 0
29 MATH FOR MACHINE LEARNING = 6 (x i μ k k= i:y i =k )(x i μ ) T k Plugging in what we got for μ and μ, we get Σ = /3 /6 [4/3 ] = [/3 4 /3 4/3 /6 /3 ] Σ = [ 4 4 ] δ (x) = x TΣ μ (μ ) TΣ μ + log π. = x T [ 0 0 ] (00 3 ) + log = 0X log δ (x) = x TΣ μ (μ ) TΣ μ + log π. = x T [ 0 0 ] (00 3 ) + log = 0X log Setting δ (x) = δ (x) 0X 50 + log = 0X X = 0X X = X. 3 + log So, the line that decides between the two classes is given by X = X. Here is a graph of the deciding line:
30 RICHARD HAN If δ (x) > δ (x), then we classify x as of class k. So if x is above the line X = X, then we classify x as of class k. Conversely, if δ (x) < δ (x), then we classify x as of class k. This corresponds to x being below the line X = X. The point (5, 0) is below the line; so we classify it as of class k. LDA EXAMPLE Suppose we have a set of data (x, y ),, (x 6, y 6 ) as follows: x = (0, ), x = (, ), x 3 = (, 0), x 4 = (, ), x 5 = (3, 3), x 6 = (4, 4), with y = y = k =, y 3 = y 4 = k =, and y 5 = y 6 = k 3 = 3. Apply linear discriminant analysis by doing the following: a) Find estimates for the linear discriminant functions δ (x), δ (x), and δ 3 (x). b) Find the lines that decide between each pair of classes. c) Classify the new point x = (, 3). Solution: Here is a graph of the data points:
31 MATH FOR MACHINE LEARNING The number of features p is, the number of classes K is 3, the total number of data points N is 6, the number N of data points in class k is, the number N of data points in class k is, and the number N 3 of data points in class k 3 is. First, we will find estimates for π, π, π 3, the prior probabilities that Y = k, Y = k, Y = k 3, respectively. Then, we will find estimates for μ, μ, μ 3, the class-specific mean vectors. We can then calculate the estimate for the covariance matrix Σ. Finally, using the estimates π, π, π, 3 μ, μ, μ, 3 Σ, we can find the estimates for the linear discriminant functions δ (x), δ (x), and δ 3 (x). π = N N = 6 = 3 π = N N = 6 = 3 π 3 = N 3 N = 6 = 3 μ = N x i i:y i = = [x + x ] = [ / ] μ = N x i i:y i = = [x 3 + x 4 ] = [ / ] 3
32 RICHARD HAN μ 3 = N 3 x i i:y i =3 = [x 5 + x 6 ] = [ 7/ 7/ ] K Σ = N K (x i μ k k= i:y i =k )(x i μ ) T k = 6 3 [ / /6 ] = [/3 / /6 /3 ] Σ = [ 4 4 ] δ (x) = x TΣ μ (μ ) TΣ μ + log π. = x T [ 7 ] (3 ) + log 3 = X + 7X 3 + log 3 δ (x) = x TΣ μ (μ ) TΣ μ + log π. = x T [ 7 ] (3 ) + log 3 = 7X X 3 + log 3 δ (x) 3 = x TΣ μ 3 (μ ) 3 TΣ μ 3 + log π. 3 = x T [ 7 7 ] (49 ) + log 3 = 7X + 7X 49 + log 3 Setting δ (x) = δ (x) X + 7X 3 + log = 7X 3 X 3 X + 7X = 7X X + log 3 9X = 9X 4
33 MATH FOR MACHINE LEARNING X = X. So, the line that decides between classes k and k is given by X = X. Setting δ (x) = δ (x) 3 X + 7X 3 + log = 7X 3 + 7X 49 8 = 9X X = + log 3 So, the line that decides between classes k and k 3 is given by X =. Setting δ (x) = δ (x) 3 7X X 3 + log = 7X 3 + 7X 49 + log 3 8 = 9X X = So, the line that decides between classes k and k 3 is given by X =. Here is a graph of the deciding lines: The lines divide the plane into 3 regions. δ (x) > δ (x) corresponds to the region above the line X = X. Conversely, δ (x) < δ (x) corresponds to the region below the line X = X. δ (x) > δ (x) 3 corresponds to the region to the left of the line X =. Conversely, δ (x) < δ (x) 3 5
34 RICHARD HAN corresponds to the region to the right of X =. δ (x) > δ (x) 3 corresponds to the region below the line X =. Conversely, δ (x) < δ (x) 3 corresponds to the region above the line X =. If δ (x) > δ (x) and δ (x) > δ (x), 3 then we classify x as of class k. So if x is in region I, then we classify x as of class k. Conversely, if x is in region II, then we classify x as of class k ; and if x is in region III, we classify x as of class k 3. The point (, 3) is in region I; so we classify it as of class k. 6
35 MATH FOR MACHINE LEARNING SUMMARY: LINEAR DISCRIMINANT ANALYSIS In linear discriminant analysis, we find estimates p (x) k for the posterior probability p k (x) that Y = k given that X = x. We classify x according to the class k that gives the highest estimated posterior probability p (x). k Maximizing the estimated posterior probability p (x) k is equivalent to maximizing the log of p (x), k which, in turn, is equivalent to maximizing the estimated linear discriminant function (x). δ k We find estimates of the prior probability π k that Y = k, of the class-specific mean vectors μ k, and of the covariance matrix Σ in order to estimate the linear discriminant functions δ k (x). By setting δ (x) k = δ k (x) for each pair (k, k ) of classes, we get hyperplanes in R p that, together, divide R p into regions corresponding to the distinct classes. We classify x according to the class k for which δ (x) k is largest. 7
36 RICHARD HAN PROBLEM SET: LINEAR DISCRIMINANT ANALYSIS. Suppose we have a set of data (x, y ),, (x 6, y 6 ) as follows: x = (, ), x = (, ), x 3 = (, ), x 4 = (3, 3), x 5 = (3, 4), x 6 = (4, 3) with y = y = y 3 = k = and y 4 = y 5 = y 6 = k =. Apply linear discriminant analysis by doing the following: a) Find estimates for the linear discriminant functions δ (x) and δ (x). b) Find the line that decides between the two classes. c) Classify the new point x = (4, 5).. Suppose we have a set of data (x, y ),, (x 6, y 6 ) as follows: x = (0, 0), x = (, ), x 3 = (, 3), x 4 = (, 4), x 5 = (3, ), x 6 = (4, ) with y = y = k =, y 3 = y 4 = k = and y 5 = y 6 = k 3 = 3. Apply linear discriminant analysis by doing the following: a) Find estimates for the linear discriminant functions δ (x), δ (x) and δ 3 (x). b) Find the lines that decide between each pair of classes. c) Classify the new point x = (3, 0). 8
37 MATH FOR MACHINE LEARNING SOLUTION SET: LINEAR DISCRIMINANT ANALYSIS. Here is a graph of the data points: The number of features p is, the number of classes K is, the total number of data points N is 6, the number N of data points in class k is 3, and the number N of data points in class k is 3. First, we will find estimates for π and π, the prior probabilities that Y = k and Y = k, respectively. Then, we will find estimates for μ and μ, the class-specific mean vectors. We can then calculate the estimate for the covariance matrix Σ. Finally, using the estimates π, π, μ, μ, Σ, we can find the estimates for the linear discriminant functions δ (x) and δ (x). π = N N = 3 6 = π = N N = 3 6 = μ = N 5 x i = 3 [x + x + x 3 ] = [ 3 ] 5 3 i:y i = 9
38 μ = N 0 x i = 3 [x 4 + x 5 + x 6 ] = [ 3 ] 0 3 i:y i = K Σ = N K (x i μ k = 6 k= i:y i =k )(x i μ ) T k 6/9 /3 /6 [/9 ] = [ 6/9 /9 /6 /3 ] RICHARD HAN Σ = [ 4 4 ] δ (x) = x TΣ μ μ TΣ μ + log π = x T [ 0 0 ] (00) + log 3 = 0X + 0X 50 + log 3 δ (x) = x TΣ μ μ TΣ μ + log π = x T [ 0 0 ] (400) + log 3 = 0X + 0X 00 + log 3 Setting δ (x) = δ (x) 0X + 0X 50 + log = 0X 3 + 0X 00 + log 3 50 = 0X 3 + 0X 50 = 0X + 0X 5 = X + X X + 5 = X So, the line that decides between the two classes is given by X = X + 5. Here is a graph of the decision line: 30
39 MATH FOR MACHINE LEARNING If δ (x) > δ (x), then we classify x as of class k. So if x is below the line X = X + 5, then we classify x as of class k. Conversely, if δ (x) < δ (x), then we classify x as of class k. This corresponds to x being above the line X = X + 5. The point (4, 5) is above the line; so we classify it as of class k. 3
40 RICHARD HAN. Here is a graph of the data points: The number of features p is, the number of classes K is 3, the total number of data points N is 6, the number N of data points in class k is, the number N of data points in class k is, and the number N 3 of data points in class k 3 is. First, we will find estimates for π, π, π 3, the prior probabilities that Y = k, Y = k, Y = k 3, respectively. Then, we will find estimates for μ, μ, μ 3, the class-specific mean vectors. We can then calculate the estimate for the covariance matrix Σ. Finally, using the estimates π, π, π, 3 μ, μ, μ, 3 Σ, we can find the estimates for the linear discriminant functions δ (x), δ (x), and δ 3 (x). π = N N = 6 = 3 π = N N = 6 = 3 π 3 = N 3 N = 6 = 3 3
41 μ = N μ = N μ 3 = N 3 x i i:y i = x i i:y i = x i i:y i =3 MATH FOR MACHINE LEARNING = [x + x ] = [ / / ] = [x 3 + x 4 ] = [ 7/ ] = [x 5 + x 6 ] = [ 7/ ] K Σ = N K (x i μ k = k= i:y i =k )(x i μ ) T k [ / /6 ] = [/3 6 3 / /6 /3 ] Σ = [ 4 4 ] δ (x) = x TΣ μ μ TΣ μ + log π = x T [ ] () + log 3 = X + X + log 3 δ (x) = x TΣ μ μ TΣ μ + log π = x T [ 0 ] (37) + log 3 = X + 0X 37 + log 3 δ (x) 3 = x TΣ μ 3 μ TΣ 3 μ 3 + log π 3 = x T [ 0 ] (37) + log 3 = 0X + X 37 + log 3 Setting δ (x) = δ (x) X + X + log = X 3 + 0X 37 8 = 9X = X + log 3 So, the line that decides between classes k and k is given by X =. 33
42 RICHARD HAN Setting δ (x) = δ (x) 3 X + X + log = 0X 3 + X 37 8 = 9X = X + log 3 So, the line that decides between classes k and k 3 is given by X =. Setting δ (x) = δ (x) 3 X + 0X 37 + log = 0X 3 + X 37 + log 3 9X = 9X X = X So, the line that decides between classes k and k 3 is given by X = X. Here is a graph of the decision lines: The lines divide the plane into 3 regions. If x is in region I, then we classify x as of class k. Similarly, points in region II get classified as of k, and points in region III get classified as of k 3. The point (3, 0) is in region III; so we classify it as of class k 3. 34
43 MATH FOR MACHINE LEARNING 35
Linear Algebra for Beginners Open Doors to Great Careers. Richard Han
Linear Algebra for Beginners Open Doors to Great Careers Richard Han Copyright 2018 Richard Han All rights reserved. CONTENTS PREFACE... 7 1 - INTRODUCTION... 8 2 SOLVING SYSTEMS OF LINEAR EQUATIONS...
More informationMachine Learning Support Vector Machines. Prof. Matteo Matteucci
Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University
More informationContents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)
Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationThe classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know
The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is
More informationThe classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA
The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationGaussian discriminant analysis Naive Bayes
DM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. is 2. Multi-variate
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationNatural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley
Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:
More informationSupport Vector Machines
Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationLecture 4 Discriminant Analysis, k-nearest Neighbors
Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationMidterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.
CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationChap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University
Chap 2. Linear Classifiers (FTH, 4.1-4.4) Yongdai Kim Seoul National University Linear methods for classification 1. Linear classifiers For simplicity, we only consider two-class classification problems
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationML (cont.): SUPPORT VECTOR MACHINES
ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationLinear Decision Boundaries
Linear Decision Boundaries A basic approach to classification is to find a decision boundary in the space of the predictor variables. The decision boundary is often a curve formed by a regression model:
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationNon-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines
Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall
More informationSupport Vector Machines
Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationMidterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.
CS 189 Spring 2013 Introduction to Machine Learning Midterm You have 1 hour 20 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators
More informationIntroduction to SVM and RVM
Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer
More informationLecture 9: Large Margin Classifiers. Linear Support Vector Machines
Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation
More informationAnnouncements - Homework
Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationLINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationJeff Howbert Introduction to Machine Learning Winter
Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationSupport Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2
More informationLecture 9: Classification, LDA
Lecture 9: Classification, LDA Reading: Chapter 4 STATS 202: Data mining and analysis October 13, 2017 1 / 21 Review: Main strategy in Chapter 4 Find an estimate ˆP (Y X). Then, given an input x 0, we
More informationCS 6375 Machine Learning
CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.
More informationSupport Vector Machines
Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two
More informationLecture 9: Classification, LDA
Lecture 9: Classification, LDA Reading: Chapter 4 STATS 202: Data mining and analysis October 13, 2017 1 / 21 Review: Main strategy in Chapter 4 Find an estimate ˆP (Y X). Then, given an input x 0, we
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationMachine Learning and Data Mining. Support Vector Machines. Kalev Kask
Machine Learning and Data Mining Support Vector Machines Kalev Kask Linear classifiers Which decision boundary is better? Both have zero training error (perfect training accuracy) But, one of them seems
More informationCS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall
CS534: Machine Learning Thomas G. Dietterich 221C Dearborn Hall tgd@cs.orst.edu http://www.cs.orst.edu/~tgd/classes/534 1 Course Overview Introduction: Basic problems and questions in machine learning.
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationSupport Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar
Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane
More information17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk
94 Jonathan Richard Shewchuk 7 Neural Networks NEURAL NETWORKS Can do both classification & regression. [They tie together several ideas from the course: perceptrons, logistic regression, ensembles of
More informationLecture 16: Modern Classification (I) - Separating Hyperplanes
Lecture 16: Modern Classification (I) - Separating Hyperplanes Outline 1 2 Separating Hyperplane Binary SVM for Separable Case Bayes Rule for Binary Problems Consider the simplest case: two classes are
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationLecture 4: Training a Classifier
Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as
More informationSupport Vector Machines.
Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel
More informationLinear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationCOMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017
COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University OVERVIEW This class will cover model-based
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write
More informationLinear Classification and SVM. Dr. Xin Zhang
Linear Classification and SVM Dr. Xin Zhang Email: eexinzhang@scut.edu.cn What is linear classification? Classification is intrinsically non-linear It puts non-identical things in the same class, so a
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationMachine Learning, Fall 2012 Homework 2
0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationCOMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37
COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation
More informationCS446: Machine Learning Fall Final Exam. December 6 th, 2016
CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationLinear Regression (9/11/13)
STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter
More informationLinear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationHomework 3. Convex Optimization /36-725
Homework 3 Convex Optimization 10-725/36-725 Due Friday October 14 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationStatistical NLP for the Web
Statistical NLP for the Web Neural Networks, Deep Belief Networks Sameer Maskey Week 8, October 24, 2012 *some slides from Andrew Rosenberg Announcements Please ask HW2 related questions in courseworks
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)
More informationBANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1
BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationLecture 10: Support Vector Machine and Large Margin Classifier
Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu
More informationOutline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22
Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems
More informationClassification. Chapter Introduction. 6.2 The Bayes classifier
Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode
More informationStatistical Methods for SVM
Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,
More informationLecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices
Lecture 3: Simple Linear Regression in Matrix Format To move beyond simple regression we need to use matrix algebra We ll start by re-expressing simple linear regression in matrix form Linear algebra is
More informationHomework 4. Convex Optimization /36-725
Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationMachine Learning. Support Vector Machines. Manfred Huber
Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released
More informationEE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015
EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,
More informationClassification 1: Linear regression of indicators, linear discriminant analysis
Classification 1: Linear regression of indicators, linear discriminant analysis Ryan Tibshirani Data Mining: 36-462/36-662 April 2 2013 Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.1 4.3 1 Classification
More informationCMSC858P Supervised Learning Methods
CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors
More informationSupport Vector Machines and Bayes Regression
Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering
More informationMachine Learning (CS 567) Lecture 2
Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More information