Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Similar documents
Linear Algebra for Beginners Open Doors to Great Careers. Richard Han

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

ECE521 week 3: 23/26 January 2017

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Pattern Recognition and Machine Learning

Gaussian discriminant analysis Naive Bayes

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Lecture 5

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Support Vector Machines

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

ML (cont.): SUPPORT VECTOR MACHINES

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Linear Decision Boundaries

Linear & nonlinear classifiers

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Linear Classification. Prof. Matteo Matteucci

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines

Linear & nonlinear classifiers

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Introduction to SVM and RVM

Introduction to Machine Learning

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Announcements - Homework

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Max Margin-Classifier

Neural Networks and Deep Learning

CSC 411 Lecture 17: Support Vector Machine

18.9 SUPPORT VECTOR MACHINES

Jeff Howbert Introduction to Machine Learning Winter

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 9: Classification, LDA

CS 6375 Machine Learning

Support Vector Machines

Lecture 9: Classification, LDA

Support Vector Machines for Classification: A Statistical Portrait

Warm up: risk prediction with logistic regression

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

Midterm exam CS 189/289, Fall 2015

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Kernel Methods. Machine Learning A W VO

Lecture 4: Training a Classifier

Support Vector Machines.

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Linear Classification and SVM. Dr. Xin Zhang

Support Vector Machine (SVM) and Kernel Methods

Machine Learning, Fall 2012 Homework 2

COMS 4771 Regression. Nakul Verma

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Linear Regression (9/11/13)

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear Models for Classification

Homework 3. Convex Optimization /36-725

Statistical NLP for the Web

Discriminative Models

CS145: INTRODUCTION TO DATA MINING

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Nonparametric Bayesian Methods (Gaussian Processes)

Lecture 10: Support Vector Machine and Large Margin Classifier

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Classification. Chapter Introduction. 6.2 The Bayes classifier

Statistical Methods for SVM

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

Homework 4. Convex Optimization /36-725

CPSC 540: Machine Learning

Machine Learning. Support Vector Machines. Manfred Huber

Reading Group on Deep Learning Session 1

CPSC 340: Machine Learning and Data Mining

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Classification 1: Linear regression of indicators, linear discriminant analysis

CMSC858P Supervised Learning Methods

Support Vector Machines and Bayes Regression

Machine Learning (CS 567) Lecture 2

Logistic Regression. Machine Learning Fall 2018

Transcription:

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence Richard Han

Copyright 05 Richard Han All rights reserved.

CONTENTS PREFACE... - INTRODUCTION... LINEAR REGRESSION... 4 LINEAR REGRESSION... 4 THE LEAST SQUARES METHOD... 5 LINEAR ALGEBRA SOLUTION TO LEAST SQUARES PROBLEM... 7 EXAMPLE: LINEAR REGRESSION... 9 SUMMARY: LINEAR REGRESSION... 0 PROBLEM SET: LINEAR REGRESSION... SOLUTION SET: LINEAR REGRESSION... 3 LINEAR DISCRIMINANT ANALYSIS... 4 CLASSIFICATION... 4 LINEAR DISCRIMINANT ANALYSIS... 4 THE POSTERIOR PROBABILITY FUNCTIONS... 4 MODELLING THE POSTERIOR PROBABILITY FUNCTIONS... 5 LINEAR DISCRIMINANT FUNCTIONS... 7 ESTIMATING THE LINEAR DISCRIMINANT FUNCTIONS... 7 CLASSIFYING DATA POINTS USING LINEAR DISCRIMINANT FUNCTIONS... 8 LDA EXAMPLE... 9 LDA EXAMPLE... SUMMARY: LINEAR DISCRIMINANT ANALYSIS... 7 PROBLEM SET: LINEAR DISCRIMINANT ANALYSIS... 8 SOLUTION SET: LINEAR DISCRIMINANT ANALYSIS... 9 4 LOGISTIC REGRESSION... Error! Bookmark not defined. LOGISTIC REGRESSION... Error! Bookmark not defined. LOGISTIC REGRESSION MODEL OF THE POSTERIOR PROBABILITY FUNCTION.. Error! Bookmark not defined. ESTIMATING THE POSTERIOR PROBABILITY FUNCTION... Error! Bookmark not defined.

RICHARD HAN THE MULTIVARIATE NEWTON-RAPHSON METHOD... Error! Bookmark not defined. MAXIMIZING THE LOG-LIKELIHOOD FUNCTION... Error! Bookmark not defined. EXAMPLE: LOGISTIC REGRESSION... Error! Bookmark not defined. SUMMARY: LOGISTIC REGRESSION... Error! Bookmark not defined. PROBLEM SET: LOGISTIC REGRESSION... Error! Bookmark not defined. SOLUTION SET: LOGISTIC REGRESSION... Error! Bookmark not defined. 5 ARTIFICIAL NEURAL NETWORKSError! Bookmark not defined. ARTIFICIAL NEURAL NETWORKS... Error! Bookmark not defined. NEURAL NETWORK MODEL OF THE OUTPUT FUNCTIONS... Error! Bookmark not defined. FORWARD PROPAGATION... Error! Bookmark not defined. CHOOSING ACTIVATION FUNCTIONS... Error! Bookmark not defined. ESTIMATING THE OUTPUT FUNCTIONS... Error! Bookmark not defined. ERROR FUNCTION FOR REGRESSION... Error! Bookmark not defined. ERROR FUNCTION FOR BINARY CLASSIFICATION... Error! Bookmark not defined. ERROR FUNCTION FOR MULTI-CLASS CLASSIFICATION... Error! Bookmark not defined. MINIMIZING THE ERROR FUNCTION USING GRADIENT DESCENT... Error! Bookmark not defined. BACKPROPAGATION EQUATIONS... Error! Bookmark not defined. SUMMARY OF BACKPROPAGATION... Error! Bookmark not defined. SUMMARY: ARTIFICIAL NEURAL NETWORKS... Error! Bookmark not defined. PROBLEM SET: ARTIFICIAL NEURAL NETWORKS... Error! Bookmark not defined. SOLUTION SET: ARTIFICIAL NEURAL NETWORKS... Error! Bookmark not defined. 6 MAXIMAL MARGIN CLASSIFIERError! Bookmark not defined. MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. DEFINITIONS OF SEPARATING HYPERPLANE AND MARGIN... Error! Bookmark not defined. MAXIMIZING THE MARGIN... Error! Bookmark not defined. DEFINITION OF MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. REFORMULATING THE OPTIMIZATION PROBLEM... Error! Bookmark not defined. SOLVING THE CONVEX OPTIMIZATION PROBLEM... Error! Bookmark not defined. KKT CONDITIONS... Error! Bookmark not defined. iv

MATH FOR MACHINE LEARNING PRIMAL AND DUAL PROBLEMS... Error! Bookmark not defined. SOLVING THE DUAL PROBLEM... Error! Bookmark not defined. THE COEFFICIENTS FOR THE MAXIMAL MARGIN HYPERPLANE... Error! Bookmark not defined. THE SUPPORT VECTORS... Error! Bookmark not defined. CLASSIFYING TEST POINTS... Error! Bookmark not defined. MAXIMAL MARGIN CLASSIFIER EXAMPLE... Error! Bookmark not defined. MAXIMAL MARGIN CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUMMARY: MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. PROBLEM SET: MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. SOLUTION SET: MAXIMAL MARGIN CLASSIFIER... Error! Bookmark not defined. 7 SUPPORT VECTOR CLASSIFIERError! Bookmark not defined. SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. SLACK VARIABLES: POINTS ON CORRECT SIDE OF HYPERPLANE... Error! Bookmark not defined. SLACK VARIABLES: POINTS ON WRONG SIDE OF HYPERPLANE... Error! Bookmark not defined. FORMULATING THE OPTIMIZATION PROBLEM... Error! Bookmark not defined. DEFINITION OF SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. A CONVEX OPTIMIZATION PROBLEM... Error! Bookmark not defined. SOLVING THE CONVEX OPTIMIZATION PROBLEM (SOFT MARGIN).. Error! Bookmark not defined. THE COEFFICIENTS FOR THE SOFT MARGIN HYPERPLANE... Error! Bookmark not defined. THE SUPPORT VECTORS (SOFT MARGIN)... Error! Bookmark not defined. CLASSIFYING TEST POINTS (SOFT MARGIN)... Error! Bookmark not defined. SUPPORT VECTOR CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUPPORT VECTOR CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUMMARY: SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. PROBLEM SET: SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. SOLUTION SET: SUPPORT VECTOR CLASSIFIER... Error! Bookmark not defined. 8 SUPPORT VECTOR MACHINE CLASSIFIERError! Bookmark not defined. SUPPORT VECTOR MACHINE CLASSIFIER... Error! Bookmark not defined. ENLARGING THE FEATURE SPACE... Error! Bookmark not defined. v

RICHARD HAN THE KERNEL TRICK... Error! Bookmark not defined. SUPPORT VECTOR MACHINE CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUPPORT VECTOR MACHINE CLASSIFIER EXAMPLE... Error! Bookmark not defined. SUMMARY: SUPPORT VECTOR MACHINE CLASSIFIER... Error! Bookmark not defined. PROBLEM SET: SUPPORT VECTOR MACHINE CLASSIFIER... Error! Bookmark not defined. SOLUTION SET: SUPPORT VECTOR MACHINE CLASSIFIER... Error! Bookmark not defined. CONCLUSION... Error! Bookmark not defined. APPENDIX... Error! Bookmark not defined. APPENDIX... Error! Bookmark not defined. APPENDIX 3... Error! Bookmark not defined. APPENDIX 4... Error! Bookmark not defined. APPENDIX 5... Error! Bookmark not defined. INDEX... Error! Bookmark not defined. vi

PREFACE Welcome to Math for Machine Learning: Open Doors to Data Science and Artificial Intelligence. This is a first textbook in math for machine learning. Be sure to get the companion online course Math for Machine Learning here: Math for Machine Learning Online Course. The online course can be very helpful in conjunction with this book. The prerequisites for this book and the online course are Linear Algebra, Multivariable Calculus, and Probability. You can find my online course on Linear Algebra here: Linear Algebra Course. We will not do any programming in this book. This book will get you started in machine learning in a smooth and natural way, preparing you for more advanced topics and dispelling the belief that machine learning is complicated, difficult, and intimidating. I want you to succeed and prosper in your career, life, and future endeavors. I am here for you. Visit me at: Online Math Training

RICHARD HAN - INTRODUCTION Welcome to Math for Machine Learning: Open Doors to Data Science and Artificial Intelligence! My name is Richard Han. This is a first textbook in math for machine learning. Ideal student: If you're a working professional needing a refresher on machine learning or a complete beginner who needs to learn Machine Learning for the first time, this book is for you. If your busy schedule doesn t allow you to go back to a traditional school, this book allows you to study on your own schedule and further your career goals without being left behind. If you plan on taking machine learning in college, this is a great way to get ahead. If you re currently struggling with machine learning or have struggled with it in the past, now is the time to master it. Benefits of studying this book: After reading this book, you will have refreshed your knowledge of machine learning for your career so that you can earn a higher salary. You will have a required prerequisite for lucrative career fields such as Data Science and Artificial Intelligence. You will be in a better position to pursue a masters or PhD degree in machine learning and data science. Why Machine Learning is important: Famous uses of machine learning include: o Linear discriminant analysis. Linear discriminant analysis can be used to solve classification problems such as spam filtering and classifying patient illnesses.

MATH FOR MACHINE LEARNING o Logistic regression. Logistic regression can be used to solve binary classification problems such as determining whether a patient has a certain form of cancer or not. o Artificial neural networks. Artificial neural networks can be used for applications such as self-driving cars, recommender systems, online marketing, reading medical images, speech and face recognition o Support Vector machines. Real world applications of SVM s include classification of proteins and classification of images. What my book offers: In this book, I cover core topics such as: Linear Regression Linear Discriminant Analysis Logistic Regression Artificial Neural Networks Support Vector Machines I explain each definition and go through each example step by step so that you understand each topic clearly. Throughout the book, there are practice problems for you to try. Detailed solutions are provided after each problem set. I hope you benefit from the book. Best regards, Richard Han 3

LINEAR REGRESSION LINEAR REGRESSION Suppose we have a set of data (x, y ),, (x N, y N ). This is called the training data. x i x i Each x i is a vector [ ] of measurements, where x i is an instance of the first input variable X, x i x ip is an instance of the second input variable X, etc. X,, X p are called features or predictors. y,, y N are instances of the output variable Y, which is called the response. In linear regression, we assume that the response depends on the input variables in a linear fashion: y = f(x) + ε, where f(x) = β 0 + β X + + β p X p. Here, ε is called the error term and β 0,, β p are called parameters. We don t know the values of β 0,, β p. However, we can use the training data to approximate the values of β 0,, β p. What we ll do is look at the amount by which the predicted value f(x i ) differs from the actual y i for each of the pairs (x, y ),, (x N, y N ) from the training data. So we have y i f(x i ) as the difference. We then square this and take the sum for i =,, N: N (y i f(x i )) This is called the residual sum of squares and denoted RSS(β) where β = [ ]. β p i= We want the residual sum of squares to be as small as possible. Essentially, this means that we want our predicted value f(x i ) to be as close to the actual value y i as possible, for each of the pairs (x i, y i ). Doing this will give us a linear function of the input variables that best fits the given training data. In β 0 β 4

MATH FOR MACHINE LEARNING the case of only one input variable, we get the best fit line. In the case of two input variables, we get the best fit plane. And so on, for higher dimensions. THE LEAST SQUARES METHOD By minimizing RSS(β), we can obtain estimates β 0, β,, β p of the parameters β 0,, β p. This method is called the least squares method. x x x p x x x p Let X = [ x N x N x Np] y Then y Xβ = [ ] y N y and y = [ ]. y N x x x p x x x p [ [ x N x N x Np] β y 0 + β x + + β p x p β = [ ] 0 + β x + + β p x p y N [ β 0 + β x N + + β p x Np] f(x y ) = [ f(x ] [ ) ] y N f(x N ) = [ y f(x ) ] y N f(x N ) So (y Xβ) T (y Xβ) = N i= β 0 β ] β p (y i f(x i )) = RSS(β) RSS(β) = (y Xβ) T (y Xβ). Consider the vector of partial derivatives of RSS(β): RSS(β) β 0 RSS(β) β RSS(β) [ β p ] 5

RICHARD HAN RSS(β) = (y (β 0 + β x + + β p x p )) + + (y N (β 0 + β x N + + β p x Np )) Let s take the partial derivative with respect to β 0. RSS(β) β 0 = (y (β 0 + β x + + β p x p )) ( ) + + (y N (β 0 + β x N + + β p x Np )) ( ) = [ ](y Xβ) Next, take the partial derivative with respect to β. RSS(β) β = (y (β 0 + β x + + β p x p )) ( x ) + + (y N (β 0 + β x N + + β p x Np )) ( x N ) = [x x N ] (y Xβ) In general, RSS(β) β k = [x k x Nk ] (y Xβ) So, [ RSS(β) β 0 RSS(β) β RSS(β) β p ] [ ](y Xβ) [x = [ x N ](y Xβ) ] [x p x Np](y Xβ) x x N = [ ] (y Xβ) x p x Np = X T (y Xβ) If we take the second derivative of RSS(β), say RSS(β) β k β j, we get β j ( (y (β 0 + β x + + β p x p )) ( x k ) + + (y N (β 0 + β x N + + β p x Np )) ( x Nk )) = x j x k + + x Nj x Nk = (x j x k + + x Nj x Nk ) 6

MATH FOR MACHINE LEARNING x 0 x x x p x 0 x x x p Note X = [ ] x N0 x N x N x Np x 0 x 0 x N0 x 0 x x p x X T x x N x 0 x x p X = [ ] [ ] x p x p x Np x N0 x N x Np = (a jk ) where a jk = x j x k + + x Nj x Nk So RSS(β) β k β j = a jk The matrix of second derivatives of RSS(β) is X T X. This matrix is called the Hessian. By the second derivative test, if the Hessian of RSS(β) at a critical point is positive definite, then RSS(β) has a local minimum there. If we set our vector of derivatives to 0, we get X T (y Xβ) = 0 X T y + X T Xβ = 0 X T Xβ = X T y X T Xβ = X T y β = (X T X) X T y. β 0 β Thus, we solved for the vector of parameters [ ] which minimizes the residual sum of squares RSS(β). β p β 0 So we let β = (X T X) X T y. [ ] β p LINEAR ALGEBRA SOLUTION TO LEAST SQUARES PROBLEM We can arrive at the same solution for the least squares problem by using linear algebra. x x x p y x x x p Let X = and y = [ ] as before, from our training data. We want a y N [ x N x N x Np] vector β such that Xβ is close to y. In other words, we want a vector β such that the distance Xβ y between Xβ and y is minimized. A vector β that minimizes Xβ y is called a least-squares solution 7

RICHARD HAN of Xβ = y. X is an N by (p + ) matrix. We want a β in R p+ such that Xβ is closest to y. Note that Xβ is a linear combination of the columns of X. So Xβ lies in the span of the columns of X, which is a subspace of R N denoted Col X. So we want the vector in Col X that is closest to y. The projection of y onto the subspace Col X is that vector. proj Col X y = Xβ for some β R p+. Consider y Xβ. Note that y = Xβ + (y Xβ ). R N can be broken into two subspaces Col X and (Col X), where (Col X) is the subspace of R N consisting of all vectors that are orthogonal to the vectors in Col X. Any vector in R N can be written uniquely as z + w where z Col X and w (Col X). Since y R N, and y = Xβ + (y Xβ ), with Xβ Col X, the second vector y Xβ must lie in (Col X). y Xβ is orthogonal to the columns of X. X T (y Xβ ) = 0 X T y X T Xβ = 0. X T Xβ = X T y. Thus, it turns out that the set of least-squares solutions of Xβ = y consists of all and only the solutions to the matrix equation X T Xβ = X T y. If X T X is positive definite, then the eigenvalues of X T X are all positive. So 0 is not an eigenvalue of X T X. It follows that X T X is invertible. Then, we can solve the equation X T Xβ = X T y for β to get β = 8

MATH FOR MACHINE LEARNING (X T X) X T y, which is the same result we got earlier using multi-variable calculus. EXAMPLE: LINEAR REGRESSION Suppose we have the following training data: (x, y ) = (, ), (x, y ) = (, 4), (x 3, y 3 ) = (3, 4). Find the best fit line using the least squares method. Find the predicted value for x = 4. Solution: Form X = [ ] and y = [ 4]. 3 4 The coefficients β 0, β for the best fit line f(x) = β 0 + β x are given by [ β 0 ] = (X β T X) X T y. X T = [ 3 ] X T X = [ 3 ] [ ] = [ 3 6 6 4 ] 3 (X T X) 7/3 = [ / ] (X T X) X T 7/3 y = [ / ] [ 3 ] [ 4] 4 = [ 0 3/ ] β 0 = 0 and β = 3/. Thus, the best fit line is given by f(x) = ( 3 ) x. The predicted value for x = 4 is f(4) = ( 3 ) 4 = 6. 9

SUMMARY: LINEAR REGRESSION In the least squares method, we seek a linear function of the input variables that best fits the given training data. We do this by minimizing the residual sum of squares. To minimize the residual sum of squares, we apply the second derivative test from multi-variable calculus. We can arrive at the same solution to the least squares problem using linear algebra. 0

PROBLEM SET: LINEAR REGRESSION MATH FOR MACHINE LEARNING. Suppose we have the following training data: (x, y ) = (0, ), (x, y ) = (, ), (x 3, y 3 ) = (, 4), (x 4, y 4 ) = (3, 4). Find the best fit line using the least squares method. Find the predicted value for x = 4.. Suppose we have the following training data: (x, y ), (x, y ), (x 3, y 3 ) where x = [ 0 0 ], x = [ 0 ], x 3 = [ 0 ], x 4 = [ ] and y =, y = 0, y 3 = 0, y 4 =. Find the best fit plane using the least squares method. Find the predicted value for x = [ ].

SOLUTION SET: LINEAR REGRESSION 0. Form X = [ ] and y = [ ]. 4 3 4 RICHARD HAN The coefficients β 0, β for the best fit line f(x) = β 0 + β x are given by [ β 0 β ] = (X T X) X T y. 0 X T = [ 0 3 ] XT X = [ 0 3 ] [ ] = [ 4 6 6 4 ] 3 (X T X) = [ 7 0 3 0 3 0 5 ] (X T X) X T y = [ 7 0 3 0 3 0 5 ] [ 0 3 ] [ ] 4 4 = [ 4 0 9 0 ] β 0 = 4 0 and β = 9 0. Thus, the best fit line is given by f(x) = 4 0 + 9 0 x The predicted value for x = 4 is f(4) = 4 0 + 9 0 4 = 5. 0 0 0 0. Form X = [ ] and y = [ ]. 0 0 The coefficients β 0, β, β for the best fit line f(x, x ) = β 0 + β x + β x are given by [ ] = β (X T X) X T y. 0 0 4 X T = [ 0 0 ] X T 0 X = [ 0 0 ] [ ] = [ ] 0 0 0 0 0 β 0 β

MATH FOR MACHINE LEARNING (X T X) = [ 3 4 0 0 ] (X T X) X T y = [ 3 4 0 0 [ 0 0 ] [ ] 0 0 0 0 ] = [ 3 4 4 4 4 ] 0 [ ] 0 = 4 [ ] β 0 = 4, β =, β = Thus, the best fit plane is given by f(x, x ) = 4 + x + x The predicted value for x = [ ] is f(, ) = 4. 3

RICHARD HAN 3 LINEAR DISCRIMINANT ANALYSIS CLASSIFICATION In the problem of regression, we had a set of data (x, y ),, (x N, y N ) and we wanted to predict the values for the response variable Y for new data points. The values that Y took were numerical, quantitative, values. In certain problems, the values for the response variable Y that we want to predict are not quantitative but qualitative. So the values for Y will take on values from a finite set of classes or categories. Problems of this sort are called classification problems. Some examples of a classification problem are classifying an email as spam or not spam and classifying a patient s illness as one among a finite number of diseases. LINEAR DISCRIMINANT ANALYSIS One method for solving a classification problem is called linear discriminant analysis. What we ll do is estimate Pr (Y = k X = x), the probability that Y is the class k given that the input variable X is x. Once we have all of these probabilities for a fixed x, we pick the class k for which the probability Pr (Y = k X = x) is largest. We then classify x as that class k. THE POSTERIOR PROBABILITY FUNCTIONS In this section, we ll build a formula for the posterior probability Pr (Y = k X = x). Let π k = Pr (Y = k), the prior probability that Y = k. Let f k (x) = Pr (X = x Y = k), the probability that X = x, given that Y = k. By Bayes rule, Pr (X = x Y = k) Pr (Y = k) Pr(Y = k X = x) = K Pr(X = x Y = l) Pr (Y = l) l= Here we assume that k can take on the values,, K. = f k(x) π k K f l (x) π l = π k f k (x) K l= π l f l (x) We can think of Pr(Y = k X = x) as a function of x and denote it as p k (x). So p k (x) = π k f k (x) K l= π l f l (x) l=. Recall that p k (x) is the posterior probability that Y = k given that X = x. 4

MATH FOR MACHINE LEARNING MODELLING THE POSTERIOR PROBABILITY FUNCTIONS Remember that we wanted to estimate Pr(Y = k X = x) for any given x. That is, we want an estimate for p k (x). If we can get estimates for π k, f k (x), π l and f l (x) for each l =,, K, then we would have an estimate for p k (x). Let s say that X = (X, X,, X p ) where X,, X p are the input variables. So the values of X will be vectors of p elements. We will assume that the conditional distribution of X given Y = k is the multivariate Gaussian distribution N(μ k, Σ), where μ k is a class-specific mean vector and Σ is the covariance of X. μ k The class-specific mean vector μ k is given by the vector of class-specific means [ ], where μ kj is μ kp the class-specific mean of X j. So μ kj = i:y i =k x ij Pr (X j = x ij ). Recall that x i = [ ]. (For all those x i for which y i = k, we re x ip taking the mean of their jth components.) Σ, the covariance matrix of X, is given by the matrix of covariances of X i and X j. So Σ = (a ij ), where a ij = Cov(X i, X j ) E[(X i μ Xi )(X j μ Xj )]. x i The multivariate Gaussian density is given by f(x) = (π) p Σ 5 e (x μ)t Σ (x μ) for the multivariate Gaussian distribution N(μ, Σ). Since we re assuming that the conditional distribution of X given Y = k is the multivariate Gaussian distribution N(μ k, Σ), we have that Pr(X = x Y = k) = (π) p Σ e (x μ k )T Σ (x μ k ). Recall that f k (x) = Pr(X = x Y = k). So f k (x) = (π) p Σ Recall that p k (x) = e (x μ k )T Σ (x μ k ). π k f k (x) K l= π l f l (x).

RICHARD HAN Plugging in what we have for f k (x), we get π k (π) p Σ p k (x) = K l= π l (π) p Σ e (x μ k )T Σ (x μ k ) e (x μ l )T Σ (x μ l ) = π k e (x μ k )T Σ (x μ k ) K π l e (x μ l )T Σ (x μ l ) l=. Note that the denominator is (π) p Σ l= π l f l (x) and that K K l= π l f l (x) = K l= f l (x)π l K = Pr(X = x Y = l) Pr (Y = l) l= = Pr(X = x). So the denominator is just (π) p Σ Pr(X = x). Hence, p k (x) = π k e (x μ k )T Σ (x μ k ) (π) p Σ Pr(X=x). 6

LINEAR DISCRIMINANT FUNCTIONS Recall that we want to choose the class k for which the posterior probability p k (x) is largest. Since the logarithm function is order-preserving, maximizing p k (x) is the same as maximizing log p k (x). Taking log p k (x) gives log π k e (x μ k )T Σ (x μ k ) (π) p Σ Pr(X=x) = log π k + ( ) (x μ k) T Σ (x μ k ) log ((π) p Σ Pr (X = x)) = log π k + ( ) (x μ k) T Σ (x μ k ) log C where C = (π) p Σ P r(x = x). = log π k (xt Σ μ k T Σ )(x μ k ) log C = log π k [xt Σ x x T Σ μ k μ k T Σ x + μ k T Σ μ k ] log C = log π k [xt Σ x x T Σ μ k + μ k T Σ μ k ] log C, because x T Σ μ k = μ k T Σ x Proof: x T Σ μ k = μ k (Σ ) T x = μ k T (Σ T ) x = μ k T Σ x because Σ is symmetric. = log π k xt Σ x + x T Σ μ k μ k T Σ μ k log C = x T Σ μ k μ k T Σ μ k + log π k xt Σ x log C Let δ k (x) = x T Σ μ k μ k T Σ μ k + log π k. Then log p k (x) = δ k (x) xt Σ x log C. δ k (x) is called a linear discriminant function. Maximizing log p k (x) is the same as maximizing δ k (x) since xt Σ x log C does not depend on k. ESTIMATING THE LINEAR DISCRIMINANT FUNCTIONS Now, if we can find estimates for π k, μ k, and Σ, then we would have an estimate for p k (x) and hence for log p k (x) and δ k (x). In an attempt to maximize p k (x), we instead maximize the estimate of p k (x), which is the same as 7

RICHARD HAN maximizing the estimate of δ k (x). π k can be estimated as π k = N k where N N k is the number of training data points in class k and N is the total number of training data points. Remember π k = Pr (Y = k). We re estimating this by just taking the proportion of data points in class k. μ k The class-specific mean vector μ k = [ ], where μ kj = i:y i =k x ij Pr(X j = x ij ). μ kp We can estimate μ kj as N k So we can estimate μ k as μ k = [ i:yi =k x ij. N i:yi =k x i k N i:yi =k x ip k ] = [ N k i:yi =k x i i:yi =k x ip = x i [ ] N k i:y i =k x ip ] In other words, μ k = N k i:yi =k x i averages of each component over all x i in class k. = N k x i i:y i =k. We estimate the class-specific mean vector by the vector of Finally, the covariance matrix Σ is estimated as Σ = K (x N K k= i:y i =k i μ ) k (x i μ ) T k. Recall that δ k (x) = x T Σ μ k μ k T Σ μ k + log π k. So, δ (x) k = x TΣ μ k (μ ) k TΣ μ k + log π. k Note that Σ, μ, k and π k only depend on the training data and not on x. Note that x is a vector and x TΣ μ k is a linear combination of the components of x. Hence, δ (x) k is a linear combination of the components of x. This is why it s called a linear discriminant function. CLASSIFYING DATA POINTS USING LINEAR DISCRIMINANT FUNCTIONS If (k, k ) is a pair of classes, we can consider whether δ k (x) > δ k (x). If so, we know x is not in class k. Then, we can compare δ k (x) > δ k3 (x) and rule out another class. Once we ve exhausted all 8

MATH FOR MACHINE LEARNING the classes, we ll know which class x should be assigned to. Setting δ k (x) = δ k (x), we get x TΣ μ k (μ k ) T Σ μ k + log π k = x TΣ μ k (μ k ) T Σ μ k + log π k. This gives us a hyperplane in R p which separates class k from class k. If we find the separating hyperplane for each pair of classes, we get something like this: In this example, p = and K = 3. LDA EXAMPLE Suppose we have a set of data (x, y ),, (x 6, y 6 ) as follows: x = (, 3), x = (, 3), x 3 = (, 4), x 4 = (3, ), x 5 = (3, ), x 6 = (4, ), with y = y = y 3 = k = and y 4 = y 5 = y 6 = k =. Apply linear discriminant analysis by doing the following: a) Find estimates for the linear discriminant functions δ (x) and δ (x). b) Find the line that decides between the two classes. c) Classify the new point x = (5, 0). Solution: Here is a graph of the data points: 9

RICHARD HAN The number of features p is, the number of classes K is, the total number of data points N is 6, the number N of data points in class k is 3, and the number N of data points in class k is 3. First, we will find estimates for π and π, the prior probabilities that Y = k and Y = k, respectively. Then, we will find estimates for μ and μ, the class-specific mean vectors. We can then calculate the estimate for the covariance matrix Σ. Finally, using the estimates π, π, μ, μ, Σ, we can find the estimates for the linear discriminant functions δ (x) and δ (x). π = N N = 3 6 = π = N N = 3 6 = μ = N x i i:y i = = 3 [x + x + x 3 ] = [ 5/3 0/3 ] μ = N x i i:y i = = 3 [x 4 + x 5 + x 6 ] = [ 0/3 5/3 ] K Σ = N K (x i μ k k= i:y i =k )(x i μ ) T k 0

MATH FOR MACHINE LEARNING = 6 (x i μ k k= i:y i =k )(x i μ ) T k Plugging in what we got for μ and μ, we get Σ = /3 /6 [4/3 ] = [/3 4 /3 4/3 /6 /3 ] Σ = [ 4 4 ] δ (x) = x TΣ μ (μ ) TΣ μ + log π. = x T [ 0 0 ] (00 3 ) + log = 0X 50 3 + log δ (x) = x TΣ μ (μ ) TΣ μ + log π. = x T [ 0 0 ] (00 3 ) + log = 0X 50 3 + log Setting δ (x) = δ (x) 0X 50 + log = 0X 3 50 0X = 0X X = X. 3 + log So, the line that decides between the two classes is given by X = X. Here is a graph of the deciding line:

RICHARD HAN If δ (x) > δ (x), then we classify x as of class k. So if x is above the line X = X, then we classify x as of class k. Conversely, if δ (x) < δ (x), then we classify x as of class k. This corresponds to x being below the line X = X. The point (5, 0) is below the line; so we classify it as of class k. LDA EXAMPLE Suppose we have a set of data (x, y ),, (x 6, y 6 ) as follows: x = (0, ), x = (, ), x 3 = (, 0), x 4 = (, ), x 5 = (3, 3), x 6 = (4, 4), with y = y = k =, y 3 = y 4 = k =, and y 5 = y 6 = k 3 = 3. Apply linear discriminant analysis by doing the following: a) Find estimates for the linear discriminant functions δ (x), δ (x), and δ 3 (x). b) Find the lines that decide between each pair of classes. c) Classify the new point x = (, 3). Solution: Here is a graph of the data points:

MATH FOR MACHINE LEARNING The number of features p is, the number of classes K is 3, the total number of data points N is 6, the number N of data points in class k is, the number N of data points in class k is, and the number N 3 of data points in class k 3 is. First, we will find estimates for π, π, π 3, the prior probabilities that Y = k, Y = k, Y = k 3, respectively. Then, we will find estimates for μ, μ, μ 3, the class-specific mean vectors. We can then calculate the estimate for the covariance matrix Σ. Finally, using the estimates π, π, π, 3 μ, μ, μ, 3 Σ, we can find the estimates for the linear discriminant functions δ (x), δ (x), and δ 3 (x). π = N N = 6 = 3 π = N N = 6 = 3 π 3 = N 3 N = 6 = 3 μ = N x i i:y i = = [x + x ] = [ / ] μ = N x i i:y i = = [x 3 + x 4 ] = [ / ] 3

RICHARD HAN μ 3 = N 3 x i i:y i =3 = [x 5 + x 6 ] = [ 7/ 7/ ] K Σ = N K (x i μ k k= i:y i =k )(x i μ ) T k = 6 3 [ / /6 ] = [/3 / /6 /3 ] Σ = [ 4 4 ] δ (x) = x TΣ μ (μ ) TΣ μ + log π. = x T [ 7 ] (3 ) + log 3 = X + 7X 3 + log 3 δ (x) = x TΣ μ (μ ) TΣ μ + log π. = x T [ 7 ] (3 ) + log 3 = 7X X 3 + log 3 δ (x) 3 = x TΣ μ 3 (μ ) 3 TΣ μ 3 + log π. 3 = x T [ 7 7 ] (49 ) + log 3 = 7X + 7X 49 + log 3 Setting δ (x) = δ (x) X + 7X 3 + log = 7X 3 X 3 X + 7X = 7X X + log 3 9X = 9X 4

MATH FOR MACHINE LEARNING X = X. So, the line that decides between classes k and k is given by X = X. Setting δ (x) = δ (x) 3 X + 7X 3 + log = 7X 3 + 7X 49 8 = 9X X = + log 3 So, the line that decides between classes k and k 3 is given by X =. Setting δ (x) = δ (x) 3 7X X 3 + log = 7X 3 + 7X 49 + log 3 8 = 9X X = So, the line that decides between classes k and k 3 is given by X =. Here is a graph of the deciding lines: The lines divide the plane into 3 regions. δ (x) > δ (x) corresponds to the region above the line X = X. Conversely, δ (x) < δ (x) corresponds to the region below the line X = X. δ (x) > δ (x) 3 corresponds to the region to the left of the line X =. Conversely, δ (x) < δ (x) 3 5

RICHARD HAN corresponds to the region to the right of X =. δ (x) > δ (x) 3 corresponds to the region below the line X =. Conversely, δ (x) < δ (x) 3 corresponds to the region above the line X =. If δ (x) > δ (x) and δ (x) > δ (x), 3 then we classify x as of class k. So if x is in region I, then we classify x as of class k. Conversely, if x is in region II, then we classify x as of class k ; and if x is in region III, we classify x as of class k 3. The point (, 3) is in region I; so we classify it as of class k. 6

MATH FOR MACHINE LEARNING SUMMARY: LINEAR DISCRIMINANT ANALYSIS In linear discriminant analysis, we find estimates p (x) k for the posterior probability p k (x) that Y = k given that X = x. We classify x according to the class k that gives the highest estimated posterior probability p (x). k Maximizing the estimated posterior probability p (x) k is equivalent to maximizing the log of p (x), k which, in turn, is equivalent to maximizing the estimated linear discriminant function (x). δ k We find estimates of the prior probability π k that Y = k, of the class-specific mean vectors μ k, and of the covariance matrix Σ in order to estimate the linear discriminant functions δ k (x). By setting δ (x) k = δ k (x) for each pair (k, k ) of classes, we get hyperplanes in R p that, together, divide R p into regions corresponding to the distinct classes. We classify x according to the class k for which δ (x) k is largest. 7

RICHARD HAN PROBLEM SET: LINEAR DISCRIMINANT ANALYSIS. Suppose we have a set of data (x, y ),, (x 6, y 6 ) as follows: x = (, ), x = (, ), x 3 = (, ), x 4 = (3, 3), x 5 = (3, 4), x 6 = (4, 3) with y = y = y 3 = k = and y 4 = y 5 = y 6 = k =. Apply linear discriminant analysis by doing the following: a) Find estimates for the linear discriminant functions δ (x) and δ (x). b) Find the line that decides between the two classes. c) Classify the new point x = (4, 5).. Suppose we have a set of data (x, y ),, (x 6, y 6 ) as follows: x = (0, 0), x = (, ), x 3 = (, 3), x 4 = (, 4), x 5 = (3, ), x 6 = (4, ) with y = y = k =, y 3 = y 4 = k = and y 5 = y 6 = k 3 = 3. Apply linear discriminant analysis by doing the following: a) Find estimates for the linear discriminant functions δ (x), δ (x) and δ 3 (x). b) Find the lines that decide between each pair of classes. c) Classify the new point x = (3, 0). 8

MATH FOR MACHINE LEARNING SOLUTION SET: LINEAR DISCRIMINANT ANALYSIS. Here is a graph of the data points: The number of features p is, the number of classes K is, the total number of data points N is 6, the number N of data points in class k is 3, and the number N of data points in class k is 3. First, we will find estimates for π and π, the prior probabilities that Y = k and Y = k, respectively. Then, we will find estimates for μ and μ, the class-specific mean vectors. We can then calculate the estimate for the covariance matrix Σ. Finally, using the estimates π, π, μ, μ, Σ, we can find the estimates for the linear discriminant functions δ (x) and δ (x). π = N N = 3 6 = π = N N = 3 6 = μ = N 5 x i = 3 [x + x + x 3 ] = [ 3 ] 5 3 i:y i = 9

μ = N 0 x i = 3 [x 4 + x 5 + x 6 ] = [ 3 ] 0 3 i:y i = K Σ = N K (x i μ k = 6 k= i:y i =k )(x i μ ) T k 6/9 /3 /6 [/9 ] = [ 6/9 /9 /6 /3 ] RICHARD HAN Σ = [ 4 4 ] δ (x) = x TΣ μ μ TΣ μ + log π = x T [ 0 0 ] (00) + log 3 = 0X + 0X 50 + log 3 δ (x) = x TΣ μ μ TΣ μ + log π = x T [ 0 0 ] (400) + log 3 = 0X + 0X 00 + log 3 Setting δ (x) = δ (x) 0X + 0X 50 + log = 0X 3 + 0X 00 + log 3 50 = 0X 3 + 0X 50 = 0X + 0X 5 = X + X X + 5 = X So, the line that decides between the two classes is given by X = X + 5. Here is a graph of the decision line: 30

MATH FOR MACHINE LEARNING If δ (x) > δ (x), then we classify x as of class k. So if x is below the line X = X + 5, then we classify x as of class k. Conversely, if δ (x) < δ (x), then we classify x as of class k. This corresponds to x being above the line X = X + 5. The point (4, 5) is above the line; so we classify it as of class k. 3

RICHARD HAN. Here is a graph of the data points: The number of features p is, the number of classes K is 3, the total number of data points N is 6, the number N of data points in class k is, the number N of data points in class k is, and the number N 3 of data points in class k 3 is. First, we will find estimates for π, π, π 3, the prior probabilities that Y = k, Y = k, Y = k 3, respectively. Then, we will find estimates for μ, μ, μ 3, the class-specific mean vectors. We can then calculate the estimate for the covariance matrix Σ. Finally, using the estimates π, π, π, 3 μ, μ, μ, 3 Σ, we can find the estimates for the linear discriminant functions δ (x), δ (x), and δ 3 (x). π = N N = 6 = 3 π = N N = 6 = 3 π 3 = N 3 N = 6 = 3 3

μ = N μ = N μ 3 = N 3 x i i:y i = x i i:y i = x i i:y i =3 MATH FOR MACHINE LEARNING = [x + x ] = [ / / ] = [x 3 + x 4 ] = [ 7/ ] = [x 5 + x 6 ] = [ 7/ ] K Σ = N K (x i μ k = k= i:y i =k )(x i μ ) T k [ / /6 ] = [/3 6 3 / /6 /3 ] Σ = [ 4 4 ] δ (x) = x TΣ μ μ TΣ μ + log π = x T [ ] () + log 3 = X + X + log 3 δ (x) = x TΣ μ μ TΣ μ + log π = x T [ 0 ] (37) + log 3 = X + 0X 37 + log 3 δ (x) 3 = x TΣ μ 3 μ TΣ 3 μ 3 + log π 3 = x T [ 0 ] (37) + log 3 = 0X + X 37 + log 3 Setting δ (x) = δ (x) X + X + log = X 3 + 0X 37 8 = 9X = X + log 3 So, the line that decides between classes k and k is given by X =. 33

RICHARD HAN Setting δ (x) = δ (x) 3 X + X + log = 0X 3 + X 37 8 = 9X = X + log 3 So, the line that decides between classes k and k 3 is given by X =. Setting δ (x) = δ (x) 3 X + 0X 37 + log = 0X 3 + X 37 + log 3 9X = 9X X = X So, the line that decides between classes k and k 3 is given by X = X. Here is a graph of the decision lines: The lines divide the plane into 3 regions. If x is in region I, then we classify x as of class k. Similarly, points in region II get classified as of k, and points in region III get classified as of k 3. The point (3, 0) is in region III; so we classify it as of class k 3. 34

MATH FOR MACHINE LEARNING 35