Gaussian discriminant analysis Naive Bayes

Similar documents
Generative Learning algorithms

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Naïve Bayes classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier

Introduction to Machine Learning

CSC 411: Lecture 09: Naive Bayes

Support Vector Machines

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Statistical Data Mining and Machine Learning Hilary Term 2016

MLE/MAP + Naïve Bayes

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Introduction to Machine Learning

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

The Bayes classifier

Machine Learning Linear Classification. Prof. Matteo Matteucci

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Introduction to Machine Learning

CMU-Q Lecture 24:

Introduction to Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 4 Discriminant Analysis, k-nearest Neighbors

STA 4273H: Statistical Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Machine Learning (CS 567) Lecture 5

Linear Methods for Prediction

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Introduction to Machine Learning

Logistic Regression. Seungjin Choi

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Introduction to Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Machine Learning. Support Vector Machines. Manfred Huber

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Linear Classification

Introduction to Probabilistic Machine Learning

Lecture 5: LDA and Logistic Regression

Machine Learning

Linear Models for Classification

Machine Learning for Signal Processing Bayes Classification and Regression

CPSC 540: Machine Learning

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Classification. Chapter Introduction. 6.2 The Bayes classifier

Bayes Decision Theory

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Introduction to Machine Learning Spring 2018 Note 18

Ch 4. Linear Models for Classification

Machine Learning Gaussian Naïve Bayes Big Picture

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Machine Learning

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Nonparametric Bayesian Methods (Gaussian Processes)

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Statistical Machine Learning Hilary Term 2018

CS489/698: Intro to ML

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning for NLP

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Rapid Introduction to Machine Learning/ Deep Learning

Lecture 2: Logistic Regression and Neural Networks

ECE521 week 3: 23/26 January 2017

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

MLE/MAP + Naïve Bayes

Generative v. Discriminative classifiers Intuition

Estimating Parameters

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Least Squares Regression

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

10-725/36-725: Convex Optimization Prerequisite Topics

Linear Classifiers. Michael Collins. January 18, 2012

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

A Study of Relative Efficiency and Robustness of Classification Methods

CSC 411: Lecture 04: Logistic Regression

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Transcription:

DM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark

Outline 1. is 2. Multi-variate Bernoulli Event Model Multinomial Event Model 3. 2

Discriminative approach learns p(y x) Generative approach learns p(x y) 3

Generative Method 1. Model p(y) and p(x y) 2. learn parameters of the models by maximizing joint likelihood p(x, y) = p(x y)p(y) 3. express 4. predict p(y x) = p(x y)p(y) p(x) = p(x y)p(y) p(x y)p(y) y Y p(x y)p(y) arg max p(y x) = arg max y y p(x) = arg max y p(x y)p(y) p(x y)p(y) y Y = arg max p(x y)p(y) y 4

Outline 1. is 2. Multi-variate Bernoulli Event Model Multinomial Event Model 3. 5

is Let x be a vector of continuous variables We will assume p( x y) is a multivariate Gaussian distribution p( x, µ, Σ) = ( ) 1 1 exp (2π) n/2 Σ 1/2 2 ( x µ)t Σ 1 ( x µ) 6

is Step 1: we model the probabilities: y Bernoulli(ϕ) x y = 0 N(µ 0, Σ) x y = 1 N(µ 1, Σ) that is: p(y) = φ y (1 φ) 1 y 1 p(x y = 0) = N ( µ 0, Σ) = exp ( 1 (2π) n/2 Σ 1/2 2 ( x µ 0) T Σ 1 ( x µ 0 ) ) 1 p(x y = 1) = N ( µ 1, Σ) = exp ( 1 (2π) n/2 Σ 1/2 2 ( x µ 1) T Σ 1 ( x µ 1 ) ) Step 2: we express the joint likelihood of a set of data i = 1... m: l(φ, µ 0, µ 1, Σ) = = m p(x i, y i ) i=1 m p(x i y i )p(y i ) i=1 We substitute the model assumptions above and maximize log l(φ, µ 0, µ 1, Σ) in φ, µ 0, µ 1, Σ 7

Solutions: m i=1 yi φ = m = i I{yi = 1} m i µ 0 = I{yi = 0}x i i I{yi = 0} i µ 1 = I{yi = 1}x i i I{yi = 1} Σ =... Compare with logistic regression where we maximized the conditional likelihood instead! Step 3 and 4: p(x y)p(y) arg max p(y x) = arg max y y p(x) arg max p(x y)p(y) y 8

Comments In GDA: x y Gaussian = logistic posterior for p(y = 1 x) (see sec 4.2 of B1) In logistic regression we model p(y x) as logistic also other distributions, eg: x y = 0 Poisson(λ 0 ) x y = 1 Poisson(λ 1 ) x y = 0 ExpFam(η 0 ) x y = 1 ExpFam(η 1 ) hence we make stronger assumptions in GDA. If we do not know where the data come from the logistic regression analysis would be more robust. If we know then GDA may perform better. 9

When Σ is the same for all class conditional densities then the decision boundaries are linear Linear discriminative analysis (LDA) When the class conditional densities do not share Σ then quadratic discriminant 10

Outline 1. is 2. Multi-variate Bernoulli Event Model Multinomial Event Model 3. 11

Chain Rule of Probability permits the calculation of the joint distribution of a set of random variables using only conditional probabilities. Consider the set of events A 1, A 2,... A n. To find the value of the joint distribution, we can apply the definition of conditional probability to obtain: Pr(A n, A n 1,..., A 1 ) = Pr(A n A n 1,..., A 1 ) Pr(A n 1, A n 2,..., A 1 ) repeating the process with each final term: n Pr( n k=1a k ) = Pr(A k k j=1a j ) k=1 For example: Pr(A 4, A 3, A 2, A 1 ) = Pr(A 4 A 3, A 2, A 1 ) Pr(A 3 A 2, A 1 ) Pr(A 2 A 1 ) Pr(A 1 ) 12

Multi-variate Bernoulli Event Model We want to decide whether an email is spam y {0, 1} given some discrete features x. How to represent emails by a set of features? Binary array, each element corresponds to a word in the vocabulary and the bit indicates whether the word is present or not in the data. 1 0 0 0 x = 1. 0 0 x {0, } n n = 50000 (large number) 2 50000 possible bit vectors 2 50000 1 parameters to learn We collect examples, look at those that are spam y = 1 and learn p(x y = 1), then at those y = 0 and learn p(x y = 0) 14

Step 1: For a given example i we treat each x i j independently p(y) φ y j : p(x j y = 0) φ j y=0 j : p(x j y = 1) φ j y=1 Step 2: Maximize joint likelihood Assume x j s are conditionally independent given y. By chain rule: p(x 1,..., x 50000 ) = p(x 1 y)p(x 2 y, x 1 )p(x 3 y, x 1, x 2 )... = p(x 1 y)p(x 2 y)p(x 3 y)... cond. indep. m = p(x i y) i=1 15

m l(φ y, φ j y=0, φ j y=1 ) = p( x i, y i ) i=1 m n = p(x i j y i )p(y i ) i=1 j=1 Solution: φ y = φ j y=1 = φ j y=0 = i I{yi = 1} m i I{yi = 1, x i j = 1} i I{yi = 1} i I{yi = 0, x i j = 1} i I{yi = 0} Step 3 and 4: prediction as usual but remember to use logarithms 16

Laplace Smoothing what if p(x 300000 y = 1) = 0 and p(x 300000 y = 0) = 0 because we do not have any observation in the training set with that word? p(x y = 1)p(y = 1) p(y = 1 x) = p(x y = 1)p(y = 1) + p(x y = 0)p(y = 0) 50000 j=1 p(x j y = 1)p(y = 1) = 50000 j=1 p(x j y = 1)p(y = 1) + 50000 j=1 p(x j y = 0)p(y = 0) = 0 0 + 0 Laplace smoothing: assume some observations p(x y) = c(x, y) + k c(y) + k x φ y = φ j y=1 = φ j y=0 = i I{yi = 1} + 1 m + K i I{yi = 1, x i j = 1} + 1 i I{yi = 1} + 2 i I{yi = 0, x i j = 1} + 1 i I{yi = 0} + 2 17

Multinomial Event Model We look at a generalization of the previous that allows to take into account also of the number of times a word appear as well as the position. Let x i {1, 2,... K}, for example, a continuous variable discretized in buckets Let [x i 1, x i 2,... x i n i ], x i j {1, 2,..., K} represent the word in position j, n i # of word in the ith email Step 1: p(y) φ y assumed that p(x j : p(x j = k y = 0) φ j = k y = 0) j y=0 is the same for all j j : p(x j = k y = 1) φ j y=1 φ j y=0 are parameters of multinomial Bernoulli distributions 19

Step 2: Joint likelihood: Solution: 20

Laplace Smoothing 21

Outline 1. is 2. Multi-variate Bernoulli Event Model Multinomial Event Model 3. 22

(Intro) Support vector machines: discriminative approach that implements a non linear decision with basis in linear classifiers. lift points to a space where they are linearly separable find linear separator Let s focus first on how to find a linear separator. Desiderata: predict 1 iff θ T x 0 predict 0 iff θ T x < 0 also wanted: if θ T x 0 very confident that y = 1 if θ T x 0 very confident that y = 0 Hence it would be nice if: i : y i = 1 we have θ T x 0 i : y i = 0 we have θ T x 0 23

Notation Assume training set is linearly separable Let s change notation: y { 1, 1} (instead of {0, 1} like in GLM) Let s have h output values { 1, 1}: { 1 ifz 0 f(z) = sign(z) 1 ifz < 0 (hence no probabilities like in logistic regression) h( θ, x) = f( θ x), x R n+1, θ R n+1 h( θ, x) = f( θ x + θ 0 ), x R n, θ R n, θ 0 R 24

Functional Marginal Def.: The functional margin of a hyperplane ( θ, θ 0 ) w.r.t. a specific example (x i, y i ) is: ˆγ i = y i ( θ T x i + θ 0 ) For the decision boundary θ T x + θ 0 that defines the linear boundary: we want θ T x 0 if y i = +1 we want θ T x 0 if y i = 1 If y i ( θ T x i + θ 0 ) > 0 then i is classified correctly. Hence, we want to maximize y i ( θ T x i + θ 0 ). This can be achieved by maximizing the worst case for the training set ˆγ = min ˆγ i i Note: scaling θ 2 θ, θ 0 θ o would make ˆγ arbitrarily large. Hence we impose: θ = 1 25

Hyperplanes hyperplane: set of the form { x a T x = b} ( a 0) a T x 0 x a T x = b a is the normal vector hyperplanes are affine and convex sets 26

Geometric Marginal Def. the geometric margin γ i is the distance of example i from the linear separator, A-B θ θ unit vector orthogonal to the separating hyperplane A-B: x i γ i θ θ since B is on the linear separator, substituting the part above in θ T x + θ 0 = 0: ( ) θ θ T x i γ i θ + θ 0 = 0 27

Solving for γ i : θ T x i + θ 0 = γ i θ T θ θ γ i θ = θ T x i + θ 0 θ This was for the positives. We can develop the same for the negatives but we would have a negative sign. Hence the quantity: ( θ T ) γ i = y i θ x i + θ 0 θ will be always positive. To maximize the distance of the line from all points we maximize the worst case, that is: γ = min γ i i 28

γ = ˆγ θ Note that if θ = 1 then ˆγ i = γ i the two marginal correspond geometric margin is invariant to scaling θ 2 θ, θ 0 θ o 29

Optimal margin classifier max γ θ,θ 0 γ (1) γ y i ( θ T x + θ 0 ) i = 1,..., m (2) θ = 1 (3) (2) implements γ = min γ i (3) is a nonconvex constraint thanks to which the two marginals are the same. 30