Machine Learning - MT Classification: Generative Models

Similar documents
Introduction to Machine Learning

Introduction to Machine Learning

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine learning - HT Maximum Likelihood

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Introduction to Machine Learning

Introduction to Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Machine Learning

Naïve Bayes classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CPSC 540: Machine Learning

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Statistical Data Mining and Machine Learning Hilary Term 2016

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT Linear Regression

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Linear Regression and Discrimination

Machine Learning (CS 567) Lecture 5

Machine Learning for Signal Processing Bayes Classification

The Bayes classifier

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Lecture 4 Discriminant Analysis, k-nearest Neighbors

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Naive Bayes and Gaussian Bayes Classifier

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Nonparametric Bayesian Methods (Gaussian Processes)

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CPSC 340: Machine Learning and Data Mining

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Statistical Methods for NLP

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Logistic Regression. Machine Learning Fall 2018

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Machine Learning, Fall 2012 Homework 2

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

MLE/MAP + Naïve Bayes

Linear and logistic regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

COMS 4771 Regression. Nakul Verma

Support Vector Machines

Machine learning - HT Basis Expansion, Regularization, Validation

Statistical Machine Learning Hilary Term 2018

COMP90051 Statistical Machine Learning

Generative Learning algorithms

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Inf2b Learning and Data

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

CS 195-5: Machine Learning Problem Set 1

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Lecture : Probabilistic Machine Learning

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction to Gaussian Process

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Lecture 2: Simple Classifiers

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

CS340 Machine learning Gaussian classifiers

CMU-Q Lecture 24:

PMR Learning as Inference

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

ECE521 week 3: 23/26 January 2017

Multivariate Bayesian Linear Regression MLAI Lecture 11

Linear Classification

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Bayesian Linear Regression [DRAFT - In Progress]

Introduction to Machine Learning Midterm, Tues April 8

Bias-Variance Tradeoff

MLE/MAP + Naïve Bayes

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Generative Model (Naïve Bayes, LDA)

Gaussian discriminant analysis Naive Bayes

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Introduction to Machine Learning Spring 2018 Note 18

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

MACHINE LEARNING ADVANCED MACHINE LEARNING

ECE 5984: Introduction to Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Jeff Howbert Introduction to Machine Learning Winter

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Introduction to Machine Learning

Gaussian Models

PATTERN RECOGNITION AND MACHINE LEARNING

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 9: Classification, LDA

Supervised Learning Coursework

Logistic Regression. Seungjin Choi

STA414/2104 Statistical Methods for Machine Learning II

Lecture 9: Classification, LDA

Transcription:

Machine Learning - MT 2016 7. Classification: Generative Models Varun Kanade University of Oxford October 31, 2016

Announcements Practical 1 Submission Try to get signed off during session itself Otherwise, do it in the next session Exception: Practical 4 (Firm deadline Friday Week 8 at noon) Sheet 2 is due this Friday 12pm 1

Recap: Supervised Learning - Regression Discriminative Model: Linear Model (with Gaussian noise) p(y w, x) = w x + N (0, σ 2 ) Other noise models possible, e.g., Laplace Non-linearities using basis expansion Regularisation to avoid overfitting: Ridge, Lasso (Cross)-Validation to choose hyperparameters Optimisation Algorithms for Model Fitting Least Squares Ridge Lasso 1800 2016 Gauss Legendre 2

Supervised Learning - Classification In classification problems, the target/output y is a category y {1, 2,..., C} The input x = (x 1,..., x D), where Categorical: x i {1,..., K} Real-Valued: x i R Discriminative Model: Only model the conditional distribution p(y x, θ) Generative Model: Model the full joint distribution p(x, y θ) 3

Prediction Using Generative Models Suppose we have a model p(x, y θ) over the joint distribution over inputs and outputs Given a new input x new, we can write the conditional distribution for y For c {1,..., C}, we write p(y = c x new, θ) = p(y = c θ) p(x new y = c, θ) C c =1 p(y = c θ)p(x new y = c, θ) The numerator is simply the joint probability p(x new, c θ) and the denominator the marginal probability p(x new θ) We can pick ŷ = argmax c p(y = c x new, θ) 4

Toy Example Predict voter preference using in US elections Voted in Annual State Candidate 2012? Income Choice Y 50K OK Clinton N 173K CA Clinton Y 80K NJ Trump Y 150K WA Clinton N 25K WV Johnson Y 85K IL Clinton... Y 1050K NY Trump N 35K CA Trump N 100K NY?. 5

Classification : Generative Model In order to fit a generative model, we ll express the joint distribution as p(x, y θ, π) = p(y π) p(x y, θ) To model p(y π), we ll use parameters π c such that c πc = 1 p(y = c π) = π c For class-conditional densities, for class c = 1,..., C, we will have a model: p(x y = c, θ c) 6

Classification : Generative Model So in our example, p(y = clinton π) = π clinton p(y = trump π) = π trump p(y = johnson π) = π johnson Given that a voter supports Trump!! p(x y = trump, θ trump) models the distribution over x given y = trump and θ trump Similarly, we have p(x y = clinton, θ clinton) and p(x y = johnson, θ johnson) We need to pick model for p(x y = c, θ c) Estimate the parameters π c, θ c for c = 1,..., C 7

Naïve Bayes Classifier (NBC) Assume that the features are conditionally independent given the class label p(x y = c, θ c) = D p(x j y = c, θ jc) So, for example, we are modelling that conditioned on being a trump supporter, the state, previous voting or annual income are conditionally independent j=1 Clearly, this assumption is naïve and never satisfied But model fitting becomes very very easy Although the generative model is clearly inadequate, it actually works quite well Goal is predicting class, not modelling the data! 8

Naïve Bayes Classifier (NBC) Real-Valued Features x j is real-valued e.g., annual income Example: Use a Gaussian model, so θ jc = (µ jc, σjc) 2 Can use other distributions, e.g., age is probably not Gaussian! Categorical Features x j is categorical with values in {1,..., K} Use the multinoulli distribution, i.e. x j = i with probability µ jc,i K µ jc,i = 1 i=1 The special case when x j {0, 1}, use a single parameter θ jc [0, 1] 9

Naïve Bayes Classifier (NBC) Assume that all the features are binary, i.e., every x j {0, 1} If we have C classes, overall we have only O(CD) parameters, θ jc for each j = 1,..., D and c = 1,..., C Without the conditional independence assumption We have to assign a probability for each of the 2 D combination Thus, we have O(C 2 D ) parameters! The naïve assumption breaks the curse of dimensionality and avoids overfitting! 10

Maximum Likelihood for the NBC Let us suppose we have data (x i, y i) N i=1 i.i.d. from some joint distribution p(x, y) The probability for a single datapoint is given by: p(x i, y i θ, π) = p(y i π) p(x i θ, y i) = C π I(y i=c) c C j=1 D p(x ij θ jc) I(y i=c) Let N c be the number of datapoints with y i = c, so that C Nc = N We write the log-likelihood of the data as: D log p(d θ, π) = N c log π c + log p(x ij θ jc) j=1 i:y i =c The log-likelihood is easily separated into sums involving different parameters! 11

Maximum Likelihood for the NBC We have the log-likelihood for the NBC D log p(d θ, π) = N c log π c + log p(x ij θ jc) j=1 i:y i =c Let us obtain estimates for π. We get the following optimisation problem: maximise subject to : N c log π c π c = 1 This constrained optimisation problem can be solved using the method of Lagrange multipliers 12

Constrained Optimisation Problem Suppose f(z) is some function that we want to maximise subject to g(z) = 0. Constrained Objective argmax f(z), subject to : g(z) = 0 z Langrangian (Dual) Form Λ(z, λ) = f(z) + λg(z) Any optimal solution to the constrained problem is a stationary point of Λ(z, λ) 13

Constrained Optimisation Problem Any optimal solution to the constrained problem is a stationary point of Λ(z, λ) = f(z) + λg(z) zλ(z, λ) = 0 zf = λ zg Λ(z,λ) λ = 0 g(z) = 0 14

Maximum Likelihood for NBC Recall that we want to solve: maximise : subject to : N c log π c π c 1 = 0 We can write the Lagrangean form: Λ(π, λ) = N c log π c + λ π c 1 We can write the partial derivatives and set them to 0: Λ(π,λ) π c Λ(π,λ) λ = = Nc + λ = 0 π c π c 1 = 0 15

Maximum Likelihood for NBC The solution is obtained by setting And so, N c π c + λ = 0 π c = Nc λ As well as using the second condition, And thence, π c 1 = Nc λ 1 = 0 λ = Thus, we get the estimates, N c = N π c = Nc N 16

Maximum Likelihood for the NBC We have the log-likelihood for the NBC D log p(d θ, π) = N c log π c + log p(x ij θ jc) j=1 i:y i =c We obtained the estimates, π c = Nc N We can estimate θ jc by taking a similar approach To estimate θ jc we only need to use the j th feature of examples with y i = c Estimates depend on the model, e.g., Gaussian, Bernoulli, Multinoulli, etc. Fitting NBC is very very fast! 17

Summary: Naïve Bayes Classifier Generative Model: Fit the distribution p(x, y θ) Make the naïve and obviously untrue assumption that features are conditionally independent given class! p(x y = c, θ c) = D p(x j y = c, θ jc) j=1 Despite this classifiers often work quite well in practice The conditional independence assumption reduces the number of parameters and avoids overfitting Fitting the model is very straightforward Easy to mix and match different models for different features 18

Outline Generative Models for Classification Naïve Bayes Model Gaussian Discriminant Analysis

Generative Model: Gaussian Discriminant Analysis Recall the form of the joint distribution in a generative model p(x, y θ, π) = p(y π) p(x y, θ) For classes, we use parameters π c such that c πc = 1 p(y = c π) = π c Suppose x R D, we model the class-conditional density for class c = 1,..., C, as a multivariate normal distribution with mean µ c and covariance matrix Σ c p(x y = c, θ c) = N (x µ c, Σ c) 19

Quadratic Discriminant Analysis (QDA) Let s first see what the prediction rule for this model is: p(y = c x new, θ) = p(y = c θ) p(x new y = c, θ) C c =1 p(y = c θ)p(x new y = c, θ) When the densities p(x y = c, θ c) are multivariate normal, we get ( ) π c 2πΣ c 2 1 exp 1 (x µ 2 c )T Σ 1 c (x µ c ) p(y = c x, θ) = ) C c =1 π c 2πΣ c 2 1 exp ( 1 (x µ 2 c )T Σ 1 c (x µ c ) The denominator is the same for all classes, so the boundary between class c and c is given by ( ) π c 2πΣ c 1 2 exp 1 (x µ 2 c )T Σ 1 c (x µ c ) ( ) = 1 π c 2πΣ c 1 2 exp 1 (x µ 2 c )T Σ 1 c (x µ c ) Thus the boundaries are quadratic surfaces, hence the method is called quadratic discriminant analysis 20

Quadratic Discriminant Analysis (QDA) 21

Linear Discriminant Analysis A special case is when the covariance matrices are shared or tied across different classes We can write ( p(y = c x, θ) π c exp 1 ) 2 (x µ c )T Σ 1 (x µ c ) = exp (µ Tc Σ 1 x 12 ) ( µtc Σ 1 µ c + log π c exp 1 ) 2 xt Σ 1 x Let us set γ c = 1 2 µt c Σ 1 µ c + log π c β c = Σ 1 µ c and so ) p(y = c x new, θ) exp (β Tc x + γc 22

Linear Discriminant Analysis (LDA) & Softmax Recall that we wrote, ) p(y = c x, θ) exp (β Tc x + γ c And so, p(y = c x, θ) = ) exp (β Tc x + γc ) = softmax(η) c c exp (β Tc x + γ c where, η = [β T 1 x + γ1,, βt C x + γc]. Softmax Softmax maps a set of numbers to a probability distribution with mode at the maximum softmax([1, 2, 3]) [0.090, 0.245, 0.665] softmax([10, 20, 30]) [2 10 9, 4 10 5, 1] 23

QDA and LDA 24

Two class LDA When we have only 2 classes, say 0 and 1, p(y = 1 x, θ) = exp = ) exp (β T1 x + γ1 (β T1 x + γ1 ) + exp (β T0 x + γ0 ) 1 1 + exp ( ((β 1 β 0 ) T x + (γ 1 γ 0)) ) = sigmoid((β 1 β 0 ) T x + (γ 1 γ 0)) Sigmoid Function The sigmoid function is defined as: sigmoid(t) = 1 1 + e t Sigmoid 0.8 1 0.6 0.4 0.2 0 4 2 0 2 4 t 25

MLE for QDA (or LDA) We can write the log-likelihood given data D = (x i, y i) N i=1 as: log p(d θ) = N c log π c + log N (x µ c, Σ c) As in the case of Naïve Bayes, we get π c = Nc N possible to show that, µ c = 1 N c i:y i =c x i i:y i =c Σ c = 1 (x i µ N c )(x i µ c ) T c i:y i =c (See Chap 4.1 from Murphy for details) For other parameters, it is 26

How to Prevent Overfitting The number of parameters in the model is roughly C D 2 In high-dimensions this can lead to overfitting Use diagonal covariance matrices (basically Naïve Bayes) Use weight tying a.k.a. parameter sharing (LDA vs QDA) Bayesian Approaches Use a discriminative classifier (+ regularize if needed) 27