The generative approach to classification. A classification problem. Generative models CSE 250B

Similar documents
Generative modeling of data. Instructor: Taylor Berg-Kirkpatrick Slides: Sanjoy Dasgupta

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Machine learning - HT Maximum Likelihood

Introduction to Machine Learning

Random variables, expectation, and variance

Chapter 5 continued. Chapter 5 sections

Introduction to Machine Learning

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Introduction to Machine Learning

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Applied Multivariate and Longitudinal Data Analysis

Bayesian Decision Theory

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

CSC 411: Lecture 09: Naive Bayes

COM336: Neural Computing

Introduction to Machine Learning

Two hours. Statistical Tables to be provided THE UNIVERSITY OF MANCHESTER. 14 January :45 11:45

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Bayes Decision Theory

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Introduction to Machine Learning Spring 2018 Note 18

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Generative Learning algorithms

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

LDA, QDA, Naive Bayes

The Bayes classifier

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

STA 4273H: Statistical Machine Learning

Naïve Bayes classification

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

The Multivariate Gaussian Distribution

Chapter 17: Undirected Graphical Models

CMPE 58K Bayesian Statistics and Machine Learning Lecture 5

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Machine Learning (CS 567) Lecture 5

Linear Classification: Probabilistic Generative Models

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Chapter 5. Chapter 5 sections

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

6.867 Machine Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Bayesian Decision and Bayesian Learning

Support Vector Machines

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Minimum Error Rate Classification

Parametric Techniques

Lecture 11. Multivariate Normal theory

5. Discriminant analysis

Gaussian discriminant analysis Naive Bayes

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Parsimonious Gaussian Mixture Models

Linear Regression and Discrimination

Lecture Notes on the Gaussian Distribution

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Linear Classification

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Next tool is Partial ACF; mathematical tools first. The Multivariate Normal Distribution. e z2 /2. f Z (z) = 1 2π. e z2 i /2

Parametric Techniques Lecture 3

Combining eigenvalues and variation of eigenvectors for order determination

A Probability Review

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

CS 195-5: Machine Learning Problem Set 1

Lecture 4: Probabilistic Learning

Gaussian Models

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Covariance. Lecture 20: Covariance / Correlation & General Bivariate Normal. Covariance, cont. Properties of Covariance

Evaluation. Andrea Passerini Machine Learning. Evaluation

Naive Bayes and Gaussian Bayes Classifier

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

18 Bivariate normal distribution I

Machine Learning - MT Classification: Generative Models

Quick Tour of Basic Probability Theory and Linear Algebra

Exam 2. Jeremy Morris. March 23, 2006

Naive Bayes and Gaussian Bayes Classifier

Hidden Markov Models and Gaussian Mixture Models

CS340 Machine learning Gaussian classifiers

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Evaluation requires to define performance measures to be optimized

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Linear Methods for Prediction

Notes on Discriminant Functions and Optimal Classification

Random Variables and Their Distributions

CSC411 Fall 2018 Homework 5

The Gaussian distribution

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian Mixture Models

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Multivariate Gaussians. Sargur Srihari

Introduction to Gaussian Processes

Classification 1: Linear regression of indicators, linear discriminant analysis

Day 5: Generative models, structured classification

Transcription:

The generative approach to classification The generative approach to classification CSE 250B The learning process: Fit a probability distribution to each class, individually To classify a new point: Which of these distributions was it most likely to have come from? Generative models A classification problem Prx) P 3 x) You have a bottle of wine whose label is missing. Example: Data space X R Classes/labels Y {, 2, 3} P x) π 0% P 2 x) π 2 50% π 3 40% x For each class j, we have: the probability of that class, π j Pry j) the distribution of data in that class, P j x) Overall joint distribution: Prx, y) Pry)Prx y) π y P y x). To classify a new x: pick the label y with largest Prx, y) Which winery is it from,, 2, or 3? Solve this problem using visual and chemical features of the wine.

The data set Recall: the generative approach Training set obtained from 30 bottles Winery : 43 bottles Winery 2: 5 bottles Winery 3: 36 bottles For each bottle, 3 features: Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD35 of diluted wines, Proline Also, a separate test set of 48 labeled points. Prx) P x) π 0% P 2 x) π 2 50% P 3 x) For any data point x X and any candidate label j, Pry j x) π 3 40% Pry j)prx y j) Prx) Optimal prediction: the class j with largest π j P j x). x π jp j x) Prx) Fitting a generative model The univariate Gaussian Training set of 30 bottles: Winery : 43 bottles, winery 2: 5 bottles, winery 3: 36 bottles For each bottle, 3 features: Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD35 of diluted wines, Proline Class weights: π 43/30 0.33, π 2 5/30 0.39, π 3 36/30 0.28 Need distributions P, P 2, P 3, one per class. Base these on a single feature: Alcohol. The Gaussian Nµ, σ 2 ) has mean µ, variance σ 2, and density function ) x µ)2 px) 2πσ 2 exp ) /2 2σ 2.

The distribution for winery All three wineries Single feature: Alcohol π 0.33, P N3.7, 0.20) π 2 0.39, P 2 N2.3, 0.28) π 3 0.28, P 3 N3.2, 0.27) To classify x: Pick the j with highest π j P j x) Mean µ 3.72, Standard deviation σ 0.44 variance 0.20) Test error: 4/48 29% What if we use two features? Why it helps to add features The bivariate Gaussian Better separation between the classes! Error rate drops from 29% to 8%. Model class by a bivariate Gaussian, parametrized by: ) ) 3.7 0.20 0.06 mean µ and covariance matrix Σ 3.0 0.06 0.2

Dependence between two random variables The bivariate 2d) Gaussian Suppose X has mean µ and X 2 has mean µ 2. Can measure dependence between them by their covariance: covx, X 2 ) E[X µ )X 2 µ 2 )] E[X X 2 ] µ µ 2 Maximized when X X 2, in which case it is varx ). It is at most stdx )stdx 2 ). A distribution over x, x 2 ) R 2, parametrized by: Mean µ, µ 2 ) R 2, where µ EX ) and µ 2 EX 2 ) [ ] Σ Σ Covariance matrix Σ 2 where Σ 2 Σ 22 Σ varx ) Σ 22 varx 2 ) Σ 2 Σ 2 covx, X 2 ) Density is highest at the mean, falls off in ellipsoidal contours. Density of the bivariate Gaussian Bivariate Gaussian: examples Mean µ, µ 2 ) R 2, where µ EX ) and µ 2 EX 2 ) [ ] Σ Σ Covariance matrix Σ 2 Σ 2 Σ 22 Density px, x 2 ) exp [ ] T [ x µ Σ x µ 2π Σ /2 2 x 2 µ 2 x 2 µ 2 ] ) In either case, the mean is, ). Σ [ 4 0 0 ] Σ [ 4.5.5 ]

The decision boundary The multivariate Gaussian Go from to 2 features: error rate goes from 29% to 8%. µ Nµ, Σ): Gaussian in R d mean: µ R d covariance: d d matrix Σ Generates points X X, X 2,..., X d ). µ is the vector of coordinatewise means: µ EX, µ 2 EX 2,..., µ d EX d. Σ is a matrix containing all pairwise covariances: What kind of function is this? And, can we use more features? Σ ij Σ ji covx i, X j ) if i j Σ ii varx i ) Density px) 2π) d/2 exp ) Σ /2 2 x µ)t Σ x µ) Special case: independent features Suppose the X i are independent, and varx i ) σ 2 i. What is the covariance matrix Σ, and what is its inverse Σ? Diagonal Gaussian Diagonal Gaussian: the X i are independent, with variances σ 2 i. Σ diagσ, 2..., σd 2 ) offdiagonal elements zero) Each X i is an independent onedimensional Gaussian Nµ i, σ 2 i ): Prx) Prx )Prx 2 ) Prx d ) 2π) d/2 exp σ σ d ) d x i µ i ) 2 i 2σ 2 i Contours of equal density: axisaligned ellipsoids, centered at µ: µ 2

Even more special case: spherical Gaussian How to fit a Gaussian to data The X i are independent and all have the same variance σ 2. Σ σ 2 I d diagσ 2, σ 2,..., σ 2 ) diagonal elements σ 2, rest zero) Each X i is an independent univariate Gaussian Nµ i, σ 2 ): Prx) Prx )Prx 2 ) Prx d ) Density at a point depends only on its distance from µ: µ ) 2π) d/2 σ d exp x µ 2 2σ 2 Fit a Gaussian to data points x ),..., x m) R d. Empirical mean µ m x ) x m)) Empirical covariance matrix has i, j entry: ) m Σ ij x k) i x k) j µ i µ j m k Back to the winery data The multivariate Gaussian Go from to 2 features: test error goes from 29% to 8%. µ Nµ, Σ): Gaussian in R d mean: µ R d covariance: d d matrix Σ Density px) 2π) d/2 exp ) Σ /2 2 x µ)t Σ x µ) If we write S Σ then S is a d d matrix and x µ) T Σ x µ) i,j S ij x i µ i )x j µ j ), With all 3 features: test error rate goes to zero. a quadratic function of x.

Binary classification with Gaussian generative model Estimate class probabilities π, π 2 Fit a Gaussian to each class: P Nµ, Σ ), P 2 Nµ 2, Σ 2 ) Given a new point x, predict class if π P x) > π 2 P 2 x) x T Mx 2w T x θ, Common covariance: Σ Σ 2 Σ Linear decision boundary: choose class if x Σ µ µ 2 ) }{{} w θ. Example : Spherical Gaussians with Σ I d and π π 2. bisector of line joining means where: M 2 Σ 2 Σ ) w Σ µ Σ 2 µ 2 and θ is a threshold depending on the various parameters. µ µ 2 w Linear or quadratic decision boundary. Example 3: Nonspherical. Example 2: Again spherical, but now π > π 2. µ µ µ 2 µ 2 µ µ 2 w w µ µ 2 ) Classification rule: w x θ Choose w as above Common practice: fit θ to minimize training or validation error

Different covariances: Σ Σ 2 Quadratic boundary: choose class if x T Mx 2w T x θ, where: M 2 Σ 2 Σ ) w Σ µ Σ 2 µ 2 Example : Σ σ 2 I d and Σ 2 σ 2 2 I d with σ > σ 2 Example 2: Same thing in d. X R. class class 2 µ µ 2 Multiclass discriminant analysis Example 3: A parabolic boundary. k classes: weights π j, classconditional densities P j Nµ j, Σ j ). Each class has an associated quadratic function f j x) log π j P j x)) To classify point x, pick arg max j f j x). µ µ 2 If Σ Σ k, the boundaries are linear.