CS340 Machine learning Gaussian classifiers

Similar documents
CS540 Machine learning L8

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Linear Classification: Probabilistic Generative Models

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Logistic Regression. Seungjin Choi

Eigenvectors and SVD 1

Introduction to Machine Learning

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Bayesian Decision Theory

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Naive Bayes and Gaussian Bayes Classifier

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Classification. Sandro Cumani. Politecnico di Torino

Introduction to Machine Learning

Probabilistic generative models

Linear Classification

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Inf2b Learning and Data

Introduction to Machine Learning

Naive Bayes and Gaussian Bayes Classifier

Machine Learning - MT Classification: Generative Models

Gaussian discriminant analysis Naive Bayes

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CSC 411: Lecture 09: Naive Bayes

Statistical Data Mining and Machine Learning Hilary Term 2016

Bayes Decision Theory

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Naive Bayes and Gaussian Bayes Classifier

Linear Models for Classification

Machine Learning (CS 567) Lecture 5

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Introduction to Machine Learning

Machine Learning for Signal Processing Bayes Classification and Regression

Lecture 2: Simple Classifiers

Naïve Bayes classification

Statistical Machine Learning Hilary Term 2018

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

The generative approach to classification. A classification problem. Generative models CSE 250B

Chapter 5 continued. Chapter 5 sections

Classification. Chapter Introduction. 6.2 The Bayes classifier

Ch 4. Linear Models for Classification

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Linear Regression and Discrimination

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Classification: Logistic Regression from Data

The Bayes classifier

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Introduction to Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

CMPE 58K Bayesian Statistics and Machine Learning Lecture 5

STA 4273H: Statistical Machine Learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Minimum Error Rate Classification

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Joint Gaussian Graphical Model Review Series I

01 Probability Theory and Statistics Review

Machine Learning. 7. Logistic and Linear Regression

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Support Vector Machines

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Linear Decision Boundaries

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

Logistic Regression. INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE

Modern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning for Signal Processing Bayes Classification

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Bayesian Decision and Bayesian Learning

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

COM336: Neural Computing

Qualifier: CS 6375 Machine Learning Spring 2015

Motivating the Covariance Matrix

Generative Learning algorithms

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine Learning Linear Classification. Prof. Matteo Matteucci

Linear Models for Classification

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Lecture 4 Discriminant Analysis, k-nearest Neighbors

5. Discriminant analysis

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

STA 450/4000 S: January

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

LDA, QDA, Naive Bayes

Transcription:

CS340 Machine learning Gaussian classifiers 1

Correlated features Height and weight are not independent 2

Multivariate Gaussian Multivariate Normal (MVN) N(x µ,σ) def 1 (2π) p/2 Σ 1/2exp[ 1 2 (x µ)t Σ 1 (x µ)] Exponent is the Mahalanobis distance between x and µ (x µ) T Σ 1 (x µ) Σ is the covariance matrix (positive definite) x T Σx>0 x 3

Covariance matrix is Bivariate Gaussian where the correlation coefficient is and satisfies -1 ρ 1 Density is Σ ρ ( ) σ 2 x ρσ x σ y ρσ x σ y σ 2 y Cov(X,Y) Var(X)Var(Y) p(x,y) ( 1 2πσ x σ exp 1 y 1 ρ 2 2(1 ρ 2 ) ( x 2 σ 2 x + y2 σ 2 y 2ρxy )) (σ x σ y ) 4

Spherical, diagonal, full covariance Σ ( ) σ 2 0 0 σ 2 Σ ( σ 2 x 0 0 σ 2 y ) Σ ( ) σ 2 x ρσ x σ y ρσ x σ y σ 2 y 5

Surface plots 6

Generative classifier A generative classifier is one that defines a classconditional density p(x yc) and combines this with a class prior p(c) to compute the class posterior p(yc x) p(x yc)p(yc) c p(x yc )p(c ) Examples: Naïve Bayes: Gaussian classifiers p(x yc) Alternative is a discriminative classifier, that estimates p(yc x) directly. d j1 p(x j yc) p(x yc)n(x µ c,σ c ) 7

Naïve Bayes with Bernoulli features Consider this class-conditional density p(x yc) d i1 The resulting class posterior (using plugin rule) has the form p(yc x) p(yc)p(x yc) p(x) This can be rewritten as p(y c x,θ,π) θ I(x i1) ic (1 θ ic ) I(x i0) π c d i1 θi(x i1) ic (1 θ ic ) I(x i0) p(x) p(x yc)p(yc) c p(x yc )p(yc ) exp[logp(x yc)+logp(yc)] c exp[logp(x yc )+logp(yc )] exp[logπ c + i I(x i1)logθ ic +I(x i 0)log(1 θ ic )] c exp[logπ c + i I(x i1)logθ i,c +I(x i 0)log(1 θ ic )] 8

Form of the class posterior From previous slide p(y c x,θ,π) exp Define [ logπ c + i I(x i 1)logθ ic +I(x i 0)log(1 θ ic ) x [1,I(x 1 1),I(x 1 0),...,I(x d 1),I(x d 0)] β c [logπ c,logθ 1c,log(1 θ 1c ),...,logθ dc,log(1 θ dc )] Then the posterior is given by the softmax function p(y c x,β) exp[βt cx ] c exp[β T c x ] This is called softmax because it acts like the max function when β c ] p(y c x) { 1.0 ifcargmaxc β T c x 0.0 otherwise 9

From previous slide Two-class case p(y c x,β) exp[βt cx ] c exp[β T c x ] In the binary case, Y {0,1}, the softmax becomes the logistic (sigmoid) function σ(u) 1/(1+e -u ) e βt 1 x p(y 1 x,θ) e βt 1 x +e βt 0 x 1 1+e (β 0 β 1 ) T x 1 1+e wt x σ(w T x ) 10

Sigmoid function σ(ax + b), a controls steepness, b is threshold. For small a and x b/2, roughly linear a0.3 a1.0 a3.0 b-30 b0 b+30 11

Sigmoid function in 2D σ(w 1 x 1 + w 2 x 2 ) σ(w T x): w is perpendicular to the decision boundary Mackay 39.3 12

Logit function Let pp(y1) and η be the log odds Then p σ(η) and η logit(p) σ(η) ηlog p 1 p 1 eη 1+e η p (1 p) p 1 p +1 e η +1 p (1 p) p+1 p 1 p p η is the natural parameter of the Bernoulli distribution, and p E[y] is the moment parameter If η w T x, then w i is how much the log-odds increases by if we increase x i 13

Gaussian classifiers Class posterior (using plug-in rule) We will consider the form of this equation for various special cases: Σ 1 Σ 0, p(y c x) Σ c tied, many classes General case p(x Y c)p(y c) C c 1 p(x Y c )p(y c ) π c 2πΣ c 1 2exp [ 1 2 (x µ c) T Σ 1 c (x µ c ) ] c π c 2πΣ c 1 2exp [ 1 2 (x µ c )T Σ 1 c (x µ c ) ] 14

Σ 1 Σ 0 Class posterior simplifies to p(y 1 x) p(x Y 1)p(Y 1) p(x Y 1)p(Y 1)+p(x Y 0)p(Y 0) π 1 exp [ 1 2 (x µ 1) T Σ 1 (x µ 1 ) ] π 1 exp [ 1 2 (x µ 1) T Σ 1 (x µ 1 ) ] +π 0 exp [ 1 2 (x µ 0) T Σ 1 (x µ 0 ) ] π 1 e a 1 π 1 e a 1 +π0 e a 0 a c def 1 2 (x µ c) T Σ(x µ c ) 1 1+ π 0 π 1 e a 0 a 1 15

Σ 1 Σ 0 Class posterior simplifies to p(y 1 x) 1 [ ] 1+exp log π 1 π 0 +a 0 a 1 a 0 a 1 1 2 (x µ 0) T Σ 1 (x µ 0 )+ 1 2 (x µ 1) T Σ 1 (x µ 1 ) (µ 1 µ 0 ) T Σ 1 x+ 1 2 (µ 1 µ 0 ) T Σ 1 (µ 1 +µ 0 ) so p(y 1 x) 1 1+exp[ β T x γ] σ(βt x+γ) β def Σ 1 (µ 1 µ 0 ) γ σ(η) def 1 2 (µ 1 µ 0 ) T Σ 1 (µ 1 +µ 0 )+log π 1 def 1 eη 1+e η e η +1 π 0 Linear function of x 16

Decision boundary Rewrite class posterior as p(y 1 x) σ(β T x+γ)σ(w T (x x 0 )) w βσ 1 (µ 1 µ 0 ) x 0 γ β 1 2 (µ 1+µ 0 ) log(π 1 /π 0 ) (µ 1 µ 0 ) T Σ 1 (µ 1 µ 0 ) (µ 1 µ 0 ) If ΣI, then w(µ 1 -µ 0 ) is in the direction of µ 1 -µ 0, so the hyperplane is orthogonal to the line between the two means, and intersects it at x 0 If π 1 π 0, then x 0 0.5(µ 1 +µ 0 ) is midway between the two means If π 1 increases, x 0 decreases, so the boundary shifts toward µ 0 (so more space gets mapped to class 1) 17

Decision boundary in 1d Discontinuous decision region 18

Decision boundary in 2d 19

Similarly to before p(y c x) Tied Σ, many classes θ c def p(y c x) π c exp [ 1 2 (x µ c) T Σ 1 c (x µ c ) ] c π c exp [ 1 2 (x µ c )T Σ 1 c (x µ c ) ] exp [ ] µ T Σ 1 x 1 2 µt cσ 1 µ c +logπ c c exp [ ] µ T c Σ 1 x 1 2 µt c Σ 1 µ c +logπ c ( ) ( ) µ T c Σ 1 µ c +logπ c γc Σ 1 µ c β c e θt cx c e θt c x eβ T cx+γ c c e βt c x+γ c This is the multinomial logit or softmax function 20

Discriminant function Tied Σ, many classes g c (x) 1 2 (x µ c) T Σ 1 (x µ c )+logp(y c)β T cx+β c0 β c Σ 1 µ c β c0 1 2 µt cσ 1 µ c +logπ c Decision boundary is again linear, since x T Σ x terms cancel If Σ I, then the decision boundaries are orthogonal to µ i - µ j, otherwise skewed 21

Decision boundaries [x,y] meshgrid(linspace(-10,10,100), linspace(-10,10,100)); g1 reshape(mvnpdf(x, mu1(:), S1), [m n]);... contour(x,y,g2*p2-max(g1*p1, g3*p3),[0 0], -k ); 22

Σ 0, Σ 1 arbitrary If the Σ are unconstrained, we end up with cross product terms, leading to quadratic decision boundaries 23

General case µ 1 (0,0),µ 2 (0,5),µ 3 (5,5),π(1/3,1/3,1/3) 24