CS340 Machine learning Gaussian classifiers

CS340 Machine learning Gaussian classifiers 1

Correlated features Height and weight are not independent 2

Multivariate Gaussian Multivariate Normal (MVN) N(x µ,σ) def 1 (2π) p/2 Σ 1/2exp[ 1 2 (x µ)t Σ 1 (x µ)] Exponent is the Mahalanobis distance between x and µ (x µ) T Σ 1 (x µ) Σ is the covariance matrix (positive definite) x T Σx>0 x 3

Covariance matrix is Bivariate Gaussian where the correlation coefficient is and satisfies -1 ρ 1 Density is Σ ρ ( ) σ 2 x ρσ x σ y ρσ x σ y σ 2 y Cov(X,Y) Var(X)Var(Y) p(x,y) ( 1 2πσ x σ exp 1 y 1 ρ 2 2(1 ρ 2 ) ( x 2 σ 2 x + y2 σ 2 y 2ρxy )) (σ x σ y ) 4

Spherical, diagonal, full covariance Σ ( ) σ 2 0 0 σ 2 Σ ( σ 2 x 0 0 σ 2 y ) Σ ( ) σ 2 x ρσ x σ y ρσ x σ y σ 2 y 5

Surface plots 6

Generative classifier A generative classifier is one that defines a classconditional density p(x yc) and combines this with a class prior p(c) to compute the class posterior p(yc x) p(x yc)p(yc) c p(x yc )p(c ) Examples: Naïve Bayes: Gaussian classifiers p(x yc) Alternative is a discriminative classifier, that estimates p(yc x) directly. d j1 p(x j yc) p(x yc)n(x µ c,σ c ) 7

Naïve Bayes with Bernoulli features Consider this class-conditional density p(x yc) d i1 The resulting class posterior (using plugin rule) has the form p(yc x) p(yc)p(x yc) p(x) This can be rewritten as p(y c x,θ,π) θ I(x i1) ic (1 θ ic ) I(x i0) π c d i1 θi(x i1) ic (1 θ ic ) I(x i0) p(x) p(x yc)p(yc) c p(x yc )p(yc ) exp[logp(x yc)+logp(yc)] c exp[logp(x yc )+logp(yc )] exp[logπ c + i I(x i1)logθ ic +I(x i 0)log(1 θ ic )] c exp[logπ c + i I(x i1)logθ i,c +I(x i 0)log(1 θ ic )] 8

Form of the class posterior From previous slide p(y c x,θ,π) exp Define [ logπ c + i I(x i 1)logθ ic +I(x i 0)log(1 θ ic ) x [1,I(x 1 1),I(x 1 0),...,I(x d 1),I(x d 0)] β c [logπ c,logθ 1c,log(1 θ 1c ),...,logθ dc,log(1 θ dc )] Then the posterior is given by the softmax function p(y c x,β) exp[βt cx ] c exp[β T c x ] This is called softmax because it acts like the max function when β c ] p(y c x) { 1.0 ifcargmaxc β T c x 0.0 otherwise 9

From previous slide Two-class case p(y c x,β) exp[βt cx ] c exp[β T c x ] In the binary case, Y {0,1}, the softmax becomes the logistic (sigmoid) function σ(u) 1/(1+e -u ) e βt 1 x p(y 1 x,θ) e βt 1 x +e βt 0 x 1 1+e (β 0 β 1 ) T x 1 1+e wt x σ(w T x ) 10

Sigmoid function σ(ax + b), a controls steepness, b is threshold. For small a and x b/2, roughly linear a0.3 a1.0 a3.0 b-30 b0 b+30 11

Sigmoid function in 2D σ(w 1 x 1 + w 2 x 2 ) σ(w T x): w is perpendicular to the decision boundary Mackay 39.3 12

Logit function Let pp(y1) and η be the log odds Then p σ(η) and η logit(p) σ(η) ηlog p 1 p 1 eη 1+e η p (1 p) p 1 p +1 e η +1 p (1 p) p+1 p 1 p p η is the natural parameter of the Bernoulli distribution, and p E[y] is the moment parameter If η w T x, then w i is how much the log-odds increases by if we increase x i 13

Gaussian classifiers Class posterior (using plug-in rule) We will consider the form of this equation for various special cases: Σ 1 Σ 0, p(y c x) Σ c tied, many classes General case p(x Y c)p(y c) C c 1 p(x Y c )p(y c ) π c 2πΣ c 1 2exp [ 1 2 (x µ c) T Σ 1 c (x µ c ) ] c π c 2πΣ c 1 2exp [ 1 2 (x µ c )T Σ 1 c (x µ c ) ] 14

Σ 1 Σ 0 Class posterior simplifies to p(y 1 x) p(x Y 1)p(Y 1) p(x Y 1)p(Y 1)+p(x Y 0)p(Y 0) π 1 exp [ 1 2 (x µ 1) T Σ 1 (x µ 1 ) ] π 1 exp [ 1 2 (x µ 1) T Σ 1 (x µ 1 ) ] +π 0 exp [ 1 2 (x µ 0) T Σ 1 (x µ 0 ) ] π 1 e a 1 π 1 e a 1 +π0 e a 0 a c def 1 2 (x µ c) T Σ(x µ c ) 1 1+ π 0 π 1 e a 0 a 1 15

Σ 1 Σ 0 Class posterior simplifies to p(y 1 x) 1 [ ] 1+exp log π 1 π 0 +a 0 a 1 a 0 a 1 1 2 (x µ 0) T Σ 1 (x µ 0 )+ 1 2 (x µ 1) T Σ 1 (x µ 1 ) (µ 1 µ 0 ) T Σ 1 x+ 1 2 (µ 1 µ 0 ) T Σ 1 (µ 1 +µ 0 ) so p(y 1 x) 1 1+exp[ β T x γ] σ(βt x+γ) β def Σ 1 (µ 1 µ 0 ) γ σ(η) def 1 2 (µ 1 µ 0 ) T Σ 1 (µ 1 +µ 0 )+log π 1 def 1 eη 1+e η e η +1 π 0 Linear function of x 16

Decision boundary Rewrite class posterior as p(y 1 x) σ(β T x+γ)σ(w T (x x 0 )) w βσ 1 (µ 1 µ 0 ) x 0 γ β 1 2 (µ 1+µ 0 ) log(π 1 /π 0 ) (µ 1 µ 0 ) T Σ 1 (µ 1 µ 0 ) (µ 1 µ 0 ) If ΣI, then w(µ 1 -µ 0 ) is in the direction of µ 1 -µ 0, so the hyperplane is orthogonal to the line between the two means, and intersects it at x 0 If π 1 π 0, then x 0 0.5(µ 1 +µ 0 ) is midway between the two means If π 1 increases, x 0 decreases, so the boundary shifts toward µ 0 (so more space gets mapped to class 1) 17

Decision boundary in 1d Discontinuous decision region 18

Decision boundary in 2d 19

Similarly to before p(y c x) Tied Σ, many classes θ c def p(y c x) π c exp [ 1 2 (x µ c) T Σ 1 c (x µ c ) ] c π c exp [ 1 2 (x µ c )T Σ 1 c (x µ c ) ] exp [ ] µ T Σ 1 x 1 2 µt cσ 1 µ c +logπ c c exp [ ] µ T c Σ 1 x 1 2 µt c Σ 1 µ c +logπ c ( ) ( ) µ T c Σ 1 µ c +logπ c γc Σ 1 µ c β c e θt cx c e θt c x eβ T cx+γ c c e βt c x+γ c This is the multinomial logit or softmax function 20

Discriminant function Tied Σ, many classes g c (x) 1 2 (x µ c) T Σ 1 (x µ c )+logp(y c)β T cx+β c0 β c Σ 1 µ c β c0 1 2 µt cσ 1 µ c +logπ c Decision boundary is again linear, since x T Σ x terms cancel If Σ I, then the decision boundaries are orthogonal to µ i - µ j, otherwise skewed 21

Decision boundaries [x,y] meshgrid(linspace(-10,10,100), linspace(-10,10,100)); g1 reshape(mvnpdf(x, mu1(:), S1), [m n]);... contour(x,y,g2*p2-max(g1*p1, g3*p3),[0 0], -k ); 22

Σ 0, Σ 1 arbitrary If the Σ are unconstrained, we end up with cross product terms, leading to quadratic decision boundaries 23

General case µ 1 (0,0),µ 2 (0,5),µ 3 (5,5),π(1/3,1/3,1/3) 24