The generative approach to classification The generative approach to classification CSE 250B The learning process: Fit a probability distribution to each class, individually To classify a new point: Which of these distributions was it most likely to have come from? Generative models A classification problem Prx) P 3 x) You have a bottle of wine whose label is missing. Example: Data space X R Classes/labels Y {, 2, 3} P x) π 0% P 2 x) π 2 50% π 3 40% x For each class j, we have: the probability of that class, π j Pry j) the distribution of data in that class, P j x) Overall joint distribution: Prx, y) Pry)Prx y) π y P y x). To classify a new x: pick the label y with largest Prx, y) Which winery is it from,, 2, or 3? Solve this problem using visual and chemical features of the wine.
The data set Recall: the generative approach Training set obtained from 30 bottles Winery : 43 bottles Winery 2: 5 bottles Winery 3: 36 bottles For each bottle, 3 features: Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD35 of diluted wines, Proline Also, a separate test set of 48 labeled points. Prx) P x) π 0% P 2 x) π 2 50% P 3 x) For any data point x X and any candidate label j, Pry j x) π 3 40% Pry j)prx y j) Prx) Optimal prediction: the class j with largest π j P j x). x π jp j x) Prx) Fitting a generative model The univariate Gaussian Training set of 30 bottles: Winery : 43 bottles, winery 2: 5 bottles, winery 3: 36 bottles For each bottle, 3 features: Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD35 of diluted wines, Proline Class weights: π 43/30 0.33, π 2 5/30 0.39, π 3 36/30 0.28 Need distributions P, P 2, P 3, one per class. Base these on a single feature: Alcohol. The Gaussian Nµ, σ 2 ) has mean µ, variance σ 2, and density function ) x µ)2 px) 2πσ 2 exp ) /2 2σ 2.
The distribution for winery All three wineries Single feature: Alcohol π 0.33, P N3.7, 0.20) π 2 0.39, P 2 N2.3, 0.28) π 3 0.28, P 3 N3.2, 0.27) To classify x: Pick the j with highest π j P j x) Mean µ 3.72, Standard deviation σ 0.44 variance 0.20) Test error: 4/48 29% What if we use two features? Why it helps to add features The bivariate Gaussian Better separation between the classes! Error rate drops from 29% to 8%. Model class by a bivariate Gaussian, parametrized by: ) ) 3.7 0.20 0.06 mean µ and covariance matrix Σ 3.0 0.06 0.2
Dependence between two random variables The bivariate 2d) Gaussian Suppose X has mean µ and X 2 has mean µ 2. Can measure dependence between them by their covariance: covx, X 2 ) E[X µ )X 2 µ 2 )] E[X X 2 ] µ µ 2 Maximized when X X 2, in which case it is varx ). It is at most stdx )stdx 2 ). A distribution over x, x 2 ) R 2, parametrized by: Mean µ, µ 2 ) R 2, where µ EX ) and µ 2 EX 2 ) [ ] Σ Σ Covariance matrix Σ 2 where Σ 2 Σ 22 Σ varx ) Σ 22 varx 2 ) Σ 2 Σ 2 covx, X 2 ) Density is highest at the mean, falls off in ellipsoidal contours. Density of the bivariate Gaussian Bivariate Gaussian: examples Mean µ, µ 2 ) R 2, where µ EX ) and µ 2 EX 2 ) [ ] Σ Σ Covariance matrix Σ 2 Σ 2 Σ 22 Density px, x 2 ) exp [ ] T [ x µ Σ x µ 2π Σ /2 2 x 2 µ 2 x 2 µ 2 ] ) In either case, the mean is, ). Σ [ 4 0 0 ] Σ [ 4.5.5 ]
The decision boundary The multivariate Gaussian Go from to 2 features: error rate goes from 29% to 8%. µ Nµ, Σ): Gaussian in R d mean: µ R d covariance: d d matrix Σ Generates points X X, X 2,..., X d ). µ is the vector of coordinatewise means: µ EX, µ 2 EX 2,..., µ d EX d. Σ is a matrix containing all pairwise covariances: What kind of function is this? And, can we use more features? Σ ij Σ ji covx i, X j ) if i j Σ ii varx i ) Density px) 2π) d/2 exp ) Σ /2 2 x µ)t Σ x µ) Special case: independent features Suppose the X i are independent, and varx i ) σ 2 i. What is the covariance matrix Σ, and what is its inverse Σ? Diagonal Gaussian Diagonal Gaussian: the X i are independent, with variances σ 2 i. Σ diagσ, 2..., σd 2 ) offdiagonal elements zero) Each X i is an independent onedimensional Gaussian Nµ i, σ 2 i ): Prx) Prx )Prx 2 ) Prx d ) 2π) d/2 exp σ σ d ) d x i µ i ) 2 i 2σ 2 i Contours of equal density: axisaligned ellipsoids, centered at µ: µ 2
Even more special case: spherical Gaussian How to fit a Gaussian to data The X i are independent and all have the same variance σ 2. Σ σ 2 I d diagσ 2, σ 2,..., σ 2 ) diagonal elements σ 2, rest zero) Each X i is an independent univariate Gaussian Nµ i, σ 2 ): Prx) Prx )Prx 2 ) Prx d ) Density at a point depends only on its distance from µ: µ ) 2π) d/2 σ d exp x µ 2 2σ 2 Fit a Gaussian to data points x ),..., x m) R d. Empirical mean µ m x ) x m)) Empirical covariance matrix has i, j entry: ) m Σ ij x k) i x k) j µ i µ j m k Back to the winery data The multivariate Gaussian Go from to 2 features: test error goes from 29% to 8%. µ Nµ, Σ): Gaussian in R d mean: µ R d covariance: d d matrix Σ Density px) 2π) d/2 exp ) Σ /2 2 x µ)t Σ x µ) If we write S Σ then S is a d d matrix and x µ) T Σ x µ) i,j S ij x i µ i )x j µ j ), With all 3 features: test error rate goes to zero. a quadratic function of x.
Binary classification with Gaussian generative model Estimate class probabilities π, π 2 Fit a Gaussian to each class: P Nµ, Σ ), P 2 Nµ 2, Σ 2 ) Given a new point x, predict class if π P x) > π 2 P 2 x) x T Mx 2w T x θ, Common covariance: Σ Σ 2 Σ Linear decision boundary: choose class if x Σ µ µ 2 ) }{{} w θ. Example : Spherical Gaussians with Σ I d and π π 2. bisector of line joining means where: M 2 Σ 2 Σ ) w Σ µ Σ 2 µ 2 and θ is a threshold depending on the various parameters. µ µ 2 w Linear or quadratic decision boundary. Example 3: Nonspherical. Example 2: Again spherical, but now π > π 2. µ µ µ 2 µ 2 µ µ 2 w w µ µ 2 ) Classification rule: w x θ Choose w as above Common practice: fit θ to minimize training or validation error
Different covariances: Σ Σ 2 Quadratic boundary: choose class if x T Mx 2w T x θ, where: M 2 Σ 2 Σ ) w Σ µ Σ 2 µ 2 Example : Σ σ 2 I d and Σ 2 σ 2 2 I d with σ > σ 2 Example 2: Same thing in d. X R. class class 2 µ µ 2 Multiclass discriminant analysis Example 3: A parabolic boundary. k classes: weights π j, classconditional densities P j Nµ j, Σ j ). Each class has an associated quadratic function f j x) log π j P j x)) To classify point x, pick arg max j f j x). µ µ 2 If Σ Σ k, the boundaries are linear.