Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Similar documents
Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Clustering: Mixture Models

Expectation-Maximization Algorithm.

Expectation maximization

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Algorithms for Clustering

10-701/ Machine Learning Mid-term Exam Solution

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Mixtures of Gaussians and the EM Algorithm

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

Lecture 11 and 12: Basic estimation theory

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Axis Aligned Ellipsoid

Machine Learning Brett Bernstein

Probabilistic Unsupervised Learning

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

15-780: Graduate Artificial Intelligence. Density estimation

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Statistical Pattern Recognition

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

ECE 901 Lecture 13: Maximum Likelihood Estimation

Machine Learning Brett Bernstein

Introduction to Machine Learning DIS10

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

CSE 527, Additional notes on MLE & EM

Pattern Classification, Ch4 (Part 1)

The Expectation-Maximization (EM) Algorithm

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Frequentist Inference

Lecture 13: Maximum Likelihood Estimation

CSIE/GINM, NTU 2009/11/30 1

Naïve Bayes. Naïve Bayes

Optimally Sparse SVMs

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Support vector machine revisited

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Classification with linear models

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Lecture 12: September 27

Intro to Learning Theory

Regression and generalization

Probability and MLE.

Empirical Process Theory and Oracle Inequalities

Step 1: Function Set. Otherwise, output C 2. Function set: Including all different w and b

Vector Quantization: a Limiting Case of EM

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Introductory statistics

Agnostic Learning and Concentration Inequalities

Lecture 2: Monte Carlo Simulation

Logit regression Logit regression

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Lecture 2 October 11

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

Stat410 Probability and Statistics II (F16)

Probabilistic Unsupervised Learning

Introduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

CSCI567 Machine Learning (Fall 2014)

STAT Homework 2 - Solutions

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

3/8/2016. Contents in latter part PATTERN RECOGNITION AND MACHINE LEARNING. Dynamical Systems. Dynamical Systems. Linear Dynamical Systems

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

1 Review of Probability & Statistics

1.010 Uncertainty in Engineering Fall 2008

Machine Learning Theory (CS 6783)

REGRESSION WITH QUADRATIC LOSS

Lecture 19: Convergence

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Topic 9: Sampling Distributions of Estimators

Outline. L7: Probability Basics. Probability. Probability Theory. Bayes Law for Diagnosis. Which Hypothesis To Prefer? p(a,b) = p(b A) " p(a)

Expectation and Variance of a random variable

6.867 Machine learning, lecture 7 (Jaakkola) 1

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Exponential Families and Bayesian Inference

1 Review and Overview

Solution of Final Exam : / Machine Learning

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Chapter 6 Principles of Data Reduction

1 Review and Overview

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Pattern Classification

STAT Homework 1 - Solutions

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Machine Learning. Logistic Regression -- generative verses discriminative classifier. Le Song /15-781, Spring 2008

Support Vector Machines and Kernel Methods

Dimensionality Reduction vs. Clustering

Distributional Similarity Models (cont.)

Transcription:

Outlie CSCI-567: Machie Learig Sprig 209 Gaussia mixture models Prof. Victor Adamchik 2 Desity estimatio U of Souther Califoria Mar. 26, 209 3 Naive Bayes Revisited March 26, 209 / 57 March 26, 209 2 / 57 Outlie Gaussia mixture models Gaussia mixture models GMM is a probabilistic approach for clusterig. Gaussia mixture models Motivatio ad Model EM algorithm EM applied to GMMs 2 Desity estimatio 3 Naive Bayes Revisited We wat to come up with a probabilistic model p to explai how the data is geerated. We will model each regio with a Gaussia distributio. To geerate a poit, we first radomly pick oe of the Gaussia models, the draw a poit accordig this Gaussia. March 26, 209 3 / 57 March 26, 209 4 / 57

GMM: formal defiitio A example A GMM has the followig desity fuctio: px = where ω k Nx µ k, Σ k = k= k= ω k 2π D Σ k e 2 x µ k T Σ k x µ k K: the umber of Gaussia compoets same as #clusters we wat µ k ad Σ k : mea ad covariace matrix of the k-th Gaussia ω,..., ω K : mixture weights, they represet how much each compoet cotributes to the fial distributio. It satisfies two properties: k, ω k > 0, ad ω k = k The coditioal distributios are px z = red = Nx µ, Σ px z = blue = Nx µ 2, Σ 2 px z = gree = Nx µ 3, Σ 3 Here z is the hidde latet variable. The margial distributio is px = prednx µ, Σ + pbluenx µ 2, Σ 2 + pgreenx µ 3, Σ 3 March 26, 209 5 / 57 March 26, 209 6 / 57 Learig GMMs Preview of EM for learig GMMs Learig a GMM meas fidig all the parameters = {ω k, µ k, Σ k } K k=. How to lear these parameters? A obvious attempt is maximum-likelihood estimatio MLE: fid l N = px ; = N = l px ; P The problem is itractable i geeral o-cocave problem, also there is a latet parameter. Oe solutio is to still apply GD/SGD, but a much more effective approach is the Expectatio Maximizatio EM algorithm. Step 0 Iitialize ω k, µ k, Σ k for each k [K] Step E-Step update the soft assigmet fixig parameters γ k = pz = k x ω k N x µ k, Σ k Step 2 M-Step update the model parameter fixig assigmets ω k = γ k µ k = γ kx N γ k Σ k = γ γ k x µ k x µ k T k Step 3 retur to Step if ot coverged March 26, 209 7 / 57 March 26, 209 8 / 57

EM algorithm EM algorithm I geeral EM is a heuristic to solve MLE with latet variables ot just GMM, i.e. fid the maximizer of P = N l px ; = is the parameters for a geeral probabilistic model x s are observed radom variables z s are latet variables Agai, directly solvig the objective is itractable. A geeral algorithm for dealig with hidde data. EM is a optimizatio strategy for objective fuctios that ca be iterpreted as likelihoods i the presece of missig data. EM is much simpler tha gradiet methods: o eed to choose step size. EM is a iterative algorithm with two steps: E-step: fill-i hidde values usig iferece M-step: apply stadard MLE method to completed data We will prove that EM always coverges to a local optimum of the likelihood. March 26, 209 9 / 57 March 26, 209 0 / 57 High level idea Derivatio of EM Keep maximizig a lower boud of P that is more maageable Fidig the lower boud of P : px, z ; l px ; = l pz x ; [ ] px, z ; = E z q l pz x ; true for ay z true for ay dist. q Let us recall the defiitio of expectatio E z q [fz] = z qzfz ad etropy Hz = E z q [l qz] = z qz l qz March 26, 209 / 57 March 26, 209 2 / 57

Derivatio of EM Fidig the lower boud of P : px, z ; l px ; = l true for ay z pz x ; [ ] px, z ; = E z q l true for ay dist. q pz x ; [ ] pz x ; = E z q [l px, z ; ] E z q [l qz] E z q l qz [ ] pz x ; = E z q [l px, z ; ] + Hq E z q l H is etropy qz [ ] pz x ; E z q [l px, z ; ] + Hq l E z q qz Jese s iequality Jese s iequality Claim: E [l X] l E[X] Proof. By the defiitio of E[X] = N x + x 2 +... + x, the It follows, E [l X] = N l x + l x 2 +... + l x = N l N N N l = x l N N N x N = N = N = x x = This is the AGM iequality. For N = 2, it is just x x 2 2 0. x March 26, 209 3 / 57 March 26, 209 4 / 57 Derivatio of EM Alteratively maximize the lower boud After applyig Jese s iequality, we obtai l px ; E z q [l px, z ; ] + Hq l E z q [ pz x ; qz Next, we observe that [ ] pz x ; E z q = qz z pz x ; qz = qz z ] pz x ; = We have foud a lower boud for the log-likelihood fuctio P = N l px ; = N = E z q [l px, z ; ] + Hq = F, {q } This holds for ay {q }, so how do we choose? It follows, l px ; E z q [l px, z ; ] + Hq Naturally, the oe that maximizes the lower boud i.e. the tightest lower boud! This is similar to K-meas: we will alteratively maximizig F over {q } ad. March 26, 209 5 / 57 March 26, 209 6 / 57

Pictorial explaatio P is o-cocave, but F maximize., {q t } ofte is cocave ad easy to Maximizig over {q } Fix t, ad maximize F over {q } ] E z q [l px, z ; t F, {q } = q = q k= q q k l px, z = k ; t q k l q k + Hq subject to coditios: q k 0 ad q k = k Next, write dow the Lagragia ad the apply KKT coditios. March 26, 209 7 / 57 March 26, 209 8 / 57 Maximizig over {q } Maximizig over The solutio to [ ] F, {q } = E z q l px, z ; t q q is you have to verify it by yourself q t z = pz = k x ; t i.e., the posterior distributio of z give x ad t. So at t, we foud the tightest lower boud F, {q t } : F, {q t } P for all. F t, {q t } = P t + Hq Fix {q t }, maximize over ote, Hq t is idepedet of : F =, {q t } N = E z q t Q ; t [l px, z ; ] {q t } are computed via t Q is called a complete likelihood ad is usually more tractable, sice z are ot latet variables aymore. March 26, 209 9 / 57 March 26, 209 20 / 57

Geeral EM algorithm Pictorial explaatio Step 0 Iitialize, t = Step E-Step update the posterior of latet variables q t = p x ; t ad obtai Expectatio of complete likelihood Q ; t = N = E z q t [l px, z ; ] Step 2 M-Step update the model parameter via Maximizatio t+ Q ; t P is o-cocave, but Q; t ofte is cocave ad easy to maximize. P t+ F t+ ; {q t } F t ; {q t } = P t So EM always icreases the objective value ad will coverge to some local maximum similar to K-meas. Step 3 t t + ad retur to Step if ot coverged March 26, 209 2 / 57 March 26, 209 22 / 57 Apply EM to lear GMMs E-Step: z = k = p z = k x ; t = p z = k ; t px z = k ; t = ω t k N x µ t k, Σt k q t This computes the soft assigmet γ k = q t z = k, i.e. coditioal probability of x belogig to cluster k. Apply EM to lear GMMs M-Step: Q, t = To fid ω,..., ω K, solve ω N = k= = = N = N = E z q t E z q t N {ω k,µ k,σ k } = k= γ k l ω k [l px, z ; ] [l pz ; + l px z ; ] γ k l ω k + l Nx µ k, Σ k To fid each µ k, Σ k, solve µ k,σ k N γ k l Nx µ k, Σ k = March 26, 209 23 / 57 March 26, 209 24 / 57

M-Step cotiued Solutios to previous two problems are very atural see slide 8, for each k ω k = γ k N i.e. weighted fractio of examples belogig to cluster k µ k = γ kx γ k i.e. weighted average of examples belogig to cluster k Σ k = γ γ k x µ k x µ k T k i.e weighted covariace of examples belogig to cluster k March 26, 209 25 / 57 GMM: puttig it together EM for clusterig: Step 0 Iitialize ω k, µ k, Σ k for each k [K] Step E-Step update the soft assigmet fixig parameters γ k = pz = k x ω k N x µ k, Σ k Step 2 M-Step update the model parameter fixig assigmets ω k = γ k µ k = γ kx N γ k Σ k = γ γ k x µ k x µ k T k Step 3 retur to Step if ot coverged March 26, 209 26 / 57 Coectio to K-meas K-meas is i fact a special case of EM for a simplified GMM: Let Σ k = σ 2 I for some fixed σ, so oly ω k ad µ k are parameters. EM becomes K-meas: N px ; = = N = k= pz = knx µ k If we assume hard assigmets pz = k =, if k = C, the Outlie Gaussia mixture models 2 Desity estimatio Parametric models Noparametric models N = = N px ; = = N exp 2σ 2 x µ C 2 2 = Nx µ C = µ,c N x µ C 2 2 GMM is a soft versio of K-meas ad it provides a probabilistic iterpretatio of the data. March 26, 209 27 / 57 = 3 Naive Bayes Revisited March 26, 209 28 / 57

Desity estimatio Parametric geerative models Observe what we have doe idirectly for clusterig with GMMs is: Give a traiig set x,..., x N, estimate a desity fuctio p that i.i.d. could have geerated this dataset via x p. This is exactly the problem of desity estimatio, aother importat usupervised learig problem. Useful for may dowstream applicatios we have see clusterig already, will see more applicatios today these applicatios also provide a way to measure quality of the desity estimator Parametric estimatio assumes a geerative model parametrized by : Examples: px = px ; GMM: px ; = K k= ω knx µ k, Σ k where = {ω k, µ k, Σ k } Multiomial for D examples with K possible values px = k ; = k where is a distributio over K elemets. Size of is idepedet of the traiig set size, so it s parametric. March 26, 209 29 / 57 March 26, 209 30 / 57 Parametric methods MLE for multiomial Agai, we apply MLE to lear the parameters : = N l px ; For some cases this is itractable ad we ca use EM to approximately solve MLE e.g. GMMs. = For some other cases this admits a simple closed-form solutio e.g. multiomial. = = N l px = x ; = = k= :x =k l k = N l x = z k l k where z k = { : x = k} is the umber of examples with value k. The solutio your TA4 is simply k = z k N z k, i.e. the fractio of examples with value k. k= March 26, 209 3 / 57 March 26, 209 32 / 57

Noparametric models Ca we estimate without assumig a fixed geerative model? High level idea Costruct somethig similar to a histogram: for each data poit, create a hump via a kerel sum up all the humps; more data - a higher hump picture from Wikipedia Kerel desity estimatio KDE is a commo approach for oparametric desity estimatio. Here kerel meas somethig differet from what we have see for kerel fuctio. We focus o the D cotiuous case. March 26, 209 33 / 57 March 26, 209 34 / 57 Kerel Differet kerels Kx KDE with a kerel Kx: R R cetered at x : px = N N Kx x = May choices for K, for example, Kx = 2π e x2 2, the stadard Gaussia desity e x2 2 2π 2 I[ x ] 3 4 max{ x2, 0} Properties of a kerel: symmetry: Kx = K x Kxdx =, this isures p is a desity fuctio. March 26, 209 35 / 57 March 26, 209 36 / 57

Badwidth Effect of badwidth picture from Wikipedia If Kx is a kerel, the for ay h > 0 K h u h K x h stretchig the kerel A larger h will smooth a desity. A small h will yield a desity that is spiky ad very hard to iterpret. ca be used as a kerel too verify the two properties yourself So, geeral KDE is determied by both the kerel K ad the badwidth h px = N N = K h x x = Nh x cotrols the ceter of each hump N x x K h = Assume Gaussia kerel. Gray curve is groud-truth Red: h = 0.05 Black: h = 0.337 Gree: h = 2 h cotrols the width/variace of the humps March 26, 209 37 / 57 March 26, 209 38 / 57 Badwidth selectio Outlie Selectig h is a deep topic oe ca also do cross-validatio based o dowstream applicatios there are theoretically-motivated approaches Fid a value of h that miimizes the error betwee the estimated desity ad the true desity: E [ p KDE x px 2] = E [p KDE x px] 2 + V ar [p KDE x] Gaussia mixture models 2 Desity estimatio 3 Naive Bayes Revisited Setup ad assumptio Coectio to logistic regressio Geerative ad Discrimiative Models This expressio is a example of the bias-variace tradeoff, which we saw i the earlier lecture. March 26, 209 39 / 57 March 26, 209 40 / 57

Bayes optimal classifier Discrete features Suppose the data x, y is draw from a joit distributio px, y, the Bayes optimal classifier is f x = pc x i.e. predict the class with the largest coditioal probability. For a label c [C], py = c = { : y = c} N px, y is of course ukow, but we ca estimate it, which is exactly a desity estimatio problem! Observe that px, y = pypx y For each possible value k of a discrete feature d, px d = k y = c = { : x d = k, y = c} { : y = c} To estimate px y = c for some c [C], we are doig desity estimatio usig data with label y = c. March 26, 209 4 / 57 March 26, 209 42 / 57 Cotiuous features If the feature is cotiuous, we ca do parametric estimatio, e.g. via a Gaussia px d = x y = c = exp x µ cd 2 2πσcd 2σ 2 cd How to predict? Usig Naive Bayes assumptio: D px y = c = px d y = c the predictio for a ew example x is d= where µ cd ad σcd 2 are the empirical mea ad variace of feature d amog all examples with label c. or oparametric estimatio, e.g. via a kerel K ad badwidth h: px d = x y = c = { : y = c} :y =c K h x x d py = c x = = = px y = cpy = c px D py = c px d y = c d= l py = c + D l px d y = c d= March 26, 209 43 / 57 March 26, 209 44 / 57

Naive Bayes Naive Bayes For discrete features, pluggig i previous MLE estimatios gives = = py = c x l py = c + D l px d y = c d= l { : y = c} + D d= l { : x d = x d, y = c} { : y = c} For cotiuous features with a Gaussia model, = = = py = c x l py = c + D l px d y = c d= l { : y = c} + l { : y = c} D l exp x d µ cd 2 2πσcd d= D l σ cd + x d µ cd 2 d= 2σ 2 cd 2σ 2 cd March 26, 209 45 / 57 March 26, 209 46 / 57 Coectio to logistic regressio Coectio to logistic regressio Let us fix the variace for each feature to be σ i.e. ot a parameter of the model ay more, the the predictio becomes = = = py = c x l { : y = c} D l σ + x d µ cd 2 d= l { : y = c} x 2 D 2 2σ 2 d= D w c0 + w cd x d = wc T x d= where we deote w c0 = l { : y = c} D µ 2 cd d= 2σ 2 2σ 2 µ 2 D cd 2σ 2 + d= µ cd σ 2 x d liear classifier! ad w cd = µ cd σ 2. You ca verify py = c x e wt c x This is exactly the softmax fuctio, the same model we used for a probabilistic iterpretatio of logistic regressio! So what is differet the? They lear the parameters i differet ways: both via MLE, oe o py = c x, the other o px, y solutios are differet: logistic regressio has o closed-form, aive Bayes admits a simple closed-form March 26, 209 47 / 57 March 26, 209 48 / 57

Two differet modelig paradigms Geerative model v.s discrimiative model Suppose the traiig data is from a ukow joit probabilistic model px, y. There are two kids of classificatio models i machie learig geerative models ad discrimiative models. Discrimiative model Geerative model Differeces i assumig models for the data the geerative approach requires we specify the model for the joit distributio such as Naive Bayes, ad thus, maximize the joit likelihood log px, y the discrimiative approach discrimiative requires oly specifyig a model for the coditioal distributio such as logistic regressio, ad thus, maximize the coditioal likelihood log py x Sometimes, modelig by discrimiative approach is easier Sometimes, parameter estimatio by geerative approach is easier Example logistic regressio aive Bayes Model coditioal py x joit px, y might have same py x Learig MLE MLE Accuracy usually better for large N usually better for small N Remark more flexible, ca geerate data after learig March 26, 209 49 / 57 March 26, 209 50 / 57 Determiig sex ma or woma based o measuremets Example: Geerative approach Propose a model of the joit distributio of x = height, y =sex 280 red = female, blue=male our data 280 260 red = female, blue=male weight 260 240 220 200 80 60 40 20 00 Sex Height 6 2 5 2 5 6 6 2 2 5.7 80 55 60 65 70 75 80 height 80 55 60 65 70 75 80 weight 240 220 200 80 60 40 20 00 height Ituitio: we will model how heights vary accordig to a Gaussia i each sub-populatio male ad female. Note: This is similar to Naive Bayes for detectig spam emails. March 26, 209 5 / 57 March 26, 209 52 / 57

Model of the joit distributio Parameter estimatio px, y = pypx y p = 2πσ e x µ 2 2πσ2 e x µ 2 2 p 2 2σ 2 if y = 2σ 2 2 if y = 2 where p + p 2 = represets two prior probabilities that x is give the label or 2 respectively. px y is assumed to be Gaussias. weight 280 260 240 220 200 80 60 40 20 00 red = female, blue=male 80 55 60 65 70 75 80 height Likelihood of the traiig data D = {x, y } N = with y {, 2} log P D = log px, y = log p e x µ 2 2σ 2 2πσ :y = + :y =2 log p 2 e x µ 2 2 2σ 2 2 2πσ2 Maximize the likelihood fuctio p, p 2, µ, µ 2, σ, σ2 = log P D March 26, 209 53 / 57 March 26, 209 54 / 57 Decisio boudary Example of oliear decisio boudary The decisio boudary betwee two classes is defied by py = x py = 2 x which is equivalet to px y = py = px y = 2py = 2 Namely, 2 0 Parabolic Boudary x µ 2 2σ 2 log 2πσ + log p x µ 2 2 2σ 2 2 log 2πσ 2 + log p 2 2 It is quadratic i x. It follows for some a, b ad c, that The decisio boudary is ot liear! ax 2 + bx + c 0 March 26, 209 55 / 57 2 0 2 Note: the boudary is characterized by a quadratic fuctio, givig rise to the shape of parabolic curve. March 26, 209 56 / 57

A special case What if we assume the two Gaussias have the same variace? We will get a liear decisio boudary From the previous slide: x µ 2 2σ 2 log 2πσ + log p x µ 2 2 2σ 2 2 log 2πσ 2 + log p 2 Settig σ = σ 2, we obtai bx + c 0 Note: equal variaces across two differet categories could be a very strog assumptio. For example, the plot suggests that the male populatio has slightly bigger variace i.e., bigger eclipse tha the female populatio. March 26, 209 57 / 57