Probabilistic Classification: Bayes Classifiers. Lecture 6:

Size: px

Start display at page:

Download "Probabilistic Classification: Bayes Classifiers. Lecture 6:"

Ethelbert Henderson
5 years ago
Views:

1 Probablstc Classfcaton: Bayes Classfers Lecture : Classfcaton Models Sam Rowes January, Generatve model: p(x, y) = p(y)p(x y). p(y) are called class prors. p(x y) are called class condtonal feature dstrbutons. For the pror we use a Bernoull or multnomal: p(y = k π) = π k wth k π k =. Classfcaton rules: ML: argmax y p(x y) (can behave badly f skewed prors) MAP: argmax y p(y x) = argmax y log p(x y) + log p(y) (safer) Fttng: maxmze n log p(xn, y n ) = n log p(xn y n ) + log p(y n ) ) Sort data nto batches by class label. ) Estmate p(y) by countng sze of batches (plus regularzaton). ) Estmate p(x y) separately wthn each batch usng ML. (also wth regularzaton). Remnder: Classfcaton Gven examples of a dscrete class label y and some features x. Goal: compute label (y) for new nputs x. Two approaches: Generatve: model p(x, y) = p(y)p(x y); use Bayes rule to nfer condtonal p(y x). Dscrmnatve: model dscrmnants f(y x) drectly and take max. Generatve approach s related to condtonal densty estmaton whle dscrmnatve approach s closer to regresson. Three Key Regularzaton Ideas To avod overfttng, we can put prors on the parameters of the class and class condtonal feature dstrbutons. We can also te some parameters together so that fewer of them are estmated usng more data. Fnally, we can make factorzaton or ndependence assumptons about the dstrbutons. In partcular, for the class condtonal dstrbutons we can assume the features are fully dependent, partly dependent, or ndependent (!). x X X X m X X X m X x (a) (b) (c)

2 Gaussan Class-Condtonal Dstrbutons If all features are contnuous, a popular choce s a Gaussan class-condtonal. { p(x y = k, θ) = πσ / exp } (x µ k)σ (x µ k ) Fttng: use the followng amazng and useful fact. The maxmum lkelhood ft of a Gaussan to some data s the Gaussan whose mean s equal to the data mean and whose covarance s equal to the sample covarance. [Try to prove ths as an exercse n understandng lkelhood, algebra, and calculus all at once!] Seems easy. And works amazngly well. But we can do even better wth some smple regularzaton... Gaussan Bayes Classfer Maxmum lkelhood estmates for parameters: prors π k : use observed frequences of classes (plus smoothng) means µ k : use class means covarance Σ: use data from sngle class or pooled data (x m µ y m) to estmate full/dagonal covarances Compute the posteror va Bayes rule: p(x y = k, θ)p(y = k π) p(y = k x, θ) = j p(x y = j, θ)p(y = j π) = exp{µ k Σ x µ k Σ µ k / + log π k } j exp{µ j Σ x µ j Σ µ j / + log π j } = e β k x / j eβ j x = exp{β k x}/z where β k = [Σ µ k ; (µ k Σ µ k + log π k )] and we have augmented x wth a constant component always equal to (bas term). Regularzed Gaussans Idea : assume all the covarances are the same (te parameters). Ths s exactly Fsher s lnear dscrmnant analyss. x x (a) x x Idea : Make ndependence assumptons to get dagonal or dentty-multple covarances. (Or sparse nverse covarances.) More on ths n a few mnutes... Idea : add a bt of the dentty matrx to each sample covarance. Ths fattens t up n drectons where there wasn t enough data. Equvalent to usng a Wshart pror on the covarance matrx. (b) Softmax/Logt The squashng functon s known as the softmax or logt: φ k (z) ez k g(η) = j ez j + e η It s nvertble (up to a constant): z k = log φ k + c η = log(g/ g) Dervatve s easy: φ k z j = φ k (δ kj φ j ) dg = g( g) dη φ( z ).... z

3 Lnear Geometry Takng the rato of any two posterors (the odds ) shows that the contours of equal parwse probablty are lnear surfaces n the feature space: p(y = k x, θ) p(y = j x, θ) = exp { (β k β j ) x } The parwse dscrmnaton contours p(y k ) = p(y j ) are orthogonal to the dfferences of the means n feature space when Σ = σi. For general Σ shared b/w all classes the same s true n the transformed feature space w = Σ x. The prors do not change the geometry, they only shft the operatng pont on the logt by the log-odds log(π k /π j ). Thus, for equal class-covarances, we obtan a lnear classfer. If we use dfference covarances, the decson surfaces are conc sectons and we have a quadratc classfer. Dscrete Bayesan Classfer If the nputs are dscrete (categorcal), what should we do? The smplest class condtonal model s a jont multnomal (table): p(x = a, x = b,... y = c) = η c ab... Ths s conceptually correct, but there s a bg practcal problem. Fttng: ML params are observed counts: ηab... c = n [y n = c][x = a][x = b][...][...] n [y n = c] Consder the x dgts at gray levels. How many entres n the table? How many wll be zero? What happens at test tme? Doh! We obvously need some regularlzaton. Smoothng wll not help much here. Unless we know about the relatonshps between nputs beforehand, sharng parameters s hard also. But what about ndependence? Exponental Famly Class-Condtonals Bayes Classfer has the same softmax form whenever the class-condtonal denstes are any exponental famly densty: p(x y = k, η k ) = h(x) exp{ηk x a(η k)} p(y = k x, η) = p(x y = k, η k)p(y = k π) j p(x y = j, η j)p(y = j π) = exp{η k x a(η k)} j exp{η j x a(η j)} = eβ k x j eβ j x where β k = [η k ; a(η k )] and we have augmented x wth a constant component always equal to (bas term). Resultng classfer s lnear n the suffcent statstcs. Nave (Idot s) Bayes Classfer Assumpton: condtoned on class, attrbutes are ndependent. p(x y) = p(x y) Sounds crazy rght? Rght! But t works. Algorthm: sort data cases nto bns accordng to y n. Compute margnal probabltes p(y = c) usng frequences. For each class, estmate dstrbuton of th varable: p(x y = c). At test tme, compute argmax c p(c x) usng c(x) = argmax c p(c x) = argmax c [log p(x c) + log p(c)] = argmax c [log p(c) + log p(x c)]

4 Dscrete (Multnomal) Nave Bayes Dscrete features x, assumed ndependent gven the class label y. p(x = j y = k) = η jk p(x y = k, η) = η [x =j] jk j Classfcaton rule: p(y = k x, η) = π k j η[x =j] jk q π q j η[x =j] jq = eβ k x q eβ q x β k = log[η k... η jk... η jk... log π k ] x = [x =; x =;... ; x =j;... ; ] X X (a) X m Gaussan Nave Bayes Ths s just a Gaussan Bayes Classfer wth a separate dagonal covarance matrx for each class. Equvalent to fttng a one-dmensonal Gaussan to each nput for each possble class. Decson surfaces are quadratcs, not lnear... Fttng Dscrete Nave Bayes ML parameters are class-condtonal frequency counts: ηjk = m [x m = j][y m = k] m [ym = k] How do we know? Wrte down the lkelhood: l(θ; D) = log p(y m π) + log p(x m y m, η) m m and optmze t by settng ts dervatve to zero (careful! enforce normalzaton wth Lagrange multplers): l(η; D) = [x m = j][y m = k] log η jk + λ k ( j η jk) m jk k l = m [x m = j][y m = k] λ η jk η k jk l = λ η k = [y m = k] ηjk = above jk m Dscrmnatve Models Parametrze p(y x) drectly, forget p(x, y) and Bayes rule. As long as p(y x) or dscrmnants f(y x) are lnear functons of x (or monotone transforms), decson surfaces wll be pecewse lnear. Don t need to model the densty of the features. Some densty models have lots of parameters. Many denstes gve same lnear classfer. But we cannot generate new labeled data. Optmze the same cost functon we use at test tme. x x

5 Logstc/Softmax Regresson Model: y s a multnomal random varable whose posteror s the softmax of lnear functons of any feature vector. p(y = k x, θ) = eθ k x j eθ j x Fttng: now we optmze the condtonal lkelhood: Jont vs. Condtonal Models Many of the methods we have seen so far have lnear or pecewse lnear decson surfaces n some space x: LDA, perceptron, Gaussan Bayes, Nave Bayes, KNN,... But the crtera used to fnd ths hyperplane s dfferent: Nave Bayes s a jont model; t optmzes p(x, y) = p(x)p(y x). Logstc Regresson s condtonal: optmzes p(y x) drectly. l(θ; D) = mk [y m = k] log p(y = k x m, θ) = mk y m k log pm k l θ = mk l m k p m k p m k z m z m θ.. = mk y m k p m k p m k (δ k p m )xm. y.. = m (y m k pm k )xm..... x More on Logstc Regresson Hardest Part: pckng the feature vector x. Amazng fact: the condtonal lkelhood s (almost) convex n the parameters θ. Stll no local mnma! Gradent s easy to compute; so easy (f slow) to optmze usng gradent descent or Newton-Raphson / IRLS. Why almost? Consder what happens f there are two features wth dentcal classfcaton patterns n our tranng data. Logstc Regresson can only see the sum of the correspondng weghts. Soluton? Weght decay: add ɛ θ to the cost functon, whch subtracts ɛθ from each gradent. Why s ths method called logstc regresson? It should really be called softmax lnear regresson. Log odds (logt) between any two classes s lnear n parameters. Nosy-OR Classfer Many probablstc models can be obtaned as nosy versons of formulas from propostonal logc. Nosy-OR: each nput x actvates output y w/some probablty. p(y = x, α) = α x = exp x log α Lettng θ = log α we get yet another lnear classfer: p(y = x, θ) = e θ x

6 Other Models Not Covered Non-parametrc (e.g. K-nearest-neghbour). Semparametrc (e.g. kernel classfers, support vector machnes, Gaussan processes). Probt regresson. Complementary log-log. Generalzed lnear models. Some return a value for y wthout a dstrbuton. Classfcaton va Regresson Bnary case: p(y = x) s also the condtonal expectaton. So we could forget that y was a dscrete (categorcal) random varable and just attempt to model p(y x) usng regresson. One dea: do regresson to an ndcator matrx. For two classes, ths s equvalent to LDA. For or more, dsaster... Very bad dea! Nose models (e.g. Gaussan) for regresson are totally napproprate, and fts are oversenstve to outlers. Furthermore, gves unreasonable predctons < and >... y y x x

Probabilistic Classification: Bayes Classifiers 2

Probabilistic Classification: Bayes Classifiers 2 CSC Machne Learnng Lecture : Classfcaton II September, Sam Rowes Probablstc Classfcaton: Baes Classfers Generatve model: p(, ) = p()p( ). p() are called class prors. p( ) are called class-condtonal feature