Tran Test Introducton to classfers Bayesan classfcaton CISC 58 Professor Danel Leeds Goal: learn functon C to maxmze correct labels (Y) based on features (X) lon: 6 wolf: monkey: 4 broker: analyst: dvdend: C C(x)=y jungle lon: wolf: monkey: broker: 4 analyst: dvdend: C wallstreet Graffe detector Label X : heght Class Y : True or False ( s graffe or s not graffe ) Learn otmal classfcaton arameter(s) Parameter: x thresh Examle functon: Learnng our classfer arameter(s) Adjust arameter(s) based on observed data Tranng set: contans features and corresondng labels X Y.5 True. True.8 True C x = True f x > xthresh False otherwse 3.5..5..5. False.9 False 4 The testng set Does classfer correctly label new data? Testng set should be dstnct from tranng set! Be careful wth your tranng set What f we tran wth only baby graffes and ants? What f we tran wth only T rexes and adult graffes?.5..5..5 baby cat graffe lon Trex graffe Examle good erformance: 9% correct labels.5..5..5 5 6
error Tranng vs. testng Tranng: learn arameters from set of data n each class Testng: measure how often classfer correctly dentfes new data More tranng reduces classfer error ε Too much tranng data causes worse testng error overfttng sze of tranng set 8 Quck robablty revew P(G=C H=True) G H P(G,H) A False.5 P(G=C,H=True) B False.5 C False.5 P(H=True) D False. A True.3 P(H=True G=C) B True. C True.5 D True. 9 Bayes rule Tycally: P B A P(A) P A B = P(B) P D P() P D = P(D) where D s the observed data and are the arameters to descrbe that data Our job s to fnd the most lkely arameters for gven data A osteror robablty: Probablty of Parameters for data d: P D Lkelhood: Probablty of data d gven t s from Parameters : P D Pror: Probablty of observng Parameters : P() Parameters may be treated as analogous to class Tycal classfcaton aroaches MAP Maxmum A Posteror: Determne arameters/class that has maxmum robablty argmax P D MLE Maxmum Lkelhood: Determne arameters/class whch maxmze robablty of the data argmax P D Lkelhood: P D Each arameter has own dstrbuton of ossble data Dstrbuton descrbed by arameter(s) n Examle.5 Classes: {Horse, Dog}. Feature: RunnngSeed: [ ].5 Model as Gaussan wth fxed σ μ horse =.5, μ dog = 5 5 5. The ror: P() Certan arameters/classes are more common than others Classes: {Horse, Dog} P(Horse)=.5, P(Dog)=.95 Hgh lkelhood may not mean hgh osteror Whch s hgher? P(Horse D=9) P(Dog D=9) P D P D P()..5..5 5 5 3
log(x) ex(x) Revew Classfy by fndng class wth max osteror or max lkelhood Learnng robabltes We have a con based to favor one sde argmax P D P D P() Posteror Lkelhood x Pror - means roortonal We gnore the P(D) denomnator because D stays same whle comarng dfferent classes () How can we calculate the bas? Data (D): {HHTH, TTHH, TTTT, HTTT} Bas (): robablty of H P D = H T H - # heads, T - # tals 4 5 Otmzaton: fndng the maxmum lkelhood The roertes of logarthms argmax P(D ) = argmax H T Equvalently, maxmze log P(D ) argmax H log + T log - robablty of Head e a = b log b = a a < b log a < log b log ab = log a + log b log a n = n log a Convenent when dealng wth small robabltes.454 x.9 =.44 -> - + -7 = -7 6 7 Otmzaton: fndng the maxmum lkelhood Otmzaton: fndng zero sloe argmax P(D ) = argmax H T Equvalently, maxmze log P(D ) argmax H log + T log - robablty of Head Locaton of maxmum has sloe maxmze log P(D ) argmax H log + T log : d H log + T log = d H T = - robablty of Head 8 9 3
Intuton of the MLE result = H H + T Probablty of gettng heads s # heads dvded by # total fls Fndng the maxmum a osteror P D P D P() Incororatng the Beta ror: P = α ( ) β B(α,β) argmax P D P() = argmax log P D + log P() MAP: estmatng (estmatng ) argmax log P D + log P() argmax H log + T log + α log + β log log(b α, β ) H T + α Set dervatve to β = Intuton of the MAP result = H + α H + α + T + β Pror has strong nfluence when H and T small Pror has weak nfluence when H and T large H T + α β = H + α = ( H + T + α + β ) 3 Multle features Dr. Lyon s lecture: Poston coordnates: x, y, angle Pctures: xels, sonar Sometmes multle features rovde new nformaton Robot localzaton: (,4) dfferent from (,) and from (4,4) Sometmes multle features redundant: Suer-hero fan: Watch Batman? Watch Suerman? Assumng ndeendence: Is there a storm? P(storm lghtnng, wnd) : P(S L, W) P S L, W = P(L,W S)P(S) P L, W S P(S) P(L,W) Let s assume L and W are ndeendent gven S P L, W S =? 4 5 4
Estmatng P(Lghtnng Storm) MLE countng data onts Udated Oct : Is there Lghtnng? Yes or No (Bnary varable lke Heads or Tals) P(L=yes S=yes) Probablty of lghtnng gven there s a storm P(L=no S=yes) =? What s MLE of P(L=yes S=yes)? P A = a C = c j = #D{A=a C=c j } #D{C=c j } P A = a, B = b k C = c j = #D{A=a B=b k C=c j } #D{C=c j } Note: both A and C can take on multle values (bnary and beyond) What s MLE of P(L=yes S=no)? 6 7 P(L,W S) P(A,,A n C) P(L,W S)=P(L S)P(W,S) P(A,,A n C) Non-ndeendent, estmate: P(L=yes,W=yes S=yes) P(L=yes,W=no S=yes) P(L=no,W=yes S=yes) Deduce P(L=no,W=no S=yes): (L,W) (no,no) Reeat for S=no P(L, W S = yes) Number of arameters to estmate: For each class fnd n - In total: ( n -) Udated Oct : Note: n ths slde, all varables are bnary Indeendent, estmate: P(L=yes S=yes) Deduce P(L=no S=yes): -P(L=yes S=yes) P(W=yes S=yes) Deduce P(W=no S=yes): -P(W=yes S=yes) Reeat for S=no Number of arameters to estmate: For each class fnd n In total: n Udated Oct : Note: n ths slde, all varables are bnary 8 9 Naïve Bayes: Classfcaton + Learnng Udated Oct : Want to know P(Y X,X,...,X n ) Comute P(X,X,...,X n Y) and P(Y) Comute P X, X,, X n Y = P(X Y) Learnng: Estmate each P(X Y) (through MLE) P X = x k Y = y j = #D(X = x k Y = y j ) #D(Y = y j ) Estmate P(Y) P Y = y j = #D(Y = y j) D Note: both X and Y can take on multle values (bnary and beyond) 3 Shortcomng of MLE P X = x k Y = y j = #D(X = x k Y = y j ) #D(Y = y j ) What f X = x k Y = y j s very rare, but ossble? Examle classfy artcles: X does word aear n artcle? Y={jungle, wallstreet} X =broker very unlkely n jungle: MLE P(X =broker Y=jungle)= P X = x,, X n = x n Y = y j = P(X = x Y = y j ) Udated Oct : Note: both X and Y can take on multle values (bnary and beyond) lon: 6 wolf: monkey: 4 broker: analyst: dvdend: C jungle 3 5
lon Estmate each P(X Y) through MAP Benefts of Naïve Bayes Incororatng ror for each class β j P X = x k Y = y j = #D(X = x k Y = y j ) + (β j ) #D(Y = y j ) + m (β m ) Very fast learnng and classfyng: n+ arameters, not x( n -)+ arameters Often works even f features are NOT ndeendent P Y = y j = #D(Y = y j) + (β j ) D + m (β m ) Extra note: (β j ) frequency of class j m β m frequences of all classes Udated Oct : Note: both X and Y can take on multle values (bnary and beyond) 3 33 Classfcaton strategy: generatve vs. dscrmnatve Lnear algebra: data features Document Document Document 3 Generatve, e.g., Bayes/Naïve Bayes: 5 5 Identfy robablty dstrbuton for each class Determne class wth maxmum robablty for data examle Dscrmnatve, e.g., Logstc Regresson: Identfy boundary between classes Determne whch sde of boundary new data examle exsts on 5 5 Vector lst of numbers: each number descrbes a data feature Matrx lst of lsts of numbers: features for each data ont Wolf Lon 6 Monkey 4 Broker Analyst Dvdend d 8 # of word occurrences 4 34 35 Feature sace The dot roduct Each data feature defnes a dmenson n sace Document Document Document3 Wolf 8 Lon 6 Monkey 4 Broker 4 Analyst Dvdend d doc doc doc3 wolf 36, b = The dot roduct comares two vectors: a = a b a b = = a b = a T b a n b n 5 = 5 + = 5 + = 5 37 6
The dot roduct, contnued Magntude of a vector s the sum of the squares of the elements a = a If a has unt magntude, a b s the rojecton of b onto a a b = n = a b.7.7.5 =.7.5 +.7.7 +.7 =.78.7.7 =.7 +.7.5.5 +.35 =.35 38 Searatng boundary, defned by w Searatng hyerlane slts class and class Plane s defned by lne w erendcular to lan Is data ont x n class or class? w T x > class w T x < class 39 From real-number rojecton to / label Bnary classfcaton: s class A, s class B Sgmod functon stands n for (x y) g(h) Sgmod: g h = +e h x y = ; = g w T x = x y = ; = g w T x = e wt x +e wt x +e wt x.5-5 5 h w T x = j w j x j 4 Learnng arameters for classfcaton Smlar to MLE for Bayes classfer Lkelhood for data onts y,, y n (really framed as osteror y x) If y n class A, y =, multly (-g(x ;w)) If y n class B, y =, multly (g(x ;w)) L y x; w = g x ; w ( y ) g x ; w y ( y ) log g x ; w + y log g x ; w y g x ; w log g x ; w + log g x ; w 4 w T x = w j x j j Learnng arameters for classfcaton g h = + e h e h g h = y g x ; w log g x ; w + log g x ; w y w T x w T x + log g w T x + e h y x w j x j + x j wtx e j + e wt x Learnng arameters for classfcaton y log + e wt x e wt x + log + e wt x + e wt x y log + e wt x + log e wtx + e wt x y w T x w T x log + e wt x 4 w j w j x j y ( g(w T x ) ) x j y g(w T x ) 43 7
y Iteratve gradent descent true data label g(w T x ) comuted data Begn wth ntal guessed weghts w label For each data ont (y,x ), udate each weght w j w j w j + εx j y g(w T x ) MAP for dscrmnatve classfer MLE: P(x y=;w) ~ g(w T x) MAP: P(y= x) = P(x y=;w) P(w) ~ g(w T x)??? Choose ε so change s not too bg or too small Intuton x j y g(w T x) If y = and g(w T x )=, and x j>, make w j larger and ush w T x to be larger If y = and g(w T x )=, and x j>, make w smaller and ush w T x to be smaller P(w) rors L regularzaton mnmze all weghts L regularzaton mnmze number of non-zero weghts 44 45 MAP L regularzaton P(y= x,w) = P(x y=;w) P(w): L y x; w = w j g x ; w ( y ) g x ; w y j y w T x w T x + log g w T x x j y g(w T x ) w j λ e w j λ j (x) w j λ w 46 8