Generative vs. Discriminative Classifiers

Geerate s. Dscrmate Classfers Goal: Wsh to lear f: Y, e.g., P(Y ) Geerate classfers (e.g., Naïe Baes): Assume some fuctoal form for P( Y), P(Y) hs s a geerate model of the data! Estmate parameters of P( Y), P(Y) drectl from trag data Use Baes rule to calculate P(Y ) Y Dscrmate classfers: Drectl assume some fuctoal form for P(Y ) hs s a dscrmate model of the data! Estmate parameters of P(Y ) drectl from trag data Y Naïe Baes s Logstc Regresso Cosder Y boolea, cotuous, <... m > Number of parameters to estmate: NB: LR: p( ) π k ep π k ' ep µ ( ) + e j ( ) log j µ k, j σ k, j C σ k, j ( ', ) log j µ k j σ k ' C σ k ', j, ' k j j ** Estmato method: NB parameter estmates are ucoupled LR parameter estmates are coupled

Naïe Baes s Logstc Regresso Asmptotc comparso (# trag eamples ft) whe model assumptos correct NB, LR produce detcal classfers whe model assumptos correct LR s less based does ot assume codtoal depedece therefore epected to outperform NB 3 Naïe Baes s Logstc Regresso No-asmptotc aalss (see [Ng & Jorda, 00] ) coergece rate of parameter estmates how ma trag eamples eeded to assure good estmates? NB order log m (where m # of attrbutes ) LR order m NB coerges more quckl to ts (perhaps less helpful) asmptotc estmates 4

Rate of coergece: logstc regresso Let h Ds,m be logstc regresso traed o eamples m dmesos. he wth hgh probablt: Implcato: f we wat for some small costat ε 0, t suffces to pck order m eamples Coergeces to ts asmptotc classfer, order m eamples result follows from Vapk s structural rsk boud, plus fact that the "VC Dmeso" of a m-dmesoal lear separators s m 5 Rate of coergece: aïe Baes parameters Let a ε, δ>0, ad a 0 be fed. Assume that for some fed ρ 0 >0 0, we hae that Let he wth probablt at least -δ, after eamples:. For dscrete put, for all ad b. For cotuous puts, for all ad b 6 3

Some epermets from UCI data sets 7 Summar Naïe Baes classfer What s the assumpto Wh we use t How do we lear t Logstc regresso Fuctoal form follows from Naïe Baes assumptos For Gaussa Naïe Baes assumg arace For dscrete-alued Naïe Baes too But trag procedure pcks parameters wthout the codtoal depedece assumpto Gradet ascet/descet Geeral approach whe closed-form solutos uaalable Geerate s. Dscrmate classfers Bas s. arace tradeoff 8 4

ache Learg 0-70/5-78, 78, Fall 0 Lear Regresso ad Sparst Erc g Lecture 4, September, 0 Readg: 9 ache learg for apartmet hutg Now ou'e moed to Pttsburgh!! Ad ou wat to fd the most reasoabl prced apartmet satsfg our eeds: square-ft., # of bedroom, dstace to campus Lg area (ft ) # bedroom Ret ($) 30 600 506 000 433 00 09 500 50? 70.5? 0 5

6 he learg problem Features: Lg area, dstace to campus, # t Lg area, dstace to campus, # bedroom Deote as [,, k ] arget: Ret Deoted as rag set: ret Lg area k ret Lg area Locato k k K K K Y or Lear Regresso Assume that Y (target) s a lear fucto of (features): e.g.: ˆ g let's assume a acuous "feature" 0 (ths s the tercept term, wh?), ad defe the feature ector to be: the we hae the followg geeral represetato of the lear fucto: 0 ˆ + + Our goal s to pck the optmal. How! We seek that mmze the followg cost fucto: J ) ) ( ˆ ( ) (

he Least-ea-Square (LS) method he Cost Fucto: J ( ) ( ) Cosder a gradet descet algorthm: t+ t j j α J ( ) j t 3 he Least-ea-Square (LS) method Now we hae the followg descet rule: t+ j t j + α t ( ) j For a sgle trag pot, we hae: hs s kow as the LS update rule, or the Wdrow-Hoff learg rule hs s actuall a "stochastc", "coordate" descet algorthm hs ca be used as a o-le algorthm 4 7

8 Geometrc ad Coergece of LS N N N3 Clam: whe the step sze α satsfes certa codto, ad whe certa other techcal codtos are satsfed, LS wll coerge to a optmal rego. t t t ) ( α + + 5 Steepest Descet ad LS Steepest descet Note that: k J J J ) (,, K + + t t t ) ( α hs s as a batch gradet descet algorthm 6

9 he ormal equatos Wrte the cost fucto matr form: J ) ( ) ( o mmze J(), take derate ad set to zero: ( ) ( ) ( ) J + ) ( ) ( ( ) ( ) ( ) 0 + + + J tr tr tr tr he ormal equatos ( ) * 7 Some matr derates For, defe: R a R m f : race: f A f A f A f A A f m m A L O L ) (, tr A A, tr a a BCA CAB ABC tr tr tr Some fact of matr derates (wthout proof), tr A B AB, tr A AB C CAB C ABA + ( ) A A A A 8

Commets o the ormal equato I most stuatos of practcal terest, the umber of data pots N s larger tha the dmesoalt k of the put space ad the matr s of full colum rak. If ths codto holds, the t s eas to erf that s ecessarl ertble. he assumpto that s ertble mples that t s poste defte, thus the crtcal pot we hae foud s a mmum. What f has less tha full colum rak? regularzato (later). 9 Drect ad Iterate methods Drect methods: we ca achee the soluto a sgle step b solg the ormal equato Usg Gaussa elmato or QR decomposto, we coerge a fte umber of steps It ca be feasble whe data are streamg real tme, or of er large amout Iterate methods: stochastc or steepest gradet Coergg a lmtg sese But more attracte large practcal problems Cauto s eeded for decdg the learg rate α 0 0

Coergece rate heorem: the steepest descet equato algorthm coerge to the mmum of the cost characterzed b ormal equato: If A formal aalss of LS eed more math-mussels; practce, oe ca use a small α, or graduall decrease α. A Summar: LS update rule t+ j t j t ), + α( Pros: o-le, low per-step cost, fast coergece ad perhaps less proe to local optmum Cos: coergece to optmum ot alwas guarateed Steepest descet t + t t + α ( ) Pros: eas to mplemet, coceptuall clea, guarateed coergece Cos: batch, ofte slow coergg Normal equatos * ( ) Pros: a sgle-shot algorthm! Easest to mplemet. Cos: eed to compute pseudo-erse ( ) -, epese, umercal ssues (e.g., matr s sgular..), although there are was to get aroud ths

Geometrc Iterpretato of LS he predctos o the trag data are: Note that ˆ ( ( ) I ) ad ( ˆ ) ( ( ) I ) ( ( ) ) 0!! ŷ ˆ * ( ) s the orthogoal projecto of to the space spaed b the colums of 3 Probablstc Iterpretato of LS Let us assume that the target arable ad the puts are related b the equato: + where ε s a error term of umodeled effects or radom ose Now assume that ε follows a Gaussa N(0,σ), the we hae: p ε ( ) ep πσ σ ( ; ) B depedece assumpto: L( ) p( ; ) ep πσ ( ) σ 4

Probablstc Iterpretato of LS, cot. Hece the log-lkelhood s: l( ) log ( ) πσ σ Do ou recogze the last term? Yes t s: J ( ) ( ) hus uder depedece assumpto, LS s equalet to LE of! 5 Case stud: predctg gee epresso he geetc pcture causal SNPs CGCACGACAA a uarate pheotpe:.e., the epresso test of a gee 6 3

Assocato appg as Regresso Iddual Iddual Pheotpe (BI).5 4.8 Geotpe.. C....... C............ C..... A.. C............ G..... A.. G....... A..... C....... C.......... Iddual d N 47 4.7.. G....... C............ G....... G.......... Beg SNPs Causal SNP 7 Assocato appg as Regresso Pheotpe (BI) Geotpe Iddual.5.. 0....... 0....... 0... Iddual 4.8................... Iddual d N 47 4.7................ 0... J j j β j SNPs wth large β j are releat 8 4

Epermetal setup Asthama dataset 543 dduals, geotped at 34 SNPs Dplod data was trasformed to 0/ (for homozgotes) or (for heterozgotes) 54334 matr YPheotpe arable (cotuous) A sgle pheotpe was used for regresso Implemetato detals Iterate methods: Batch update ad ole update mplemeted. For both methods, step sze α s chose to be a small fed alue (0-6 ). hs choce s based o the data used for epermets. Both methods are ol ru to a mamum of 000 epochs or utl the chage trag SE s less tha 0-4 9 Coergece Cures For the batch method, the trag SE s tall large due to uformed talzato I the ole update, N updates for eer epoch reduces SE to a much smaller alue. 30 5

he Leared Coeffcets 3 ultarate Regresso for rat Assocato Aalss rat Geotpe Assocato Stregth G A A C C A G A A G A.? β 3 6

ultarate Regresso for rat Assocato Aalss rat Geotpe Assocato Stregth G A A C C A G A A G A. a o-zero assocatos: Whch SNPs are trul sgfcat? 33 Sparst Oe commo assumpto to make sparst. akes bologcal sese: each pheotpe s lkel to be assocated wth a small umber of SNPs, rather tha all the SNPs. akes statstcal sese: Learg s ow feasble hgh dmesos wth small sample sze 34 7

Sparst: I a mathematcal sese Cosder least squares lear regresso problem: Sparst meas most of the beta s are zero. β β β β 3 β - 3 - But ths s ot coe!!! a local optma, computatoall tractable. 35 L Regularzato (LASSO) (bshra, 996) A coe relaato. Costraed Form Lagraga Form Stll eforces sparst! 36 8

Lasso for Reducg False Postes rat Geotpe Assocato Stregth. G A A C C A G A A G A Lasso Pealt for sparst J + λ β j j a zero assocatos (sparse results), but what f there are multple related trats? 37 Rdge Regresso s Lasso Rdge Regresso: Lasso: HO! βs wth costat J(β) (leel sets of J(β)) βs wth costat l orm β β βs wth costat l orm β Lasso (l pealt) results sparse solutos ector wth more zero coordates Good for hgh dmesoal problems do t hae to store all coordates! 38 9

Baesa Iterpretato reat the dstrbuto parameters also as a radom arable he a posteror dstrbuto of after seem the data s: hs s Baes Rule p( D ) p( ) p( D) p( D) lkelhood pror posteror margal lkelhoodlh p( D ) p( ) p( D ) p( ) d he pror p(.) ecodes our pror kowledge about the doma 39 Regularzed Least Squares ad AP What f ( ) s ot ertble? log lkelhood log pror I) Gaussa Pror 0 Rdge Regresso Closed form: HW Pror belef that β s Gaussa wth zero mea bases soluto to small β 40 0

Regularzed Least Squares ad AP What f ( ) s ot ertble? log lkelhood log pror II) Laplace Pror Lasso Closed form: HW Pror belef that β s Laplace wth zero mea bases soluto to small β 4 Beod basc LR LR wth o-lear bass fuctos Locall weghted lear regresso Regresso trees ad ultlear Iterpolato 4

No-lear fuctos: 43 LR wth o-lear bass fuctos LR does ot mea we ca ol deal wth lear relatoshps We are free to desg (o-lear) features uder LR m 0 + φ( ) ( ) j j φ where the φ j () are fed bass fuctos (ad we defe φ 0 () ). Eample: polomal regresso: 3 [,, ] φ( ) :, We wll be cocered wth estmatg (dstrbutos oer) the weghts ad choosg the model order. 44

Bass fuctos here are ma bass fuctos, e.g.: Polomal l φ () Radal bass fuctos Sgmodal j j φ ( ) µ j φ j ( ) σ s j ep ( µ ) j s Sples, Fourer, Waelets, etc 45 D ad D RBFs D RBF After ft: 46 3

Good ad Bad RBFs A good D RBF wo bad D RBFs 47 Oerfttg ad uderfttg + + 0 + 5 0 + j j 0 j 48 4

Bas ad arace We defe the bas of a model to be the epected geeralzato error ee f we were to ft t to a er (sa, ftel) large trag set. B fttg "spurous" patters the trag set, we mght aga obta a model wth large geeralzato error. I ths case, we sa the model has large arace. 49 Locall weghted lear regresso he algorthm: Istead of mmzg J ( ) ( ) ow we ft to mmze J ( ) w ( ) Where do w 's come from? ( ) ep τ w where s the quer pot for whch we'd lke to kow ts correspodg Essetall we put hgher weghts o (errors o) trag eamples that are close to the quer pot (tha those that are further awa from the quer) 50 5

Parametrc s. o-parametrc Locall weghted lear regresso s the secod eample we are rug to of a o-parametrc algorthm. (what s the frst?) he (uweghted) lear regresso algorthm that we saw earler s kow as a parametrc learg algorthm because t has a fed, fte umber of parameters (the ), whch are ft to the data; Oce we'e ft the ad stored them awa, we o loger eed to keep the trag data aroud to make future predctos. I cotrast, to make predctos usg locall weghted lear regresso, we eed to keep the etre trag set aroud. he term "o-parametrc" (roughl) refers to the fact that the amout of stuff we eed to keep order to represet the hpothess grows learl wth the sze of the trag set. 5 Robust Regresso he best ft from a quadratc But ths s probabl better regresso How ca we do ths? 5 6

LOESS-based Robust Regresso Remember what we do "locall weghted lear regresso"? we "score" each pot for ts mpotece Now we score each pot accordg to ts "ftess" (Courtes to Adrew oor) 53 Robust regresso For k to R Let ( k, k ) be the kth datapot Let est k be predcted alue of k Let w k be a weght for data pot k that s large f the data pot fts well ad small f t fts badl: w k φ est ( ) ) k k he redo the regresso usg weghted data pots. Repeat whole thg utl coerged! 54 7

Robust regresso probablstc terpretato What regular regresso does: Assume k was orgall geerated usg the followg recpe: k k + N( 0, σ ) Computatoal task s to fd the amum Lkelhood estmato of 55 Robust regresso probablstc terpretato What LOESS robust regresso does: Assume k was orgall geerated usg the followg recpe: wth probablt p: k k + N( 0, σ ) but otherwse k ~ N ( µ, σ huge) Computatoal task s to fd the amum Lkelhood estmates of, p, µ ad σ huge. he algorthm ou saw wth terate reweghtg/refttg does ths computato for us. Later ou wll fd that t s a stace of the famous E.. algorthm 56 8

Regresso ree Decso tree for regresso Geder Rch? Num. Chldre # trael per r. Age Geder? F No 5 38 No 0 5 Yes 0 7 : : : : : Female Predcted age39 ale Predcted age36 57 A coceptual pcture Assumg regular regresso trees, ca ou sketch a graph of the ftted fucto *() () oer ths dagram? 58 9

How about ths oe? ultlear Iterpolato We wated to create a cotuous ad pecewse lear ft to the data 59 ake home message Gradet descet O-le Batch Normal equatos Equalece of LS ad LE LR does ot mea fttg lear relatos, but lear Wdows arketplace combato or bass fuctos (that ca be olear) Weghtg pots b mportace ersus b ftess 60 30