Generative vs. Discriminative Classifiers

Similar documents
Machine Learning. Introduction to Regression. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Advanced Introduction to Machine Learning

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

Generative classification models

Kernel-based Methods and Support Vector Machines

Bayes (Naïve or not) Classifiers: Generative Approach

Dimensionality Reduction and Learning

CS 1675 Introduction to Machine Learning Lecture 12 Support vector machines

Supervised learning: Linear regression Logistic regression

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

Binary classification: Support Vector Machines

Objectives of Multiple Regression

Linear Regression Linear Regression with Shrinkage. Some slides are due to Tommi Jaakkola, MIT AI Lab

Support vector machines II

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #1

Point Estimation: definition of estimators

Lecture 8: Linear Regression

i 2 σ ) i = 1,2,...,n , and = 3.01 = 4.01

Regression and the LMS Algorithm

CS 2750 Machine Learning. Lecture 7. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Model Fitting, RANSAC. Jana Kosecka

Support vector machines

ECON 482 / WH Hong The Simple Regression Model 1. Definition of the Simple Regression Model

Radial Basis Function Networks

Introduction to local (nonparametric) density estimation. methods

Lecture Notes 2. The ability to manipulate matrices is critical in economics.

Econometric Methods. Review of Estimation

PGE 310: Formulation and Solution in Geosystems Engineering. Dr. Balhoff. Interpolation

Lecture Notes Forecasting the process of estimating or predicting unknown situations

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Bayesian Classification. CS690L Data Mining: Classification(2) Bayesian Theorem: Basics. Bayesian Theorem. Training dataset. Naïve Bayes Classifier

Unsupervised Learning and Other Neural Networks

Lecture 9: Tolerant Testing

Chapter 14 Logistic Regression Models

Lecture 3. Least Squares Fitting. Optimization Trinity 2014 P.H.S.Torr. Classic least squares. Total least squares.

Overview. Basic concepts of Bayesian learning. Most probable model given data Coin tosses Linear regression Logistic regression

Line Fitting and Regression

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

CHAPTER VI Statistical Analysis of Experimental Data

Multiple Choice Test. Chapter Adequacy of Models for Regression

L5 Polynomial / Spline Curves

Chapter Two. An Introduction to Regression ( )

Section 2 Notes. Elizabeth Stone and Charles Wang. January 15, Expectation and Conditional Expectation of a Random Variable.

Lecture 3 Probability review (cont d)

9.1 Introduction to the probit and logit models

Maximum Likelihood Estimation

CSE 5526: Introduction to Neural Networks Linear Regression

Simple Linear Regression

QR Factorization and Singular Value Decomposition COS 323

6. Nonparametric techniques

Big Data Analytics. Data Fitting and Sampling. Acknowledgement: Notes by Profs. R. Szeliski, S. Seitz, S. Lazebnik, K. Chaturvedi, and S.

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Dimensionality reduction Feature selection

13. Parametric and Non-Parametric Uncertainties, Radial Basis Functions and Neural Network Approximations

4. Standard Regression Model and Spatial Dependence Tests

KLT Tracker. Alignment. 1. Detect Harris corners in the first frame. 2. For each Harris corner compute motion between consecutive frames

Correlation and Simple Linear Regression

ESS Line Fitting

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

Multivariate Transformation of Variables and Maximum Likelihood Estimation

An Introduction to. Support Vector Machine

ENGI 3423 Simple Linear Regression Page 12-01

ECON 5360 Class Notes GMM

Recall MLR 5 Homskedasticity error u has the same variance given any values of the explanatory variables Var(u x1,...,xk) = 2 or E(UU ) = 2 I

Correlation and Regression Analysis

Multiple Linear Regression Analysis

Lecture 1: Introduction to Regression


Lecture Notes Types of economic variables

0/1 INTEGER PROGRAMMING AND SEMIDEFINTE PROGRAMMING

The Mathematical Appendix

Lecture 3. Sampling, sampling distributions, and parameter estimation

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Bayesian belief networks

Statistics MINITAB - Lab 5

Parametric Density Estimation: Bayesian Estimation. Naïve Bayes Classifier

7. Joint Distributions

Chapter 13 Student Lecture Notes 13-1

Example: Multiple linear regression. Least squares regression. Repetition: Simple linear regression. Tron Anders Moger

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

Assignment 5/MATH 247/Winter Due: Friday, February 19 in class (!) (answers will be posted right after class)

Midterm Exam 1, section 2 (Solution) Thursday, February hour, 15 minutes

MATH 247/Winter Notes on the adjoint and on normal operators.

Part I: Background on the Binomial Distribution

Chapter 8. Inferences about More Than Two Population Central Values

Classification : Logistic regression. Generative classification model.

Sampling Theory MODULE V LECTURE - 14 RATIO AND PRODUCT METHODS OF ESTIMATION

Statistics. Correlational. Dr. Ayman Eldeib. Simple Linear Regression and Correlation. SBE 304: Linear Regression & Correlation 1/3/2018

TESTS BASED ON MAXIMUM LIKELIHOOD

Summary of the lecture in Biostatistics

Lecture 1: Introduction to Regression

STK3100 and STK4100 Autumn 2018

Chapter 5 Transformation and Weighting to Correct Model Inadequacies

9 U-STATISTICS. Eh =(m!) 1 Eh(X (1),..., X (m ) ) i.i.d

CLASS NOTES. for. PBAF 528: Quantitative Methods II SPRING Instructor: Jean Swanson. Daniel J. Evans School of Public Affairs

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

A Study of the Reproducibility of Measurements with HUR Leg Extension/Curl Research Line

6.867 Machine Learning

Transcription:

Geerate s. Dscrmate Classfers Goal: Wsh to lear f: Y, e.g., P(Y ) Geerate classfers (e.g., Naïe Baes): Assume some fuctoal form for P( Y), P(Y) hs s a geerate model of the data! Estmate parameters of P( Y), P(Y) drectl from trag data Use Baes rule to calculate P(Y ) Y Dscrmate classfers: Drectl assume some fuctoal form for P(Y ) hs s a dscrmate model of the data! Estmate parameters of P(Y ) drectl from trag data Y Naïe Baes s Logstc Regresso Cosder Y boolea, cotuous, <... m > Number of parameters to estmate: NB: LR: p( ) π k ep π k ' ep µ ( ) + e j ( ) log j µ k, j σ k, j C σ k, j ( ', ) log j µ k j σ k ' C σ k ', j, ' k j j ** Estmato method: NB parameter estmates are ucoupled LR parameter estmates are coupled

Naïe Baes s Logstc Regresso Asmptotc comparso (# trag eamples ft) whe model assumptos correct NB, LR produce detcal classfers whe model assumptos correct LR s less based does ot assume codtoal depedece therefore epected to outperform NB 3 Naïe Baes s Logstc Regresso No-asmptotc aalss (see [Ng & Jorda, 00] ) coergece rate of parameter estmates how ma trag eamples eeded to assure good estmates? NB order log m (where m # of attrbutes ) LR order m NB coerges more quckl to ts (perhaps less helpful) asmptotc estmates 4

Rate of coergece: logstc regresso Let h Ds,m be logstc regresso traed o eamples m dmesos. he wth hgh probablt: Implcato: f we wat for some small costat ε 0, t suffces to pck order m eamples Coergeces to ts asmptotc classfer, order m eamples result follows from Vapk s structural rsk boud, plus fact that the "VC Dmeso" of a m-dmesoal lear separators s m 5 Rate of coergece: aïe Baes parameters Let a ε, δ>0, ad a 0 be fed. Assume that for some fed ρ 0 >0 0, we hae that Let he wth probablt at least -δ, after eamples:. For dscrete put, for all ad b. For cotuous puts, for all ad b 6 3

Some epermets from UCI data sets 7 Summar Naïe Baes classfer What s the assumpto Wh we use t How do we lear t Logstc regresso Fuctoal form follows from Naïe Baes assumptos For Gaussa Naïe Baes assumg arace For dscrete-alued Naïe Baes too But trag procedure pcks parameters wthout the codtoal depedece assumpto Gradet ascet/descet Geeral approach whe closed-form solutos uaalable Geerate s. Dscrmate classfers Bas s. arace tradeoff 8 4

ache Learg 0-70/5-78, 78, Fall 0 Lear Regresso ad Sparst Erc g Lecture 4, September, 0 Readg: 9 ache learg for apartmet hutg Now ou'e moed to Pttsburgh!! Ad ou wat to fd the most reasoabl prced apartmet satsfg our eeds: square-ft., # of bedroom, dstace to campus Lg area (ft ) # bedroom Ret ($) 30 600 506 000 433 00 09 500 50? 70.5? 0 5

6 he learg problem Features: Lg area, dstace to campus, # t Lg area, dstace to campus, # bedroom Deote as [,, k ] arget: Ret Deoted as rag set: ret Lg area k ret Lg area Locato k k K K K Y or Lear Regresso Assume that Y (target) s a lear fucto of (features): e.g.: ˆ g let's assume a acuous "feature" 0 (ths s the tercept term, wh?), ad defe the feature ector to be: the we hae the followg geeral represetato of the lear fucto: 0 ˆ + + Our goal s to pck the optmal. How! We seek that mmze the followg cost fucto: J ) ) ( ˆ ( ) (

he Least-ea-Square (LS) method he Cost Fucto: J ( ) ( ) Cosder a gradet descet algorthm: t+ t j j α J ( ) j t 3 he Least-ea-Square (LS) method Now we hae the followg descet rule: t+ j t j + α t ( ) j For a sgle trag pot, we hae: hs s kow as the LS update rule, or the Wdrow-Hoff learg rule hs s actuall a "stochastc", "coordate" descet algorthm hs ca be used as a o-le algorthm 4 7

8 Geometrc ad Coergece of LS N N N3 Clam: whe the step sze α satsfes certa codto, ad whe certa other techcal codtos are satsfed, LS wll coerge to a optmal rego. t t t ) ( α + + 5 Steepest Descet ad LS Steepest descet Note that: k J J J ) (,, K + + t t t ) ( α hs s as a batch gradet descet algorthm 6

9 he ormal equatos Wrte the cost fucto matr form: J ) ( ) ( o mmze J(), take derate ad set to zero: ( ) ( ) ( ) J + ) ( ) ( ( ) ( ) ( ) 0 + + + J tr tr tr tr he ormal equatos ( ) * 7 Some matr derates For, defe: R a R m f : race: f A f A f A f A A f m m A L O L ) (, tr A A, tr a a BCA CAB ABC tr tr tr Some fact of matr derates (wthout proof), tr A B AB, tr A AB C CAB C ABA + ( ) A A A A 8

Commets o the ormal equato I most stuatos of practcal terest, the umber of data pots N s larger tha the dmesoalt k of the put space ad the matr s of full colum rak. If ths codto holds, the t s eas to erf that s ecessarl ertble. he assumpto that s ertble mples that t s poste defte, thus the crtcal pot we hae foud s a mmum. What f has less tha full colum rak? regularzato (later). 9 Drect ad Iterate methods Drect methods: we ca achee the soluto a sgle step b solg the ormal equato Usg Gaussa elmato or QR decomposto, we coerge a fte umber of steps It ca be feasble whe data are streamg real tme, or of er large amout Iterate methods: stochastc or steepest gradet Coergg a lmtg sese But more attracte large practcal problems Cauto s eeded for decdg the learg rate α 0 0

Coergece rate heorem: the steepest descet equato algorthm coerge to the mmum of the cost characterzed b ormal equato: If A formal aalss of LS eed more math-mussels; practce, oe ca use a small α, or graduall decrease α. A Summar: LS update rule t+ j t j t ), + α( Pros: o-le, low per-step cost, fast coergece ad perhaps less proe to local optmum Cos: coergece to optmum ot alwas guarateed Steepest descet t + t t + α ( ) Pros: eas to mplemet, coceptuall clea, guarateed coergece Cos: batch, ofte slow coergg Normal equatos * ( ) Pros: a sgle-shot algorthm! Easest to mplemet. Cos: eed to compute pseudo-erse ( ) -, epese, umercal ssues (e.g., matr s sgular..), although there are was to get aroud ths

Geometrc Iterpretato of LS he predctos o the trag data are: Note that ˆ ( ( ) I ) ad ( ˆ ) ( ( ) I ) ( ( ) ) 0!! ŷ ˆ * ( ) s the orthogoal projecto of to the space spaed b the colums of 3 Probablstc Iterpretato of LS Let us assume that the target arable ad the puts are related b the equato: + where ε s a error term of umodeled effects or radom ose Now assume that ε follows a Gaussa N(0,σ), the we hae: p ε ( ) ep πσ σ ( ; ) B depedece assumpto: L( ) p( ; ) ep πσ ( ) σ 4

Probablstc Iterpretato of LS, cot. Hece the log-lkelhood s: l( ) log ( ) πσ σ Do ou recogze the last term? Yes t s: J ( ) ( ) hus uder depedece assumpto, LS s equalet to LE of! 5 Case stud: predctg gee epresso he geetc pcture causal SNPs CGCACGACAA a uarate pheotpe:.e., the epresso test of a gee 6 3

Assocato appg as Regresso Iddual Iddual Pheotpe (BI).5 4.8 Geotpe.. C....... C............ C..... A.. C............ G..... A.. G....... A..... C....... C.......... Iddual d N 47 4.7.. G....... C............ G....... G.......... Beg SNPs Causal SNP 7 Assocato appg as Regresso Pheotpe (BI) Geotpe Iddual.5.. 0....... 0....... 0... Iddual 4.8................... Iddual d N 47 4.7................ 0... J j j β j SNPs wth large β j are releat 8 4

Epermetal setup Asthama dataset 543 dduals, geotped at 34 SNPs Dplod data was trasformed to 0/ (for homozgotes) or (for heterozgotes) 54334 matr YPheotpe arable (cotuous) A sgle pheotpe was used for regresso Implemetato detals Iterate methods: Batch update ad ole update mplemeted. For both methods, step sze α s chose to be a small fed alue (0-6 ). hs choce s based o the data used for epermets. Both methods are ol ru to a mamum of 000 epochs or utl the chage trag SE s less tha 0-4 9 Coergece Cures For the batch method, the trag SE s tall large due to uformed talzato I the ole update, N updates for eer epoch reduces SE to a much smaller alue. 30 5

he Leared Coeffcets 3 ultarate Regresso for rat Assocato Aalss rat Geotpe Assocato Stregth G A A C C A G A A G A.? β 3 6

ultarate Regresso for rat Assocato Aalss rat Geotpe Assocato Stregth G A A C C A G A A G A. a o-zero assocatos: Whch SNPs are trul sgfcat? 33 Sparst Oe commo assumpto to make sparst. akes bologcal sese: each pheotpe s lkel to be assocated wth a small umber of SNPs, rather tha all the SNPs. akes statstcal sese: Learg s ow feasble hgh dmesos wth small sample sze 34 7

Sparst: I a mathematcal sese Cosder least squares lear regresso problem: Sparst meas most of the beta s are zero. β β β β 3 β - 3 - But ths s ot coe!!! a local optma, computatoall tractable. 35 L Regularzato (LASSO) (bshra, 996) A coe relaato. Costraed Form Lagraga Form Stll eforces sparst! 36 8

Lasso for Reducg False Postes rat Geotpe Assocato Stregth. G A A C C A G A A G A Lasso Pealt for sparst J + λ β j j a zero assocatos (sparse results), but what f there are multple related trats? 37 Rdge Regresso s Lasso Rdge Regresso: Lasso: HO! βs wth costat J(β) (leel sets of J(β)) βs wth costat l orm β β βs wth costat l orm β Lasso (l pealt) results sparse solutos ector wth more zero coordates Good for hgh dmesoal problems do t hae to store all coordates! 38 9

Baesa Iterpretato reat the dstrbuto parameters also as a radom arable he a posteror dstrbuto of after seem the data s: hs s Baes Rule p( D ) p( ) p( D) p( D) lkelhood pror posteror margal lkelhoodlh p( D ) p( ) p( D ) p( ) d he pror p(.) ecodes our pror kowledge about the doma 39 Regularzed Least Squares ad AP What f ( ) s ot ertble? log lkelhood log pror I) Gaussa Pror 0 Rdge Regresso Closed form: HW Pror belef that β s Gaussa wth zero mea bases soluto to small β 40 0

Regularzed Least Squares ad AP What f ( ) s ot ertble? log lkelhood log pror II) Laplace Pror Lasso Closed form: HW Pror belef that β s Laplace wth zero mea bases soluto to small β 4 Beod basc LR LR wth o-lear bass fuctos Locall weghted lear regresso Regresso trees ad ultlear Iterpolato 4

No-lear fuctos: 43 LR wth o-lear bass fuctos LR does ot mea we ca ol deal wth lear relatoshps We are free to desg (o-lear) features uder LR m 0 + φ( ) ( ) j j φ where the φ j () are fed bass fuctos (ad we defe φ 0 () ). Eample: polomal regresso: 3 [,, ] φ( ) :, We wll be cocered wth estmatg (dstrbutos oer) the weghts ad choosg the model order. 44

Bass fuctos here are ma bass fuctos, e.g.: Polomal l φ () Radal bass fuctos Sgmodal j j φ ( ) µ j φ j ( ) σ s j ep ( µ ) j s Sples, Fourer, Waelets, etc 45 D ad D RBFs D RBF After ft: 46 3

Good ad Bad RBFs A good D RBF wo bad D RBFs 47 Oerfttg ad uderfttg + + 0 + 5 0 + j j 0 j 48 4

Bas ad arace We defe the bas of a model to be the epected geeralzato error ee f we were to ft t to a er (sa, ftel) large trag set. B fttg "spurous" patters the trag set, we mght aga obta a model wth large geeralzato error. I ths case, we sa the model has large arace. 49 Locall weghted lear regresso he algorthm: Istead of mmzg J ( ) ( ) ow we ft to mmze J ( ) w ( ) Where do w 's come from? ( ) ep τ w where s the quer pot for whch we'd lke to kow ts correspodg Essetall we put hgher weghts o (errors o) trag eamples that are close to the quer pot (tha those that are further awa from the quer) 50 5

Parametrc s. o-parametrc Locall weghted lear regresso s the secod eample we are rug to of a o-parametrc algorthm. (what s the frst?) he (uweghted) lear regresso algorthm that we saw earler s kow as a parametrc learg algorthm because t has a fed, fte umber of parameters (the ), whch are ft to the data; Oce we'e ft the ad stored them awa, we o loger eed to keep the trag data aroud to make future predctos. I cotrast, to make predctos usg locall weghted lear regresso, we eed to keep the etre trag set aroud. he term "o-parametrc" (roughl) refers to the fact that the amout of stuff we eed to keep order to represet the hpothess grows learl wth the sze of the trag set. 5 Robust Regresso he best ft from a quadratc But ths s probabl better regresso How ca we do ths? 5 6

LOESS-based Robust Regresso Remember what we do "locall weghted lear regresso"? we "score" each pot for ts mpotece Now we score each pot accordg to ts "ftess" (Courtes to Adrew oor) 53 Robust regresso For k to R Let ( k, k ) be the kth datapot Let est k be predcted alue of k Let w k be a weght for data pot k that s large f the data pot fts well ad small f t fts badl: w k φ est ( ) ) k k he redo the regresso usg weghted data pots. Repeat whole thg utl coerged! 54 7

Robust regresso probablstc terpretato What regular regresso does: Assume k was orgall geerated usg the followg recpe: k k + N( 0, σ ) Computatoal task s to fd the amum Lkelhood estmato of 55 Robust regresso probablstc terpretato What LOESS robust regresso does: Assume k was orgall geerated usg the followg recpe: wth probablt p: k k + N( 0, σ ) but otherwse k ~ N ( µ, σ huge) Computatoal task s to fd the amum Lkelhood estmates of, p, µ ad σ huge. he algorthm ou saw wth terate reweghtg/refttg does ths computato for us. Later ou wll fd that t s a stace of the famous E.. algorthm 56 8

Regresso ree Decso tree for regresso Geder Rch? Num. Chldre # trael per r. Age Geder? F No 5 38 No 0 5 Yes 0 7 : : : : : Female Predcted age39 ale Predcted age36 57 A coceptual pcture Assumg regular regresso trees, ca ou sketch a graph of the ftted fucto *() () oer ths dagram? 58 9

How about ths oe? ultlear Iterpolato We wated to create a cotuous ad pecewse lear ft to the data 59 ake home message Gradet descet O-le Batch Normal equatos Equalece of LS ad LE LR does ot mea fttg lear relatos, but lear Wdows arketplace combato or bass fuctos (that ca be olear) Weghtg pots b mportace ersus b ftess 60 30