Probabilistic Classification: Bayes Classifiers 2

Similar documents
Probabilistic Classification: Bayes Classifiers. Lecture 6:

Generative classification models

Classification learning II

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Discriminative classifier: Logistic Regression. CS534-Machine Learning

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Week 5: Neural Networks

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Homework Assignment 3 Due in class, Thursday October 15

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Lecture Notes on Linear Regression

Generative and Discriminative Models. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Generalized Linear Methods

SDMML HT MSc Problem Sheet 4

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 5 September 14, Readings: Bishop, 3.1

Machine learning: Density estimation

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Classification as a Regression Problem

Which Separator? Spring 1

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Kernel Methods and SVMs Extension

1 Convex Optimization

10-701/ Machine Learning, Fall 2005 Homework 3

Multilayer neural networks

Linear Feature Engineering 11

Multi-layer neural networks

Hidden Markov Models

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Evaluation for sets of classes

Expectation Maximization Mixture Models HMMs

Lecture 10 Support Vector Machines II

Maxent Models & Deep Learning

Machine Learning for Signal Processing Linear Gaussian Models

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

The exam is closed book, closed notes except your one-page cheat sheet.

Support Vector Machines

Support Vector Machines

Lecture 10: Dimensionality reduction

EEE 241: Linear Systems

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Multilayer Perceptron (MLP)

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Boostrapaggregating (Bagging)

Evaluation of classifiers MLPs

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

6 Supplementary Materials

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Limited Dependent Variables

Maximum Likelihood Estimation (MLE)

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

Machine Learning for Signal Processing Linear Gaussian Models

Pattern Classification

Course 395: Machine Learning - Lectures

Learning from Data 1 Naive Bayes

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Conjugacy and the Exponential Family

Composite Hypotheses testing

Supporting Information

Support Vector Machines

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Retrieval Models: Language models

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

OPTIMISATION. Introduction Single Variable Unconstrained Optimisation Multivariable Unconstrained Optimisation Linear Programming

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Maximum Likelihood Estimation

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Lecture 12: Classification

Singular Value Decomposition: Theory and Applications

Support Vector Machines

The Fundamental Theorem of Algebra. Objective To use the Fundamental Theorem of Algebra to solve polynomial equations with complex solutions

Feature Selection: Part 1

Logistic Classifier CISC 5800 Professor Daniel Leeds

The Geometry of Logit and Probit

Linear Approximation with Regularization and Moving Least Squares

Goodness of fit and Wilks theorem

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

Rockefeller College University at Albany

Neural networks. Nuno Vasconcelos ECE Department, UCSD

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Logistic Regression Maximum Likelihood Estimation

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Gaussian process classification: a message-passing viewpoint

Transcription:

CSC Machne Learnng Lecture : Classfcaton II September, Sam Rowes Probablstc Classfcaton: Baes Classfers Generatve model: p(, ) = p()p( ). p() are called class prors. p( ) are called class-condtonal feature dstrbutons. For the pror we use a Bernoull or multnomal: p( = k π) = π k wth k π k = ; π k >. What classfcaton rule should we use? Pck the class that best models the data, e argma p( )? No! Ths behaves ver badl f the class prors are skewed. MAP s best: argma p( )=argma p(, ) = argma log p( )+log p() How should we ft model parameters? Mamum jont lkelhood. Φ = n log p(n, n ) = n log p(n n ) + log p( n ) ) Sort data nto batches b class label. ) Estmate p() b countng sze of batches (plus regularzaton). ) Estmate p( ) separatel wthn each batch usng ML on the class-condtonal model (also wth regularzaton). Revew: Classfcaton Gven eamples of a dscrete class label and some features. Goal: compute for new. Last class: compute a dscrmnant f( ) and then take ma. Ths class: probablstc classfers. Two approaches: Generatve: model p(, ) = p()p( ); then use Baes rule to nfer condtonal p( ). Dscrmnatve: model posteror p( ) drectl. Generatve approach s related to jont denst estmaton whle dscrmnatve approach s closer to regresson. Three Ke Regularzaton Ideas To avod overfttng, we can put prors on the parameters of the class and class-condtonal feature dstrbutons. We can also te some parameters together so that fewer of them are estmated usng more data. Fnall, we can make factorzaton or ndependence assumptons about the dstrbutons. In partcular, for the class-condtonal dstrbutons we can assume the features are full dependent, partl dependent, or ndependent (!). X X X m X X X m X (a) (b) (c)

Class Prors and Multnomal Smoothng Let s sa ou were trng to estmate the bas of a con. ou flp t K tmes; what s our estmate of the probablt z of heads? One answer: mamum lkelhood. z = #h/k. What f ou flp t tmes and ou get both heads? Do ou thnk that z =? Would ou be nfntel surprsed to see a tal? ML s almost alwas a bad dea. We need to ncorporate a pror belef to modulate the results of small numbers of trals. We do ths wth a technque called smoothng: z = #h+α K+α α are the number of pseudo-counts ou use for our pror. The same stuaton occurs when estmatng class prors from data: p (c) = #c + α N + Cα A ver common settng s α = whch s called Laplace Smoothng. Regularzed Gaussans Idea : assume all the covarances are the same (te parameters). Ths s eactl Fsher s lnear dscrmnant analss. (a) Idea : use a Wshart pror on the covarance matr. (Smoothng!) Ths fattens up the posterors b makng the MAP estmates the sample covarances plus a bt of the dentt matr. Idea : Make ndependence assumptons to get dagonal or dentt-multple covarances. (.e. sparse nverse covarances.) More on ths n a few mnutes... (b) Gaussan Class-Condtonal Dstrbutons If all the nput features are contnuous, a popular choce s a Gaussan class-condtonal model. { p( = k, θ) = πσ / ep } ( µ k)σ ( µ k ) Fttng: use the followng smple but useful fact. The mamum lkelhood ft of a Gaussan to some data s the Gaussan whose mean s equal to the data mean and whose covarance s equal to the sample covarance. [Tr to prove ths as an eercse n understandng lkelhood, algebra, and calculus all at once!] One ver nce feature of ths model s that the mamum lkelhood parameters can be found n closed-form, so we don t have to worr about numercal optmzaton, local mnma, search, etc. Seems eas. And works surprsngl well. But we can do even better wth some smple regularzaton... Gaussan Baes Classfer Mamum lkelhood estmates for parameters: prors π k : use observed frequences of classes (plus smoothng) means µ k : use class means covarance Σ: use data from sngle class or pooled data ( n µ n) to estmate (full/dagonal) covarances Compute the posteror va Baes rule. For equal covarances: p( = k, θ)p( = k π) p( = k, θ) = j p( = j, θ)p( = j π) = ep{µ k Σ µ k Σ µ k / + log π k } j ep{µ j Σ µ j Σ µ j / + log π j } = eβ k j eβ j = ep{β k }/Z where β k = [Σ µ k ; (µ k Σ µ k + log π k )] (last term s bas)

Lnear Geometr Takng the rato of an two posterors (the odds ) shows that the contours of equal parwse probablt are lnear surfaces n the feature space f the covarances of all classes are equal: p( = k, θ) p( = j, θ) = ep { (β k β j ) } The parwse dscrmnaton contours p( k ) = p( j ) are orthogonal to the dfferences of the means n feature space when Σ = σi. For general Σ (shared b/w all classes) the same s true n the transformed feature space u = Σ. The prors do not change the geometr, the onl shft the operatng pont on the logt b the log-odds log(π k /π j ). Summar: for equal class-covarances, we obtan a lnear classfer. If we use dfferent covarances for each class, we have a quadratc classfer wth conc secton decson surfaces. Dscrmnatve Models Observaton: f p( ) are lnear functons of (or monotone transforms), decson surfaces wll be pecewse lnear. Idea: parametrze p( ) drectl, forget p(, ) and Baes rule. Advantages: We don t need to model the denst of the features p() whch often takes lots of parameters and seems redundant snce man denstes gve the same lnear classfer. Dsadvantages: We cannot detect outlers, compare models usng lkelhoods or generate new labelled data. What should our objectve functon be? We ll tr to use one that s closer to the one we care about at test tme (e error rate). Eponental Faml Class-Condtonals Baes Classfer has the same form whenever the class-condtonal denstes are an eponental faml denst: p( = k, η k ) = h() ep{ηk a(η k)} p( = k, η) = p( = k, η k)p( = k π) j p( = j, η j)p( = j π) = ep{η k a(η k)} j ep{η j a(η j)} = eβ k j eβ j where β k = [η k ; a(η k )] and we have augmented wth a constant component alwas equal to (bas term). Resultng classfer s lnear n the suffcent statstcs. Logstc/Softma Regresson Model: s a multnomal random varable whose posteror s the softma of lnear functons of the feature vector. eθ k p( = k, θ) = j eθ j Fttng: now we optmze the condtonal log-lkelhood: l(θ; D) = [ n = k] log p( = k n, θ) = nk nk l = l n k p n θ p n k z n nk k z n. θ = k n. p n p n k (δ k p n. )n.. nk k = n ( n pn )n. n k log pn k....

Softma/Logt The squashng functon s known as the softma or logt: φ k (z) ez k g(η) = j ez j + e η It s nvertble (up to a constant): z k = log φ k + c η = log(g/ g) Dervatve s eas: φ k z j = φ k (δ kj φ j ) dg = g( g) dη φ( z ).... z Artfcal Neural Networks Hstorcall motvated b relatons to bolog, but for our purposes, ANNs are just nonlnear classfcaton machnes of the form: p( = k, θ) = eθ k h() j eθ j h() h j = σ(b j ) where h j = σ(b j ) are known as the hdden unt actvatons; k are the output unts and are the nput unts. The nonlnear scalar functon σ s called an actvaton functon. We usuall use nvertble and dfferentable actvaton functons. If the actvaton functon s lnear, the whole network reduces to a lnear network: equvalent to logstc regresson. [ Onl f there are at least as man hddens as nputs and outputs.] It s often a good dea to add skp weghts drectl connectng nputs to outputs to take care of ths lnear component drectl. More on Logstc Regresson Hardest Part: pckng the feature vector (see net slde!). Amazng fact: the condtonal lkelhood s conve n the parameters θ (assumng regularzaton). Stll no local mnma! But the optmal parameters cannot be computed n closed form. However, the gradent s eas to compute; so eas to optmze. Slow: gradent descent, IIS. Fast: BFGS, Newton-Raphson, IRLS. Regularzaton? Gaussan pror on θ (weght deca): add ɛ θ to the cost functon, whch subtracts ɛθ from each gradent. Logstc regresson could reall be called softma lnear regresson. Log odds (logt) between an two classes s lnear n parameters. Consder what happens f there are two features wth dentcal classfcaton patterns n our tranng data. Logstc Regresson can onl see the sum of the correspondng weghts. Luckl, weght deca wll solve ths. Moral: alwas regularze! Common Actvaton Functons Two common actvaton functons: sgmod and hperbolc tangent ep(z) ep( z) sgmod(z) = tanh(z) = + ep( z) ep(z) + ep( z) sgmod(z) / + tanh(z) + z z For small weghts, these functons wll be operatng near zero and ther behavour wll be almost lnear. Thus, for small weghts, the network behaves essentall lnearl. But for larger weghts, we are effectvel learnng the nput feature functons for a non-lnear verson of logstc regresson. In general we want a saturatng actvaton functon (wh?).

Geometr of ANNs ANNs can be thought of as generalzed lnear models, where the bass functons (hdden unts) are sgmodal clffs. The clff drecton s determned b the nput-to-hdden weghts, and the clffs are combned b the hdden-to-output weghts. We nclude bas unts of course, and these set where the clff s postoned relatve to the orgn. b j h j Dscrete Baesan Classfer If the nputs are dscrete (categorcal), what should we do? The smplest class-condtonal model s a jont multnomal (table): p( = a, = b,... = c) = η c ab... Ths s conceptuall correct, but there s a bg practcal problem. Fttng: ML params are observed counts: ηab... c = n [ n = c][ = a][ = b][...][...] n [ n = c] Consder the dgts at gra levels. How man entres n the table? How man wll be zero? What happens at test tme? Doh! We obvousl need some regularlzaton. Smoothng wll not help much here. Unless we know about the relatonshps between nputs beforehand, sharng parameters s hard also (what to share?). But what about ndependence? Neural Networks for Classfcaton Neural nets wth one hdden laer traned for classfcaton are dong nonlnear logstc regresson: p( = k ) = softma[θ k σ(b)] where θ and B are the frst and second laer weghts and σ() s a squashng functon (e.g. tanh) that operates componentwse. softma σ Gradent of condtonal lkelhood stll easl computable, usng the effcent backpropagaton algorthm whch we ll see later. But: We lose the convet propert local mnma problems. θ B Nave (Idot s) Baes Classfer Assumpton: condtoned on class, attrbutes are ndependent. p( ) = p( ) Sounds craz rght? Rght! But t works. Algorthm: sort data cases nto bns accordng to n. Compute margnal probabltes p( = c) usng frequences. For each class, estmate dstrbuton of th varable: p( = c). At test tme, compute argma c p(c ) usng c() = argma c p(c ) = argma c [log p( c) + log p(c)] = argma c [log p(c) + log p( c)]

Dscrete (Multnomal) Nave Baes Dscrete features, assumed ndependent gven the class label. p( = j = k) = η jk p( = k, η) = η [ =j] jk j Classfcaton rule: p( = k, η) = π k j η[ =j] jk q π q j η[ =j] jq = eβ k q eβ q β k = log[η k... η jk... η jk... log π k ] = [ =; =;... ; =j;... ; ] X X (a) X m Gaussan Nave Baes Ths s just a Gaussan Baes Classfer wth a separate but dagonal covarance matr for each class. Equvalent to fttng a D Gaussan to each nput for each class. NB: Decson surfaces are quadratcs, not lnear... Even better dea: Blend between full, dagonal and dentt covarances. Fttng Dscrete Nave Baes ML parameters are class-condtonal frequenc counts: ηjk = n [ n = j][ n = k] n [n = k] How do we know? Wrte down the lkelhood: l(θ; D) = log p( n π) + log p( n n, η) n n and optmze t b settng ts dervatve to zero (careful! enforce normalzaton wth Lagrange multplers): l(η; D) = [ n = j][ n = k] log η jk + λ k ( j η jk) n jk k l = n [ n = j][ n = k] λ η jk η k jk l = λ η k = [ n = k] ηjk jk n Nos-OR Classfer Man probablstc models can be obtaned as nos versons of formulas from propostonal logc. Nos-OR: each nput actvates output w/some probablt. p( =, α) = α = ep log α Lettng θ = log α we get et another lnear classfer: p( =, θ) = e θ

Jont vs. Condtonal Models Man of the methods we have seen so far have lnear or pecewse lnear decson surfaces n some space : LDA, perceptron, Gaussan Baes, Nave Baes, Nos-OR, KNN,... But the crtera used to fnd ths hperplane s dfferent: KNN/perceptron optmze tranng set classfcaton error. Gauss/Nave Baes are jont models; optmze p(, ) = p()p( ). Logstc Regresson/NN are condtonal: optmze p( ) drectl. Ver mportant pont: n general there s a large dfference between the archtecture used for classfcaton and the objectve functon used to optmze the parameters of the archtecture. See readng... Futher ponts... Last class: non-parametrc (e.g. K-nearest-neghbour). Those classfers return a sngle guess for wthout a dstrbuton. Ths class: probablstc generatve models p(, ) (e.g. Gaussan class-condtonal, Nave Baes) & dscrmnatve (condtonal) models p( ) (e.g. logstc regresson, ANNs, nos-or). (Plus man more we ddn t talk about, e.g. probt regresson, complementar log-log, generalzed lnear models,...) Advanced topc: kernel machne classfers. (e.g. kernel voted perceptron, support vector machnes, Gaussan processes). Advanced topc: combnng multple weak classfers nto a sngle stronger one usng boostng, baggng, stackng... Readngs: Haste et. al, Ch; Duda&Hart, Ch,. Classfcaton va Regresson? We could forget that was a dscrete (categorcal) random varable and just attempt to model p( ) usng regresson. Idea: do regresson to an ndcator matr. (n bnar case p( = ) s also the condtonal epectaton) For two classes, ths s equvalent to LDA. For or more, dsaster... Ver bad dea! Nose models (e.g. Gaussan) for regresson are totall napproprate, and fts are oversenstve to outlers. Furthermore, gves unreasonable predctons < and >........................