COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Similar documents
COMP 551 Applied Machine Learning Lecture 4: Linear classification

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

Pattern Recognition 2014 Support Vector Machines

Linear Classification

What is Statistical Learning?

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Simple Linear Regression (single variable)

Part 3 Introduction to statistical classification techniques

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

IAML: Support Vector Machines

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

Lecture 8: Multiclass Classification (I)

Resampling Methods. Chapter 5. Chapter 5 1 / 52

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

CS 109 Lecture 23 May 18th, 2016

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Distributions, spatial statistics and a Bayesian perspective

Support-Vector Machines

Differentiation Applications 1: Related Rates

A Matrix Representation of Panel Data

Trigonometric Ratios Unit 5 Tentative TEST date

Chapter 3 Kinematics in Two Dimensions; Vectors

Support Vector Machines and Flexible Discriminants

Lead/Lag Compensator Frequency Domain Properties and Design Methods

Computational modeling techniques

Tree Structured Classifier

Experiment #3. Graphing with Excel

The blessing of dimensionality for kernel methods

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

Computational modeling techniques

Section 5.8 Notes Page Exponential Growth and Decay Models; Newton s Law

, which yields. where z1. and z2

ENSC Discrete Time Systems. Project Outline. Semester

We can see from the graph above that the intersection is, i.e., [ ).

CHM112 Lab Graphing with Excel Grading Rubric

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

ECE 5318/6352 Antenna Engineering. Spring 2006 Dr. Stuart Long. Chapter 6. Part 7 Schelkunoff s Polynomial

ELE Final Exam - Dec. 2018

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

CONSTRUCTING STATECHART DIAGRAMS

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

THE LIFE OF AN OBJECT IT SYSTEMS

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

Ecology 302 Lecture III. Exponential Growth (Gotelli, Chapter 1; Ricklefs, Chapter 11, pp )

AP Statistics Notes Unit Two: The Normal Distributions

Kinetic Model Completeness

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

ELT COMMUNICATION THEORY

Preparation work for A2 Mathematics [2017]

Contents. This is page i Printer: Opaque this

Introduction to Regression

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

Statistical classifiers: Bayesian decision theory and density estimation

AIP Logic Chapter 4 Notes

Inference in the Multiple-Regression

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

The Law of Total Probability, Bayes Rule, and Random Variables (Oh My!)

Math Foundations 20 Work Plan

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

ALE 21. Gibbs Free Energy. At what temperature does the spontaneity of a reaction change?

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

Chapter 3: Cluster Analysis

ECEN 4872/5827 Lecture Notes

BLAST / HIDDEN MARKOV MODELS

Data mining/machine learning large data sets. STA 302 or 442 (Applied Statistics) :, 1

Hypothesis Tests for One Population Mean

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Physics 212. Lecture 12. Today's Concept: Magnetic Force on moving charges. Physics 212 Lecture 12, Slide 1

Lyapunov Stability Stability of Equilibrium Points

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

Chemistry 20 Lesson 11 Electronegativity, Polarity and Shapes

Support Vector Machines and Flexible Discriminants

Dead-beat controller design

Smoothing, penalized least squares and splines

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

Comparing Several Means: ANOVA. Group Means and Grand Mean

Lecture 13: Markov Chain Monte Carlo. Gibbs sampling

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

37 Maxwell s Equations

Homology groups of disks with holes

Materials Engineering 272-C Fall 2001, Lecture 7 & 8 Fundamentals of Diffusion

15-381/781 Bayesian Nets & Probabilistic Inference

Transcription:

COMP 551 Applied Machine Learning Lecture 5: Generative mdels fr linear classificatin Instructr: Herke van Hf (herke.vanhf@mail.mcgill.ca) Slides mstly by: Jelle Pineau Class web page: www.cs.mcgill.ca/~hvanh2/cmp551 Unless therwise nted, all material psted fr this curse are cpyright f the instructrs, and cannt be reused r repsted withut the instructr s written permissin.

Mdeling fr binary classificatin Tw prbabilistic appraches: 1. Generative learning: Separately mdel P(x y) and P(y). Use Bayes rule, t estimate P(y x): P(y =1 x) = P(x y =1)P(y =1) P(x) 2. Discriminative learning: Directly estimate P(y x). 2 Jelle Pineau

Hw abut ther types f data? Last lecture, we saw ne generative apprach (LDA) LDA wrks with cntinuus data What abut ther types f data? 3 Jelle Pineau

Hw abut ther types f data? LDA nly wrks with cntinuus input data Let s lk at an apprach fr handling ther types f data (mainly: binary) 4 Jelle Pineau

Generative learning with binary input data Generative learning: Estimate P(x y), P(y). Then calculate P(y x). Simple principle: fr every class y, estimate the cnditinal prbability P(x y) f every input pattern x What happens if the number f input variables m is large? 5 Jelle Pineau

Generative learning with binary input data Generative learning: Estimate P(x y), P(y). Then calculate P(y x). Simple principle: fr every class y, estimate the cnditinal prbability P(x y) f every input pattern x What happens if the number f input variables m is large? O(2 m ) parameters necessary t describe the mdel! Need additinal assumptin n structure f input t keep manageable! 6 Jelle Pineau

Naïve Bayes assumptin Generative learning: Estimate P(x y), P(y). Then calculate P(y x). Naïve Bayes: Assume the x j are cnditinally independent given y. In ther wrds: P(x j y) = P(x j y, x k ), fr all j, k 7 Jelle Pineau

Naïve Bayes assumptin Generative learning: Estimate P(x y), P(y). Then calculate P(y x). Naïve Bayes: Assume the x j are cnditinally independent given y. In ther wrds: P(x j y) = P(x j y, x k ), fr all j, k Generative mdel structure: P(x y) = P(x 1, x 2,, x m y) = P(x 1 y) P(x 2 y, x 1 ) P(x 3 y, x 1, x 2 ) P(x m y, x 1, x 2,, x m-1 ) (frm general rules f prbabilities) = P(x 1 y) P(x 2 y) P(x 3 y) P(x m y) (frm the Naïve Bayes assumptin abve) 8 Jelle Pineau

Cnditinally independence example Offer and pprtunity might bth ccur ften in spam e-mails Let s say we get 50% spam and 50% regular. Let s say, spam e- mails cntain 50% ffer and 50% pprtunity independently. Let s say it s 10% fr either in regular e-mail. Spam Regular Tgether % Expected % Cntains nly ffer 25 9 34 17 21 Cntains nly pprtunity 25 9 34 17 21 Cntains neither 25 81 106 53 49 Cntains bth 25 1 26 13 9 Offer and pprtunity nt independent ver all e-mails! We say cnditinally independent given the class 9 Jelle Pineau

Naïve Bayes graphical mdel y x 1 x 2 x 3 x m Hw many parameters t estimate? Assume m binary features. 10 Jelle Pineau

Naïve Bayes graphical mdel y x 1 x 2 x 3 x m Hw many parameters t estimate? Assume m binary features. withut Naïve Bayes assumptin: O(2 m ) numbers t describe mdel. With Naïve Bayes assumptin: O(m) numbers t describe mdel. Useful when the number f features is high. 11 Jelle Pineau

Training a Naïve Bayes classifier Assume x, y are binary variables, m=1. Estimate the parameters P(x y) and P(y) frm data. y Define: Θ 1 = Pr (y=1) Θ j,1 = Pr (x j =1 y=1) Θ j,0 = Pr (x j =1 y=0). x 12 Jelle Pineau

Training a Naïve Bayes classifier Assume x, y are binary variables, m=1. Estimate the parameters P(x y) and P(y) frm data. y Define: Θ 1 = Pr (y=1) Θ j,1 = Pr (x j =1 y=1) Θ j,0 = Pr (x j =1 y=0). x Evaluatin criteria: Find parameters that maximize the lglikelihd functin. Likelihd: Pr(y x) Pr(y)Pr(x y) = i=1:n ( P(y i ) j=1:m P(x i,j y i ) ) Samples i are independent, s we take prduct ver n. Input features are independent (cnd. n y) s we take prduct ver m. 13 Jelle Pineau

Training a Naïve Bayes classifier Likelihd fr binary utput variable: L(Θ 1 y) = Θ 1y (1-Θ 1 ) 1-y Lg-likelihd fr all parameters (like befre): lg L(Θ 1,Θ i,1,θ i,0 D) = Σ i=1:n [ lg P(y i ) + Σ j=1:m lg P(x i,j y i ) ] 14 Jelle Pineau

Training a Naïve Bayes classifier Likelihd fr binary utput variable: L(Θ 1 y) = Θ 1y (1-Θ 1 ) 1-y Lg-likelihd fr all parameters (like befre): lg L(Θ 1,Θ i,1,θ i,0 D) = Σ i=1:n [ lg P(y i ) + Σ j=1:m lg P(x i,j y i ) ] = Σ i=1:n [ y i lg Θ 1 + (1-y i )lg(1-θ 1 ) + Σ j=1:m y i ( x i,j lgθ i,1 + (1-x i,j )lg(1-θ i,1 ) ) + Σ j=1:m (1-y i )( x i,j lgθ i,0 + (1-x i,j )lg(1-θ i,0 ) ) ] 15 Jelle Pineau

Training a Naïve Bayes classifier Likelihd fr binary utput variable: L(Θ 1 y) = Θ 1y (1-Θ 1 ) 1-y Lg-likelihd fr all parameters (like befre): lg L(Θ 1,Θ i,1,θ i,0 D) = Σ i=1:n [ lg P(y i ) + Σ j=1:m lg P(x i,j y i ) ] = Σ i=1:n [ y i lg Θ 1 + (1-y i )lg(1-θ 1 ) + Σ j=1:m y i ( x i,j lgθ i,1 + (1-x i,j )lg(1-θ i,1 ) ) + Σ j=1:m (1-y i )( x i,j lgθ i,0 + (1-x i,j )lg(1-θ i,0 ) ) ] (will have ther frm if params P(x y) have ther frm, e.g. Gaussian). 16 Jelle Pineau

Training a Naïve Bayes classifier Likelihd fr binary utput variable: L(Θ 1 y) = Θ 1y (1-Θ 1 ) 1-y Lg-likelihd fr all parameters (like befre): lg L(Θ 1,Θ i,1,θ i,0 D) = Σ i=1:n [ lg P(y i ) + Σ j=1:m lg P(x i,j y i ) ] = Σ i=1:n [ y i lg Θ 1 + (1-y i )lg(1-θ 1 ) + Σ j=1:m y i ( x i,j lgθ i,1 + (1-x i,j )lg(1-θ i,1 ) ) + Σ j=1:m (1-y i )( x i,j lgθ i,0 + (1-x i,j )lg(1-θ i,0 ) ) ] (will have ther frm if params P(x y) have ther frm, e.g. Gaussian). Maximize t estimate Θ 1 : take derivative f lgl, set t 0: L / Θ 1 = Σ i=1:n (y i /Θ 1 - (1-y i )/(1-Θ 1 )) = 0 17 Jelle Pineau

Training a Naïve Bayes classifier Slving fr Θ 1 we get: Θ 1 = (1/n) Σ i=1:n y i = number f examples where y=1 / number f examples Similarly, we get: Θ j,1 = number f examples where x j =1 and y=1 / number f examples where y=1 Θ j,0 = number f examples where x j =1 and y=0 / number f examples where y=0 18 Jelle Pineau

Naïve Bayes decisin bundary Decisin bundary where prbability f classes are equal: lg-dds rati = 0 lg Pr(y =1 x) Pr(y = 0 x) = lg Pr(x y =1)P(y =1) Pr(x y = 0)P(y = 0) = lg = lg m ( ) ( ) ( ) ( ) P(y =1) P(y = 0) + lg P x j y =1 j=1 P x j y = 0 m j=1 m P(y =1) P(y = 0) + lg P x j y =1 P x j y = 0 j=1 19 Jelle Pineau

Naïve Bayes decisin bundary Cnsider the case where features are binary: x j = {0, 1} Define: w j,0 = lg P(x j = 0 y =1) P(x j = 0 y = 0) ; w j,1 = lg P(x j =1 y =1) P(x j =1 y = 0) Nw we have: lg Pr(y =1 x) Pr(y = 0 x) ( ) ( ) m P(y =1) = lg P(y = 0) + lg P x j y =1 P x j y = 0 This is a linear decisin bundary! cnstant + linear in x j=1 m P(y =1) = lg P(y = 0) + (w j,0(1 x j )+ w j,1 x j ) j=1 m m P(y =1) = lg P(y = 0) + w j,0 + (w j,1 w j,0 )x j j=1 j=1 20 Jelle Pineau

Text classificatin example Using Naïve Bayes, we can cmpute prbabilities fr all the wrds which appear in the dcument cllectin. P(y=c) is the prbability f class c P(x j y=c) is the prbability f seeing wrd j in dcuments f class c Class c wrd 1 wrd 2 wrd 3 wrd m 21 Jelle Pineau

Text classificatin example Using Naïve Bayes, we can cmpute prbabilities fr all the wrds which appear in the dcument cllectin. P(y=c) is the prbability f class c P(x j y=c) is the prbability f seeing wrd j in dcuments f class c Set f classes depends n the applicatin, e.g. Tpic mdeling: each class crrespnds t dcuments n a given tpic, e.g. {Plitics, Finance, Sprts, Arts}. Class c What happens when a wrd is nt bserved in the training data? wrd 1 wrd 2 wrd 3 wrd m 22 Jelle Pineau

Laplace smthing Replace the maximum likelihd estimatr: Pr(x j y=1) = number f instance with x j =1 and y=1 number f examples with y=1 23 Jelle Pineau

Laplace smthing Replace the maximum likelihd estimatr: Pr(x j y=1) = number f instance with x j =1 and y=1 number f examples with y=1 With the fllwing: Pr(x j y=1) = (number f instance with x j =1 and y=1) + 1 (number f examples with y=1) + 2 24 Jelle Pineau

Laplace smthing Replace the maximum likelihd estimatr: Pr(x j y=1) = number f instance with x j =1 and y=1 number f examples with y=1 With the fllwing: Pr(x j y=1) = (number f instance with x j =1 and y=1) + 1 (number f examples with y=1) + 2 If n example frm that class, it reduces t a prir prbability r Pr=1/2. If all examples have x j =1, then Pr(x j =0 y) has Pr = 1 / (#examples + 1). If a wrd appears frequently, the new estimate is nly slightly biased. This is a frm f regularizatin (decreases variance at the cst f bias ) 25 Jelle Pineau

Example: 20 newsgrups Given 1000 training dcuments frm each grup, learn t classify new dcuments accrding t which newsgrup they came frm: cmp.graphics cmp.s.ms-windws.misc cmp.sys.ibm.pc.hardware cmp.sys.mac.hardware cmp.windws.x alt.atheism sc.religin.christian talk.religin.misc talk.plitics.mideast talk.plitics.misc misc.frsale rec.auts rec.mtrcycles rec.sprt.baseball rec.sprt.hckey sci.space sci.crypt sci.electrnics sci.med talk.plitics.guns Naïve Bayes: 89% classificatin accuracy (cmparable t ther state-f-the-art methds.) 26 Jelle Pineau

Gaussian Naïve Bayes Extending Naïve Bayes t cntinuus inputs: P(y) is still assumed t be a binmial distributin. P(x y) is assumed t be a multivariate Gaussian (nrmal) distributin with mean μ R n and cvariance matrix Σ R n xr n 27 Jelle Pineau

Gaussian Naïve Bayes Extending Naïve Bayes t cntinuus inputs: P(y) is still assumed t be a binmial distributin. P(x y) is assumed t be a multivariate Gaussian (nrmal) distributin with mean μ R n and cvariance matrix Σ R n xr n If we assume the same Σ fr all classes: Linear discriminant analysis. If Σ is distinct between classes: Quadratic discriminant analysis. If Σ is diagnal (i.e. features are independent): Gaussian Naïve Bayes. (linear if same fr all classes) 28 Jelle Pineau

Gaussian Naïve Bayes Extending Naïve Bayes t cntinuus inputs: P(y) is still assumed t be a binmial distributin. P(x y) is assumed t be a multivariate Gaussian (nrmal) distributin with mean μ R n and cvariance matrix Σ R n xr n If we assume the same Σ fr all classes: Linear discriminant analysis. If Σ is distinct between classes: Quadratic discriminant analysis. If Σ is diagnal (i.e. features are independent): Gaussian Naïve Bayes. (linear if same fr all classes) Hw d we estimate parameters? Derive the maximum likelihd estimatrs fr μ and Σ. 29 Jelle Pineau

Mdeling fr binary classificatin Tw prbabilistic appraches: 1. Generative learning: Separately mdel P(x y) and P(y). Use Bayes rule, t estimate P(y x): P(y =1 x) = P(x y =1)P(y =1) P(x) 2. Discriminative learning: Directly estimate P(y x). 30 Jelle Pineau

Discriminative learning We have seen that under several assumptins, we get linear decisin bundaries p(x y) are Gaussian with shared cvariance (LDA) p(x y) are independent Bernulli distributins (Naïve Bayes) D we really need t estimate p(x y) and p(y)? Can we directly find the parameters f the best decisin bundary? E.g. cvariance matrix requires estimating O(m 2 ) but decisin bundary nly requires O(m) parameters 31 Jelle Pineau

Prbabilistic view f discriminative learning Suppse we have 2 classes: y {0, 1} What is the prbability f a given input x having class y = 1? Cnsider Bayes rule: P(y =1 x) = P(x, y =1) P(x) = P(x y =1)P(y =1) P(x y =1)P(y =1)+ P(x y = 0)P(y = 0) 32 Jelle Pineau

Prbabilistic view f discriminative learning Suppse we have 2 classes: y {0, 1} What is the prbability f a given input x having class y = 1? Cnsider Bayes rule: P(y =1 x) = = 1+ P(x, y =1) P(x) = 1 P(x y = 0)P(y = 0) P(x y =1)P(y =1) P(x y =1)P(y =1) P(x y =1)P(y =1)+ P(x y = 0)P(y = 0) = 1+ exp(ln 1 = P(x y = 0)P(y = 0) P(x y =1)P(y =1) ) 1 1+ exp( a) = =σ(-a) σ 33 Jelle Pineau

Prbabilistic view f discriminative learning Suppse we have 2 classes: y {0, 1} What is the prbability f a given input x having class y = 1? Cnsider Bayes rule: P(y =1 x) = = 1+ P(x, y =1) P(x) = 1 P(x y = 0)P(y = 0) P(x y =1)P(y =1) P(x y =1)P(y =1) P(x y =1)P(y =1)+ P(x y = 0)P(y = 0) = 1+ exp(ln 1 = P(x y = 0)P(y = 0) P(x y =1)P(y =1) ) 1 1+ exp( a) = =σ(-a) σ where a = ln P(x y =1)P(y =1) P(x y = 0)P(y = 0) = ln P(y =1 x) P(y = 0 x) (By Bayes rule; P(x) n tp and bttm cancels ut.) 34 Jelle Pineau

Prbabilistic view f discriminative learning Suppse we have 2 classes: y {0, 1} What is the prbability f a given input x having class y = 1? Cnsider Bayes rule: P(y =1 x) = where = 1+ P(x, y =1) P(x) = 1 P(x y = 0)P(y = 0) P(x y =1)P(y =1) a = ln P(x y =1)P(y =1) P(x y =1)P(y =1)+ P(x y = 0)P(y = 0) = 1+ exp(ln P(x y =1)P(y =1) P(x y = 0)P(y = 0) 1 = P(x y = 0)P(y = 0) P(x y =1)P(y =1) ) = ln P(y =1 x) P(y = 0 x) Here σ has a special frm, called the lgistic functin 1 1+ exp( a) = =σ(-a) σ (By Bayes rule; P(x) n tp and bttm cancels ut.) and a is the lg-dds rati f data being class 1 vs. class 0. 35 Jelle Pineau

Discriminative learning: Lgistic regressin The lgistic functin (= sigmid curve): σ(w T x) = 1 / (1 + e -wtx ) Transfrms learned functin s.t. it can be interpreted as prbability 36 Jelle Pineau

Discriminative learning: Lgistic regressin The lgistic functin (= sigmid curve): σ(w T x) = 1 / (1 + e -wtx ) Transfrms learned functin s.t. it can be interpreted as prbability The decisin bundary is the set f pints fr which a=0. Idea: Directly mdel the lg-dds with a linear functin: a = ln P(x y =1)P(y =1) P(x y = 0)P(y = 0) = w 0 + w 1 x 1 + + w m x m 37 Jelle Pineau

Discriminative learning: Lgistic regressin The lgistic functin (= sigmid curve): σ(w T x) = 1 / (1 + e -wtx ) Transfrms learned functin s.t. it can be interpreted as prbability The decisin bundary is the set f pints fr which a=0. Idea: Directly mdel the lg-dds with a linear functin: a = ln P(x y =1)P(y =1) P(x y = 0)P(y = 0) = w 0 + w 1 x 1 + + w m x m Hw d we find the weights? Need an bjective functin! 38 Jelle Pineau

Fitting the weights Recall: σ(w T x i ) is the prbability that y i =1 (given x i ) 1-σ(w T x i ) be the prbability that y i = 0. Fr y {0, 1}, the likelihd functin, Pr(x 1,y 1,, x n,y h w), is: i=1:n σ(w T x i ) yi (1- σ(w T x i )) (1-yi) (samples are i.i.d.) 39 Jelle Pineau

Fitting the weights Recall: σ(w T x i ) is the prbability that y i =1 (given x i ) 1-σ(w T x i ) be the prbability that y i = 0. Fr y {0, 1}, the likelihd functin, Pr(x 1,y 1,, x n,y h w), is: i=1:n σ(w T x i ) yi (1- σ(w T x i )) (1-yi) (samples are i.i.d.) Gal: Minimize the negative lg-likelihd (als called crssentrpy errr functin): - i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) 40 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: 41 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δlg(σ)/δw=1/σ Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + 42 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δσ/δw=σ(1-σ) Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + 43 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δw T x/δw=x Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + 44 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δ(1-σ)/δw= (1-σ)σ(-1) Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + (1-y i )(1/(1-σ(w T x i )))(1-σ(w T x i ))σ(w T x i )(-1) x i ] 45 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + (1-y i )(1/(1-σ(w T x i )))(1-σ(w T x i ))σ(w T x i )(-1) x i ] = - i=1:n x i (y i (1-σ(w T x i )) - (1-y i )σ(w T x i )) = - i=1:n x i (y i - σ(w T x i )) 46 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + (1-y i )(1/(1-σ(w T x i )))(1-σ(w T x i ))σ(w T x i )(-1) x i ] = - i=1:n x i (y i (1-σ(w T x i )) - (1-y i )σ(w T x i )) = - i=1:n x i (y i - σ(w T x i )) Nw apply iteratively: w k+1 = w k + α k i=1:n x i (y i σ(w kt x i )) 47 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + (1-y i )(1/(1-σ(w T x i )))(1-σ(w T x i ))σ(w T x i )(-1) x i ] = - i=1:n x i (y i (1-σ(w T x i )) - (1-y i )σ(w T x i )) = - i=1:n x i (y i - σ(w T x i )) Nw apply iteratively: w k+1 = w k + α k i=1:n x i (y i σ(w kt x i )) Can als apply ther iterative methds, e.g. Newtn s methd, crdinate descent, L-BFGS, etc. 48 Jelle Pineau

Multi-class classificatin Generally tw ptins: 1. Learn a single classifier that can prduce 20 distinct utput values. 2. Learn 20 different 1-vs-all binary classifiers. 49 Jelle Pineau

Multi-class classificatin Generally tw ptins: 1. Learn a single classifier that can prduce 20 distinct utput values. 2. Learn 20 different 1-vs-all binary classifiers. Optin 1 assumes yu have a multi-class versin f the classifier. Fr Naïve Bayes, cmpute P(y x) fr each class, and select the class with highest prbability. 50 Jelle Pineau

Multi-class classificatin Generally tw ptins: 1. Learn a single classifier that can prduce 20 distinct utput values. 2. Learn 20 different 1-vs-all binary classifiers. Optin 1 assumes yu have a multi-class versin f the classifier. Fr Naïve Bayes, cmpute P(y x) fr each class, and select the class with highest prbability. Optin 2 applies t all binary classifiers, s mre flexible. But: ften slwer (need t learn many classifiers) creates a class imbalance prblem (say, 5% vs 95% fr 20 classes) what if tw classifiers say belngs t class? Or zer d? 51 Jelle Pineau

Cmparing linear classificatin methds Crdinate 2 fr Training Data -6-4 -2 0 2 4 Technique Errr Rates Training Test Linear regressin 0.48 0.67 Linear discriminant analysis 0.32 0.56 Quadratic discriminant analysis 0.01 0.53 Lgistic regressin 0.22 0.51-4 -2 0 2 4 Crdinate 1 fr Training Data FIGURE 4.4. A tw-dimensinal plt f the vwel training data. There are eleven classes with X IR 10,andthisisthebestviewintermsfaLDAmdel (Sectin 4.3.3). The heavy circles are the prjected mean vectrs fr each class. The class verlap is cnsiderable. 52 Jelle Pineau

Discriminative vs generative Discriminative classifiers ften have less parameters t estimate Discriminative classifiers ften d better but Generative mdel might give us mre insight in data It can tell us when all classes are bad (lw prbability) With many classes, discriminative mdels need t find the decisin bundary between every pair 53 Jelle Pineau

What yu shuld knw Naïve Bayes assumptin Lg-dds rati decisin bundary Hw t estimate parameters fr Naïve Bayes Laplace smthing Relatin between Naïve Bayes, LDA, QDA, Gaussian Naïve Bayes. Derivatin f lgistic regressin. Wrth reading further: Relatin between Lgistic regressin and LDA (Hastie et al., 4.4.5) 54 Jelle Pineau

What yu shuld knw 55 Jelle Pineau