Bayes (Naïve or not) Classifiers: Generative Approach

Similar documents
Generative classification models

Naïve Bayes MIT Course Notes Cynthia Rudin

Dimensionality Reduction and Learning

Econometric Methods. Review of Estimation

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Point Estimation: definition of estimators

Overview. Basic concepts of Bayesian learning. Most probable model given data Coin tosses Linear regression Logistic regression

6.867 Machine Learning

Chapter 14 Logistic Regression Models

Introduction to local (nonparametric) density estimation. methods

Chapter 5 Properties of a Random Sample

Machine Learning. Introduction to Regression. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

Bayesian Classification. CS690L Data Mining: Classification(2) Bayesian Theorem: Basics. Bayesian Theorem. Training dataset. Naïve Bayes Classifier

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

X ε ) = 0, or equivalently, lim

Part I: Background on the Binomial Distribution

Functions of Random Variables

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Lecture Notes Types of economic variables

Bayes Decision Theory - II

Logistic regression (continued)

Supervised learning: Linear regression Logistic regression

å 1 13 Practice Final Examination Solutions - = CS109 Dec 5, 2018

Qualifying Exam Statistical Theory Problem Solutions August 2005

Simulation Output Analysis

Support vector machines

Recall MLR 5 Homskedasticity error u has the same variance given any values of the explanatory variables Var(u x1,...,xk) = 2 or E(UU ) = 2 I

6. Nonparametric techniques

Special Instructions / Useful Data

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Parameter Estimation

The number of observed cases The number of parameters. ith case of the dichotomous dependent variable. the ith case of the jth parameter

3. Basic Concepts: Consequences and Properties

An Introduction to. Support Vector Machine

ENGI 3423 Simple Linear Regression Page 12-01

CHAPTER VI Statistical Analysis of Experimental Data

Kernel-based Methods and Support Vector Machines

Homework 1: Solutions Sid Banerjee Problem 1: (Practice with Asymptotic Notation) ORIE 4520: Stochastics at Scale Fall 2015

Point Estimation: definition of estimators

Unsupervised Learning and Other Neural Networks

Lecture 9: Tolerant Testing

Parametric Density Estimation: Bayesian Estimation. Naïve Bayes Classifier

Linear Regression Linear Regression with Shrinkage. Some slides are due to Tommi Jaakkola, MIT AI Lab

1 Solution to Problem 6.40

Lecture 3. Sampling, sampling distributions, and parameter estimation

Binary classification: Support Vector Machines

Third handout: On the Gini Index

Nonparametric Density Estimation Intro

22 Nonparametric Methods.

THE ROYAL STATISTICAL SOCIETY 2016 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 5

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

12.2 Estimating Model parameters Assumptions: ox and y are related according to the simple linear regression model

Random Variables and Probability Distributions

9.1 Introduction to the probit and logit models

CHAPTER 4 RADICAL EXPRESSIONS

1. The weight of six Golden Retrievers is 66, 61, 70, 67, 92 and 66 pounds. The weight of six Labrador Retrievers is 54, 60, 72, 78, 84 and 67.

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

hp calculators HP 30S Statistics Averages and Standard Deviations Average and Standard Deviation Practice Finding Averages and Standard Deviations

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

Class 13,14 June 17, 19, 2015

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

Multiple Choice Test. Chapter Adequacy of Models for Regression

Chapter 3 Sampling For Proportions and Percentages

Chapter Business Statistics: A First Course Fifth Edition. Learning Objectives. Correlation vs. Regression. In this chapter, you learn:

Module 7. Lecture 7: Statistical parameter estimation

Summary of the lecture in Biostatistics

8.1 Hashing Algorithms

Lecture 02: Bounding tail distributions of a random variable

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Lecture 2 - What are component and system reliability and how it can be improved?

Lecture 4 Sep 9, 2015

Dimensionality reduction Feature selection

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

Multiple Linear Regression Analysis

STK3100 and STK4100 Autumn 2018

Classification : Logistic regression. Generative classification model.

STK3100 and STK4100 Autumn 2017

ENGI 4421 Propagation of Error Page 8-01

Chapter 2 Supplemental Text Material

Simple Linear Regression

Parameter, Statistic and Random Samples

Support vector machines II


Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

Multivariate Transformation of Variables and Maximum Likelihood Estimation

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #1

STK4011 and STK9011 Autumn 2016

CS 1675 Introduction to Machine Learning Lecture 12 Support vector machines

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

CS 2750 Machine Learning Lecture 5. Density estimation. Density estimation

The Mathematical Appendix

TESTS BASED ON MAXIMUM LIKELIHOOD

9 U-STATISTICS. Eh =(m!) 1 Eh(X (1),..., X (m ) ) i.i.d

CSE 5526: Introduction to Neural Networks Linear Regression

Law of Large Numbers

Transcription:

Logstc regresso

Bayes (Naïve or ot) Classfers: Geeratve Approach What do we mea by Geeratve approach: Lear p(y), p(x y) ad the apply bayes rule to compute p(y x) for makg predctos Ths s essetally makg a assumpto that the data s geerated accordg to a geeratve process govered by p(y) ad p(x y) I partcular, to geerate a data pot: 1. Sample from p(y) to determe ts class label 2. Sample ts feature vector x from p(x y)

Bayes classfer y x p(y) p(x y) To geerate oe example our trag or test set: 1. Geerate a y accordg to p(y) 2. Geerate a x vector accordg to p(x y) Naïve Bayes classfer p(x 1 y) x 1 y p(y) x m p(x m y) To geerate oe example our trag or test set: 1. Geerate a y accordg to p(y) 2. For geerate a value for each feature x accordg to p(x y)

Geeratve approach s just oe type of learg approaches used mache learg Learg a correct geeratve model p(x y) ca be dffcult desty estmato s a challegg problem ts ow Ad sometmes uecessary I cotrast, perceptro, KNN ad DT are what we call dscrmatve methods They are ot cocered about ay probablstc models for geeratg the observatos They oly care about fdg a good dscrmatve fucto perceptro, KNN ad DT lear determstc fuctos, ot probablstc Oe ca also take a probablstc approach to learg dscrmatve fuctos.e., Lear p(y x) drectly wthout assumg x s geerated based o some partcular dstrbuto gve y (.e., p(x y)) Logstc regresso s oe such approach

Logstc regresso Recall the problem of regresso Lears a mappg from put vector x to a cotuous output y Logstc regresso exteds tradtoal regresso to hadle bary output y I partcular, we assume that g(t) P( y 1 x) w 1 e 1 g( w x) wx 1 e 1 ( 0 w1 x1... w m x m ) Sgmod fucto t

Logstc Regresso Equvaletly, we have the followg: P( y 1 x) log P y x w ( 0 ) w x 0 1 1 Odds of y=1... w m x m Sde Note: the odds favor of a evet are the quatty p / (1 p), where p s the probablty of the evet If I toss a far dce, what are the odds that I wll have a sx? I other words, LR assumes that the log odds s a lear fucto of the put features

Logstc Regresso lears a lear decso boudary We predct y = 1 f p y = 1 x > p y = 0 x Predct y = 1 f Predct y = 1 f p y = 1 x p y = 0 x > 1 log p y = 1 x p y = 0 x = w 0 + w 1 x 1 + + w m x m > 0

Learg w for logstc regresso Gve a set of trag data pots, we would lke to fd a weght vector w such that P( y 1 x) 1 e ( w0 w1 x1... w m x m ) s large (approachg 1) for postve trag examples, ad small (approachg 0) for egatve examples 1 Maxmum Lkelhood estmato allows us to precsely capture ths

Maxmum Lkelhood Estmato Goal: estmate the parameters of the dstrbuto gve data Assumg the examples are detcally depedetly dstrbuted (..d) accordg to a dstrbuto wth parameter θ Let D = {d 1, d 2,, d } be the observed data, we defe the lkelhood fucto of θ: It measures how lkely t s to observe the data D gve that the parameter s θ Maxmum Lkelhood Estmator (MLE): L( ) P( D; ) P( d ; ) 1 MLE arg max L( )

Example A co toss (deoted by x) follows a bary dstrbuto: P x = θ x 1 θ 1 x, where θ = P(x = 1) s the parameter we wat to estmate Observed data :..d. co tosses D = {x 1, x 2,, x } Prevously we metoed a coutg recpe: θ = 1 Ths s actually Maxmum Lkelhood Estmato of θ What s the Lkelhood fucto? L θ = P D; θ = =1 P x ; θ = θ x 1 θ 1 x =1 = θ x =1 1 θ =1 1 x =1 x

MLE estmate for bary dstrbuto θ MLE = argmax θ = argmax log θ θ = argmax θ dl( ) d 1 1 1 =1 1 0 (1 ) 0 x L θ log θ + x = argmax log L(θ) θ =1 1 θ =1 1 x 1 x =1 1 0 0 (1 ) 1 (1 ) 0 1 1 0 0 1 log 1 θ

MLE for logstc regresso w MLE arg max l( w) w arg max w 1 log P( y x ; w) arg max w 1 y log P( y 1 x ; w) (1 y )log(1 P( y 1 x ; w)) arg max w y log yˆ ( w) (1 y )log(1 yˆ ( w)) 1 As such, gve a set of trag data pots, we would lke to fd a weght vector w such that p(y = 1 x;w) s large (e.g. 1) for postve trag examples, ad small (e.g. 0) otherwse the same as our tuto

Optmzg L(w) Ufortuately ths does ot have a close form soluto You take the dervatve, set t to zero, but o closed form soluto Istead, we teratvely search for the optmal w Start wth a radom w, teratvely mprove w (smlar to Perceptro) by takg the gradet

Gradet of L(w) y log yˆ ( w) (1 y )log(1 yˆ ( )) l( w) w 1 l w = y log y (w) + 1 y log(1 y w ) Useful fact: y w = y w 1 y w x l w = y y w x

Batch Learg for Logstc Regresso Note: y takes 0/1 here, ot 1/-1 Gve : tragexamples( x Let w (0,0,0,...,0) Repeat utlcovergece d (0,0,0,...,0) For 1to N do 1 y w x 1 e error y y d d errorx w w d, y ), 1,..., N Note the strkg smlarty betwee LR ad perceptro Both lear a lear decso boudary The teratve algorthm takes very smlar form

Coecto to Naïve Bayes If we use Naïve Bayes ad assume a Gaussa dstrbuto for P(x y), we ca show that P(y = 1 x) takes the exact same fuctoal form as Logstc Regresso So, are they equvalet? Logstc regresso does ot assume Gaussa dstrbuto, or codtoal depedece, thus weaker modelg assumpto A strog modelg assumpto could make a sgfcat dfferece practce

Comparatvely Naïve Bayes - geeratve model: p(x y) makes strog assumptos about the dstrbuto of data attrbutes Whe the assumptos are ok, Naïve Bayes ca estmate a good model wth eve small umber of examples Logstc regresso-dscrmatve model: drectly lear P(y x) fewer parameters to estmate, but they are ted together ad make learg harder Makes weaker assumptos May eed larger umber of trag examples Bottom le: f the aïve bayes assumpto holds ad the probablstc models are accurate (.e., x s gaussa gve y etc.), NB would be a good choce; otherwse, logstc regresso ofte works better

Summary We troduced the cocept of geeratve vs. dscrmatve method Gve a method that we dscussed class, you eed to kow whch category t belogs to Logstc regresso Assumes that the log odds of y=1 s a lear fucto of x (.e., w x) Learg goal s to lear a weght vector w such that examples wth y=1 are predcted to have hgh P(y=1 x) ad vce versa Maxmum lkelhood estmato s a approach that acheves ths Iteratve algorthm to lear w usg MLE Smlarty ad dfferece betwee LR ad Perceptros Logstc regresso lears a lear decso boudares By troducg olear features (.e., x 1 2, x 2 2, x 1 x 2, ), ca be exteded to olear boudary.