Classification as a Regression Problem

Similar documents
CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture Notes on Linear Regression

10-701/ Machine Learning, Fall 2005 Homework 3

Generalized Linear Methods

Maximum Likelihood Estimation (MLE)

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Lecture 12: Classification

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Linear Approximation with Regularization and Moving Least Squares

Machine learning: Density estimation

Composite Hypotheses testing

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Homework Assignment 3 Due in class, Thursday October 15

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Maximum Likelihood Estimation

EEE 241: Linear Systems

The exam is closed book, closed notes except your one-page cheat sheet.

Linear Feature Engineering 11

Economics 130. Lecture 4 Simple Linear Regression Continued

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Ensemble Methods: Boosting

Lecture 10 Support Vector Machines II

1 Convex Optimization

Generative classification models

First Year Examination Department of Statistics, University of Florida

The big picture. Outline

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Limited Dependent Variables

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Conjugacy and the Exponential Family

Error Probability for M Signals

Logistic Classifier CISC 5800 Professor Daniel Leeds

SDMML HT MSc Problem Sheet 4

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Chapter Newton s Method

Kernel Methods and SVMs Extension

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

Semi-Supervised Learning

Boostrapaggregating (Bagging)

Lecture 2: Prelude to the big shrink

Lecture 4 Hypothesis Testing

x i1 =1 for all i (the constant ).

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Which Separator? Spring 1

Learning from Data 1 Naive Bayes

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Rockefeller College University at Albany

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

Week 5: Neural Networks

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

4DVAR, according to the name, is a four-dimensional variational method.

Singular Value Decomposition: Theory and Applications

Hidden Markov Models

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Chapter 7 Generalized and Weighted Least Squares Estimation. In this method, the deviation between the observed and expected values of

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

Global Sensitivity. Tuesday 20 th February, 2018

Statistical pattern recognition

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

Lecture 4. Instructor: Haipeng Luo

Estimation: Part 2. Chapter GREG estimation

Solutions to exam in SF1811 Optimization, Jan 14, 2015

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

4.3 Poisson Regression

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Feature Selection: Part 1

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Explaining the Stein Paradox

Expectation Maximization Mixture Models HMMs

Support Vector Machines CS434

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

e i is a random error

Salmon: Lectures on partial differential equations. Consider the general linear, second-order PDE in the form. ,x 2

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Feb 14: Spatial analysis of data fields

Quantifying Uncertainty

Probability Density Function Estimation by different Methods

Transcription:

Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class representaton s qute arbtrary; Careful numercal class representaton s a crtcal step. Bnary Classfcaton Let us represent the classes C and C wth numercal values and,.e., f y C then y =, and f y C then y = Snce we have assgned numerc values to classes, bnary classfcaton can be consdered to be a regresson problem. The optmal predctor for regresson s * f x) = E y x = P( y C x) + P( y C x) = P( y C x [ ] ) ( Therefore, the optmal predctor outputs the posteror probablty of class C. By applyng the Bayes classfcaton rule we can obtan the Bayes classfer!! Important Concluson: Wth approprate class representaton, the optmal classfcaton s equvalent to optmal regresson. ult-class Classfcaton (K classes) Could the prevous result be generalzed to mult-class classfcaton? Example. 3 classes: whte, black, and blue. Let us examne the representaton: y = - (class s whte ), y = (class s black ), and y = (class s blue ) Dscusson: The representaton s napproprate snce t enforces order; mples that whte and blue are further away then let s say black and blue. What f t s evdent that an example could be ether whte or blue, but defntely not black? The proposed representaton would probably lead to a completely msleadng answer black!! Soluton: Decompose the mult-class problem to K bnary classfcaton problems: Problem ( K ) : f y C then y = f y C then y = The regresson functon E [ y x] on Problem wll equal the posteror P( y C x). By repeatng for all K classes, all posterors wll be avalable, whch s suffcent to construct the Bayes classfer!!

Approaches to nmzng SE from a Fnte Dataset Goal: Gven a dataset D fnd a mappng f that mnmzes SE. Two extreme approaches:. earest neghbor algorthm (non-parametrc approach). Lnear regresson (parametrc approach) on-parametrc approach assumes that data ponts close n the attrbute space are smlar to each other. It does not assume any functonal form. Parametrc approach assumes a functonal form: e.g., the output s a lnear functon of the nputs. earest eghbor Algorthms 6 (x) k-nearest neghbor (k-): ( x) = y k x k ( x) f, where (x) are the k nearest neghbors of x k Regresson: Classfcaton: The predcton s the average y among the k nearest neghbors The predcton s the maorty class among the k nearest neghbors k- s also known as a lazy learnng algorthm because t does not perform any calculatons pror to seeng a data pont; t has to analyze the whole dataset for the nearest neghbors every tme a new data pont appears. ote: Parametrc learnng algorthms learn a functon from the dataset; they are much faster n gvng predctons but need to spend some tme beforehand. k Theorem: If, k, (large number of neghbors n a very tght neghborhood) k- s an optmal predctor. Practcally, f we have dmensons ( attrbutes), k- s a good try; f we have more attrbutes, we may run out of data ponts (n practce, data sze s always lmted). Example: Generate an -dmensonal dataset such that all attrbutes X, =,,, are unformly dstrbuted n nterval [,]. What s the length r of the sze of a hypercube that contans % of all data ponts (.e. % of the hypercube volume)? Answer: r =. /.

r=. r=. r=.....3..3 5.63.4.5.8.63.5.98.95.93.998.995.993 In hgh dmensons, all neghborng ponts are far away, and could not be used to accurately estmate values wth k-!! Lnear Regresson Assumes a functonal form f ( x, θ) = θ + θ x + θ x + K + θ x (Eq.) where x = (x, x, x ) are the attrbutes and θ = (θ, θ, θ ) are the functon parameters. ore generally, f ( x, θ) = θ + θ φ( x) + θ φ ( x) + K + θ φ ( x) where φ s are the so called bass functons Example: 4 f ( x, θ) = θ + θ x + θ x + θ 3 x, where x = (x, x ) are the attrbutes and θ = (θ, θ, θ, θ 3 ) are the functon parameters. ote that functon f(x,θ) from the example s lnear n the parameters. We can easly transform t nto a functon from (Eq.) by ntroducng new attrbutes x =, x =x and x =x, and x 3 =x 4. Lnear regresson s sutable for problems where functonal form f(x,θ) s known wth suffcent certanty. Learnng goal: Fnd θ that mnmzes SE SE s a functon of parameters θ, so the problem of mnmzng SE can be solved by standard methods of the unconstraned optmzaton. Illustraton of the regresson: f ( x,θ ) = θ x

Lnear Regresson Lnear regresson can be represented by a functonal form: f(x; θ) = θ x +θ x + + θ x = θ ote: x s a dummy attrbute and ts value s a constant equal to. Lnear regresson can also be represented n a graphc form: x θ = x x θ. + output.... θ x Goal: nmze ean Square Error (SE): SE = ( y f ( x ; θ)) = SE s a quadratc functon n parameters θ It s a convex functon There s only one mnmum, t s the global mnmum SE Soluton: Suffcent condton s =, θ, =,,,. θ Therefore, fnd θ such that SE = θ =, θ y k xk x k There are + lnear equatons wth + unknown varables we can get a closed-form soluton. Specal Case: If some attrbute s a lnear combnaton of others, there s no unque soluton. SE θ where: = = y x = θ k = k = x k x (n matrx form) X T Y = X T Xθ, X [ (+)] = {x } =:, =:(+), (x s th attrbute of th data pont) Y [ ] = {y } =:,

θ [(+) ] = {θ } =:(+) ote: D = [X Y],.e., [X Y] s what we defned prevously as the data set. The optmal parameter choce s then: θ = (X T X) - X T Y, whch s a closed form soluton. ote: the above soluton exsts f X T X s nvertble,.e. f ts rank equals +,.e. no attrbute s a lnear combnaton of others (n atlab, use functon rank). ote: usng matrx dervatons we can do the optmzaton n a more elegant way by defnng Statstcal results: SE = (Y Xθ) T (Y Xθ) θ SE = X T (Y Xθ) = [(+) ] θ = (X T X) - X T Y Assumpton: the true data generatng process (DGP) s y = β x + e, e s nose wth E(e) =, Var(e)= σ = ote: Ths s a bg assumpton! Questons: How close s the estmate θ to the true value β? Answer : E[θ] = E[(X T X) - X T Y] = (X T X) - X T E[Y] (remember, Y=Xβ+e [x] ) E[θ] = (X T X) - X T Xβ + (X T X) - X T E[e] E[θ] = β + = β Concluson: f we repeat lnear regresson on dfferent data sets sampled accordng to the true DGP, the average θ wll equal β (.e., E[θ] = β), whch are the true parameters. Therefore, the lnear regresson s an unbased predctor. Answer : The varance of parameter estmate θ s Var[θ] = (after some calculaton) = (X T X) - σ Concluson: Var[θ] s a measure of how dfferent estmaton θ s from the true parameters β,.e. how successful s the lnear regresson. Therefore, qualty of lnear regresson depends on the nose level (.e. σ ) and on the data sze. The varance ncreases lnearly wth σ and decreases as / wth the sze of the dataset. ore strngent assumpton: the true DGP s y = β x + e, and e ~ (, σ ) (.e., e s Gaussan addtve nose) = If the assumpton s vald we could: Estmate θ can be consdered as a mult-dmensonal Gaussan varable wth θ = (β, (X T X) - σ ). Therefore, we could do some nce thng such as test the hypothess that β = (.e. that attrbute s not nfluencng the target y).

onlnear Regresson Queston: What f we know that f(x;θ) s a non-lnear parametrc functon? For example: f(x;θ) = θ + θ x x θ, ths s a functon nonlnear n parameters. Soluton: nmze SE = ( y f ( x ; θ)) Start from the necessary condton for mnmum: SE f ( x ; ) = ( ( ; )) θ y f x θ = θ θ Agan, we have to solve nonlnear equatons wth unknowns. But, ths tme closed-form soluton s not easy to derve. ath Background: Unconstraned Optmzaton: Problem: Gven f(x), fnd ts mnmum. Popular Soluton: Use the gradent descent algorthm. Idea: The gradent of f(x) at the mnmum s zero vector. So,. start from an ntal guess x ;. calculate gradent f(x ); 3. move n the drecton opposte of the gradent,.e., generate new guess x as x = x α f(x ), where α s a properly selected constant; 4. repeat ths process untl convergence to the mnmum.

Two problems wth gradent descent algorthm:. It accepts convergence to a local mnmum. The smplest soluton to avod the local mnmum s to repeat the procedure startng from multple ntal guesses x.. Possble slow convergence to a mnmum. There are a number of algorthms provdng faster convergence (e.g. conugate gradent; second order methods such as ewton or quaz-ewton; nondervatve methods) Back to solvng nonlnear regresson usng gradent descent procedure: Step : Start from an ntal guess for parameters θ. Step k: Update the parameters as θ k+ = θ k α f(θ k ) Specal Case: For lnear predcton the update step would be θ k+ = θ k +αx T (Y Xθ k )

Logstc Regresson Classfcaton by SE nmzaton Remember: Classfcaton can be solved by SE mnmzaton methods (E[y x] can be used to derve posterors P(y C x)). Queston: What functonal form f(x;θ) can be an approprate choce for representng posteror class probabltes? Opton : What about lnear model f(x;θ) = s not a good choce. θ x? The range of the functon goes beyond -, so t = Opton : We can use sgmod functon to do squeeze the output of a lnear model to the range between and : f(x;θ) = g( θ x ). If g(z) = e z /(+e z ), optmzng f(x;θ) s called logstc regresson. = Soluton: Logstc regresson can be solved by mnmzng SE. Dervatve SE/ θ s SE = ( y f ( x ; θ)) x = θ θ g' k = k xk ote: Solvng θ SE = results n (+) nonlnear equatons wth (+) unknowns optmzaton can be done by usng gradent descent algorthm.

axmum Lkelhood (L) Algorthm Basc Idea: Gven a data set D and a parametrc model wth parameters θ that descrbes the data generatng process, the best soluton θ* s the one that maxmzes P(D θ),.e. θ* = arg max θ P(D θ) P(D θ) s called the lkelhood, so the name of the algorthm that fnds the optmal soluton θ* s called the maxmum lkelhood algorthm. Ths dea can be appled for both unsupervsed and supervsed learnng problems. L for Unsupervsed Learnng: Densty Estmaton Gven D = {x, =,, }, and assumng the functonal form p(x θ) of the data generatng process, the goal s to estmate the optmal parameters θ that maxmze lkelhood P(D θ): P(D θ) = P(x, x,, x θ) By assumng that data ponts x are ndependent and dentcally dstrbuted (d) P(D θ) = p( x θ) (p s the probablty densty functon.) = Snce log(x) s monotoncally ncreasng functon wth x, maxmzaton of P(D θ) s equvalent to maxmzaton of l = log(p(d θ)). l s called the log-lkelhood. So, l = log = ( p( x ) θ) Example: Data set D = {x, =,, } s drawn from a Gaussan dstrbuton wth mean μ and standard devaton σ,.e., X ~ (μ,σ ). Therefore, ( x μ) ( x μ, σ ) = σ e πσ p ( x μ) l = log = πσ σ Values μ and σ that maxmze the log-lkelhood satsfy the necessary condton for local optmum: l = μ = l ˆ x μ =, = σ ˆ = = ( x μ) σ ˆ L for Supervsed Learnng Gven D = {(x,y ), =,, }, and assumng the functonal form p(y x,θ) of the data generatng process, the goal s to estmate the optmal parameters θ that maxmze lkelhood P(D θ): P(D θ) = P(y, y,, y x, x,, x,θ) = /f data s d = = p( x, θ) y

L for Regresson Assume the data generatng process corresponds to: y = f ( x, θ) + e, where e ~ (μ,σ ) ote: ths s a relatvely strong assumpton! y ~ ( f ( x, θ), σ ) ( x f ( x, θ)) p ( y x, θ) = e σ πσ l = log P( D θ ) = = log ( y f ( x, θ)) πσ σ Snce σ s a constant, maxmzaton of l s equvalent to mnmzaton of ( y f ( x, θ )) = Important concluson: Regresson usng L under the assumpton of DGP wth addtve Gaussan nose s equvalent to regresson usng SE mnmzaton!! L for Classfcaton There are two man approaches to classfcaton nvolvng L: the Bayesan Estmaton approach, and logstc regresson. Bayesan Estmaton Idea: gven a dataset D, decompose D nto datasets { D}, =,..., k = # of classes, whereu D = D and D D = for all,. For each D, we can estmate the pdf p( x y C ) (the class-condtonal densty). These denstes can be estmated usng the L methods descrbed n Lecture 3, provded we make a (strong) assumpton about the functonal form of the densty (e.g., Gaussan). We also note that ths approach s useful theoretcally and when the nput dmenson s low, but densty estmaton s generally not practcal n hgh dmensons. In order to obtan a classfer, we want to be able to estmate the probabltes py ( C x) (the posteror class probabltes). A new nput wll be assgned to the class wth the hghest estmated posteror class probablty. We can estmate these probabltes by applyng Bayes Theorem: Bayes Theorem: f A & B are events, then PAB ( ) = PB ( APA ) ( ) PB ( ) So we see that p( x y C ) p( y C ) py ( C x) = p( x) where py ( C ) (the pror class probablty) may be estmated wthout reference to the nputs as the relatve frequency of C n the dataset D. For the purpose of classfcaton, t s not necessary to compute

p( x ), snce we are nterested only n the relatve szes of the posteror probabltes. Fnally, we may defne the Bayesan classfer by yˆ = arg max C p( y C x) = arg max C p( x y C) p( y C) k We reterate that ths method s only practcally applcable n low dmensons, and requres strong assumptons about the functonal form of the class dstrbutons. k Logstc Regresson The assumptons nvolved n logstc regresson are smlar to those nvolved wth lnear regresson, namely the exstence of a lnear relatonshp between the nputs and the output. In the case of logstc regresson, ths assumpton takes a somewhat dfferent form: we assume that the posteror class probabltes can be estmated as a lnear functon of the nputs, passed through a sgmodal functon. Parameter estmates (coeffcents of the nputs) are then calculated to mnmze SE. For smplcty, assume we are dong bnary classfcaton and that y {,}. Then the logstc regresson model s The lkelhood functon of the data D s gven by η e μ = p( y C x ) = where η = θ x η + e y y pd ( Θ) = py ( x, Θ ) = μ ( μ) = = y y ote that the term μ ( μ ) reduces to the posteror class probablty of class when y =, and the posteror class probablty of class otherwse, so ths expresson makes sense. In order to fnd the L estmators of the parameters, we form the log-lkelhood l = log pd ( Θ) = [ ylog μ + ( y)log( μ )] = The L estmators requre us to solve =, whch s a non-lnear system of (+) equatons n (+) unknowns, so we don t expect a closed form soluton. Hence we would, for nstance, apply the gradent descent algorthm to get the parameter estmates for the classfer. Θ l