Classification as a Regression Problem

Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class representaton s qute arbtrary; Careful numercal class representaton s a crtcal step. Bnary Classfcaton Let us represent the classes C and C wth numercal values and,.e., f y C then y =, and f y C then y = Snce we have assgned numerc values to classes, bnary classfcaton can be consdered to be a regresson problem. The optmal predctor for regresson s * f x) = E y x = P( y C x) + P( y C x) = P( y C x [ ] ) ( Therefore, the optmal predctor outputs the posteror probablty of class C. By applyng the Bayes classfcaton rule we can obtan the Bayes classfer!! Important Concluson: Wth approprate class representaton, the optmal classfcaton s equvalent to optmal regresson. ult-class Classfcaton (K classes) Could the prevous result be generalzed to mult-class classfcaton? Example. 3 classes: whte, black, and blue. Let us examne the representaton: y = - (class s whte ), y = (class s black ), and y = (class s blue ) Dscusson: The representaton s napproprate snce t enforces order; mples that whte and blue are further away then let s say black and blue. What f t s evdent that an example could be ether whte or blue, but defntely not black? The proposed representaton would probably lead to a completely msleadng answer black!! Soluton: Decompose the mult-class problem to K bnary classfcaton problems: Problem ( K ) : f y C then y = f y C then y = The regresson functon E [ y x] on Problem wll equal the posteror P( y C x). By repeatng for all K classes, all posterors wll be avalable, whch s suffcent to construct the Bayes classfer!!

Approaches to nmzng SE from a Fnte Dataset Goal: Gven a dataset D fnd a mappng f that mnmzes SE. Two extreme approaches:. earest neghbor algorthm (non-parametrc approach). Lnear regresson (parametrc approach) on-parametrc approach assumes that data ponts close n the attrbute space are smlar to each other. It does not assume any functonal form. Parametrc approach assumes a functonal form: e.g., the output s a lnear functon of the nputs. earest eghbor Algorthms 6 (x) k-nearest neghbor (k-): ( x) = y k x k ( x) f, where (x) are the k nearest neghbors of x k Regresson: Classfcaton: The predcton s the average y among the k nearest neghbors The predcton s the maorty class among the k nearest neghbors k- s also known as a lazy learnng algorthm because t does not perform any calculatons pror to seeng a data pont; t has to analyze the whole dataset for the nearest neghbors every tme a new data pont appears. ote: Parametrc learnng algorthms learn a functon from the dataset; they are much faster n gvng predctons but need to spend some tme beforehand. k Theorem: If, k, (large number of neghbors n a very tght neghborhood) k- s an optmal predctor. Practcally, f we have dmensons ( attrbutes), k- s a good try; f we have more attrbutes, we may run out of data ponts (n practce, data sze s always lmted). Example: Generate an -dmensonal dataset such that all attrbutes X, =,,, are unformly dstrbuted n nterval [,]. What s the length r of the sze of a hypercube that contans % of all data ponts (.e. % of the hypercube volume)? Answer: r =. /.

r=. r=. r=.....3..3 5.63.4.5.8.63.5.98.95.93.998.995.993 In hgh dmensons, all neghborng ponts are far away, and could not be used to accurately estmate values wth k-!! Lnear Regresson Assumes a functonal form f ( x, θ) = θ + θ x + θ x + K + θ x (Eq.) where x = (x, x, x ) are the attrbutes and θ = (θ, θ, θ ) are the functon parameters. ore generally, f ( x, θ) = θ + θ φ( x) + θ φ ( x) + K + θ φ ( x) where φ s are the so called bass functons Example: 4 f ( x, θ) = θ + θ x + θ x + θ 3 x, where x = (x, x ) are the attrbutes and θ = (θ, θ, θ, θ 3 ) are the functon parameters. ote that functon f(x,θ) from the example s lnear n the parameters. We can easly transform t nto a functon from (Eq.) by ntroducng new attrbutes x =, x =x and x =x, and x 3 =x 4. Lnear regresson s sutable for problems where functonal form f(x,θ) s known wth suffcent certanty. Learnng goal: Fnd θ that mnmzes SE SE s a functon of parameters θ, so the problem of mnmzng SE can be solved by standard methods of the unconstraned optmzaton. Illustraton of the regresson: f ( x,θ ) = θ x

Lnear Regresson Lnear regresson can be represented by a functonal form: f(x; θ) = θ x +θ x + + θ x = θ ote: x s a dummy attrbute and ts value s a constant equal to. Lnear regresson can also be represented n a graphc form: x θ = x x θ. + output.... θ x Goal: nmze ean Square Error (SE): SE = ( y f ( x ; θ)) = SE s a quadratc functon n parameters θ It s a convex functon There s only one mnmum, t s the global mnmum SE Soluton: Suffcent condton s =, θ, =,,,. θ Therefore, fnd θ such that SE = θ =, θ y k xk x k There are + lnear equatons wth + unknown varables we can get a closed-form soluton. Specal Case: If some attrbute s a lnear combnaton of others, there s no unque soluton. SE θ where: = = y x = θ k = k = x k x (n matrx form) X T Y = X T Xθ, X [ (+)] = {x } =:, =:(+), (x s th attrbute of th data pont) Y [ ] = {y } =:,

θ [(+) ] = {θ } =:(+) ote: D = [X Y],.e., [X Y] s what we defned prevously as the data set. The optmal parameter choce s then: θ = (X T X) - X T Y, whch s a closed form soluton. ote: the above soluton exsts f X T X s nvertble,.e. f ts rank equals +,.e. no attrbute s a lnear combnaton of others (n atlab, use functon rank). ote: usng matrx dervatons we can do the optmzaton n a more elegant way by defnng Statstcal results: SE = (Y Xθ) T (Y Xθ) θ SE = X T (Y Xθ) = [(+) ] θ = (X T X) - X T Y Assumpton: the true data generatng process (DGP) s y = β x + e, e s nose wth E(e) =, Var(e)= σ = ote: Ths s a bg assumpton! Questons: How close s the estmate θ to the true value β? Answer : E[θ] = E[(X T X) - X T Y] = (X T X) - X T E[Y] (remember, Y=Xβ+e [x] ) E[θ] = (X T X) - X T Xβ + (X T X) - X T E[e] E[θ] = β + = β Concluson: f we repeat lnear regresson on dfferent data sets sampled accordng to the true DGP, the average θ wll equal β (.e., E[θ] = β), whch are the true parameters. Therefore, the lnear regresson s an unbased predctor. Answer : The varance of parameter estmate θ s Var[θ] = (after some calculaton) = (X T X) - σ Concluson: Var[θ] s a measure of how dfferent estmaton θ s from the true parameters β,.e. how successful s the lnear regresson. Therefore, qualty of lnear regresson depends on the nose level (.e. σ ) and on the data sze. The varance ncreases lnearly wth σ and decreases as / wth the sze of the dataset. ore strngent assumpton: the true DGP s y = β x + e, and e ~ (, σ ) (.e., e s Gaussan addtve nose) = If the assumpton s vald we could: Estmate θ can be consdered as a mult-dmensonal Gaussan varable wth θ = (β, (X T X) - σ ). Therefore, we could do some nce thng such as test the hypothess that β = (.e. that attrbute s not nfluencng the target y).

onlnear Regresson Queston: What f we know that f(x;θ) s a non-lnear parametrc functon? For example: f(x;θ) = θ + θ x x θ, ths s a functon nonlnear n parameters. Soluton: nmze SE = ( y f ( x ; θ)) Start from the necessary condton for mnmum: SE f ( x ; ) = ( ( ; )) θ y f x θ = θ θ Agan, we have to solve nonlnear equatons wth unknowns. But, ths tme closed-form soluton s not easy to derve. ath Background: Unconstraned Optmzaton: Problem: Gven f(x), fnd ts mnmum. Popular Soluton: Use the gradent descent algorthm. Idea: The gradent of f(x) at the mnmum s zero vector. So,. start from an ntal guess x ;. calculate gradent f(x ); 3. move n the drecton opposte of the gradent,.e., generate new guess x as x = x α f(x ), where α s a properly selected constant; 4. repeat ths process untl convergence to the mnmum.

Two problems wth gradent descent algorthm:. It accepts convergence to a local mnmum. The smplest soluton to avod the local mnmum s to repeat the procedure startng from multple ntal guesses x.. Possble slow convergence to a mnmum. There are a number of algorthms provdng faster convergence (e.g. conugate gradent; second order methods such as ewton or quaz-ewton; nondervatve methods) Back to solvng nonlnear regresson usng gradent descent procedure: Step : Start from an ntal guess for parameters θ. Step k: Update the parameters as θ k+ = θ k α f(θ k ) Specal Case: For lnear predcton the update step would be θ k+ = θ k +αx T (Y Xθ k )

Logstc Regresson Classfcaton by SE nmzaton Remember: Classfcaton can be solved by SE mnmzaton methods (E[y x] can be used to derve posterors P(y C x)). Queston: What functonal form f(x;θ) can be an approprate choce for representng posteror class probabltes? Opton : What about lnear model f(x;θ) = s not a good choce. θ x? The range of the functon goes beyond -, so t = Opton : We can use sgmod functon to do squeeze the output of a lnear model to the range between and : f(x;θ) = g( θ x ). If g(z) = e z /(+e z ), optmzng f(x;θ) s called logstc regresson. = Soluton: Logstc regresson can be solved by mnmzng SE. Dervatve SE/ θ s SE = ( y f ( x ; θ)) x = θ θ g' k = k xk ote: Solvng θ SE = results n (+) nonlnear equatons wth (+) unknowns optmzaton can be done by usng gradent descent algorthm.

axmum Lkelhood (L) Algorthm Basc Idea: Gven a data set D and a parametrc model wth parameters θ that descrbes the data generatng process, the best soluton θ* s the one that maxmzes P(D θ),.e. θ* = arg max θ P(D θ) P(D θ) s called the lkelhood, so the name of the algorthm that fnds the optmal soluton θ* s called the maxmum lkelhood algorthm. Ths dea can be appled for both unsupervsed and supervsed learnng problems. L for Unsupervsed Learnng: Densty Estmaton Gven D = {x, =,, }, and assumng the functonal form p(x θ) of the data generatng process, the goal s to estmate the optmal parameters θ that maxmze lkelhood P(D θ): P(D θ) = P(x, x,, x θ) By assumng that data ponts x are ndependent and dentcally dstrbuted (d) P(D θ) = p( x θ) (p s the probablty densty functon.) = Snce log(x) s monotoncally ncreasng functon wth x, maxmzaton of P(D θ) s equvalent to maxmzaton of l = log(p(d θ)). l s called the log-lkelhood. So, l = log = ( p( x ) θ) Example: Data set D = {x, =,, } s drawn from a Gaussan dstrbuton wth mean μ and standard devaton σ,.e., X ~ (μ,σ ). Therefore, ( x μ) ( x μ, σ ) = σ e πσ p ( x μ) l = log = πσ σ Values μ and σ that maxmze the log-lkelhood satsfy the necessary condton for local optmum: l = μ = l ˆ x μ =, = σ ˆ = = ( x μ) σ ˆ L for Supervsed Learnng Gven D = {(x,y ), =,, }, and assumng the functonal form p(y x,θ) of the data generatng process, the goal s to estmate the optmal parameters θ that maxmze lkelhood P(D θ): P(D θ) = P(y, y,, y x, x,, x,θ) = /f data s d = = p( x, θ) y

L for Regresson Assume the data generatng process corresponds to: y = f ( x, θ) + e, where e ~ (μ,σ ) ote: ths s a relatvely strong assumpton! y ~ ( f ( x, θ), σ ) ( x f ( x, θ)) p ( y x, θ) = e σ πσ l = log P( D θ ) = = log ( y f ( x, θ)) πσ σ Snce σ s a constant, maxmzaton of l s equvalent to mnmzaton of ( y f ( x, θ )) = Important concluson: Regresson usng L under the assumpton of DGP wth addtve Gaussan nose s equvalent to regresson usng SE mnmzaton!! L for Classfcaton There are two man approaches to classfcaton nvolvng L: the Bayesan Estmaton approach, and logstc regresson. Bayesan Estmaton Idea: gven a dataset D, decompose D nto datasets { D}, =,..., k = # of classes, whereu D = D and D D = for all,. For each D, we can estmate the pdf p( x y C ) (the class-condtonal densty). These denstes can be estmated usng the L methods descrbed n Lecture 3, provded we make a (strong) assumpton about the functonal form of the densty (e.g., Gaussan). We also note that ths approach s useful theoretcally and when the nput dmenson s low, but densty estmaton s generally not practcal n hgh dmensons. In order to obtan a classfer, we want to be able to estmate the probabltes py ( C x) (the posteror class probabltes). A new nput wll be assgned to the class wth the hghest estmated posteror class probablty. We can estmate these probabltes by applyng Bayes Theorem: Bayes Theorem: f A & B are events, then PAB ( ) = PB ( APA ) ( ) PB ( ) So we see that p( x y C ) p( y C ) py ( C x) = p( x) where py ( C ) (the pror class probablty) may be estmated wthout reference to the nputs as the relatve frequency of C n the dataset D. For the purpose of classfcaton, t s not necessary to compute

p( x ), snce we are nterested only n the relatve szes of the posteror probabltes. Fnally, we may defne the Bayesan classfer by yˆ = arg max C p( y C x) = arg max C p( x y C) p( y C) k We reterate that ths method s only practcally applcable n low dmensons, and requres strong assumptons about the functonal form of the class dstrbutons. k Logstc Regresson The assumptons nvolved n logstc regresson are smlar to those nvolved wth lnear regresson, namely the exstence of a lnear relatonshp between the nputs and the output. In the case of logstc regresson, ths assumpton takes a somewhat dfferent form: we assume that the posteror class probabltes can be estmated as a lnear functon of the nputs, passed through a sgmodal functon. Parameter estmates (coeffcents of the nputs) are then calculated to mnmze SE. For smplcty, assume we are dong bnary classfcaton and that y {,}. Then the logstc regresson model s The lkelhood functon of the data D s gven by η e μ = p( y C x ) = where η = θ x η + e y y pd ( Θ) = py ( x, Θ ) = μ ( μ) = = y y ote that the term μ ( μ ) reduces to the posteror class probablty of class when y =, and the posteror class probablty of class otherwse, so ths expresson makes sense. In order to fnd the L estmators of the parameters, we form the log-lkelhood l = log pd ( Θ) = [ ylog μ + ( y)log( μ )] = The L estmators requre us to solve =, whch s a non-lnear system of (+) equatons n (+) unknowns, so we don t expect a closed form soluton. Hence we would, for nstance, apply the gradent descent algorthm to get the parameter estmates for the classfer. Θ l