CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute and ts value s a constant equal to. Lnear regresson can also be represented n a graphc form: 0 θ 0 θ. + output.... θ Goal: nmze ean Square Error (SE): SE = ( y f ( ; θ)) = SE s a quadratc functon n parameters θ It s a conve functon There s only one mnmum, t s the global mnmum SE Soluton: Suffcent condton s = 0, θ, = 0,,,. Therefore, fnd θ such that SE = θ = 0, y k k k There are + lnear equatons wth + unknown varables we can get a closedform soluton.

Specal Case: If some attrbute s a lnear combnaton of others, there s no unque soluton. SE = 0 = y = θ k = k = 0 k (n matr form) X T Y = X T Xθ, where: X [ (+)] = { } =:, =:(+), ( s th attrbute of th data pont) Y [ ] = {y } =:, θ [(+) ] = {θ } =:(+) ote: D = [X Y],.e., [X Y] s what we defned prevously as the data set. The optmal parameter choce s then: θ = (X T X) - X T Y, whch s a closed form soluton. ote: the above soluton ests f X T X s nvertble,.e. f ts rank equals +,.e. no attrbute s a lnear combnaton of others (n atlab, use functon rank). ote: usng matr dervatons we can do the optmzaton n a more elegant way by defnng Statstcal results: SE = (Y Xθ) T (Y Xθ) θ SE = X T (Y Xθ) = 0 [(+) ] θ = (X T X) - X T Y Assumpton: the true data generatng process (DGP) s y = β + e, e s nose wth E(e) = 0, Var(e)= σ =0 ote: Ths s a bg assumpton! Questons: How close s the estmate θ to the true value β? Answer : E[θ] = E[(X T X) - X T Y] = (X T X) - X T E[Y] (remember, Y=Xβ+e [] ) E[θ] = (X T X) - X T Xβ + (X T X) - X T E[e] E[θ] = β + 0 = β Concluson: f we repeat lnear regresson on dfferent data sets sampled accordng to the true DGP, the average θ wll equal β (.e., E[θ] = β), whch are the true parameters. Therefore, the lnear regresson s an unbased predctor. Answer : The varance of parameter estmate θ s Var[θ] = (after some calculaton) = (X T X) - σ Concluson: Var[θ] s a measure of how dfferent estmaton θ s from the true parameters β,.e. how successful s the lnear regresson. Therefore, qualty of lnear

regresson depends on the nose level (.e. σ ) and on the data sze. The varance ncreases lnearly wth σ and decreases as / wth the sze of the dataset. ore strngent assumpton: the true DGP s y = β + e, and e ~ (0, σ ) (.e., e s Gaussan addtve nose) =0 If the assumpton s vald we could: Estmate θ can be consdered as a multdmensonal Gaussan varable wth θ = (β, (X T X) - σ ). Therefore, we could do some nce thng such as test the hypothess that β =0 (.e. that attrbute s not nfluencng the target y). onlnear Regresson Queston: What f we know that f(;θ) s a non-lnear parametrc functon? For eample: f(;θ) = θ 0 + θ θ, ths s a functon nonlnear n parameters. Soluton: nmze SE = ( y f ( ; θ)) Start from the necessary condton for mnmum: SE f ( ; ) = ( ( ; )) θ y f θ = 0 Agan, we have to solve nonlnear equatons wth unknowns. But, ths tme closed-form soluton s not easy to derve. ath Background: Unconstraned Optmzaton: Problem: Gven f(), fnd ts mnmum. Popular Soluton: Use the gradent descent algorthm. Idea: The gradent of f() at the mnmum s zero vector. So,. start from an ntal guess 0 ;. calculate gradent f( 0 ); 3. move n the drecton opposte of the gradent,.e., generate new guess as = 0 α f( 0 ), where α s a properly selected constant; 4. repeat ths process untl convergence to the mnmum.

Two problems wth gradent descent algorthm:. It accepts convergence to a local mnmum. The smplest soluton to avod the local mnmum s to repeat the procedure startng from multple ntal guesses 0.. Possble slow convergence to a mnmum. There are a number of algorthms provdng faster convergence (e.g. conugate gradent; second order methods such as ewton or quaz-ewton; nondervatve methods) Back to solvng nonlnear regresson usng gradent descent procedure: Step : Start from an ntal guess for parameters θ 0. Step k: Update the parameters as θ k+ = θ k α f(θ k ) Specal Case: For lnear predcton the update step would be θ k+ = θ k +αx T (Y Xθ k ) Logstc Regresson by SE nmzaton Remember: Classfcaton can be solved by SE mnmzaton methods (E[y ] can be used to derve posterors P(y C )). Queston: What functonal form f(;θ) can be an approprate choce for representng posteror class probabltes? Opton : What about lnear model f(;θ) = θ 0-, so t s not a good choce. = 0? The range of the functon goes beyond Opton : We can use sgmod functon to do squeeze the output of a lnear model to the range between 0 and : f(;θ) = g( regresson. = 0 θ ). If g(z) = e z /(+e z ), optmzng f(;θ) s called logstc

Soluton: Logstc regresson can be solved by mnmzng SE. Dervatve SE/ s SE = ( y f ( ; θ)) = θ = g' k k k 0 ote: Solvng θ SE = 0 results n (+) nonlnear equatons wth (+) unknowns optmzaton can be done by usng gradent descent algorthm. amum Lkelhood (L) Algorthm Basc Idea: Gven a data set D and a parametrc model wth parameters θ that descrbes the data generatng process, the best soluton θ* s the one that mamzes P(D θ),.e. θ* = arg ma θ P(D θ) P(D θ) s called the lkelhood, so the name of the algorthm that fnds the optmal soluton θ* s called the mamum lkelhood algorthm. Ths dea can be appled for both unsupervsed and supervsed learnng problems. L for Unsupervsed Learnng: Densty Estmaton Gven D = {, =,, }, and assumng the functonal form p( θ) of the data generatng process, the goal s to estmate the optmal parameters θ that mamze lkelhood P(D θ): P(D θ) = P(,,, θ) By assumng that data ponts are ndependent and dentcally dstrbuted (d) P(D θ) = = p( θ) (p s the probablty densty functon.) Snce log() s monotoncally ncreasng functon wth, mamzaton of P(D θ) s equvalent to mamzaton of l = log(p(d θ)). l s called the log-lkelhood. So, l = log = ( p( ) θ) Eample: Data set D = {, =,, } s drawn from a Gaussan dstrbuton wth mean µ and standard devaton σ,.e., X ~ (µ,σ ). Therefore, ( µ ) ( µ, σ ) = σ e p ( µ ) l = log πσ = πσ σ Values µ and σ that mamze the log-lkelhood satsfy the necessary condton for local optmum: l = µ = l 0 ˆ µ =, = σ = = ( µ ) σ 0 ˆ ˆ

L for Supervsed Learnng Gven D = {(,y ), =,, }, and assumng the functonal form p(y,θ) of the data generatng process, the goal s to estmate the optmal parameters θ that mamze lkelhood P(D θ): P(D θ) = P(y, y,, y,,,,θ) = /f data s d = p( y, θ) = L for Regresson Assume the data generatng process corresponds to: y = f (, θ) + e, where e ~ (µ,σ ) ote: ths s a relatvely strong assumpton! y ~ ( f (, θ), σ ) ( f (, θ)) p ( y, θ) = e σ πσ l = log P( D θ ) = log = ( y f (, θ)) πσ σ = Snce σ s a constant, mamzaton of l s equvalent to mnmzaton of ( y f (, θ )) Important concluson: Regresson usng L under the assumpton of DGP wth addtve Gaussan nose s equvalent to regresson usng SE mnmzaton!!