6.867 Machie learig: lecture 3 Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics Beod liear regressio models additive regressio models, eamples geeralizatio ad cross-validatio populatio miimizer Statistical regressio models model formulatio, motivatio maimum likelihood estimatio Tommi Jaakkola, MIT CSAIL fuctios, fuctios, f(; w) = w + w, or f(; w) = w + w, or f(; w) = w + w +... + w d d f(; w) = w + w +... + w d d combied with the squared loss, are coveiet because the combied with the squared loss, are coveiet because the ŵ = (X T X) X T where, for eample, = [,..., ] T. Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL fuctios, f(; w) = w + w, or f(; w) = w + w +... + w d d combied with the squared loss, are coveiet because the ŵ = (X T X) X T where, for eample, = [,..., ] T. the resultig predictio errors ɛ i = i f( i ; ŵ) are ucorrelated with a liear fuctio of the iputs. fuctios, f(; w) = w + w, or f(; w) = w + w +... + w d d combied with the squared loss, are coveiet because the ŵ = (X T X) X T where, for eample, = [,..., ] T. the resultig predictio errors ɛ i = i f( i ; ŵ) are ucorrelated with a liear fuctio of the iputs. we ca easil eted these to o-liear fuctios of the iputs while still keepig them liear i the parameters Tommi Jaakkola, MIT CSAIL Tommi Jaakkola, MIT CSAIL 6
Beod liear regressio Eample etesio: m th order polomial regressio where is give b f(; w) = w + w +... + w m m + w m m Polomial regressio liear i the parameters, o-liear i the iputs solutio as before ŵ = (X T X) X T where ŵ... m ŵ ŵ =, X =... m ŵ m... m!!!!!! degree = degree = 3!!!!!! degree = degree = 7 Tommi Jaakkola, MIT CSAIL 7 Tommi Jaakkola, MIT CSAIL 8 Compleit ad overfittig With limited traiig eamples our polomial regressio model ma achieve zero traiig error but evertless has a large test (geeralizatio) error trai ( t f( t ; ŵ)) t= test E (,) P ( f(; ŵ))!!! We suffer from over-fittig whe the traiig error o loger bears a relatio to the geeralizatio error Avoidig over-fittig: cross-validatio Cross-validatio allows us to estimate the geeralizatio error based o traiig eamples aloe Leave-oe-out cross-validatio treats each traiig eample i tur as a test eample: CV = ( i f( i ; ŵ i ) ) where ŵ i are the least squares estimates of the parameters without the i th traiig eample. 3!!!3!!!!.!!... 3!!!3!!!!.!!... Tommi Jaakkola, MIT CSAIL 9 Tommi Jaakkola, MIT CSAIL Polomial regressio: eample cot d Additive models More geerall, predictios ca be based o a liear combiatio of a set of basis fuctios (or features) {φ (),..., φ m ()}, where each φ i () : R d R, ad!!!!!! degree =, CV =.6 degree = 3, CV =. Eamples: f(; w) = w + w φ () +... + w m φ m () If φ i () = i, i =,..., m, the f(; w) = w + w +... + w m m + w m m!!!!!! degree =, CV = 6. degree = 7, CV =.6 Tommi Jaakkola, MIT CSAIL Tommi Jaakkola, MIT CSAIL
Additive models More geerall, predictios ca be based o a liear combiatio of a set of basis fuctios (or features) {φ (),..., φ m ()}, where each φ i () : R d R, ad Eamples: f(; w) = w + w φ () +... + w m φ m () If φ i () = i, i =,..., m, the f(; w) = w + w +... + w m m + w m m If m = d, φ i () = i, i =,..., d, the The basis fuctios ca capture various (e.g., qualitative) properties of the iputs. For eample: we ca tr to rate compaies based o tet descriptios = tet documet (collectio of words) { if word i appears i the documet φ i () = otherwise f(; w) = w + i words w i φ i () f(; w) = w + w +... + w d d Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL We ca also make predictios b gaugig the similarit of eamples to prototpes. For eample, our additive regressio fuctio could be f(; w) = w + w φ () +... + w m φ m () where the basis fuctios are radial basis fuctios φ k () = ep{ σ k } measurig the similarit to the prototpes; σ cotrols how quickl the basis fuctio vaishes as a fuctio of the distace to the prototpe. (traiig eamples themselves could serve as prototpes) We ca view the additive models graphicall i terms of simple uits ad weights φ () w w f(; w)... φ m () I eural etworks the basis fuctios themselves have adjustable parameters (cf. prototpes) w m Tommi Jaakkola, MIT CSAIL Tommi Jaakkola, MIT CSAIL 6 Squared loss ad populatio miimizer What do we get if we have ulimited traiig eamples (the whole populatio) ad o costraits o the regressio fuctio? miimize E (,) P ( f()) with respect to a ucostraied fuctio!!!!! Squared loss ad populatio miimizer To miimize E (,) P ( f()) = E P [ E P ( f()) ] we ca focus o each separatel sice f() ca be chose idepedetl for each differet. For a particular we ca f() E P ( f()) = E P ( f()) = (E{ } f()) = Thus the fuctio we are trig to approimate is the coditioal epectatio f () = E{ } Tommi Jaakkola, MIT CSAIL 7 Tommi Jaakkola, MIT CSAIL 8
Topics Beod liear regressio models additive regressio models, eamples geeralizatio ad cross-validatio populatio miimizer Statistical regressio models model formulatio, motivatio maimum likelihood estimatio Statistical view of liear regressio I a statistical regressio model we model both the fuctio ad oise Observed output = fuctio + oise where, e.g., ɛ N(, σ ). Whatever we caot capture with our chose famil of fuctios will be iterpreted as oise = f(; w) + ɛ 6!!!6!! Tommi Jaakkola, MIT CSAIL 9 Tommi Jaakkola, MIT CSAIL Statistical view of liear regressio f(; w) is trig to capture the mea of the observatios give the iput : E{ } = E{ f(; w) + ɛ } = f(; w) where E{ } is the coditioal epectatio of give, evaluated accordig to the model (ot accordig to the uderlig distributio P ) Statistical view of liear regressio Accordig to our statistical model = f(; w) + ɛ, ɛ N(, σ ) the outputs give are ormall distributed with mea f(; w) ad variace σ : p(, w, σ ) = ep{ πσ σ ( f(; w)) } (we model the ucertait i the predictios, ot just the mea) Loss fuctio? Estimatio?!!! Tommi Jaakkola, MIT CSAIL Tommi Jaakkola, MIT CSAIL Maimum likelihood estimatio Give observatios D = {(, ),..., (, )} we fid the parameters w that maimize the (coditioal) likelihood of the outputs L(D ; w, σ ) = p( i i, w, σ ) Eample: liear fuctio p(, w, σ ) = πσ ep{ σ ( w w ) } 6!!!6!! (wh is this a bad fit accordig to the likelihood criterio?) Maimum likelihood estimatio cot d Likelihood of the observed outputs: L(D; w, σ ) = P ( i i, w, σ ) It is ofte easier (but equivalet) to tr to maimize the log-likelihood: l(d; w, σ ) = log L(D; w, σ ) = log P ( i i, w, σ ) ( = σ ( i f( i ; w)) log ) πσ ( = ) σ ( i f( i ; w)) +... Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL
Maimum likelihood estimatio cot d Maimizig log-likelihood is equivalet to miimizig empirical loss whe the loss is defied accordig to Loss( i, f( i ; w)) = log P ( i i, w, σ ) Loss defied as the egative log-probabilit is kow as the log-loss. Maimum likelihood estimatio cot d The log-likelihood of observatios log L(D; w, σ ) = log P ( i i, w, σ ) is a geeric fittig criterio ad ca be used to estimate the oise variace σ as well. Let ŵ be the maimum likelihood (here least squares) settig of the parameters. What is the maimum likelihood estimate of σ, obtaied b solvig σ log L(D; w, σ ) =? Tommi Jaakkola, MIT CSAIL Tommi Jaakkola, MIT CSAIL 6 Maimum likelihood estimatio cot d The log-likelihood of observatios log L(D; w, σ ) = log P ( i i, w, σ ) is a geeric fittig criterio ad ca be used to estimate the oise variace σ as well. Let ŵ be the maimum likelihood (here least squares) settig of the parameters. The maimum likelihood estimate of the oise variace σ is ˆσ = ( i f( i ; ŵ)) i.e., the mea squared predictio error. Tommi Jaakkola, MIT CSAIL 7