Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

.87 Machie learig: lecture Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learig problem hypothesis class, estimatio algorithm loss ad estimatio criterio samplig, empirical ad epected losses Regressio, eample Liear regressio estimatio, errors, aalysis Tommi Jaakkola, MIT CSAIL Review: the learig problem Recall the image (face) recogitio problem Idyk Barzilay Collis Jaakkola Hypothesis class: we cosider some restricted set F of mappigs f : X L from images to labels Estimatio: o the basis of a traiig set of eamples ad labels, {(, y ),..., (, y )}, we fid a estimate ˆf F Evaluatio: we measure how well ˆf geeralizes to yet usee eamples, i.e., whether ˆf( ew ) agrees with y ew Hypotheses ad estimatio We used a simple liear classifier, a parameterized mappig f(; θ) from images X to labels L, to solve a biary image classificatio problem ( s vs 3 s): ŷ = f(; θ) = sig ( θ ) where is a piel image ad ŷ {, }. The parameters θ were adjusted o the basis of the traiig eamples ad labels accordig to a simple mistake drive update rule (writte here i a vector form) θ θ + y i i, wheever y i sig ( θ i ) The update rule attempts to miimize the umber of errors that the classifier makes o the traiig eamples Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL Estimatio criterio We ca formulate the estimatio problem more eplicitly by defiig a zero-oe loss: Loss ( y, ŷ ) {, y = ŷ =, y ŷ so that Loss ( y i, ŷ i ) = gives the fractio of predictio errors o the traiig set. This is a fuctio of the parameters θ ad we ca try to miimize it directly. Estimatio criterio cot d We have reduced the estimatio problem to a miimizatio problem fid θ that miimizes empirical loss {}}{ Tommi Jaakkola, MIT CSAIL 5 Tommi Jaakkola, MIT CSAIL

Estimatio criterio cot d We have reduced the estimatio problem to a miimizatio problem fid θ that miimizes empirical loss {}}{ valid for ay parameterized class of mappigs from eamples to predictios valid whe the predictios are discrete labels, real valued, or other provided that the loss is defied appropriately may be ill-posed (uder-costraied) as stated Estimatio criterio cot d We have reduced the estimatio problem to a miimizatio problem fid θ that miimizes empirical loss {}}{ valid for ay parameterized class of mappigs from eamples to predictios valid whe the predictios are discrete labels, real valued, or other provided that the loss is defied appropriately may be ill-posed (uder-costraied) as stated But why is it sesible to miimize the empirical loss i the first place sice we are oly iterested i the performace o ew eamples? Tommi Jaakkola, MIT CSAIL 7 Tommi Jaakkola, MIT CSAIL 8 Traiig ad test performace: samplig We assume that each traiig ad test eample-label pair, (, y), is draw idepedetly at radom from the same but ukow populatio of eamples ad labels. We ca represet this populatio as a joit probability distributio P (, y) so that each traiig/test eample is a sample from this distributio ( i, y i ) P Idyk Barzilay Collis Jaakkola Traiig ad test performace: samplig We assume that each traiig ad test eample-label pair, (, y), is draw idepedetly at radom from the same but ukow populatio of eamples ad labels. We ca represet this populatio as a joit probability distributio P (, y) so that each traiig/test eample is a sample from this distributio ( i, y i ) P Empirical (traiig) loss = { ( )} Epected (test) loss = E (,y) P Loss y, f(; θ) The traiig loss based o a few sampled eamples ad labels serves as a proy for the test performace measured over the whole populatio. Tommi Jaakkola, MIT CSAIL 9 Tommi Jaakkola, MIT CSAIL Topics The learig problem hypothesis class, estimatio algorithm loss ad estimatio criterio samplig, empirical ad epected losses Regressio, eample Liear regressio estimatio, errors, aalysis Regressio The goal is to make quatitative (real valued) predictios o the basis of a (vector of) features or attributes Eample: predictig vehicle fuel efficiecy (mpg) from 8 attributes y cyls disp hp weight... 8. 8 37. 3. 35.... 97.. 835... 33.5 98. 83. 75...... Tommi Jaakkola, MIT CSAIL Tommi Jaakkola, MIT CSAIL

y y Regressio The goal is to make quatitative (real valued) predictios o the basis of a (vector of) features or attributes Eample: predictig vehicle fuel efficiecy (mpg) from 8 attributes y cyls disp hp weight... 8. 8 37. 3. 35.... 97.. 835... 33.5 98. 83. 75...... We eed to specify the class of fuctios (e.g., liear) select how to measure predictio loss solve the resultig miimizatio problem! Liear regressio!!!!! We begi by cosiderig liear regressio (easy to eted to more comple predictios later o) f : R R f : R d R f(; w) = w + w f(; w) = w + w +... w d d where w = [w, w,..., w d ] T are parameters we eed to set. Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL! Liear regressio: squared loss!! f : R R f : R d R!!! f(; w) = w + w f(; w) = w + w +... w d d We ca measure the predictio loss i terms of squared error, Loss(y, ŷ) = (y ŷ), so that the empirical loss o traiig samples becomes mea squared error J (w) = ( yi f( i ; w) ) Liear regressio: estimatio We have to miimize the empirical squared loss J (w) = ( yi f( i ; w) ) = (y i w w i ) (-dim) By settig the derivatives with respect to w ad w to zero, we get ecessary coditios for the optimal parameter values w J (w) = w J (w) = Tommi Jaakkola, MIT CSAIL 5 Tommi Jaakkola, MIT CSAIL Optimality coditios: derivatio J (w) = (y i w w i ) w w Optimality coditios: derivatio J (w) = (y i w w i ) w w = (y i w w i ) w Tommi Jaakkola, MIT CSAIL 7 Tommi Jaakkola, MIT CSAIL 8

Optimality coditios: derivatio J (w) = (y i w w i ) w w = (y i w w i ) w = (y i w w i ) (y i w w i ) w Optimality coditios: derivatio J (w) = (y i w w i ) w w = (y i w w i ) w = (y i w w i ) (y i w w i ) w = (y i w w i )( i ) = Tommi Jaakkola, MIT CSAIL 9 Tommi Jaakkola, MIT CSAIL Optimality coditios: derivatio J (w) = (y i w w i ) w w = (y i w w i ) w = (y i w w i ) (y i w w i ) w = (y i w w i )( i ) = w J (w) = (y i w w i )( ) = Iterpretatio If we deote the predictio error as ɛ i = (y i w w i ) the the optimality coditios ca be writte as ɛ i i =, ɛ i = Thus the predictio error is ucorrelated with ay liear fuctio of the iputs!!!.5.5.5!.5!!.5!!! Tommi Jaakkola, MIT CSAIL Tommi Jaakkola, MIT CSAIL Iterpretatio If we deote the predictio error as ɛ i = (y i w w i ) the the optimality coditios ca be writte as ɛ i i =, ɛ i = Thus the predictio error is ucorrelated with ay liear fuctio of the iputs but ot with a quadratic fuctio of the iputs ɛ i i (i geeral) Liear regressio: matri otatio We ca epress the solutio a bit more geerally by resortig to a matri otatio y y =, X = y, w = so that (y t w w t ) = y t= y = y Xw [ w w ] [ w w ] Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL

Liear regressio: solutio By settig the derivatives of y Xw / to zero, we get the same optimality coditios as before, ow epressed i a matri form w y Xw = w (y Xw)T (y Xw) Liear regressio: solutio By settig the derivatives of y Xw / to zero, we get the same optimality coditios as before, ow epressed i a matri form w y Xw = w (y Xw)T (y Xw) = XT (y Xw) Tommi Jaakkola, MIT CSAIL 5 Tommi Jaakkola, MIT CSAIL Liear regressio: solutio By settig the derivatives of y Xw / to zero, we get the same optimality coditios as before, ow epressed i a matri form w y Xw = w (y Xw)T (y Xw) which gives = XT (y Xw) = (XT y X T Xw) = ŵ = (X T X) X T y The solutio is a liear fuctio of the outputs y Liear regressio: geeralizatio As the umber of traiig eamples icreases our solutio gets better! = =! = =!! We d like to uderstad the error a bit better mea squared error.5.5.5 5 5 umber of traiig eamples Tommi Jaakkola, MIT CSAIL 7 Tommi Jaakkola, MIT CSAIL 8 Liear regressio: types of errors Structural error measures the error itroduced by the limited fuctio class (ifiite traiig data): mi E (,y) P (y w w ) = E (,y) P (y w w w,w ) where (w, w ) are the optimal liear regressio parameters. Liear regressio: types of errors Structural error measures the error itroduced by the limited fuctio class (ifiite traiig data): mi E (,y) P (y w w ) = E (,y) P (y w w w,w ) where (w, w ) are the optimal liear regressio parameters. Approimatio error measures how close we ca get to the optimal liear predictios with limited traiig data: E (,y) P (w + w ŵ ŵ ) where (ŵ, ŵ ) are the parameter estimates based o a small traiig set (therefore themselves radom variables). Tommi Jaakkola, MIT CSAIL 9 Tommi Jaakkola, MIT CSAIL 3

Liear regressio: error decompositio The epected error of our liear regressio fuctio decomposes ito the sum of structural ad approimatio errors E (,y) P (y ŵ ŵ ) = E (,y) P (y w w ) + E (,y) P (w + w ŵ ŵ ) mea squared error.5.5 Error decompositio: derivatio E (,y) P (y ŵ ŵ ) = E (,y) P ( (y w w ) + (w + w ŵ ŵ ) ) = E (,y) P (y w w ) +E (,y) P (y w w )(w + w ŵ ŵ ) +E (,y) P (w + w ŵ ŵ ) The secod term has to be zero sice the error (y w w ) of the best liear predictor is ecessarily ucorrelated with ay liear fuctio of the iput icludig (w + w ŵ ŵ ).5 5 5 umber of traiig eamples Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL 3