Statistics and Data Analysis in MATLAB Kendrick Kay, February 28, Lecture 4: Model fitting

Statistics ad Data Aalysis i MATLAB Kedrick Kay, kedrick.kay@wustl.edu February 28, 2014 Lecture 4: Model fittig 1. The basics - Suppose that we have a set of data ad suppose that we have selected the type of odel to apply to the data (see Lecture 3). Our task ow is to fit the odel to the data that is, adjust the free paraeters of the odel such that the odel describes the data as well as possible. - Before proceedig, we ust first establish a etric for the goodess-of-fit of the odel. A coo etric is squared error, that is, the su of the squares of the differeces betwee the data ad the odel fit: squared error = ( d i i ) 2 where is the uber of data poits, d i is the ith data poit, ad i is the odel fit for the ith data poit. We will see the otivatio for squared error later i this lecture. - Give the etric of squared error, our job is to deterie the specific set of paraeter values that iiize squared error. The solutio to this proble depeds o the type of odel that we are tryig to fit. 2. The case of liear odels - Model fittig i the case of liear (ad liearized) odels ca be give a ice geoetric iterpretatio. We ca view the data as a sigle poit i -diesioal space ad we ca view the regressors as vectors i this space eaatig fro the origi. Potetial odel fits are give by poits i the subspace spaed by the vectors. (The subspace cosists of all poits that ca be expressed as a weighted su of the regressors.) The odel fit that iiizes squared error is the poit that lies closest i a Euclidea sese to the data. (This is because the Euclidea distace betwee two poits is a siple ootoic trasforatio the square root of the su of the squares of the differeces i the two poits' coordiates.) The residuals of the odel fit ca be viewed as a vector that starts at the odel fit ad eds at the data. - With these geoetric isights i id, we ca ow derive the solutio to our fittig proble. Recall that liear odels ca be expressed as y = Xw + where y is a set of data poits ( 1), X is a set of regressors ( p), w is a set of weights (p 1), ad is a set of residuals ( 1). Let ŵ OLS deote the set of weights that provide the optial odel fit; these weights are called the ordiary least-squares (OLS) estiate. At the optial odel fit, the residuals ust be orthogoal to each of the regressors. (If the residuals were correlated with a give regressor, the a better odel fit could be obtaied by ovig i the directio of that regressor.) This orthogoality coditio iplies that the dot product betwee each regressor ad the residuals ust equal zero: X T (y Xŵ OLS ) = 0 where 0 is a vector of zeros ( 1). Expadig ad solvig, we obtai: X T y X T Xŵ OLS = 0 ad

ŵ OLS = (X T X) 1 X T y where A 1 idicates the atrix iverse of A. Thus, we see that the set of weights that iiize squared error ca be coputed usig a aalytic expressio ivolvig siple atrix operatios. - Note that if there are ore regressors tha data poits (p > ), the iversio of the correlatio atrix X T X is ill-defied, ad there is o OLS solutio. Ituitively, the idea is that if there are ore regressors tha data poits, the there are ifiitely ay solutios, all of which achieve zero error. 3. The case of oliear odels - Give that oliear odels ecopass a broad diversity of odels, there is little hope of writig dow a sigle expressio that will solve the fittig proble for a arbitrary oliear odel. - To fit oliear odels, we cast the proble as a search proble i which we have a paraeter space, a cost fuctio, ad our job is to search through the space to fid the poit that iiizes the cost fuctio. Sice we are usig the cost fuctio of squared error, we ca thik of our job as tryig to fid the iiu poit o a error surface. If there is oly oe paraeter, the error surface is a fuctio defied o oe diesio (i.e. a curvy lie); if there are two paraeters, the error surface is a fuctio defied o two diesios (i.e. a bupy sheet); etc. - To search through the paraeter space, the usual approach is to use local, iterative optiizatio algoriths that start at soe poit i the space (the iitial seed), look at the error surface i a sall eighborhood aroud that poit, ove i soe directio i a attept to reduce the error, ad the repeat this process util iproveets are sufficietly sall (e.g. util the iproveet is less tha soe sall uber). There are a variety of optiizatio algoriths, ad they basically vary with respect to how exactly they ake use of first-order derivative iforatio (gradiets) ad secod-order derivative iforatio (curvature). The Leveberg-Marquardt algorith is a popular ad effective algorith ad is ipleeted i the MATLAB Optiizatio Toolbox. - As a siple exaple of a optiizatio ethod, let us cosider how to perfor gradiet descet for a liear odel (reusig the earlier exaple y = Xw + ). Assuig the error etric is squared error, the the derivative of the error surface with respect to the jth weight is error = ((y Xw) T (y Xw) ) = (y i X i w) 2 = 2(y X w)( X ) i i i, j i i where w j is the jth eleet of w, y i is the ith eleet of y, X i is the ith row of X, ad X i,j is the (i,j)th eleet of X. Collectig the derivatives for differet weights ito a vector, we obtai the gradiet of the error surface (p 1): error = 2X T (y Xw) So, what this tell us is that give a set of weights w, we kow how to copute the aout by which the error will chage if we were to tweak ay of the weights (e.g., if we were to icreet the first weight by 0.01, the the error surface will icrease by approxiately 0.01 ties the first eleet i the gradiet vector). This suggests a siple algorith: first, set all the weights to soe iitial value (e.g. all zeros); the, copute the gradiet ad update the weights by subtractig soe sall fractio of the gradiet; ad repeat the weight-updatig process util the error stops decreasig. We will see that this algorith ca be ipleeted i MATLAB quite easily.

- A potetial proble with local, iterative optiizatio is local iia, that is, locatios o the error surface that are the iiu withi soe local rage but which are ot the absolute iiu that ca be achieved (which is kow as the global iiu). - For liear odels, the error surface is shaped like a bowl ad there are o local iia. The lack of local iia akes paraeter-search easy as log as a algorith ca adjust paraeters to reduce the error, we will evetually get to the optial solutio. Geoetrically, there is a ice ituitio for why liear odels have o local iia: assuig there are at least as ay data poits as regressors, there is exactly oe poit i the subspace spaed by the regressors that is closest to the data, ad as paraeter values deviate fro this poit, the distace fro the data grows ootoically. - For oliear odels, the error surface ay be "bupy" with local iia. Because of local iia, the solutio foud by a algorith ay ot be the best possible solutio. - The severity of the proble of local iia depeds o the ature of the data, the ature of the odel, ad the specific optiizatio algorith used, so it is difficult to ake ay geeral stateets. Strategies for dealig with local iia iclude (1) startig with differet iitial seeds ad selectig the best resultig odel, (2) exhaustively saplig the paraeter space, ad (3) usig alterative optiizatio techiques such as geetic algoriths. 4. The otivatio for squared error - Give a probability distributio, we ca quatify the probability, or likelihood, of a set of data. Moreover, for differet probability distributios, we ca ask which probability distributio axiizes the likelihood of the data. This procedure is kow as axiu likelihood estiatio ad provides a eas for choosig fro aogst differet odels. For exaple, give a set of data, out of all of the possible Gaussia distributios, the oe that axiizes the likelihood of the data has a ea ad stadard deviatio equal to the ea ad stadard deviatio of the data. - Let us apply axiu likelihood estiatio to the case of regressio odels. Suppose that the true uderlyig probability distributio for each data poit is a Gaussia whose ea is equal to the odel predictio ad whose stadard deviatio is soe fixed value. I other words, suppose that the data are geerated by the odel plus idepedet, idetically distributed (i.i.d.) Gaussia oise. The, the likelihood of a give set of data ca be writte as follows: (d i i ) 2 1 likelihood(d ) = p(d i ) = σ 2π e 2σ 2 where d represets the data, represets the odel, is the uber of data poits, d i is the ith data poit, σ is the stadard deviatio of the Gaussia oise, ad i is the odel predictio for the ith data poit. The odel estiate,, that we wat is the oe that axiizes the likelihood of the data: arg ax( likelihood(d ) ) Because the logarith is a ootoic fuctio, the desired odel estiate is also give by arg ax( log-likelihood(d ) ) which i tur is equivalet to arg i( egative-log-likelihood(d ) ) Now, let's substitute i the likelihood expressio:

(d 1 i i ) 2 argi log σ 2π e 2σ 2 Siplifyig, we obtai 1 arg i log σ 2π + (d i i )2 2σ 2 We ca drop the first ter sice it has o depedece o : (d arg i i i ) 2 2σ 2 We ca drop the deoiator sice it has o depedece o : arg i (d i i ) 2 Fially, we ca rewrite this ore siply: arg i( squared error) Thus, we see that to axiize the likelihood of the data, we should choose the odel that iiizes squared error. This shows that there is good otivatio to use squared error, aely, that assuig i.i.d. Gaussia oise, the axiu likelihood estiate of the paraeters of a odel is the set of paraeters that iiizes squared error. 5. A alterative error etric: absolute error - The assuptio that the oise is Gaussia ay be iaccurate for two reasos. Oe, the easureet oise ay be o-gaussia. Two, the odel beig applied ay ot be the correct odel (e.g., fittig a liear odel whe the true effect is quadratic). This has the cosequece that uodeled effects ay be subsued i the oise ad ay cause the oise to be o- Gaussia. Thus, although squared error has a ice otivatio, we ight desire to use a differet error etric. - Oe alterative etric is the su of the absolute values of the differeces betwee the data ad the odel fit: absolute error = d i i where is the uber of data poits, d i is the ith data poit, ad i is the odel fit for the ith data poit. This choice of error etric is useful as it reduces the ipact of outliers (which ca be roughly defied as ureasoably extree data poits). For a theoretical otivatio, it ca be show that the paraeter estiate that iiizes absolute error is the axiu likelihood estiate uder the assuptio of Laplacia oise. (The Laplace distributio is just like the Gaussia distributio except that a expoetial is take of the absolute differece fro the ea istead of the squared differece fro the ea.) - The differece betwee squared error ad absolute error ca be fraed i ters of the differece betwee the ea ad the edia: the ea of a set of data poits is the uber that iiizes squared error (that is, the su of the squares of the differeces betwee the uber ad each of the data poits), whereas the edia of a set of data poits is the uber that iiizes absolute error (that is, the su of the absolute differeces betwee the uber ad

each of the data poits). Thus, we ca view absolute error as a error etric that is potetially ore robust tha squared error. Note, however, there are soe disadvatages of absolute error: first, there is o aalytic solutio for liear odels, ad secod, error surfaces quatifyig absolute error ay be less well-behaved (e.g. ore local iia) tha error surfaces quatifyig squared error.