Linear Methods for Regression

Size: px
Start display at page:

Download "Linear Methods for Regression"

Transcription

1 3 Linear Methds fr Regressin This is page 43 Printer: Opaque this 3.1 Intrductin A linear regressin mdel assumes that the regressin functin E(Y X) is linear in the inputs X 1,...,X p. Linear mdels were largely develped in the precmputer age f statistics, but even in tday s cmputer era there are still gd reasns t study and use them. They are simple and ften prvide an adequate and interpretable descriptin f hw the inputs affect the utput. Fr predictin purpses they can smetimes utperfrm fancier nnlinear mdels, especially in situatins with small numbers f training cases, lw signal-t-nise rati r sparse data. Finally, linear methds can be applied t transfrmatins f the inputs and this cnsiderably expands their scpe. These generalizatins are smetimes called basis-functin methds, and are discussed in Chapter 5. In this chapter we describe linear methds fr regressin, while in the next chapter we discuss linear methds fr classificatin. On sme tpics we g int cnsiderable detail, as it is ur firm belief that an understanding f linear methds is essential fr understanding nnlinear nes. In fact, many nnlinear techniques are direct generalizatins f the linear methds discussed here.

2 44 3. Linear Methds fr Regressin 3.2 Linear Regressin Mdels and Least Squares As intrduced in Chapter 2, we have an input vectr X T =(X 1,X 2,...,X p ), and want t predict a real-valued utput Y. The linear regressin mdel has the frm p f(x) =β 0 + X j β j. (3.1) The linear mdel either assumes that the regressin functin E(Y X) is linear, r that the linear mdel is a reasnable apprximatin. Here the β j s are unknwn parameters r cefficients, and the variables X j can cme frm different surces: quantitative inputs; transfrmatins f quantitative inputs, such as lg, square-rt r square; basis expansins, such as X 2 = X 2 1, X 3 = X 3 1, leading t a plynmial representatin; numeric r dummy cding f the levels f qualitative inputs. Fr example, if G is a five-level factr input, we might create X j, j = 1,...,5, such that X j = I(G = j). Tgether this grup f X j represents the effect f G by a set f level-dependent cnstants, since in 5 j=1 X jβ j, ne f the X j s is ne, and the thers are zer. j=1 interactins between variables, fr example, X 3 = X 1 X 2. N matter the surce f the X j, the mdel is linear in the parameters. Typically we have a set f training data (x 1,y 1 )...(x N,y N ) frm which t estimate the parameters β. Eachx i =(x i1,x i2,...,x ip ) T is a vectr f feature measurements fr the ith case. The mst ppular estimatin methd is least squares, in which we pick the cefficients β =(β 0, β 1,...,β p ) T t minimize the residual sum f squares RSS(β) = = N (y i f(x i )) 2 i=1 N ( y i β 0 i=1 j=1 p ) 2. x ij β j (3.2) Frm a statistical pint f view, this criterin is reasnable if the training bservatins (x i,y i ) represent independent randm draws frm their ppulatin. Even if the x i s were nt drawn randmly, the criterin is still valid if the y i s are cnditinally independent given the inputs x i. Figure 3.1 illustrates the gemetry f least-squares fitting in the IR p+1 -dimensinal

3 3.2 Linear Regressin Mdels and Least Squares 45 Y X 2 FIGURE 3.1. Linear least squares fitting with X IR 2. We seek the linear functin f X that minimizes the sum f squared residuals frm Y. space ccupied by the pairs (X, Y ). Nte that (3.2) makes n assumptins abut the validity f mdel (3.1); it simply finds the best linear fit t the data. Least squares fitting is intuitively satisfying n matter hw the data arise; the criterin measures the average lack f fit. Hw d we minimize (3.2)? Dente by X the N (p +1) matrix with each rw an input vectr (with a 1 in the first psitin), and similarly let y be the N-vectr f utputs in the training set. Then we can write the residual sum-f-squares as RSS(β) =(y Xβ) T (y Xβ). (3.3) This is a quadratic functin in the p + 1 parameters. Differentiating with respect t β we btain RSS β = 2XT (y Xβ) 2 RSS β β T =2XT X. X 1 (3.4) Assuming (fr the mment) that X has full clumn rank, and hence X T X is psitive definite, we set the first derivative t zer t btain the unique slutin X T (y Xβ) =0 (3.5) ˆβ =(X T X) 1 X T y. (3.6)

4 46 3. Linear Methds fr Regressin y x 2 ŷ FIGURE 3.2. The N-dimensinal gemetry f least squares regressin with tw predictrs. The utcme vectr y is rthgnally prjected nt the hyperplane spanned by the input vectrs x 1 and x 2.Theprjectinŷ represents the vectr f the least squares predictins The predicted values at an input vectr x 0 are given by ˆf(x 0 )=(1:x 0 ) T ˆβ; the fitted values at the training inputs are ŷ = X ˆβ = X(X T X) 1 X T y, (3.7) where ŷ i = ˆf(x i ). The matrix H = X(X T X) 1 X T appearing in equatin (3.7) is smetimes called the hat matrix because it puts the hat n y. Figure 3.2 shws a different gemetrical representatin f the least squares estimate, this time in IR N. We dente the clumn vectrs f X by x 0, x 1,...,x p, with x 0 1. Fr much f what fllws, this first clumn is treated like any ther. These vectrs span a subspace f IR N, als referred t as the clumn space f X. We minimize RSS(β) = y Xβ 2 by chsing ˆβ s that the residual vectr y ŷ is rthgnal t this subspace. This rthgnality is expressed in (3.5), and the resulting estimate ŷ is hence the rthgnal prjectin f y nt this subspace. The hat matrix H cmputes the rthgnal prjectin, and hence it is als knwn as a prjectin matrix. It might happen that the clumns f X are nt linearly independent, s that X is nt f full rank. This wuld ccur, fr example, if tw f the inputs were perfectly crrelated, (e.g., x 2 =3x 1 ). Then X T X is singular and the least squares cefficients ˆβ are nt uniquely defined. Hwever, the fitted values ŷ = X ˆβ are still the prjectin f y nt the clumn space f X; there is just mre than ne way t express that prjectin in terms f the clumn vectrs f X. The nn-full-rank case ccurs mst ften when ne r mre qualitative inputs are cded in a redundant fashin. There is usually a natural way t reslve the nn-unique representatin, by recding and/r drpping redundant clumns in X. Mst regressin sftware packages detect these redundancies and autmatically implement x 1

5 3.2 Linear Regressin Mdels and Least Squares 47 sme strategy fr remving them. Rank deficiencies can als ccur in signal and image analysis, where the number f inputs p can exceed the number f training cases N. In this case, the features are typically reduced by filtering r else the fitting is cntrlled by regularizatin (Sectin and Chapter 18). Up t nw we have made minimal assumptins abut the true distributin f the data. In rder t pin dwn the sampling prperties f ˆβ, we nw assume that the bservatins y i are uncrrelated and have cnstant variance σ 2, and that the x i are fixed (nn randm). The variance cvariance matrix f the least squares parameter estimates is easily derived frm (3.6) and is given by Typically ne estimates the variance σ 2 by Var( ˆβ) =(X T X) 1 σ 2. (3.8) ˆσ 2 = 1 N p 1 N (y i ŷ i ) 2. The N p 1 rather than N in the denminatr makes ˆσ 2 an unbiased estimate f σ 2 :E(ˆσ 2 )=σ 2. T draw inferences abut the parameters and the mdel, additinal assumptins are needed. We nw assume that (3.1) is the crrect mdel fr the mean; that is, the cnditinal expectatin f Y is linear in X 1,...,X p. We als assume that the deviatins f Y arund its expectatin are additive and Gaussian. Hence i=1 Y = E(Y X 1,...,X p )+ε p = β 0 + X j β j + ε, (3.9) j=1 where the errr ε is a Gaussian randm variable with expectatin zer and variance σ 2, written ε N(0, σ 2 ). Under (3.9), it is easy t shw that ˆβ N(β, (X T X) 1 σ 2 ). (3.10) This is a multivariate nrmal distributin with mean vectr and variance cvariance matrix as shwn. Als (N p 1)ˆσ 2 σ 2 χ 2 N p 1, (3.11) a chi-squared distributin with N p 1 degrees f freedm. In additin ˆβ and ˆσ 2 are statistically independent. We use these distributinal prperties t frm tests f hypthesis and cnfidence intervals fr the parameters β j.

6 48 3. Linear Methds fr Regressin Tail Prbabilities t 30 t 100 nrmal Z FIGURE 3.3. The tail prbabilities Pr( Z >z) fr three distributins, t 30, t 100 and standard nrmal. Shwn are the apprpriate quantiles fr testing significance at the p =0.05 and 0.01 levels. The difference between t and the standard nrmal becmes negligible fr N bigger than abut 100. T test the hypthesis that a particular cefficient β j = 0, we frm the standardized cefficient r Z-scre z j = ˆβ j ˆσ v j, (3.12) where v j is the jth diagnal element f (X T X) 1. Under the null hypthesis that β j =0,z j is distributed as t N p 1 (a t distributin with N p 1 degrees f freedm), and hence a large (abslute) value f z j will lead t rejectin f this null hypthesis. If ˆσ is replaced by a knwn value σ, then z j wuld have a standard nrmal distributin. The difference between the tail quantiles f a t-distributin and a standard nrmal becme negligible as the sample size increases, and s we typically use the nrmal quantiles (see Figure 3.3). Often we need t test fr the significance f grups f cefficients simultaneusly. Fr example, t test if a categrical variable with k levels can be excluded frm a mdel, we need t test whether the cefficients f the dummy variables used t represent the levels can all be set t zer. Here we use the F statistic, F = (RSS 0 RSS 1 )/(p 1 p 0 ), (3.13) RSS 1 /(N p 1 1) where RSS 1 is the residual sum-f-squares fr the least squares fit f the bigger mdel with p 1 +1 parameters, and RSS 0 the same fr the nested smaller mdel with p parameters, having p 1 p 0 parameters cnstrained t be

7 3.2 Linear Regressin Mdels and Least Squares 49 zer. The F statistic measures the change in residual sum-f-squares per additinal parameter in the bigger mdel, and it is nrmalized by an estimate f σ 2. Under the Gaussian assumptins, and the null hypthesis that the smaller mdel is crrect, the F statistic will have a F p1 p 0,N p 1 1 distributin. It can be shwn (Exercise 3.1) that the z j in (3.12) are equivalent t the F statistic fr drpping the single cefficient β j frm the mdel. Fr large N, the quantiles f the F p1 p 0,N p 1 1 apprach thse f the χ 2 p 1 p 0. Similarly, we can islate β j in (3.10) t btain a 1 2α cnfidence interval fr β j : ( ˆβ j z (1 α) v 1 2 j ˆσ, ˆβj + z (1 α) v 1 2 j ˆσ). (3.14) Here z (1 α) is the 1 α percentile f the nrmal distributin: z ( ) = 1.96, z (1.05) = 1.645, etc. Hence the standard practice f reprting ˆβ ± 2 se( ˆβ) amunts t an apprximate 95% cnfidence interval. Even if the Gaussian errr assumptin des nt hld, this interval will be apprximately crrect, with its cverage appraching 1 2α as the sample size N. In a similar fashin we can btain an apprximate cnfidence set fr the entire parameter vectr β, namely C β = {β ( ˆβ β) T X T X( ˆβ β) ˆσ 2 χ 2 p+1(1 α) }, (3.15) where χ 2 l (1 α) is the 1 α percentile f the chi-squared distributin n l degrees f freedm: fr example, χ 2 5(1 0.05) =11.1, χ 2 5 (1 0.1) =9.2. This cnfidence set fr β generates a crrespnding cnfidence set fr the true functin f(x) =x T β, namely {x T β β C β } (Exercise 3.2; see als Figure 5.4 in Sectin fr examples f cnfidence bands fr functins) Example: Prstate Cancer The data fr this example cme frm a study by Stamey et al. (1989). They examined the crrelatin between the level f prstate-specific antigen and a number f clinical measures in men wh were abut t receive a radical prstatectmy. The variables are lg cancer vlume (lcavl), lg prstate weight (lweight), age, lg f the amunt f benign prstatic hyperplasia (lbph), seminal vesicle invasin (svi), lg f capsular penetratin (lcp), Gleasn scre (gleasn), and percent f Gleasn scres 4 r 5 (pgg45). The crrelatin matrix f the predictrs given in Table 3.1 shws many strng crrelatins. Figure 1.1 (page 3) f Chapter 1 is a scatterplt matrix shwing every pairwise plt between the variables. We see that svi is a binary variable, and gleasn is an rdered categrical variable. We see, fr

8 50 3. Linear Methds fr Regressin TABLE 3.1. Crrelatins f predictrs in the prstate cancer data. lcavl lweight age lbph svi lcp gleasn lweight age lbph svi lcp gleasn pgg TABLE 3.2. Linear mdel fit t the prstate cancer data. The Z scre is the cefficient divided by its standard errr (3.12). Rughly a Z scre larger than tw in abslute value is significantly nnzer at the p = 0.05 level. Term Cefficient Std. Errr Z Scre Intercept lcavl lweight age lbph svi lcp gleasn pgg example, that bth lcavl and lcp shw a strng relatinship with the respnse lpsa, and with each ther. We need t fit the effects jintly t untangle the relatinships between the predictrs and the respnse. We fit a linear mdel t the lg f prstate-specific antigen, lpsa, after first standardizing the predictrs t have unit variance. We randmly split the dataset int a training set f size 67 and a test set f size 30. We applied least squares estimatin t the training set, prducing the estimates, standard errrs and Z-scres shwn in Table 3.2. The Z-scres are defined in (3.12), and measure the effect f drpping that variable frm the mdel. A Z-scre greater than 2 in abslute value is apprximately significant at the 5% level. (Fr ur example, we have nine parameters, and the tail quantiles f the t 67 9 distributin are ±2.002!) The predictr lcavl shws the strngest effect, with lweight and svi als strng. Ntice that lcp is nt significant, nce lcavl is in the mdel (when used in a mdel withut lcavl, lcp is strngly significant). We can als test fr the exclusin f a number f terms at nce, using the F -statistic (3.13). Fr example, we cnsider drpping all the nn-significant terms in Table 3.2, namely age,

9 3.2 Linear Regressin Mdels and Least Squares 51 lcp, gleasn, and pgg45. We get F = ( )/(9 5) 29.43/(67 9) =1.67, (3.16) which has a p-value f 0.17 (Pr(F 4,58 > 1.67) = 0.17), and hence is nt significant. The mean predictin errr n the test data is In cntrast, predictin using the mean training value f lpsa has a test errr f 1.057, which is called the base errr rate. Hence the linear mdel reduces the base errr rate by abut 50%. We will return t this example later t cmpare varius selectin and shrinkage methds The Gauss Markv Therem One f the mst famus results in statistics asserts that the least squares estimates f the parameters β have the smallest variance amng all linear unbiased estimates. We will make this precise here, and als make clear that the restrictin t unbiased estimates is nt necessarily a wise ne. This bservatin will lead us t cnsider biased estimates such as ridge regressin later in the chapter. We fcus n estimatin f any linear cmbinatin f the parameters θ = a T β; fr example, predictins f(x 0 )=x T 0 β are f this frm. The least squares estimate f a T β is ˆθ = a T ˆβ = a T (X T X) 1 X T y. (3.17) Cnsidering X t be fixed, this is a linear functin c T 0 y f the respnse vectr y. If we assume that the linear mdel is crrect, a T ˆβ is unbiased since E(a T ˆβ) = E(a T (X T X) 1 X T y) = a T (X T X) 1 X T Xβ = a T β. (3.18) The Gauss Markv therem states that if we have any ther linear estimatr θ = c T y that is unbiased fr a T β, that is, E(c T y)=a T β, then Var(a T ˆβ) Var(c T y). (3.19) The prf (Exercise 3.3) uses the triangle inequality. Fr simplicity we have stated the result in terms f estimatin f a single parameter a T β, but with a few mre definitins ne can state it in terms f the entire parameter vectr β (Exercise 3.3). Cnsider the mean squared errr f an estimatr θ in estimating θ: MSE( θ) = E( θ θ) 2 = Var( θ)+[e( θ) θ] 2. (3.20)

10 52 3. Linear Methds fr Regressin The first term is the variance, while the secnd term is the squared bias. The Gauss-Markv therem implies that the least squares estimatr has the smallest mean squared errr f all linear estimatrs with n bias. Hwever, there may well exist a biased estimatr with smaller mean squared errr. Such an estimatr wuld trade a little bias fr a larger reductin in variance. Biased estimates are cmmnly used. Any methd that shrinks r sets t zer sme f the least squares cefficients may result in a biased estimate. We discuss many examples, including variable subset selectin and ridge regressin, later in this chapter. Frm a mre pragmatic pint f view, mst mdels are distrtins f the truth, and hence are biased; picking the right mdel amunts t creating the right balance between bias and variance. We g int these issues in mre detail in Chapter 7. Mean squared errr is intimately related t predictin accuracy, as discussed in Chapter 2. Cnsider the predictin f the new respnse at input x 0, Y 0 = f(x 0 )+ε 0. (3.21) Then the expected predictin errr f an estimate f(x 0 )=x T 0 β is E(Y 0 f(x 0 )) 2 = σ 2 +E(x T 0 β f(x 0 )) 2 = σ 2 + MSE( f(x 0 )). (3.22) Therefre, expected predictin errr and mean squared errr differ nly by the cnstant σ 2, representing the variance f the new bservatin y Multiple Regressin frm Simple Univariate Regressin The linear mdel (3.1) with p > 1 inputs is called the multiple linear regressin mdel. The least squares estimates (3.6) fr this mdel are best understd in terms f the estimates fr the univariate (p = 1) linear mdel, as we indicate in this sectin. Suppse first that we have a univariate mdel with n intercept, that is, The least squares estimate and residuals are Y = Xβ + ε. (3.23) N 1 ˆβ = x iy i N, 1 x2 i (3.24) r i = y i x i ˆβ. In cnvenient vectr ntatin, we let y =(y 1,...,y N ) T, x =(x 1,...,x N ) T and define N x, y = x i y i, i=1 = x T y, (3.25)

11 3.2 Linear Regressin Mdels and Least Squares 53 the inner prduct between x and y 1. Then we can write ˆβ = x, y x, x, r = y x ˆβ. (3.26) As we will see, this simple univariate regressin prvides the building blck fr multiple linear regressin. Suppse next that the inputs x 1, x 2,...,x p (the clumns f the data matrix X) are rthgnal; that is x j, x k =0 fr all j k. Then it is easy t check that the multiple least squares estimates ˆβ j are equal t x j, y / x j, x j the univariate estimates. In ther wrds, when the inputs are rthgnal, they have n effect n each ther s parameter estimates in the mdel. Orthgnal inputs ccur mst ften with balanced, designed experiments (where rthgnality is enfrced), but almst never with bservatinal data. Hence we will have t rthgnalize them in rder t carry this idea further. Suppse next that we have an intercept and a single input x. Then the least squares cefficient f x has the frm ˆβ 1 = x x1, y x x1, x x1, (3.27) where x = i x i/n, and 1 = x 0, the vectr f N nes. We can view the estimate (3.27) as the result f tw applicatins f the simple regressin (3.26). The steps are: 1. regress x n 1 t prduce the residual z = x x1; 2. regress y n the residual z t give the cefficient ˆβ 1. In this prcedure, regress b n a means a simple univariate regressin f b n a with n intercept, prducing cefficient ˆγ = a, b / a, a and residual vectr b ˆγa. We say that b is adjusted fr a, r is rthgnalized with respect t a. Step 1 rthgnalizes x with respect t x 0 = 1. Step 2 is just a simple univariate regressin, using the rthgnal predictrs 1 and z. Figure 3.4 shws this prcess fr tw general inputs x 1 and x 2. The rthgnalizatin des nt change the subspace spanned by x 1 and x 2, it simply prduces an rthgnal basis fr representing it. This recipe generalizes t the case f p inputs, as shwn in Algrithm 3.1. Nte that the inputs z 0,...,z j 1 in step 2 are rthgnal, hence the simple regressin cefficients cmputed there are in fact als the multiple regressin cefficients. 1 The inner-prduct ntatin is suggestive f generalizatins f linear regressin t different metric spaces, as well as t prbability spaces.

12 54 3. Linear Methds fr Regressin y x 2 z ŷ FIGURE 3.4. Least squares regressin by rthgnalizatin f the inputs. The vectr x 2 is regressed n the vectr x 1, leaving the residual vectr z. Theregressin f y n z gives the multiple regressin cefficient f x 2. Adding tgether the prjectins f y n each f x 1 and z gives the least squares fit ŷ. Algrithm 3.1 Regressin by Successive Orthgnalizatin. 1. Initialize z 0 = x 0 = Fr j =1, 2,...,p Regress x j n z 0, z 1,...,,z j 1 t prduce cefficients ˆγ lj = z l, x j / z l, z l, l = 0,...,j 1 and residual vectr z j = x j j 1 k=0 ˆγ kjz k. 3. Regress y n the residual z p t give the estimate ˆβ p. x 1 The result f this algrithm is ˆβ p = z p, y z p, z p. (3.28) Re-arranging the residual in step 2, we can see that each f the x j is a linear cmbinatin f the z k, k j. Since the z j are all rthgnal, they frm a basis fr the clumn space f X, and hence the least squares prjectin nt this subspace is ŷ. Since z p alne invlves x p (with cefficient 1), we see that the cefficient (3.28) is indeed the multiple regressin cefficient f y n x p. This key result expses the effect f crrelated inputs in multiple regressin. Nte als that by rearranging the x j, any ne f them culd be in the last psitin, and a similar results hlds. Hence stated mre generally, we have shwn that the jth multiple regressin cefficient is the univariate regressin cefficient f y n x j (j 1)(j+1)...,p, the residual after regressing x j n x 0, x 1,...,x j 1, x j+1,...,x p :

13 3.2 Linear Regressin Mdels and Least Squares 55 The multiple regressin cefficient ˆβ j represents the additinal cntributin f x j n y, afterx j has been adjusted fr x 0, x 1,...,x j 1, x j+1,...,x p. If x p is highly crrelated with sme f the ther x k s, the residual vectr z p will be clse t zer, and frm (3.28) the cefficient ˆβ p will be very unstable. This will be true fr all the variables in the crrelated set. In such situatins, we might have all the Z-scres (as in Table 3.2) be small any ne f the set can be deleted yet we cannt delete them all. Frm (3.28) we als btain an alternate frmula fr the variance estimates (3.8), Var( ˆβ p )= σ2 z p, z p = σ2 z p 2. (3.29) In ther wrds, the precisin with which we can estimate ˆβ p depends n the length f the residual vectr z p ; this represents hw much f x p is unexplained by the ther x k s. Algrithm 3.1 is knwn as the Gram Schmidt prcedure fr multiple regressin, and is als a useful numerical strategy fr cmputing the estimates. We can btain frm it nt just ˆβ p, but als the entire multiple least squares fit, as shwn in Exercise 3.4. We can represent step 2 f Algrithm 3.1 in matrix frm: X = ZΓ, (3.30) where Z has as clumns the z j (in rder), and Γ is the upper triangular matrix with entries ˆγ kj. Intrducing the diagnal matrix D with jth diagnal entry D jj = z j, we get X = ZD 1 DΓ = QR, (3.31) the s-called QR decmpsitin f X. Here Q is an N (p + 1) rthgnal matrix, Q T Q = I, and R is a (p +1) (p + 1) upper triangular matrix. The QR decmpsitin represents a cnvenient rthgnal basis fr the clumn space f X. It is easy t see, fr example, that the least squares slutin is given by ˆβ = R 1 Q T y, (3.32) ŷ = QQ T y. (3.33) Equatin (3.32) is easy t slve because R is upper triangular (Exercise 3.4).

14 56 3. Linear Methds fr Regressin Multiple Outputs Suppse we have multiple utputs Y 1,Y 2,...,Y K that we wish t predict frm ur inputs X 0,X 1,X 2,...,X p. We assume a linear mdel fr each utput Y k = β 0k + p X j β jk + ε k (3.34) j=1 = f k (X)+ε k. (3.35) With N training cases we can write the mdel in matrix ntatin Y = XB + E. (3.36) Here Y is the N K respnse matrix, with ik entry y ik, X is the N (p+1) input matrix, B is the (p +1) K matrix f parameters and E is the N K matrix f errrs. A straightfrward generalizatin f the univariate lss functin (3.2) is RSS(B) = K k=1 i=1 N (y ik f k (x i )) 2 (3.37) = tr[(y XB) T (Y XB)]. (3.38) The least squares estimates have exactly the same frm as befre ˆB =(X T X) 1 X T Y. (3.39) Hence the cefficients fr the kth utcme are just the least squares estimates in the regressin f y k n x 0, x 1,...,x p. Multiple utputs d nt affect ne anther s least squares estimates. If the errrs ε =(ε 1,...,ε K ) in (3.34) are crrelated, then it might seem apprpriate t mdify (3.37) in favr f a multivariate versin. Specifically, suppse Cv(ε) = Σ, then the multivariate weighted criterin RSS(B; Σ) = N (y i f(x i )) T Σ 1 (y i f(x i )) (3.40) i=1 arises naturally frm multivariate Gaussian thery. Here f(x) is the vectr functin (f 1 (x),...,f K (x)), and y i the vectr f K respnses fr bservatin i. Hwever, it can be shwn that again the slutin is given by (3.39); K separate regressins that ignre the crrelatins (Exercise 3.11). If the Σ i vary amng bservatins, then this is n lnger the case, and the slutin fr B n lnger decuples. In Sectin 3.7 we pursue the multiple utcme prblem, and cnsider situatins where it des pay t cmbine the regressins.

15 3.3 Subset Selectin Subset Selectin There are tw reasns why we are ften nt satisfied with the least squares estimates (3.6). The first is predictin accuracy: the least squares estimates ften have lw bias but large variance. Predictin accuracy can smetimes be imprved by shrinking r setting sme cefficients t zer. By ding s we sacrifice a little bit f bias t reduce the variance f the predicted values, and hence may imprve the verall predictin accuracy. The secnd reasn is interpretatin. With a large number f predictrs, we ften wuld like t determine a smaller subset that exhibit the strngest effects. In rder t get the big picture, we are willing t sacrifice sme f the small details. In this sectin we describe a number f appraches t variable subset selectin with linear regressin. In later sectins we discuss shrinkage and hybrid appraches fr cntrlling variance, as well as ther dimensin-reductin strategies. These all fall under the general heading mdel selectin. Mdel selectin is nt restricted t linear mdels; Chapter 7 cvers this tpic in sme detail. With subset selectin we retain nly a subset f the variables, and eliminate the rest frm the mdel. Least squares regressin is used t estimate the cefficients f the inputs that are retained. There are a number f different strategies fr chsing the subset Best-Subset Selectin Best subset regressin finds fr each k {0, 1, 2,...,p} the subset f size k that gives smallest residual sum f squares (3.2). An efficient algrithm the leaps and bunds prcedure (Furnival and Wilsn, 1974) makes this feasible fr p as large as 30 r 40. Figure 3.5 shws all the subset mdels fr the prstate cancer example. The lwer bundary represents the mdels that are eligible fr selectin by the best-subsets apprach. Nte that the best subset f size 2, fr example, need nt include the variable that was in the best subset f size 1 (fr this example all the subsets are nested). The best-subset curve (red lwer bundary in Figure 3.5) is necessarily decreasing, s cannt be used t select the subset size k. The questin f hw t chse k invlves the tradeff between bias and variance, alng with the mre subjective desire fr parsimny. There are a number f criteria that ne may use; typically we chse the smallest mdel that minimizes an estimate f the expected predictin errr. Many f the ther appraches that we discuss in this chapter are similar, in that they use the training data t prduce a sequence f mdels varying in cmplexity and indexed by a single parameter. In the next sectin we use

16 58 3. Linear Methds fr Regressin Residual Sum f Squares Subset Size k FIGURE 3.5. All pssible subset mdels fr the prstate cancer example. At each subset size is shwn the residual sum-f-squares fr each mdel f that size. crss-validatin t estimate predictin errr and select k; the AIC criterin is a ppular alternative. We defer mre detailed discussin f these and ther appraches t Chapter Frward- and Backward-Stepwise Selectin Rather than search thrugh all pssible subsets (which becmes infeasible fr p much larger than 40), we can seek a gd path thrugh them. Frwardstepwise selectin starts with the intercept, and then sequentially adds int the mdel the predictr that mst imprves the fit. With many candidate predictrs, this might seem like a lt f cmputatin; hwever, clever updating algrithms can explit the QR decmpsitin fr the current fit t rapidly establish the next candidate (Exercise 3.9). Like best-subset regressin, frward stepwise prduces a sequence f mdels indexed by k, the subset size, which must be determined. Frward-stepwise selectin is a greedy algrithm, prducing a nested sequence f mdels. In this sense it might seem sub-ptimal cmpared t best-subset selectin. Hwever, there are several reasns why it might be preferred:

17 3.3 Subset Selectin 59 Cmputatinal; fr large p we cannt cmpute the best subset sequence, but we can always cmpute the frward stepwise sequence (even when p N). Statistical; a price is paid in variance fr selecting the best subset f each size; frward stepwise is a mre cnstrained search, and will have lwer variance, but perhaps mre bias. E ˆβ(k) β Best Subset Frward Stepwise Backward Stepwise Frward Stagewise Subset Size k FIGURE 3.6. Cmparisn f fur subset-selectin techniques n a simulated linear regressin prblem Y = X T β + ε. ThereareN = 300 bservatins n p = 31 standard Gaussian variables, with pairwise crrelatins all equal t Fr 10 f the variables, the cefficients are drawn at randm frm a N(0, 0.4) distributin; the rest are zer. The nise ε N(0, 6.25), resulting in a signal-t-nise rati f Results are averaged ver 50 simulatins. Shwn is the mean-squared errr f the estimated cefficient ˆβ(k) at each step frm the true β. Backward-stepwise selectin starts with the full mdel, and sequentially deletes the predictr that has the least impact n the fit. The candidate fr drpping is the variable with the smallest Z-scre (Exercise 3.10). Backward selectin can nly be used when N>p, while frward stepwise can always be used. Figure 3.6 shws the results f a small simulatin study t cmpare best-subset regressin with the simpler alternatives frward and backward selectin. Their perfrmance is very similar, as is ften the case. Included in the figure is frward stagewise regressin (next sectin), which takes lnger t reach minimum errr.

18 60 3. Linear Methds fr Regressin On the prstate cancer example, best-subset, frward and backward selectin all gave exactly the same sequence f terms. Sme sftware packages implement hybrid stepwise-selectin strategies that cnsider bth frward and backward mves at each step, and select the best f the tw. Fr example in the R package the step functin uses the AIC criterin fr weighing the chices, which takes prper accunt f the number f parameters fit; at each step an add r drp will be perfrmed that minimizes the AIC scre. Other mre traditinal packages base the selectin n F -statistics, adding significant terms, and drpping nn-significant terms. These are ut f fashin, since they d nt take prper accunt f the multiple testing issues. It is als tempting after a mdel search t print ut a summary f the chsen mdel, such as in Table 3.2; hwever, the standard errrs are nt valid, since they d nt accunt fr the search prcess. The btstrap (Sectin 8.2) can be useful in such settings. Finally, we nte that ften variables cme in grups (such as the dummy variables that cde a multi-level categrical predictr). Smart stepwise prcedures (such as step in R) will add r drp whle grups at a time, taking prper accunt f their degrees-f-freedm Frward-Stagewise Regressin Frward-stagewise regressin (FS) is even mre cnstrained than frwardstepwise regressin. It starts like frward-stepwise regressin, with an intercept equal t ȳ, and centered predictrs with cefficients initially all 0. At each step the algrithm identifies the variable mst crrelated with the current residual. It then cmputes the simple linear regressin cefficient f the residual n this chsen variable, and then adds it t the current cefficient fr that variable. This is cntinued till nne f the variables have crrelatin with the residuals i.e. the least-squares fit when N>p. Unlike frward-stepwise regressin, nne f the ther variables are adjusted when a term is added t the mdel. As a cnsequence, frward stagewise can take many mre than p steps t reach the least squares fit, and histrically has been dismissed as being inefficient. It turns ut that this slw fitting can pay dividends in high-dimensinal prblems. We see in Sectin that bth frward stagewise and a variant which is slwed dwn even further are quite cmpetitive, especially in very highdimensinal prblems. Frward-stagewise regressin is included in Figure 3.6. In this example it takes ver 1000 steps t get all the crrelatins belw Fr subset size k, we pltted the errr fr the last step fr which there where k nnzer cefficients. Althugh it catches up with the best fit, it takes lnger t d s.

19 3.3.4 Prstate Cancer Data Example (Cntinued) 3.4 Shrinkage Methds 61 Table 3.3 shws the cefficients frm a number f different selectin and shrinkage methds. They are best-subset selectin using an all-subsets search, ridge regressin, the lass, principal cmpnents regressin and partial least squares. Each methd has a cmplexity parameter, and this was chsen t minimize an estimate f predictin errr based n tenfld crss-validatin; full details are given in Sectin Briefly, crss-validatin wrks by dividing the training data randmly int ten equal parts. The learning methd is fit fr a range f values f the cmplexity parameter t nine-tenths f the data, and the predictin errr is cmputed n the remaining ne-tenth. This is dne in turn fr each ne-tenth f the data, and the ten predictin errr estimates are averaged. Frm this we btain an estimated predictin errr curve as a functin f the cmplexity parameter. Nte that we have already divided these data int a training set f size 67 and a test set f size 30. Crss-validatin is applied t the training set, since selecting the shrinkage parameter is part f the training prcess. The test set is there t judge the perfrmance f the selected mdel. The estimated predictin errr curves are shwn in Figure 3.7. Many f the curves are very flat ver large ranges near their minimum. Included are estimated standard errr bands fr each estimated errr rate, based n the ten errr estimates cmputed by crss-validatin. We have used the ne-standard-errr rule we pick the mst parsimnius mdel within ne standard errr f the minimum (Sectin 7.10, page 244). Such a rule acknwledges the fact that the tradeff curve is estimated with errr, and hence takes a cnservative apprach. Best-subset selectin chse t use the tw predictrs lcvl and lweight. The last tw lines f the table give the average predictin errr (and its estimated standard errr) ver the test set. 3.4 Shrinkage Methds By retaining a subset f the predictrs and discarding the rest, subset selectin prduces a mdel that is interpretable and has pssibly lwer predictin errr than the full mdel. Hwever, because it is a discrete prcess variables are either retained r discarded it ften exhibits high variance, and s desn t reduce the predictin errr f the full mdel. Shrinkage methds are mre cntinuus, and dn t suffer as much frm high variability Ridge Regressin Ridge regressin shrinks the regressin cefficients by impsing a penalty n their size. The ridge cefficients minimize a penalized residual sum f

20 62 3. Linear Methds fr Regressin All Subsets Ridge Regressin CV Errr CV Errr Subset Size Lass Degrees f Freedm Principal Cmpnents Regressin CV Errr CV Errr Shrinkage Factr s Number f Directins Partial Least Squares CV Errr Number f Directins FIGURE 3.7. Estimated predictin errr curves and their standard errrs fr the varius selectin and shrinkage methds. Each curve is pltted as a functin f the crrespnding cmplexity parameter fr that methd. The hrizntal axis has been chsen s that the mdel cmplexity increases as we mve frm left t right. The estimates f predictin errr and their standard errrs were btained by tenfld crss-validatin; full details are given in Sectin The least cmplex mdel within ne standard errr f the best is chsen, indicated by thepurple vertical brken lines.

21 3.4 Shrinkage Methds 63 TABLE 3.3. Estimated cefficients and test errr results, fr different subset and shrinkage methds applied t the prstate data. The blank entries crrespnd t variables mitted. Term LS Best Subset Ridge Lass PCR PLS Intercept lcavl lweight age lbph svi lcp gleasn pgg Test Errr Std Errr squares, ˆβ ridge =argmin β { N ( yi β 0 i=1 p ) 2 p x ij β j + λ j=1 j=1 β 2 j }. (3.41) Here λ 0 is a cmplexity parameter that cntrls the amunt f shrinkage: the larger the value f λ, the greater the amunt f shrinkage. The cefficients are shrunk tward zer (and each ther). The idea f penalizing by the sum-f-squares f the parameters is als used in neural netwrks, where it is knwn as weight decay (Chapter 11). An equivalent way t write the ridge prblem is ˆβ ridge =argmin β subject t N ( y i β 0 i=1 p βj 2 t, j=1 p ) 2, x ij β j j=1 (3.42) which makes explicit the size cnstraint n the parameters. There is a net-ne crrespndence between the parameters λ in (3.41) and t in (3.42). When there are many crrelated variables in a linear regressin mdel, their cefficients can becme prly determined and exhibit high variance. A wildly large psitive cefficient n ne variable can be canceled by a similarly large negative cefficient n its crrelated cusin. By impsing a size cnstraint n the cefficients, as in (3.42), this prblem is alleviated. The ridge slutins are nt equivariant under scaling f the inputs, and s ne nrmally standardizes the inputs befre slving (3.41). In additin,

22 64 3. Linear Methds fr Regressin ntice that the intercept β 0 has been left ut f the penalty term. Penalizatin f the intercept wuld make the prcedure depend n the rigin chsen fr Y ; that is, adding a cnstant c t each f the targets y i wuld nt simply result in a shift f the predictins by the same amunt c. It can be shwn (Exercise 3.5) that the slutin t (3.41) can be separated int tw parts, after reparametrizatin using centered inputs: each x ij gets replaced by x ij x j. We estimate β 0 by ȳ = 1 N N 1 y i. The remaining cefficients get estimated by a ridge regressin withut intercept, using the centered x ij. Hencefrth we assume that this centering has been dne, s that the input matrix X has p (rather than p + 1) clumns. Writing the criterin in (3.41) in matrix frm, RSS(λ) =(y Xβ) T (y Xβ)+λβ T β, (3.43) the ridge regressin slutins are easily seen t be ˆβ ridge =(X T X + λi) 1 X T y, (3.44) where I is the p p identity matrix. Ntice that with the chice f quadratic penalty β T β, the ridge regressin slutin is again a linear functin f y. The slutin adds a psitive cnstant t the diagnal f X T X befre inversin. This makes the prblem nnsingular, even if X T X is nt f full rank, and was the main mtivatin fr ridge regressin when it was first intrduced in statistics (Herl and Kennard, 1970). Traditinal descriptins f ridge regressin start with definitin (3.44). We chse t mtivate it via (3.41) and (3.42), as these prvide insight int hw it wrks. Figure 3.8 shws the ridge cefficient estimates fr the prstate cancer example, pltted as functins f df(λ), the effective degrees f freedm implied by the penalty λ (defined in (3.50) n page 68). In the case f rthnrmal inputs, the ridge estimates are just a scaled versin f the least squares estimates, that is, ˆβ ridge = ˆβ/(1 + λ). Ridge regressin can als be derived as the mean r mde f a psterir distributin, with a suitably chsen prir distributin. In detail, suppse y i N(β 0 + x T i β, σ2 ), and the parameters β j are each distributed as N(0, τ 2 ), independently f ne anther. Then the (negative) lg-psterir density f β, withτ 2 and σ 2 assumed knwn, is equal t the expressin in curly braces in (3.41), with λ = σ 2 /τ 2 (Exercise 3.6). Thus the ridge estimate is the mde f the psterir distributin; since the distributin is Gaussian, it is als the psterir mean. The singular value decmpsitin (SVD) f the centered input matrix X gives us sme additinal insight int the nature f ridge regressin. This decmpsitin is extremely useful in the analysis f many statistical methds. The SVD f the N p matrix X has the frm X = UDV T. (3.45)

23 3.4 Shrinkage Methds 65 Cefficients lcavl lweight age lbph svi lcp gleasn pgg45 df(λ) FIGURE 3.8. Prfiles f ridge cefficients fr the prstate cancer example, as the tuning parameter λ is varied. Cefficients are pltted versus df(λ), the effective degrees f freedm. A vertical line is drawn at df = 5.0, thevaluechsenby crss-validatin.

24 66 3. Linear Methds fr Regressin Here U and V are N p and p p rthgnal matrices, with the clumns f U spanning the clumn space f X, and the clumns f V spanning the rw space. D is a p p diagnal matrix, with diagnal entries d 1 d 2 d p 0 called the singular values f X. If ne r mre values d j =0, X is singular. Using the singular value decmpsitin we can write the least squares fitted vectr as X ˆβ ls = X(X T X) 1 X T y = UU T y, (3.46) after sme simplificatin. Nte that U T y are the crdinates f y with respect t the rthnrmal basis U. Nte als the similarity with (3.33); Q and U are generally different rthgnal bases fr the clumn space f X (Exercise 3.8). Nw the ridge slutins are X ˆβ ridge = X(X T X + λi) 1 X T y = UD(D 2 + λi) 1 DU T y = p d 2 j u j d 2 j + λut j y, (3.47) j=1 where the u j are the clumns f U. Nte that since λ 0, we have d 2 j /(d2 j + λ) 1. Like linear regressin, ridge regressin cmputes the crdinates f y with respect t the rthnrmal basis U. It then shrinks these crdinates by the factrs d 2 j /(d2 j + λ). This means that a greater amunt f shrinkage is applied t the crdinates f basis vectrs with smaller d 2 j. What des a small value f d 2 j mean? The SVD f the centered matrix X is anther way f expressing the principal cmpnents f the variables in X. The sample cvariance matrix is given by S = X T X/N, and frm (3.45) we have X T X = VD 2 V T, (3.48) which is the eigen decmpsitin f X T X (and f S, up t a factr N). The eigenvectrs v j (clumns f V) are als called the principal cmpnents (r Karhunen Leve) directins f X. The first principal cmpnent directin v 1 has the prperty that z 1 = Xv 1 has the largest sample variance amngst all nrmalized linear cmbinatins f the clumns f X. This sample variance is easily seen t be Var(z 1 )=Var(Xv 1 )= d2 1 N, (3.49) and in fact z 1 = Xv 1 = u 1 d 1. The derived variable z 1 is called the first principal cmpnent f X, and hence u 1 is the nrmalized first principal

25 3.4 Shrinkage Methds Largest Principal Cmpnent Smallest Principal Cmpnent X 1 X2 FIGURE 3.9. Principal cmpnents f sme input data pints. The largest principal cmpnent is the directin that maximizes the variance f the prjected data, and the smallest principal cmpnent minimizes that variance. Ridge regressin prjects y nt these cmpnents, and then shrinks the cefficients f the lw variance cmpnents mre than the high-variance cmpnents. cmpnent. Subsequent principal cmpnents z j have maximum variance d 2 j /N, subject t being rthgnal t the earlier nes. Cnversely the last principal cmpnent has minimum variance. Hence the small singular values d j crrespnd t directins in the clumn space f X having small variance, and ridge regressin shrinks these directins the mst. Figure 3.9 illustrates the principal cmpnents f sme data pints in tw dimensins. If we cnsider fitting a linear surface ver this dmain (the Y -axis is sticking ut f the page), the cnfiguratin f the data allw us t determine its gradient mre accurately in the lng directin than the shrt. Ridge regressin prtects against the ptentially high variance f gradients estimated in the shrt directins. The implicit assumptin is that the respnse will tend t vary mst in the directins f high variance f the inputs. This is ften a reasnable assumptin, since predictrs are ften chsen fr study because they vary with the respnse variable, but need nt hld in general.

26 68 3. Linear Methds fr Regressin In Figure 3.7 we have pltted the estimated predictin errr versus the quantity df(λ) = tr[x(x T X + λi) 1 X T ], = tr(h λ ) = p d 2 j + λ. (3.50) d 2 j=1 j This mntne decreasing functin f λ is the effective degrees f freedm f the ridge regressin fit. Usually in a linear-regressin fit with p variables, the degrees-f-freedm f the fit is p, the number f free parameters. The idea is that althugh all p cefficients in a ridge fit will be nn-zer, they are fit in a restricted fashin cntrlled by λ. Nte that df(λ) =p when λ = 0 (n regularizatin) and df(λ) 0 as λ. Of curse there is always an additinal ne degree f freedm fr the intercept, which was remved apriri. This definitin is mtivated in mre detail in Sectin and Sectins In Figure 3.7 the minimum ccurs at df(λ) = 5.0. Table 3.3 shws that ridge regressin reduces the test errr f the full least squares estimates by a small amunt The Lass The lass is a shrinkage methd like ridge, with subtle but imprtant differences. The lass estimate is defined by ˆβ lass = argmin β N ( y i β 0 i=1 subject t p ) 2 x ij β j j=1 p β j t. (3.51) Just as in ridge regressin, we can re-parametrize the cnstant β 0 by standardizing the predictrs; the slutin fr ˆβ 0 is ȳ, and thereafter we fit a mdel withut an intercept (Exercise 3.5). In the signal prcessing literature, the lass is als knwn as basis pursuit (Chen et al., 1998). We can als write the lass prblem in the equivalent Lagrangian frm { 1 ˆβ lass =argmin β 2 N ( yi β 0 i=1 j=1 p ) 2 p } x ij β j + λ β j. (3.52) Ntice the similarity t the ridge regressin prblem (3.42) r (3.41): the L 2 ridge penalty p 1 β2 j is replaced by the L 1 lass penalty p 1 β j. This latter cnstraint makes the slutins nnlinear in the y i, and there is n clsed frm expressin as in ridge regressin. Cmputing the lass slutin j=1 j=1

27 3.4 Shrinkage Methds 69 is a quadratic prgramming prblem, althugh we see in Sectin that efficient algrithms are available fr cmputing the entire path f slutins as λ is varied, with the same cmputatinal cst as fr ridge regressin. Because f the nature f the cnstraint, making t sufficiently small will cause sme f the cefficients t be exactly zer. Thus the lass des a kind f cntinuus subset selectin. If t is chsen larger than t 0 = p 1 ˆβ j (where ls ˆβ j = ˆβ j, the least squares estimates), then the lass estimates are the ˆβ j s. On the ther hand, fr t = t 0 /2 say, then the least squares cefficients are shrunk by abut 50% n average. Hwever, the nature f the shrinkage is nt bvius, and we investigate it further in Sectin belw. Like the subset size in variable subset selectin, r the penalty parameter in ridge regressin, t shuld be adaptively chsen t minimize an estimate f expected predictin errr. In Figure 3.7, fr ease f interpretatin, we have pltted the lass predictin errr estimates versus the standardized parameter s = t/ p 1 ˆβ j. A value ŝ 0.36 was chsen by 10-fld crss-validatin; this caused fur cefficients t be set t zer (fifth clumn f Table 3.3). The resulting mdel has the secnd lwest test errr, slightly lwer than the full least squares mdel, but the standard errrs f the test errr estimates (last line f Table 3.3) are fairly large. Figure 3.10 shws the lass cefficients as the standardized tuning parameter s = t/ p 1 ˆβ j is varied. At s =1.0 these are the least squares estimates; they decrease t 0 as s 0. This decrease is nt always strictly mntnic, althugh it is in this example. A vertical line is drawn at s = 0.36, the value chsen by crss-validatin Discussin: Subset Selectin, Ridge Regressin and the Lass In this sectin we discuss and cmpare the three appraches discussed s far fr restricting the linear regressin mdel: subset selectin, ridge regressin and the lass. In the case f an rthnrmal input matrix X the three prcedures have explicit slutins. Each methd applies a simple transfrmatin t the least squares estimate ˆβ j, as detailed in Table 3.4. Ridge regressin des a prprtinal shrinkage. Lass translates each cefficient by a cnstant factr λ, truncating at zer. This is called sft threshlding, and is used in the cntext f wavelet-based smthing in Sectin 5.9. Best-subset selectin drps all variables with cefficients smaller than the M th largest; this is a frm f hard-threshlding. Back t the nnrthgnal case; sme pictures help understand their relatinship. Figure 3.11 depicts the lass (left) and ridge regressin (right) when there are nly tw parameters. The residual sum f squares has elliptical cnturs, centered at the full least squares estimate. The cnstraint

28 70 3. Linear Methds fr Regressin lcavl Cefficients svi lweight pgg45 lbph gleasn age lcp Shrinkage Factr s FIGURE Prfiles f lass cefficients, as the tuning parameter t is varied. Cefficients are pltted versus s = t/ P p 1 ˆβ j. A vertical line is drawn at s =0.36, the value chsen by crss-validatin. Cmpare Figure 3.8 n page 65; the lass prfiles hit zer, while thse fr ridge d nt. The prfiles are piece-wise linear, and s are cmputed nly at the pints displayed; see Sectin frdetails.

29 3.4 Shrinkage Methds 71 TABLE 3.4. Estimatrs f β j in the case f rthnrmal clumns f X. M and λ are cnstants chsen by the crrespnding techniques; sign dentes the sign f its argument (±1), and x + dentes psitive part f x. Belwthetable,estimatrs are shwn by brken red lines. The 45 line in gray shws the unrestricted estimate fr reference. Estimatr Frmula Best subset (size M) ˆβj I( ˆβ j ˆβ (M) ) Ridge ˆβj /(1 + λ) Lass sign( ˆβ j )( ˆβ j λ) + Best Subset Ridge Lass λ ˆβ (M) (0,0) (0,0) (0,0) β 2. ^ β. β 2 ^ β β 1 β 1 FIGURE Estimatin picture fr the lass (left) and ridge regressin (right). Shwn are cnturs f the errr and cnstraint functins. The slid blue areas are the cnstraint regins β 1 + β 2 t and β1 2 + β2 2 t 2, respectively, while the red ellipses are the cnturs f the least squares errr functin.

30 72 3. Linear Methds fr Regressin regin fr ridge regressin is the disk β1 2 + β2 2 t, while that fr lass is the diamnd β 1 + β 2 t. Bth methds find the first pint where the elliptical cnturs hit the cnstraint regin. Unlike the disk, the diamnd has crners; if the slutin ccurs at a crner, then it has ne parameter β j equal t zer. When p>2, the diamnd becmes a rhmbid, and has many crners, flat edges and faces; there are many mre pprtunities fr the estimated parameters t be zer. We can generalize ridge regressin and the lass, and view them as Bayes estimates. Cnsider the criterin { N } ( p ) 2 p β =argmin yi β 0 x ij β j + λ β j q (3.53) β i=1 j=1 fr q 0. The cnturs f cnstant value f j β j q are shwn in Figure 3.12, fr the case f tw inputs. Thinking f β j q as the lg-prir density fr β j, these are als the equicnturs f the prir distributin f the parameters. The value q = 0 crrespnds t variable subset selectin, as the penalty simply cunts the number f nnzer parameters; q = 1 crrespnds t the lass, while q = 2 t ridge regressin. Ntice that fr q 1, the prir is nt unifrm in directin, but cncentrates mre mass in the crdinate directins. The prir crrespnding t the q = 1 case is an independent duble expnential (r Laplace) distributin fr each input, with density (1/2τ) exp( β /τ) and τ =1/λ. The case q = 1 (lass) is the smallest q such that the cnstraint regin is cnvex; nn-cnvex cnstraint regins make the ptimizatin prblem mre difficult. In this view, the lass, ridge regressin and best subset selectin are Bayes estimates with different prirs. Nte, hwever, that they are derived as psterir mdes, that is, maximizers f the psterir. It is mre cmmn t use the mean f the psterir as the Bayes estimate. Ridge regressin is als the psterir mean, but the lass and best subset selectin are nt. Lking again at the criterin (3.53), we might try using ther values f q besides 0, 1, r 2. Althugh ne might cnsider estimating q frm the data, ur experience is that it is nt wrth the effrt fr the extra variance incurred. Values f q (1, 2) suggest a cmprmise between the lass and ridge regressin. Althugh this is the case, with q>1, β j q is differentiable at 0, and s des nt share the ability f lass (q =1)fr j=1 q =4 q =2 q =1 q =0.5 q =0.1 FIGURE Cnturs f cnstant value f P j βj q fr given values f q.

31 3.4 Shrinkage Methds 73 q =1.2 α =0.2 L q Elastic Net FIGURE Cnturs f cnstant value f P j βj q fr q =1.2 (left plt), and the elastic-net penalty P j (αβ2 j +(1 α) β j ) fr α =0.2 (right plt). Althugh visually very similar, the elastic-net has sharp (nn-differentiable) crners, while the q =1.2 penalty des nt. setting cefficients exactly t zer. Partly fr this reasn as well as fr cmputatinal tractability, Zu and Hastie (2005) intrduced the elasticnet penalty p ( λ αβ 2 j +(1 α) β j ), (3.54) j=1 a different cmprmise between ridge and lass. Figure 3.13 cmpares the L q penalty with q =1.2 and the elastic-net penalty with α =0.2; it is hard t detect the difference by eye. The elastic-net selects variables like the lass, and shrinks tgether the cefficients f crrelated predictrs like ridge. It als has cnsiderable cmputatinal advantages ver the L q penalties. We discuss the elastic-net further in Sectin Least Angle Regressin Least angle regressin (LAR) is a relative newcmer (Efrn et al., 2004), and can be viewed as a kind f demcratic versin f frward stepwise regressin (Sectin 3.3.2). As we will see, LAR is intimately cnnected with the lass, and in fact prvides an extremely efficient algrithm fr cmputing the entire lass path as in Figure Frward stepwise regressin builds a mdel sequentially, adding ne variable at a time. At each step, it identifies the best variable t include in the active set, and then updates the least squares fit t include all the active variables. Least angle regressin uses a similar strategy, but nly enters as much f a predictr as it deserves. At the first step it identifies the variable mst crrelated with the respnse. Rather than fit this variable cmpletely, LAR mves the cefficient f this variable cntinuusly tward its leastsquares value (causing its crrelatin with the evlving residual t decrease in abslute value). As sn as anther variable catches up in terms f crrelatin with the residual, the prcess is paused. The secnd variable then jins the active set, and their cefficients are mved tgether in a way that keeps their crrelatins tied and decreasing. This prcess is cntinued

32 74 3. Linear Methds fr Regressin until all the variables are in the mdel, and ends at the full least-squares fit. Algrithm 3.2 prvides the details. The terminatin cnditin in step 5 requires sme explanatin. If p>n 1, the LAR algrithm reaches a zer residual slutin after N 1 steps (the 1 is because we have centered the data). Algrithm 3.2 Least Angle Regressin. 1. Standardize the predictrs t have mean zer and unit nrm. Start with the residual r = y ȳ, β 1, β 2,...,β p =0. 2. Find the predictr x j mst crrelated with r. 3. Mve β j frm 0 twards its least-squares cefficient x j, r, until sme ther cmpetitr x k has as much crrelatin with the current residual as des x j. 4. Mve β j and β k in the directin defined by their jint least squares cefficient f the current residual n (x j, x k ), until sme ther cmpetitr x l has as much crrelatin with the current residual. 5. Cntinue in this way until all p predictrs have been entered. After min(n 1,p) steps, we arrive at the full least-squares slutin. Suppse A k is the active set f variables at the beginning f the kth step, and let β Ak be the cefficient vectr fr these variables at this step; there will be k 1 nnzer values, and the ne just entered will be zer. If r k = y X Ak β Ak is the current residual, then the directin fr this step is δ k =(X T A k X Ak ) 1 X T A k r k. (3.55) The cefficient prfile then evlves as β Ak (α) =β Ak + α δ k. Exercise 3.23 verifies that the directins chsen in this fashin d what is claimed: keep the crrelatins tied and decreasing. If the fit vectr at the beginning f this step is ˆf k, then it evlves as ˆf k (α) =ˆf k + α u k, where u k = X Ak δ k is the new fit directin. The name least angle arises frm a gemetrical interpretatin f this prcess; u k makes the smallest (and equal) angle with each f the predictrs in A k (Exercise 3.24). Figure 3.14 shws the abslute crrelatins decreasing and jining ranks with each step f the LAR algrithm, using simulated data. By cnstructin the cefficients in LAR change in a piecewise linear fashin. Figure 3.15 [left panel] shws the LAR cefficient prfile evlving as a functin f their L 1 arc length 2. Nte that we d nt need t take small 2 The L 1 arc-length f a differentiable curve β(s) frs [0,S]isgivenbyTV(β,S)= R S 0 β(s) 1 ds, where β(s) = β(s)/ s. Frthepiecewise-linearLARcefficient prfile, this amunts t summing the L 1 nrms f the changes in cefficients frm step t step.

33 3.4 Shrinkage Methds 75 v2 v6 v4 v5 v3 v1 Abslute Crrelatins L 1 Arc Length FIGURE Prgressin f the abslute crrelatins during each step f the LAR prcedure, using a simulated data set with six predictrs. The labels at the tp f the plt indicate which variables enter the active set at each step. The step length are measured in units f L 1 arc length. Least Angle Regressin Lass Cefficients Cefficients L 1 Arc Length L 1 Arc Length FIGURE Left panel shws the LAR cefficient prfiles n the simulated data, as a functin f the L 1 arc length. The right panel shws the Lass prfile. They are identical until the dark-blue cefficient crsses zer at an arc length f abut 18.

34 76 3. Linear Methds fr Regressin steps and recheck the crrelatins in step 3; using knwledge f the cvariance f the predictrs and the piecewise linearity f the algrithm, we can wrk ut the exact step length at the beginning f each step (Exercise 3.25). The right panel f Figure 3.15 shws the lass cefficient prfiles n the same data. They are almst identical t thse in the left panel, and differ fr the first time when the blue cefficient passes back thrugh zer. Fr the prstate data, the LAR cefficient prfile turns ut t be identical t the lass prfile in Figure 3.10, which never crsses zer. These bservatins lead t a simple mdificatin f the LAR algrithm that gives the entire lass path, which is als piecewise-linear. Algrithm 3.2a Least Angle Regressin: Lass Mdificatin. 4a. If a nn-zer cefficient hits zer, drp its variable frm the active set f variables and recmpute the current jint least squares directin. The LAR(lass) algrithm is extremely efficient, requiring the same rder f cmputatin as that f a single least squares fit using the p predictrs. Least angle regressin always takes p steps t get t the full least squares estimates. The lass path can have mre than p steps, althugh the tw are ften quite similar. Algrithm 3.2 with the lass mdificatin 3.2a is an efficient way f cmputing the slutin t any lass prblem, especially when p N. Osbrne et al. (2000a) als discvered a piecewise-linear path fr cmputing the lass, which they called a hmtpy algrithm. We nw give a heuristic argument fr why these prcedures are s similar. Althugh the LAR algrithm is stated in terms f crrelatins, if the input features are standardized, it is equivalent and easier t wrk with innerprducts. Suppse A is the active set f variables at sme stage in the algrithm, tied in their abslute inner-prduct with the current residuals y Xβ. We can express this as x T j (y Xβ) =γ s j, j A (3.56) where s j { 1, 1} indicates the sign f the inner-prduct, and γ is the cmmn value. Als x T k (y Xβ) γ k A. Nw cnsider the lass criterin (3.52), which we write in vectr frm R(β) = 1 2 y Xβ λ β 1. (3.57) Let B be the active set f variables in the slutin fr a given value f λ. Fr these variables R(β) is differentiable, and the statinarity cnditins give x T j (y Xβ) =λ sign(β j ), j B (3.58) Cmparing (3.58) with (3.56), we see that they are identical nly if the sign f β j matches the sign f the inner prduct. That is why the LAR

35 3.4 Shrinkage Methds 77 algrithm and lass start t differ when an active cefficient passes thrugh zer; cnditin (3.58) is vilated fr that variable, and it is kicked ut f the active set B. Exercise 3.23 shws that these equatins imply a piecewiselinear cefficient prfile as λ decreases. The statinarity cnditins fr the nn-active variables require that x T k (y Xβ) λ, k B, (3.59) which again agrees with the LAR algrithm. Figure 3.16 cmpares LAR and lass t frward stepwise and stagewise regressin. The setup is the same as in Figure 3.6 n page 59, except here N = 100 here rather than 300, s the prblem is mre difficult. We see that the mre aggressive frward stepwise starts t verfit quite early (well befre the 10 true variables can enter the mdel), and ultimately perfrms wrse than the slwer frward stagewise regressin. The behavir f LAR and lass is similar t that f frward stagewise regressin. Incremental frward stagewise is similar t LAR and lass, and is described in Sectin Degrees-f-Freedm Frmula fr LAR and Lass Suppse that we fit a linear mdel via the least angle regressin prcedure, stpping at sme number f steps k<p, r equivalently using a lass bund t that prduces a cnstrained versin f the full least squares fit. Hw many parameters, r degrees f freedm have we used? Cnsider first a linear regressin using a subset f k features. If this subset is prespecified in advance withut reference t the training data, then the degrees f freedm used in the fitted mdel is defined t be k. Indeed, in classical statistics, the number f linearly independent parameters is what is meant by degrees f freedm. Alternatively, suppse that we carry ut a best subset selectin t determine the ptimal set f k predictrs. Then the resulting mdel has k parameters, but in sme sense we have used up mre than k degrees f freedm. We need a mre general definitin fr the effective degrees f freedm f an adaptively fitted mdel. We define the degrees f freedm f the fitted vectr ŷ =(ŷ 1, ŷ 2,...,ŷ N )as df(ŷ) = 1 σ 2 N i=1 Cv(ŷ i,y i ). (3.60) Here Cv(ŷ i,y i ) refers t the sampling cvariance between the predicted value ŷ i and its crrespnding utcme value y i. This makes intuitive sense: the harder that we fit t the data, the larger this cvariance and hence df(ŷ). Expressin (3.60) is a useful ntin f degrees f freedm, ne that can be applied t any mdel predictin ŷ. This includes mdels that are

36 78 3. Linear Methds fr Regressin E ˆβ(k) β Frward Stepwise LAR Lass Frward Stagewise Incremental Frward Stagewise Fractin f L 1 arc-length FIGURE Cmparisn f LAR and lass with frward stepwise, frward stagewise (FS) and incremental frward stagewise (FS 0) regressin. The setup is the same as in Figure 3.6, except N = 100 here rather than 300. Herethe slwer FS regressin ultimately utperfrms frward stepwise. LAR and lass shw similar behavir t FS and FS 0. Since the prcedures take different numbers f steps (acrss simulatin replicates and methds), we plt the MSE as a functin f the fractin f ttal L 1 arc-length tward the least-squares fit. adaptively fitted t the training data. This definitin is mtivated and discussed further in Sectins Nw fr a linear regressin with k fixed predictrs, it is easy t shw that df(ŷ) = k. Likewise fr ridge regressin, this definitin leads t the clsed-frm expressin (3.50) n page 68: df(ŷ) =tr(s λ ). In bth these cases, (3.60) is simple t evaluate because the fit ŷ = H λ y is linear in y. If we think abut definitin (3.60) in the cntext f a best subset selectin f size k, it seems clear that df(ŷ) will be larger than k, and this can be verified by estimating Cv(ŷ i,y i )/σ 2 directly by simulatin. Hwever there is n clsed frm methd fr estimating df(ŷ) fr best subset selectin. Fr LAR and lass, smething magical happens. These techniques are adaptive in a smther way than best subset selectin, and hence estimatin f degrees f freedm is mre tractable. Specifically it can be shwn that after the kth step f the LAR prcedure, the effective degrees f freedm f the fit vectr is exactly k. Nw fr the lass, the (mdified) LAR prcedure

37 3.5 Methds Using Derived Input Directins 79 ften takes mre than p steps, since predictrs can drp ut. Hence the definitin is a little different; fr the lass, at any stage df(ŷ) apprximately equals the number f predictrs in the mdel. While this apprximatin wrks reasnably well anywhere in the lass path, fr each k it wrks best at the last mdel in the sequence that cntains k predictrs. A detailed study f the degrees f freedm fr the lass may be fund in Zu et al. (2007). 3.5 Methds Using Derived Input Directins In many situatins we have a large number f inputs, ften very crrelated. The methds in this sectin prduce a small number f linear cmbinatins Z m,m=1,...,m f the riginal inputs X j, and the Z m are then used in place f the X j as inputs in the regressin. The methds differ in hw the linear cmbinatins are cnstructed Principal Cmpnents Regressin In this apprach the linear cmbinatins Z m used are the principal cmpnents as defined in Sectin abve. Principal cmpnent regressin frms the derived input clumns z m = Xv m, and then regresses y n z 1, z 2,...,z M fr sme M p. Since the z m are rthgnal, this regressin is just a sum f univariate regressins: ŷ pcr (M) =ȳ1 + M m=1 ˆθ m z m, (3.61) where ˆθ m = z m, y / z m, z m. Since the z m are each linear cmbinatins f the riginal x j, we can express the slutin (3.61) in terms f cefficients f the x j (Exercise 3.13): ˆβ pcr (M) = M ˆθ m v m. (3.62) m=1 As with ridge regressin, principal cmpnents depend n the scaling f the inputs, s typically we first standardize them. Nte that if M = p, we wuld just get back the usual least squares estimates, since the clumns f Z = UD span the clumn space f X. FrM<pwe get a reduced regressin. We see that principal cmpnents regressin is very similar t ridge regressin: bth perate via the principal cmpnents f the input matrix. Ridge regressin shrinks the cefficients f the principal cmpnents (Figure 3.17), shrinking mre depending n the size f the crrespnding eigenvalue; principal cmpnents regressin discards the p M smallest eigenvalue cmpnents. Figure 3.17 illustrates this.

38 80 3. Linear Methds fr Regressin Shrinkage Factr ridge pcr Index FIGURE Ridge regressin shrinks the regressin cefficients f the principal cmpnents, using shrinkage factrs d 2 j/(d 2 j + λ) as in (3.47). Principal cmpnent regressin truncates them. Shwn are the shrinkage and truncatin patterns crrespnding t Figure 3.7, as a functin f the principal cmpnent index. In Figure 3.7 we see that crss-validatin suggests seven terms; the resulting mdel has the lwest test errr in Table Partial Least Squares This technique als cnstructs a set f linear cmbinatins f the inputs fr regressin, but unlike principal cmpnents regressin it uses y (in additin t X) fr this cnstructin. Like principal cmpnent regressin, partial least squares (PLS) is nt scale invariant, s we assume that each x j is standardized t have mean 0 and variance 1. PLS begins by cmputing ˆϕ 1j = x j, y fr each j. Frm this we cnstruct the derived input z 1 = j ˆϕ 1jx j, which is the first partial least squares directin. Hence in the cnstructin f each z m, the inputs are weighted by the strength f their univariate effect n y 3. The utcme y is regressed n z 1 giving cefficient ˆθ 1, and then we rthgnalize x 1,...,x p with respect t z 1.We cntinue this prcess, until M p directins have been btained. In this manner, partial least squares prduces a sequence f derived, rthgnal inputs r directins z 1, z 2,...,z M. As with principal-cmpnent regressin, if we were t cnstruct all M = p directins, we wuld get back a slutin equivalent t the usual least squares estimates; using M<pdirectins prduces a reduced regressin. The prcedure is described fully in Algrithm Since the x j are standardized, the first directins ˆϕ 1j are the univariate regressin cefficients (up t an irrelevant cnstant); this is nt the case fr subsequent directins.

39 Algrithm 3.3 Partial Least Squares. 3.5 Methds Using Derived Input Directins Standardize each x j t have mean zer and variance ne. Set ŷ (0) = ȳ1, and x (0) j = x j,j=1,...,p. 2. Fr m =1, 2,...,p (a) z m = p j=1 ˆϕ mjx (m 1) j (b) ˆθ m = z m, y / z m, z m. (c) ŷ (m) = ŷ (m 1) + ˆθ m z m. (d) Orthgnalize each x (m 1) j [ z m, x (m 1) j, where ˆϕ mj = x (m 1), y. with respect t z m : x (m) j / z m, z m ]z m, j =1, 2,...,p. j = x (m 1) j 3. Output the sequence f fitted vectrs {ŷ (m) } p 1. Since the {z l} m 1 are linear in the riginal x j, s is ŷ (m) = X ˆβ pls (m). These linear cefficients can be recvered frm the sequence f PLS transfrmatins. In the prstate cancer example, crss-validatin chse M = 2 PLS directins in Figure 3.7. This prduced the mdel given in the rightmst clumn f Table 3.3. What ptimizatin prblem is partial least squares slving? Since it uses the respnse y t cnstruct its directins, its slutin path is a nnlinear functin f y. It can be shwn (Exercise 3.15) that partial least squares seeks directins that have high variance and have high crrelatin with the respnse, in cntrast t principal cmpnents regressin which keys nly n high variance (Stne and Brks, 1990; Frank and Friedman, 1993). In particular, the mth principal cmpnent directin v m slves: max α Var(Xα) (3.63) subject t α =1, α T Sv l =0, l =1,...,m 1, where S is the sample cvariance matrix f the x j. The cnditins α T Sv l = 0 ensures that z m = Xα is uncrrelated with all the previus linear cmbinatins z l = Xv l. The mth PLS directin ˆϕ m slves: max α Crr 2 (y, Xα)Var(Xα) (3.64) subject t α =1, α T S ˆϕ l =0, l =1,...,m 1. Further analysis reveals that the variance aspect tends t dminate, and s partial least squares behaves much like ridge regressin and principal cmpnents regressin. We discuss this further in the next sectin. If the input matrix X is rthgnal, then partial least squares finds the least squares estimates after m = 1 steps. Subsequent steps have n effect

40 82 3. Linear Methds fr Regressin since the ˆϕ mj are zer fr m>1 (Exercise 3.14). It can als be shwn that the sequence f PLS cefficients fr m =1, 2,...,prepresents the cnjugate gradient sequence fr cmputing the least squares slutins (Exercise 3.18). 3.6 Discussin: A Cmparisn f the Selectin and Shrinkage Methds There are sme simple settings where we can understand better the relatinship between the different methds described abve. Cnsider an example with tw crrelated inputs X 1 and X 2, with crrelatin ρ. We assume that the true regressin cefficients are β 1 = 4 and β 2 = 2. Figure 3.18 shws the cefficient prfiles fr the different methds, as their tuning parameters are varied. The tp panel has ρ =0.5, the bttm panel ρ = 0.5. The tuning parameters fr ridge and lass vary ver a cntinuus range, while best subset, PLS and PCR take just tw discrete steps t the least squares slutin. In the tp panel, starting at the rigin, ridge regressin shrinks the cefficients tgether until it finally cnverges t least squares. PLS and PCR shw similar behavir t ridge, althugh are discrete and mre extreme. Best subset vershts the slutin and then backtracks. The behavir f the lass is intermediate t the ther methds. When the crrelatin is negative (lwer panel), again PLS and PCR rughly track the ridge path, while all f the methds are mre similar t ne anther. It is interesting t cmpare the shrinkage behavir f these different methds. Recall that ridge regressin shrinks all directins, but shrinks lw-variance directins mre. Principal cmpnents regressin leaves M high-variance directins alne, and discards the rest. Interestingly, it can be shwn that partial least squares als tends t shrink the lw-variance directins, but can actually inflate sme f the higher variance directins. This can make PLS a little unstable, and cause it t have slightly higher predictin errr cmpared t ridge regressin. A full study is given in Frank and Friedman (1993). These authrs cnclude that fr minimizing predictin errr, ridge regressin is generally preferable t variable subset selectin, principal cmpnents regressin and partial least squares. Hwever the imprvement ver the latter tw methds was nly slight. T summarize, PLS, PCR and ridge regressin tend t behave similarly. Ridge regressin may be preferred because it shrinks smthly, rather than in discrete steps. Lass falls smewhere between ridge regressin and best subset regressin, and enjys sme f the prperties f each.

41 3.6 Discussin: A Cmparisn f the Selectin and Shrinkage Methds 83 ρ =0.5 β2 β PCR Lass PLS Ridge Least Squares Best Subset PCR Ridge Lass PLS β 1 ρ = 0.5 Best Subset Least Squares β 1 FIGURE Cefficient prfiles frm different methds fr a simple prblem: tw inputs with crrelatin ±0.5, and the true regressin cefficients β =(4, 2).

42 84 3. Linear Methds fr Regressin 3.7 Multiple Outcme Shrinkage and Selectin As nted in Sectin 3.2.4, the least squares estimates in a multiple-utput linear mdel are simply the individual least squares estimates fr each f the utputs. T apply selectin and shrinkage methds in the multiple utput case, ne culd apply a univariate technique individually t each utcme r simultaneusly t all utcmes. With ridge regressin, fr example, we culd apply frmula (3.44) t each f the K clumns f the utcme matrix Y, using pssibly different parameters λ, r apply it t all clumns using the same value f λ. The frmer strategy wuld allw different amunts f regularizatin t be applied t different utcmes but require estimatin f k separate regularizatin parameters λ 1,...,λ k, while the latter wuld permit all k utputs t be used in estimating the sle regularizatin parameter λ. Other mre sphisticated shrinkage and selectin strategies that explit crrelatins in the different respnses can be helpful in the multiple utput case. Suppse fr example that amng the utputs we have Y k = f(x)+ε k (3.65) Y l = f(x)+ε l ; (3.66) i.e., (3.65) and (3.66) share the same structural part f(x) in their mdels. It is clear in this case that we shuld pl ur bservatins n Y k and Y l t estimate the cmmn f. Cmbining respnses is at the heart f cannical crrelatin analysis (CCA), a data reductin technique develped fr the multiple utput case. Similar t PCA, CCA finds a sequence f uncrrelated linear cmbinatins Xv m, m = 1,...,M f the x j, and a crrespnding sequence f uncrrelated linear cmbinatins Yu m f the respnses y k, such that the crrelatins Crr 2 (Yu m, Xv m ) (3.67) are successively maximized. Nte that at mst M = min(k, p) directins can be fund. The leading cannical respnse variates are thse linear cmbinatins (derived respnses) best predicted by the x j ; in cntrast, the trailing cannical variates can be prly predicted by the x j, and are candidates fr being drpped. The CCA slutin is cmputed using a generalized SVD f the sample crss-cvariance matrix Y T X/N (assuming Y and X are centered; Exercise 3.20). Reduced-rank regressin (Izenman, 1975; van der Merwe and Zidek, 1980) frmalizes this apprach in terms f a regressin mdel that explicitly pls infrmatin. Given an errr cvariance Cv(ε) = Σ, we slve the fllwing

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression 3.3.4 Prstate Cancer Data Example (Cntinued) 3.4 Shrinkage Methds 61 Table 3.3 shws the cefficients frm a number f different selectin and shrinkage methds. They are best-subset selectin using an all-subsets

More information

Pattern Recognition 2014 Support Vector Machines

Pattern Recognition 2014 Support Vector Machines Pattern Recgnitin 2014 Supprt Vectr Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 1 / 55 Overview 1 Separable Case 2 Kernel Functins 3 Allwing Errrs (Sft

More information

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Resampling Methods. Chapter 5. Chapter 5 1 / 52 Resampling Methds Chapter 5 Chapter 5 1 / 52 1 51 Validatin set apprach 2 52 Crss validatin 3 53 Btstrap Chapter 5 2 / 52 Abut Resampling An imprtant statistical tl Pretending the data as ppulatin and

More information

What is Statistical Learning?

What is Statistical Learning? What is Statistical Learning? Sales 5 10 15 20 25 Sales 5 10 15 20 25 Sales 5 10 15 20 25 0 50 100 200 300 TV 0 10 20 30 40 50 Radi 0 20 40 60 80 100 Newspaper Shwn are Sales vs TV, Radi and Newspaper,

More information

Simple Linear Regression (single variable)

Simple Linear Regression (single variable) Simple Linear Regressin (single variable) Intrductin t Machine Learning Marek Petrik January 31, 2017 Sme f the figures in this presentatin are taken frm An Intrductin t Statistical Learning, with applicatins

More information

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) > Btstrap Methd > # Purpse: understand hw btstrap methd wrks > bs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(bs) > mean(bs) [1] 21.64625 > # estimate f lambda > lambda = 1/mean(bs);

More information

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeff Reading: Chapter 2 STATS 202: Data mining and analysis September 27, 2017 1 / 20 Supervised vs. unsupervised learning In unsupervised

More information

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeff Reading: Chapter 2 STATS 202: Data mining and analysis September 27, 2017 1 / 20 Supervised vs. unsupervised learning In unsupervised

More information

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter Midwest Big Data Summer Schl: Machine Learning I: Intrductin Kris De Brabanter kbrabant@iastate.edu Iwa State University Department f Statistics Department f Cmputer Science June 24, 2016 1/24 Outline

More information

COMP 551 Applied Machine Learning Lecture 4: Linear classification

COMP 551 Applied Machine Learning Lecture 4: Linear classification COMP 551 Applied Machine Learning Lecture 4: Linear classificatin Instructr: Jelle Pineau (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/cmp551 Unless therwise nted, all material psted

More information

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came. MATH 1342 Ch. 24 April 25 and 27, 2013 Page 1 f 5 CHAPTER 24: INFERENCE IN REGRESSION Chapters 4 and 5: Relatinships between tw quantitative variables. Be able t Make a graph (scatterplt) Summarize the

More information

A Matrix Representation of Panel Data

A Matrix Representation of Panel Data web Extensin 6 Appendix 6.A A Matrix Representatin f Panel Data Panel data mdels cme in tw brad varieties, distinct intercept DGPs and errr cmpnent DGPs. his appendix presents matrix algebra representatins

More information

Lecture 10, Principal Component Analysis

Lecture 10, Principal Component Analysis Principal Cmpnent Analysis Lecture 10, Principal Cmpnent Analysis Ha Helen Zhang Fall 2017 Ha Helen Zhang Lecture 10, Principal Cmpnent Analysis 1 / 16 Principal Cmpnent Analysis Lecture 10, Principal

More information

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction T-61.5060 Algrithmic methds fr data mining Slide set 6: dimensinality reductin reading assignment LRU bk: 11.1 11.3 PCA tutrial in mycurses (ptinal) ptinal: An Elementary Prf f a Therem f Jhnsn and Lindenstrauss,

More information

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017 Resampling Methds Crss-validatin, Btstrapping Marek Petrik 2/21/2017 Sme f the figures in this presentatin are taken frm An Intrductin t Statistical Learning, with applicatins in R (Springer, 2013) with

More information

Distributions, spatial statistics and a Bayesian perspective

Distributions, spatial statistics and a Bayesian perspective Distributins, spatial statistics and a Bayesian perspective Dug Nychka Natinal Center fr Atmspheric Research Distributins and densities Cnditinal distributins and Bayes Thm Bivariate nrmal Spatial statistics

More information

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels Mtivating Example Memry-Based Learning Instance-Based Learning K-earest eighbr Inductive Assumptin Similar inputs map t similar utputs If nt true => learning is impssible If true => learning reduces t

More information

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines COMP 551 Applied Machine Learning Lecture 11: Supprt Vectr Machines Instructr: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/cmp551 Unless therwise nted, all material psted fr this curse

More information

1 The limitations of Hartree Fock approximation

1 The limitations of Hartree Fock approximation Chapter: Pst-Hartree Fck Methds - I The limitatins f Hartree Fck apprximatin The n electrn single determinant Hartree Fck wave functin is the variatinal best amng all pssible n electrn single determinants

More information

Comparing Several Means: ANOVA. Group Means and Grand Mean

Comparing Several Means: ANOVA. Group Means and Grand Mean STAT 511 ANOVA and Regressin 1 Cmparing Several Means: ANOVA Slide 1 Blue Lake snap beans were grwn in 12 pen-tp chambers which are subject t 4 treatments 3 each with O 3 and SO 2 present/absent. The ttal

More information

Math Foundations 20 Work Plan

Math Foundations 20 Work Plan Math Fundatins 20 Wrk Plan Units / Tpics 20.8 Demnstrate understanding f systems f linear inequalities in tw variables. Time Frame December 1-3 weeks 6-10 Majr Learning Indicatrs Identify situatins relevant

More information

Tree Structured Classifier

Tree Structured Classifier Tree Structured Classifier Reference: Classificatin and Regressin Trees by L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stne, Chapman & Hall, 98. A Medical Eample (CART): Predict high risk patients

More information

, which yields. where z1. and z2

, which yields. where z1. and z2 The Gaussian r Nrmal PDF, Page 1 The Gaussian r Nrmal Prbability Density Functin Authr: Jhn M Cimbala, Penn State University Latest revisin: 11 September 13 The Gaussian r Nrmal Prbability Density Functin

More information

Smoothing, penalized least squares and splines

Smoothing, penalized least squares and splines Smthing, penalized least squares and splines Duglas Nychka, www.image.ucar.edu/~nychka Lcally weighted averages Penalized least squares smthers Prperties f smthers Splines and Reprducing Kernels The interplatin

More information

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank CAUSAL INFERENCE Technical Track Sessin I Phillippe Leite The Wrld Bank These slides were develped by Christel Vermeersch and mdified by Phillippe Leite fr the purpse f this wrkshp Plicy questins are causal

More information

The general linear model and Statistical Parametric Mapping I: Introduction to the GLM

The general linear model and Statistical Parametric Mapping I: Introduction to the GLM The general linear mdel and Statistical Parametric Mapping I: Intrductin t the GLM Alexa Mrcm and Stefan Kiebel, Rik Hensn, Andrew Hlmes & J-B J Pline Overview Intrductin Essential cncepts Mdelling Design

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS 1 Influential bservatins are bservatins whse presence in the data can have a distrting effect n the parameter estimates and pssibly the entire analysis,

More information

Linear Classification

Linear Classification Linear Classificatin CS 54: Machine Learning Slides adapted frm Lee Cper, Jydeep Ghsh, and Sham Kakade Review: Linear Regressin CS 54 [Spring 07] - H Regressin Given an input vectr x T = (x, x,, xp), we

More information

AP Statistics Notes Unit Two: The Normal Distributions

AP Statistics Notes Unit Two: The Normal Distributions AP Statistics Ntes Unit Tw: The Nrmal Distributins Syllabus Objectives: 1.5 The student will summarize distributins f data measuring the psitin using quartiles, percentiles, and standardized scres (z-scres).

More information

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics Mdule 3: Gaussian Prcess Parameter Estimatin, Predictin Uncertainty, and Diagnstics Jerme Sacks and William J Welch Natinal Institute f Statistical Sciences and University f British Clumbia Adapted frm

More information

IN a recent article, Geary [1972] discussed the merit of taking first differences

IN a recent article, Geary [1972] discussed the merit of taking first differences The Efficiency f Taking First Differences in Regressin Analysis: A Nte J. A. TILLMAN IN a recent article, Geary [1972] discussed the merit f taking first differences t deal with the prblems that trends

More information

Inference in the Multiple-Regression

Inference in the Multiple-Regression Sectin 5 Mdel Inference in the Multiple-Regressin Kinds f hypthesis tests in a multiple regressin There are several distinct kinds f hypthesis tests we can run in a multiple regressin. Suppse that amng

More information

Building to Transformations on Coordinate Axis Grade 5: Geometry Graph points on the coordinate plane to solve real-world and mathematical problems.

Building to Transformations on Coordinate Axis Grade 5: Geometry Graph points on the coordinate plane to solve real-world and mathematical problems. Building t Transfrmatins n Crdinate Axis Grade 5: Gemetry Graph pints n the crdinate plane t slve real-wrld and mathematical prblems. 5.G.1. Use a pair f perpendicular number lines, called axes, t define

More information

Principal Components

Principal Components Principal Cmpnents Suppse we have N measurements n each f p variables X j, j = 1,..., p. There are several equivalent appraches t principal cmpnents: Given X = (X 1,... X p ), prduce a derived (and small)

More information

ENSC Discrete Time Systems. Project Outline. Semester

ENSC Discrete Time Systems. Project Outline. Semester ENSC 49 - iscrete Time Systems Prject Outline Semester 006-1. Objectives The gal f the prject is t design a channel fading simulatr. Upn successful cmpletin f the prject, yu will reinfrce yur understanding

More information

Chapter 3: Cluster Analysis

Chapter 3: Cluster Analysis Chapter 3: Cluster Analysis } 3.1 Basic Cncepts f Clustering 3.1.1 Cluster Analysis 3.1. Clustering Categries } 3. Partitining Methds 3..1 The principle 3.. K-Means Methd 3..3 K-Medids Methd 3..4 CLARA

More information

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical mdel fr micrarray data analysis David Rssell Department f Bistatistics M.D. Andersn Cancer Center, Hustn, TX 77030, USA rsselldavid@gmail.cm

More information

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9. Sectin 7 Mdel Assessment This sectin is based n Stck and Watsn s Chapter 9. Internal vs. external validity Internal validity refers t whether the analysis is valid fr the ppulatin and sample being studied.

More information

The blessing of dimensionality for kernel methods

The blessing of dimensionality for kernel methods fr kernel methds Building classifiers in high dimensinal space Pierre Dupnt Pierre.Dupnt@ucluvain.be Classifiers define decisin surfaces in sme feature space where the data is either initially represented

More information

22.54 Neutron Interactions and Applications (Spring 2004) Chapter 11 (3/11/04) Neutron Diffusion

22.54 Neutron Interactions and Applications (Spring 2004) Chapter 11 (3/11/04) Neutron Diffusion .54 Neutrn Interactins and Applicatins (Spring 004) Chapter (3//04) Neutrn Diffusin References -- J. R. Lamarsh, Intrductin t Nuclear Reactr Thery (Addisn-Wesley, Reading, 966) T study neutrn diffusin

More information

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression 4th Indian Institute f Astrphysics - PennState Astrstatistics Schl July, 2013 Vainu Bappu Observatry, Kavalur Crrelatin and Regressin Rahul Ry Indian Statistical Institute, Delhi. Crrelatin Cnsider a tw

More information

7 TH GRADE MATH STANDARDS

7 TH GRADE MATH STANDARDS ALGEBRA STANDARDS Gal 1: Students will use the language f algebra t explre, describe, represent, and analyze number expressins and relatins 7 TH GRADE MATH STANDARDS 7.M.1.1: (Cmprehensin) Select, use,

More information

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d) COMP 551 Applied Machine Learning Lecture 9: Supprt Vectr Machines (cnt d) Instructr: Herke van Hf (herke.vanhf@mail.mcgill.ca) Slides mstly by: Class web page: www.cs.mcgill.ca/~hvanh2/cmp551 Unless therwise

More information

Kinetic Model Completeness

Kinetic Model Completeness 5.68J/10.652J Spring 2003 Lecture Ntes Tuesday April 15, 2003 Kinetic Mdel Cmpleteness We say a chemical kinetic mdel is cmplete fr a particular reactin cnditin when it cntains all the species and reactins

More information

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification COMP 551 Applied Machine Learning Lecture 5: Generative mdels fr linear classificatin Instructr: Herke van Hf (herke.vanhf@mail.mcgill.ca) Slides mstly by: Jelle Pineau Class web page: www.cs.mcgill.ca/~hvanh2/cmp551

More information

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA Mdelling f Clck Behaviur Dn Percival Applied Physics Labratry University f Washingtn Seattle, Washingtn, USA verheads and paper fr talk available at http://faculty.washingtn.edu/dbp/talks.html 1 Overview

More information

Lead/Lag Compensator Frequency Domain Properties and Design Methods

Lead/Lag Compensator Frequency Domain Properties and Design Methods Lectures 6 and 7 Lead/Lag Cmpensatr Frequency Dmain Prperties and Design Methds Definitin Cnsider the cmpensatr (ie cntrller Fr, it is called a lag cmpensatr s K Fr s, it is called a lead cmpensatr Ntatin

More information

Part 3 Introduction to statistical classification techniques

Part 3 Introduction to statistical classification techniques Part 3 Intrductin t statistical classificatin techniques Machine Learning, Part 3, March 07 Fabi Rli Preamble ØIn Part we have seen that if we knw: Psterir prbabilities P(ω i / ) Or the equivalent terms

More information

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw: In SMV I IAML: Supprt Vectr Machines II Nigel Gddard Schl f Infrmatics Semester 1 We sa: Ma margin trick Gemetry f the margin and h t cmpute it Finding the ma margin hyperplane using a cnstrained ptimizatin

More information

Chapter 15 & 16: Random Forests & Ensemble Learning

Chapter 15 & 16: Random Forests & Ensemble Learning Chapter 15 & 16: Randm Frests & Ensemble Learning DD3364 Nvember 27, 2012 Ty Prblem fr Bsted Tree Bsted Tree Example Estimate this functin with a sum f trees with 9-terminal ndes by minimizing the sum

More information

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10] EECS 270, Winter 2017, Lecture 3 Page 1 f 6 Medium Scale Integrated (MSI) devices [Sectins 2.9 and 2.10] As we ve seen, it s smetimes nt reasnable t d all the design wrk at the gate-level smetimes we just

More information

Least Squares Optimal Filtering with Multirate Observations

Least Squares Optimal Filtering with Multirate Observations Prc. 36th Asilmar Cnf. n Signals, Systems, and Cmputers, Pacific Grve, CA, Nvember 2002 Least Squares Optimal Filtering with Multirate Observatins Charles W. herrien and Anthny H. Hawes Department f Electrical

More information

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation III-l III. A New Evaluatin Measure J. Jiner and L. Werner Abstract The prblems f evaluatin and the needed criteria f evaluatin measures in the SMART system f infrmatin retrieval are reviewed and discussed.

More information

IAML: Support Vector Machines

IAML: Support Vector Machines 1 / 22 IAML: Supprt Vectr Machines Charles Suttn and Victr Lavrenk Schl f Infrmatics Semester 1 2 / 22 Outline Separating hyperplane with maimum margin Nn-separable training data Epanding the input int

More information

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers LHS Mathematics Department Hnrs Pre-alculus Final Eam nswers Part Shrt Prblems The table at the right gives the ppulatin f Massachusetts ver the past several decades Using an epnential mdel, predict the

More information

We say that y is a linear function of x if. Chapter 13: The Correlation Coefficient and the Regression Line

We say that y is a linear function of x if. Chapter 13: The Correlation Coefficient and the Regression Line Chapter 13: The Crrelatin Cefficient and the Regressin Line We begin with a sme useful facts abut straight lines. Recall the x, y crdinate system, as pictured belw. 3 2 1 y = 2.5 y = 0.5x 3 2 1 1 2 3 1

More information

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp THE POWER AND LIMIT OF NEURAL NETWORKS T. Y. Lin Department f Mathematics and Cmputer Science San Jse State University San Jse, Califrnia 959-003 tylin@cs.ssu.edu and Bereley Initiative in Sft Cmputing*

More information

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa There are tw parts t this lab. The first is intended t demnstrate hw t request and interpret the spatial diagnstics f a standard OLS regressin mdel using GeDa. The diagnstics prvide infrmatin abut the

More information

Department of Electrical Engineering, University of Waterloo. Introduction

Department of Electrical Engineering, University of Waterloo. Introduction Sectin 4: Sequential Circuits Majr Tpics Types f sequential circuits Flip-flps Analysis f clcked sequential circuits Mre and Mealy machines Design f clcked sequential circuits State transitin design methd

More information

Support Vector Machines and Flexible Discriminants

Support Vector Machines and Flexible Discriminants 12 Supprt Vectr Machines and Flexible Discriminants This is page 417 Printer: Opaque this 12.1 Intrductin In this chapter we describe generalizatins f linear decisin bundaries fr classificatin. Optimal

More information

Thermodynamics and Equilibrium

Thermodynamics and Equilibrium Thermdynamics and Equilibrium Thermdynamics Thermdynamics is the study f the relatinship between heat and ther frms f energy in a chemical r physical prcess. We intrduced the thermdynamic prperty f enthalpy,

More information

Determining the Accuracy of Modal Parameter Estimation Methods

Determining the Accuracy of Modal Parameter Estimation Methods Determining the Accuracy f Mdal Parameter Estimatin Methds by Michael Lee Ph.D., P.E. & Mar Richardsn Ph.D. Structural Measurement Systems Milpitas, CA Abstract The mst cmmn type f mdal testing system

More information

Homology groups of disks with holes

Homology groups of disks with holes Hmlgy grups f disks with hles THEOREM. Let p 1,, p k } be a sequence f distinct pints in the interir unit disk D n where n 2, and suppse that fr all j the sets E j Int D n are clsed, pairwise disjint subdisks.

More information

Activity Guide Loops and Random Numbers

Activity Guide Loops and Random Numbers Unit 3 Lessn 7 Name(s) Perid Date Activity Guide Lps and Randm Numbers CS Cntent Lps are a relatively straightfrward idea in prgramming - yu want a certain chunk f cde t run repeatedly - but it takes a

More information

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date AP Statistics Practice Test Unit Three Explring Relatinships Between Variables Name Perid Date True r False: 1. Crrelatin and regressin require explanatry and respnse variables. 1. 2. Every least squares

More information

Differentiation Applications 1: Related Rates

Differentiation Applications 1: Related Rates Differentiatin Applicatins 1: Related Rates 151 Differentiatin Applicatins 1: Related Rates Mdel 1: Sliding Ladder 10 ladder y 10 ladder 10 ladder A 10 ft ladder is leaning against a wall when the bttm

More information

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets Department f Ecnmics, University f alifrnia, Davis Ecn 200 Micr Thery Prfessr Giacm Bnann Insurance Markets nsider an individual wh has an initial wealth f. ith sme prbability p he faces a lss f x (0

More information

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours STATS216v Intrductin t Statistical Learning Stanfrd University, Summer 2016 Practice Final (Slutins) Duratin: 3 hurs Instructins: (This is a practice final and will nt be graded.) Remember the university

More information

Linear programming III

Linear programming III Linear prgramming III Review 1/33 What have cvered in previus tw classes LP prblem setup: linear bjective functin, linear cnstraints. exist extreme pint ptimal slutin. Simplex methd: g thrugh extreme pint

More information

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank MATCHING TECHNIQUES Technical Track Sessin VI Emanuela Galass The Wrld Bank These slides were develped by Christel Vermeersch and mdified by Emanuela Galass fr the purpse f this wrkshp When can we use

More information

Computational modeling techniques

Computational modeling techniques Cmputatinal mdeling techniques Lecture 4: Mdel checing fr ODE mdels In Petre Department f IT, Åb Aademi http://www.users.ab.fi/ipetre/cmpmd/ Cntent Stichimetric matrix Calculating the mass cnservatin relatins

More information

SPH3U1 Lesson 06 Kinematics

SPH3U1 Lesson 06 Kinematics PROJECTILE MOTION LEARNING GOALS Students will: Describe the mtin f an bject thrwn at arbitrary angles thrugh the air. Describe the hrizntal and vertical mtins f a prjectile. Slve prjectile mtin prblems.

More information

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007 CS 477/677 Analysis f Algrithms Fall 2007 Dr. Gerge Bebis Curse Prject Due Date: 11/29/2007 Part1: Cmparisn f Srting Algrithms (70% f the prject grade) The bjective f the first part f the assignment is

More information

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data Outline IAML: Lgistic Regressin Charles Suttn and Victr Lavrenk Schl f Infrmatics Semester Lgistic functin Lgistic regressin Learning lgistic regressin Optimizatin The pwer f nn-linear basis functins Least-squares

More information

SAMPLING DYNAMICAL SYSTEMS

SAMPLING DYNAMICAL SYSTEMS SAMPLING DYNAMICAL SYSTEMS Melvin J. Hinich Applied Research Labratries The University f Texas at Austin Austin, TX 78713-8029, USA (512) 835-3278 (Vice) 835-3259 (Fax) hinich@mail.la.utexas.edu ABSTRACT

More information

Math Foundations 10 Work Plan

Math Foundations 10 Work Plan Math Fundatins 10 Wrk Plan Units / Tpics 10.1 Demnstrate understanding f factrs f whle numbers by: Prime factrs Greatest Cmmn Factrs (GCF) Least Cmmn Multiple (LCM) Principal square rt Cube rt Time Frame

More information

Hypothesis Tests for One Population Mean

Hypothesis Tests for One Population Mean Hypthesis Tests fr One Ppulatin Mean Chapter 9 Ala Abdelbaki Objective Objective: T estimate the value f ne ppulatin mean Inferential statistics using statistics in rder t estimate parameters We will be

More information

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Fall 2013 Physics 172 Recitation 3 Momentum and Springs Fall 03 Physics 7 Recitatin 3 Mmentum and Springs Purpse: The purpse f this recitatin is t give yu experience wrking with mmentum and the mmentum update frmula. Readings: Chapter.3-.5 Learning Objectives:.3.

More information

Revision: August 19, E Main Suite D Pullman, WA (509) Voice and Fax

Revision: August 19, E Main Suite D Pullman, WA (509) Voice and Fax .7.4: Direct frequency dmain circuit analysis Revisin: August 9, 00 5 E Main Suite D Pullman, WA 9963 (509) 334 6306 ice and Fax Overview n chapter.7., we determined the steadystate respnse f electrical

More information

Support-Vector Machines

Support-Vector Machines Supprt-Vectr Machines Intrductin Supprt vectr machine is a linear machine with sme very nice prperties. Haykin chapter 6. See Alpaydin chapter 13 fr similar cntent. Nte: Part f this lecture drew material

More information

February 28, 2013 COMMENTS ON DIFFUSION, DIFFUSIVITY AND DERIVATION OF HYPERBOLIC EQUATIONS DESCRIBING THE DIFFUSION PHENOMENA

February 28, 2013 COMMENTS ON DIFFUSION, DIFFUSIVITY AND DERIVATION OF HYPERBOLIC EQUATIONS DESCRIBING THE DIFFUSION PHENOMENA February 28, 2013 COMMENTS ON DIFFUSION, DIFFUSIVITY AND DERIVATION OF HYPERBOLIC EQUATIONS DESCRIBING THE DIFFUSION PHENOMENA Mental Experiment regarding 1D randm walk Cnsider a cntainer f gas in thermal

More information

ENGI 4430 Parametric Vector Functions Page 2-01

ENGI 4430 Parametric Vector Functions Page 2-01 ENGI 4430 Parametric Vectr Functins Page -01. Parametric Vectr Functins (cntinued) Any nn-zer vectr r can be decmpsed int its magnitude r and its directin: r rrˆ, where r r 0 Tangent Vectr: dx dy dz dr

More information

and the Doppler frequency rate f R , can be related to the coefficients of this polynomial. The relationships are:

and the Doppler frequency rate f R , can be related to the coefficients of this polynomial. The relationships are: Algrithm fr Estimating R and R - (David Sandwell, SIO, August 4, 2006) Azimith cmpressin invlves the alignment f successive eches t be fcused n a pint target Let s be the slw time alng the satellite track

More information

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001) CN700 Additive Mdels and Trees Chapter 9: Hastie et al. (2001) Madhusudana Shashanka Department f Cgnitive and Neural Systems Bstn University CN700 - Additive Mdels and Trees March 02, 2004 p.1/34 Overview

More information

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India CHAPTER 3 INEQUALITIES Cpyright -The Institute f Chartered Accuntants f India INEQUALITIES LEARNING OBJECTIVES One f the widely used decisin making prblems, nwadays, is t decide n the ptimal mix f scarce

More information

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving. Sectin 3.2: Many f yu WILL need t watch the crrespnding vides fr this sectin n MyOpenMath! This sectin is primarily fcused n tls t aid us in finding rts/zers/ -intercepts f plynmials. Essentially, ur fcus

More information

5 th grade Common Core Standards

5 th grade Common Core Standards 5 th grade Cmmn Cre Standards In Grade 5, instructinal time shuld fcus n three critical areas: (1) develping fluency with additin and subtractin f fractins, and develping understanding f the multiplicatin

More information

On Huntsberger Type Shrinkage Estimator for the Mean of Normal Distribution ABSTRACT INTRODUCTION

On Huntsberger Type Shrinkage Estimator for the Mean of Normal Distribution ABSTRACT INTRODUCTION Malaysian Jurnal f Mathematical Sciences 4(): 7-4 () On Huntsberger Type Shrinkage Estimatr fr the Mean f Nrmal Distributin Department f Mathematical and Physical Sciences, University f Nizwa, Sultanate

More information

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

MATHEMATICS SYLLABUS SECONDARY 5th YEAR Eurpean Schls Office f the Secretary-General Pedaggical Develpment Unit Ref. : 011-01-D-8-en- Orig. : EN MATHEMATICS SYLLABUS SECONDARY 5th YEAR 6 perid/week curse APPROVED BY THE JOINT TEACHING COMMITTEE

More information

Application of ILIUM to the estimation of the T eff [Fe/H] pair from BP/RP

Application of ILIUM to the estimation of the T eff [Fe/H] pair from BP/RP Applicatin f ILIUM t the estimatin f the T eff [Fe/H] pair frm BP/RP prepared by: apprved by: reference: issue: 1 revisin: 1 date: 2009-02-10 status: Issued Cryn A.L. Bailer-Jnes Max Planck Institute fr

More information

Chapters 29 and 35 Thermochemistry and Chemical Thermodynamics

Chapters 29 and 35 Thermochemistry and Chemical Thermodynamics Chapters 9 and 35 Thermchemistry and Chemical Thermdynamics 1 Cpyright (c) 011 by Michael A. Janusa, PhD. All rights reserved. Thermchemistry Thermchemistry is the study f the energy effects that accmpany

More information

Lab 1 The Scientific Method

Lab 1 The Scientific Method INTRODUCTION The fllwing labratry exercise is designed t give yu, the student, an pprtunity t explre unknwn systems, r universes, and hypthesize pssible rules which may gvern the behavir within them. Scientific

More information

Pipetting 101 Developed by BSU CityLab

Pipetting 101 Developed by BSU CityLab Discver the Micrbes Within: The Wlbachia Prject Pipetting 101 Develped by BSU CityLab Clr Cmparisns Pipetting Exercise #1 STUDENT OBJECTIVES Students will be able t: Chse the crrect size micrpipette fr

More information

UNIV1"'RSITY OF NORTH CAROLINA Department of Statistics Chapel Hill, N. C. CUMULATIVE SUM CONTROL CHARTS FOR THE FOLDED NORMAL DISTRIBUTION

UNIV1'RSITY OF NORTH CAROLINA Department of Statistics Chapel Hill, N. C. CUMULATIVE SUM CONTROL CHARTS FOR THE FOLDED NORMAL DISTRIBUTION UNIV1"'RSITY OF NORTH CAROLINA Department f Statistics Chapel Hill, N. C. CUMULATIVE SUM CONTROL CHARTS FOR THE FOLDED NORMAL DISTRIBUTION by N. L. Jlmsn December 1962 Grant N. AFOSR -62..148 Methds f

More information

Thermodynamics Partial Outline of Topics

Thermodynamics Partial Outline of Topics Thermdynamics Partial Outline f Tpics I. The secnd law f thermdynamics addresses the issue f spntaneity and invlves a functin called entrpy (S): If a prcess is spntaneus, then Suniverse > 0 (2 nd Law!)

More information

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint Biplts in Practice MICHAEL GREENACRE Prfessr f Statistics at the Pmpeu Fabra University Chapter 13 Offprint CASE STUDY BIOMEDICINE Cmparing Cancer Types Accrding t Gene Epressin Arrays First published:

More information

Sequential Allocation with Minimal Switching

Sequential Allocation with Minimal Switching In Cmputing Science and Statistics 28 (1996), pp. 567 572 Sequential Allcatin with Minimal Switching Quentin F. Stut 1 Janis Hardwick 1 EECS Dept., University f Michigan Statistics Dept., Purdue University

More information

Contents. This is page i Printer: Opaque this

Contents. This is page i Printer: Opaque this Cntents This is page i Printer: Opaque this Supprt Vectr Machines and Flexible Discriminants. Intrductin............. The Supprt Vectr Classifier.... Cmputing the Supprt Vectr Classifier........ Mixture

More information

Emphases in Common Core Standards for Mathematical Content Kindergarten High School

Emphases in Common Core Standards for Mathematical Content Kindergarten High School Emphases in Cmmn Cre Standards fr Mathematical Cntent Kindergarten High Schl Cntent Emphases by Cluster March 12, 2012 Describes cntent emphases in the standards at the cluster level fr each grade. These

More information