Linear Methods for Regression

Size: px

Start display at page:

Download "Linear Methods for Regression"

Bertina Lane
5 years ago
Views:

1 3 Linear Methds fr Regressin This is page 43 Printer: Opaque this 3.1 Intrductin A linear regressin mdel assumes that the regressin functin E(Y X) is linear in the inputs X 1,...,X p. Linear mdels were largely develped in the precmputer age f statistics, but even in tday s cmputer era there are still gd reasns t study and use them. They are simple and ften prvide an adequate and interpretable descriptin f hw the inputs affect the utput. Fr predictin purpses they can smetimes utperfrm fancier nnlinear mdels, especially in situatins with small numbers f training cases, lw signal-t-nise rati r sparse data. Finally, linear methds can be applied t transfrmatins f the inputs and this cnsiderably expands their scpe. These generalizatins are smetimes called basis-functin methds, and are discussed in Chapter 5. In this chapter we describe linear methds fr regressin, while in the next chapter we discuss linear methds fr classificatin. On sme tpics we g int cnsiderable detail, as it is ur firm belief that an understanding f linear methds is essential fr understanding nnlinear nes. In fact, many nnlinear techniques are direct generalizatins f the linear methds discussed here.

2 44 3. Linear Methds fr Regressin 3.2 Linear Regressin Mdels and Least Squares As intrduced in Chapter 2, we have an input vectr X T =(X 1,X 2,...,X p ), and want t predict a real-valued utput Y. The linear regressin mdel has the frm p f(x) =β 0 + X j β j. (3.1) The linear mdel either assumes that the regressin functin E(Y X) is linear, r that the linear mdel is a reasnable apprximatin. Here the β j s are unknwn parameters r cefficients, and the variables X j can cme frm different surces: quantitative inputs; transfrmatins f quantitative inputs, such as lg, square-rt r square; basis expansins, such as X 2 = X 2 1, X 3 = X 3 1, leading t a plynmial representatin; numeric r dummy cding f the levels f qualitative inputs. Fr example, if G is a five-level factr input, we might create X j, j = 1,...,5, such that X j = I(G = j). Tgether this grup f X j represents the effect f G by a set f level-dependent cnstants, since in 5 j=1 X jβ j, ne f the X j s is ne, and the thers are zer. j=1 interactins between variables, fr example, X 3 = X 1 X 2. N matter the surce f the X j, the mdel is linear in the parameters. Typically we have a set f training data (x 1,y 1 )...(x N,y N ) frm which t estimate the parameters β. Eachx i =(x i1,x i2,...,x ip ) T is a vectr f feature measurements fr the ith case. The mst ppular estimatin methd is least squares, in which we pick the cefficients β =(β 0, β 1,...,β p ) T t minimize the residual sum f squares RSS(β) = = N (y i f(x i )) 2 i=1 N ( y i β 0 i=1 j=1 p ) 2. x ij β j (3.2) Frm a statistical pint f view, this criterin is reasnable if the training bservatins (x i,y i ) represent independent randm draws frm their ppulatin. Even if the x i s were nt drawn randmly, the criterin is still valid if the y i s are cnditinally independent given the inputs x i. Figure 3.1 illustrates the gemetry f least-squares fitting in the IR p+1 -dimensinal

3 3.2 Linear Regressin Mdels and Least Squares 45 Y X 2 FIGURE 3.1. Linear least squares fitting with X IR 2. We seek the linear functin f X that minimizes the sum f squared residuals frm Y. space ccupied by the pairs (X, Y ). Nte that (3.2) makes n assumptins abut the validity f mdel (3.1); it simply finds the best linear fit t the data. Least squares fitting is intuitively satisfying n matter hw the data arise; the criterin measures the average lack f fit. Hw d we minimize (3.2)? Dente by X the N (p +1) matrix with each rw an input vectr (with a 1 in the first psitin), and similarly let y be the N-vectr f utputs in the training set. Then we can write the residual sum-f-squares as RSS(β) =(y Xβ) T (y Xβ). (3.3) This is a quadratic functin in the p + 1 parameters. Differentiating with respect t β we btain RSS β = 2XT (y Xβ) 2 RSS β β T =2XT X. X 1 (3.4) Assuming (fr the mment) that X has full clumn rank, and hence X T X is psitive definite, we set the first derivative t zer t btain the unique slutin X T (y Xβ) =0 (3.5) ˆβ =(X T X) 1 X T y. (3.6)

4 46 3. Linear Methds fr Regressin y x 2 ŷ FIGURE 3.2. The N-dimensinal gemetry f least squares regressin with tw predictrs. The utcme vectr y is rthgnally prjected nt the hyperplane spanned by the input vectrs x 1 and x 2.Theprjectinŷ represents the vectr f the least squares predictins The predicted values at an input vectr x 0 are given by ˆf(x 0 )=(1:x 0 ) T ˆβ; the fitted values at the training inputs are ŷ = X ˆβ = X(X T X) 1 X T y, (3.7) where ŷ i = ˆf(x i ). The matrix H = X(X T X) 1 X T appearing in equatin (3.7) is smetimes called the hat matrix because it puts the hat n y. Figure 3.2 shws a different gemetrical representatin f the least squares estimate, this time in IR N. We dente the clumn vectrs f X by x 0, x 1,...,x p, with x 0 1. Fr much f what fllws, this first clumn is treated like any ther. These vectrs span a subspace f IR N, als referred t as the clumn space f X. We minimize RSS(β) = y Xβ 2 by chsing ˆβ s that the residual vectr y ŷ is rthgnal t this subspace. This rthgnality is expressed in (3.5), and the resulting estimate ŷ is hence the rthgnal prjectin f y nt this subspace. The hat matrix H cmputes the rthgnal prjectin, and hence it is als knwn as a prjectin matrix. It might happen that the clumns f X are nt linearly independent, s that X is nt f full rank. This wuld ccur, fr example, if tw f the inputs were perfectly crrelated, (e.g., x 2 =3x 1 ). Then X T X is singular and the least squares cefficients ˆβ are nt uniquely defined. Hwever, the fitted values ŷ = X ˆβ are still the prjectin f y nt the clumn space f X; there is just mre than ne way t express that prjectin in terms f the clumn vectrs f X. The nn-full-rank case ccurs mst ften when ne r mre qualitative inputs are cded in a redundant fashin. There is usually a natural way t reslve the nn-unique representatin, by recding and/r drpping redundant clumns in X. Mst regressin sftware packages detect these redundancies and autmatically implement x 1

5 3.2 Linear Regressin Mdels and Least Squares 47 sme strategy fr remving them. Rank deficiencies can als ccur in signal and image analysis, where the number f inputs p can exceed the number f training cases N. In this case, the features are typically reduced by filtering r else the fitting is cntrlled by regularizatin (Sectin and Chapter 18). Up t nw we have made minimal assumptins abut the true distributin f the data. In rder t pin dwn the sampling prperties f ˆβ, we nw assume that the bservatins y i are uncrrelated and have cnstant variance σ 2, and that the x i are fixed (nn randm). The variance cvariance matrix f the least squares parameter estimates is easily derived frm (3.6) and is given by Typically ne estimates the variance σ 2 by Var( ˆβ) =(X T X) 1 σ 2. (3.8) ˆσ 2 = 1 N p 1 N (y i ŷ i ) 2. The N p 1 rather than N in the denminatr makes ˆσ 2 an unbiased estimate f σ 2 :E(ˆσ 2 )=σ 2. T draw inferences abut the parameters and the mdel, additinal assumptins are needed. We nw assume that (3.1) is the crrect mdel fr the mean; that is, the cnditinal expectatin f Y is linear in X 1,...,X p. We als assume that the deviatins f Y arund its expectatin are additive and Gaussian. Hence i=1 Y = E(Y X 1,...,X p )+ε p = β 0 + X j β j + ε, (3.9) j=1 where the errr ε is a Gaussian randm variable with expectatin zer and variance σ 2, written ε N(0, σ 2 ). Under (3.9), it is easy t shw that ˆβ N(β, (X T X) 1 σ 2 ). (3.10) This is a multivariate nrmal distributin with mean vectr and variance cvariance matrix as shwn. Als (N p 1)ˆσ 2 σ 2 χ 2 N p 1, (3.11) a chi-squared distributin with N p 1 degrees f freedm. In additin ˆβ and ˆσ 2 are statistically independent. We use these distributinal prperties t frm tests f hypthesis and cnfidence intervals fr the parameters β j.

6 48 3. Linear Methds fr Regressin Tail Prbabilities t 30 t 100 nrmal Z FIGURE 3.3. The tail prbabilities Pr( Z >z) fr three distributins, t 30, t 100 and standard nrmal. Shwn are the apprpriate quantiles fr testing significance at the p =0.05 and 0.01 levels. The difference between t and the standard nrmal becmes negligible fr N bigger than abut 100. T test the hypthesis that a particular cefficient β j = 0, we frm the standardized cefficient r Z-scre z j = ˆβ j ˆσ v j, (3.12) where v j is the jth diagnal element f (X T X) 1. Under the null hypthesis that β j =0,z j is distributed as t N p 1 (a t distributin with N p 1 degrees f freedm), and hence a large (abslute) value f z j will lead t rejectin f this null hypthesis. If ˆσ is replaced by a knwn value σ, then z j wuld have a standard nrmal distributin. The difference between the tail quantiles f a t-distributin and a standard nrmal becme negligible as the sample size increases, and s we typically use the nrmal quantiles (see Figure 3.3). Often we need t test fr the significance f grups f cefficients simultaneusly. Fr example, t test if a categrical variable with k levels can be excluded frm a mdel, we need t test whether the cefficients f the dummy variables used t represent the levels can all be set t zer. Here we use the F statistic, F = (RSS 0 RSS 1 )/(p 1 p 0 ), (3.13) RSS 1 /(N p 1 1) where RSS 1 is the residual sum-f-squares fr the least squares fit f the bigger mdel with p 1 +1 parameters, and RSS 0 the same fr the nested smaller mdel with p parameters, having p 1 p 0 parameters cnstrained t be

7 3.2 Linear Regressin Mdels and Least Squares 49 zer. The F statistic measures the change in residual sum-f-squares per additinal parameter in the bigger mdel, and it is nrmalized by an estimate f σ 2. Under the Gaussian assumptins, and the null hypthesis that the smaller mdel is crrect, the F statistic will have a F p1 p 0,N p 1 1 distributin. It can be shwn (Exercise 3.1) that the z j in (3.12) are equivalent t the F statistic fr drpping the single cefficient β j frm the mdel. Fr large N, the quantiles f the F p1 p 0,N p 1 1 apprach thse f the χ 2 p 1 p 0. Similarly, we can islate β j in (3.10) t btain a 1 2α cnfidence interval fr β j : ( ˆβ j z (1 α) v 1 2 j ˆσ, ˆβj + z (1 α) v 1 2 j ˆσ). (3.14) Here z (1 α) is the 1 α percentile f the nrmal distributin: z ( ) = 1.96, z (1.05) = 1.645, etc. Hence the standard practice f reprting ˆβ ± 2 se( ˆβ) amunts t an apprximate 95% cnfidence interval. Even if the Gaussian errr assumptin des nt hld, this interval will be apprximately crrect, with its cverage appraching 1 2α as the sample size N. In a similar fashin we can btain an apprximate cnfidence set fr the entire parameter vectr β, namely C β = {β ( ˆβ β) T X T X( ˆβ β) ˆσ 2 χ 2 p+1(1 α) }, (3.15) where χ 2 l (1 α) is the 1 α percentile f the chi-squared distributin n l degrees f freedm: fr example, χ 2 5(1 0.05) =11.1, χ 2 5 (1 0.1) =9.2. This cnfidence set fr β generates a crrespnding cnfidence set fr the true functin f(x) =x T β, namely {x T β β C β } (Exercise 3.2; see als Figure 5.4 in Sectin fr examples f cnfidence bands fr functins) Example: Prstate Cancer The data fr this example cme frm a study by Stamey et al. (1989). They examined the crrelatin between the level f prstate-specific antigen and a number f clinical measures in men wh were abut t receive a radical prstatectmy. The variables are lg cancer vlume (lcavl), lg prstate weight (lweight), age, lg f the amunt f benign prstatic hyperplasia (lbph), seminal vesicle invasin (svi), lg f capsular penetratin (lcp), Gleasn scre (gleasn), and percent f Gleasn scres 4 r 5 (pgg45). The crrelatin matrix f the predictrs given in Table 3.1 shws many strng crrelatins. Figure 1.1 (page 3) f Chapter 1 is a scatterplt matrix shwing every pairwise plt between the variables. We see that svi is a binary variable, and gleasn is an rdered categrical variable. We see, fr

8 50 3. Linear Methds fr Regressin TABLE 3.1. Crrelatins f predictrs in the prstate cancer data. lcavl lweight age lbph svi lcp gleasn lweight age lbph svi lcp gleasn pgg TABLE 3.2. Linear mdel fit t the prstate cancer data. The Z scre is the cefficient divided by its standard errr (3.12). Rughly a Z scre larger than tw in abslute value is significantly nnzer at the p = 0.05 level. Term Cefficient Std. Errr Z Scre Intercept lcavl lweight age lbph svi lcp gleasn pgg example, that bth lcavl and lcp shw a strng relatinship with the respnse lpsa, and with each ther. We need t fit the effects jintly t untangle the relatinships between the predictrs and the respnse. We fit a linear mdel t the lg f prstate-specific antigen, lpsa, after first standardizing the predictrs t have unit variance. We randmly split the dataset int a training set f size 67 and a test set f size 30. We applied least squares estimatin t the training set, prducing the estimates, standard errrs and Z-scres shwn in Table 3.2. The Z-scres are defined in (3.12), and measure the effect f drpping that variable frm the mdel. A Z-scre greater than 2 in abslute value is apprximately significant at the 5% level. (Fr ur example, we have nine parameters, and the tail quantiles f the t 67 9 distributin are ±2.002!) The predictr lcavl shws the strngest effect, with lweight and svi als strng. Ntice that lcp is nt significant, nce lcavl is in the mdel (when used in a mdel withut lcavl, lcp is strngly significant). We can als test fr the exclusin f a number f terms at nce, using the F -statistic (3.13). Fr example, we cnsider drpping all the nn-significant terms in Table 3.2, namely age,

9 3.2 Linear Regressin Mdels and Least Squares 51 lcp, gleasn, and pgg45. We get F = ( )/(9 5) 29.43/(67 9) =1.67, (3.16) which has a p-value f 0.17 (Pr(F 4,58 > 1.67) = 0.17), and hence is nt significant. The mean predictin errr n the test data is In cntrast, predictin using the mean training value f lpsa has a test errr f 1.057, which is called the base errr rate. Hence the linear mdel reduces the base errr rate by abut 50%. We will return t this example later t cmpare varius selectin and shrinkage methds The Gauss Markv Therem One f the mst famus results in statistics asserts that the least squares estimates f the parameters β have the smallest variance amng all linear unbiased estimates. We will make this precise here, and als make clear that the restrictin t unbiased estimates is nt necessarily a wise ne. This bservatin will lead us t cnsider biased estimates such as ridge regressin later in the chapter. We fcus n estimatin f any linear cmbinatin f the parameters θ = a T β; fr example, predictins f(x 0 )=x T 0 β are f this frm. The least squares estimate f a T β is ˆθ = a T ˆβ = a T (X T X) 1 X T y. (3.17) Cnsidering X t be fixed, this is a linear functin c T 0 y f the respnse vectr y. If we assume that the linear mdel is crrect, a T ˆβ is unbiased since E(a T ˆβ) = E(a T (X T X) 1 X T y) = a T (X T X) 1 X T Xβ = a T β. (3.18) The Gauss Markv therem states that if we have any ther linear estimatr θ = c T y that is unbiased fr a T β, that is, E(c T y)=a T β, then Var(a T ˆβ) Var(c T y). (3.19) The prf (Exercise 3.3) uses the triangle inequality. Fr simplicity we have stated the result in terms f estimatin f a single parameter a T β, but with a few mre definitins ne can state it in terms f the entire parameter vectr β (Exercise 3.3). Cnsider the mean squared errr f an estimatr θ in estimating θ: MSE( θ) = E( θ θ) 2 = Var( θ)+[e( θ) θ] 2. (3.20)

10 52 3. Linear Methds fr Regressin The first term is the variance, while the secnd term is the squared bias. The Gauss-Markv therem implies that the least squares estimatr has the smallest mean squared errr f all linear estimatrs with n bias. Hwever, there may well exist a biased estimatr with smaller mean squared errr. Such an estimatr wuld trade a little bias fr a larger reductin in variance. Biased estimates are cmmnly used. Any methd that shrinks r sets t zer sme f the least squares cefficients may result in a biased estimate. We discuss many examples, including variable subset selectin and ridge regressin, later in this chapter. Frm a mre pragmatic pint f view, mst mdels are distrtins f the truth, and hence are biased; picking the right mdel amunts t creating the right balance between bias and variance. We g int these issues in mre detail in Chapter 7. Mean squared errr is intimately related t predictin accuracy, as discussed in Chapter 2. Cnsider the predictin f the new respnse at input x 0, Y 0 = f(x 0 )+ε 0. (3.21) Then the expected predictin errr f an estimate f(x 0 )=x T 0 β is E(Y 0 f(x 0 )) 2 = σ 2 +E(x T 0 β f(x 0 )) 2 = σ 2 + MSE( f(x 0 )). (3.22) Therefre, expected predictin errr and mean squared errr differ nly by the cnstant σ 2, representing the variance f the new bservatin y Multiple Regressin frm Simple Univariate Regressin The linear mdel (3.1) with p > 1 inputs is called the multiple linear regressin mdel. The least squares estimates (3.6) fr this mdel are best understd in terms f the estimates fr the univariate (p = 1) linear mdel, as we indicate in this sectin. Suppse first that we have a univariate mdel with n intercept, that is, The least squares estimate and residuals are Y = Xβ + ε. (3.23) N 1 ˆβ = x iy i N, 1 x2 i (3.24) r i = y i x i ˆβ. In cnvenient vectr ntatin, we let y =(y 1,...,y N ) T, x =(x 1,...,x N ) T and define N x, y = x i y i, i=1 = x T y, (3.25)

11 3.2 Linear Regressin Mdels and Least Squares 53 the inner prduct between x and y 1. Then we can write ˆβ = x, y x, x, r = y x ˆβ. (3.26) As we will see, this simple univariate regressin prvides the building blck fr multiple linear regressin. Suppse next that the inputs x 1, x 2,...,x p (the clumns f the data matrix X) are rthgnal; that is x j, x k =0 fr all j k. Then it is easy t check that the multiple least squares estimates ˆβ j are equal t x j, y / x j, x j the univariate estimates. In ther wrds, when the inputs are rthgnal, they have n effect n each ther s parameter estimates in the mdel. Orthgnal inputs ccur mst ften with balanced, designed experiments (where rthgnality is enfrced), but almst never with bservatinal data. Hence we will have t rthgnalize them in rder t carry this idea further. Suppse next that we have an intercept and a single input x. Then the least squares cefficient f x has the frm ˆβ 1 = x x1, y x x1, x x1, (3.27) where x = i x i/n, and 1 = x 0, the vectr f N nes. We can view the estimate (3.27) as the result f tw applicatins f the simple regressin (3.26). The steps are: 1. regress x n 1 t prduce the residual z = x x1; 2. regress y n the residual z t give the cefficient ˆβ 1. In this prcedure, regress b n a means a simple univariate regressin f b n a with n intercept, prducing cefficient ˆγ = a, b / a, a and residual vectr b ˆγa. We say that b is adjusted fr a, r is rthgnalized with respect t a. Step 1 rthgnalizes x with respect t x 0 = 1. Step 2 is just a simple univariate regressin, using the rthgnal predictrs 1 and z. Figure 3.4 shws this prcess fr tw general inputs x 1 and x 2. The rthgnalizatin des nt change the subspace spanned by x 1 and x 2, it simply prduces an rthgnal basis fr representing it. This recipe generalizes t the case f p inputs, as shwn in Algrithm 3.1. Nte that the inputs z 0,...,z j 1 in step 2 are rthgnal, hence the simple regressin cefficients cmputed there are in fact als the multiple regressin cefficients. 1 The inner-prduct ntatin is suggestive f generalizatins f linear regressin t different metric spaces, as well as t prbability spaces.

12 54 3. Linear Methds fr Regressin y x 2 z ŷ FIGURE 3.4. Least squares regressin by rthgnalizatin f the inputs. The vectr x 2 is regressed n the vectr x 1, leaving the residual vectr z. Theregressin f y n z gives the multiple regressin cefficient f x 2. Adding tgether the prjectins f y n each f x 1 and z gives the least squares fit ŷ. Algrithm 3.1 Regressin by Successive Orthgnalizatin. 1. Initialize z 0 = x 0 = Fr j =1, 2,...,p Regress x j n z 0, z 1,...,,z j 1 t prduce cefficients ˆγ lj = z l, x j / z l, z l, l = 0,...,j 1 and residual vectr z j = x j j 1 k=0 ˆγ kjz k. 3. Regress y n the residual z p t give the estimate ˆβ p. x 1 The result f this algrithm is ˆβ p = z p, y z p, z p. (3.28) Re-arranging the residual in step 2, we can see that each f the x j is a linear cmbinatin f the z k, k j. Since the z j are all rthgnal, they frm a basis fr the clumn space f X, and hence the least squares prjectin nt this subspace is ŷ. Since z p alne invlves x p (with cefficient 1), we see that the cefficient (3.28) is indeed the multiple regressin cefficient f y n x p. This key result expses the effect f crrelated inputs in multiple regressin. Nte als that by rearranging the x j, any ne f them culd be in the last psitin, and a similar results hlds. Hence stated mre generally, we have shwn that the jth multiple regressin cefficient is the univariate regressin cefficient f y n x j (j 1)(j+1)...,p, the residual after regressing x j n x 0, x 1,...,x j 1, x j+1,...,x p :

13 3.2 Linear Regressin Mdels and Least Squares 55 The multiple regressin cefficient ˆβ j represents the additinal cntributin f x j n y, afterx j has been adjusted fr x 0, x 1,...,x j 1, x j+1,...,x p. If x p is highly crrelated with sme f the ther x k s, the residual vectr z p will be clse t zer, and frm (3.28) the cefficient ˆβ p will be very unstable. This will be true fr all the variables in the crrelated set. In such situatins, we might have all the Z-scres (as in Table 3.2) be small any ne f the set can be deleted yet we cannt delete them all. Frm (3.28) we als btain an alternate frmula fr the variance estimates (3.8), Var( ˆβ p )= σ2 z p, z p = σ2 z p 2. (3.29) In ther wrds, the precisin with which we can estimate ˆβ p depends n the length f the residual vectr z p ; this represents hw much f x p is unexplained by the ther x k s. Algrithm 3.1 is knwn as the Gram Schmidt prcedure fr multiple regressin, and is als a useful numerical strategy fr cmputing the estimates. We can btain frm it nt just ˆβ p, but als the entire multiple least squares fit, as shwn in Exercise 3.4. We can represent step 2 f Algrithm 3.1 in matrix frm: X = ZΓ, (3.30) where Z has as clumns the z j (in rder), and Γ is the upper triangular matrix with entries ˆγ kj. Intrducing the diagnal matrix D with jth diagnal entry D jj = z j, we get X = ZD 1 DΓ = QR, (3.31) the s-called QR decmpsitin f X. Here Q is an N (p + 1) rthgnal matrix, Q T Q = I, and R is a (p +1) (p + 1) upper triangular matrix. The QR decmpsitin represents a cnvenient rthgnal basis fr the clumn space f X. It is easy t see, fr example, that the least squares slutin is given by ˆβ = R 1 Q T y, (3.32) ŷ = QQ T y. (3.33) Equatin (3.32) is easy t slve because R is upper triangular (Exercise 3.4).

14 56 3. Linear Methds fr Regressin Multiple Outputs Suppse we have multiple utputs Y 1,Y 2,...,Y K that we wish t predict frm ur inputs X 0,X 1,X 2,...,X p. We assume a linear mdel fr each utput Y k = β 0k + p X j β jk + ε k (3.34) j=1 = f k (X)+ε k. (3.35) With N training cases we can write the mdel in matrix ntatin Y = XB + E. (3.36) Here Y is the N K respnse matrix, with ik entry y ik, X is the N (p+1) input matrix, B is the (p +1) K matrix f parameters and E is the N K matrix f errrs. A straightfrward generalizatin f the univariate lss functin (3.2) is RSS(B) = K k=1 i=1 N (y ik f k (x i )) 2 (3.37) = tr[(y XB) T (Y XB)]. (3.38) The least squares estimates have exactly the same frm as befre ˆB =(X T X) 1 X T Y. (3.39) Hence the cefficients fr the kth utcme are just the least squares estimates in the regressin f y k n x 0, x 1,...,x p. Multiple utputs d nt affect ne anther s least squares estimates. If the errrs ε =(ε 1,...,ε K ) in (3.34) are crrelated, then it might seem apprpriate t mdify (3.37) in favr f a multivariate versin. Specifically, suppse Cv(ε) = Σ, then the multivariate weighted criterin RSS(B; Σ) = N (y i f(x i )) T Σ 1 (y i f(x i )) (3.40) i=1 arises naturally frm multivariate Gaussian thery. Here f(x) is the vectr functin (f 1 (x),...,f K (x)), and y i the vectr f K respnses fr bservatin i. Hwever, it can be shwn that again the slutin is given by (3.39); K separate regressins that ignre the crrelatins (Exercise 3.11). If the Σ i vary amng bservatins, then this is n lnger the case, and the slutin fr B n lnger decuples. In Sectin 3.7 we pursue the multiple utcme prblem, and cnsider situatins where it des pay t cmbine the regressins.

15 3.3 Subset Selectin Subset Selectin There are tw reasns why we are ften nt satisfied with the least squares estimates (3.6). The first is predictin accuracy: the least squares estimates ften have lw bias but large variance. Predictin accuracy can smetimes be imprved by shrinking r setting sme cefficients t zer. By ding s we sacrifice a little bit f bias t reduce the variance f the predicted values, and hence may imprve the verall predictin accuracy. The secnd reasn is interpretatin. With a large number f predictrs, we ften wuld like t determine a smaller subset that exhibit the strngest effects. In rder t get the big picture, we are willing t sacrifice sme f the small details. In this sectin we describe a number f appraches t variable subset selectin with linear regressin. In later sectins we discuss shrinkage and hybrid appraches fr cntrlling variance, as well as ther dimensin-reductin strategies. These all fall under the general heading mdel selectin. Mdel selectin is nt restricted t linear mdels; Chapter 7 cvers this tpic in sme detail. With subset selectin we retain nly a subset f the variables, and eliminate the rest frm the mdel. Least squares regressin is used t estimate the cefficients f the inputs that are retained. There are a number f different strategies fr chsing the subset Best-Subset Selectin Best subset regressin finds fr each k {0, 1, 2,...,p} the subset f size k that gives smallest residual sum f squares (3.2). An efficient algrithm the leaps and bunds prcedure (Furnival and Wilsn, 1974) makes this feasible fr p as large as 30 r 40. Figure 3.5 shws all the subset mdels fr the prstate cancer example. The lwer bundary represents the mdels that are eligible fr selectin by the best-subsets apprach. Nte that the best subset f size 2, fr example, need nt include the variable that was in the best subset f size 1 (fr this example all the subsets are nested). The best-subset curve (red lwer bundary in Figure 3.5) is necessarily decreasing, s cannt be used t select the subset size k. The questin f hw t chse k invlves the tradeff between bias and variance, alng with the mre subjective desire fr parsimny. There are a number f criteria that ne may use; typically we chse the smallest mdel that minimizes an estimate f the expected predictin errr. Many f the ther appraches that we discuss in this chapter are similar, in that they use the training data t prduce a sequence f mdels varying in cmplexity and indexed by a single parameter. In the next sectin we use

16 58 3. Linear Methds fr Regressin Residual Sum f Squares Subset Size k FIGURE 3.5. All pssible subset mdels fr the prstate cancer example. At each subset size is shwn the residual sum-f-squares fr each mdel f that size. crss-validatin t estimate predictin errr and select k; the AIC criterin is a ppular alternative. We defer mre detailed discussin f these and ther appraches t Chapter Frward- and Backward-Stepwise Selectin Rather than search thrugh all pssible subsets (which becmes infeasible fr p much larger than 40), we can seek a gd path thrugh them. Frwardstepwise selectin starts with the intercept, and then sequentially adds int the mdel the predictr that mst imprves the fit. With many candidate predictrs, this might seem like a lt f cmputatin; hwever, clever updating algrithms can explit the QR decmpsitin fr the current fit t rapidly establish the next candidate (Exercise 3.9). Like best-subset regressin, frward stepwise prduces a sequence f mdels indexed by k, the subset size, which must be determined. Frward-stepwise selectin is a greedy algrithm, prducing a nested sequence f mdels. In this sense it might seem sub-ptimal cmpared t best-subset selectin. Hwever, there are several reasns why it might be preferred:

17 3.3 Subset Selectin 59 Cmputatinal; fr large p we cannt cmpute the best subset sequence, but we can always cmpute the frward stepwise sequence (even when p N). Statistical; a price is paid in variance fr selecting the best subset f each size; frward stepwise is a mre cnstrained search, and will have lwer variance, but perhaps mre bias. E ˆβ(k) β Best Subset Frward Stepwise Backward Stepwise Frward Stagewise Subset Size k FIGURE 3.6. Cmparisn f fur subset-selectin techniques n a simulated linear regressin prblem Y = X T β + ε. ThereareN = 300 bservatins n p = 31 standard Gaussian variables, with pairwise crrelatins all equal t Fr 10 f the variables, the cefficients are drawn at randm frm a N(0, 0.4) distributin; the rest are zer. The nise ε N(0, 6.25), resulting in a signal-t-nise rati f Results are averaged ver 50 simulatins. Shwn is the mean-squared errr f the estimated cefficient ˆβ(k) at each step frm the true β. Backward-stepwise selectin starts with the full mdel, and sequentially deletes the predictr that has the least impact n the fit. The candidate fr drpping is the variable with the smallest Z-scre (Exercise 3.10). Backward selectin can nly be used when N>p, while frward stepwise can always be used. Figure 3.6 shws the results f a small simulatin study t cmpare best-subset regressin with the simpler alternatives frward and backward selectin. Their perfrmance is very similar, as is ften the case. Included in the figure is frward stagewise regressin (next sectin), which takes lnger t reach minimum errr.

18 60 3. Linear Methds fr Regressin On the prstate cancer example, best-subset, frward and backward selectin all gave exactly the same sequence f terms. Sme sftware packages implement hybrid stepwise-selectin strategies that cnsider bth frward and backward mves at each step, and select the best f the tw. Fr example in the R package the step functin uses the AIC criterin fr weighing the chices, which takes prper accunt f the number f parameters fit; at each step an add r drp will be perfrmed that minimizes the AIC scre. Other mre traditinal packages base the selectin n F -statistics, adding significant terms, and drpping nn-significant terms. These are ut f fashin, since they d nt take prper accunt f the multiple testing issues. It is als tempting after a mdel search t print ut a summary f the chsen mdel, such as in Table 3.2; hwever, the standard errrs are nt valid, since they d nt accunt fr the search prcess. The btstrap (Sectin 8.2) can be useful in such settings. Finally, we nte that ften variables cme in grups (such as the dummy variables that cde a multi-level categrical predictr). Smart stepwise prcedures (such as step in R) will add r drp whle grups at a time, taking prper accunt f their degrees-f-freedm Frward-Stagewise Regressin Frward-stagewise regressin (FS) is even mre cnstrained than frwardstepwise regressin. It starts like frward-stepwise regressin, with an intercept equal t ȳ, and centered predictrs with cefficients initially all 0. At each step the algrithm identifies the variable mst crrelated with the current residual. It then cmputes the simple linear regressin cefficient f the residual n this chsen variable, and then adds it t the current cefficient fr that variable. This is cntinued till nne f the variables have crrelatin with the residuals i.e. the least-squares fit when N>p. Unlike frward-stepwise regressin, nne f the ther variables are adjusted when a term is added t the mdel. As a cnsequence, frward stagewise can take many mre than p steps t reach the least squares fit, and histrically has been dismissed as being inefficient. It turns ut that this slw fitting can pay dividends in high-dimensinal prblems. We see in Sectin that bth frward stagewise and a variant which is slwed dwn even further are quite cmpetitive, especially in very highdimensinal prblems. Frward-stagewise regressin is included in Figure 3.6. In this example it takes ver 1000 steps t get all the crrelatins belw Fr subset size k, we pltted the errr fr the last step fr which there where k nnzer cefficients. Althugh it catches up with the best fit, it takes lnger t d s.

19 3.3.4 Prstate Cancer Data Example (Cntinued) 3.4 Shrinkage Methds 61 Table 3.3 shws the cefficients frm a number f different selectin and shrinkage methds. They are best-subset selectin using an all-subsets search, ridge regressin, the lass, principal cmpnents regressin and partial least squares. Each methd has a cmplexity parameter, and this was chsen t minimize an estimate f predictin errr based n tenfld crss-validatin; full details are given in Sectin Briefly, crss-validatin wrks by dividing the training data randmly int ten equal parts. The learning methd is fit fr a range f values f the cmplexity parameter t nine-tenths f the data, and the predictin errr is cmputed n the remaining ne-tenth. This is dne in turn fr each ne-tenth f the data, and the ten predictin errr estimates are averaged. Frm this we btain an estimated predictin errr curve as a functin f the cmplexity parameter. Nte that we have already divided these data int a training set f size 67 and a test set f size 30. Crss-validatin is applied t the training set, since selecting the shrinkage parameter is part f the training prcess. The test set is there t judge the perfrmance f the selected mdel. The estimated predictin errr curves are shwn in Figure 3.7. Many f the curves are very flat ver large ranges near their minimum. Included are estimated standard errr bands fr each estimated errr rate, based n the ten errr estimates cmputed by crss-validatin. We have used the ne-standard-errr rule we pick the mst parsimnius mdel within ne standard errr f the minimum (Sectin 7.10, page 244). Such a rule acknwledges the fact that the tradeff curve is estimated with errr, and hence takes a cnservative apprach. Best-subset selectin chse t use the tw predictrs lcvl and lweight. The last tw lines f the table give the average predictin errr (and its estimated standard errr) ver the test set. 3.4 Shrinkage Methds By retaining a subset f the predictrs and discarding the rest, subset selectin prduces a mdel that is interpretable and has pssibly lwer predictin errr than the full mdel. Hwever, because it is a discrete prcess variables are either retained r discarded it ften exhibits high variance, and s desn t reduce the predictin errr f the full mdel. Shrinkage methds are mre cntinuus, and dn t suffer as much frm high variability Ridge Regressin Ridge regressin shrinks the regressin cefficients by impsing a penalty n their size. The ridge cefficients minimize a penalized residual sum f

20 62 3. Linear Methds fr Regressin All Subsets Ridge Regressin CV Errr CV Errr Subset Size Lass Degrees f Freedm Principal Cmpnents Regressin CV Errr CV Errr Shrinkage Factr s Number f Directins Partial Least Squares CV Errr Number f Directins FIGURE 3.7. Estimated predictin errr curves and their standard errrs fr the varius selectin and shrinkage methds. Each curve is pltted as a functin f the crrespnding cmplexity parameter fr that methd. The hrizntal axis has been chsen s that the mdel cmplexity increases as we mve frm left t right. The estimates f predictin errr and their standard errrs were btained by tenfld crss-validatin; full details are given in Sectin The least cmplex mdel within ne standard errr f the best is chsen, indicated by thepurple vertical brken lines.

21 3.4 Shrinkage Methds 63 TABLE 3.3. Estimated cefficients and test errr results, fr different subset and shrinkage methds applied t the prstate data. The blank entries crrespnd t variables mitted. Term LS Best Subset Ridge Lass PCR PLS Intercept lcavl lweight age lbph svi lcp gleasn pgg Test Errr Std Errr squares, ˆβ ridge =argmin β { N ( yi β 0 i=1 p ) 2 p x ij β j + λ j=1 j=1 β 2 j }. (3.41) Here λ 0 is a cmplexity parameter that cntrls the amunt f shrinkage: the larger the value f λ, the greater the amunt f shrinkage. The cefficients are shrunk tward zer (and each ther). The idea f penalizing by the sum-f-squares f the parameters is als used in neural netwrks, where it is knwn as weight decay (Chapter 11). An equivalent way t write the ridge prblem is ˆβ ridge =argmin β subject t N ( y i β 0 i=1 p βj 2 t, j=1 p ) 2, x ij β j j=1 (3.42) which makes explicit the size cnstraint n the parameters. There is a net-ne crrespndence between the parameters λ in (3.41) and t in (3.42). When there are many crrelated variables in a linear regressin mdel, their cefficients can becme prly determined and exhibit high variance. A wildly large psitive cefficient n ne variable can be canceled by a similarly large negative cefficient n its crrelated cusin. By impsing a size cnstraint n the cefficients, as in (3.42), this prblem is alleviated. The ridge slutins are nt equivariant under scaling f the inputs, and s ne nrmally standardizes the inputs befre slving (3.41). In additin,

22 64 3. Linear Methds fr Regressin ntice that the intercept β 0 has been left ut f the penalty term. Penalizatin f the intercept wuld make the prcedure depend n the rigin chsen fr Y ; that is, adding a cnstant c t each f the targets y i wuld nt simply result in a shift f the predictins by the same amunt c. It can be shwn (Exercise 3.5) that the slutin t (3.41) can be separated int tw parts, after reparametrizatin using centered inputs: each x ij gets replaced by x ij x j. We estimate β 0 by ȳ = 1 N N 1 y i. The remaining cefficients get estimated by a ridge regressin withut intercept, using the centered x ij. Hencefrth we assume that this centering has been dne, s that the input matrix X has p (rather than p + 1) clumns. Writing the criterin in (3.41) in matrix frm, RSS(λ) =(y Xβ) T (y Xβ)+λβ T β, (3.43) the ridge regressin slutins are easily seen t be ˆβ ridge =(X T X + λi) 1 X T y, (3.44) where I is the p p identity matrix. Ntice that with the chice f quadratic penalty β T β, the ridge regressin slutin is again a linear functin f y. The slutin adds a psitive cnstant t the diagnal f X T X befre inversin. This makes the prblem nnsingular, even if X T X is nt f full rank, and was the main mtivatin fr ridge regressin when it was first intrduced in statistics (Herl and Kennard, 1970). Traditinal descriptins f ridge regressin start with definitin (3.44). We chse t mtivate it via (3.41) and (3.42), as these prvide insight int hw it wrks. Figure 3.8 shws the ridge cefficient estimates fr the prstate cancer example, pltted as functins f df(λ), the effective degrees f freedm implied by the penalty λ (defined in (3.50) n page 68). In the case f rthnrmal inputs, the ridge estimates are just a scaled versin f the least squares estimates, that is, ˆβ ridge = ˆβ/(1 + λ). Ridge regressin can als be derived as the mean r mde f a psterir distributin, with a suitably chsen prir distributin. In detail, suppse y i N(β 0 + x T i β, σ2 ), and the parameters β j are each distributed as N(0, τ 2 ), independently f ne anther. Then the (negative) lg-psterir density f β, withτ 2 and σ 2 assumed knwn, is equal t the expressin in curly braces in (3.41), with λ = σ 2 /τ 2 (Exercise 3.6). Thus the ridge estimate is the mde f the psterir distributin; since the distributin is Gaussian, it is als the psterir mean. The singular value decmpsitin (SVD) f the centered input matrix X gives us sme additinal insight int the nature f ridge regressin. This decmpsitin is extremely useful in the analysis f many statistical methds. The SVD f the N p matrix X has the frm X = UDV T. (3.45)

23 3.4 Shrinkage Methds 65 Cefficients lcavl lweight age lbph svi lcp gleasn pgg45 df(λ) FIGURE 3.8. Prfiles f ridge cefficients fr the prstate cancer example, as the tuning parameter λ is varied. Cefficients are pltted versus df(λ), the effective degrees f freedm. A vertical line is drawn at df = 5.0, thevaluechsenby crss-validatin.

24 66 3. Linear Methds fr Regressin Here U and V are N p and p p rthgnal matrices, with the clumns f U spanning the clumn space f X, and the clumns f V spanning the rw space. D is a p p diagnal matrix, with diagnal entries d 1 d 2 d p 0 called the singular values f X. If ne r mre values d j =0, X is singular. Using the singular value decmpsitin we can write the least squares fitted vectr as X ˆβ ls = X(X T X) 1 X T y = UU T y, (3.46) after sme simplificatin. Nte that U T y are the crdinates f y with respect t the rthnrmal basis U. Nte als the similarity with (3.33); Q and U are generally different rthgnal bases fr the clumn space f X (Exercise 3.8). Nw the ridge slutins are X ˆβ ridge = X(X T X + λi) 1 X T y = UD(D 2 + λi) 1 DU T y = p d 2 j u j d 2 j + λut j y, (3.47) j=1 where the u j are the clumns f U. Nte that since λ 0, we have d 2 j /(d2 j + λ) 1. Like linear regressin, ridge regressin cmputes the crdinates f y with respect t the rthnrmal basis U. It then shrinks these crdinates by the factrs d 2 j /(d2 j + λ). This means that a greater amunt f shrinkage is applied t the crdinates f basis vectrs with smaller d 2 j. What des a small value f d 2 j mean? The SVD f the centered matrix X is anther way f expressing the principal cmpnents f the variables in X. The sample cvariance matrix is given by S = X T X/N, and frm (3.45) we have X T X = VD 2 V T, (3.48) which is the eigen decmpsitin f X T X (and f S, up t a factr N). The eigenvectrs v j (clumns f V) are als called the principal cmpnents (r Karhunen Leve) directins f X. The first principal cmpnent directin v 1 has the prperty that z 1 = Xv 1 has the largest sample variance amngst all nrmalized linear cmbinatins f the clumns f X. This sample variance is easily seen t be Var(z 1 )=Var(Xv 1 )= d2 1 N, (3.49) and in fact z 1 = Xv 1 = u 1 d 1. The derived variable z 1 is called the first principal cmpnent f X, and hence u 1 is the nrmalized first principal

25 3.4 Shrinkage Methds Largest Principal Cmpnent Smallest Principal Cmpnent X 1 X2 FIGURE 3.9. Principal cmpnents f sme input data pints. The largest principal cmpnent is the directin that maximizes the variance f the prjected data, and the smallest principal cmpnent minimizes that variance. Ridge regressin prjects y nt these cmpnents, and then shrinks the cefficients f the lw variance cmpnents mre than the high-variance cmpnents. cmpnent. Subsequent principal cmpnents z j have maximum variance d 2 j /N, subject t being rthgnal t the earlier nes. Cnversely the last principal cmpnent has minimum variance. Hence the small singular values d j crrespnd t directins in the clumn space f X having small variance, and ridge regressin shrinks these directins the mst. Figure 3.9 illustrates the principal cmpnents f sme data pints in tw dimensins. If we cnsider fitting a linear surface ver this dmain (the Y -axis is sticking ut f the page), the cnfiguratin f the data allw us t determine its gradient mre accurately in the lng directin than the shrt. Ridge regressin prtects against the ptentially high variance f gradients estimated in the shrt directins. The implicit assumptin is that the respnse will tend t vary mst in the directins f high variance f the inputs. This is ften a reasnable assumptin, since predictrs are ften chsen fr study because they vary with the respnse variable, but need nt hld in general.

26 68 3. Linear Methds fr Regressin In Figure 3.7 we have pltted the estimated predictin errr versus the quantity df(λ) = tr[x(x T X + λi) 1 X T ], = tr(h λ ) = p d 2 j + λ. (3.50) d 2 j=1 j This mntne decreasing functin f λ is the effective degrees f freedm f the ridge regressin fit. Usually in a linear-regressin fit with p variables, the degrees-f-freedm f the fit is p, the number f free parameters. The idea is that althugh all p cefficients in a ridge fit will be nn-zer, they are fit in a restricted fashin cntrlled by λ. Nte that df(λ) =p when λ = 0 (n regularizatin) and df(λ) 0 as λ. Of curse there is always an additinal ne degree f freedm fr the intercept, which was remved apriri. This definitin is mtivated in mre detail in Sectin and Sectins In Figure 3.7 the minimum ccurs at df(λ) = 5.0. Table 3.3 shws that ridge regressin reduces the test errr f the full least squares estimates by a small amunt The Lass The lass is a shrinkage methd like ridge, with subtle but imprtant differences. The lass estimate is defined by ˆβ lass = argmin β N ( y i β 0 i=1 subject t p ) 2 x ij β j j=1 p β j t. (3.51) Just as in ridge regressin, we can re-parametrize the cnstant β 0 by standardizing the predictrs; the slutin fr ˆβ 0 is ȳ, and thereafter we fit a mdel withut an intercept (Exercise 3.5). In the signal prcessing literature, the lass is als knwn as basis pursuit (Chen et al., 1998). We can als write the lass prblem in the equivalent Lagrangian frm { 1 ˆβ lass =argmin β 2 N ( yi β 0 i=1 j=1 p ) 2 p } x ij β j + λ β j. (3.52) Ntice the similarity t the ridge regressin prblem (3.42) r (3.41): the L 2 ridge penalty p 1 β2 j is replaced by the L 1 lass penalty p 1 β j. This latter cnstraint makes the slutins nnlinear in the y i, and there is n clsed frm expressin as in ridge regressin. Cmputing the lass slutin j=1 j=1

27 3.4 Shrinkage Methds 69 is a quadratic prgramming prblem, althugh we see in Sectin that efficient algrithms are available fr cmputing the entire path f slutins as λ is varied, with the same cmputatinal cst as fr ridge regressin. Because f the nature f the cnstraint, making t sufficiently small will cause sme f the cefficients t be exactly zer. Thus the lass des a kind f cntinuus subset selectin. If t is chsen larger than t 0 = p 1 ˆβ j (where ls ˆβ j = ˆβ j, the least squares estimates), then the lass estimates are the ˆβ j s. On the ther hand, fr t = t 0 /2 say, then the least squares cefficients are shrunk by abut 50% n average. Hwever, the nature f the shrinkage is nt bvius, and we investigate it further in Sectin belw. Like the subset size in variable subset selectin, r the penalty parameter in ridge regressin, t shuld be adaptively chsen t minimize an estimate f expected predictin errr. In Figure 3.7, fr ease f interpretatin, we have pltted the lass predictin errr estimates versus the standardized parameter s = t/ p 1 ˆβ j. A value ŝ 0.36 was chsen by 10-fld crss-validatin; this caused fur cefficients t be set t zer (fifth clumn f Table 3.3). The resulting mdel has the secnd lwest test errr, slightly lwer than the full least squares mdel, but the standard errrs f the test errr estimates (last line f Table 3.3) are fairly large. Figure 3.10 shws the lass cefficients as the standardized tuning parameter s = t/ p 1 ˆβ j is varied. At s =1.0 these are the least squares estimates; they decrease t 0 as s 0. This decrease is nt always strictly mntnic, althugh it is in this example. A vertical line is drawn at s = 0.36, the value chsen by crss-validatin Discussin: Subset Selectin, Ridge Regressin and the Lass In this sectin we discuss and cmpare the three appraches discussed s far fr restricting the linear regressin mdel: subset selectin, ridge regressin and the lass. In the case f an rthnrmal input matrix X the three prcedures have explicit slutins. Each methd applies a simple transfrmatin t the least squares estimate ˆβ j, as detailed in Table 3.4. Ridge regressin des a prprtinal shrinkage. Lass translates each cefficient by a cnstant factr λ, truncating at zer. This is called sft threshlding, and is used in the cntext f wavelet-based smthing in Sectin 5.9. Best-subset selectin drps all variables with cefficients smaller than the M th largest; this is a frm f hard-threshlding. Back t the nnrthgnal case; sme pictures help understand their relatinship. Figure 3.11 depicts the lass (left) and ridge regressin (right) when there are nly tw parameters. The residual sum f squares has elliptical cnturs, centered at the full least squares estimate. The cnstraint

28 70 3. Linear Methds fr Regressin lcavl Cefficients svi lweight pgg45 lbph gleasn age lcp Shrinkage Factr s FIGURE Prfiles f lass cefficients, as the tuning parameter t is varied. Cefficients are pltted versus s = t/ P p 1 ˆβ j. A vertical line is drawn at s =0.36, the value chsen by crss-validatin. Cmpare Figure 3.8 n page 65; the lass prfiles hit zer, while thse fr ridge d nt. The prfiles are piece-wise linear, and s are cmputed nly at the pints displayed; see Sectin frdetails.

29 3.4 Shrinkage Methds 71 TABLE 3.4. Estimatrs f β j in the case f rthnrmal clumns f X. M and λ are cnstants chsen by the crrespnding techniques; sign dentes the sign f its argument (±1), and x + dentes psitive part f x. Belwthetable,estimatrs are shwn by brken red lines. The 45 line in gray shws the unrestricted estimate fr reference. Estimatr Frmula Best subset (size M) ˆβj I( ˆβ j ˆβ (M) ) Ridge ˆβj /(1 + λ) Lass sign( ˆβ j )( ˆβ j λ) + Best Subset Ridge Lass λ ˆβ (M) (0,0) (0,0) (0,0) β 2. ^ β. β 2 ^ β β 1 β 1 FIGURE Estimatin picture fr the lass (left) and ridge regressin (right). Shwn are cnturs f the errr and cnstraint functins. The slid blue areas are the cnstraint regins β 1 + β 2 t and β1 2 + β2 2 t 2, respectively, while the red ellipses are the cnturs f the least squares errr functin.

30 72 3. Linear Methds fr Regressin regin fr ridge regressin is the disk β1 2 + β2 2 t, while that fr lass is the diamnd β 1 + β 2 t. Bth methds find the first pint where the elliptical cnturs hit the cnstraint regin. Unlike the disk, the diamnd has crners; if the slutin ccurs at a crner, then it has ne parameter β j equal t zer. When p>2, the diamnd becmes a rhmbid, and has many crners, flat edges and faces; there are many mre pprtunities fr the estimated parameters t be zer. We can generalize ridge regressin and the lass, and view them as Bayes estimates. Cnsider the criterin { N } ( p ) 2 p β =argmin yi β 0 x ij β j + λ β j q (3.53) β i=1 j=1 fr q 0. The cnturs f cnstant value f j β j q are shwn in Figure 3.12, fr the case f tw inputs. Thinking f β j q as the lg-prir density fr β j, these are als the equicnturs f the prir distributin f the parameters. The value q = 0 crrespnds t variable subset selectin, as the penalty simply cunts the number f nnzer parameters; q = 1 crrespnds t the lass, while q = 2 t ridge regressin. Ntice that fr q 1, the prir is nt unifrm in directin, but cncentrates mre mass in the crdinate directins. The prir crrespnding t the q = 1 case is an independent duble expnential (r Laplace) distributin fr each input, with density (1/2τ) exp( β /τ) and τ =1/λ. The case q = 1 (lass) is the smallest q such that the cnstraint regin is cnvex; nn-cnvex cnstraint regins make the ptimizatin prblem mre difficult. In this view, the lass, ridge regressin and best subset selectin are Bayes estimates with different prirs. Nte, hwever, that they are derived as psterir mdes, that is, maximizers f the psterir. It is mre cmmn t use the mean f the psterir as the Bayes estimate. Ridge regressin is als the psterir mean, but the lass and best subset selectin are nt. Lking again at the criterin (3.53), we might try using ther values f q besides 0, 1, r 2. Althugh ne might cnsider estimating q frm the data, ur experience is that it is nt wrth the effrt fr the extra variance incurred. Values f q (1, 2) suggest a cmprmise between the lass and ridge regressin. Althugh this is the case, with q>1, β j q is differentiable at 0, and s des nt share the ability f lass (q =1)fr j=1 q =4 q =2 q =1 q =0.5 q =0.1 FIGURE Cnturs f cnstant value f P j βj q fr given values f q.

31 3.4 Shrinkage Methds 73 q =1.2 α =0.2 L q Elastic Net FIGURE Cnturs f cnstant value f P j βj q fr q =1.2 (left plt), and the elastic-net penalty P j (αβ2 j +(1 α) β j ) fr α =0.2 (right plt). Althugh visually very similar, the elastic-net has sharp (nn-differentiable) crners, while the q =1.2 penalty des nt. setting cefficients exactly t zer. Partly fr this reasn as well as fr cmputatinal tractability, Zu and Hastie (2005) intrduced the elasticnet penalty p ( λ αβ 2 j +(1 α) β j ), (3.54) j=1 a different cmprmise between ridge and lass. Figure 3.13 cmpares the L q penalty with q =1.2 and the elastic-net penalty with α =0.2; it is hard t detect the difference by eye. The elastic-net selects variables like the lass, and shrinks tgether the cefficients f crrelated predictrs like ridge. It als has cnsiderable cmputatinal advantages ver the L q penalties. We discuss the elastic-net further in Sectin Least Angle Regressin Least angle regressin (LAR) is a relative newcmer (Efrn et al., 2004), and can be viewed as a kind f demcratic versin f frward stepwise regressin (Sectin 3.3.2). As we will see, LAR is intimately cnnected with the lass, and in fact prvides an extremely efficient algrithm fr cmputing the entire lass path as in Figure Frward stepwise regressin builds a mdel sequentially, adding ne variable at a time. At each step, it identifies the best variable t include in the active set, and then updates the least squares fit t include all the active variables. Least angle regressin uses a similar strategy, but nly enters as much f a predictr as it deserves. At the first step it identifies the variable mst crrelated with the respnse. Rather than fit this variable cmpletely, LAR mves the cefficient f this variable cntinuusly tward its leastsquares value (causing its crrelatin with the evlving residual t decrease in abslute value). As sn as anther variable catches up in terms f crrelatin with the residual, the prcess is paused. The secnd variable then jins the active set, and their cefficients are mved tgether in a way that keeps their crrelatins tied and decreasing. This prcess is cntinued

32 74 3. Linear Methds fr Regressin until all the variables are in the mdel, and ends at the full least-squares fit. Algrithm 3.2 prvides the details. The terminatin cnditin in step 5 requires sme explanatin. If p>n 1, the LAR algrithm reaches a zer residual slutin after N 1 steps (the 1 is because we have centered the data). Algrithm 3.2 Least Angle Regressin. 1. Standardize the predictrs t have mean zer and unit nrm. Start with the residual r = y ȳ, β 1, β 2,...,β p =0. 2. Find the predictr x j mst crrelated with r. 3. Mve β j frm 0 twards its least-squares cefficient x j, r, until sme ther cmpetitr x k has as much crrelatin with the current residual as des x j. 4. Mve β j and β k in the directin defined by their jint least squares cefficient f the current residual n (x j, x k ), until sme ther cmpetitr x l has as much crrelatin with the current residual. 5. Cntinue in this way until all p predictrs have been entered. After min(n 1,p) steps, we arrive at the full least-squares slutin. Suppse A k is the active set f variables at the beginning f the kth step, and let β Ak be the cefficient vectr fr these variables at this step; there will be k 1 nnzer values, and the ne just entered will be zer. If r k = y X Ak β Ak is the current residual, then the directin fr this step is δ k =(X T A k X Ak ) 1 X T A k r k. (3.55) The cefficient prfile then evlves as β Ak (α) =β Ak + α δ k. Exercise 3.23 verifies that the directins chsen in this fashin d what is claimed: keep the crrelatins tied and decreasing. If the fit vectr at the beginning f this step is ˆf k, then it evlves as ˆf k (α) =ˆf k + α u k, where u k = X Ak δ k is the new fit directin. The name least angle arises frm a gemetrical interpretatin f this prcess; u k makes the smallest (and equal) angle with each f the predictrs in A k (Exercise 3.24). Figure 3.14 shws the abslute crrelatins decreasing and jining ranks with each step f the LAR algrithm, using simulated data. By cnstructin the cefficients in LAR change in a piecewise linear fashin. Figure 3.15 [left panel] shws the LAR cefficient prfile evlving as a functin f their L 1 arc length 2. Nte that we d nt need t take small 2 The L 1 arc-length f a differentiable curve β(s) frs [0,S]isgivenbyTV(β,S)= R S 0 β(s) 1 ds, where β(s) = β(s)/ s. Frthepiecewise-linearLARcefficient prfile, this amunts t summing the L 1 nrms f the changes in cefficients frm step t step.

33 3.4 Shrinkage Methds 75 v2 v6 v4 v5 v3 v1 Abslute Crrelatins L 1 Arc Length FIGURE Prgressin f the abslute crrelatins during each step f the LAR prcedure, using a simulated data set with six predictrs. The labels at the tp f the plt indicate which variables enter the active set at each step. The step length are measured in units f L 1 arc length. Least Angle Regressin Lass Cefficients Cefficients L 1 Arc Length L 1 Arc Length FIGURE Left panel shws the LAR cefficient prfiles n the simulated data, as a functin f the L 1 arc length. The right panel shws the Lass prfile. They are identical until the dark-blue cefficient crsses zer at an arc length f abut 18.

34 76 3. Linear Methds fr Regressin steps and recheck the crrelatins in step 3; using knwledge f the cvariance f the predictrs and the piecewise linearity f the algrithm, we can wrk ut the exact step length at the beginning f each step (Exercise 3.25). The right panel f Figure 3.15 shws the lass cefficient prfiles n the same data. They are almst identical t thse in the left panel, and differ fr the first time when the blue cefficient passes back thrugh zer. Fr the prstate data, the LAR cefficient prfile turns ut t be identical t the lass prfile in Figure 3.10, which never crsses zer. These bservatins lead t a simple mdificatin f the LAR algrithm that gives the entire lass path, which is als piecewise-linear. Algrithm 3.2a Least Angle Regressin: Lass Mdificatin. 4a. If a nn-zer cefficient hits zer, drp its variable frm the active set f variables and recmpute the current jint least squares directin. The LAR(lass) algrithm is extremely efficient, requiring the same rder f cmputatin as that f a single least squares fit using the p predictrs. Least angle regressin always takes p steps t get t the full least squares estimates. The lass path can have mre than p steps, althugh the tw are ften quite similar. Algrithm 3.2 with the lass mdificatin 3.2a is an efficient way f cmputing the slutin t any lass prblem, especially when p N. Osbrne et al. (2000a) als discvered a piecewise-linear path fr cmputing the lass, which they called a hmtpy algrithm. We nw give a heuristic argument fr why these prcedures are s similar. Althugh the LAR algrithm is stated in terms f crrelatins, if the input features are standardized, it is equivalent and easier t wrk with innerprducts. Suppse A is the active set f variables at sme stage in the algrithm, tied in their abslute inner-prduct with the current residuals y Xβ. We can express this as x T j (y Xβ) =γ s j, j A (3.56) where s j { 1, 1} indicates the sign f the inner-prduct, and γ is the cmmn value. Als x T k (y Xβ) γ k A. Nw cnsider the lass criterin (3.52), which we write in vectr frm R(β) = 1 2 y Xβ λ β 1. (3.57) Let B be the active set f variables in the slutin fr a given value f λ. Fr these variables R(β) is differentiable, and the statinarity cnditins give x T j (y Xβ) =λ sign(β j ), j B (3.58) Cmparing (3.58) with (3.56), we see that they are identical nly if the sign f β j matches the sign f the inner prduct. That is why the LAR

35 3.4 Shrinkage Methds 77 algrithm and lass start t differ when an active cefficient passes thrugh zer; cnditin (3.58) is vilated fr that variable, and it is kicked ut f the active set B. Exercise 3.23 shws that these equatins imply a piecewiselinear cefficient prfile as λ decreases. The statinarity cnditins fr the nn-active variables require that x T k (y Xβ) λ, k B, (3.59) which again agrees with the LAR algrithm. Figure 3.16 cmpares LAR and lass t frward stepwise and stagewise regressin. The setup is the same as in Figure 3.6 n page 59, except here N = 100 here rather than 300, s the prblem is mre difficult. We see that the mre aggressive frward stepwise starts t verfit quite early (well befre the 10 true variables can enter the mdel), and ultimately perfrms wrse than the slwer frward stagewise regressin. The behavir f LAR and lass is similar t that f frward stagewise regressin. Incremental frward stagewise is similar t LAR and lass, and is described in Sectin Degrees-f-Freedm Frmula fr LAR and Lass Suppse that we fit a linear mdel via the least angle regressin prcedure, stpping at sme number f steps k<p, r equivalently using a lass bund t that prduces a cnstrained versin f the full least squares fit. Hw many parameters, r degrees f freedm have we used? Cnsider first a linear regressin using a subset f k features. If this subset is prespecified in advance withut reference t the training data, then the degrees f freedm used in the fitted mdel is defined t be k. Indeed, in classical statistics, the number f linearly independent parameters is what is meant by degrees f freedm. Alternatively, suppse that we carry ut a best subset selectin t determine the ptimal set f k predictrs. Then the resulting mdel has k parameters, but in sme sense we have used up mre than k degrees f freedm. We need a mre general definitin fr the effective degrees f freedm f an adaptively fitted mdel. We define the degrees f freedm f the fitted vectr ŷ =(ŷ 1, ŷ 2,...,ŷ N )as df(ŷ) = 1 σ 2 N i=1 Cv(ŷ i,y i ). (3.60) Here Cv(ŷ i,y i ) refers t the sampling cvariance between the predicted value ŷ i and its crrespnding utcme value y i. This makes intuitive sense: the harder that we fit t the data, the larger this cvariance and hence df(ŷ). Expressin (3.60) is a useful ntin f degrees f freedm, ne that can be applied t any mdel predictin ŷ. This includes mdels that are

36 78 3. Linear Methds fr Regressin E ˆβ(k) β Frward Stepwise LAR Lass Frward Stagewise Incremental Frward Stagewise Fractin f L 1 arc-length FIGURE Cmparisn f LAR and lass with frward stepwise, frward stagewise (FS) and incremental frward stagewise (FS 0) regressin. The setup is the same as in Figure 3.6, except N = 100 here rather than 300. Herethe slwer FS regressin ultimately utperfrms frward stepwise. LAR and lass shw similar behavir t FS and FS 0. Since the prcedures take different numbers f steps (acrss simulatin replicates and methds), we plt the MSE as a functin f the fractin f ttal L 1 arc-length tward the least-squares fit. adaptively fitted t the training data. This definitin is mtivated and discussed further in Sectins Nw fr a linear regressin with k fixed predictrs, it is easy t shw that df(ŷ) = k. Likewise fr ridge regressin, this definitin leads t the clsed-frm expressin (3.50) n page 68: df(ŷ) =tr(s λ ). In bth these cases, (3.60) is simple t evaluate because the fit ŷ = H λ y is linear in y. If we think abut definitin (3.60) in the cntext f a best subset selectin f size k, it seems clear that df(ŷ) will be larger than k, and this can be verified by estimating Cv(ŷ i,y i )/σ 2 directly by simulatin. Hwever there is n clsed frm methd fr estimating df(ŷ) fr best subset selectin. Fr LAR and lass, smething magical happens. These techniques are adaptive in a smther way than best subset selectin, and hence estimatin f degrees f freedm is mre tractable. Specifically it can be shwn that after the kth step f the LAR prcedure, the effective degrees f freedm f the fit vectr is exactly k. Nw fr the lass, the (mdified) LAR prcedure

37 3.5 Methds Using Derived Input Directins 79 ften takes mre than p steps, since predictrs can drp ut. Hence the definitin is a little different; fr the lass, at any stage df(ŷ) apprximately equals the number f predictrs in the mdel. While this apprximatin wrks reasnably well anywhere in the lass path, fr each k it wrks best at the last mdel in the sequence that cntains k predictrs. A detailed study f the degrees f freedm fr the lass may be fund in Zu et al. (2007). 3.5 Methds Using Derived Input Directins In many situatins we have a large number f inputs, ften very crrelated. The methds in this sectin prduce a small number f linear cmbinatins Z m,m=1,...,m f the riginal inputs X j, and the Z m are then used in place f the X j as inputs in the regressin. The methds differ in hw the linear cmbinatins are cnstructed Principal Cmpnents Regressin In this apprach the linear cmbinatins Z m used are the principal cmpnents as defined in Sectin abve. Principal cmpnent regressin frms the derived input clumns z m = Xv m, and then regresses y n z 1, z 2,...,z M fr sme M p. Since the z m are rthgnal, this regressin is just a sum f univariate regressins: ŷ pcr (M) =ȳ1 + M m=1 ˆθ m z m, (3.61) where ˆθ m = z m, y / z m, z m. Since the z m are each linear cmbinatins f the riginal x j, we can express the slutin (3.61) in terms f cefficients f the x j (Exercise 3.13): ˆβ pcr (M) = M ˆθ m v m. (3.62) m=1 As with ridge regressin, principal cmpnents depend n the scaling f the inputs, s typically we first standardize them. Nte that if M = p, we wuld just get back the usual least squares estimates, since the clumns f Z = UD span the clumn space f X. FrM<pwe get a reduced regressin. We see that principal cmpnents regressin is very similar t ridge regressin: bth perate via the principal cmpnents f the input matrix. Ridge regressin shrinks the cefficients f the principal cmpnents (Figure 3.17), shrinking mre depending n the size f the crrespnding eigenvalue; principal cmpnents regressin discards the p M smallest eigenvalue cmpnents. Figure 3.17 illustrates this.

38 80 3. Linear Methds fr Regressin Shrinkage Factr ridge pcr Index FIGURE Ridge regressin shrinks the regressin cefficients f the principal cmpnents, using shrinkage factrs d 2 j/(d 2 j + λ) as in (3.47). Principal cmpnent regressin truncates them. Shwn are the shrinkage and truncatin patterns crrespnding t Figure 3.7, as a functin f the principal cmpnent index. In Figure 3.7 we see that crss-validatin suggests seven terms; the resulting mdel has the lwest test errr in Table Partial Least Squares This technique als cnstructs a set f linear cmbinatins f the inputs fr regressin, but unlike principal cmpnents regressin it uses y (in additin t X) fr this cnstructin. Like principal cmpnent regressin, partial least squares (PLS) is nt scale invariant, s we assume that each x j is standardized t have mean 0 and variance 1. PLS begins by cmputing ˆϕ 1j = x j, y fr each j. Frm this we cnstruct the derived input z 1 = j ˆϕ 1jx j, which is the first partial least squares directin. Hence in the cnstructin f each z m, the inputs are weighted by the strength f their univariate effect n y 3. The utcme y is regressed n z 1 giving cefficient ˆθ 1, and then we rthgnalize x 1,...,x p with respect t z 1.We cntinue this prcess, until M p directins have been btained. In this manner, partial least squares prduces a sequence f derived, rthgnal inputs r directins z 1, z 2,...,z M. As with principal-cmpnent regressin, if we were t cnstruct all M = p directins, we wuld get back a slutin equivalent t the usual least squares estimates; using M<pdirectins prduces a reduced regressin. The prcedure is described fully in Algrithm Since the x j are standardized, the first directins ˆϕ 1j are the univariate regressin cefficients (up t an irrelevant cnstant); this is nt the case fr subsequent directins.

39 Algrithm 3.3 Partial Least Squares. 3.5 Methds Using Derived Input Directins Standardize each x j t have mean zer and variance ne. Set ŷ (0) = ȳ1, and x (0) j = x j,j=1,...,p. 2. Fr m =1, 2,...,p (a) z m = p j=1 ˆϕ mjx (m 1) j (b) ˆθ m = z m, y / z m, z m. (c) ŷ (m) = ŷ (m 1) + ˆθ m z m. (d) Orthgnalize each x (m 1) j [ z m, x (m 1) j, where ˆϕ mj = x (m 1), y. with respect t z m : x (m) j / z m, z m ]z m, j =1, 2,...,p. j = x (m 1) j 3. Output the sequence f fitted vectrs {ŷ (m) } p 1. Since the {z l} m 1 are linear in the riginal x j, s is ŷ (m) = X ˆβ pls (m). These linear cefficients can be recvered frm the sequence f PLS transfrmatins. In the prstate cancer example, crss-validatin chse M = 2 PLS directins in Figure 3.7. This prduced the mdel given in the rightmst clumn f Table 3.3. What ptimizatin prblem is partial least squares slving? Since it uses the respnse y t cnstruct its directins, its slutin path is a nnlinear functin f y. It can be shwn (Exercise 3.15) that partial least squares seeks directins that have high variance and have high crrelatin with the respnse, in cntrast t principal cmpnents regressin which keys nly n high variance (Stne and Brks, 1990; Frank and Friedman, 1993). In particular, the mth principal cmpnent directin v m slves: max α Var(Xα) (3.63) subject t α =1, α T Sv l =0, l =1,...,m 1, where S is the sample cvariance matrix f the x j. The cnditins α T Sv l = 0 ensures that z m = Xα is uncrrelated with all the previus linear cmbinatins z l = Xv l. The mth PLS directin ˆϕ m slves: max α Crr 2 (y, Xα)Var(Xα) (3.64) subject t α =1, α T S ˆϕ l =0, l =1,...,m 1. Further analysis reveals that the variance aspect tends t dminate, and s partial least squares behaves much like ridge regressin and principal cmpnents regressin. We discuss this further in the next sectin. If the input matrix X is rthgnal, then partial least squares finds the least squares estimates after m = 1 steps. Subsequent steps have n effect

40 82 3. Linear Methds fr Regressin since the ˆϕ mj are zer fr m>1 (Exercise 3.14). It can als be shwn that the sequence f PLS cefficients fr m =1, 2,...,prepresents the cnjugate gradient sequence fr cmputing the least squares slutins (Exercise 3.18). 3.6 Discussin: A Cmparisn f the Selectin and Shrinkage Methds There are sme simple settings where we can understand better the relatinship between the different methds described abve. Cnsider an example with tw crrelated inputs X 1 and X 2, with crrelatin ρ. We assume that the true regressin cefficients are β 1 = 4 and β 2 = 2. Figure 3.18 shws the cefficient prfiles fr the different methds, as their tuning parameters are varied. The tp panel has ρ =0.5, the bttm panel ρ = 0.5. The tuning parameters fr ridge and lass vary ver a cntinuus range, while best subset, PLS and PCR take just tw discrete steps t the least squares slutin. In the tp panel, starting at the rigin, ridge regressin shrinks the cefficients tgether until it finally cnverges t least squares. PLS and PCR shw similar behavir t ridge, althugh are discrete and mre extreme. Best subset vershts the slutin and then backtracks. The behavir f the lass is intermediate t the ther methds. When the crrelatin is negative (lwer panel), again PLS and PCR rughly track the ridge path, while all f the methds are mre similar t ne anther. It is interesting t cmpare the shrinkage behavir f these different methds. Recall that ridge regressin shrinks all directins, but shrinks lw-variance directins mre. Principal cmpnents regressin leaves M high-variance directins alne, and discards the rest. Interestingly, it can be shwn that partial least squares als tends t shrink the lw-variance directins, but can actually inflate sme f the higher variance directins. This can make PLS a little unstable, and cause it t have slightly higher predictin errr cmpared t ridge regressin. A full study is given in Frank and Friedman (1993). These authrs cnclude that fr minimizing predictin errr, ridge regressin is generally preferable t variable subset selectin, principal cmpnents regressin and partial least squares. Hwever the imprvement ver the latter tw methds was nly slight. T summarize, PLS, PCR and ridge regressin tend t behave similarly. Ridge regressin may be preferred because it shrinks smthly, rather than in discrete steps. Lass falls smewhere between ridge regressin and best subset regressin, and enjys sme f the prperties f each.

41 3.6 Discussin: A Cmparisn f the Selectin and Shrinkage Methds 83 ρ =0.5 β2 β PCR Lass PLS Ridge Least Squares Best Subset PCR Ridge Lass PLS β 1 ρ = 0.5 Best Subset Least Squares β 1 FIGURE Cefficient prfiles frm different methds fr a simple prblem: tw inputs with crrelatin ±0.5, and the true regressin cefficients β =(4, 2).

42 84 3. Linear Methds fr Regressin 3.7 Multiple Outcme Shrinkage and Selectin As nted in Sectin 3.2.4, the least squares estimates in a multiple-utput linear mdel are simply the individual least squares estimates fr each f the utputs. T apply selectin and shrinkage methds in the multiple utput case, ne culd apply a univariate technique individually t each utcme r simultaneusly t all utcmes. With ridge regressin, fr example, we culd apply frmula (3.44) t each f the K clumns f the utcme matrix Y, using pssibly different parameters λ, r apply it t all clumns using the same value f λ. The frmer strategy wuld allw different amunts f regularizatin t be applied t different utcmes but require estimatin f k separate regularizatin parameters λ 1,...,λ k, while the latter wuld permit all k utputs t be used in estimating the sle regularizatin parameter λ. Other mre sphisticated shrinkage and selectin strategies that explit crrelatins in the different respnses can be helpful in the multiple utput case. Suppse fr example that amng the utputs we have Y k = f(x)+ε k (3.65) Y l = f(x)+ε l ; (3.66) i.e., (3.65) and (3.66) share the same structural part f(x) in their mdels. It is clear in this case that we shuld pl ur bservatins n Y k and Y l t estimate the cmmn f. Cmbining respnses is at the heart f cannical crrelatin analysis (CCA), a data reductin technique develped fr the multiple utput case. Similar t PCA, CCA finds a sequence f uncrrelated linear cmbinatins Xv m, m = 1,...,M f the x j, and a crrespnding sequence f uncrrelated linear cmbinatins Yu m f the respnses y k, such that the crrelatins Crr 2 (Yu m, Xv m ) (3.67) are successively maximized. Nte that at mst M = min(k, p) directins can be fund. The leading cannical respnse variates are thse linear cmbinatins (derived respnses) best predicted by the x j ; in cntrast, the trailing cannical variates can be prly predicted by the x j, and are candidates fr being drpped. The CCA slutin is cmputed using a generalized SVD f the sample crss-cvariance matrix Y T X/N (assuming Y and X are centered; Exercise 3.20). Reduced-rank regressin (Izenman, 1975; van der Merwe and Zidek, 1980) frmalizes this apprach in terms f a regressin mdel that explicitly pls infrmatin. Given an errr cvariance Cv(ε) = Σ, we slve the fllwing

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression 3.3.4 Prstate Cancer Data Example (Cntinued) 3.4 Shrinkage Methds 61 Table 3.3 shws the cefficients frm a number f different selectin and shrinkage methds. They are best-subset selectin using an all-subsets