Bayes Estimators & Ridge Regression

Size: px

Start display at page:

Download "Bayes Estimators & Ridge Regression"

Lucas Buck Palmer
5 years ago
Views:

1 Readings Chapter 14 Christensen Merlise Clyde September 29, 2015

2 How Good are Estimators? Quadratic loss for estimating β using estimator a L(β, a) = (β a) T (β a)

3 How Good are Estimators? Quadratic loss for estimating β using estimator a L(β, a) = (β a) T (β a) Consider our expected loss (before we see the data) of taking an action a

4 How Good are Estimators? Quadratic loss for estimating β using estimator a L(β, a) = (β a) T (β a) Consider our expected loss (before we see the data) of taking an action a Under OLS or the Reference prior the Expected Mean Square Error

5 How Good are Estimators? Quadratic loss for estimating β using estimator a L(β, a) = (β a) T (β a) Consider our expected loss (before we see the data) of taking an action a Under OLS or the Reference prior the Expected Mean Square Error E Y [(β ˆβ) T (β ˆβ) = σ 2 tr[(x T X) 1 ]

6 How Good are Estimators? Quadratic loss for estimating β using estimator a L(β, a) = (β a) T (β a) Consider our expected loss (before we see the data) of taking an action a Under OLS or the Reference prior the Expected Mean Square Error E Y [(β ˆβ) T (β ˆβ) = σ 2 tr[(x T X) 1 ] = p σ 2 j=1 λ 1 j

7 How Good are Estimators? Quadratic loss for estimating β using estimator a L(β, a) = (β a) T (β a) Consider our expected loss (before we see the data) of taking an action a Under OLS or the Reference prior the Expected Mean Square Error E Y [(β ˆβ) T (β ˆβ) = σ 2 tr[(x T X) 1 ] = p σ 2 j=1 λ 1 j If smallest λ j 0 then MSE

8 Is g-prior any better? Under the g-prior E Y [(β g 1+g ˆβ) T (β g 1+g ˆβ)

9 Is g-prior any better? Under the g-prior E Y [(β E[L(β, g 1+g ˆβ) T (β g 1+g ˆβ) ( ) g 1 + g ˆβ)] g 2 = σ 2 tr[(x T X) 1 ] + βt β 1 + g (1 + g) 2

10 Is g-prior any better? Under the g-prior E Y [(β E[L(β, ( g 1 + g ˆβ)] g = σ 2 = g 1+g ˆβ) T (β g 1+g ˆβ) ) 2 tr[(x T X) 1 ] + βt β (1 + g) g 1 (1 + g) 2 (σ2 g 2 λ 1 j + β 2 )

11 Is g-prior any better? Under the g-prior E Y [(β E[L(β, ( g 1 + g ˆβ)] g = σ 2 = g 1+g ˆβ) T (β g 1+g ˆβ) ) 2 tr[(x T X) 1 ] + βt β (1 + g) g 1 (1 + g) 2 (σ2 g 2 λ 1 j + β 2 ) Aside: g prior is better than Reference Prior if g > β 2 σ 2 λ 1 1 j But still have risk going to infinity as λ 0

12 Canonical Orthogonal Representation & Ridge Regression Assume that X has been centered and standardized so that X T X = corr(x)

13 Canonical Orthogonal Representation & Ridge Regression Assume that X has been centered and standardized so that X T X = corr(x) (use scale function in R)

14 Canonical Orthogonal Representation & Ridge Regression Assume that X has been centered and standardized so that X T X = corr(x) (use scale function in R) Write X = U p LV T Singular Value Decomposition

15 Canonical Orthogonal Representation & Ridge Regression Assume that X has been centered and standardized so that X T X = corr(x) (use scale function in R) Write X = U p LV T Singular Value Decomposition where U T p U p = I p and V is p p orthogonal matrix, L is diagonal

16 Canonical Orthogonal Representation & Ridge Regression Assume that X has been centered and standardized so that X T X = corr(x) (use scale function in R) Write X = U p LV T Singular Value Decomposition where U T p U p = I p and V is p p orthogonal matrix, L is diagonal Y = 1α + U p LV T β + ɛ

17 Canonical Orthogonal Representation & Ridge Regression Assume that X has been centered and standardized so that X T X = corr(x) (use scale function in R) Write X = U p LV T Singular Value Decomposition where U T p U p = I p and V is p p orthogonal matrix, L is diagonal Y = 1α + U p LV T β + ɛ Let U = [1 U p U n p 1 ] n n orthogonal matrix

18 Canonical Orthogonal Representation & Ridge Regression Assume that X has been centered and standardized so that X T X = corr(x) (use scale function in R) Write X = U p LV T Singular Value Decomposition where U T p U p = I p and V is p p orthogonal matrix, L is diagonal Y = 1α + U p LV T β + ɛ Let U = [1 U p U n p 1 ] n n orthogonal matrix Rotate by U T

19 Canonical Orthogonal Representation & Ridge Regression Assume that X has been centered and standardized so that X T X = corr(x) (use scale function in R) Write X = U p LV T Singular Value Decomposition where U T p U p = I p and V is p p orthogonal matrix, L is diagonal Y = 1α + U p LV T β + ɛ Let U = [1 U p U n p 1 ] n n orthogonal matrix Rotate by U T U T Y = U T 1α + U T U p LV T β + U T ɛ

20 Canonical Orthogonal Representation & Ridge Regression Assume that X has been centered and standardized so that X T X = corr(x) (use scale function in R) Write X = U p LV T Singular Value Decomposition where U T p U p = I p and V is p p orthogonal matrix, L is diagonal Y = 1α + U p LV T β + ɛ Let U = [1 U p U n p 1 ] n n orthogonal matrix Rotate by U T U T Y = U T 1α + U T U p LV T β + U T ɛ n 0 p ( Y = 0 L α γ 0 n p 1 0 n p 1 p ) + ɛ

21 Canonical Orthogonal Representation & Ridge Regression Assume that X has been centered and standardized so that X T X = corr(x) (use scale function in R) Write X = U p LV T Singular Value Decomposition where U T p U p = I p and V is p p orthogonal matrix, L is diagonal Y = 1α + U p LV T β + ɛ Let U = [1 U p U n p 1 ] n n orthogonal matrix Rotate by U T U T Y = U T 1α + U T U p LV T β + U T ɛ n 0 p ( Y = 0 L α γ 0 n p 1 0 n p 1 p ) + ɛ

22 Orthogonal Regression U T Y = U T 1α + U T U p LV T β + U T ɛ

23 Orthogonal Regression U T Y = U T 1α + U T U p LV T β + U T ɛ n 0 p ( Y = 0 L α γ 0 n p 1 0 n p 1 p ) + ɛ

24 Orthogonal Regression U T Y = U T 1α + U T U p LV T β + U T ɛ n 0 p ( Y = 0 L α γ 0 n p 1 0 n p 1 p ) + ɛ ˆα = ȳ

25 Orthogonal Regression U T Y = U T 1α + U T U p LV T β + U T ɛ n 0 p ( Y = 0 L α γ 0 n p 1 0 n p 1 p ) + ɛ ˆα = ȳ ˆγ = (L T L) 1 L T U T p Y or ˆγ i = y i /l i for i = 1,..., p

26 Orthogonal Regression U T Y = U T 1α + U T U p LV T β + U T ɛ n 0 p ( Y = 0 L α γ 0 n p 1 0 n p 1 p ) + ɛ ˆα = ȳ ˆγ = (L T L) 1 L T U T p Y or ˆγ i = y i /l i for i = 1,..., p Var(ˆγ i ) = σ 2 /l 2 i Directions in X space U j with small eigenvectors l i have the largest variances. Unstable directions.

27 Ridge Regression & Independent Prior (Another) Normal Conjugate Prior Distribution on γ: γ φ N(0 p, 1 φk I p)

28 Ridge Regression & Independent Prior (Another) Normal Conjugate Prior Distribution on γ: Posterior mean γ φ N(0 p, 1 φk I p) γ = (L T L + ki) 1 L T U T p Y = (L T L + ki) 1 L T Lˆγ

29 Ridge Regression & Independent Prior (Another) Normal Conjugate Prior Distribution on γ: Posterior mean γ φ N(0 p, 1 φk I p) γ = (L T L + ki) 1 L T U T p Y = (L T L + ki) 1 L T Lˆγ γ i = l i 2 li 2 + k ˆγ i = λ i λ i + k ˆγ i

30 Ridge Regression & Independent Prior (Another) Normal Conjugate Prior Distribution on γ: γ φ N(0 p, 1 φk I p) Posterior mean γ = (L T L + ki) 1 L T U T p Y = (L T L + ki) 1 L T Lˆγ γ i = l i 2 li 2 + k ˆγ i = λ i λ i + k ˆγ i When λ i 0 then γ i 0 When k 0 we get OLS back but if k gets too big posterior mean goes to zero.

31 Transform Transform back β = V γ

32 Transform Transform back β = V γ β = (X T X + ki) 1 X T Xˆβ

33 Transform Transform back β = V γ β = (X T X + ki) 1 X T Xˆβ importance of standardizing

34 Transform Transform back β = V γ β = (X T X + ki) 1 X T Xˆβ importance of standardizing Is there a value of k for which ridge is better in terms of Expected MSE than OLS?

35 Transform Transform back β = V γ β = (X T X + ki) 1 X T Xˆβ importance of standardizing Is there a value of k for which ridge is better in terms of Expected MSE than OLS? Choice of k? Hoerl & Kennard (1970)

36 MSE Can show (HW) that E[(β β) T (β β)] = E[(γ γ) T (γ γ]

37 MSE Can show (HW) that E[(β β) T (β β)] = E[(γ γ) T (γ γ] Var(γ i γ i ) = σ 2 l 2 i /(l 2 i + k) 2

38 MSE Can show (HW) that E[(β β) T (β β)] = E[(γ γ) T (γ γ] Var(γ i γ i ) = σ 2 l 2 i /(l 2 i + k) 2 Bias of γ is k/(l 2 i + k)

39 MSE Can show (HW) that E[(β β) T (β β)] = E[(γ γ) T (γ γ] Var(γ i γ i ) = σ 2 li 2/(l i 2 + k) 2 Bias of γ is k/(li 2 + k) MSE σ 2 i (l 2 i l 2 i + k) 2 + k2 i γ 2 i (l 2 i + k) 2 The derivative with respect to k is negative at k = 0, hence the function is decreasing.

40 MSE Can show (HW) that E[(β β) T (β β)] = E[(γ γ) T (γ γ] Var(γ i γ i ) = σ 2 li 2/(l i 2 + k) 2 Bias of γ is k/(li 2 + k) MSE σ 2 i (l 2 i l 2 i + k) 2 + k2 i γ 2 i (l 2 i + k) 2 The derivative with respect to k is negative at k = 0, hence the function is decreasing. Since k = 0 is OLS, this means that there is a value of k that will always be better than OLS

41 Alternative Motivation If ˆβ is unconstrained expect high variance with nearly singular X

42 Alternative Motivation If ˆβ is unconstrained expect high variance with nearly singular X Let Y c = (I P 1 )Y and X c the centered and standardized X matrix

43 Alternative Motivation If ˆβ is unconstrained expect high variance with nearly singular X Let Y c = (I P 1 )Y and X c the centered and standardized X matrix Control how large coefficients may grow

44 Alternative Motivation If ˆβ is unconstrained expect high variance with nearly singular X Let Y c = (I P 1 )Y and X c the centered and standardized X matrix Control how large coefficients may grow min β (Yc X c β) T (Y c X c β) subject to β 2 j t

45 Alternative Motivation If ˆβ is unconstrained expect high variance with nearly singular X Let Y c = (I P 1 )Y and X c the centered and standardized X matrix Control how large coefficients may grow min β (Yc X c β) T (Y c X c β) subject to β 2 j t Equivalent Quadratic Programming Problem min β Yc X c β 2 + k β 2

46 Alternative Motivation If ˆβ is unconstrained expect high variance with nearly singular X Let Y c = (I P 1 )Y and X c the centered and standardized X matrix Control how large coefficients may grow min β (Yc X c β) T (Y c X c β) subject to β 2 j t Equivalent Quadratic Programming Problem min β Yc X c β 2 + k β 2 penalized likelihood

47 Alternative Motivation If ˆβ is unconstrained expect high variance with nearly singular X Let Y c = (I P 1 )Y and X c the centered and standardized X matrix Control how large coefficients may grow min β (Yc X c β) T (Y c X c β) subject to β 2 j t Equivalent Quadratic Programming Problem min β Yc X c β 2 + k β 2 penalized likelihood

48 Picture

49 Longley Data Employed GNP.deflator GNP Unemployed Armed.Forces Population Year

50 OLS > longley.lm = lm(employed ~., data=longley) > summary(longley.lm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e e ** GNP.deflator 1.506e e GNP e e Unemployed e e ** Armed.Forces e e *** Population e e Year 1.829e e ** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 9 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 6 and 9 DF, p-value: 4.984e-10

51 Ridge Trace t(x$coef) x$lambda

52 Generalized Cross-validation > select(lm.ridge(employed ~., data=longley, lambda=seq(0, 0.1, ))) modified HKB estimator is modified L-W estimator is smallest value of GCV at > longley.rreg = lm.ridge(employed ~., data=longley, lambda=0.0028) > coef(longley.rreg) GNP.deflator GNP Unemployed Armed.Forces e e e e e-03 Population e-01 Year 1.557e+00

53 Testimators Goldstein & Smith (1974) have shown that if 1 0 h i 1 and γ i = h i ˆγ i 2 γ 2 i Var(ˆγ i ) < 1+h i 1 h i then γ i has smaller MSE than ˆγ i Case: If γ j < Var(ˆγ i ) = σ 2 /l 2 i then h i = 0 and γ i is better. Apply: Estimate σ 2 with SSE/(n - p - 1) and γ i with ˆγ i. Set h i = 0 if t-statistic is less than 1. testimator - see also Sclove (JASA 1968) and Copas ( JRSSB 1983)

54 Generalized Ridge Instead of γ j iid N(0, σ 2 /k) take γ j ind N(0, σ 2 /k i )

55 Generalized Ridge Instead of γ j iid N(0, σ 2 /k) take γ j ind N(0, σ 2 /k i ) Then Condition of Goldstein & Smith becomes [ 2 γi 2 < σ ] k j li 2

56 Generalized Ridge Instead of γ j iid N(0, σ 2 /k) take γ j ind N(0, σ 2 /k i ) Then Condition of Goldstein & Smith becomes [ 2 γi 2 < σ ] k j li 2 If l i is small almost any k i will improve over OLS

57 Generalized Ridge Instead of γ j iid N(0, σ 2 /k) take γ j ind N(0, σ 2 /k i ) Then Condition of Goldstein & Smith becomes [ 2 γi 2 < σ ] k j li 2 If l i is small almost any k i will improve over OLS if li 2 is large then only very small values of k i will give an improvement

58 Generalized Ridge Instead of γ j iid N(0, σ 2 /k) take γ j ind N(0, σ 2 /k i ) Then Condition of Goldstein & Smith becomes [ 2 γi 2 < σ ] k j li 2 If l i is small almost any k i will improve over OLS if li 2 is large then only very small values of k i will give an improvement Prior on k i?

59 Generalized Ridge Instead of γ j iid N(0, σ 2 /k) take γ j ind N(0, σ 2 /k i ) Then Condition of Goldstein & Smith becomes [ 2 γi 2 < σ ] k j li 2 If l i is small almost any k i will improve over OLS if li 2 is large then only very small values of k i will give an improvement Prior on k i? Induced prior on β? γ j ind N(0, σ 2 /k i ) β N(0, σ 2 VK 1 V T ) which is not diagonal. Loss of invariance.

60 Related Regression on PCA Principal Components of X may be obtained via the Singular Value Decomposition: X = U p LV T

61 Related Regression on PCA Principal Components of X may be obtained via the Singular Value Decomposition: X = U p LV T the l i are the eigenvalues of X T X

62 Related Regression on PCA Principal Components of X may be obtained via the Singular Value Decomposition: X = U p LV T the l i are the eigenvalues of X T X Y = 1α + ULV T β + ɛ = 1α + Fγ + ɛ Columns F i U i are the principal components of the data multivariate data X 1,..., X p

63 Related Regression on PCA Principal Components of X may be obtained via the Singular Value Decomposition: X = U p LV T the l i are the eigenvalues of X T X Y = 1α + ULV T β + ɛ = 1α + Fγ + ɛ Columns F i U i are the principal components of the data multivariate data X 1,..., X p If the direction F i is ill-defined (l i = 0 or λ i < ɛ then we may decide to not use F i in the model.

64 Related Regression on PCA Principal Components of X may be obtained via the Singular Value Decomposition: X = U p LV T the l i are the eigenvalues of X T X Y = 1α + ULV T β + ɛ = 1α + Fγ + ɛ Columns F i U i are the principal components of the data multivariate data X 1,..., X p If the direction F i is ill-defined (l i = 0 or λ i < ɛ then we may decide to not use F i in the model. equivalent to setting γ i = ˆγ i if l i ɛ γ i = 0 if l i < ɛ

65 Related Regression on PCA Principal Components of X may be obtained via the Singular Value Decomposition: X = U p LV T the l i are the eigenvalues of X T X Y = 1α + ULV T β + ɛ = 1α + Fγ + ɛ Columns F i U i are the principal components of the data multivariate data X 1,..., X p If the direction F i is ill-defined (l i = 0 or λ i < ɛ then we may decide to not use F i in the model. equivalent to setting γ i = ˆγ i if l i ɛ γ i = 0 if l i < ɛ

66 Summary OLS can clearly be dominated by other estimators for extimating β

67 Summary OLS can clearly be dominated by other estimators for extimating β Lead to Bayes like estimators

68 Summary OLS can clearly be dominated by other estimators for extimating β Lead to Bayes like estimators choice of penalties or prior hyper-parameters

69 Summary OLS can clearly be dominated by other estimators for extimating β Lead to Bayes like estimators choice of penalties or prior hyper-parameters hierarchical model with prior on k i

70 Summary OLS can clearly be dominated by other estimators for extimating β Lead to Bayes like estimators choice of penalties or prior hyper-parameters hierarchical model with prior on k i Shrinkage, dimension reduction & variable selection

71 Summary OLS can clearly be dominated by other estimators for extimating β Lead to Bayes like estimators choice of penalties or prior hyper-parameters hierarchical model with prior on k i Shrinkage, dimension reduction & variable selection what loss function? Estimation versus prediction? Copas 1983

Lasso & Bayesian Lasso

Lasso & Bayesian Lasso Readings Chapter 15 Christensen Merlise Clyde October 6, 2015 Lasso Tibshirani (JRSS B 1996) proposed estimating coefficients through L 1 constrained least squares Least Absolute Shrinkage and Selection