Shrinkage Estimation in High Dimensions

Size: px

Start display at page:

Download "Shrinkage Estimation in High Dimensions"

Marjory Robbins
5 years ago
Views:

1 Shrinkage Estimation in High Dimensions Pavan Srinath and Ramji Venkataramanan University of Cambridge ITA 206 / 20

2 The Estimation Problem θ R n is a vector of parameters, to be estimated from an observation y: y = θ + w i.e., y y n θ θ n w. =. +. w n, w i i.i.d. N(0, ) Loss function of estimator ˆθ(y) is θ ˆθ(y) 2 The normalized risk of the estimator is R(θ, ˆθ) = n E [ ˆθ(y) θ 2] The expectation is calculated with the density p θ (y) = (2π) n 2 e y θ / 20

3 Maximum Likelihood Estimator y y n θ θ n w y =. =. +. w n, w i i.i.d. N(0, ) The obvious estimator ˆθ(y) = y is also the ML estimator. But... R(θ, ˆθ ML ) = θ R n 3 / 20

4 Maximum Likelihood Estimator y y n θ θ n w y =. =. +. w n, w i i.i.d. N(0, ) The obvious estimator ˆθ(y) = y is also the ML estimator. But... R(θ, ˆθ ML ) = θ R n For n > 2, there are estimators that do strictly better for all θ (James-Stein 6) 3 / 20

5 James-Stein Estimator ˆθ JS = [ ] (n 2) y 2 y ˆθ JS shrinks each y i towards 0. Its risk is ) [ ] (n 2)2 R (θ, ˆθ JS = E n y 2 <.4.2 ML-Estimator Regular JS-Estimator R(θ,ˆθ)/n θ n = 0, θ i = c for i =,..., 5 and θ i = c for i = 6,..., 0 4 / 20

6 Intuition y y n θ θ n w y =. =. +. ˆθ JS = w n [, w i i.i.d. N(0, ) ] (n 2) y 2 y Why should the estimate of θ depend on all the y i s? 5 / 20

7 Intuition y y n θ θ n w y =. =. +. ˆθ JS = w n [, w i i.i.d. N(0, ) ] (n 2) y 2 y Why should the estimate of θ depend on all the y i s? Note that the loss function is n ((θ ˆθ ) (θ n ˆθ ) n ) 2 5 / 20

8 Intuition y y n θ θ n w y =. =. +. ˆθ JS = w n [, w i i.i.d. N(0, ) ] (n 2) y 2 y Empirical Bayes: If we assume θ i i.i.d. N(0, A), then ( ˆθ MMSE = ) y + A If we don t know n 2 +A, estimate it by y 2 [ ] since E n 2 = y 2 +A This gives ˆθ JS, but the surprise is that it beats ML for any θ! 5 / 20

9 The Attracting Vector ˆθ JS = [ ] (n 2) y 2 y ˆθ JS shrinks y towards the all-zeros vector 0 Risk smaller when θ closer to 0, i.e., θ is small.4.2 ML-Estimator Regular JS-Estimator R(θ,ˆθ)/n θ 6 / 20

10 The Attracting Vector In general, can shrink towards any vector, e.g., α [ ] (n 2) ˆθ = α + y α 2 (y α) Risk of ˆθ decreases as θ α gets smaller 7 / 20

11 The Attracting Vector In general, can shrink towards any vector, e.g., α [ ] (n 2) ˆθ = α + y α 2 (y α) Risk of ˆθ decreases as θ α gets smaller Lindley s estimator: Based on assumption that θ i s lie close to their average θ ( ȳ) [ ˆθ L = ȳ + (n 3) y ȳ 2 ] (y ȳ), ȳ = i ˆθ L has been applied to baseball data, disease incidence data... [Efron-Morris 75] Risk reduction of a JS-like estimator over ML is greatest when θ is close to the attracting vector y i n 7 / 20

12 Positive-Part Version [ (n 3) y ȳ 2 ˆθ L = ȳ + (y ȳ), ȳ = + i When shrinkage factor becomes -ve, replace it by 0 ] y i n 8 / 20

13 Positive-Part Version [ (n 3) y ȳ 2 ˆθ L = ȳ + (y ȳ), ȳ = + i When shrinkage factor becomes -ve, replace it by ] n = 0, θ i = c for all i Lindley's Estimator Positive Part ML-Estimator JS-Estimator Positive Part y i n R(θ,ˆθ)/n ˆθ L does best when θ i s are close to their empirical mean θ θ 8 / 20

14 Another example n = 0, θ i = c for i 5, and θ i = c for 6 i 0.4 JS-Estimator Positive Part Lindley's Estimator Positive Part.2 ML-Estimator R(θ,ˆθ)/n θ What would be a good attracting vector for this example? Shrink +ve y i s that are > ȳ towards c, the rest towards c. But we don t know anything about θ! 9 / 20

15 θ i θ i i Components of θ in 2 clusters i Components of θ in cluster In the absence of prior information about θ, how do we choose an attracting vector that is close to θ? Can we use the data to pick a good attracting vector tailored to the underlying θ? 0 / 20

16 θ i θ i i Components of θ in 2 clusters i Components of θ in cluster In the absence of prior information about θ, how do we choose an attracting vector that is close to θ? Can we use the data to pick a good attracting vector tailored to the underlying θ? Idea: ) Design a good estimator ˆθ 2 for θ s whose components are roughly separable into two clusters 2) Then from y, try to infer which is better ˆθ 2 or ˆθ L 0 / 20

17 A Two-Cluster Estimator Define two clusters C := {y i y i > ȳ}, C 2 := {y i y i ȳ}. Shrink y i s in C towards a, the rest towards a 2. / 20

18 A Two-Cluster Estimator Define two clusters C := {y i y i > ȳ}, C 2 := {y i y i ȳ}. Shrink y i s in C towards a, the rest towards a 2. Attracting vector: The estimator is ν 2 = a ˆθ 2 = ν 2 + {y >ȳ}. {yn>ȳ} + a 2 {y ȳ}. {yn ȳ} [ ] n y ν 2 2 (y ν 2 ) + How to choose a and a 2? / 20

19 The Ideal Attractors Attracting vector ν 2 lies in a 2-d subspace spanned by {y >ȳ}. {yn>ȳ} and {y ȳ}. {yn ȳ} Key Idea: Best attractor is the vector in this subspace that is closest to θ Projection of θ onto this subspace: 2 / 20

20 The Ideal Attractors Attracting vector ν 2 lies in a 2-d subspace spanned by {y >ȳ}. {yn>ȳ} and {y ȳ}. {yn ȳ} Key Idea: Best attractor is the vector in this subspace that is closest to θ Projection of θ onto this subspace: ν 2 = a {y >ȳ}. {yn>ȳ} + a2 {y ȳ}. {yn ȳ} a = n i= θ n i {yi >ȳ} n i=, a i= 2 = θ i {yi ȳ} n {y i >ȳ} i=. {y i ȳ} a, a 2 cannot be computed, but can we estimate them from y? 2 / 20

21 Estimating the Attractors a = n i= y i {yi >ȳ} n 2δ i=0 { y i ȳ δ} n i= {y i >ȳ} a 2 = n i= y i {yi ȳ} + n 2δ i=0 { y i ȳ δ} n i= {y i ȳ} δ is a constant that should be chosen small but n Concentration of a, a 2 For any 0 < ɛ < P ( a a κ δ + o(δ) ɛ) Ke nkɛ2 P ( a 2 a 2 κ 2 δ + o(δ) ɛ) K e nk ɛ 2 3 / 20

22 Risk of Two-Cluster Estimator [ ] n ˆθ 2 = ν 2 + y ν 2 2 (y ν 2 ) + Theorem: The loss function of the two-cluster estimator ˆθ 2 satisfies: () For any 0 < ɛ <, P ( n θ ˆθ 2 2 [ min ( β n, ) ) β n + κ n δ + o(δ)] ɛ Ke nkɛ2 α n + where α n, β n are explicit constants that depend on θ 4 / 20

23 Risk of Two-Cluster Estimator [ ] n ˆθ 2 = ν 2 + y ν 2 2 (y ν 2 ) + Theorem: The loss function of the two-cluster estimator ˆθ 2 satisfies: () For any 0 < ɛ <, P ( n θ ˆθ 2 2 [ min ( β n, ) ) β n + κ n δ + o(δ)] ɛ Ke nkɛ2 α n + where α n, β n are explicit constants that depend on θ (2) For a sequence of θ with increasing dimension n, if θ lim sup 2 n n <, we have lim n ( ) n R θ, ˆθ 2 [ ( min β n, ) β n + κ n δ + o(δ)] = 0 α n + 4 / 20

24 Risk of Lindley s Estimator Note that ˆθ L is a one-cluster estimator: Best attractor in d subspace spanned by is θ; ˆθ L approximates it by ȳ Corollary: The loss function of Lindley s estimator satisfies: () For any 0 < ɛ <, ( ) θ ˆθ L 2 P ρ n n ρ n + ɛ Ke nkɛ2, where θ θ 2 ρ n =. n θ (2) If lim sup 2 n n <, we have lim ( ) n n R θ, ˆθ JS ρ n ρ n + = 0. 5 / 20

25 Picking the Better Estimator Depending on θ, either ˆθ 2 or ˆθ L may have lower risk θ i θ i i Expect ˆθ 2 to be better i Expect ˆθ L to be better For θ whose components are roughly separable into two clusters, expect ˆθ 2 to have lower risk than ˆθ L For θ whose components are clustered around one value, expect ˆθ L to have lower risk How to pick the better estimator? 6 / 20

26 Picking the Better Estimator Depending on θ, either ˆθ 2 or ˆθ L may have lower risk θ i θ i i Expect ˆθ 2 to be better i Expect ˆθ L to be better For θ whose components are roughly separable into two clusters, expect ˆθ 2 to have lower risk than ˆθ L For θ whose components are clustered around one value, expect ˆθ L to have lower risk How to pick the better estimator? IDEA: Estimate the loss of each estimator from the data 6 / 20

27 Hybrid Estimator Loss Estimates: ˆL(θ, ˆθ L ) = ( ) g y ȳ 2 /n n ˆL(θ, ˆθ 2 ) = ( n nδ i=0 ) { y i ȳ δ} (a a 2 ) g( y ν 2 2 /n) where g(x) = max{x, } Hybrid Estimator Choose { ˆθ ˆθ hyb = L if ˆL(θ, ˆθ L ) ˆL(θ, ˆθ 2 ), ˆθ 2 otherwise Different from approach in [George 86]: convex combinations of multiple shrinkage estimators with fixed attracting subspaces 7 / 20

28 Performance of Hybrid Estimator Theorem: The loss function of the hybrid JS-estimator satisfies: () For any 0 < ɛ <, ( { P n θ ˆθ hyb 2 min n θ ˆθ L 2, Ke nkɛ2 where K and k are positive constants. } n θ ˆθ 2 2 ɛ) (2) For a sequence of θ with increasing dimension n, if lim sup n θ 2 /n <, we have { } lim n n R(θ, ˆθ hyb ) min n R(θ, ˆθ L ), n R(θ, ˆθ 2 ) = 0 Proof involves showing ˆL(θ, ˆθ L ), ˆL(θ, ˆθ 2 ) concentrate around n θ ˆθ L 2, n θ ˆθ 2 2, respectively. 8 / 20

29 Simulation Results n = 50 R(θ, ˆθ)/n Regular JS estimator Lindley s estimator 0.2 Two cluster JS estimator Hybrid JS estimator ML estimator θ i s arranged in 2 clusters, one centered at τ, the other at τ Each cluster has width τ 2, placement of points within a cluster is uniformly random τ 9 / 20

30 Simulation Results n = 000 R(θ, ˆθ)/n Regular JS estimator Lindley s estimator 0.2 Two cluster JS estimator Hybrid JS estimator ML estimator θ i s arranged in 2 clusters, one centered at τ, the other at τ Each cluster has width τ 2, placement of points within a cluster is uniformly random τ 9 / 20

31 Simulation Results n=000 R(θ, ˆθ)/n Regular JS estimator Lindley s estimator Two cluster JS estimator Hybrid JS estimator ML estimator θ i s are uniformly placed between τ and τ τ 9 / 20

32 Summary [ ] n ˆθ = ν + y ν 2 (y ν) + To achieve significant risk reduction over ML: Shrinkage estimator needs attracting vector ν that s close to θ Can we find a good ν(y), without any knowledge about θ? Proposed estimator infers clustering structure of θ and picks a good target subspace for the attractor Can test for up to L-clusters for any integer L 2, and pick the best one based on loss estimate. Provided concentration results for loss function, convergence results for risk Future Work: Test performance on real data sets, More general multi-dimn. target subspaces than cluster-based ones Cluster-Seeking James-Stein Estimators: 20 / 20

Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]

Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973] Stats 300C: Theory of Statistics Spring 2018 Lecture 20 May 18, 2018 Prof. Emmanuel Candes Scribe: Will Fithian and E. Candes 1 Outline 1. Stein s Phenomenon 2. Empirical Bayes Interpretation of James-Stein