Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Size: px

Start display at page:

Download "Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11"

Lawrence Warner
5 years ago
Views:

1 Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of this material comes from Lehmann and Casella [1998] section 5.1, and some comes from Ferguson [1967], section Motivation and definition Let X binomial(n, ), and let X = X/n. Via admissibility of unique Bayes estimators, δ w0 (X) = w X + (1 w) 0 is admissible under squared error loss for all w (0, 1) and 0 (0, 1). R(, δ w0 ) = Var[δ w0 ] + Bias 2 [δ w0 ] = w 2 Var[ X] + (1 w) 2 ( 0 ) 2 = w 2 (1 )/n + (1 w) 2 ( 0 ) 2. Risk functions for three such estimators are in Figure 1. Requiring admissibility is not enough to identify a unique procedure. 1

2 risk Figure 1: Three risk functions for estimating a binomial proportion when n = 10, w = 0.8 and 0 {1/4, 2/4, 3/4}. This is good; we can require more of our estimator. This is bad: what should we require? One idea is to avoid a worst case scenario : evaluate procedures by their maximum risk. Definition 1 (minimax risk, minimax estimator). The minimax risk is defined as R m (Θ) = inf δ An estimator δ m is a minimax estimator of if sup R(, δ m ) = inf Θ δ i.e. sup Θ R(, δ m ) sup Θ R(, δ) for all δ D. sup R(, δ). sup R(, δ) = R m (Θ). 2

3 2 Least favorable prior Identifying a minimax estimator seems difficult: one would need to minimize the supremum risk over all estimators. However, in many cases we can identify a minimax estimator using some intuition: Suppose some values of are harder to estimate than others. Example: For the binomial model Var[X/n] = (1 )/n, so = 1/2 is hard to estimate well ). Consider a prior π() that heavily weights the difficult value of. The Bayes estimator δ π will do well for these difficult values, where the supremum risk of most estimators is likely to occur. R(π, δ π ) = R(, δ π )π(d) R(, δ)π(d) = R(π, δ) This means that R(, δ π ) R(, δ) in places of high π-probability, i.e. the difficult parts of Θ. Since δ π does well in the difficult region, maybe it is minimax. Definition 2 (least favorable prior). A prior distribution π is least favorable if R(π, δ π ) R(π, δ π ) for all priors π on Θ. Intuition: δ π is the best you can do under the worst prior. Analogy: to competitive games. You get to choose an estimator δ, your adversary gets to choose a prior π. Your adversary wants you to have high loss, you want to have low loss. For any π, your best strategy is δ π. Your adversary s best strategy is then for π to be least favorable. 3

4 Bounding the minimax risk: The least favorable prior provides a lower bound on the minimax risk R m (Θ). For any prior π over Θ, [ R(π, δ) = R(, δ) π(d) sup Minimizing over all estimators δ D, we have ] R(, δ) π(d) = sup R(, δ). inf δ R(π, δ) inf δ R(π, δ π ) R m (Θ) sup R(, δ) The Bayes risk of any prior gives a lower bound for minimax risk. Maximizing over π gives the sharpest lower bound: sup R(π, δ π ) R m (Θ). π Finding the LFP and minimax estimator: For any prior π, we have shown R(π, δ π ) inf δ sup R(, δ). On the other hand, for any estimator δ π, we have Putting these together gives inf δ sup R(, δ) sup R(, δ π ). R(π, δ π ) inf δ sup R(, δ) sup R(, δ π ). Therefore, δ π achieves the minimax risk if R(π, δ π ) = sup R(, δ π ). 4

5 Theorem 1 (LC 5.1.4). Let δ π be (unique) Bayes for π, and suppose R(π, δ π ) = sup R(, δ π ). Then 1. δ π is (unique) minimax; 2. π is least favorable; Proof. 1. For any other estimator δ, sup R(, δ) R(, δ)π(d) R(, δ π )π(d) = sup R(, δ π ). If δ π is unique Bayes under π, then the second inequality is strict and δ π is unique minimax. 2. Let π be a prior over Θ. Then R( π, δ π ) = R(, δ π ) π(d) R(, δ π ) π(d) sup R(, δ π ) = R(π, δ π ). Notes: regarding the condition R(π, δ π ) = sup R(, δ π ), the condition is sufficient but not necessary for δ π to be minimax. the condition is very restrictive - it is met only if π( : R(, δ π ) = sup R(, δ π )) = 1. Bayes estimators that satisfy this condition are sometimes called equalizer rules: 5

6 Definition 3. An estimator δ is an equalizer rule with respect to π if δ is Bayes with respect to π and R(, δ) = sup R(, δ) a.e. π. The Theorem implies that equalizer rules are minimax: Corollary 1 (LC cor 5.1.6). An equalizer rule is minimax. In the definition of an equalizer rule, it is not enough that R(, δ) is constant a.e. π. For example, if the supremum risk of an estimator could occur on a set of π-measure zero, then generally R(π, δ) < sup R(, δ) and the conditions of the theorem do not hold. To summarize: We need R(π, δ π ) = sup R(, δ π ). This condition will be met if δ π has constant risk. This condition will not be met if δ π has constant risk a.e. π, but higher risk on a set of π-measure zero. This condition will be met if δ π achieves its maximum risk on a set of π- probability 1. Example(binomial proportion): X binomial(n, ), estimate with squared error loss. Let s try to find a Bayes estimator with constant risk. Can this be done with conjugate priors? Under beta(a, b), we have δ ab (X) = a + X a + b + n R(, δ ab ) = Var[δ ab ] + Bias 2 [δ ab ] = n(1 ) (a (a + b))2 +. (a + b + n) 2 (a + b + n) 2 Can we make this constant as a function of? The numerator is n n 2 + (a + b) 2 2 2a(a + b) + c(a, b, n). 6

7 This will be constant in if n = 2a(a + b) n = (a + b) 2, solving for a and b gives a = b = n/2. Therefore, The estimator δ n/2(x) = X+ n/2 n+ n = X n constant risk and Bayes, therefore an equalizer rule, and therefore minimax. beta( n/2, n/2) is a least favorable prior. Risk comparison R(, δ n/2(x)) = At = 1/2 (a difficult value of ), we have Draw the picture. Notes: n n+1 is 1 (2(1 + n)) 2 R(, X) = (1 ) n R(1/2, X) = n > (n + 2 n + 1) = R(1/2, δ n/2(x)). The region in which R(, δ n/2(x)) < R(, X) decreases in size with n. Draw the picture. The least favorable prior is not unique. To see this, note that under any prior π, δ π (x) = E[ X] = x+1 (1 ) n x π(d) x (1 ) n x π(d), and so the estimator only depends on the first n + 1 moments of under π. 7

8 The minimax estimator is sensitive to the loss function: Under L(, δ) = ( δ) 2 /[(1 )], X/n is minimax (it is an equalizer rule under under beta(1,1)). Example (difference in proportions): X binomial(n, x ), Y binomial(n, y ), estimate y x. Can we proceed from the one sample case? Consider Y + n/2 n + n X + n/2 n + n = Y X n + n. It turns out that this is not minimax. Consider estimators of the form δ(x, Y ) = c (Y X) for some c (0, 1/n). Strategy: For each c, where does c (Y X) have maximum risk? Is c (Y X) Bayes w.r.t. a prior that assigns probability 1 to points of maximum risk? R(, δ c ) = Var[δ c ] + (E[c(Y X)] ( y x )) 2 = c 2 n[ x (1 x ) + y (1 y )] + (cn 1) 2 ( y x ) 2 Taking derivatives, the maximum risk is attained at ( x, y ) satisfying [2(cn 1) 2 2c 2 n] x 2(cn 1) 2 y = c 2 n [2(cn 1) 2 2c 2 n] y 2(cn 1) 2 x = c 2 n We have two equations and two unknowns, and so for a given c, maximum risk is generally attained at some point c = ( xc, yc ). The theorem suggests trying a prior 8

9 π c () = 1( = c ). But then δ πc = c is the Bayes estimator, not δ c. Furthermore δ πc has its minimum risk at c. However, consider Θ 0 = { : x + y = 1} Θ. On this set, y (1 y ) = x (1 x ) and y x = 1 2 x, and the risk function is R(, δ c ) = c 2 n(2 x (1 x )) + (cn 1) 2 (1 2 x ). Is there a value of c that gives this estimator constant risk (on Θ 0 )? dr( x, c(y X)) d x = d d x c 2 n(2( x 2 x)) + (cn 1) 2 (1 4 x x) Solving for c, we have = c 2 n(2(1 2 x )) (cn 1) 2 4(1 2 x ) c 2 n = 2(cn 1) 2 ± nc = 2(cn 1) ± n/2 = (n 1/c) 2n ± n 1/c = 2 c = 2 = 1 2n 2n ± n n 2n ± 1 Using the solution could give estimators outside the parameter space, so consider the + solution: c e = 1 2n n 2n + 1 2n δ e (X, Y ) = (Y X)/n, 2n + 1 i.e. δ e is a shrunken version of the UMVUE. By construction, this estimator has constant risk on { : x + y = 1}. Does it achieve its maximum risk on this set? Recall the risk function is: R(, δ c ) = c 2 n[ x (1 x ) + y (1 y )] + (cn 1) 2 ( y x ) 2 = c 2 n( x (1 x ) + y (1 y ) + ( y x ) 2 ), 9

10 where we are using the fact that 2(cn 1) 2 = c 2 n for our risk-equalizing value of c. Taking derivatives, we have that the maximum risk occurs when c 2 n[(1 2 x ) + ( x y )] = c 2 n(1 x y ) = 0 c 2 n[(1 2 y ) + ( y x )] = c 2 n(1 x y ) = 0, which are both satisfied when x + y = 1. So far, we have shown δ e has constant risk on Θ 0 = { : x + y = 1} δ e achieves its maximum risk on this set. To show that it is minimax, what remains is to show that it is Bayes with respect to a prior π on Θ 0. If it is, then the condition R(π, δ e ) = sup R(, δ e ) Θ will be met and δ e will therefore be minimax. Consider estimation of y = 1 x from X and Y. For what prior on y will δ e be Bayes? Note that on Θ 0, n + Y X binomial(2n, y ). Recall beta( n/2, n/2) is least favorable for estimating a binomial proportion with n samples. Here, our sample size is 2n, so lets see if beta( n/2, n/2) is least favorable. π( y n + Y X) = dbeta(n + Y X + n/2, n + X Y + n/2) E[ y n + Y X] = n + Y X + n/2 2n +. 2n Now recall we are trying to estimate y x, which on Θ 0 is equal to 2 y 1. 10

11 Under squared error loss, the Bayes estimator of 2 y 1 is 2E[ y n + Y X] 1, which is So we have shown: 2E[ y n + Y X] 1 = n + Y X n/2 n + n + n/2 n/2 n + n/2 = Y X n + n/2 n = n + (Y X)/n n/2 2n = (Y X)/n = δ e. 2n + 1 δ e is Bayes with respect to a prior π on Θ 0 Θ; δ e is constant risk a.e. π; sup Θ R(, δ e ) is achieved on Θ 0. Therefore, δ e is an equalizer rule, and hence minimax. 3 Least favorable prior sequence In many problems there are no least favorable priors, and the main theorem from above is of no help. Example(normal mean, known variance): X 1,... X n i.i.d. normal(0, σ 2 ), σ 2 known. R(, X) = σ 2 /n is constant risk, so seems potentially minimax. But no prior π over Θ gives δ π (X) = X. You can also see that there is no least favorable prior: Under any prior π, R(π, δ π ) < R(π, X) = σ 2 /n, 11

12 but you can find priors whose Bayes risk is arbitrarily close to σ 2 /n (i.e., the set of Bayes risks is not closed). However, X is a limit of Bayes estimators: As τ 2, More generally, consider δ τ 2 = n/σ2 n/σ 2 +1/τ 2 R(π 2 τ, δ 2 τ) = 1 n/σ 2 +1/τ 2 X X σ 2 /n = R(, X). (π k, δ k ), a sequence of prior distributions such that R(π k, δ k ) R; δ, an estimator such that sup R(, δ) = R. For each k, we have R(π k, δ k ) R m However, we always have R m sup R(, δ). Putting these together gives R(π k, δ k ) R m sup R(, δ). If R(π k, δ k ) sup R(, δ), then we must have R m = sup R(, δ), and so δ must be minimax. Definition 4. A sequence {π k } of priors is least favorable if for every prior π, R(π, δ π ) lim k R(π k, δ k ). Theorem 2 (LC thm 5.1.2). Let {π k } be sequence of prior distributions and δ an estimator such that R(π k, δ k ) sup R(, δ). Then 1. δ is minimax, and 2. {π k } is least favorable. Proof. 12

13 1. For any estimator δ, 2. For any prior π, sup R(, δ ) = lim sup k sup R(, δ ) R(π k, δ ) R(π k, δ k ) R(, δ ) lim R(π k, δ k ) = sup R(, δ) k R(π, δ π ) R(π, δ) sup R(, δ) = lim R(π k, δ k ). k Exercise: Prove the following (weaker) variant of the above theorem: Any constant risk generalized Bayes estimator is minimax. Example (normal mean, known variance): Let X 1,..., X n i.i.d. N p (, σ 2 I), σ 2 known. Let π k () be such that N p (0, τk 2I), τ k 2. δ k = X. n/σ2 n/σ 2 +1/τ 2 1 Under average square-error loss, R(π k, δ k ) = n/σ 2 +1/τk 2 Thus X is minimax. σ 2 /n = R(, X). Alert! This result should seem a bit odd to you - X is inadmissible for p 3. How can it be dominated if it is minimax? We even have the following Theorem: Theorem 3 (LC lemma ). Any unique minimax estimator is admissible. Proof. If δ is unique minimax, and δ any other estimator, them sup so δ can t be dominated. R(, δ) < sup R(, δ ) 0 : R( 0, δ) < R( 0, δ ), 13

14 So how can X be minimax and not admissible? The only possibility left by the theorem is that X is not unique minimax. In fact, for p 3, estimators of the form ( ) δ(x) = 1 c( x ) σ2 (p 2) x x 2 are minimax as long as c : R + [0, 2] is nondecreasing (LC theorem 5.5.5). For these estimators, the supremum risk occurs in the limit as. Example (normal mean, unknown variance): Let X 1,..., X n i.i.d. N p (, σ 2 I), σ 2 R + unknown. sup,σ 2 R((, σ 2 ), X) =. In fact, we can prove that every estimator has infinite maximum risk: sup R((, σ 2 ), δ) = sup sup R(, δ(x) σ 2 ),σ 2 σ 2 sup σ 2 σ 2 /n =, where the inequality holds because X is minimax in the known variance case. Therefore, every estimator is trivially minimax, with maximum risk of infinity. Here is a not-entirely satisfying solution to this problem: Assume σ 2 M, M known. Applying exactly the same argument as above, we have sup,σ 2 R((, σ 2 ), δ) = sup σ 2 :σ 2 M and so X is minimax with maximum risk M/n. sup R(, δ(x) σ 2 ) sup σ 2 /n = M/n, σ 2 :σ 2 M Note that the value of M doesn t affect the estimator - X is minimax no matter what M is. Does this mean that X is the minimax estimator for σ 2 R +? Here are some thoughts: 14

15 For p = 1, X is unique minimax for R, σ 2 (0, M] for all M. For σ 2 R +, it is still minimax, although not unique. For p > 2, X is minimax, but not unique minimax for σ 2 (0, M] or even known σ 2. A better solution to this problem is to change the loss function to L((, σ 2 ), d) = ( d) 2 /σ 2, standardized squared-error loss. Exercise: Show that X is minimax under standardized squared-error loss. 4 Nonparametric problems The normal mean, unknown variance problem above gave some indication that we can deal with nuisance parameters (such as the variance σ 2 ) when obtaining a minimax estimator for a parameter of interest (such as the mean ). What about more general nuisance parameters? Let X P P. Suppose we want to estimate = g(p ), some functional of P. Suppose δ is minimax for = g(p ) when P P 0 P. When will it also be minimax for P P? Theorem (LC ). If δ is minimax for when P 0 P and then δ is minimax for when P. Proof. For any other estimator δ, sup R(P, δ) = sup R(P, δ), P P 0 P P sup R(P, δ ) sup R(P, δ ) P P P P 0 sup R(P, δ) = sup R(P, δ). P P 0 P P 15

16 Example: (Difference in binomial proportions) P = {dbinom(x, n, x, y } P 0 = {dbinom(x, n, x = 1 y, y } For estimation of y x on P 0, we showed that c (Y X)/n, with c = 2n/( 2n + 1), is constant risk on P 0. c (Y X)/n is Bayes w.r.t. a beta( n/2, n/2) prior on y, and so c (Y X) is minimax on P 0. We then showed that c (Y X)/n achieved its maximum risk on P 0, and so by the theorem it is minimax. Example: (Population mean, bounded variance) Let P = {P : Var[X P ] M}, M known. Is X minimax for = E[X P ]? Let P 0 = {dnorm(x,, σ) : R, σ 2 M}. Then 1. X is minimax for P0 ; 2. sup P P0 R(P, X) = sup P P R(P, X) = M/n. Thus X is minimax (and unique in the univariate case, as shown in LC 5.2). Example: (Population mean, bounded range) Let X 1,..., X n i.i.d. P P, where P ([0, 1]) = 1 P P. Least favorable perspective: Our previous experience tells us to find a minimax estimator that is best under the worst conditions. What are the worst conditions? Guess: Most difficult situation is where mass of each P is as spread out as possible. Let P 0 = {binary(x, ) : [0, 1]}. As we ve already derived, the minimax estimator for E[X i ] = Pr(X i = 1 ) = based on X 1,..., X n i.i.d. binary() is δ(x) = n X + n/2 n + n = n n + 1 X + 1 n

17 To use the lemma to show this is minimax for P, we need to show Let s calculate the risk of δ for P P: and so Var[δ(X)] = E[δ(X)] = Bias 2 (δ P ) = R(P, δ) = sup R(P, δ) = sup R(P, δ). P P P P 0 n ( n + 1) Var[X P ]/n = Var[X P ]/( n + 1) 2 2 n n + 1 n = + n + 1 2( n + 1) = + 1/2 n + 1 ( 1/2)2 ( n + 1) 2, 2 1 ( n + 1) 2 [Var[X P ] + ( 1/2)2 ]. Where does this risk achieve its maximum value? Note that sup R(P, δ) = sup P P [0,1] sup R(P, δ), P P where P = {P P : E[X P ] = }. To do the inner supremum, note that for any P P, Var[X P ] = E[X 2 ] E[X] 2 E[X] E[X] 2 = E[X](1 E[X]) = (1 ) with equality only if P is the binary() measure. Therefore, for each supremum risk is attained at the binary() distribution, and so the maximum risk of δ is attained on the P 0. To recap, 17

18 sup R(P, δ) = sup P P [0,1] sup R(P, δ) = sup P P [0,1] sup R(P, δ) = sup R(P 0, δ). P P P 0 P P 0 Thus δ is minimax on P 0, achieves its maximum risk there, and so is minimax over all of P. 5 Minimax and admissibility We already showed a unique minimax estimator can t be dominated: Theorem (LC lemma ). Any unique minimax estimator is admissible. This is reminiscent of our theorem about admissibility of unique Bayes estimators, which has a similar proof: If δ were to dominate δ, then it would be as good as δ in terms of both Bayes risk and maximum risk, and so such a δ can t be either unique Bayes or unique minimax. What about the other direction? We have shown in some important cases that admissibility of an estimator implies it is Bayes, or close to Bayes. Does admissibility imply minimax? The answer is yes for constant risk estimators: Theorem (LC lemma ). If δ is constant risk and admissible, then it is minimax. Proof. Let δ be another estimator. Since δ is not dominated, there is a 0 s.t. R( 0, δ) R( 0, δ ) But since δ is constant risk, sup R(, δ) = R( 0, δ) R( 0, δ ) sup R(, δ ). Below is a diagram summarizing some of the relationships between admissible, Bayes and minimax estimators we ve covered so far: 18

19 6 Superefficiency and sparsity Let X 1,..., X n i.i.d. N(, 1), where R. A version of Hodges estimator for is given by { X if x > 1/n 1/4 δ H (x) = 0 if x < 1/n 1/4 or more compactly, δ H (x) = x 1( x > 1/n 1/4 ). Exercise: Show that the asymptotic distribution of δ H (x) is given by n(δh (X) ) N(0, v()), where v() = { 1 if 0 0 if = 0 The asymptotic variance of n(δ H (X) ) makes δ H seem useful in many situations: Often we are trying to estimate a parameter that could potentially be zero, or very close to zero. For such parameters, δ H seems to be as good as X asymptotically for all, and much better than X at the special value = 0. In particular, if = 0, then Pr(δ H (X) = 0) 1. This seems much better than simple consistency at = 0, which says Pr( δ H (X) < ɛ) 1. for every ɛ > 0. An estimator consistent at = 0 never actually has to be zero, whereas the Hodges estimator is more and more likely to actually be zero as n. However, there is a price to pay. Let s compare the maximum of the risk function for δ H to that of X, under squared error loss. For Hodges estimator, we have for any 19

20 R, sup E X [(δ H (X) ) 2 ] E X [(δ H (X) ) 2 ] Letting = 0 / n, we have = E X [( X1( X < 1/n 1/4 ) ) 2 1( X < 1/n 1/4 )] = E X [ 2 1( X < 1/n 1/4 )] = 2 Pr( X < 1/n 1/4 ) = 2 Pr( n 1/4 < X < n 1/4 ) = 2 Pr( n( n 1/4 ) < n( X ) < n( + n 1/4 ) ) = 2 [Φ( n( + n 1/4 )) Φ( n( n 1/4 ))]. sup R(, δ H ) 2 0 n [Φ( 0 + n 1/4 ) Φ( 0 n 1/4 )]. Note that this holds for any 0 R. On the other hand, the risk of X is constant, 1/n for all. Therefore, sup R(, δ H ) sup R(, X) 2 0 [Φ( 0 + n 1/4 ) Φ( 0 n 1/4 )]. You can play around with various values of 0 to see how big you can make this bound for a given n. The thing to keep in mind is that the inequality holds for all n and 0. As n, the normal CDF term goes to one, so As this holds for all 0, we have sup lim R(, δ H ) n sup R(, X) 2 0. lim n sup R(, δ H ) =. sup R(, X) Yikes! So the Hodges estimator becomes infinitely worse than X as n. Is this asymptotic result relevant for finite-sample comparisons of estimators? semi-closed form expression for the risk of δ H is given by Lehmann and Casella [1998, page 442]. A plot of the finite-sample risk of δ H is shown in Figure 2. A 20

21 n MSE Figure 2: Risk functions of δ H for n {25, 100, 250}, with the risk bound in blue. Related to this calculation is something of more modern interest, the problem of variable selection in regression. Consider the following model: X 1,..., X n i.i.d.p X ɛ 1,..., ɛ n i.i.d.n(0, 1) y i = β T X i + ɛ i You may have attended one or more seminars where someone presented an estimator ˆβ of β with the following property: Pr( ˆβ j = 0 β) 1 as n β : β j = 0. Again, such an estimator ˆβ j is not just consistent at β j = 0, it actually equals 0 with increasingly high probability as n. This property has been coined sparsistency by people studying asymptotics of model selection procedures. This special consistency at β j = 0 seems similar to the properties of Hodges estimator when = 0. What does the behavior at 0 imply about the risk function elsewhere? It is difficult to say anything about the entire risk function for all such estimators, but 21

22 we can gain some general insight by looking at the supremum risk. Let β n = β 0 / n, where β 0 is arbitrary. sup E β [ ˆβ β 2 ] E βn [ ˆβ β n 2 ] β E βn [ ˆβ β n 2 1(ˆβ = 0)] Now consider the risk of the OLS estimator: = β n 2 Pr(ˆβ = 0 β n ) = 1 n β 0 2 Pr(ˆβ = 0 β n ) So we have which holds for all n and all β 0. E β [ ˆβ ols β 2 ] = E PX [tr((x T X) 1 )] v n sup β R(β, ˆβ) sup β R(β, ˆβ ols ) β 0 2 Pr(ˆβ = 0 β n ) nv n, Now you should believe that nv n converges to some number v > 0. What about Pr(ˆβ = 0 β n )? Now β n 0 as n, and so it seems that since ˆβ has the sparsistency property, eventually ˆβ will equal 0 with high probability. Next quarter you will learn about contiguity of sequences of distributions and will be able to show that lim Pr(ˆβ = 0 β = β n ) = lim Pr(ˆβ = 0 β = 0) = 1. n n The first equality is due to contiguity, the second to the sparsistency property of ˆβ. This result gives lim n sup β R(β, ˆβ) sup β R(β, ˆβ ols ) β 0 2 /v. Since this result holds for all β 0, the limit is in fact infinite. The result is that, in terms of supremum risk (under squared error loss) a sparsistent estimator becomes infinitely worse than the OLS estimator. This result is due to Leeb and Pötscher [2008] (see also Pötscher and Leeb [2009]). Discuss: Is supremum risk an appropriate comparison? 22

23 Is squared error an appropriate loss? When would you use a sparsistent estimator? References Thomas S. Ferguson. Mathematical statistics: A decision theoretic approach. Probability and Mathematical Statistics, Vol. 1. Academic Press, New York, Hannes Leeb and Benedikt M Pötscher. Sparse estimators and the oracle property, or the return of hodges estimator. Journal of Econometrics, 142(1): , E. L. Lehmann and George Casella. Theory of point estimation. Springer Texts in Statistics. Springer-Verlag, New York, second edition, ISBN Benedikt M. Pötscher and Hannes Leeb. On the distribution of penalized maximum likelihood estimators: the LASSO, SCAD, and thresholding. J. Multivariate Anal., 100(9): , ISSN X. doi: /j.jmva URL 23

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11 Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of