Lecture 11 October 27 - PDF Free Download

STATS 300A: Theory of Statistics Fall 205 Lecture October 27 Lecturer: Lester Mackey Scribe: Viswajith Veugopal, Vivek Bagaria, Steve Yadlowsky Warig: These otes may cotai factual ad/or typographic errors.. Summary I this lecture, we will discuss the idetificatio of miimax estimators via submodels, the admissibility of miimax estimators, ad simultaeous estimatio ad the James-Stei estimator. This will coclude our discussio of estimatio; i the future we will be focusig o the decisio problem of hypothesis testig..2 Miimax Estimators ad Submodels Recall that a estimator δ M is miimax if its maximum risk is miimal: if δ sup θ Ω Rθ, δ) sup Rθ, δ M ) θ We saw how to derive the miimax estimator usig least favourable priors i Lecture 0. I this lecture we will cosider a differet approach, based o the followig Lemma: Lemma TPE 5..5). Suppose that δ is miimax for a submodel θ Ω 0 Ω ad The, δ is miimax for the full model, θ Ω. sup Rθ, δ) sup Rθ, δ) θ Ω 0 θ Ω This lemma allows us to fid a miimax estimator for a particular tractable submodel, ad the show that the worst-case risk for the full model is equal to that of the submodel that is, the worst-case risk does t rise as you go to the full model). I this case, usig the Lemma, we ca argue that the estimator we foud is also miimax for the full model. This was similar to how we justified miimaxity of the estimator of a Normal mea with bouded variace last lecture. Here s a fairly simple example: Example. Let X,..., X be i.i.d N µ, σ 2 ), where both µ ad σ 2 are ukow. Thus, our parameter vector, θ µ, σ 2 ) ad our parameter space Ω R R +. Our task ow is to estimate µ. Our loss fuctio is the relative squared error loss, give by: Lµ, σ 2 ), d) d µ)2 σ 2 -

STATS 300A Lecture October 27 Fall 205 We cosider this loss fuctio to make the questio of miimaxity more iterestig: regular squared error loss is ubouded for the full model, sice it is proportioal to the variace, which is ubouded. We cosider the submodel where σ 2. That is, Ω 0 R {}, ad our loss fuctio simplifies to our usual squared error loss: Lµ, ), d) d µ) 2. We saw i Example of Lecture 0 that uder this loss X is miimax for Ω 0. Moreover, Rµ, σ 2 ), X) µ, σ2 ) Ω. Thus, the risk does ot deped o σ 2. Sice Rµ, ), X)) Rµ, σ 2 ), X)), we have that the maximum risks are equal. That is, sup θ Ω0 Rθ, δ) sup θ Ω Rθ, δ). Therefore, it follows from Lemma that X is miimax o Ω. Note that, thaks to our ew loss fuctio, we do t eed to impose boudedess o our variace like we did i our previous lecture) to establish miimaxity i a meaigful way. This example is parametric, like a lot of the examples we ve made so far. Assumig we kow the form of the distributio for the variables, ad that the variables are i.i.d., are both strog assumptios. Now, we cosider a more ambitious example, which is i a o-parametric settig, ad hece more geeral. Example 2 TPE Example 5..6). Suppose X, X 2,..., X are i.i.d with commo CDF F, with mea µf ) <, ad variace σ 2 F ) <. Our goal is to fid a miimax estimate of µf ) uder squared error loss. Without further restrictio o F, the worst case risk is ubouded for every estimator, so every estimator is miimax. We will impose further costraits, ad restrict our family somehow to have fiite worst-case risk, to esure that meaigful miimax estimators ca be obtaied. Costrait a). Assume σ 2 F ) B. Now, we ve see i the previous lecture that X is miimax for the Gaussia submodel i this case. So a atural guess for us to make is that X is miimax. We verify this by applicatio of Lemma. First, we compute the supremum risk for the full model: RF, X) 2 Sice σ 2 F ) [0, B] by assumptio, we get: i EX i µf )) 2 σ2 F ). sup RF, X) B F Now we saw i Lecture 0 that for the submodel F 0 N µ, σ 2 ) whe σ 2 B, X is miimax. Further, the supremum risk i this case is idetical to that of the full model: sup F F 0 RF, X) B Thus, usig Lemma we coclude that X is miimax for the full model. o-parametric model still costraied to have σ 2 F ) B.) That is, the -2

STATS 300A Lecture October 27 Fall 205 Costrait b). Assume F F where F is the set of all CDFs with support cotaied i [0, ]. Is X miimax for this model? We have reaso to believe that it is ot, based o the miimax estimator we derived i Lecture 9 for the Biomial submodel. Ad i fact, it turs out that X is t miimax. To show this, first cosider the submodel, F 0 {Berθ)} θ 0,). Let Y i X i so that Y Bi, θ) ad X Y/. Recall from Lecture 9 that the miimax estimator for µf ) θ, i the Biomial case, is: δx) which has supremum risk 4+ ) 2. So + X + ) 2 + sup Rθ, X) θ 4 > 4 + ) sup Rθ, δ) 2 Thus, X has a higher worst-case risk tha δx) as defied above, ad hece, we have show that X is ot miimax. Now, let s get more ambitious, ad try to see if we ca fid the miimax estimator uder the full model. We kow that this ca t be X, but it s possible that it could be δx). To examie this possibility, we cojecture that δx) is also miimax uder the full model. If we are to establish this uder the Lemma, we eed to show that the supremum risk of δx) uder the full model is o more tha 4+ which is the supremum risk for the biomial ) 2 submodel). Let us compute: [ ) )) 2 ] E F [δx) µf )) 2 ] E F + X µf )) + + 2 µf ) ) [ 2 ) ] 2 + Var X) + 2 µf ) ) 2 [EX + 2) µf ) 2 + 4 ] µf ) + µf )2 ) 2 [EX + 2) + 4 ] µf ) where the third step follows from the fact that VarX ) Var X) E[X] 2 E[X ]) 2 E[X] 2 µf )) 2. By assumptio X [0, ], so X 2 X ad we ca boud the risk: ) 2 E F [δx) µf )) 2 ] [EX + ) + 4 ] µf ) 4 + ). 2 So, δx) is miimax for the Biomial submodel, ad its worst-case risk is the same for the full model ad for the Biomial submodel. Therefore, applyig the Lemma, we coclude that δx) is miimax. Thus, we have foud a miimax estimator. -3 θ

STATS 300A Lecture October 27 Fall 205.3 Admissibility of miimax estimators Let us ow tur to the questio of admissibility of miimax estimators. We begi by otig that the questio of admissibility is particularly importat for miimax estimators. This is because, although we foud domiatig estimators eve whe we were workig with ubiased estimators, the domiatig estimators were biased, so we lost the property ubiasedess) that we were iterested i however, if you fid a estimator that domiates a miimax estimator, it will still be miimax! Also, a aside: admissibility ca give rise to miimaxity. If δ is admissible with costat risk, the δ is also miimax. This is ot hard to show. Let the costat risk of δ be r. The, r is also the worst-case risk of δ, sice the risk is costat. Now, if we assume δ is ot miimax, there exists a differet estimator, say δ, which is miimax. The worst-case risk of δ, say r, would thus be < r. But sice this is the worst-case risk of δ, that would mea that the risk of δ is lower tha r throughout, ad thus δ domiates δ. However, we assumed that δ was admissible, so this is a cotradictio. Thus, our assumptio led to a cotradictio, ad therefore δ is miimax.) Note that miimaxity does ot guaratee admissibility; it oly esures the worst case risk is optimal. We eed to check for admissibility. The followig example illustrates several stadard ways of doig so. iid Example 3. Let X, X 2,..., X N θ, σ 2 ) where σ 2 is kow, ad θ is the estimad. The the miimax estimator is X uder squared error loss, ad we would like to determie whether X is admissible. Istead of aswerig this directly, we aswer a more geeral questio: whe is a X + b, a, b R, basically, ay affie fuctio of X) admissible? Case : 0 < a <. I this case a X + b is a covex combiatio of X ad b. By results we saw i the previous lecture, it is a Bayes estimator with respect to some Gaussia prior o θ. Further, sice we are usig squared error loss, which is strictly covex, this Bayes estimator is uique. So, by Theorem 5.2.4 which basically tells us that a uique Bayes estimator will always be admissible), a X + b is admissible. Case 2: a 0. I this case b is also a uique Bayes estimator with respect to a degeerate prior distributio with uit mass at θ b. So by Theorem 5.2.4, b is admissible. Case 3: a, b 0. I this case X + b is ot admissible because it is domiated by X. To see this, ote that X has the same variace as X + b, but strictly smaller bias. The ext few cases use the followig result. I geeral, the risk of a X + b is: E[a X + b θ)] 2 E[ a X θ) + b + θa ) ) 2 ] a2 σ 2 + b + θa ))2 where, i the first step, we added ad subtracted aθ iside. Case 4: a >. If we apply the result for the geeral risk we have: E[a X + b θ) 2 ] a2 σ 2 > σ2 Rθ, X). -4

STATS 300A Lecture October 27 Fall 205 The first iequality follows because the secod summad i the expressio for the geeral risk is always oegative. X domiates a X + b whe a >, ad so i this case a X + b is iadmissible. Case 5: a < 0. E[a X + b θ) 2 ] > b + θa )) 2 a ) 2 θ + > θ + b ) 2, a b a ad this is the risk of predictig the costat b/a ). So, b/a ) domiates a X + b, ad therefore, a X + b is agai iadmissible. Now, we have cosidered every case except for the estimator X. It turs out that X. The argumet i this case is more ivolved, ad proceeds by cotradictio. Case 6: a, b 0. Here, we use a limitig Bayes argumet. Suppose X is iadmissible. The, assumig w.l.o.g that σ 2, we have: ) 2 Rθ, X) By our hypothesis, there must exist a estimator δ such that Rθ, δ ) / for all θ ad Rθ, δ ) < / for at least oe θ Ω. Because Rθ, δ) is cotiuous i θ, there must exist ε > 0 ad a iterval θ 0, θ ) cotaiig θ so that: Rθ, δ ) < ε θ θ 0, θ )..) Let r τ be the average risk of δ with respect to the prior distributio N 0, τ 2 ) o θ. Note that this is the exact same prior we used to prove that X was the limit of a Bayes estimator, ad hece miimax. We did this by lettig τ, ad therefore lettig our prior ted to the improper prior πθ) θ.) Let r τ be the average risk of a Bayes estimator δ τ uder the same prior. Note that δ τ δ because Rθ, δ τ ) as θ which is ot cosistet with Rθ, δ ) / for all θ R. So, r τ < r τ, because the Bayes estimator is uique almost surely with respect to the margial distributio of θ. We will look at the followig ratio, which is selected to simplify our algebra later. This ratio, we will show, will become arbitrarily large, which we will use to form a cotradictio with r τ < r τ. Usig the form of the Bayes risk r τ computed i a previous lecture see TPE Example 5..4), we ca write: r τ r τ [ Rθ, 2πτ δ ) ] ) θ exp 2 dθ 2τ 2 + τ 2-5

STATS 300A Lecture October 27 Fall 205 Applyig.), we fid: r τ r τ θ 2πτ θ 0 +τ 2 ) + τ 2 ) τ ε 2π εe θ 2 2τ 2 dθ θ θ 0 e θ 2 2τ 2 dθ As τ, the first expressio, + τ 2 )ε/τ 2π) ad sice the itegrad coverges mootoically to, Lebesgue s mootoe covergece theorem esures that the itegral approaches the positive quatity θ θ 0. So, for sufficietly large τ, we must have r τ r τ This meas that r τ < r τ. However, this is a cotradictio, because r τ is the optimal average risk sice it is the Bayes risk). So our assumptio that there was a domiatig estimator was false, ad i this case, a X + b X is admissible. >..4 Simultaeous estimatio Up to this poit, we have cosidered oly situatios where a sigle real-valued parameter is of iterest. However, i practice, we ofte care about several parameters, ad wish to estimate them all at oce. I this sectio we cosider the admissibility of estimators of several parameters that is, of simultaeous estimatio. Example 4. Let X, X 2,..., X p be idepedet with X i N θ i, σ 2 ) for i p. For the sake of simplicity, say σ 2. Now our goal is to estimate θ θ, θ 2,..., θ p ) uder the loss fuctio: p Lθ, d) d i θ i ) 2 i A atural estimator for θ is X X, X 2,..., X p ). It ca be show that X is the UMRUE, the maximum likelihood estimator, a geeralized Bayes estimator, ad a miimax estimator for θ. So, it would be atural to thik that X is admissible. However, couterituitively, it turs out that this is ot the case whe p 3. Whe p 3, X is domiated by the James-Stei estimator ad that too, strictly domiated): Here 2 is the 2-orm so X 2 2 p j X2 j ) δx) δ X), δ 2 X),..., δ p X)) where δ i X) p 2 ) X X 2 i. 2-6

STATS 300A Lecture October 27 Fall 205 The J-S estimator makes use of the etire data vector whe estimatig each θ i, so it is surprisig that this is beeficial give the assumptio of idepedece amogst the compoets of X. A example of the James-Stei estimator beig used to estimate battig averages is available at http://www-stat.staford.edu/~ckirby/brad/other/article977.pdf. It turs out that the James-Stei estimator is ot itself admissible because it is domiated by the positive part James-Stei estimator TPE Theorem 5.5.4): δ i X) max p 2, 0 X 2 2 To add isult to ijury, eve this estimator ca be show iadmissible, although that proof is o-costructive..4. Motivatio for the J-S estimator To motivate the J-S estimator, we cosider how it ca arise i a empirical Bayes framework. The empirical Bayes approach which builds o priciples of Bayesia estimatio, but is ot strictly Bayesia) is a two-step process:. Itroduce a prior family idexed by a hyperparameter this is the Bayesia aspect). 2. Estimate the hyperparameter from the data this is the empirical aspect). So applyig this procedure to the problem at had: iid. Suppose θ i N 0, A) the the Bayes estimator for θ i is δ A,i X) X i + ) X i A + A 2. I this step we must choose A. Margializig over θ, we see that X has the distributio, X i iid N 0, A + ) Exercise: Verify this.) We will use X ad the kowledge of this margial distributio to fid a estimate of. Oe could, i priciple, use ay estimate of A, ad it is A+ commo to use a maximum likelihood estimate, but here we will used a ubiased estimate. It ca the be show that [ ] E X 2 2 ) X i p 2)A + ) Exercise: Verify this. Hit: A+ X 2 2 follows a χ 2 distributio). So p 2 X 2 2 must be UMVU for. A+ If we plug this estimator ito our Bayes estimator we obtai the J-S estimator: δx i ) p 2 X 2 2-7 ) X i.

STATS 300A Lecture October 27 Fall 205.4.2 James-Stei domiatio Ituitively, the problem with the estimate X is that X 2 2 is typically much larger tha θ 2 2: [ p ] p E[ X 2 2] E Xj 2 p + θi 2 p + θ 2 2 i where p is actually σ 2 p p i this case. So, we may view the J-S estimator as a method for correctig the bias i the size of X. It achieves this by shrikig each coordiate of X toward 0. The uiform superiority of the J-S estimator to X ca be formalised see Keeer.2). Theorem Theorem 5.5. TPE). The James-Stei estimator δ has uiformly smaller risk tha X if p 3. The proof, give o p. 355 of TPE, compares the risk of the J-S estimator directly to that of X. i -8