STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least favorable prior sequeces I this lecture, we will exted our tools for derivig miimax estimators. Last time, we discovered that miimax estimators ca arise from Bayes estimators uder least favorable priors. However, it turs out that miimax estimators may ot be Bayes estimators. Cosider the followig example, where our old approach fails. Example (Miimax for i.i.d. Normal radom variables with ukow mea ). Let iid X,..., X N (, σ 2 ), with σ 2 kow. Our goal is to estimate uder squared-error loss. For our first guess, pick the atural estimator X. Note that it has costat risk σ2, which suggests miimaxity because we kow that Bayes estimators with costat risk are also miimax estimators. However, X is ot Bayes for ay prior, because uder squared-error loss ubiased estimators are Bayes estimators oly i the degeerate situatios of zero risk (TPE Theorem 4.2.3), ad X is ubiased. Thus, we caot coclude by our previous results (e.g., TPE Corollary 5..5) that X is miimax. We might try to cosider the wider class of estimators δ a,µ0 (X) = ax + ( a) µ 0 for a (0, ) ad µ 0 R, because may of the Bayes estimators we ve ecoutered are covex combiatios of a prior ad a data mea. Note however that the worst case risk for these estimators is ifiite: sup E [ δ (X)] 2 ( ) = sup{a 2 Var X + ( a) 2 ( µ 0 ) 2 } = a2 σ 2 = +. + ( a)2 sup ( µ 0 ) 2 Sice these estimators have poorer worst case risk tha X, they certaily caot be miimax. We could keep tryig to fid Bayes estimators with better worst-case performace tha X, but we would fail: it turs out that X is i fact miimax. To establish this, we will exted our miimax results to the limits of Bayes estimators, rather tha restrictig attetio to Bayes estimators oly. Defiitio (Least Favorable Sequece of Priors). Let {Λ m } be a sequece of priors with miimal average risk r Λm = if δ R (, δ) dλm (). The, {Λ m } is a least favorable sequece of priors if there is a real umber r such that r Λm r < ad r r Λ for ay prior Λ. 0-
STATS 300A Lecture 0 October 22 Fall 205 The reaso for studyig the limit of priors is that it may help us establish miimaxity. Sice there eed ot exist a prior Λ such that the associated Bayes estimator has average risk r, this defiitio is less restrictive tha that of a least-favorable prior. We ca prove a aalogue of TPE Theorem 5..4 i this ew settig. Theorem (TPE 5..2). Suppose there is real umber r such that {Λ m } is a sequece of priors with r Λm r <. Let δ be ay estimator such that sup R (, δ) = r. The,. δ is miimax, 2. {Λ m } is least-favorable. Proof.. Let δ be ay other estimator. The, for ay m, sup R (, δ ) R (, δ ) dλ m () r Λm, so that sedig m yields sup which meas that δ is miimax. 2. Let Λ be ay prior, the r Λ = R (, δ Λ ) dλ () which meas that {Λ m } is least favorable. R (, δ ) r = sup R (, δ), R (, δ) dλ () sup R (, δ) = r, Ulike Theorem 5..4, this result does ot guaratee uiqueess, eve if the Bayes estimators δ Λm are uique. This is because the limitig step i the proof of () chages ay strict iequality to ostrict iequality. However, this result allows to check much wider class of estimators, sice to check that the estimator is ideed a miimax estimator we eed to fid oly the sequece of Bayes risks coverget to maximum risk of our cadidate. Example 2 (Miimax for i.i.d. Normal radom variables, cotiued). We ow have the tools to cofirm our suspicio that X is miimax. By Theorem above, it suffices to fid a sequece {Λ m } such that r Λm σ2 =: r. Usig the cojugate prior is a good startig poit, so we let {Λ m } be the cojugate priors {N (0, m 2 )} with variace tedig to, so that Λ m teds to the (improper with π() =, R) uiform prior o R. By TPE Example 4.2.2, the posterior for associated with each Λ m is ( ) X σ X,..., X N 2 +, +. σ 2 m 2 σ 2 m 2 I particular, the posterior variace does ot deped o X,..., X, so Lemma below automatically yields the Bayes risk r Λm = σ 2 + m 2 m σ2 = sup R (, X ). It follows from Theorem that X is miimax ad {Λ m } is least favorable. 0-2
STATS 300A Lecture 0 October 22 Fall 205 Lemma (TPE 5..3). If the posterior variace Var Θ X (g (Θ) X = x) is costat i x, the uder squared error loss, r Λ = Var Θ X (g (Θ) X = x). We kow that the posterior mea miimizes Bayes risk, so this result ca be obtaied by pluggig i the posterior mea of g() ito the average risk. 0.2 Miimaxity via submodel restrictio The followig example illustrates the techique of derivig a miimax estimator for a geeral family of models by restrictig attetio to a subset of that family. The idea comes from simple observatio that if the estimator is miimax i submodel ad its risk does t chage whe we go to a larger model the estimator is miimax i this larger class. Example 3 (Miimax for i.i.d. Normal radom variables, ukow mea ad variace). iid Recosider Example i the case that the variace is ukow. That is, let X,..., X N (, σ 2 ), with both ad σ 2 ukow. Note that sup R ( (, σ 2 ), X ) σ 2 = sup,σ 2 σ 2 =, ad i fact, the maximum risk of ay estimator i this settig is ifiite, so the questio of miimaxity is uiterestig. Therefore, we restrict attetio to the family parameterized by Ω = {(, σ 2 ) : R, σ 2 B}, where B is a kow costat. Assume δ is ay other estimator. Calculatig the risk of X withi this family, we fid sup R,σ 2 B R ( (, σ 2 ), X ) = B = sup R ( (, σ 2 ), X ) R,σ 2 =B sup R ( (, σ 2 ), δ ) [submodel miimax] R,σ 2 =B sup R ( (, σ 2 ), δ ), R,σ 2 B where the first iequality follows from the fact that X is miimax for i.i.d. ormals with kow σ 2, ad the secod iequality follows from the fact that we are takig the supremum over a larger set. Hece, we are able to show that X is miimax over Ω by focusig o the case where σ 2 is kow. Notice further that the form of the estimator does ot deped o the upper boud B, though the boud is ecessary for miimaxity to be worth ivestigatig. 0.3 Depedece o the Loss Fuctio I geeral, miimax estimators ca vary depedig o the loss beig cosidered. Below, we provide a example of miimax estimatio uder weighted squared error loss. 0-3
STATS 300A Lecture 0 October 22 Fall 205 Example 4 (Miimax for biomial radom variables, weighted squared error loss). Let X Bi (, ) with the loss fuctio L (, d) = (d )2. This is a simple weighted squarederror loss with the weights w() = but it is arguably more realistic tha the usual ( ) ( ) squared error i this situatio because it pealizes errors ear 0 ad more strogly tha errors ear. 2 Note that for ay, R ( ), X = ; that is, the risk is costat i, suggestig X is miimax. We will show that this is ideed the case. We should be careful sice TPE Theorem 4.2.3 is oly valid uder the squared-error loss. Sice our loss fuctio is differet, a ubiased estimator ca be Bayes. I this example, this is ideed the case. Recall from TPE Corollary 4..2 that the Bayes estimator associated with the loss L (d, ) = w () (d ) 2 is give by E Θ X[Θw(Θ) X]. E Θ X Ivokig this result, we fid that the [w(θ) X] Bayes estimator has the form δ Λ (X) = E Θ X E Θ X [ [ Θ X] ]. (0.) X Θ( Θ) This is true for arbitrary priors Λ, but to calculate a closed form Bayes estimator, we use a prior cojugate to the biomial likelihood: Θ Λ a,b = Beta (a, b), for some a, b > 0. Suppose we observe X = x. If a + x > ad b + + x >, the substitutig the result of Remark below ito equatio 0. proves that the estimator δ a,b (x) = a + x a + b + 2, miimizes the posterior risk. I particular, the estimator δ, (x) = x miimizes the posterior risk with respect to the uiform prior after observig 0 < x <. If we ca verify that this form remais uchaged whe x {0, }, the the estimator δ, (X) = X is Bayes with costat risk, ad hece miimax. To see that this is the case, ote that the posterior risk uder the prior Λ, after observig X = x ad decidig δ (x) = d is which, i the case X = 0, simplifies to 0 (d ) 2 ( ) Γ (x + + x + ) Γ (x + ) Γ ( x + ) x ( ) x d, 0 (d ) 2 ( ) d. This itegral coverges for d = 0 ad diverges otherwise, so the posterior risk is miimized by choosig δ (0) = 0. Similarly, i the case X =, the posterior risk is miimized by choosig δ () = =. This cofirms that δ, (X) = X miimizes the posterior risk for ay outcome X, ad is ideed Bayes. Sice as we metioed before this estimator has costat risk we ca coclude that X is miimax. Notice that the form of the miimax estimator here depeds o the type of loss beig used: X has costat risk for the type of weighted squared error loss cosidered here. 0-4
STATS 300A Lecture 0 October 22 Fall 205 Remark. Recall that the Beta fuctio ca be evaluated as 0 x k ( x) k 2 dx = Γ (k ) Γ (k 2 ) Γ (k + k 2 ), (0.2) wheever k, k 2 > 0. Therefore, if Y Beta (a, b), where a, b > 0, we ca explicitly evaluate the expectatio [ ] E = Y = = [ Γ (a + b) 0 y Γ (a) Γ (b) ya ( y) b Γ (a + b) [ y a ( y) b 2] dy Γ (a) Γ (b) 0 Γ (a + b) Γ (a) Γ (b ) Γ (a) Γ (b) Γ (a + b ) Γ(a + b) Γ(b ) = Γ(a) Γ(a + b ) Γ(b) Γ(a) = a + b b, ] dy where i the secod step we require b > i order to apply the relatio 0.2. A similar argumet yields [ ] (a + b 2) (a + b ) E =, Y ( Y ) (a ) (b ) wheever a >. Combiig these idetities, we have that, wheever a, b >, E [ ] Y [ ] = a E a + b 2. Y ( Y ) 0.4 Radomized Miimax Estimators So far, we have had little occasio to cosider radomized estimators, that is, fuctios δ (X, U) of both the data ad a idepedet source of radomess U Uif (0, ). Radomized estimators played little role i our exploratio of average risk optimality, sice o-radomized estimators of equal or better average risk are always available. However, they tur out to play a role whe we cosider the miimax criterio. Notice that whe workig with covex losses, we ca dispese with radomized estimators, because we ca fid a determiistic estimator with the same or better performace. Ideed, the data X is always sufficiet, so by the Rao-Blackwell theorem, the o-radom estimator δ (X) = E [δ (X, U) X] is o worse tha δ (X, U). However, there are o-covex losses for which o determiistic miimax estimator exists, as the followig example demostrates. 0-5
STATS 300A Lecture 0 October 22 Fall 205 0 0 δ (i ) δ (i 2 ) δ (i ) legth > α Figure 0.. By choosig α small eough, we ca esure that ay choice of + values for the o-radom estimator δ will leave some 0 a distace at least α away from ay of the δs. Example 5 (Radomized miimax estimator). Let X Bi (, ), where [0, ], ad cosider estimatio of uder the 0- loss, { 0 if d < α L (, d) =. otherwise First cosider a arbitrary o-radom estimator δ. Sice X ca take o oly the + values {0,,..., }, the estimator δ (X) ca take o oly + values, {δ (0), δ (),..., δ ()}. If α <, the we ca always fid 2(+) 0 such that δ (x) 0 α for every x {0,..., }; see Figure 0.. Hece, R ( 0, δ (X)) = is the maximum risk of ay o-radom δ. Cosider istead the estimator δ (X, U) = U, which is completely radom ad idepedet of the data X. The, for ay [0, ], R (, δ ) = E [L (, δ (X, U))] = P ( U α) = P ( α < U < α + ) α<, ad sice α > 0, the maximum risk of δ is smaller tha the maximum risk of ay o-radom δ. Hece, i this settig, there ca be o determiistic miimax estimator. 0-6