Lecture 11 October 27

Similar documents
Lecture 10 October Minimaxity and least favorable prior sequences

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Lecture 19: Convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

7.1 Convergence of sequences of random variables

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Estimation for Complete Data

Lecture 12: September 27

6.3 Testing Series With Positive Terms

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Rates of Convergence by Moduli of Continuity

Frequentist Inference

Chapter 6 Infinite Series

32 estimating the cumulative distribution function

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

1 Introduction to reducing variance in Monte Carlo simulations

Notes 19 : Martingale CLT

Sequences and Series of Functions

7.1 Convergence of sequences of random variables

Machine Learning Brett Bernstein

Infinite Sequences and Series

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Asymptotic Results for the Linear Regression Model

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Advanced Stochastic Processes.

Problem Set 4 Due Oct, 12

Lecture 33: Bootstrap

LECTURE 14 NOTES. A sequence of α-level tests {ϕ n (x)} is consistent if

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Lecture 3 : Random variables and their distributions

1.010 Uncertainty in Engineering Fall 2008

MA131 - Analysis 1. Workbook 2 Sequences I

Lecture 3 The Lebesgue Integral

Unbiased Estimation. February 7-12, 2008

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Fall 2013 MTH431/531 Real analysis Section Notes

1 Convergence in Probability and the Weak Law of Large Numbers

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Sequences. Notation. Convergence of a Sequence

Lecture 2: Monte Carlo Simulation

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Output Analysis and Run-Length Control

Lecture 2. The Lovász Local Lemma

Lecture 3: August 31

Lecture 6 Simple alternatives and the Neyman-Pearson lemma

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Chapter 6 Principles of Data Reduction

MA131 - Analysis 1. Workbook 3 Sequences II

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Sequences I. Chapter Introduction

Lecture 8: Convergence of transformations and law of large numbers

Random Variables, Sampling and Estimation

Statistics 511 Additional Materials

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

STAT Homework 1 - Solutions

Seunghee Ye Ma 8: Week 5 Oct 28

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

Empirical Processes: Glivenko Cantelli Theorems

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Application to Random Graphs

Discrete Mathematics for CS Spring 2005 Clancy/Wagner Notes 21. Some Important Distributions

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 15

Lecture 9: September 19

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Notes 27 : Brownian motion: path properties

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

MAT1026 Calculus II Basic Convergence Tests for Series

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Agnostic Learning and Concentration Inequalities

Element sampling: Part 2

Distribution of Random Samples & Limit theorems

Machine Learning Brett Bernstein

University of Colorado Denver Dept. Math. & Stat. Sciences Applied Analysis Preliminary Exam 13 January 2012, 10:00 am 2:00 pm. Good luck!

STAT331. Example of Martingale CLT with Cox s Model

Lecture Notes 15 Hypothesis Testing (Chapter 10)

Singular Continuous Measures by Michael Pejic 5/14/10

Notes 5 : More on the a.s. convergence of sums

CSE 527, Additional notes on MLE & EM

4.3 Growth Rates of Solutions to Recurrences

2.1. The Algebraic and Order Properties of R Definition. A binary operation on a set F is a function B : F F! F.

Bertrand s Postulate

Maximum Likelihood Estimation and Complexity Regularization

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Stat410 Probability and Statistics II (F16)

MATH301 Real Analysis (2008 Fall) Tutorial Note #7. k=1 f k (x) converges pointwise to S(x) on E if and

Bayesian Methods: Introduction to Multi-parameter Models

Section 11.8: Power Series

Transcription:

STATS 300A: Theory of Statistics Fall 205 Lecture October 27 Lecturer: Lester Mackey Scribe: Viswajith Veugopal, Vivek Bagaria, Steve Yadlowsky Warig: These otes may cotai factual ad/or typographic errors.. Summary I this lecture, we will discuss the idetificatio of miimax estimators via submodels, the admissibility of miimax estimators, ad simultaeous estimatio ad the James-Stei estimator. This will coclude our discussio of estimatio; i the future we will be focusig o the decisio problem of hypothesis testig..2 Miimax Estimators ad Submodels Recall that a estimator δ M is miimax if its maximum risk is miimal: if δ sup θ Ω Rθ, δ) sup Rθ, δ M ) θ We saw how to derive the miimax estimator usig least favourable priors i Lecture 0. I this lecture we will cosider a differet approach, based o the followig Lemma: Lemma TPE 5..5). Suppose that δ is miimax for a submodel θ Ω 0 Ω ad The, δ is miimax for the full model, θ Ω. sup Rθ, δ) sup Rθ, δ) θ Ω 0 θ Ω This lemma allows us to fid a miimax estimator for a particular tractable submodel, ad the show that the worst-case risk for the full model is equal to that of the submodel that is, the worst-case risk does t rise as you go to the full model). I this case, usig the Lemma, we ca argue that the estimator we foud is also miimax for the full model. This was similar to how we justified miimaxity of the estimator of a Normal mea with bouded variace last lecture. Here s a fairly simple example: Example. Let X,..., X be i.i.d N µ, σ 2 ), where both µ ad σ 2 are ukow. Thus, our parameter vector, θ µ, σ 2 ) ad our parameter space Ω R R +. Our task ow is to estimate µ. Our loss fuctio is the relative squared error loss, give by: Lµ, σ 2 ), d) d µ)2 σ 2 -

STATS 300A Lecture October 27 Fall 205 We cosider this loss fuctio to make the questio of miimaxity more iterestig: regular squared error loss is ubouded for the full model, sice it is proportioal to the variace, which is ubouded. We cosider the submodel where σ 2. That is, Ω 0 R {}, ad our loss fuctio simplifies to our usual squared error loss: Lµ, ), d) d µ) 2. We saw i Example of Lecture 0 that uder this loss X is miimax for Ω 0. Moreover, Rµ, σ 2 ), X) µ, σ2 ) Ω. Thus, the risk does ot deped o σ 2. Sice Rµ, ), X)) Rµ, σ 2 ), X)), we have that the maximum risks are equal. That is, sup θ Ω0 Rθ, δ) sup θ Ω Rθ, δ). Therefore, it follows from Lemma that X is miimax o Ω. Note that, thaks to our ew loss fuctio, we do t eed to impose boudedess o our variace like we did i our previous lecture) to establish miimaxity i a meaigful way. This example is parametric, like a lot of the examples we ve made so far. Assumig we kow the form of the distributio for the variables, ad that the variables are i.i.d., are both strog assumptios. Now, we cosider a more ambitious example, which is i a o-parametric settig, ad hece more geeral. Example 2 TPE Example 5..6). Suppose X, X 2,..., X are i.i.d with commo CDF F, with mea µf ) <, ad variace σ 2 F ) <. Our goal is to fid a miimax estimate of µf ) uder squared error loss. Without further restrictio o F, the worst case risk is ubouded for every estimator, so every estimator is miimax. We will impose further costraits, ad restrict our family somehow to have fiite worst-case risk, to esure that meaigful miimax estimators ca be obtaied. Costrait a). Assume σ 2 F ) B. Now, we ve see i the previous lecture that X is miimax for the Gaussia submodel i this case. So a atural guess for us to make is that X is miimax. We verify this by applicatio of Lemma. First, we compute the supremum risk for the full model: RF, X) 2 Sice σ 2 F ) [0, B] by assumptio, we get: i EX i µf )) 2 σ2 F ). sup RF, X) B F Now we saw i Lecture 0 that for the submodel F 0 N µ, σ 2 ) whe σ 2 B, X is miimax. Further, the supremum risk i this case is idetical to that of the full model: sup F F 0 RF, X) B Thus, usig Lemma we coclude that X is miimax for the full model. o-parametric model still costraied to have σ 2 F ) B.) That is, the -2

STATS 300A Lecture October 27 Fall 205 Costrait b). Assume F F where F is the set of all CDFs with support cotaied i [0, ]. Is X miimax for this model? We have reaso to believe that it is ot, based o the miimax estimator we derived i Lecture 9 for the Biomial submodel. Ad i fact, it turs out that X is t miimax. To show this, first cosider the submodel, F 0 {Berθ)} θ 0,). Let Y i X i so that Y Bi, θ) ad X Y/. Recall from Lecture 9 that the miimax estimator for µf ) θ, i the Biomial case, is: δx) which has supremum risk 4+ ) 2. So + X + ) 2 + sup Rθ, X) θ 4 > 4 + ) sup Rθ, δ) 2 Thus, X has a higher worst-case risk tha δx) as defied above, ad hece, we have show that X is ot miimax. Now, let s get more ambitious, ad try to see if we ca fid the miimax estimator uder the full model. We kow that this ca t be X, but it s possible that it could be δx). To examie this possibility, we cojecture that δx) is also miimax uder the full model. If we are to establish this uder the Lemma, we eed to show that the supremum risk of δx) uder the full model is o more tha 4+ which is the supremum risk for the biomial ) 2 submodel). Let us compute: [ ) )) 2 ] E F [δx) µf )) 2 ] E F + X µf )) + + 2 µf ) ) [ 2 ) ] 2 + Var X) + 2 µf ) ) 2 [EX + 2) µf ) 2 + 4 ] µf ) + µf )2 ) 2 [EX + 2) + 4 ] µf ) where the third step follows from the fact that VarX ) Var X) E[X] 2 E[X ]) 2 E[X] 2 µf )) 2. By assumptio X [0, ], so X 2 X ad we ca boud the risk: ) 2 E F [δx) µf )) 2 ] [EX + ) + 4 ] µf ) 4 + ). 2 So, δx) is miimax for the Biomial submodel, ad its worst-case risk is the same for the full model ad for the Biomial submodel. Therefore, applyig the Lemma, we coclude that δx) is miimax. Thus, we have foud a miimax estimator. -3 θ

STATS 300A Lecture October 27 Fall 205.3 Admissibility of miimax estimators Let us ow tur to the questio of admissibility of miimax estimators. We begi by otig that the questio of admissibility is particularly importat for miimax estimators. This is because, although we foud domiatig estimators eve whe we were workig with ubiased estimators, the domiatig estimators were biased, so we lost the property ubiasedess) that we were iterested i however, if you fid a estimator that domiates a miimax estimator, it will still be miimax! Also, a aside: admissibility ca give rise to miimaxity. If δ is admissible with costat risk, the δ is also miimax. This is ot hard to show. Let the costat risk of δ be r. The, r is also the worst-case risk of δ, sice the risk is costat. Now, if we assume δ is ot miimax, there exists a differet estimator, say δ, which is miimax. The worst-case risk of δ, say r, would thus be < r. But sice this is the worst-case risk of δ, that would mea that the risk of δ is lower tha r throughout, ad thus δ domiates δ. However, we assumed that δ was admissible, so this is a cotradictio. Thus, our assumptio led to a cotradictio, ad therefore δ is miimax.) Note that miimaxity does ot guaratee admissibility; it oly esures the worst case risk is optimal. We eed to check for admissibility. The followig example illustrates several stadard ways of doig so. iid Example 3. Let X, X 2,..., X N θ, σ 2 ) where σ 2 is kow, ad θ is the estimad. The the miimax estimator is X uder squared error loss, ad we would like to determie whether X is admissible. Istead of aswerig this directly, we aswer a more geeral questio: whe is a X + b, a, b R, basically, ay affie fuctio of X) admissible? Case : 0 < a <. I this case a X + b is a covex combiatio of X ad b. By results we saw i the previous lecture, it is a Bayes estimator with respect to some Gaussia prior o θ. Further, sice we are usig squared error loss, which is strictly covex, this Bayes estimator is uique. So, by Theorem 5.2.4 which basically tells us that a uique Bayes estimator will always be admissible), a X + b is admissible. Case 2: a 0. I this case b is also a uique Bayes estimator with respect to a degeerate prior distributio with uit mass at θ b. So by Theorem 5.2.4, b is admissible. Case 3: a, b 0. I this case X + b is ot admissible because it is domiated by X. To see this, ote that X has the same variace as X + b, but strictly smaller bias. The ext few cases use the followig result. I geeral, the risk of a X + b is: E[a X + b θ)] 2 E[ a X θ) + b + θa ) ) 2 ] a2 σ 2 + b + θa ))2 where, i the first step, we added ad subtracted aθ iside. Case 4: a >. If we apply the result for the geeral risk we have: E[a X + b θ) 2 ] a2 σ 2 > σ2 Rθ, X). -4

STATS 300A Lecture October 27 Fall 205 The first iequality follows because the secod summad i the expressio for the geeral risk is always oegative. X domiates a X + b whe a >, ad so i this case a X + b is iadmissible. Case 5: a < 0. E[a X + b θ) 2 ] > b + θa )) 2 a ) 2 θ + > θ + b ) 2, a b a ad this is the risk of predictig the costat b/a ). So, b/a ) domiates a X + b, ad therefore, a X + b is agai iadmissible. Now, we have cosidered every case except for the estimator X. It turs out that X. The argumet i this case is more ivolved, ad proceeds by cotradictio. Case 6: a, b 0. Here, we use a limitig Bayes argumet. Suppose X is iadmissible. The, assumig w.l.o.g that σ 2, we have: ) 2 Rθ, X) By our hypothesis, there must exist a estimator δ such that Rθ, δ ) / for all θ ad Rθ, δ ) < / for at least oe θ Ω. Because Rθ, δ) is cotiuous i θ, there must exist ε > 0 ad a iterval θ 0, θ ) cotaiig θ so that: Rθ, δ ) < ε θ θ 0, θ )..) Let r τ be the average risk of δ with respect to the prior distributio N 0, τ 2 ) o θ. Note that this is the exact same prior we used to prove that X was the limit of a Bayes estimator, ad hece miimax. We did this by lettig τ, ad therefore lettig our prior ted to the improper prior πθ) θ.) Let r τ be the average risk of a Bayes estimator δ τ uder the same prior. Note that δ τ δ because Rθ, δ τ ) as θ which is ot cosistet with Rθ, δ ) / for all θ R. So, r τ < r τ, because the Bayes estimator is uique almost surely with respect to the margial distributio of θ. We will look at the followig ratio, which is selected to simplify our algebra later. This ratio, we will show, will become arbitrarily large, which we will use to form a cotradictio with r τ < r τ. Usig the form of the Bayes risk r τ computed i a previous lecture see TPE Example 5..4), we ca write: r τ r τ [ Rθ, 2πτ δ ) ] ) θ exp 2 dθ 2τ 2 + τ 2-5

STATS 300A Lecture October 27 Fall 205 Applyig.), we fid: r τ r τ θ 2πτ θ 0 +τ 2 ) + τ 2 ) τ ε 2π εe θ 2 2τ 2 dθ θ θ 0 e θ 2 2τ 2 dθ As τ, the first expressio, + τ 2 )ε/τ 2π) ad sice the itegrad coverges mootoically to, Lebesgue s mootoe covergece theorem esures that the itegral approaches the positive quatity θ θ 0. So, for sufficietly large τ, we must have r τ r τ This meas that r τ < r τ. However, this is a cotradictio, because r τ is the optimal average risk sice it is the Bayes risk). So our assumptio that there was a domiatig estimator was false, ad i this case, a X + b X is admissible. >..4 Simultaeous estimatio Up to this poit, we have cosidered oly situatios where a sigle real-valued parameter is of iterest. However, i practice, we ofte care about several parameters, ad wish to estimate them all at oce. I this sectio we cosider the admissibility of estimators of several parameters that is, of simultaeous estimatio. Example 4. Let X, X 2,..., X p be idepedet with X i N θ i, σ 2 ) for i p. For the sake of simplicity, say σ 2. Now our goal is to estimate θ θ, θ 2,..., θ p ) uder the loss fuctio: p Lθ, d) d i θ i ) 2 i A atural estimator for θ is X X, X 2,..., X p ). It ca be show that X is the UMRUE, the maximum likelihood estimator, a geeralized Bayes estimator, ad a miimax estimator for θ. So, it would be atural to thik that X is admissible. However, couterituitively, it turs out that this is ot the case whe p 3. Whe p 3, X is domiated by the James-Stei estimator ad that too, strictly domiated): Here 2 is the 2-orm so X 2 2 p j X2 j ) δx) δ X), δ 2 X),..., δ p X)) where δ i X) p 2 ) X X 2 i. 2-6

STATS 300A Lecture October 27 Fall 205 The J-S estimator makes use of the etire data vector whe estimatig each θ i, so it is surprisig that this is beeficial give the assumptio of idepedece amogst the compoets of X. A example of the James-Stei estimator beig used to estimate battig averages is available at http://www-stat.staford.edu/~ckirby/brad/other/article977.pdf. It turs out that the James-Stei estimator is ot itself admissible because it is domiated by the positive part James-Stei estimator TPE Theorem 5.5.4): δ i X) max p 2, 0 X 2 2 To add isult to ijury, eve this estimator ca be show iadmissible, although that proof is o-costructive..4. Motivatio for the J-S estimator To motivate the J-S estimator, we cosider how it ca arise i a empirical Bayes framework. The empirical Bayes approach which builds o priciples of Bayesia estimatio, but is ot strictly Bayesia) is a two-step process:. Itroduce a prior family idexed by a hyperparameter this is the Bayesia aspect). 2. Estimate the hyperparameter from the data this is the empirical aspect). So applyig this procedure to the problem at had: iid. Suppose θ i N 0, A) the the Bayes estimator for θ i is δ A,i X) X i + ) X i A + A 2. I this step we must choose A. Margializig over θ, we see that X has the distributio, X i iid N 0, A + ) Exercise: Verify this.) We will use X ad the kowledge of this margial distributio to fid a estimate of. Oe could, i priciple, use ay estimate of A, ad it is A+ commo to use a maximum likelihood estimate, but here we will used a ubiased estimate. It ca the be show that [ ] E X 2 2 ) X i p 2)A + ) Exercise: Verify this. Hit: A+ X 2 2 follows a χ 2 distributio). So p 2 X 2 2 must be UMVU for. A+ If we plug this estimator ito our Bayes estimator we obtai the J-S estimator: δx i ) p 2 X 2 2-7 ) X i.

STATS 300A Lecture October 27 Fall 205.4.2 James-Stei domiatio Ituitively, the problem with the estimate X is that X 2 2 is typically much larger tha θ 2 2: [ p ] p E[ X 2 2] E Xj 2 p + θi 2 p + θ 2 2 i where p is actually σ 2 p p i this case. So, we may view the J-S estimator as a method for correctig the bias i the size of X. It achieves this by shrikig each coordiate of X toward 0. The uiform superiority of the J-S estimator to X ca be formalised see Keeer.2). Theorem Theorem 5.5. TPE). The James-Stei estimator δ has uiformly smaller risk tha X if p 3. The proof, give o p. 355 of TPE, compares the risk of the J-S estimator directly to that of X. i -8