Lecture 10 October Minimaxity and least favorable prior sequences

Similar documents
Lecture 11 October 27

Convergence of random variables. (telegram style notes) P.J.C. Spreij

6.3 Testing Series With Positive Terms

7.1 Convergence of sequences of random variables

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Sequences and Series of Functions

Notes 19 : Martingale CLT

Solutions to Tutorial 3 (Week 4)

Lecture 19: Convergence

2 Banach spaces and Hilbert spaces

Fall 2013 MTH431/531 Real analysis Section Notes

Series III. Chapter Alternating Series

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

7.1 Convergence of sequences of random variables

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Introduction to Probability. Ariel Yadin

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Math 113, Calculus II Winter 2007 Final Exam Solutions

Lecture 15: Learning Theory: Concentration Inequalities

INFINITE SEQUENCES AND SERIES

6. Sufficient, Complete, and Ancillary Statistics

7 Sequences of real numbers

1 Approximating Integrals using Taylor Polynomials

MAT1026 Calculus II Basic Convergence Tests for Series

Lecture 16: UMVUE: conditioning on sufficient and complete statistics

Beurling Integers: Part 2

Exponential Families and Bayesian Inference

Analytic Continuation

Maximum Likelihood Estimation and Complexity Regularization

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Lecture 3 The Lebesgue Integral

Infinite Sequences and Series

lim za n n = z lim a n n.

Output Analysis and Run-Length Control

b i u x i U a i j u x i u x j

Machine Learning Brett Bernstein

Problem Set 4 Due Oct, 12

Lecture 2. The Lovász Local Lemma

University of Colorado Denver Dept. Math. & Stat. Sciences Applied Analysis Preliminary Exam 13 January 2012, 10:00 am 2:00 pm. Good luck!

Lecture 2: Monte Carlo Simulation

MATH 472 / SPRING 2013 ASSIGNMENT 2: DUE FEBRUARY 4 FINALIZED

CS284A: Representations and Algorithms in Molecular Biology

1 Review and Overview

Stat410 Probability and Statistics II (F16)

Math 140A Elementary Analysis Homework Questions 3-1

MIDTERM 3 CALCULUS 2. Monday, December 3, :15 PM to 6:45 PM. Name PRACTICE EXAM SOLUTIONS

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Chapter 6 Infinite Series

s = and t = with C ij = A i B j F. (i) Note that cs = M and so ca i µ(a i ) I E (cs) = = c a i µ(a i ) = ci E (s). (ii) Note that s + t = M and so

Lecture 12: September 27

Seunghee Ye Ma 8: Week 5 Oct 28

1 Convergence in Probability and the Weak Law of Large Numbers

6. Uniform distribution mod 1

MA131 - Analysis 1. Workbook 3 Sequences II

MA131 - Analysis 1. Workbook 9 Series III

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

LECTURE 14 NOTES. A sequence of α-level tests {ϕ n (x)} is consistent if

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Lecture 3. Properties of Summary Statistics: Sampling Distribution

sin(n) + 2 cos(2n) n 3/2 3 sin(n) 2cos(2n) n 3/2 a n =

CHAPTER 10 INFINITE SEQUENCES AND SERIES

Chapter 4. Fourier Series

2.1. The Algebraic and Order Properties of R Definition. A binary operation on a set F is a function B : F F! F.

Lecture 11 and 12: Basic estimation theory

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Math 299 Supplement: Real Analysis Nov 2013

Math 341 Lecture #31 6.5: Power Series

Complex Analysis Spring 2001 Homework I Solution

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

LECTURES 5 AND 6: CONVERGENCE TESTS

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Sequences I. Chapter Introduction

Lecture 7: Properties of Random Samples

Feedback in Iterative Algorithms

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

THE INTEGRAL TEST AND ESTIMATES OF SUMS

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS

Stat 421-SP2012 Interval Estimation Section

Lecture 9: September 19

Sequences. Notation. Convergence of a Sequence

Mathematical Methods for Physics and Engineering

1 Review and Overview

Math 525: Lecture 5. January 18, 2018

Solutions: Homework 3

Problem Set 2 Solutions

1+x 1 + α+x. x = 2(α x2 ) 1+x

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

5. INEQUALITIES, LIMIT THEOREMS AND GEOMETRIC PROBABILITY

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

REGRESSION WITH QUADRATIC LOSS

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Transcription:

STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least favorable prior sequeces I this lecture, we will exted our tools for derivig miimax estimators. Last time, we discovered that miimax estimators ca arise from Bayes estimators uder least favorable priors. However, it turs out that miimax estimators may ot be Bayes estimators. Cosider the followig example, where our old approach fails. Example (Miimax for i.i.d. Normal radom variables with ukow mea ). Let iid X,..., X N (, σ 2 ), with σ 2 kow. Our goal is to estimate uder squared-error loss. For our first guess, pick the atural estimator X. Note that it has costat risk σ2, which suggests miimaxity because we kow that Bayes estimators with costat risk are also miimax estimators. However, X is ot Bayes for ay prior, because uder squared-error loss ubiased estimators are Bayes estimators oly i the degeerate situatios of zero risk (TPE Theorem 4.2.3), ad X is ubiased. Thus, we caot coclude by our previous results (e.g., TPE Corollary 5..5) that X is miimax. We might try to cosider the wider class of estimators δ a,µ0 (X) = ax + ( a) µ 0 for a (0, ) ad µ 0 R, because may of the Bayes estimators we ve ecoutered are covex combiatios of a prior ad a data mea. Note however that the worst case risk for these estimators is ifiite: sup E [ δ (X)] 2 ( ) = sup{a 2 Var X + ( a) 2 ( µ 0 ) 2 } = a2 σ 2 = +. + ( a)2 sup ( µ 0 ) 2 Sice these estimators have poorer worst case risk tha X, they certaily caot be miimax. We could keep tryig to fid Bayes estimators with better worst-case performace tha X, but we would fail: it turs out that X is i fact miimax. To establish this, we will exted our miimax results to the limits of Bayes estimators, rather tha restrictig attetio to Bayes estimators oly. Defiitio (Least Favorable Sequece of Priors). Let {Λ m } be a sequece of priors with miimal average risk r Λm = if δ R (, δ) dλm (). The, {Λ m } is a least favorable sequece of priors if there is a real umber r such that r Λm r < ad r r Λ for ay prior Λ. 0-

STATS 300A Lecture 0 October 22 Fall 205 The reaso for studyig the limit of priors is that it may help us establish miimaxity. Sice there eed ot exist a prior Λ such that the associated Bayes estimator has average risk r, this defiitio is less restrictive tha that of a least-favorable prior. We ca prove a aalogue of TPE Theorem 5..4 i this ew settig. Theorem (TPE 5..2). Suppose there is real umber r such that {Λ m } is a sequece of priors with r Λm r <. Let δ be ay estimator such that sup R (, δ) = r. The,. δ is miimax, 2. {Λ m } is least-favorable. Proof.. Let δ be ay other estimator. The, for ay m, sup R (, δ ) R (, δ ) dλ m () r Λm, so that sedig m yields sup which meas that δ is miimax. 2. Let Λ be ay prior, the r Λ = R (, δ Λ ) dλ () which meas that {Λ m } is least favorable. R (, δ ) r = sup R (, δ), R (, δ) dλ () sup R (, δ) = r, Ulike Theorem 5..4, this result does ot guaratee uiqueess, eve if the Bayes estimators δ Λm are uique. This is because the limitig step i the proof of () chages ay strict iequality to ostrict iequality. However, this result allows to check much wider class of estimators, sice to check that the estimator is ideed a miimax estimator we eed to fid oly the sequece of Bayes risks coverget to maximum risk of our cadidate. Example 2 (Miimax for i.i.d. Normal radom variables, cotiued). We ow have the tools to cofirm our suspicio that X is miimax. By Theorem above, it suffices to fid a sequece {Λ m } such that r Λm σ2 =: r. Usig the cojugate prior is a good startig poit, so we let {Λ m } be the cojugate priors {N (0, m 2 )} with variace tedig to, so that Λ m teds to the (improper with π() =, R) uiform prior o R. By TPE Example 4.2.2, the posterior for associated with each Λ m is ( ) X σ X,..., X N 2 +, +. σ 2 m 2 σ 2 m 2 I particular, the posterior variace does ot deped o X,..., X, so Lemma below automatically yields the Bayes risk r Λm = σ 2 + m 2 m σ2 = sup R (, X ). It follows from Theorem that X is miimax ad {Λ m } is least favorable. 0-2

STATS 300A Lecture 0 October 22 Fall 205 Lemma (TPE 5..3). If the posterior variace Var Θ X (g (Θ) X = x) is costat i x, the uder squared error loss, r Λ = Var Θ X (g (Θ) X = x). We kow that the posterior mea miimizes Bayes risk, so this result ca be obtaied by pluggig i the posterior mea of g() ito the average risk. 0.2 Miimaxity via submodel restrictio The followig example illustrates the techique of derivig a miimax estimator for a geeral family of models by restrictig attetio to a subset of that family. The idea comes from simple observatio that if the estimator is miimax i submodel ad its risk does t chage whe we go to a larger model the estimator is miimax i this larger class. Example 3 (Miimax for i.i.d. Normal radom variables, ukow mea ad variace). iid Recosider Example i the case that the variace is ukow. That is, let X,..., X N (, σ 2 ), with both ad σ 2 ukow. Note that sup R ( (, σ 2 ), X ) σ 2 = sup,σ 2 σ 2 =, ad i fact, the maximum risk of ay estimator i this settig is ifiite, so the questio of miimaxity is uiterestig. Therefore, we restrict attetio to the family parameterized by Ω = {(, σ 2 ) : R, σ 2 B}, where B is a kow costat. Assume δ is ay other estimator. Calculatig the risk of X withi this family, we fid sup R,σ 2 B R ( (, σ 2 ), X ) = B = sup R ( (, σ 2 ), X ) R,σ 2 =B sup R ( (, σ 2 ), δ ) [submodel miimax] R,σ 2 =B sup R ( (, σ 2 ), δ ), R,σ 2 B where the first iequality follows from the fact that X is miimax for i.i.d. ormals with kow σ 2, ad the secod iequality follows from the fact that we are takig the supremum over a larger set. Hece, we are able to show that X is miimax over Ω by focusig o the case where σ 2 is kow. Notice further that the form of the estimator does ot deped o the upper boud B, though the boud is ecessary for miimaxity to be worth ivestigatig. 0.3 Depedece o the Loss Fuctio I geeral, miimax estimators ca vary depedig o the loss beig cosidered. Below, we provide a example of miimax estimatio uder weighted squared error loss. 0-3

STATS 300A Lecture 0 October 22 Fall 205 Example 4 (Miimax for biomial radom variables, weighted squared error loss). Let X Bi (, ) with the loss fuctio L (, d) = (d )2. This is a simple weighted squarederror loss with the weights w() = but it is arguably more realistic tha the usual ( ) ( ) squared error i this situatio because it pealizes errors ear 0 ad more strogly tha errors ear. 2 Note that for ay, R ( ), X = ; that is, the risk is costat i, suggestig X is miimax. We will show that this is ideed the case. We should be careful sice TPE Theorem 4.2.3 is oly valid uder the squared-error loss. Sice our loss fuctio is differet, a ubiased estimator ca be Bayes. I this example, this is ideed the case. Recall from TPE Corollary 4..2 that the Bayes estimator associated with the loss L (d, ) = w () (d ) 2 is give by E Θ X[Θw(Θ) X]. E Θ X Ivokig this result, we fid that the [w(θ) X] Bayes estimator has the form δ Λ (X) = E Θ X E Θ X [ [ Θ X] ]. (0.) X Θ( Θ) This is true for arbitrary priors Λ, but to calculate a closed form Bayes estimator, we use a prior cojugate to the biomial likelihood: Θ Λ a,b = Beta (a, b), for some a, b > 0. Suppose we observe X = x. If a + x > ad b + + x >, the substitutig the result of Remark below ito equatio 0. proves that the estimator δ a,b (x) = a + x a + b + 2, miimizes the posterior risk. I particular, the estimator δ, (x) = x miimizes the posterior risk with respect to the uiform prior after observig 0 < x <. If we ca verify that this form remais uchaged whe x {0, }, the the estimator δ, (X) = X is Bayes with costat risk, ad hece miimax. To see that this is the case, ote that the posterior risk uder the prior Λ, after observig X = x ad decidig δ (x) = d is which, i the case X = 0, simplifies to 0 (d ) 2 ( ) Γ (x + + x + ) Γ (x + ) Γ ( x + ) x ( ) x d, 0 (d ) 2 ( ) d. This itegral coverges for d = 0 ad diverges otherwise, so the posterior risk is miimized by choosig δ (0) = 0. Similarly, i the case X =, the posterior risk is miimized by choosig δ () = =. This cofirms that δ, (X) = X miimizes the posterior risk for ay outcome X, ad is ideed Bayes. Sice as we metioed before this estimator has costat risk we ca coclude that X is miimax. Notice that the form of the miimax estimator here depeds o the type of loss beig used: X has costat risk for the type of weighted squared error loss cosidered here. 0-4

STATS 300A Lecture 0 October 22 Fall 205 Remark. Recall that the Beta fuctio ca be evaluated as 0 x k ( x) k 2 dx = Γ (k ) Γ (k 2 ) Γ (k + k 2 ), (0.2) wheever k, k 2 > 0. Therefore, if Y Beta (a, b), where a, b > 0, we ca explicitly evaluate the expectatio [ ] E = Y = = [ Γ (a + b) 0 y Γ (a) Γ (b) ya ( y) b Γ (a + b) [ y a ( y) b 2] dy Γ (a) Γ (b) 0 Γ (a + b) Γ (a) Γ (b ) Γ (a) Γ (b) Γ (a + b ) Γ(a + b) Γ(b ) = Γ(a) Γ(a + b ) Γ(b) Γ(a) = a + b b, ] dy where i the secod step we require b > i order to apply the relatio 0.2. A similar argumet yields [ ] (a + b 2) (a + b ) E =, Y ( Y ) (a ) (b ) wheever a >. Combiig these idetities, we have that, wheever a, b >, E [ ] Y [ ] = a E a + b 2. Y ( Y ) 0.4 Radomized Miimax Estimators So far, we have had little occasio to cosider radomized estimators, that is, fuctios δ (X, U) of both the data ad a idepedet source of radomess U Uif (0, ). Radomized estimators played little role i our exploratio of average risk optimality, sice o-radomized estimators of equal or better average risk are always available. However, they tur out to play a role whe we cosider the miimax criterio. Notice that whe workig with covex losses, we ca dispese with radomized estimators, because we ca fid a determiistic estimator with the same or better performace. Ideed, the data X is always sufficiet, so by the Rao-Blackwell theorem, the o-radom estimator δ (X) = E [δ (X, U) X] is o worse tha δ (X, U). However, there are o-covex losses for which o determiistic miimax estimator exists, as the followig example demostrates. 0-5

STATS 300A Lecture 0 October 22 Fall 205 0 0 δ (i ) δ (i 2 ) δ (i ) legth > α Figure 0.. By choosig α small eough, we ca esure that ay choice of + values for the o-radom estimator δ will leave some 0 a distace at least α away from ay of the δs. Example 5 (Radomized miimax estimator). Let X Bi (, ), where [0, ], ad cosider estimatio of uder the 0- loss, { 0 if d < α L (, d) =. otherwise First cosider a arbitrary o-radom estimator δ. Sice X ca take o oly the + values {0,,..., }, the estimator δ (X) ca take o oly + values, {δ (0), δ (),..., δ ()}. If α <, the we ca always fid 2(+) 0 such that δ (x) 0 α for every x {0,..., }; see Figure 0.. Hece, R ( 0, δ (X)) = is the maximum risk of ay o-radom δ. Cosider istead the estimator δ (X, U) = U, which is completely radom ad idepedet of the data X. The, for ay [0, ], R (, δ ) = E [L (, δ (X, U))] = P ( U α) = P ( α < U < α + ) α<, ad sice α > 0, the maximum risk of δ is smaller tha the maximum risk of ay o-radom δ. Hece, i this settig, there ca be o determiistic miimax estimator. 0-6