Theoretical Statistics. Lecture 1.

1. Organizational issues. 2. Overview. 3. Stochastic convergence. Theoretical Statistics. Lecture 1. eter Bartlett 1

Organizational Issues Lectures: Tue/Thu 11am 12:30pm, 332 Evans. eter Bartlett. bartlett@stat. Office hours: Tue 1-2pm, Wed 1:30-2:30pm (Evans 399). GSI: Siqi Wu. siqi@stat. Office hours: Mon 3:30-4:30pm, Tue 3:30-4:30pm (Evans 307). http://www.stat.berkeley.edu/ bartlett/courses/210b-spring2013/ Check it for announcements, homework assignments,... Texts: Asymptotic Statistics, Aad van der Vaart. Cambridge. 1998. Convergence of Stochastic rocesses, David ollard. Springer. 1984. Available on-line at http://www.stat.yale.edu/ pollard/1984book/ 2

Organizational Issues Assessment: Homework Assignments (60%): posted on the website. Final Exam (40%): scheduled for Thursday, 5/16/13, 8-11am. Required background: Stat 210A, and either Stat 205A or Stat 204. 3

Asymptotics: Why? Example: We have a sample of size n from a density p θ. Some estimator gives ˆθ n. Consistent? i.e., ˆθ n θ? Stochastic convergence. Rate? Is it optimal? Often no finite sample optimality results. Asymptotically optimal? Variance of estimate? Optimal? Asymptotically? Distribution of estimate? Confidence region. Asymptotically? 4

Asymptotics: Approximate confidence regions Example: We have a sample of size n from a density p θ. Maximum likelihood estimator gives ˆθ n. Under mild conditions, ) n(ˆθn θ is asymptotically N ( 0,I 1 ) θ. Thus 1/2 ni θ (ˆθ n θ) N(0,I), and n(ˆθ n θ) T I θ (ˆθ n θ) χ 2 (k). So we have an approximate 1 αconfidence region for θ: { } θ : (θ ˆθ n ) T Iˆθn (θ ˆθ n ) χ2 k,α n. 5

Overview of the Course 1. Tools for consistency, rates, asymptotic distributions: Stochastic convergence. Concentration inequalities. rojections. U-statistics. Delta method. 2. Tools for richer settings (eg: function space vsr k ) Uniform laws of large numbers. Empirical process theory. Metric entropy. Functional delta method. 6

3. Tools for asymptotics of likelihood ratios: Contiguity. Local asymptotic normality. 4. Asymptotic optimality: Efficiency of estimators. Efficiency of tests. 5. Applications: Nonparametric regression. Nonparametric density estimation. M-estimators. Bootstrap estimators. 7

Convergence in Distribution X 1,X 2,...,X are random vectors, Definition: X n converges in distribution (or weakly converges) to X (written X n X) means that their distribution functions satisfy F n (x) F(x) at all continuity points of F. 8

Review: Other Types of Convergence d is a distance onr k (for which the Borel σ-algebra is the usual one). as Definition: X n converges almost surely to X (written X n X) means that d(x n,x) 0 a.s. Definition: X n converges in probability to X (written X n X) means that, for all ǫ > 0, (d(x n,x) > ǫ) 0. 9

Review: Other Types of Convergence Theorem: X n as X = X n X = Xn X, X n c Xn c. NB: For X n as X and X n X,Xn andx must be functions on the sample space of the same probability space. But not convergence in distribution. 10

Convergence in Distribution: Equivalent Definitions Theorem: [ortmanteau] The following are equivalent: 1. (Xn x) (X x) for all continuity pointsxof(x ). 2. Ef(Xn) Ef(X) for all bounded, continuous f. 3. Ef(Xn) Ef(X) for all bounded, Lipschitz f. 4. Ee itt Xn Ee itt X for allt R k. (Lévy s Continuity Theorem) 5. for all t R k,t T Xn t T X. (Cramér-Wold Device) 6. lim inf Ef(Xn) Ef(X) for all nonnegative, continuous f. 7. liminf (Xn U) (X U) for all open U. 8. limsup(xn F) (X F) for all closed F. 9. (Xn B) (X B) for all continuity sets B (i.e.,(x B) = 0). 11

Convergence in Distribution: Equivalent Definitions Example: [Why do we need continuity?] Consider f(x) = 1[x > 0], X n = 1/n. Then X n 0, f(x) 1, but f(0) = 0. [Why do we need boundedness?] Consider f(x) = x, n w.p. 1/n, X n = 0 w.p. 1 1/n. Then X n 0, Ef(X n ) 1, but f(0) = 0. 12

Relating Convergence roperties Theorem: X n X and d(x n,y n ) 0 = Y n X, X n X and Y n c = (X n,y n ) (X,c), X n X and Yn Y = (Xn,Y n ) (X,Y). 13

Relating Convergence roperties Example: NB: NOT X n X andy n Y = (X n,y n ) (X,Y). (joint convergence versus marginal convergence in distribution) Consider X,Y independent N(0,1), X n N(0,1), Y n = X n. Then X n X,Y n Y, but (X n,y n ) (X, X), which has a very different distribution from that of (X,Y). 14

Relating Convergence roperties: Continuous Mapping Supposef : R k R m is almost surely continuous (i.e., for some S with(x S)=1,f is continuous ons). Theorem: [Continuous mapping] X n X = f(x n ) f(x). X n X = f(xn ) f(x). X n as X = f(x n ) as f(x). 15

Relating Convergence roperties: Continuous Mapping Example: For X 1,...,X n i.i.d. mean µ, variance σ 2, we have n σ ( X n µ) N(0,1). So n σ 2( X n µ) 2 (N(0,1)) 2 = χ 2 1. Example: We also have X n µ 0 hence ( X n µ) 2 0. Consider f(x) = 1[x > 0]. Then f(( X n µ) 2 ) 1 f(0). (The problem is that f is not continuous at 0, and X (0) > 0, for X satisfying ( X n µ) 2 X.) 16

Relating Convergence roperties: Slutsky s Lemma Theorem: X n X and Y n c imply X n +Y n X +c, Y n X n cx, Y 1 n X n c 1 X. (Why does X n X and Y n Y not implyx n +Y n X +Y?) 17

Relating Convergence roperties: Examples Theorem: For i.i.d.y t withey 1 = µ,ey 2 1 = σ 2 <, n Ȳ n µ S n N(0,1), where n Ȳ n = n 1 Y i, i=1 n Sn 2 = (n 1) 1 (Y i Ȳ n ) 2. i=1 18

roof: S 2 n = n n 1 }{{} 1 1 n Yi 2 n i=1 }{{} EY 2 1 Ȳn }{{} EY 1 2 (weak law of large numbers) EY 2 1 (EY 1 ) 2 = σ 2. (continuous mapping theorem, Slutsky s Lemma) 19

Also n (Ȳn µ ) } {{ } N(0,σ 2 ) N(0,1) 1 S n }{{} 1/σ (central limit theorem) (continuous mapping theorem, Slutsky s Lemma) 20

Showing Convergence in Distribution Recall that the characteristic function demonstrates weak convergence: X n X Ee itt X n Ee ittx for all t R k. Theorem: [Lévy s Continuity Theorem] If Ee itt X n φ(t) for all t inr k, and φ : R k C is continuous at 0, then X n X, where Ee ittx = φ(t). Special case: X n = Y. SoX,Y have same distribution iffφ X = φ Y. 21

Showing Convergence in Distribution Theorem: [Weak law of large numbers] Suppose X 1,...,X n are i.i.d. Then X n µ iff φ X1 (0) = iµ. roof: We ll show that φ X 1 (0) = iµ implies X n µ. Indeed, Ee it X n = φ n (t/n) = (1+tiµ/n+o(1/n)) n }{{} e itµ. =φ µ (t) Lévy s Theorem implies X n µ, hence X n µ. 22

Showing Convergence in Distribution e.g., X N(µ, Σ) has characteristic function φ X (t) = Ee ittx = e itt µ t T Σt/2. Theorem: [Central limit theorem] Suppose X 1,...,X n are i.i.d., EX 1 = 0, EX 2 1 = 1. Then n X n N(0,1). 23

roof: φ X1 (0) = 1,φ X 1 (0) = iex 1 = 0,φ X 1 (0) = i 2 EX1 2 = 1. Ee it n X n = φ n (t/ n) = ( 1+0 t 2 EY 2 /(2n)+o(1/n) ) n e t2 /2 = φ N(0,1) (t). 24

Uniformly tight Definition: X is tight means that for all ǫ > 0 there is an M for which ( X > M) < ǫ. {X n } is uniformly tight (or bounded in probability) means that for all ǫ > 0 there is an M for which sup n ( X n > M) < ǫ. (so there is a compact set that contains each X n with high probability.) 25

Notation: Uniformly tight Theorem: [rohorov s Theorem] 1. X n X implies{x n } is uniformly tight. 2. {X n } uniformly tight implies that for somex and some subsequence, X nj X. 26

Notation for rates: o, O Definition: X n = o (1) X n 0, X n = o (R n ) X n = Y n R n and Y n = o (1). X n = O (1) X n uniformly tight X n = O (R n ) X n = Y n R n and Y n = O (1). (i.e., o,o specify rates of growth of a sequence. o means strictly slower (sequence Y n converges in probability to zero). O means within some constant (sequence Y n lies in a ball). 27

Relations between rates o (1)+o (1) = o (1). o (1)+O (1) = O (1). o (1)O (1) = o (1). (1+o (1)) 1 = O (1). o (O (1)) = o (1). X n 0, R(h) = o( h p ) = R(X n ) = o ( X n p ). X n 0, R(h) = O( h p ) = R(X n ) = O ( X n p ). 28