SDS : Theoretical Statistics

Size: px

Start display at page:

Download "SDS : Theoretical Statistics"

Nathaniel Hodge
5 years ago
Views:

1 SDS : Theoretical Statistics Lecture 1: Introduction Purnamrita Sarkar Department of Statistics and Data Science The University of Texas at Austin

2 Manegerial Stuff Instructor- Purnamrita Sarkar Course material and homeworks will be posted under Office hours: Tuesdays 2-3 pm. GDC TA: Jack Liu Homeworks are due Biweekly after class on Wednesdays Grading - 5 homeworks (70% ), class participation (5%) Final Exam (25% ) Books Asymptotic Statistics, Aad van der Vaart. Cambridge Martin Wainwright s High dimensional statistics: A non-asymptotic view point 1

3 Why do theory? Say you have estimated ˆθn from data X 1,..., Xn. How do we know we have a good estimation method? Does ˆθ n θ? This brings us to Stochastic Convergence. How do I know if one estimation method is better than another? Does the estimate from one converge faster than the other? Does one algorithm work under broader parameter regimes, or weaker assumptions? What is the optimal rate for a given estimation problem? 2

4 This class Your instructor hopes to cover : Consistency of parameter estimates Stochastic Convergence Concentration inequalities Asymptotic normality of estimators Empirical processes, VC classes, covering numbers Asymptotic testing Examples of network clustering with a bit of random matrix theory Bootstrap, Nonparametric regression and density estimation 3

5 Stochastic Convergence Assume that X n, n 1 and X are elements of a separable metric space (S, d). Definition (Weak Convergence) A sequence of random variable s converge in law or in distribution to a random variable X, i.e. Xn d X if P(Xn x) P(X x) x at which P(X x) is continuous. Definition ( Convergence in Probability) A sequence of random variables converge in probability to a random variable X, i.e. Xn P X if ɛ > 0, P(d(Xn, X ) ɛ) 0. 4

6 Stochastic Convergence Assume that X n, n 1 and X are elements of a separable metric space (S, d). Definition (Almost Sure Convergence) A sequence of random variables converges almost surely to a random variable X, i.e. Xn a.s. ( ) X if P lim n d(xn, X ) = 0 = 1. If you think about a (scalar) random variable as a function that maps events to a real number, almost sure convergence means P(ω Ω : lim Xn(ω) = X (ω)) = 1 n Definition (Convergence in quadratic mean) A sequence of random variables converges in quadratic mean to a random variable X, i.e. Xn q.m [ X if E d(xn, X ) 2] 0. 5

7 Stochastic Convergence Theorem Xn a.s. X, Xn q.m. X Xn P X Xn d X Xn d c Xn P c 6

8 Converses:X n d X Xn P X Convergence in law needs no knowledge of the joint distribution of Xn and the limiting random variable X. Convergence in probability does. Example Consider X N(0, 1), X n = X. Xn d X. But how about Xn P X? 7

9 Converses:X n d X Xn P X Convergence in law needs no knowledge of the joint distribution of Xn and the limiting random variable X. Convergence in probability does. Example Consider X N(0, 1), X n = X. Xn d X. But how about Xn P X? P( Xn X ɛ) = P(2 X ɛ) 0 ɛ > 0. So Xn does not converge in probability to X. 7

10 Example Example Let Z U(0, 1) and for n = 2 k + m for k 0, 0 m < 2 k X n = 1(Z [m2 k, (m + 1)2 k ]), i.e. X1 = 1, X 2 = 1(Z [0, 1/2)), X 3 = 1(Z [1/2, 1)), X 4 = 1(Z [0, 1/4)), X 5 = 1(Z [1/4, 1/2)). 8

11 Example Example Let Z U(0, 1) and for n = 2 k + m for k 0, 0 m < 2 k X n = 1(Z [m2 k, (m + 1)2 k ]), i.e. X1 = 1, X 2 = 1(Z [0, 1/2)), X 3 = 1(Z [1/2, 1)), X 4 = 1(Z [0, 1/4)), X 5 = 1(Z [1/4, 1/2)). For any Z (0, 1), the sequence {X n(z)} does not converge. So Xn a.s. 0. 8

12 Example Example Let Z U(0, 1) and for n = 2 k + m for k 0, 0 m < 2 k X n = 1(Z [m2 k, (m + 1)2 k ]), i.e. X1 = 1, X 2 = 1(Z [0, 1/2)), X 3 = 1(Z [1/2, 1)), X 4 = 1(Z [0, 1/4)), X 5 = 1(Z [1/4, 1/2)). For any Z (0, 1), the sequence {X n(z)} does not converge. So Xn a.s. 0. Xn are a sequence of bernoulli s with probabilities pn = 1/2 k where k = log n. 8

13 Example Example Let Z U(0, 1) and for n = 2 k + m for k 0, 0 m < 2 k X n = 1(Z [m2 k, (m + 1)2 k ]), i.e. X1 = 1, X 2 = 1(Z [0, 1/2)), X 3 = 1(Z [1/2, 1)), X 4 = 1(Z [0, 1/4)), X 5 = 1(Z [1/4, 1/2)). For any Z (0, 1), the sequence {X n(z)} does not converge. So Xn a.s. 0. Xn are a sequence of bernoulli s with probabilities pn = 1/2 k where k = log n. So X n P 0 and Xn qm 0 8

14 Example Example Let Z U([0, 1]) and Xn = 2 n 1(Z [0, 1/n)). Does Xn converge to X in quadratic mean, almost surely or in probability? 9

15 Example Example Let Z U([0, 1]) and Xn = 2 n 1(Z [0, 1/n)). Does Xn converge to X in quadratic mean, almost surely or in probability? a.s. P( lim Xn = X ) = P(Z > 0) = 1. So Xn X. n 9

16 Example Example Let Z U([0, 1]) and Xn = 2 n 1(Z [0, 1/n)). Does Xn converge to X in quadratic mean, almost surely or in probability? a.s. P( lim Xn = X ) = P(Z > 0) = 1. So Xn X. n E Xn 2 = 2 2n /n. So Xn qm 0 9

17 Example Example Let Z U([0, 1]) and Xn = 2 n 1(Z [0, 1/n)). Does Xn converge to X in quadratic mean, almost surely or in probability? a.s. P( lim Xn = X ) = P(Z > 0) = 1. So Xn X. n E Xn 2 = 2 2n /n. So Xn qm 0 P( X n ɛ) = P(Xn = 2 n ) = P(Z [0, 1/n)) = 1/n 0 9

18 Continuous Mapping Theorem Theorem Let g be continuous on a set C where P(X C) = 1. Then, Xn d X g(xn) d g(x ) Xn P X g(xn) P g(x ) Xn a.s. X g(xn) a.s. g(x ) 10

19 Example Let X n d X where X N(0, 1). Then X 2 n d? 11

20 Example Let X n d X where X N(0, 1). Then X 2 n d? Use g(x) = x 2. 11

21 Example Let X n d X where X N(0, 1). Then X 2 n d? Use g(x) = x 2. Use X 2 χ

22 Example d Let X n X where X N(0, 1). Then X 2 d n? Use g(x) = x 2. Use X 2 χ 2 1. So Xn 2 d χ

23 Example-continuity points Let X 1,..., Xn be i.i.d. with mean µ and variance σ 2. We have X n µ d 0. Consider g(x) = 1x>0. Then g(( X n µ) 2 ) d? 12

24 Example-continuity points Let X 1,..., Xn be i.i.d. with mean µ and variance σ 2. We have X n µ d 0. Consider g(x) = 1x>0. Then g(( X n µ) 2 ) d? Using Continuous Mapping Theorem, ( Xn µ) 2 d 0 12

25 Example-continuity points Let X 1,..., Xn be i.i.d. with mean µ and variance σ 2. We have X n µ d 0. Consider g(x) = 1x>0. Then g(( X n µ) 2 ) d? Using Continuous Mapping Theorem, ( Xn µ) 2 d 0 Can we use Continuous Mapping Theorem to claim that g( Xn µ) 2 d 0? 12

26 Example-continuity points Let X 1,..., Xn be i.i.d. with mean µ and variance σ 2. We have X n µ d 0. Consider g(x) = 1x>0. Then g(( X n µ) 2 ) d? Using Continuous Mapping Theorem, ( Xn µ) 2 d 0 Can we use Continuous Mapping Theorem to claim that g( Xn µ) 2 d 0? NO. Because, 0 is a random variable whose mass is at 0, where g is discontinuous. 12

27 How about convergence in q.m.? If Xn qm X, then is it true that for continuous f (discontinuous only at a measure zero set), f (Xn) qm f (X )? 13

28 How about convergence in q.m.? If Xn qm X, then is it true that for continuous f (discontinuous only at a measure zero set), f (Xn) qm f (X )? Consider an L- Lipschitz function f (X ). f (x) f (y) L x y. 13

29 How about convergence in q.m.? If Xn qm X, then is it true that for continuous f (discontinuous only at a measure zero set), f (Xn) qm f (X )? Consider an L- Lipschitz function f (X ). f (x) f (y) L x y. E[ f (X n) f (X ) 2 ] L 2 E[ Xn X 2 ] 0. So for Lipschitz functions quadratic mean convergence goes through. 13

30 How about convergence in q.m.? If Xn qm X, then is it true that for continuous f (discontinuous only at a measure zero set), f (Xn) qm f (X )? Consider an L- Lipschitz function f (X ). f (x) f (y) L x y. E[ f (X n) f (X ) 2 ] L 2 E[ Xn X 2 ] 0. So for Lipschitz functions quadratic mean convergence goes through. Can you come up with a non-lipschitz function and a sequence {X n} where f (Xn) qm 0? 13

31 Portmanteau Theorem Theorem The following are equivalent. Xn d X E[f (Xn)] E[f (X )] for all bounded and continuous f. E[f (Xn)] E[f (X )] for all bounded and Lipschitz f. E[e itt Xn ] E[e it T Xn ], t R k. (Levy s continuity theorem) t T X n d t T X t R k. (Cramer-Wold device) lim inf E[f (Xn)] E[f (X )] for all non-negative continuous f n lim sup P(Xn F ) P(X F ) for all closed F n lim inf P(Xn F ) P(X F ) for all open F n P(X n B) P(X B) for all continuity sets B (P(X B) = 0) 14

32 Example-bounded Consider f (x) = x and n w.p. 1/n Xn = 0 w.p. 1 1/n 15

33 Example-bounded Consider f (x) = x and n w.p. 1/n Xn = 0 w.p. 1 1/n X n d 0, but E[Xn]? 15

34 Example-bounded Consider f (x) = x and n w.p. 1/n Xn = 0 w.p. 1 1/n X n d 0, but E[Xn]? E[Xn] = 1. What went wrong? 15

35 Example-bounded Consider f (x) = x and n w.p. 1/n Xn = 0 w.p. 1 1/n X n d 0, but E[Xn]? E[Xn] = 1. What went wrong? f (x) = x is not bounded. 15

36 Putting everything together Theorem X n d X and d(xn, Yn) P 0 Yn d X (1) X n d X and Yn d c (Xn, Yn) d (X, c) (2) X n P X and Yn P Y (Xn, Yn) P (X, Y ) (3) 16

37 Putting everything together Theorem X n d X and d(xn, Yn) P 0 Yn d X (1) X n d X and Yn d c (Xn, Yn) d (X, c) (2) X n P X and Yn P Y (Xn, Yn) P (X, Y ) (3) Eq 3 does not hold if we replace convergence in probability by convergence in distribution. 16

38 Putting everything together Theorem X n d X and d(xn, Yn) P 0 Yn d X (1) X n d X and Yn d c (Xn, Yn) d (X, c) (2) X n P X and Yn P Y (Xn, Yn) P (X, Y ) (3) Eq 3 does not hold if we replace convergence in probability by convergence in distribution. Example: Xn N(0, 1), Yn = Xn. X Y and X, Y are independent standard normal random variables. 16

39 Putting everything together Theorem X n d X and d(xn, Yn) P 0 Yn d X (1) X n d X and Yn d c (Xn, Yn) d (X, c) (2) X n P X and Yn P Y (Xn, Yn) P (X, Y ) (3) Eq 3 does not hold if we replace convergence in probability by convergence in distribution. Example: Xn N(0, 1), Yn = Xn. X Y and X, Y are independent standard normal random variables. Then X n d X and Yn d Y. But (Xn, Yn) d (X, X ), not (Xn, Yn) d (X, Y ). 16

40 Putting everything together Theorem (Slutsky s theorem) Xn d X and Yn d c imply that Xn + Yn d X + c XnYn d cx Xn/Yn d X /c 17

41 Putting everything together Theorem (Slutsky s theorem) Xn d X and Yn d c imply that Xn + Yn d X + c XnYn d cx Xn/Yn d X /c Does X n d X and Yn d Y imply Xn + Yn d X + Y? 17

42 Putting everything together Theorem (Slutsky s theorem) Xn d X and Yn d c imply that Xn + Yn d X + c XnYn d cx Xn/Yn d X /c Does X n d X and Yn d Y imply Xn + Yn d X + Y? Take Yn = Xn, and X, Y as independent standard normal random variables. Xn d X and Yn d Y but Xn + Yn d 0. 17

43 Using all this If X 1,... Xn are i.i.d. random variables with mean µ and variance σ 2, prove that n Xn µ d N(0, 1). Sn 18

44 Using all this If X 1,... Xn are i.i.d. random variables with mean µ and variance σ 2, prove that n Xn µ d N(0, 1). Sn First note that Sn = 1 n 1 i X 2 i X 2 n = n n 1 i X 2 i n X 2 n 18

45 Using all this If X 1,... Xn are i.i.d. random variables with mean µ and variance σ 2, prove that n Xn µ d N(0, 1). Sn First note that Sn = 1 n 1 Law of large numbers give X 2 i X 2 n = i i X i 2 n n n 1 i X 2 i n P E[X 2 ] and Xn P µ. X 2 n 18

46 Using all this If X 1,... Xn are i.i.d. random variables with mean µ and variance σ 2, prove that n Xn µ d N(0, 1). Sn First note that Sn = 1 n 1 Law of large numbers give So ( X 2 i X 2 n = i i X i 2 n n n 1 i X 2 i n P E[X 2 ] and Xn P µ. X 2 n i X i 2, Xn) P (E[X 2 ], µ) and now using the continuous mapping n theorem, Sn 2 P σ 2. 18

47 Using all this If X 1,... Xn are i.i.d. random variables with mean µ and variance σ 2, prove that n Xn µ d N(0, 1). Sn First note that Sn = 1 n 1 Law of large numbers give So ( X 2 i X 2 n = i i X i 2 n n n 1 i X 2 i n P E[X 2 ] and Xn P µ. X 2 n i X i 2, Xn) P (E[X 2 ], µ) and now using the continuous mapping n theorem, Sn 2 P σ 2. Finally, n( Xn µ) d N(0, σ 2 ) using CLT. 18

48 Using all this If X 1,... Xn are i.i.d. random variables with mean µ and variance σ 2, prove that n Xn µ d N(0, 1). Sn First note that Sn = 1 n 1 Law of large numbers give So ( X 2 i X 2 n = i i X i 2 n n n 1 i X 2 i n P E[X 2 ] and Xn P µ. X 2 n i X i 2, Xn) P (E[X 2 ], µ) and now using the continuous mapping n theorem, Sn 2 P σ 2. Finally, n( Xn µ) d N(0, σ 2 ) using CLT. Now using Slutsky s lemma, n( Xn µ)/sn d N(0, 1) using CLT. 18

49 Uniformly tight Definition X is defined to be tight if ɛ > 0 M for which, P( X > M) < ɛ {Xn} is defined to uniformly tight if ɛ > 0 M for which, sup n P( Xn > M) < ɛ 19

50 Prohorov s theorem Theorem Xn d X {Xn} is UT. {Xn} is UT implies that, there exists a subsequence {n j } such that d Xn j X. 20

51 Notation for rates, small oh-pee and big oh-pee Definition The small o P : X n = op (1) Xn P 0 Xn = o P (Rn) Xn = YnRn and Yn = o P (1) Xn is vanishing in probability The big O P : Xn = O P (1) {Xn} is UT Xn = O P (Rn) Xn = YnRn and Yn = O P (1) Xn lies within a ball of finite radius with high probability 21

52 How do they interact o P (an) + o P (bn) = o P (max(an, bn)). o P (an) + O P (bn) = O P (bn). O P (a n)op (b n) = op (a nbn). 1 + O P (1) = O P (1). (1 + o P (1)) 1 = 1 + o P (1). o P (O P (1)) = o P (1). Xn = o P (an), Xn r = o P (an) r Be careful: e o P (1) o P (1) O P (1) + O P (1) Can actually be o P (1) because of cancellation. 22

53 23

Theoretical Statistics. Lecture 1.

1. Organizational issues. 2. Overview. 3. Stochastic convergence. Theoretical Statistics. Lecture 1. eter Bartlett 1 Organizational Issues Lectures: Tue/Thu 11am 12:30pm, 332 Evans. eter Bartlett. bartlett@stat.