Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

Size: px
Start display at page:

Download "Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n"

Transcription

1 Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Jiantao Jiao (Stanford EE) Joint work with: Kartik Venkat Yanjun Han Tsachy Weissman Stanford EE Tsinghua EE Stanford EE Information Theory, Learning, and Big Data Workshop March 17th, 2015 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

2 Problem Setting Observe n samples from discrete distribution P with support size S we want to estimate H(P ) = F α (P ) = S p i ln p i (Shannon 48) i=1 S p α i, α > 0 (Diversity, Simpson index, Rényi entropy, etc) i=1 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

3 What was known? Start with entropy: H(P ) = S i=1 p i ln p i. Question Optimal estimator for H(P ) given n samples? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

4 What was known? Start with entropy: H(P ) = S i=1 p i ln p i. Question Optimal estimator for H(P ) given n samples? No unbiased estimator, cannot calculate minimax estimator... (Lehmann and Casella 98) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

5 What was known? Start with entropy: H(P ) = S i=1 p i ln p i. Question Optimal estimator for H(P ) given n samples? No unbiased estimator, cannot calculate minimax estimator... (Lehmann and Casella 98) Classical asymptotics Optimal estimator for H(P ) when n? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

6 Classical problem Empirical Entropy S H(P n ) = ˆp i ln ˆp i i=1 where ˆp i is the empirical frequency of symbol i H(P n ) is the Maximum Likelihood Estimator (MLE) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

7 Classical problem Empirical Entropy S H(P n ) = ˆp i ln ˆp i i=1 where ˆp i is the empirical frequency of symbol i H(P n ) is the Maximum Likelihood Estimator (MLE) Theorem The MLE H(P n ) is asymptotically efficient. (Hájek Le Cam theory) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

8 Is there something missing? Optimal estimator for finite samples unknown :( Question How about using MLE when n is finite? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

9 Is there something missing? Optimal estimator for finite samples unknown :( Question How about using MLE when n is finite? We will show it is a very bad idea in general. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

10 Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

11 Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. R max n (F, ˆF ) = E P [(F (P ) ˆF ) 2 ] Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

12 Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. R max n (F, ˆF ) = sup P M S E P [(F (P ) ˆF ) 2 ] Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

13 Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. R max n (F, ˆF ) = sup P M S E P [(F (P ) ˆF ) 2 ] Minimax risk = sup E P [(F (P ) ˆF )) 2 ] P M S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

14 Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. R max n (F, ˆF ) = sup P M S E P [(F (P ) ˆF ) 2 ] Minimax risk = inf all ˆF sup E P [(F (P ) ˆF )) 2 ] P M S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

15 Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. Notation: R max n (F, ˆF ) = sup P M S E P [(F (P ) ˆF ) 2 ] Minimax risk = inf all ˆF a n b n 0 < c an b n C < a n b n an b n C > 0 sup E P [(F (P ) ˆF )) 2 ] P M S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

16 Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. Notation: R max n (F, ˆF ) = sup P M S E P [(F (P ) ˆF ) 2 ] Minimax risk = inf all ˆF a n b n 0 < c an b n C < a n b n an b n C > 0 sup E P [(F (P ) ˆF )) 2 ] P M S L 2 Risk = Bias 2 + Variance E P [(F (P ) ˆF ( ) 2 ) 2 ] = F (P ) E P ˆF + Var( ˆF ) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

17 Shall we analyze MLE non-asymptotically? Theorem (J., Venkat, Han, Weissman 14) R max n (H, H(P n )) S2 }{{} n 2 Bias 2 + ln2 S n }{{} Variance, n S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

18 Shall we analyze MLE non-asymptotically? Theorem (J., Venkat, Han, Weissman 14) R max n (H, H(P n )) S2 }{{} n 2 Bias 2 + ln2 S n }{{} Variance, n S n S Consistency Bias is dominating if n is not too large compared to S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

19 Can we reduce the bias? Taylor series? (Taylor expansion of H(P n ) around P n = P ) Requires n S (Paninski 03) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

20 Can we reduce the bias? Taylor series? (Taylor expansion of H(P n ) around P n = P ) Requires n S (Paninski 03) Jackknife? Requires n S (Paninski 03) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

21 Can we reduce the bias? Taylor series? (Taylor expansion of H(P n ) around P n = P ) Requires n S (Paninski 03) Jackknife? Requires n S (Paninski 03) Bootstrap? Requires at least n S (Han, J., Weissman 15) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

22 Can we reduce the bias? Taylor series? (Taylor expansion of H(P n ) around P n = P ) Requires n S (Paninski 03) Jackknife? Requires n S (Paninski 03) Bootstrap? Requires at least n S (Han, J., Weissman 15) Bayes estimator under Dirichlet prior? Requires at least n S (Han, J., Weissman 15) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

23 Can we reduce the bias? Taylor series? (Taylor expansion of H(P n ) around P n = P ) Requires n S (Paninski 03) Jackknife? Requires n S (Paninski 03) Bootstrap? Requires at least n S (Han, J., Weissman 15) Bayes estimator under Dirichlet prior? Requires at least n S (Han, J., Weissman 15) Plug-in Dirichlet smoothed distribution? Requires at least n S (Han, J., Weissman 15) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

24 There are many more... Coverage adjusted estimator (Chao, Shen 03, Vu, Yu, Kass 07) BUB estimator (Paninski 03) Shrinkage estimator (Hausser, Strimmer 09) Grassberger estimator (Grassberger 08) NSB estimator (Nemenman, Shafee, Bialek 02) B-Splines estimator (Daub et al. 04) Wagner, Viswanath, Kulkarni 11 Ohannessian, Tan, Dahleh Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

25 There are many more... Coverage adjusted estimator (Chao, Shen 03, Vu, Yu, Kass 07) BUB estimator (Paninski 03) Shrinkage estimator (Hausser, Strimmer 09) Grassberger estimator (Grassberger 08) NSB estimator (Nemenman, Shafee, Bialek 02) B-Splines estimator (Daub et al. 04) Wagner, Viswanath, Kulkarni 11 Ohannessian, Tan, Dahleh Recent breakthrough: Valiant and Valiant 11: the exact phase transition of entropy estimation is n S/ ln S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

26 There are many more... Coverage adjusted estimator (Chao, Shen 03, Vu, Yu, Kass 07) BUB estimator (Paninski 03) Shrinkage estimator (Hausser, Strimmer 09) Grassberger estimator (Grassberger 08) NSB estimator (Nemenman, Shafee, Bialek 02) B-Splines estimator (Daub et al. 04) Wagner, Viswanath, Kulkarni 11 Ohannessian, Tan, Dahleh Recent breakthrough: Valiant and Valiant 11: the exact phase transition of entropy estimation is n S/ ln S Linear Programming based estimator not shown to achieve minimax rates (dependence on ɛ) Another one achieves it for n S1.03 ln S. Not clear about other functionals Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

27 Systematic improvement Question Can we find a systematic methodology to improve MLE, and achieve the minimax rates for a wide class of functionals? George Pólya: the more general problem may be easier to solve than the special problem. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

28 Start from first principle X B(n, p), if we use g(x) to estimate f(p): Bias(g(X)) f(p) E p g(x) n ( ) n = f(p) g(j) p j (1 p) n j j j=0 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

29 Start from first principle X B(n, p), if we use g(x) to estimate f(p): Bias(g(X)) f(p) E p g(x) n ( ) n = f(p) g(j) p j (1 p) n j j j=0 Only polynomials of order n can be estimated without bias [ ] X(X 1)... (X r + 1) E p = p r, 0 r n n(n 1)... (n r + 1) Bias corresponds to polynomial approximation error Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

30 Minimize the maximum bias What if we choose g(x) such that g = arg min g sup p [0,1] Bias(g(X)) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

31 Minimize the maximum bias What if we choose g(x) such that g = arg min g sup p [0,1] Bias(g(X)) Paninski 03 It does not work for entropy. (Variance is too large!) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

32 Minimize the maximum bias What if we choose g(x) such that g = arg min g sup p [0,1] Bias(g(X)) Paninski 03 It does not work for entropy. (Variance is too large!) Best polynomial approximation: P k (x) arg min sup f(x) P k (x) P k Poly k x D Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

33 We only approximate when the problem is hard! f(p) nonsmooth smooth 0 ln n n 1 ˆp i Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

34 We only approximate when the problem is hard! f(p) unbiased estimate of best polynomial approximation of order ln n nonsmooth smooth 0 ln n n 1 ˆp i Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

35 We only approximate when the problem is hard! f(p) unbiased estimate of best polynomial approximation of order ln n f(ˆp i ) f (ˆp i )ˆp i (1 ˆp i ) 2n nonsmooth smooth 0 ln n n 1 ˆp i Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

36 Why ln n n threshold, ln n order? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

37 Approximation theory in functional estimation Lepski, Nemirovski, Spokoiny 99, Cai and Low 11, Wu and Yang 14 (entropy estimation lower bound) Duality with shrinkage (J., Venkat, Han, Weissman 14): a new field to be explored! Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

38 A class of functionals Note F α (P ) = S i=1 pα i, α > 0. [J., Venkat, Han, Weissman 14] showed Minimax L 2 rates L 2 rates of MLE H(P ) S 2 + ln2 S n 2 n F α (P ), 0 < α 1 2 S 2 n 2α F α (P ), 1 2 < α < 1 S 2 + S2 2α n 2α n F α (P ), 1 < α < 3 2 n 2(α 1) F α (P ), α 3 2 n 1 n 1 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

39 A class of functionals Note F α (P ) = S i=1 pα i, α > 0. [J., Venkat, Han, Weissman 14] showed Minimax L 2 rates S H(P ) 2 + ln2 S (n ln n) 2 n F α (P ), 0 < α 1 2 F α (P ), 1 2 < α < 1 S 2 + S2 2α (n ln n) 2α n S 2 S 2 (n ln n) 2α n 2α L 2 rates of MLE S 2 + ln2 S n 2 n S 2 + S2 2α n 2α n F α (P ), 1 < α < 3 2 (n ln n) 2(α 1) n 2(α 1) F α (P ), α 3 2 n 1 n 1 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

40 A class of functionals Note F α (P ) = S i=1 pα i, α > 0. [J., Venkat, Han, Weissman 14] showed Minimax L 2 rates S H(P ) 2 + ln2 S (n ln n) 2 n F α (P ), 0 < α 1 2 F α (P ), 1 2 < α < 1 S 2 + S2 2α (n ln n) 2α n S 2 S 2 (n ln n) 2α n 2α L 2 rates of MLE S 2 + ln2 S n 2 n S 2 + S2 2α n 2α n F α (P ), 1 < α < 3 2 (n ln n) 2(α 1) n 2(α 1) F α (P ), α 3 2 n 1 n 1 Effective sample size enlargement Minimax rate-optimal with n samples MLE with n ln n samples Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

41 Sample complexity perspectives MLE Optimal H(P ) Θ(S) Θ(S/ ln S) F α (P ), 0 < α < 1 Θ(S 1/α ) Θ(S 1/α / ln S) F α (P ), α > 1 Θ(1) Θ(1) Existing literature on F α (P ) for 0 < α < 2: minimax rates unknown, sample complexity unknown, optimal schemes unknown.. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

42 Unique features of our entropy estimator One realization of a general methodology Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

43 Unique features of our entropy estimator One realization of a general methodology Achieve the minimax rate (lower bound due to Wu and Yang 14) S 2 (n ln n) 2 + ln2 S n Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

44 Unique features of our entropy estimator One realization of a general methodology Achieve the minimax rate (lower bound due to Wu and Yang 14) S 2 (n ln n) 2 + ln2 S n Agnostic to the knowledge of S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

45 Unique features of our entropy estimator Easily implementable: approximation order ln n, 500 order approximation using 0.46s on PC. (Chebfun) Empirical performance outperforms every available entropy estimator (Code to be released this week, available upon request) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

46 Unique features of our entropy estimator Easily implementable: approximation order ln n, 500 order approximation using 0.46s on PC. (Chebfun) Empirical performance outperforms every available entropy estimator (Code to be released this week, available upon request) Some statisticians raised interesting questions: We may not use this estimator unless you prove it is adaptive. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

47 Adaptive estimation Near-minimax estimator: (ĤOur sup E P H(P ) P M S ) 2 inf Ĥ ) 2 sup E P (Ĥ H(P ) P M S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

48 Adaptive estimation Near-minimax estimator: (ĤOur sup E P H(P ) P M S ) 2 inf What if we know a priori H(P ) H? We want Ĥ ) 2 sup E P (Ĥ H(P ) P M S (ĤOur 2 ) 2 sup E P H(P )) inf sup E P (Ĥ H(P ), P M S (H) Ĥ P M S (H) where M S (H) = {P : H(P ) H, P M S }. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

49 Adaptive estimation Near-minimax estimator: (ĤOur sup E P H(P ) P M S ) 2 inf What if we know a priori H(P ) H? We want Ĥ ) 2 sup E P (Ĥ H(P ) P M S (ĤOur 2 ) 2 sup E P H(P )) inf sup E P (Ĥ H(P ), P M S (H) Ĥ P M S (H) where M S (H) = {P : H(P ) H, P M S }. Can our estimator satisfy all these requirements without knowing S and H? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

50 Our estimator is adaptive! Han, J., Weissman 15 showed Theorem inf Ĥ sup E P Ĥ H(P ) 2 P M S (H) { S 2 + H ln S (n ln n) 2 n if S ln S enh ln n, [ H ln S ln ( S ln S nh ln n)] 2 otherwise. (1) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

51 Our estimator is adaptive! Han, J., Weissman 15 showed Theorem inf Ĥ sup E P Ĥ H(P ) 2 P M S (H) { S 2 + H ln S (n ln n) 2 n if S ln S enh ln n, [ H ln S ln ( S ln S nh ln n)] 2 otherwise. (1) For ɛ > H S1 ɛ/h ln S, it requires n H small H!) to achieve root MSE ɛ. (much smaller than S/ ln S for Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

52 Our estimator is adaptive! Han, J., Weissman 15 showed Theorem inf Ĥ sup E P Ĥ H(P ) 2 P M S (H) { S 2 + H ln S (n ln n) 2 n if S ln S enh ln n, [ H ln S ln ( S ln S nh ln n)] 2 otherwise. For ɛ > H S1 ɛ/h ln S, it requires n H (much smaller than S/ ln S for small H!) to achieve root MSE ɛ. MLE precisely requires n ln S H S1 ɛ/h. (1) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

53 Our estimator is adaptive! Han, J., Weissman 15 showed Theorem inf Ĥ sup E P Ĥ H(P ) 2 P M S (H) { S 2 + H ln S (n ln n) 2 n if S ln S enh ln n, [ H ln S ln ( S ln S nh ln n)] 2 otherwise. For ɛ > H S1 ɛ/h ln S, it requires n H (much smaller than S/ ln S for small H!) to achieve root MSE ɛ. MLE precisely requires n ln S H S1 ɛ/h. The n n ln n phenomenon is still true! (1) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

54 Understanding the generality Can you deal with cases where the non-differentiable regime is not just a point? We consider L 1 (P, Q) = S i=1 p i q i. Breakthrough by Valiant and Valiant 11: MLE requires n S samples, optimal is S ln S! However, VV 11 s construction only proves to be optimal when n S. S ln S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

55 Applying our general methodology S The minimax L 2 rate is n ln n, while MLE is S n. The n n ln n is still true! We show that VV 11 cannot achieve it when n S. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

56 Applying our general methodology S The minimax L 2 rate is n ln n, while MLE is S n. The n n ln n is still true! We show that VV 11 cannot achieve it when n S. However, is that easy to do? The nonsmooth regime is a whole line! How to cover it? Use a small square [ 0, ln n ] 2? n Use a band? Use some other shape? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

57 Bad news Best polynomial approximation in 2D is not unique. There exists no efficient algorithm to compute it. Some polynomial that achieves best approximation rate cannot be used in our methodology. All nonsmooth regime design methods mentioned before fail. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

58 Bad news Best polynomial approximation in 2D is not unique. There exists no efficient algorithm to compute it. Some polynomial that achieves best approximation rate cannot be used in our methodology. All nonsmooth regime design methods mentioned before fail. Solution: hints from our paper Minimax estimation of functionals of discrete distributions, or wait for soon-be-finished paper Minimax estimation of divergence functions. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

59 Significant difference in phase transitions One may think: Who cares about a ln n improvement? Take sequence n = 2S/ ln S, S equally (on log) sampled from 10 2 to For each n, S, sample 20 times from a uniform distribution. The scale of S/n S/n [2.2727, ] Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

60 Significant improvement! JVHW estimator MLE MSE S/n Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

61 Mutual Information I(P XY ) Mutual information: I(P XY ) = H(P X ) + H(P Y ) H(P XY ) Theorem (J., Venkat, Han, Weissman 14) n n ln n also holds. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

62 Learning Tree Graphical Models d-dimensional random vector with alphabet size S X = (X 1, X 2,..., X d ) Suppose p.m.f. has tree structure d P X = P Xi X π(i), i=1 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

63 Learning Tree Graphical Models d-dimensional random vector with alphabet size S X = (X 1, X 2,..., X d ) Suppose p.m.f. has tree structure d P X = P Xi X π(i), i=1 X 4 X 3 X 6 IV. MAIN RESULT: CAUSAL DEPENDEN APPROXIMATIONS X 5 X 2 X 1 In situations where there are multiple random Fig. 1. Diagram of an approximating dependence tree structure. In this example, PT P X6 ( )P ( )P X1 X6 X3 X6 ( )P X4 X3 ( )P X2 X3 ( )P X5 X2 ( ). all possible arrangements of all the variables, Chow and Liu method can be used. However, i processes and timings to find the best approxima native approach, which would maintain causality of second order distributions serves as an approximation of processes separate, is to find an approximation t the full joint. Jiantao Jiao (Stanford University) Boosting the Effective Sample Sizeprobability by identifying March 17th, causal 2015 dependencies 28 / 35 n

64 Learning Tree Graphical Models d-dimensional random vector with alphabet size S X = (X 1, X 2,..., X d ) Suppose p.m.f. has tree structure d P X = P Xi X π(i), i=1 X 4 X 3 X 6 IV. MAIN RESULT: CAUSAL DEPENDEN APPROXIMATIONS X 5 X 2 X 1 In situations where there are multiple random Fig. 1. Diagram of an approximating dependence tree structure. In this example, PT P X6 ( )P ( )P X1 X6 X3 X6 ( )P X4 X3 ( )P X2 X3 ( )P X5 X2 Chow and Liu method can be used. However, i Given n i.i.d. samples from P ( ). X, estimate underlying all possible tree arrangements structure of all the variables, processes and timings to find the best approxima native approach, which would maintain causality of second order distributions serves as an approximation of processes separate, is to find an approximation t the full joint. Jiantao Jiao (Stanford University) Boosting the Effective Sample Sizeprobability by identifying March 17th, causal 2015 dependencies 28 / 35 n

65 Chow Liu algorithm (1968) Natural approach: Maximum Likelihood! Chow Liu 68 solved it (2000 citations) - Maximum Weight Spanning Tree (MWST) - Empirical mutual information Theorem (Chow, Liu 68) E MLE = arg max E Q is a tree e E Q I( ˆP e ) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

66 Chow Liu algorithm (1968) Natural approach: Maximum Likelihood! Chow Liu 68 solved it (2000 citations) - Maximum Weight Spanning Tree (MWST) - Empirical mutual information Theorem (Chow, Liu 68) Our approach E MLE = arg max E Q is a tree e E Q I( ˆP e ) Replace empirical mutual information by better estimates! Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

67 Original CL vs. Modified CL We set d = 7, S = 300, a star graph, sweep n from 1k to 55k Original CL 0.8 Expected wrong edges ratio number of samples n x 10 4 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

68 8 fold improvement! [and even more] We set d = 7, S = 300, a star graph, sweep n from 1k to 55k Original CL Modified CL Expected wrong edges ratio number of samples n x 10 4 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

69 What we did not talk about How to analyze the bias of MLE Why bootstrap and jackknife fail Extensions in multivariate settings (l p distance) Extensions in nonparametric estimation Extensions in non-i.i.d. models Applications in machine learning, medical imaging, computer vision, genomics, etc Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

70 Hindsight Unbiased estimation: approximation basis must satisfy unbiased equation (Kolmogorov 50) basis have good approximation properties additional variance incurred by approximation should not be large Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

71 Hindsight Unbiased estimation: approximation basis must satisfy unbiased equation (Kolmogorov 50) basis have good approximation properties additional variance incurred by approximation should not be large Theory of functional estimation Analytical properties of functions The whole field of approximation theory, or even more Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

72 Related work O. Lepski, A. Nemirovski, and V. Spokoiny 99, On estimation of the L r norm of a regression function T. Cai and M. Low 11, Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional, Y. Wu and P. Yang 14, Minimax rates of entropy estimation on large alphabets via best polynomial approximation, J. Acharya, A. Orlitsky, A. T. Suresh, and H. Tyagi 14, The complexity of estimating Renyi entropy Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

73 Our work J. Jiao, K. Venkat, Y. Han, T. Weissman, Minimax estimation of functionals of discrete distributions, to appear in IEEE Transactions on Information Theory J. Jiao, K. Venkat, Y. Han, T. Weissman, Maximum likelihood estimation of functionals of discrete distributrions, available on arxiv J. Jiao, K. Venkat, Y. Han, T. Weissman, Beyond maximum likelihood: from theory to practice, available on arxiv J. Jiao, Y. Han, T. Weissman, Minimax estimation of divergence functions, in preparation. Y. Han, J. Jiao, T. Weissman, Adaptive estimation of Shannon entropy, available on arxiv Y. Han, J. Jiao, T. Weissman, Does Dirichlet prior smoothing solve the Shannon entropy estimation problem?, available on arxiv Y. Han, J. Jiao, T. Weissman, Bias correction using Taylor series, Bootstrap, and Jackknife, in preparation Y. Han, J. Jiao, T. Weissman, How to bound of gap of Jensen s inequality?, in preparation Thank you! Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35

Estimating Symmetric Distribution Properties Maximum Likelihood Strikes Back!

Estimating Symmetric Distribution Properties Maximum Likelihood Strikes Back! Estimating Symmetric Distribution Properties Maximum Likelihood Strikes Back! Jayadev Acharya Cornell Hirakendu Das Yahoo Alon Orlitsky UCSD Ananda Suresh Google Shannon Channel March 5, 2018 Based on:

More information

Maximum Likelihood Approach for Symmetric Distribution Property Estimation

Maximum Likelihood Approach for Symmetric Distribution Property Estimation Maximum Likelihood Approach for Symmetric Distribution Property Estimation Jayadev Acharya, Hirakendu Das, Alon Orlitsky, Ananda Suresh Cornell, Yahoo, UCSD, Google Property estimation p: unknown discrete

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Quantum query complexity of entropy estimation

Quantum query complexity of entropy estimation Quantum query complexity of entropy estimation Xiaodi Wu QuICS, University of Maryland MSR Redmond, July 19th, 2017 J o i n t C e n t e r f o r Quantum Information and Computer Science Outline Motivation

More information

Optimal Estimation of a Nonsmooth Functional

Optimal Estimation of a Nonsmooth Functional Optimal Estimation of a Nonsmooth Functional T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania http://stat.wharton.upenn.edu/ tcai Joint work with Mark Low 1 Question Suppose

More information

Entropy and information estimation: An overview

Entropy and information estimation: An overview Entropy and information estimation: An overview Ilya Nemenman December 12, 2003 Ilya Nemenman, NIPS 03 Entropy Estimation Workshop, December 12, 2003 1 Workshop schedule: Morning 7:30 7:55 Ilya Nemenman,

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Estimation of information-theoretic quantities

Estimation of information-theoretic quantities Estimation of information-theoretic quantities Liam Paninski Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/ liam liam@gatsby.ucl.ac.uk November 16, 2004 Some

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

Minimax Estimation of a nonlinear functional on a structured high-dimensional model

Minimax Estimation of a nonlinear functional on a structured high-dimensional model Minimax Estimation of a nonlinear functional on a structured high-dimensional model Eric Tchetgen Tchetgen Professor of Biostatistics and Epidemiologic Methods, Harvard U. (Minimax ) 1 / 38 Outline Heuristics

More information

Lecture 2: Basic Concepts of Statistical Decision Theory

Lecture 2: Basic Concepts of Statistical Decision Theory EE378A Statistical Signal Processing Lecture 2-03/31/2016 Lecture 2: Basic Concepts of Statistical Decision Theory Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: John Miller and Aran Nayebi In this lecture

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

Unified Maximum Likelihood Estimation of Symmetric Distribution Properties

Unified Maximum Likelihood Estimation of Symmetric Distribution Properties Unified Maximum Likelihood Estimation of Symmetric Distribution Properties Jayadev Acharya HirakenduDas Alon Orlitsky Ananda Suresh Cornell Yahoo UCSD Google Frontiers in Distribution Testing Workshop

More information

Entropy Estimation in Turing s Perspective

Entropy Estimation in Turing s Perspective Entropy Estimation in Turing s Perspective Zhiyi Zhang Department of Mathematics and Statistics University of North Carolina at Charlotte Charlotte, NC 28223 Abstract A new nonparametric estimator of Shannon

More information

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Lecture Notes 15 Prediction Chapters 13, 22, 20.4. Lecture Notes 15 Prediction Chapters 13, 22, 20.4. 1 Introduction Prediction is covered in detail in 36-707, 36-701, 36-715, 10/36-702. Here, we will just give an introduction. We observe training data

More information

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Ann Inst Stat Math (0) 64:359 37 DOI 0.007/s0463-00-036-3 Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Paul Vos Qiang Wu Received: 3 June 009 / Revised:

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

12 - Nonparametric Density Estimation

12 - Nonparametric Density Estimation ST 697 Fall 2017 1/49 12 - Nonparametric Density Estimation ST 697 Fall 2017 University of Alabama Density Review ST 697 Fall 2017 2/49 Continuous Random Variables ST 697 Fall 2017 3/49 1.0 0.8 F(x) 0.6

More information

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment }

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Universal Estimation of Divergence for Continuous Distributions via Data-Dependent Partitions

Universal Estimation of Divergence for Continuous Distributions via Data-Dependent Partitions Universal Estimation of for Continuous Distributions via Data-Dependent Partitions Qing Wang, Sanjeev R. Kulkarni, Sergio Verdú Department of Electrical Engineering Princeton University Princeton, NJ 8544

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Robustness and duality of maximum entropy and exponential family distributions

Robustness and duality of maximum entropy and exponential family distributions Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat

More information

ST5215: Advanced Statistical Theory

ST5215: Advanced Statistical Theory Department of Statistics & Applied Probability Wednesday, October 5, 2011 Lecture 13: Basic elements and notions in decision theory Basic elements X : a sample from a population P P Decision: an action

More information

Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias

Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias Zhiyi Zhang Department of Mathematics and Statistics Uniersity of North Carolina at Charlotte Charlotte, NC 28223 Abstract

More information

MMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only

MMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only MMSE Dimension Yihong Wu Department of Electrical Engineering Princeton University Princeton, NJ 08544, USA Email: yihongwu@princeton.edu Sergio Verdú Department of Electrical Engineering Princeton University

More information

Econometrics I, Estimation

Econometrics I, Estimation Econometrics I, Estimation Department of Economics Stanford University September, 2008 Part I Parameter, Estimator, Estimate A parametric is a feature of the population. An estimator is a function of the

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Methods of Estimation

Methods of Estimation Methods of Estimation MIT 18.655 Dr. Kempthorne Spring 2016 1 Outline Methods of Estimation I 1 Methods of Estimation I 2 X X, X P P = {P θ, θ Θ}. Problem: Finding a function θˆ(x ) which is close to θ.

More information

Spring 2012 Math 541B Exam 1

Spring 2012 Math 541B Exam 1 Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote

More information

Homework 1 Due: Wednesday, September 28, 2016

Homework 1 Due: Wednesday, September 28, 2016 0-704 Information Processing and Learning Fall 06 Homework Due: Wednesday, September 8, 06 Notes: For positive integers k, [k] := {,..., k} denotes te set of te first k positive integers. Wen p and Y q

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo

A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo A BLEND OF INFORMATION THEORY AND STATISTICS Andrew R. YALE UNIVERSITY Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo Frejus, France, September 1-5, 2008 A BLEND OF INFORMATION THEORY AND

More information

Semi-Nonparametric Inferences for Massive Data

Semi-Nonparametric Inferences for Massive Data Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work

More information

3.0.1 Multivariate version and tensor product of experiments

3.0.1 Multivariate version and tensor product of experiments ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 3: Minimax risk of GLM and four extensions Lecturer: Yihong Wu Scribe: Ashok Vardhan, Jan 28, 2016 [Ed. Mar 24]

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information

Estimation Tasks. Short Course on Image Quality. Matthew A. Kupinski. Introduction

Estimation Tasks. Short Course on Image Quality. Matthew A. Kupinski. Introduction Estimation Tasks Short Course on Image Quality Matthew A. Kupinski Introduction Section 13.3 in B&M Keep in mind the similarities between estimation and classification Image-quality is a statistical concept

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

Information Geometry

Information Geometry 2015 Workshop on High-Dimensional Statistical Analysis Dec.11 (Friday) ~15 (Tuesday) Humanities and Social Sciences Center, Academia Sinica, Taiwan Information Geometry and Spontaneous Data Learning Shinto

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

More information

Outline of Today s Lecture

Outline of Today s Lecture University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Jeff A. Bilmes Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s

More information

Empirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss;

Empirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss; BFF4, May 2, 2017 Empirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss; Lawrence D. Brown Wharton School, Univ. of Pennsylvania Joint work with Gourab Mukherjee and Paat Rusmevichientong

More information

STAT 512 sp 2018 Summary Sheet

STAT 512 sp 2018 Summary Sheet STAT 5 sp 08 Summary Sheet Karl B. Gregory Spring 08. Transformations of a random variable Let X be a rv with support X and let g be a function mapping X to Y with inverse mapping g (A = {x X : g(x A}

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

An Introduction to Statistical Machine Learning - Theoretical Aspects -

An Introduction to Statistical Machine Learning - Theoretical Aspects - An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,

More information

A New Algorithm for Nonparametric Sequential Detection

A New Algorithm for Nonparametric Sequential Detection A New Algorithm for Nonparametric Sequential Detection Shouvik Ganguly, K. R. Sahasranand and Vinod Sharma Department of Electrical Communication Engineering Indian Institute of Science, Bangalore, India

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11 Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of

More information

6.1 Variational representation of f-divergences

6.1 Variational representation of f-divergences ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 6: Variational representation, HCR and CR lower bounds Lecturer: Yihong Wu Scribe: Georgios Rovatsos, Feb 11, 2016

More information

Nonparametric Estimation: Part I

Nonparametric Estimation: Part I Nonparametric Estimation: Part I Regression in Function Space Yanjun Han Department of Electrical Engineering Stanford University yjhan@stanford.edu November 13, 2015 (Pray for Paris) Outline 1 Nonparametric

More information

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics - Lecture One. Outline. Charlotte Wickham  1. Basic ideas about estimation Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence

More information

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017 Machine Learning Model Selection and Validation Fabio Vandin November 7, 2017 1 Model Selection When we have to solve a machine learning task: there are different algorithms/classes algorithms have parameters

More information

LECTURE NOTE #3 PROF. ALAN YUILLE

LECTURE NOTE #3 PROF. ALAN YUILLE LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.

More information

Lecture 3: Just a little more math

Lecture 3: Just a little more math Lecture 3: Just a little more math Last time Through simple algebra and some facts about sums of normal random variables, we derived some basic results about orthogonal regression We used as our major

More information

Practice Problems Section Problems

Practice Problems Section Problems Practice Problems Section 4-4-3 4-4 4-5 4-6 4-7 4-8 4-10 Supplemental Problems 4-1 to 4-9 4-13, 14, 15, 17, 19, 0 4-3, 34, 36, 38 4-47, 49, 5, 54, 55 4-59, 60, 63 4-66, 68, 69, 70, 74 4-79, 81, 84 4-85,

More information

Lecture 2: August 31

Lecture 2: August 31 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy

More information

1.1 Basis of Statistical Decision Theory

1.1 Basis of Statistical Decision Theory ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of

More information

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11 Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of

More information

SDS : Theoretical Statistics

SDS : Theoretical Statistics SDS 384 11: Theoretical Statistics Lecture 1: Introduction Purnamrita Sarkar Department of Statistics and Data Science The University of Texas at Austin https://psarkar.github.io/teaching Manegerial Stuff

More information

Math 494: Mathematical Statistics

Math 494: Mathematical Statistics Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

Part III. A Decision-Theoretic Approach and Bayesian testing

Part III. A Decision-Theoretic Approach and Bayesian testing Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings: K&F: 16.3, 16.4, 17.3 Bayesian Param. Learning Bayesian Structure Learning Graphical Models 10708 Carlos Guestrin Carnegie Mellon University October 6 th, 2008 10-708 Carlos Guestrin 2006-2008

More information

Lecture 2: Statistical Decision Theory

Lecture 2: Statistical Decision Theory EE378A Statistical Signal Processing Lecture 2-04/06/207 Lecture 2: Statistical Decision Theory Lecturer: Jiantao Jiao Scribe: Andrew Hilger In this lecture, we discuss a unified theoretical framework

More information

The risk of machine learning

The risk of machine learning / 33 The risk of machine learning Alberto Abadie Maximilian Kasy July 27, 27 2 / 33 Two key features of machine learning procedures Regularization / shrinkage: Improve prediction or estimation performance

More information

Bayesian Aggregation for Extraordinarily Large Dataset

Bayesian Aggregation for Extraordinarily Large Dataset Bayesian Aggregation for Extraordinarily Large Dataset Guang Cheng 1 Department of Statistics Purdue University www.science.purdue.edu/bigdata Department Seminar Statistics@LSE May 19, 2017 1 A Joint Work

More information

Hierarchical Multinomial-Dirichlet model for the estimation of conditional probability tables

Hierarchical Multinomial-Dirichlet model for the estimation of conditional probability tables Hierarchical Multinomial-Dirichlet model for the estimation of conditional probability tables Laura Azzimonti IDSIA - SUPSI/USI Manno, Switzerland laura@idsia.ch Giorgio Corani IDSIA - SUPSI/USI Manno,

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,

More information

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University

More information

41903: Introduction to Nonparametrics

41903: Introduction to Nonparametrics 41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific

More information

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN Massimo Guidolin Massimo.Guidolin@unibocconi.it Dept. of Finance STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN SECOND PART, LECTURE 2: MODES OF CONVERGENCE AND POINT ESTIMATION Lecture 2:

More information

Deep Learning: Approximation of Functions by Composition

Deep Learning: Approximation of Functions by Composition Deep Learning: Approximation of Functions by Composition Zuowei Shen Department of Mathematics National University of Singapore Outline 1 A brief introduction of approximation theory 2 Deep learning: approximation

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

From Histograms to Multivariate Polynomial Histograms and Shape Estimation. Assoc Prof Inge Koch

From Histograms to Multivariate Polynomial Histograms and Shape Estimation. Assoc Prof Inge Koch From Histograms to Multivariate Polynomial Histograms and Shape Estimation Assoc Prof Inge Koch Statistics, School of Mathematical Sciences University of Adelaide Inge Koch (UNSW, Adelaide) Poly Histograms

More information

STATISTICAL INFERENCE WITH DATA AUGMENTATION AND PARAMETER EXPANSION

STATISTICAL INFERENCE WITH DATA AUGMENTATION AND PARAMETER EXPANSION STATISTICAL INFERENCE WITH arxiv:1512.00847v1 [math.st] 2 Dec 2015 DATA AUGMENTATION AND PARAMETER EXPANSION Yannis G. Yatracos Faculty of Communication and Media Studies Cyprus University of Technology

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm Midterm Examination STA 215: Statistical Inference Due Wednesday, 2006 Mar 8, 1:15 pm This is an open-book take-home examination. You may work on it during any consecutive 24-hour period you like; please

More information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence

More information

Testing Statistical Hypotheses

Testing Statistical Hypotheses E.L. Lehmann Joseph P. Romano Testing Statistical Hypotheses Third Edition 4y Springer Preface vii I Small-Sample Theory 1 1 The General Decision Problem 3 1.1 Statistical Inference and Statistical Decisions

More information

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood Kuangyu Wen & Ximing Wu Texas A&M University Info-Metrics Institute Conference: Recent Innovations in Info-Metrics October

More information

Lecture 2: Statistical Decision Theory (Part I)

Lecture 2: Statistical Decision Theory (Part I) Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical

More information

Lecture 5 September 19

Lecture 5 September 19 IFT 6269: Probabilistic Graphical Models Fall 2016 Lecture 5 September 19 Lecturer: Simon Lacoste-Julien Scribe: Sébastien Lachapelle Disclaimer: These notes have only been lightly proofread. 5.1 Statistical

More information

Agnostic Learning of Disjunctions on Symmetric Distributions

Agnostic Learning of Disjunctions on Symmetric Distributions Agnostic Learning of Disjunctions on Symmetric Distributions Vitaly Feldman vitaly@post.harvard.edu Pravesh Kothari kothari@cs.utexas.edu May 26, 2014 Abstract We consider the problem of approximating

More information

Bootstrap, Jackknife and other resampling methods

Bootstrap, Jackknife and other resampling methods Bootstrap, Jackknife and other resampling methods Part III: Parametric Bootstrap Rozenn Dahyot Room 128, Department of Statistics Trinity College Dublin, Ireland dahyot@mee.tcd.ie 2005 R. Dahyot (TCD)

More information

Multiple Linear Regression for the Supervisor Data

Multiple Linear Regression for the Supervisor Data for the Supervisor Data Rating 40 50 60 70 80 90 40 50 60 70 50 60 70 80 90 40 60 80 40 60 80 Complaints Privileges 30 50 70 40 60 Learn Raises 50 70 50 70 90 Critical 40 50 60 70 80 30 40 50 60 70 80

More information