Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n
|
|
- Ralf Taylor
- 6 years ago
- Views:
Transcription
1 Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Jiantao Jiao (Stanford EE) Joint work with: Kartik Venkat Yanjun Han Tsachy Weissman Stanford EE Tsinghua EE Stanford EE Information Theory, Learning, and Big Data Workshop March 17th, 2015 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
2 Problem Setting Observe n samples from discrete distribution P with support size S we want to estimate H(P ) = F α (P ) = S p i ln p i (Shannon 48) i=1 S p α i, α > 0 (Diversity, Simpson index, Rényi entropy, etc) i=1 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
3 What was known? Start with entropy: H(P ) = S i=1 p i ln p i. Question Optimal estimator for H(P ) given n samples? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
4 What was known? Start with entropy: H(P ) = S i=1 p i ln p i. Question Optimal estimator for H(P ) given n samples? No unbiased estimator, cannot calculate minimax estimator... (Lehmann and Casella 98) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
5 What was known? Start with entropy: H(P ) = S i=1 p i ln p i. Question Optimal estimator for H(P ) given n samples? No unbiased estimator, cannot calculate minimax estimator... (Lehmann and Casella 98) Classical asymptotics Optimal estimator for H(P ) when n? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
6 Classical problem Empirical Entropy S H(P n ) = ˆp i ln ˆp i i=1 where ˆp i is the empirical frequency of symbol i H(P n ) is the Maximum Likelihood Estimator (MLE) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
7 Classical problem Empirical Entropy S H(P n ) = ˆp i ln ˆp i i=1 where ˆp i is the empirical frequency of symbol i H(P n ) is the Maximum Likelihood Estimator (MLE) Theorem The MLE H(P n ) is asymptotically efficient. (Hájek Le Cam theory) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
8 Is there something missing? Optimal estimator for finite samples unknown :( Question How about using MLE when n is finite? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
9 Is there something missing? Optimal estimator for finite samples unknown :( Question How about using MLE when n is finite? We will show it is a very bad idea in general. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
10 Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
11 Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. R max n (F, ˆF ) = E P [(F (P ) ˆF ) 2 ] Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
12 Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. R max n (F, ˆF ) = sup P M S E P [(F (P ) ˆF ) 2 ] Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
13 Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. R max n (F, ˆF ) = sup P M S E P [(F (P ) ˆF ) 2 ] Minimax risk = sup E P [(F (P ) ˆF )) 2 ] P M S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
14 Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. R max n (F, ˆF ) = sup P M S E P [(F (P ) ˆF ) 2 ] Minimax risk = inf all ˆF sup E P [(F (P ) ˆF )) 2 ] P M S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
15 Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. Notation: R max n (F, ˆF ) = sup P M S E P [(F (P ) ˆF ) 2 ] Minimax risk = inf all ˆF a n b n 0 < c an b n C < a n b n an b n C > 0 sup E P [(F (P ) ˆF )) 2 ] P M S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
16 Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. Notation: R max n (F, ˆF ) = sup P M S E P [(F (P ) ˆF ) 2 ] Minimax risk = inf all ˆF a n b n 0 < c an b n C < a n b n an b n C > 0 sup E P [(F (P ) ˆF )) 2 ] P M S L 2 Risk = Bias 2 + Variance E P [(F (P ) ˆF ( ) 2 ) 2 ] = F (P ) E P ˆF + Var( ˆF ) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
17 Shall we analyze MLE non-asymptotically? Theorem (J., Venkat, Han, Weissman 14) R max n (H, H(P n )) S2 }{{} n 2 Bias 2 + ln2 S n }{{} Variance, n S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
18 Shall we analyze MLE non-asymptotically? Theorem (J., Venkat, Han, Weissman 14) R max n (H, H(P n )) S2 }{{} n 2 Bias 2 + ln2 S n }{{} Variance, n S n S Consistency Bias is dominating if n is not too large compared to S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
19 Can we reduce the bias? Taylor series? (Taylor expansion of H(P n ) around P n = P ) Requires n S (Paninski 03) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
20 Can we reduce the bias? Taylor series? (Taylor expansion of H(P n ) around P n = P ) Requires n S (Paninski 03) Jackknife? Requires n S (Paninski 03) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
21 Can we reduce the bias? Taylor series? (Taylor expansion of H(P n ) around P n = P ) Requires n S (Paninski 03) Jackknife? Requires n S (Paninski 03) Bootstrap? Requires at least n S (Han, J., Weissman 15) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
22 Can we reduce the bias? Taylor series? (Taylor expansion of H(P n ) around P n = P ) Requires n S (Paninski 03) Jackknife? Requires n S (Paninski 03) Bootstrap? Requires at least n S (Han, J., Weissman 15) Bayes estimator under Dirichlet prior? Requires at least n S (Han, J., Weissman 15) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
23 Can we reduce the bias? Taylor series? (Taylor expansion of H(P n ) around P n = P ) Requires n S (Paninski 03) Jackknife? Requires n S (Paninski 03) Bootstrap? Requires at least n S (Han, J., Weissman 15) Bayes estimator under Dirichlet prior? Requires at least n S (Han, J., Weissman 15) Plug-in Dirichlet smoothed distribution? Requires at least n S (Han, J., Weissman 15) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
24 There are many more... Coverage adjusted estimator (Chao, Shen 03, Vu, Yu, Kass 07) BUB estimator (Paninski 03) Shrinkage estimator (Hausser, Strimmer 09) Grassberger estimator (Grassberger 08) NSB estimator (Nemenman, Shafee, Bialek 02) B-Splines estimator (Daub et al. 04) Wagner, Viswanath, Kulkarni 11 Ohannessian, Tan, Dahleh Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
25 There are many more... Coverage adjusted estimator (Chao, Shen 03, Vu, Yu, Kass 07) BUB estimator (Paninski 03) Shrinkage estimator (Hausser, Strimmer 09) Grassberger estimator (Grassberger 08) NSB estimator (Nemenman, Shafee, Bialek 02) B-Splines estimator (Daub et al. 04) Wagner, Viswanath, Kulkarni 11 Ohannessian, Tan, Dahleh Recent breakthrough: Valiant and Valiant 11: the exact phase transition of entropy estimation is n S/ ln S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
26 There are many more... Coverage adjusted estimator (Chao, Shen 03, Vu, Yu, Kass 07) BUB estimator (Paninski 03) Shrinkage estimator (Hausser, Strimmer 09) Grassberger estimator (Grassberger 08) NSB estimator (Nemenman, Shafee, Bialek 02) B-Splines estimator (Daub et al. 04) Wagner, Viswanath, Kulkarni 11 Ohannessian, Tan, Dahleh Recent breakthrough: Valiant and Valiant 11: the exact phase transition of entropy estimation is n S/ ln S Linear Programming based estimator not shown to achieve minimax rates (dependence on ɛ) Another one achieves it for n S1.03 ln S. Not clear about other functionals Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
27 Systematic improvement Question Can we find a systematic methodology to improve MLE, and achieve the minimax rates for a wide class of functionals? George Pólya: the more general problem may be easier to solve than the special problem. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
28 Start from first principle X B(n, p), if we use g(x) to estimate f(p): Bias(g(X)) f(p) E p g(x) n ( ) n = f(p) g(j) p j (1 p) n j j j=0 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
29 Start from first principle X B(n, p), if we use g(x) to estimate f(p): Bias(g(X)) f(p) E p g(x) n ( ) n = f(p) g(j) p j (1 p) n j j j=0 Only polynomials of order n can be estimated without bias [ ] X(X 1)... (X r + 1) E p = p r, 0 r n n(n 1)... (n r + 1) Bias corresponds to polynomial approximation error Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
30 Minimize the maximum bias What if we choose g(x) such that g = arg min g sup p [0,1] Bias(g(X)) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
31 Minimize the maximum bias What if we choose g(x) such that g = arg min g sup p [0,1] Bias(g(X)) Paninski 03 It does not work for entropy. (Variance is too large!) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
32 Minimize the maximum bias What if we choose g(x) such that g = arg min g sup p [0,1] Bias(g(X)) Paninski 03 It does not work for entropy. (Variance is too large!) Best polynomial approximation: P k (x) arg min sup f(x) P k (x) P k Poly k x D Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
33 We only approximate when the problem is hard! f(p) nonsmooth smooth 0 ln n n 1 ˆp i Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
34 We only approximate when the problem is hard! f(p) unbiased estimate of best polynomial approximation of order ln n nonsmooth smooth 0 ln n n 1 ˆp i Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
35 We only approximate when the problem is hard! f(p) unbiased estimate of best polynomial approximation of order ln n f(ˆp i ) f (ˆp i )ˆp i (1 ˆp i ) 2n nonsmooth smooth 0 ln n n 1 ˆp i Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
36 Why ln n n threshold, ln n order? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
37 Approximation theory in functional estimation Lepski, Nemirovski, Spokoiny 99, Cai and Low 11, Wu and Yang 14 (entropy estimation lower bound) Duality with shrinkage (J., Venkat, Han, Weissman 14): a new field to be explored! Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
38 A class of functionals Note F α (P ) = S i=1 pα i, α > 0. [J., Venkat, Han, Weissman 14] showed Minimax L 2 rates L 2 rates of MLE H(P ) S 2 + ln2 S n 2 n F α (P ), 0 < α 1 2 S 2 n 2α F α (P ), 1 2 < α < 1 S 2 + S2 2α n 2α n F α (P ), 1 < α < 3 2 n 2(α 1) F α (P ), α 3 2 n 1 n 1 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
39 A class of functionals Note F α (P ) = S i=1 pα i, α > 0. [J., Venkat, Han, Weissman 14] showed Minimax L 2 rates S H(P ) 2 + ln2 S (n ln n) 2 n F α (P ), 0 < α 1 2 F α (P ), 1 2 < α < 1 S 2 + S2 2α (n ln n) 2α n S 2 S 2 (n ln n) 2α n 2α L 2 rates of MLE S 2 + ln2 S n 2 n S 2 + S2 2α n 2α n F α (P ), 1 < α < 3 2 (n ln n) 2(α 1) n 2(α 1) F α (P ), α 3 2 n 1 n 1 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
40 A class of functionals Note F α (P ) = S i=1 pα i, α > 0. [J., Venkat, Han, Weissman 14] showed Minimax L 2 rates S H(P ) 2 + ln2 S (n ln n) 2 n F α (P ), 0 < α 1 2 F α (P ), 1 2 < α < 1 S 2 + S2 2α (n ln n) 2α n S 2 S 2 (n ln n) 2α n 2α L 2 rates of MLE S 2 + ln2 S n 2 n S 2 + S2 2α n 2α n F α (P ), 1 < α < 3 2 (n ln n) 2(α 1) n 2(α 1) F α (P ), α 3 2 n 1 n 1 Effective sample size enlargement Minimax rate-optimal with n samples MLE with n ln n samples Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
41 Sample complexity perspectives MLE Optimal H(P ) Θ(S) Θ(S/ ln S) F α (P ), 0 < α < 1 Θ(S 1/α ) Θ(S 1/α / ln S) F α (P ), α > 1 Θ(1) Θ(1) Existing literature on F α (P ) for 0 < α < 2: minimax rates unknown, sample complexity unknown, optimal schemes unknown.. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
42 Unique features of our entropy estimator One realization of a general methodology Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
43 Unique features of our entropy estimator One realization of a general methodology Achieve the minimax rate (lower bound due to Wu and Yang 14) S 2 (n ln n) 2 + ln2 S n Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
44 Unique features of our entropy estimator One realization of a general methodology Achieve the minimax rate (lower bound due to Wu and Yang 14) S 2 (n ln n) 2 + ln2 S n Agnostic to the knowledge of S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
45 Unique features of our entropy estimator Easily implementable: approximation order ln n, 500 order approximation using 0.46s on PC. (Chebfun) Empirical performance outperforms every available entropy estimator (Code to be released this week, available upon request) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
46 Unique features of our entropy estimator Easily implementable: approximation order ln n, 500 order approximation using 0.46s on PC. (Chebfun) Empirical performance outperforms every available entropy estimator (Code to be released this week, available upon request) Some statisticians raised interesting questions: We may not use this estimator unless you prove it is adaptive. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
47 Adaptive estimation Near-minimax estimator: (ĤOur sup E P H(P ) P M S ) 2 inf Ĥ ) 2 sup E P (Ĥ H(P ) P M S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
48 Adaptive estimation Near-minimax estimator: (ĤOur sup E P H(P ) P M S ) 2 inf What if we know a priori H(P ) H? We want Ĥ ) 2 sup E P (Ĥ H(P ) P M S (ĤOur 2 ) 2 sup E P H(P )) inf sup E P (Ĥ H(P ), P M S (H) Ĥ P M S (H) where M S (H) = {P : H(P ) H, P M S }. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
49 Adaptive estimation Near-minimax estimator: (ĤOur sup E P H(P ) P M S ) 2 inf What if we know a priori H(P ) H? We want Ĥ ) 2 sup E P (Ĥ H(P ) P M S (ĤOur 2 ) 2 sup E P H(P )) inf sup E P (Ĥ H(P ), P M S (H) Ĥ P M S (H) where M S (H) = {P : H(P ) H, P M S }. Can our estimator satisfy all these requirements without knowing S and H? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
50 Our estimator is adaptive! Han, J., Weissman 15 showed Theorem inf Ĥ sup E P Ĥ H(P ) 2 P M S (H) { S 2 + H ln S (n ln n) 2 n if S ln S enh ln n, [ H ln S ln ( S ln S nh ln n)] 2 otherwise. (1) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
51 Our estimator is adaptive! Han, J., Weissman 15 showed Theorem inf Ĥ sup E P Ĥ H(P ) 2 P M S (H) { S 2 + H ln S (n ln n) 2 n if S ln S enh ln n, [ H ln S ln ( S ln S nh ln n)] 2 otherwise. (1) For ɛ > H S1 ɛ/h ln S, it requires n H small H!) to achieve root MSE ɛ. (much smaller than S/ ln S for Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
52 Our estimator is adaptive! Han, J., Weissman 15 showed Theorem inf Ĥ sup E P Ĥ H(P ) 2 P M S (H) { S 2 + H ln S (n ln n) 2 n if S ln S enh ln n, [ H ln S ln ( S ln S nh ln n)] 2 otherwise. For ɛ > H S1 ɛ/h ln S, it requires n H (much smaller than S/ ln S for small H!) to achieve root MSE ɛ. MLE precisely requires n ln S H S1 ɛ/h. (1) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
53 Our estimator is adaptive! Han, J., Weissman 15 showed Theorem inf Ĥ sup E P Ĥ H(P ) 2 P M S (H) { S 2 + H ln S (n ln n) 2 n if S ln S enh ln n, [ H ln S ln ( S ln S nh ln n)] 2 otherwise. For ɛ > H S1 ɛ/h ln S, it requires n H (much smaller than S/ ln S for small H!) to achieve root MSE ɛ. MLE precisely requires n ln S H S1 ɛ/h. The n n ln n phenomenon is still true! (1) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
54 Understanding the generality Can you deal with cases where the non-differentiable regime is not just a point? We consider L 1 (P, Q) = S i=1 p i q i. Breakthrough by Valiant and Valiant 11: MLE requires n S samples, optimal is S ln S! However, VV 11 s construction only proves to be optimal when n S. S ln S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
55 Applying our general methodology S The minimax L 2 rate is n ln n, while MLE is S n. The n n ln n is still true! We show that VV 11 cannot achieve it when n S. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
56 Applying our general methodology S The minimax L 2 rate is n ln n, while MLE is S n. The n n ln n is still true! We show that VV 11 cannot achieve it when n S. However, is that easy to do? The nonsmooth regime is a whole line! How to cover it? Use a small square [ 0, ln n ] 2? n Use a band? Use some other shape? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
57 Bad news Best polynomial approximation in 2D is not unique. There exists no efficient algorithm to compute it. Some polynomial that achieves best approximation rate cannot be used in our methodology. All nonsmooth regime design methods mentioned before fail. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
58 Bad news Best polynomial approximation in 2D is not unique. There exists no efficient algorithm to compute it. Some polynomial that achieves best approximation rate cannot be used in our methodology. All nonsmooth regime design methods mentioned before fail. Solution: hints from our paper Minimax estimation of functionals of discrete distributions, or wait for soon-be-finished paper Minimax estimation of divergence functions. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
59 Significant difference in phase transitions One may think: Who cares about a ln n improvement? Take sequence n = 2S/ ln S, S equally (on log) sampled from 10 2 to For each n, S, sample 20 times from a uniform distribution. The scale of S/n S/n [2.2727, ] Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
60 Significant improvement! JVHW estimator MLE MSE S/n Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
61 Mutual Information I(P XY ) Mutual information: I(P XY ) = H(P X ) + H(P Y ) H(P XY ) Theorem (J., Venkat, Han, Weissman 14) n n ln n also holds. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
62 Learning Tree Graphical Models d-dimensional random vector with alphabet size S X = (X 1, X 2,..., X d ) Suppose p.m.f. has tree structure d P X = P Xi X π(i), i=1 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
63 Learning Tree Graphical Models d-dimensional random vector with alphabet size S X = (X 1, X 2,..., X d ) Suppose p.m.f. has tree structure d P X = P Xi X π(i), i=1 X 4 X 3 X 6 IV. MAIN RESULT: CAUSAL DEPENDEN APPROXIMATIONS X 5 X 2 X 1 In situations where there are multiple random Fig. 1. Diagram of an approximating dependence tree structure. In this example, PT P X6 ( )P ( )P X1 X6 X3 X6 ( )P X4 X3 ( )P X2 X3 ( )P X5 X2 ( ). all possible arrangements of all the variables, Chow and Liu method can be used. However, i processes and timings to find the best approxima native approach, which would maintain causality of second order distributions serves as an approximation of processes separate, is to find an approximation t the full joint. Jiantao Jiao (Stanford University) Boosting the Effective Sample Sizeprobability by identifying March 17th, causal 2015 dependencies 28 / 35 n
64 Learning Tree Graphical Models d-dimensional random vector with alphabet size S X = (X 1, X 2,..., X d ) Suppose p.m.f. has tree structure d P X = P Xi X π(i), i=1 X 4 X 3 X 6 IV. MAIN RESULT: CAUSAL DEPENDEN APPROXIMATIONS X 5 X 2 X 1 In situations where there are multiple random Fig. 1. Diagram of an approximating dependence tree structure. In this example, PT P X6 ( )P ( )P X1 X6 X3 X6 ( )P X4 X3 ( )P X2 X3 ( )P X5 X2 Chow and Liu method can be used. However, i Given n i.i.d. samples from P ( ). X, estimate underlying all possible tree arrangements structure of all the variables, processes and timings to find the best approxima native approach, which would maintain causality of second order distributions serves as an approximation of processes separate, is to find an approximation t the full joint. Jiantao Jiao (Stanford University) Boosting the Effective Sample Sizeprobability by identifying March 17th, causal 2015 dependencies 28 / 35 n
65 Chow Liu algorithm (1968) Natural approach: Maximum Likelihood! Chow Liu 68 solved it (2000 citations) - Maximum Weight Spanning Tree (MWST) - Empirical mutual information Theorem (Chow, Liu 68) E MLE = arg max E Q is a tree e E Q I( ˆP e ) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
66 Chow Liu algorithm (1968) Natural approach: Maximum Likelihood! Chow Liu 68 solved it (2000 citations) - Maximum Weight Spanning Tree (MWST) - Empirical mutual information Theorem (Chow, Liu 68) Our approach E MLE = arg max E Q is a tree e E Q I( ˆP e ) Replace empirical mutual information by better estimates! Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
67 Original CL vs. Modified CL We set d = 7, S = 300, a star graph, sweep n from 1k to 55k Original CL 0.8 Expected wrong edges ratio number of samples n x 10 4 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
68 8 fold improvement! [and even more] We set d = 7, S = 300, a star graph, sweep n from 1k to 55k Original CL Modified CL Expected wrong edges ratio number of samples n x 10 4 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
69 What we did not talk about How to analyze the bias of MLE Why bootstrap and jackknife fail Extensions in multivariate settings (l p distance) Extensions in nonparametric estimation Extensions in non-i.i.d. models Applications in machine learning, medical imaging, computer vision, genomics, etc Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
70 Hindsight Unbiased estimation: approximation basis must satisfy unbiased equation (Kolmogorov 50) basis have good approximation properties additional variance incurred by approximation should not be large Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
71 Hindsight Unbiased estimation: approximation basis must satisfy unbiased equation (Kolmogorov 50) basis have good approximation properties additional variance incurred by approximation should not be large Theory of functional estimation Analytical properties of functions The whole field of approximation theory, or even more Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
72 Related work O. Lepski, A. Nemirovski, and V. Spokoiny 99, On estimation of the L r norm of a regression function T. Cai and M. Low 11, Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional, Y. Wu and P. Yang 14, Minimax rates of entropy estimation on large alphabets via best polynomial approximation, J. Acharya, A. Orlitsky, A. T. Suresh, and H. Tyagi 14, The complexity of estimating Renyi entropy Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
73 Our work J. Jiao, K. Venkat, Y. Han, T. Weissman, Minimax estimation of functionals of discrete distributions, to appear in IEEE Transactions on Information Theory J. Jiao, K. Venkat, Y. Han, T. Weissman, Maximum likelihood estimation of functionals of discrete distributrions, available on arxiv J. Jiao, K. Venkat, Y. Han, T. Weissman, Beyond maximum likelihood: from theory to practice, available on arxiv J. Jiao, Y. Han, T. Weissman, Minimax estimation of divergence functions, in preparation. Y. Han, J. Jiao, T. Weissman, Adaptive estimation of Shannon entropy, available on arxiv Y. Han, J. Jiao, T. Weissman, Does Dirichlet prior smoothing solve the Shannon entropy estimation problem?, available on arxiv Y. Han, J. Jiao, T. Weissman, Bias correction using Taylor series, Bootstrap, and Jackknife, in preparation Y. Han, J. Jiao, T. Weissman, How to bound of gap of Jensen s inequality?, in preparation Thank you! Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, / 35
Estimating Symmetric Distribution Properties Maximum Likelihood Strikes Back!
Estimating Symmetric Distribution Properties Maximum Likelihood Strikes Back! Jayadev Acharya Cornell Hirakendu Das Yahoo Alon Orlitsky UCSD Ananda Suresh Google Shannon Channel March 5, 2018 Based on:
More informationMaximum Likelihood Approach for Symmetric Distribution Property Estimation
Maximum Likelihood Approach for Symmetric Distribution Property Estimation Jayadev Acharya, Hirakendu Das, Alon Orlitsky, Ananda Suresh Cornell, Yahoo, UCSD, Google Property estimation p: unknown discrete
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationQuantum query complexity of entropy estimation
Quantum query complexity of entropy estimation Xiaodi Wu QuICS, University of Maryland MSR Redmond, July 19th, 2017 J o i n t C e n t e r f o r Quantum Information and Computer Science Outline Motivation
More informationOptimal Estimation of a Nonsmooth Functional
Optimal Estimation of a Nonsmooth Functional T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania http://stat.wharton.upenn.edu/ tcai Joint work with Mark Low 1 Question Suppose
More informationEntropy and information estimation: An overview
Entropy and information estimation: An overview Ilya Nemenman December 12, 2003 Ilya Nemenman, NIPS 03 Entropy Estimation Workshop, December 12, 2003 1 Workshop schedule: Morning 7:30 7:55 Ilya Nemenman,
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationEstimation of information-theoretic quantities
Estimation of information-theoretic quantities Liam Paninski Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/ liam liam@gatsby.ucl.ac.uk November 16, 2004 Some
More informationLecture 8: Information Theory and Statistics
Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang
More informationMINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava
MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,
More informationMinimax Estimation of a nonlinear functional on a structured high-dimensional model
Minimax Estimation of a nonlinear functional on a structured high-dimensional model Eric Tchetgen Tchetgen Professor of Biostatistics and Epidemiologic Methods, Harvard U. (Minimax ) 1 / 38 Outline Heuristics
More informationLecture 2: Basic Concepts of Statistical Decision Theory
EE378A Statistical Signal Processing Lecture 2-03/31/2016 Lecture 2: Basic Concepts of Statistical Decision Theory Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: John Miller and Aran Nayebi In this lecture
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More informationLECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)
LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered
More informationUnified Maximum Likelihood Estimation of Symmetric Distribution Properties
Unified Maximum Likelihood Estimation of Symmetric Distribution Properties Jayadev Acharya HirakenduDas Alon Orlitsky Ananda Suresh Cornell Yahoo UCSD Google Frontiers in Distribution Testing Workshop
More informationEntropy Estimation in Turing s Perspective
Entropy Estimation in Turing s Perspective Zhiyi Zhang Department of Mathematics and Statistics University of North Carolina at Charlotte Charlotte, NC 28223 Abstract A new nonparametric estimator of Shannon
More informationLecture Notes 15 Prediction Chapters 13, 22, 20.4.
Lecture Notes 15 Prediction Chapters 13, 22, 20.4. 1 Introduction Prediction is covered in detail in 36-707, 36-701, 36-715, 10/36-702. Here, we will just give an introduction. We observe training data
More informationEstimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk
Ann Inst Stat Math (0) 64:359 37 DOI 0.007/s0463-00-036-3 Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Paul Vos Qiang Wu Received: 3 June 009 / Revised:
More information6.867 Machine Learning
6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.
More informationA General Overview of Parametric Estimation and Inference Techniques.
A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying
More information12 - Nonparametric Density Estimation
ST 697 Fall 2017 1/49 12 - Nonparametric Density Estimation ST 697 Fall 2017 University of Alabama Density Review ST 697 Fall 2017 2/49 Continuous Random Variables ST 697 Fall 2017 3/49 1.0 0.8 F(x) 0.6
More informationMethods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro
Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment }
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationUniversal Estimation of Divergence for Continuous Distributions via Data-Dependent Partitions
Universal Estimation of for Continuous Distributions via Data-Dependent Partitions Qing Wang, Sanjeev R. Kulkarni, Sergio Verdú Department of Electrical Engineering Princeton University Princeton, NJ 8544
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationRobustness and duality of maximum entropy and exponential family distributions
Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat
More informationST5215: Advanced Statistical Theory
Department of Statistics & Applied Probability Wednesday, October 5, 2011 Lecture 13: Basic elements and notions in decision theory Basic elements X : a sample from a population P P Decision: an action
More informationAsymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias
Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias Zhiyi Zhang Department of Mathematics and Statistics Uniersity of North Carolina at Charlotte Charlotte, NC 28223 Abstract
More informationMMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only
MMSE Dimension Yihong Wu Department of Electrical Engineering Princeton University Princeton, NJ 08544, USA Email: yihongwu@princeton.edu Sergio Verdú Department of Electrical Engineering Princeton University
More informationEconometrics I, Estimation
Econometrics I, Estimation Department of Economics Stanford University September, 2008 Part I Parameter, Estimator, Estimate A parametric is a feature of the population. An estimator is a function of the
More informationBayesian Regularization
Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationPlug-in Approach to Active Learning
Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationMethods of Estimation
Methods of Estimation MIT 18.655 Dr. Kempthorne Spring 2016 1 Outline Methods of Estimation I 1 Methods of Estimation I 2 X X, X P P = {P θ, θ Θ}. Problem: Finding a function θˆ(x ) which is close to θ.
More informationSpring 2012 Math 541B Exam 1
Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote
More informationHomework 1 Due: Wednesday, September 28, 2016
0-704 Information Processing and Learning Fall 06 Homework Due: Wednesday, September 8, 06 Notes: For positive integers k, [k] := {,..., k} denotes te set of te first k positive integers. Wen p and Y q
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationA BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo
A BLEND OF INFORMATION THEORY AND STATISTICS Andrew R. YALE UNIVERSITY Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo Frejus, France, September 1-5, 2008 A BLEND OF INFORMATION THEORY AND
More informationSemi-Nonparametric Inferences for Massive Data
Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work
More information3.0.1 Multivariate version and tensor product of experiments
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 3: Minimax risk of GLM and four extensions Lecturer: Yihong Wu Scribe: Ashok Vardhan, Jan 28, 2016 [Ed. Mar 24]
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated
More informationEstimation Tasks. Short Course on Image Quality. Matthew A. Kupinski. Introduction
Estimation Tasks Short Course on Image Quality Matthew A. Kupinski Introduction Section 13.3 in B&M Keep in mind the similarities between estimation and classification Image-quality is a statistical concept
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationHypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3
Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest
More informationInformation Geometry
2015 Workshop on High-Dimensional Statistical Analysis Dec.11 (Friday) ~15 (Tuesday) Humanities and Social Sciences Center, Academia Sinica, Taiwan Information Geometry and Spontaneous Data Learning Shinto
More informationSTATS 200: Introduction to Statistical Inference. Lecture 29: Course review
STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout
More informationModel Selection and Geometry
Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model
More informationBayesian estimation of the discrepancy with misspecified parametric models
Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012
More informationOutline of Today s Lecture
University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Jeff A. Bilmes Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s
More informationEmpirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss;
BFF4, May 2, 2017 Empirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss; Lawrence D. Brown Wharton School, Univ. of Pennsylvania Joint work with Gourab Mukherjee and Paat Rusmevichientong
More informationSTAT 512 sp 2018 Summary Sheet
STAT 5 sp 08 Summary Sheet Karl B. Gregory Spring 08. Transformations of a random variable Let X be a rv with support X and let g be a function mapping X to Y with inverse mapping g (A = {x X : g(x A}
More informationFast learning rates for plug-in classifiers under the margin condition
Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,
More informationAn Introduction to Statistical Machine Learning - Theoretical Aspects -
An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,
More informationA New Algorithm for Nonparametric Sequential Detection
A New Algorithm for Nonparametric Sequential Detection Shouvik Ganguly, K. R. Sahasranand and Vinod Sharma Department of Electrical Communication Engineering Indian Institute of Science, Bangalore, India
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationPeter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11
Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of
More information6.1 Variational representation of f-divergences
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 6: Variational representation, HCR and CR lower bounds Lecturer: Yihong Wu Scribe: Georgios Rovatsos, Feb 11, 2016
More informationNonparametric Estimation: Part I
Nonparametric Estimation: Part I Regression in Function Space Yanjun Han Department of Electrical Engineering Stanford University yjhan@stanford.edu November 13, 2015 (Pray for Paris) Outline 1 Nonparametric
More informationStatistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation
Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence
More informationMachine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017
Machine Learning Model Selection and Validation Fabio Vandin November 7, 2017 1 Model Selection When we have to solve a machine learning task: there are different algorithms/classes algorithms have parameters
More informationLECTURE NOTE #3 PROF. ALAN YUILLE
LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.
More informationLecture 3: Just a little more math
Lecture 3: Just a little more math Last time Through simple algebra and some facts about sums of normal random variables, we derived some basic results about orthogonal regression We used as our major
More informationPractice Problems Section Problems
Practice Problems Section 4-4-3 4-4 4-5 4-6 4-7 4-8 4-10 Supplemental Problems 4-1 to 4-9 4-13, 14, 15, 17, 19, 0 4-3, 34, 36, 38 4-47, 49, 5, 54, 55 4-59, 60, 63 4-66, 68, 69, 70, 74 4-79, 81, 84 4-85,
More informationLecture 2: August 31
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy
More information1.1 Basis of Statistical Decision Theory
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of
More informationPeter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11
Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of
More informationSDS : Theoretical Statistics
SDS 384 11: Theoretical Statistics Lecture 1: Introduction Purnamrita Sarkar Department of Statistics and Data Science The University of Texas at Austin https://psarkar.github.io/teaching Manegerial Stuff
More informationMath 494: Mathematical Statistics
Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/
More informationNonconcave Penalized Likelihood with A Diverging Number of Parameters
Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized
More informationPart III. A Decision-Theoretic Approach and Bayesian testing
Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to
More informationUnbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.
Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it
More informationInference For High Dimensional M-estimates. Fixed Design Results
: Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and
More information6.867 Machine Learning
6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.
More informationReadings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008
Readings: K&F: 16.3, 16.4, 17.3 Bayesian Param. Learning Bayesian Structure Learning Graphical Models 10708 Carlos Guestrin Carnegie Mellon University October 6 th, 2008 10-708 Carlos Guestrin 2006-2008
More informationLecture 2: Statistical Decision Theory
EE378A Statistical Signal Processing Lecture 2-04/06/207 Lecture 2: Statistical Decision Theory Lecturer: Jiantao Jiao Scribe: Andrew Hilger In this lecture, we discuss a unified theoretical framework
More informationThe risk of machine learning
/ 33 The risk of machine learning Alberto Abadie Maximilian Kasy July 27, 27 2 / 33 Two key features of machine learning procedures Regularization / shrinkage: Improve prediction or estimation performance
More informationBayesian Aggregation for Extraordinarily Large Dataset
Bayesian Aggregation for Extraordinarily Large Dataset Guang Cheng 1 Department of Statistics Purdue University www.science.purdue.edu/bigdata Department Seminar Statistics@LSE May 19, 2017 1 A Joint Work
More informationHierarchical Multinomial-Dirichlet model for the estimation of conditional probability tables
Hierarchical Multinomial-Dirichlet model for the estimation of conditional probability tables Laura Azzimonti IDSIA - SUPSI/USI Manno, Switzerland laura@idsia.ch Giorgio Corani IDSIA - SUPSI/USI Manno,
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,
More informationAsymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University
More information41903: Introduction to Nonparametrics
41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific
More informationSTATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN
Massimo Guidolin Massimo.Guidolin@unibocconi.it Dept. of Finance STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN SECOND PART, LECTURE 2: MODES OF CONVERGENCE AND POINT ESTIMATION Lecture 2:
More informationDeep Learning: Approximation of Functions by Composition
Deep Learning: Approximation of Functions by Composition Zuowei Shen Department of Mathematics National University of Singapore Outline 1 A brief introduction of approximation theory 2 Deep learning: approximation
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationA talk on Oracle inequalities and regularization. by Sara van de Geer
A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other
More informationFrom Histograms to Multivariate Polynomial Histograms and Shape Estimation. Assoc Prof Inge Koch
From Histograms to Multivariate Polynomial Histograms and Shape Estimation Assoc Prof Inge Koch Statistics, School of Mathematical Sciences University of Adelaide Inge Koch (UNSW, Adelaide) Poly Histograms
More informationSTATISTICAL INFERENCE WITH DATA AUGMENTATION AND PARAMETER EXPANSION
STATISTICAL INFERENCE WITH arxiv:1512.00847v1 [math.st] 2 Dec 2015 DATA AUGMENTATION AND PARAMETER EXPANSION Yannis G. Yatracos Faculty of Communication and Media Studies Cyprus University of Technology
More informationHigh-dimensional regression
High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and
More informationMidterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm
Midterm Examination STA 215: Statistical Inference Due Wednesday, 2006 Mar 8, 1:15 pm This is an open-book take-home examination. You may work on it during any consecutive 24-hour period you like; please
More informationMGMT 69000: Topics in High-dimensional Data Analysis Falll 2016
MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence
More informationTesting Statistical Hypotheses
E.L. Lehmann Joseph P. Romano Testing Statistical Hypotheses Third Edition 4y Springer Preface vii I Small-Sample Theory 1 1 The General Decision Problem 3 1.1 Statistical Inference and Statistical Decisions
More informationSpatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood
Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood Kuangyu Wen & Ximing Wu Texas A&M University Info-Metrics Institute Conference: Recent Innovations in Info-Metrics October
More informationLecture 2: Statistical Decision Theory (Part I)
Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical
More informationLecture 5 September 19
IFT 6269: Probabilistic Graphical Models Fall 2016 Lecture 5 September 19 Lecturer: Simon Lacoste-Julien Scribe: Sébastien Lachapelle Disclaimer: These notes have only been lightly proofread. 5.1 Statistical
More informationAgnostic Learning of Disjunctions on Symmetric Distributions
Agnostic Learning of Disjunctions on Symmetric Distributions Vitaly Feldman vitaly@post.harvard.edu Pravesh Kothari kothari@cs.utexas.edu May 26, 2014 Abstract We consider the problem of approximating
More informationBootstrap, Jackknife and other resampling methods
Bootstrap, Jackknife and other resampling methods Part III: Parametric Bootstrap Rozenn Dahyot Room 128, Department of Statistics Trinity College Dublin, Ireland dahyot@mee.tcd.ie 2005 R. Dahyot (TCD)
More informationMultiple Linear Regression for the Supervisor Data
for the Supervisor Data Rating 40 50 60 70 80 90 40 50 60 70 50 60 70 80 90 40 60 80 40 60 80 Complaints Privileges 30 50 70 40 60 Learn Raises 50 70 50 70 90 Critical 40 50 60 70 80 30 40 50 60 70 80
More information