Algebraic Geometry and Model Selection

Size: px

Start display at page:

Download "Algebraic Geometry and Model Selection"

Arron Jones
5 years ago
Views:

1 Algebraic Geometry and Model Selection American Institute of Mathematics 2011/Dec/12-16 I would like to thank Prof. Russell Steele, Prof. Bernd Sturmfels, and all participants. Thank you very much. Sumio Watanabe Tokyo Institute of Technology

2 Contents 1. AIC, BIC, and DIC 2. Birational Invariants 3. Singular Fluctuation 4. Open Questions

3 1 AIC, BIC, and DIC 12/15/2011 Birational probability theory 3

4 Statistical Model and True Distribution x R N, w W(compact) R d (1) True dist. q(x) i.i.d. X 1, X 2,, X n (2) Statistical model p(x w) (3) Prior dist. j(w) (Remark) E[ ] shows the expectation over X 1,X 2,,X n. E x [ ] does that over X whose prob. dist. is q(x). 12/15/2011 Birational probability theory 4

5 Statistical Estimation Posterior Dist. E w [ ] = n ( ) P p(x i w) j(w) dw n i=1 P p(x i w) j(w) dw i=1 Predictive Dist. p*(x) = E w [ p(x w) ] (Remark) In Bayesian estimation, the true distribution q(x) is estimated by p*(x). 12/15/2011 Birational probability theory 5

6 Stochastic Complexity and Generalization Loss (1) Stochastic Complexity = - (Bayes Marginal) F = - log n P p(x i w) j(w) dw i=1 (2) Generalization Loss G = - E x [ log p*(x) ] = S + KL( q(x) p*(x) ) (Remark) S = -E x [log q(x)] is the entropy of q(x). 12/15/2011 Birational probability theory 6

7 BIC(Schwarz,1978) (1) BIC = - S log p(x i w MLE ) + (d/2) log n If the posterior ~ normal distribution. F = BIC + O p (1). In general, yesterday, we learned F = - S log p(x i w MLE ) + l log n (m-1)loglog n + O p (1). RLCT In our workshop, Dr. Lin: Relation to asymptotic integral. Dr. Drton: How to use in statistics. Dr. Leykin: D-module and b-function. 12/15/2011 Birational probability theory 7

8 AIC(Akaike,1974) & DIC(Spiegelhalter,et.al. 2002) (2) AIC = - S log p(x i w MLE ) + d (3) DIC = - S log E w [p(x i w) ] + 2 S {-E w [log p(x i w)]+log p(x i E w [w] ) } If the posterior ~ the normal distribution. E[AIC]= n E[G]+o(1), E[DIC]= n E[G]+o(1). If otherwise, such relations do not hold. 12/15/2011 Birational probability theory 8

9 Estimation of G and F in regular cases Main Term Estimated Consistency in model selection Unbiased Estimator of G AIC, DIC Random (Order 1) No Yes BIC Constant (log n) Yes No 12/15/2011 Birational probability theory 9

10 2 Birational Statistics 12/15/2011 Birational probability theory 10

11 Birational Invariant If a value that is defined using resolution of singularities does not depend on the choice of resolution, then it is called a birational invariant. g1 U1 W g2 g3 U2 U3 12/15/2011 Birational probability theory 11

12 Nature in Statistics It is natural that statistical theory should be made to be invariant under birational transform. Fisher s asymptotic theory does not satisfy such invariance. For example, it is not invariant under blow-up. Y = ax + b + Noise a = c =c d b = cd=d Asymptotic normality of (a,b) holds, whereas that of (c,d) in projective space not. 12/15/2011 Birational probability theory 12

13 Differential and Birational Algebraic geometry studies mathematical properties those are invariant under the birational transform. diffeo birational To construct birational statistics might be one of the purposes of algebraic statistics. 12/15/2011 Birational probability theory 13

14 3 Singular Fluctuation and model selection 12/15/2011 Birational probability theory 14

15 Loss for estimation Predictive Dist. p*(x) = E w [ p(x w) ] Generalization Loss G n = - E x [ log E w [p(x w)] ] Training Loss n T n = - (1/n) S log E w [p(x i w)] i=1 12/15/2011 Birational probability theory 15

16 Functional Cumulant Def. Two Cumulant Generating Functions g(a)= - E x [ log E w [p(x w) a ] ] n t(a) = - (1/n) S log E w [p(x i w) a ] i=1 Then g(0)=t(0)=0, and G n =g(1), T n = t(1). 12/15/2011 Birational probability theory 16

17 Invariance Two functions g(a) and t(a) are invariant under w = g(u) p(x w) = p(x g(u)) j(w) dw = j(g(u)) g (u) du Cumulant generating function = Birational invariant generating function Example. l = lim n { E[g(1)]-S } n o Birational probability theory

18 Notation Def. Log density ratio function f(x,w)=log( q(x)/p(x w) ). Then log p(x w)= log q(x) + f(x,w). Birational probability theory

19 g(a) is rewritten as Expansion of g(a) g(a) = as - E x [ log E w [exp(-af(x,w))] ] Therefore g (0) = S + E x [ E w [f(x,w)] ] g (0) = - E x [ E w [f(x,w) 2 ] - E w [f(x,w)] 2 ] 12/15/2011 Birational probability theory 19

20 By the same way, Expansion of t(a) n t(a) = as n - (1/n) S log E w [exp(-af(x i w))] i=1 Therefore t (0) = S n + (1/n) S E w [f(x i,w)] n t (0) = - (1/n) S { E w [f(x i,w) 2 ] - E w [f(x i,w)] 2 } i=1 n i=1 12/15/2011 Birational probability theory 20

21 Functional Variance Def. Two random variables V 1 = n E x [ E w [f(x,w)] ] l n V 2 = S { E w [ (log p(x i w) ) 2 ] - E w [ log p(x i w) ] 2 } i=1 Remark. V 2 can be calculated by samples and a model without any information about true dist.. In order to calculate V 1, we need the information of the true distribution. 12/15/2011 Birational probability theory 21

22 Singular Fluctuation Theorem. Convergences hold, V 1 V 1 * E[V 1 ] E[V 1 *] V 2 V 2 * Theorem and Def. E[V 2 ] E[V 2 *] Singular Fluctuation E[V 1 *] = E[V 2 *] = 2n Remark. In order to prove the above theorems, we need resolution theorem and empirical process theory. In regular cases, l=n=d/ /15/2011 Birational probability theory

23 Outline of Proofs Posterior distribution measure exp(-nk n (w)) j(w) dw K n (w)=(1/n)s f(x i,w) = exp( - nu 2k + n 1/2 u k ξ n (u) ) j(g(u)) g (u) du o = dt d(t-nu 2k ) u h b(u) exp( - t + t 1/2 ξ n (u) ) du 0 (log n) = m-1 dt t l-1 exp( - t + t 1/2 ξ n (u) ) D(u)du n l 12/15/2011 Birational probability theory 23

24 Outline of Proofs 2 Def. Expectation over the limit posterior distribution < = > dt dud(u)( ) t l-1 exp( -t+t 1/2 x(u)) dt dud(u) t l-1 exp( -t+t 1/2 x(u)) Lemma. For s 0, n s/2 E w [ f(x,w) s ] < t s/2 a(x,u) s > 12/15/2011 Birational probability theory 24

25 Cumulants --- Random Variables g (0) = S + (l+v 1 /2)/n + o p (1/n) g (0)= - V 2 /n + o p (1/n) t (0) = S n + (l-v 1 /2)/n + o p (1/n) t (0) = - V 2 /n + o p (1/n) 12/15/2011 Birational probability theory 25

26 Cumulants --- Birational invariants. E[ g (0) ] = S + (l+n)/n + o p (1/n) E[ g (0) ] = - 2n/n + o p (1/n) E[ t (0) ] = S + (l-n)/n + o p (1/n) E[ t (0) ]= - 2n /n + o p (1/n) 12/15/2011 Birational probability theory 26

27 Theorem Generalization Loss G n = S + [ l + V 1 /2 V 2 /2 ] / n +o p (1/n) Training Loss T n = S n + [ l - V 1 /2 - V 2 /2 ] / n +o p (1/n) 12/15/2011 Birational probability theory 27

28 Theorem When n tends to infinity, E[ G n ] = S + l / n + o(1/n), E[ T n ] = S + ( l-2n ) / n + o(1/n), E[ V 2 ] = 2n + o(1). 12/15/2011 Birational probability theory 28

29 Def. WAIC WAIC is defined by W n = T n + V 2 /n. Theorem For arbitrary set (q(x),p(x w),j(w)), E[ G n ] = E[ W n ] + o(1/n 2 ). (G n -S) +( W n S n ) = 2l/n + o p (1/n). Remark. WAIC is asymptotically equivalent to Bayes cross validation (Watanabe,JMLR,Vol.11, ,2010) 12/15/2011 Birational probability theory 29W

30 T n -S n W n -S n + G-S Red =W n -S n Theory=l/n Blue =G n - S 100 Indep. Trials Reduced Rank Regression True Thank you By Theory, l=12 n=200 Metropolis /15/2011 Birational probability theory 30

31 G n -S 0.1 (G n -S)+(W n -S n ) = 2l/n W n - S n

32 WAIC (1) E[ W n ]= E[ G n ]+o(1/n 2 ) holds even if q(x) is not realizable by p(x w). (2) The essential main term is fluctuated. Inconsistency in model selection. (3) If the posterior can be approximated by a normal distribution, WAIC is equivalent to AIC and DIC as a random variable. (4) If otherwise, WAIC is unbiased estimator of generalization error, whereas either AIC or DIC not. 12/15/2011 Birational probability theory 32

33 4 Open Problems 12/15/2011 Birational probability theory 33

34 Two Biratinal Invariants If the regularity condition is satisfied, l = n = d/2. In general, they are different. l = Dimension that shows how fast the posterior shrinks. n = Dimension that shows how strong the posterior fluctuates. Q. Mathematically, what is n? 12/15/2011 Birational probability theory 34

35 Generalized RLCT g(a)= - E x [ log E w [p(x w) a ] ] t(a) = - (1/n) S log E w [p(x i w) a ] n i=1 Q. E[ (d/da) k g(0) ] and E[ (d/da) k t(0) ] (k=1,2,,) are all birational invariants. They are statistically generalized ones from real log canonical threshold. What are they? 12/15/2011 Birational probability theory 35

36 Variance of Information Criterion E[ G n ] = E[ W n ] + o(1/n 2 ). (G n -S) +( W n S n ) = 2l/n + o p (1/n). Q. These equations show that WAIC seems to have the smallest variance among the information criteria whose averages are equal to G n. Can we prove this? 12/15/2011 Birational probability theory 36

37 Bayes hypothesis testing For a given prior j(w), we define F(j) = - log P p(x i w) j(w) dw n i=1 If the null hypothesis is j 0 (w), and if the alternative is j 1 (w), then the most powerful test is given by DF = F(j 1 )-F(j 0 ). Q. In order to make the most powerful hypothesis test, probability distribution of DF is necessary. 12/15/2011 Birational probability theory 37

Algebraic Information Geometry for Learning Machines with Singularities

Algebraic Information Geometry for Learning Machines with Singularities Sumio Watanabe Precision and Intelligence Laboratory Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503