A Bayesian view of model complexity

Size: px

Start display at page:

Download "A Bayesian view of model complexity"

Maximillian Whitehead
5 years ago
Views:

1 A Bayesian view of model complexity Angelika van der Linde University of Bremen, Germany 1. Intuitions 2. Measures of dependence between observables and parameters 3. Occurrence in predictive model comparison 4. Evaluation 5. Discussion 1

2 1. Intuitions 1.1 What is a statistical model? for a Bayesian: likelihood and prior M = ({p(y θ) θ Θ}, p(θ)) note: M focused on θ special case: p(θ) not informative 2

3 1.2 What is model complexity (mc)? (i) θ y explanatory power of θ for y potential of fitting y with θ mc = no of parameters mc = p gen. in lin. regression (Wahba, 1990): Y = µ + ε, ε N(0, σ 2 I n ), µ = Sy mc = tr(s) 3

4 (ii) y θ discriminatory power of y for θ sensitivity of θ to y how hard to learn θ from y estimation variance in linear regression: tr(cov(sy )) = σ 2 tr(s 2 ) = S orth. proj. σ2 tr(s) 4

5 mc = tr(s) = i s ii s ii sensitivity of estimate µ i to y i gdf(θ) = i cov Y θ( µ i (Y ), Y i ) (Ye, 1998) 5

6 (iii) information criteria for model comparison IC = fit + complexity BIC = 2 log p(y θ ML (y)) + 2p log n AIC = 2 log p(y θ ML (y)) + 2p DIC = 2 log p(y E(ϑ y)) + 2p D 6

7 p D = 2E θ y (log p(y θ)) 2 log p(y E(θ y)) = D D(θ) and many others (GIC, TIC, BPIC, WAIC) mc by comparison rather elusive 7

8 conclusion: issues in model complexity duality y θ but: y θ or y θ? idea: mc = measure of dependence between y and θ Bay: θ random, no problem freq: θ fix, but θ(y ) random 8

9 2. Measures of dependence mutual information symmetric I(Y, Θ) = E θy [log p(θ, y) p(θ)p(y) ] J(Y, Θ) = I(Y, Θ) + E θ E y [log p(θ)p(y) p(θ, y) ] 9

10 properties dual in y and θ measure of variability of p(y θ) w.r.t. p(y) J(Y, Θ) = E θ J(p(y θ), p(y)) in exponential families J(Y, Θ) = tr(cov Y (E(θ Y ), t(y ))) = tr(cov Θ (θ, E(t(Y ) θ))) 10

11 prior and posterior version: prior with Y and Θ posterior with future replication Ỹ and Θ post coincidence with tr(s) in regression? prior: NO posterior: YES! 11

12 posterior J in detail J(Ỹ, Θ post) = E θ y J(p(ỹ θ), p(ỹ y)) J(Ỹ, Θ post) = variability of p(ỹ θ) having learnt about θ J(Ỹ, Θ post) = expofams E θ yj(p(ỹ θ), p(ỹ θ)) 12

13 3. Occurrence in predictive model comparison 3.1 Targets assessments of models according to - data fit: prior prediction of Y by Θ - fit of future obs: posterior prediction of Ỹ by Θ post - parameters averaged w.r.t. Θ or Θ post - represented by θ(y) 13

14 predictive success measured by e.g. prior post ave log E Θ [p(y θ)] = log p(y) E Θpost [log p(ỹ θ)] rep NA log p(ỹ θ(y)) should be large, prior: for data y posterior: on average over Ỹ (and Y, if rep?) Bay and freq versions! 14

15 examples of predictive targets model evidence: log E Θ [p(y θ)] = log p(y), related to BIC and Bayes factors prior predictive, average AIC: E y θ0 Eỹ θ0 [ 2 log p(ỹ θ ML (y))] posterior predictive, representative DIC: Eỹ y [ 2 log p(ỹ θ)] posterior predictive, representative DIC+, BPIC: Eỹ y E Θpost [ 2 log p(ỹ θ)] posterior predictive, average 15

16 3.2 Key decompositions log p(y) = log p(y θ) + log p(y, θ) p(y)p(θ) log p(y θ + ) = log p(y θ) + log p(y θ) p(y θ + ) taking expectations yields (1) (2) E Y,Θ [log p(y, θ) p(y)p(θ) ] = I(Y, Θ) = E ΘI(p(y θ), p(y)) (3) E Y,Θ [log p(y θ) p(y θ + ) ] = E ΘI(p(y θ), p(y θ + )) (4) 16

17 more building blocks symmetry assumption (approximation) J(p(y θ), p(y θ + )) 2I(p(y θ), p(y θ + )) achievable under second order Taylor approx link between ave and rep in expofams J(Y, Θ) = E Θ J(p(y θ), p(y E(θ))) 17

18 conclusions IC = fit + complexity is based on decompositions major variants E Y [ log p(y)] = H(Y ) = H(Y Θ) + I(Y, Θ) E Y E Θ [ log p(y Θ)] = H(Y Θ) + J(Y, Θ) a cloud of targets yields a cloud of measures of mc but all can be identified as variants of KL-divergences 18

19 problem decompositions hold only for model specific expectations w.r.t. Y, Ỹ not for true/common expectations w.r.t. Y, Ỹ!!! invocation of a good model assumption 19

20 3.3 summary of crucial issues average targets are Bayesian representative targets are frequentist Is it meaningful to invoke a good model assumption already in the definition of a predictive target (eg DIC)? Is there a definite trade-off between fit and complexity in model comparison? YES! not in any particular IC, but according to the decompositions (laws of prob) 20

21 Is is meaningful to think of models as defined by a focus and of mc depending on focus and data? how to assess uniformity of performance, i.e. accuracy of approximations and estimates used in the derivation of an IC across models??? decompositions may provide benchmarks 21

22 4. Evaluations of posterior mc Gaussian case Y, Ỹ N(θ, Σ), θ N(0, K) J(Ỹ, Θ post) = tr(i p + ΣK 1 ) 1 = Σ = s2 n I n K = τ 2 I n p( nτ 2 σ 2 +nτ 2) p τ 2 n 22

23 exponential families J(Ỹ, Θ post) = tr(cov Θ (θ, E(t(Y ) θ)) p D of DIC estimates J(Ỹ, Θ post) in exponential families general, by simulation (Plummer, 2002) J(Ỹ, Θ post) = E Θ (1)E Θ (2)[I(p(y θ (1) ), p(y θ (2) ))] 23

24 5. Discussion Bayesian view - allows to think of mc as measure of dependence - emphasizes average targets thus leading more naturally to decomp. of entropy than Taylor expansions for representative targets ICs only one option to estimate targets alternatives: - with independence assumption: CV - without true/common distribution of Y, Ỹ : simulation 24

25 singular models without regularity assumptions under investigation in machine learning community (Watanabe, WAIC, CV) information theory drives statistical thinking 25

Model Assessment and Comparisons

Model Assessment and Comparisons Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department