Aki Vehtari, Aalto University

Size: px

Start display at page:

Download "Aki Vehtari, Aalto University"

Derrick McCoy
6 years ago
Views:

Aki Vehtari, Aalto University 1 / 89 Probabilistic machine learning group, Aalto University http://research.cs.aalto.

1 Aki Vehtari, Aalto University 1 / 89 Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model assessment and selection, Gaussian processes, epidemiology and disease risk prediction

2 Stan and demos 2 / 89 Stan homepage with interfaces and documentation Demos used in this presentation are available at

3 Uncertainty and probabilistic modeling 3 / 89 Two types of uncertainty: aleatoric and epistemic Representing uncertainty with probabilities Updating uncertainty

4 Two types of uncertainty 4 / 89 Aleatoric uncertainty due to randomness we are not able to obtain observations which could reduce this uncertainty Epistemic uncertainty due to lack of knowledge we are able to obtain observations which can reduce this uncertainty two observers may have different epistemic uncertainty

5 Updating uncertainty 5 / 89 Probability of red #red #red+#yellow = θ p(y = red θ) = θ aleatoric uncertainty p(θ) epistemic uncertainty Picking many chips updates our uncertainty about the proportion p(θ y = red, yellow, red, red,...) =? Bayes rule p(θ y) = p(y θ)p(θ) p(y θ)p(θ)dθ

6 Model vs. likelihood 6 / 89 Bayes rule p(θ y) p(y θ)p(θ) Model: p(y θ) as a function of y given fixed θ describes the aleatoric uncertainty Likelihood: p(y θ) as a function of θ given fixed y provides information about epistemic uncertainty, but is not probability distribution Bayes rule combines the likelihood with prior uncertainty p(θ) and transforms them to updated posterior uncertainty

7 Practical application 7 / 89 How much certain brain area activates, measured by fmri, if the subject is shown a scary movie? We have lot of information to use, but also uncertainties physics behind magnetic resonance imaging model for blood-oxygen-level changes effects of heart beat and breathing nearby voxels are likely to have similar activation some brain areas are connected brains of different people are likely to react in similar way

8 I ve been involved in 8 / 89 Brain signal analysis (EEG, MEG, fmri, MRI) Disease risk prediction (CVD, diabetes, cancers, alcohol deaths) GWAS Pharmacology Occupational health care NMR spectroscopy Concrete quality estimation Steel manufacturing Animal population size estimation

9 GIST example 1 9 / 89 Probability of recurrence with Gaussian Processes 50 Gastric 50 Non gastric 50 E GIST Mitotic count Mitotic count Mitotic count Tumor size cm Gastric with no rupture Tumor size cm Gastric with rupture Tumor size cm Mitotic count Mitotic count Mitotic count Tumor size cm Non gastric with no rupture Tumor size cm Non gastric with rupture Tumor size cm Mitotic count Mitotic count Mitotic count Tumor size cm E GIST with no rupture % 0 90% % Tumor size cm E GIST with rupture 50 60% Tumor size cm 40% 20% 10% 0%

10 Stan example 10 / 89 Reaktor: kannattaakokauppa.fi

11 The art of probabilistic modeling 11 / 89 The art of probabilistic modeling is to describe in a mathematical form (model and prior distributions) what we already know and what we don t know Easy part is to use Bayes rule to update the uncertainties computational challenges Other parts of the art of probabilistic modeling are, for example, model checking: is data in conflict with our prior knowledge? presentation: presenting the model and the results to the application experts

12 Reminder: Uncertainty and probabilistic modeling 12 / 89 Two types of uncertainty: aleatoric and epistemic Representing uncertainty with probabilities Updating uncertainty Additional reading material: Dicing with unknown by Tony O Hagan

13 Model 13 / 89 Drop a ball from different heights and measure time Newton air resistance, air pressure, shape and surface structure of the ball relativity Taking into account the accuracy of the measurements, how accurate model is needed? often simple models are adequate and useful All models are wrong, but some of them are useful, George P. Box

14 Bayesian methods 14 / 89 Benefits of Bayesian approach use relevant prior information integrate over uncertainties to focus to interesting parts hierarchical models model checking and evaluation

15 Some simple models 15 / 89 Binomial model y Bin(θ, n) Comparison of two groups with binomial model y 1 Bin(θ 1, n 1 ) y 2 Bin(θ 2, n 2 )

16 Some simple models 16 / 89 Normal distribution for continuous data y N(µ, σ) Comparison of two groups with continuous data y 1 N(µ 1, σ 1 ) y 2 N(µ 2, σ 2 )

17 Some simple models 17 / 89 Gaussian linear model y N(α + βx, σ) Student s t linear model y t ν (α + βx, σ) Generalized linear model (demo3_6) y Bin(logit 1 (α + βx))

18 Hierarchical model 18 / 89 Hierarchical binomial (demo5_1) y j Bin(θ j, n j ) θ j p(φ)

19 Hierarchical model 19 / 89 Why we (usually) don t have to worry about multiple comparisons Classical Linear Regression Classical Linear Regression with Bonferroni Correction Multilevel Model treatment effect estimate site site from Gelman, Hill, and Yajima (2012) site

20 Reminder: Bayesian methods 20 / 89 Benefits of Bayesian approach use relevant prior information integrate over uncertainties to focus to interesting parts hierarchical models model checking and evaluation

21 Computation 21 / 89 Analytic only for very simple models Monte Carlo, Markov chain Monte Carlo generic Distributional approximations e.g. Laplace, variational, expectation propagation less generic, but can be much faster with sufficient accuracy

22 Approximating distributions with Monte Carlo 22 / 89 Visualize distributions, e.g., with histograms Compute functions using the posterior draws g(θ l ) where θ (l) are draws from the p(θ y)

23 Approximating integrals with Monte Carlo 23 / 89 Posterior expectation of unknown quantity θp(θ y)dθ 1 θ (l) L where θ (l) are samples from the p(θ y) l

24 How many simulation samples are needed? 24 / 89 If samples are independent usual methods to estimate the uncertainty due to a finite number of observations Markov chain Monte Carlo produces dependent samples requires additional work to estimate the effective number of samples

25 How many simulation samples are needed? 25 / 89 Expectation of unknown quantity E(θ) 1 L l θ (l) if L is big and θ (l) are independent, way may assume that the distribution of the expectation approaches normal distribution with variance σθ 2 /L (asymptotic normality) this variance is independent on dimensionality of θ total variance is sum of the epistemic uncertainty in the posterior and the uncertainty due to using finite number of Monte Carlo samples σ 2 θ + σ 2 θ/l = σ 2 θ(1 + 1/L) e.g. if L = 100, deviation increases by 1 + 1/L = ie. Monte Carlo error is very small (for the expectation) See BDA3 Ch 4 for counter-examples for asymptotic normality

26 How many simulation samples are needed? 26 / 89 Posterior probability p(θ A) 1 I(θ (l) A) L where I(θ (l) A) = 1 if θ (l) A I( ) is binomially distributed as p(θ A) var(i( )) = p(1 p) (Appendix A, p. 579) standard deviation of p is p(1 p)/l if L = 100 and p 0.5, p(1 p)/l = 0.05 ie. accuracy is about 5% units L = 2500 simulation samples needed for 1% unit accuracy To estimate small probabilities, a large number of samples is needed to be able to estimate p, need to get samples with θ (l) A, which in expectation requires L 1/p l

27 Markov chain Monte Carlo (MCMC) 27 / 89 produce samples θ (t), from a Markov chain, which has been constructed so that its equilibrium distribution is p(θ y). + generic + chain goes where most of the posterior mass is - samples are dependent - construction of efficient Markov chains is not always easy

28 Markov chain 28 / 89 set of random variables θ 1, θ 2,..., so that with all values of t p(θ t θ 1,..., θ (t 1) ) = p(θ t θ (t 1) ) starting point θ 0 transition distribution T t (θ t θ t 1 ) (may depend on t) by choosing a suitable transition distribution, the stationary distribution of Markov chain is p(θ y)

29 Gibbs sampling 29 / 89 demo11_1 When using conditionally conjugate (hyper)priors, the sampling from the conditional distributions is easy for wide range of models e.g. hierarchical normal distribution model WinBUGS/OpenBUGS/JAGS No algorithm parameters to tune (cf. proposal distribution in Metropolis algorithm) For not so easy conditional distributions, trivial to use e.g. grid, Metropolis-Hastings or slice sampling Several parameters can be updated in blocks (blocking, cf. Metropolis-Hastings) Gibbs sampling can be very slow if parameters are highly dependent in the posterior

30 Metropolis algorithm 30 / 89 demo11_2 Metropolis algorithm and its generalizations are basis for all MCMC methods Algorithm 1. starting point θ 0 2. t = 1, 2,... (a) pick a proposal θ from the proposal distribution J t(θ θ t 1 ). Proposal distribution has to be symmetric, i.e. J t(θ a θ b ) = J t(θ b θ a), for all θ a, θ b (b) calculate acceptance ratio (c) set θ t = r = p(θ y) p(θ t 1 y) { θ with probability min(r, 1) θ t 1 otherwise

31 Metropolis algorithm 31 / 89 Metropolis algorithm and its generalizations are basis for all MCMC methods Algorithm 1. starting point θ 0 2. t = 1, 2,... (a) pick a proposal θ from the proposal distribution J t(θ θ t 1 ). Proposal distribution has to be symmetric, i.e. J t(θ a θ b ) = J t(θ b θ a), for all θ a, θ b (b) calculate acceptance ratio (c) set θ t = r = p(θ y) p(θ t 1 y) { θ with probability min(r, 1) θ t 1 otherwise instead of p(θ y), unnormalized q(θ y) can be used step c is executed by generating a random number from U(0, 1) rejection of a proposal increments the time t also by one

32 Hamiltonian Monte Carlo 32 / 89 demo12_1 Uses gradient information for more efficient sampling Alternating dynamic simulation and sampling of the energy level Parameters step size, number of steps in each chain No U-Turn Sampling adaptively selects number of steps to improve robustness and efficiency Adaptation in Stan Step size adjustment (mass matrix) is estimated during initial adaptation phase

33 Warm-up (burn-in) and convergence diagnostics 33 / 89 How long it takes to forget the starting point θ 0? Need to forget the starting point Warm-up (burn-in) = remove samples from the beginning of the chain When the chain has forgot the starting point it has converged convergence diagnostics

34 MCMC samples are dependent 34 / 89 Monte Carlo estimates still valid Estimation of Monte Carlo error is more difficult time series analysis thinning Evaluation of effective sample size based on time series analysis

35 Several chains 35 / 89 Use of several chains make convergence diagnostics easier start chains from different starting points use different pseudo random generator seed Compare samples from the different chains Remove samples from the beginning of the chains and run chains long enough so that it is not possible to distinguish where each chain started and the chains are well mixed

36 Visual convergence diagnostics 36 / 89 demo11_3 Visual inspection works when a small number of quantities and may indicate where the problem is Visual inspection is hard when a large number of quantities

37 Comparison within and between variances of the chains ( ˆR) 37 / 89 demo11_4 Examines mixing and stationarity of chains To examine stationarity chains are splitted to two parts after splitting m chains, each having n samples scalar samples ψ ij (i = 1,..., n; j = 1,..., m)

38 Comparison within and between variances of the chains ( ˆR) 38 / 89 BDA3: potential scale reduction ( ˆR) compare means and variances of the chains works best for quantities which are approximately normally distributed good to transform variables, e.g., by taking logarithm of positive quantity

39 Comparison within and between variances of the chains ( ˆR) 39 / 89 Between chains variance B B = n m 1 m ( ψ.j ψ.. ) 2, where ψ.j = 1 n j=1 i=1 B/n is variance of the means of the chains Within chains variance W W = 1 m m j=1 s 2 j, where s2 j = 1 n 1 n ψ ij, ψ.. = 1 m n (ψ ij ψ.j ) 2 i=1 Estimate marginal posterior variance var(ψ y) as a weighted mean of W and B var + (ψ y) = n 1 n W + 1 n B m j=1 ψ.j

40 Comparison within and between variances of the chains ( ˆR) Estimate marginal posterior variance var(ψ y) as a weighted mean of W and B var + (ψ y) = n 1 n W + 1 n B this overestimates marginal posterior variance if the starting points are overdispersed Given finite n, W underestimates marginal posterior variance single chains have not yet visited all points in the distribution when n, E(W ) var(ψ y) As var + (ψ y) overestimates and W underestimates, compute var ˆR + = W 40 / 89

41 Comparison within and between variances of the chains ( ˆR) Potential scale reduction var ˆR + = W estimates how much the scale of ψ could reduce if n R 1, when n if R is big, keep sampling big is, e.g., R > / 89

42 Potential scale reduction ( ˆR) 42 / 89 If ˆR close to 1, it is still possible that chains have not converged if starting points were not overdispersed distribution far from normal just by chance when n is finite

43 Problematic distributions 43 / 89 Nonlinear dependencies Funnels Multimodal Long-tailed with undefined variance and mean

44 Time series analysis 44 / 89 Auto correlation function describes the correlation given a certain lag can be used to compare efficiency of MCMC algorithms θ 1 θ

45 Time series analysis 45 / 89 Time series analysis can be used to estimate Monte Carlo error in case of MCMC For expectation θ Var[ θ] = σ2 θ L/τ where τ is sum of autocorrelations τ describes how many dependent samples correspond to one independent sample in BDA3 L = nm n eff = nm/τ

46 Time series analysis 46 / 89 Estimation of the autocorrelation using several chains where V t is variogram V t = 1 m(n t) ˆρ t = 1 m V t 2 var + j=1 i=t+1 n (ψ i,j ψ i t,j ) 2 Compared to usual method which computes the autocorrelation from a single chain, this estimate has smaller variance, ad var + is estimated using several chains

47 Time series analysis 47 / 89 Estimation of τ τ = ˆρ t t=1 where ˆρ t is empirical autocorrelation empirical autocorrelation function is noisy and thus estimate of τ is noisy noise is larger for longer lags (less observations) less noisy estimate is obtained by truncating τ = T ˆρ t As τ is estimated from a finite number of samples, it s expectation is overoptimistic if τ > mn/20 then the estimate is unreliable t=1

48 Geyer s adaptive window estimator 48 / 89 Truncation can be decided adaptively by taking into account some properties of Markov chains for stationary, irreducible, recurrent Markov chain let Γ m = ρ 2m + ρ 2m+1, which is sum of two consequent autocorrelations Γ m is positive, decreasing and convex function of m initial positive sequence estimator (Geyer s IPSE) Choose the largest m so, that all values of the sequence ˆΓ 1,..., ˆΓ m are positive

49 Time series analysis 49 / 89 Effective number of samples n eff L/τ

50 Thinning 50 / 89 Not necessary, but used often to save disk space makes post-sampling computations faster makes estimation of Monte Carlo error easier Save every kth sample if k > m, where m from Geyer s method, then samples almost independent information is lost as m > τ

51 Stan 51 / 89 Probabilistic software describe data and model let the software do the computation automatically uses Markov chain Monte Carlo for inference (now also variational inference)

52 Stan interfaces 52 / 89 CmdStan RStan PyStan MatlabStan example later JuliaStan StataStan

53 Bernoulli model 53 / 89 data { i n t <lower=0> N; i n t <lower =0, upper=1> y [N ] ; } parameters { real <lower =0, upper=1> t h e t a ; } model { t h e t a ~ beta ( 1, 1 ) ; f o r ( n i n 1:N) y [ n ] ~ b e r n o u l l i ( t h e t a ) ; }

54 Vectorized Bernoulli model 54 / 89 data { i n t <lower=0> N; i n t <lower =0, upper=1> y [N ] ; } parameters { real <lower =0, upper=1> t h e t a ; } model { t h e t a ~ beta ( 1, 1 ) ; y ~ b e r n o u l l i ( t h e t a ) ; }

55 Binomial model 55 / 89 data { i n t <lower=0> N; / / number of experiments i n t <lower =0, upper=n> y ; / / number of successes } parameters { real <lower =0, upper=1> t h e t a ; / / parameter of the binom } model { t h e t a ~ beta ( 1, 1 ) ; / / p r i o r y ~ binomial (N, t h e t a ) ; / / observation model }

56 Stan 56 / 89 Stan compiles the model written in Stan language to C++ this makes the sampling for complex models and bigger data faster also makes Stan models easily portable, you can use your own favorite interface

57 Stan 57 / 89 Compilation (unless previously compiled model available) Adaptation Warm-up Sampling Generated quantities Save posterior draws Report divergences, n eff, ˆR

58 Difference between proportions 58 / 89 matlabstan_demo or pystan_demo An experiment was performed to estimate the effect of beta-blockers on mortality of cardiac patients A group of patients were randomly assigned to treatment and control groups: out of 674 patients receiving the control, 39 died out of 680 receiving the treatment, 22 died

59 Gaussian linear model 59 / 89 data { i n t <lower=0> N; / / number of data p o i n t s v e ctor [N] x ; / / v e ctor [N] y ; / / } parameters { r e a l alpha ; r e a l beta ; real <lower=0> sigma ; } transformed parameters { v e ctor [N] mu; mu < alpha + beta x ; } model { y ~ normal ( mu, sigma ) ; }

60 Kilpisjärvi summer temperature 60 / 89 matlabstan_demo or pystan_demo Temperature at Kilpisjärvi in June, July and August from 1952 to 2013 Is there change in the temperature?

61 Priors for Gaussian linear model 61 / 89 data { i n t <lower=0> N; / / number of data p o i n t s v e ctor [N] x ; / / v e ctor [N] y ; / / r e a l pmualpha ; / / p r i o r mean f o r alpha r e a l psalpha ; / / p r i o r std f o r alpha r e a l pmubeta ; / / p r i o r mean f o r beta r e a l psbeta ; / / p r i o r std f o r beta }... transformed parameters { v e ctor [N] mu; mu < alpha + beta x ; } model { alpha ~ normal ( pmualpha, psalpha ) ; beta ~ normal ( pmubeta, psbeta ) ; y ~ normal ( mu, sigma ) ; }

62 Student-t linear model 62 / parameters { r e a l alpha ; r e a l beta ; real <lower=0> sigma ; real <lower =1, upper=80> nu ; } transformed parameters { v e c t or [N] mu; mu < alpha + beta x ; } model { nu ~ gamma( 2, 0. 1 ) ; y ~ s t u d e n t _ t ( nu, mu, sigma ) ; }

63 Extreme value analysis 63 / 89 Geomagnetic storms

64 Extreme value analysis data { i n t <lower=0> N; vector <lower =0 >[N] y ; i n t <lower=0> Nt ; vector <lower =0 >[ Nt ] y t ; } transformed data { r e a l ymax ; ymax < max( y ) ; } parameters { real <lower=0> sigma ; real <lower= sigma / ymax> k ; } model { y ~ gpareto ( k, sigma ) ; } generated q u a n t i t i e s { v e c t or [ Nt ] predccdf ; predccdf < gpareto_ccdf ( yt, k, sigma ) ; } 64 / 89

65 Functions f u n c t i o n s { r e a l gpareto_log ( v e c t o r y, r e a l k, r e a l sigma ) { / / generalised Pareto log pdf with mu=0 / / should check and give e r r o r i f k<0 and max( y ) / sigm i n t N; N < dims ( y ) [ 1 ] ; i f ( fabs ( k ) > 1e 15) } r e t u r n (1+1/k ) sum( log1pv ( y k / sigma ) ) N log ( sigm else r e t u r n sum( y / sigma ) N log ( sigma ) ; / / l i m i t k >0 } v e c t or gpareto_ccdf ( v e c t o r y, r e a l k, r e a l sigma ) { / / generalised Pareto log ccdf with mu=0 / / should check and give e r r o r i f k<0 and max( y ) / sigm } i f ( fabs ( k ) > 1e 15) r e t u r n exp (( 1/ k ) log1pv ( y / sigma k ) ) ; else r e t u r n exp( y / sigma ) ; / / l i m i t k >0 65 / 89

66 Transformed data 66 / 89 data { i n t <lower=0> p ; i n t <lower=0> N; i n t <lower =0, upper=1> y [N ] ; m a t r ix [N, p ] x ; } transformed data { m a t r ix [N, p ] z ; v e c t or [ p ] mean_x ; v e c t or [ p ] sd_x ; f o r ( j i n 1: p ) { mean_x [ j ] < mean( c o l ( x, j ) ) ; sd_x [ j ] < sd ( c o l ( x, j ) ) ; f o r ( i i n 1:N) z [ i, j ] < ( x [ i, j ] mean_x [ j ] ) / sd_x [ j ] ; } }

67 Hierarchical survival model f u n c t i o n s { vector sqrt_vec ( vector x ) { vector [ dims ( x ) [ 1 ] ] res ; f o r (m i n 1:dims (x ) [ 1 ] ) { res [m] < s q r t ( x [m] ) ; } } r e t u r n res ; matrix j o i n t _ p r i o r _ l p ( matrix beta_raw, / / raw beta parameters vector csprime, / / cp and sp vector cs_params, vector r, / / scales of beta matrix V / / eigenvectors of the c o r r e l a t i o n m a t r i x ) { matrix [ dims ( r ) [ 1 ], 4] beta ; csprime [1] ~ beta (cs_params [1], cs_params [ 2 ] ) ; csprime [2] ~ beta (cs_params [3], cs_params [ 4 ] ) ; col ( beta_raw, 1) ~ normal ( 0. 0, 1. 0 ) ; col ( beta_raw, 2) ~ normal ( 0. 0, 1. 0 ) ; col ( beta_raw, 3) ~ normal ( 0. 0, 1. 0 ) ; col ( beta_raw, 4) ~ normal ( 0. 0, 1. 0 ) ; { r e a l c ; r e a l s ; r e a l lambda ; vector [ 4 ] eigvals ; c < / ( 1. 0 csprime [ 1 ] ) ; s < / ( 1. 0 csprime [ 2 ] ) ; lambda < s q r t ( ( 2. 0 c 1. 0 ) ( 2. 0 s 1. 0 ) ( 2. 0 ( c + s ) 1. 0 ) / ( ( ( c + s c s ) ) ( c + s 1. 0 ) ) ) ; eigvals [1] < 1.0; e i g v a l s [ 2 ] < 1. 0 / s q r t ( c ) ; e i g v a l s [ 3 ] < 1. 0 / s q r t ( s ) ; e i g v a l s [ 4 ] < 1. 0 / s q r t ( ( c + s ) ) ; f o r (m i n 1: dims ( r ) [ 1 ] ) { beta [m] < ( r [m] lambda ) (V ( e i g v a l s. (V beta_raw [m] ) ) ) ; } } } } r e t u r n beta ; 67 / 89

68 Hierarchcial survival model vector h s _p r i o r _l p ( r e a l r1_global, r e a l r2_global, vector r1_local, vector r 2 _ l o c a l ) { r1_global ~ normal ( 0. 0, 1. 0 ) ; r2_global ~ inv_gamma ( 0. 5, 0. 5 ) ; r 1 _ l o c a l ~ normal ( 0. 0, 1. 0 ) ; r 2 _ l o c a l ~ inv_gamma ( 0. 5, 0. 5 ) ; } r e t u r n ( r1_global s q r t ( r2_global ) ) r 1 _ l o c a l. sqrt_vec ( r 2 _ l o c a l ) ; vector b g _p r i o r _l p ( r e a l r_global, vector r _ l o c a l ) { r _ global ~ normal ( 0. 0, ) ; r _ l o c a l ~ inv_chi_square ( 1. 0 ) ; } } return r _global sqrt_vec ( r _ local ) ; data { int <lower=0> NobsNM; int <lower=0> NobsNW; int <lower=0> NobsDM; int <lower=0> NobsDW; int <lower=0> NcenNM; int <lower=0> NcenNW; int <lower=0> NcenDM; int <lower=0> NcenDW; int <lower=0> M_bg ; int <lower=0> M_biom ; vector [NobsNM] yobsnm ; vector [NcenNM] ycennm ; matrix [NobsNM, M_bg ] Xobs_bgNM ; matrix [NcenNM, M_bg ] Xcen_bgNM ; matrix [NobsNM, M_biom ] Xobs_biomNM ; matrix [NcenNM, M_biom ] Xcen_biomNM ; vector [NobsNW] yobsnw ; vector [NcenNW] ycennw ; matrix [NobsNW, M_bg ] Xobs_bgNW ; matrix [NcenNW, M_bg ] Xcen_bgNW ; matrix [NobsNW, M_biom ] Xobs_biomNW ; matrix [NcenNW, M_biom ] Xcen_biomNW ; vector [NobsDM] yobsdm ; vector [NcenDM] ycendm ; matrix [NobsDM, M_bg ] Xobs_bgDM ; matrix [NcenDM, M_bg ] Xcen_bgDM ; matrix [NobsDM, M_biom ] Xobs_biomDM ; matrix [NcenDM, M_biom ] Xcen_biomDM ; vector [NobsDW] yobsdw ; vector [NcenDW] ycendw ; 68 / 89

69 Hierarchcial survival model matrix [NobsDW, M_bg ] Xobs_bgDW ; matrix [NcenDW, M_bg ] Xcen_bgDW ; matrix [NobsDW, M_biom ] Xobs_biomDW ; matrix [NcenDW, M_biom ] Xcen_biomDW ; transformed data { real <lower=0> tau_mu ; vector <lower =0 >[1] tau_al ; matrix [4,4] V ; / / eigenvectors of the c o r r e l a t i o n matrix tau_mu < 10.0; tau_al [1] < 10.0; V[1,1] < 0.5; V[2,1] < 0.5; V[3,1] < 0.5; V[4,1] < 0.5; V[ 1, 2 ] < 0.5; V[ 2, 2 ] < 0. 5 ; V[ 3, 2 ] < 0.5; V[ 4, 2 ] < 0. 5 ; V[ 1, 3 ] < 0.5; V[ 2, 3 ] < 0.5; V[ 3, 3 ] < 0. 5 ; V[ 4, 3 ] < 0. 5 ; V[ 1, 4 ] < 0. 5 ; V[ 2, 4 ] < 0.5; V[ 3, 4 ] < 0.5; V[ 4, 4 ] < 0. 5 ; } parameters { vector <lower =0, upper =1 >[2] csprime_biom ; vector <lower =0, upper =1 >[2] csprime_bg ; vector <lower =0, upper =1 >[2] csprime_al ; vector <lower =0 >[4] cs_params ; / / a_c & b_c, and a_s & b_s real <lower=0> tau_s_bg_raw ; vector <lower =0 >[M_bg ] tau_bg_raw ; real <lower=0> tau_s1_biom_raw ; real <lower=0> tau_s2_biom_raw ; vector <lower =0 >[M_biom ] tau1_biom_raw ; vector <lower =0 >[M_biom ] tau2_biom_raw ; matrix [ 1, 4] alpha_raw ; matrix [ M_bg, 4] beta_bg_raw ; matrix [ M_biom, 4 ] beta_biom_raw ; } vector [ 4 ] mu; transformed parameters { matrix [ M_biom, 4 ] beta_biom ; matrix [ M_bg, 4 ] beta_bg ; matrix [ 1, 4 ] alpha ; beta_biom < j o i n t _ p r i o r _ l p ( beta_biom_raw, csprime_biom, cs_params, h s _p r i o r _l p ( tau_s1_biom_raw, tau_s2_biom_raw, tau1_biom_raw, tau2_biom_raw ), V ) ; beta_bg < j o i n t _ p r i o r _ l p ( beta_bg_raw, csprime_bg, cs_params, b g _p r i o r _l p ( tau_s_bg_raw, tau_bg_raw ), V ) ; 69 / 89

70 Hierarchcial survival model } alpha < exp ( j o i n t _p r i o r _l p (alpha_raw, csprime_al, cs_params, tau_al, V ) ) ; model { yobsnm ~ w e i b u l l ( alpha [1,1], exp( (mu[1] + Xobs_bgNM col (beta_bg, 1) + Xobs_biomNM col (beta_biom, 1 ) ) / alpha [1,1])); yobsnw ~ w e i b u l l ( alpha [1,2], exp( (mu[2] + Xobs_bgNW col (beta_bg, 2) + Xobs_biomNW col (beta_biom, 2 ) ) / alpha [1,2])); yobsdm ~ w e i b u l l ( alpha [1,3], exp( (mu[3] + Xobs_bgDM col (beta_bg, 3) + Xobs_biomDM col (beta_biom, 3 ) ) / alpha [1,3])); yobsdw ~ w e i b u l l ( alpha [1,4], exp( (mu[4] + Xobs_bgDW col (beta_bg, 4) + Xobs_biomDW col (beta_biom, 4 ) ) / alpha [1,4])); increment_log_prob ( w e i b u l l _c c d f _l o g (ycennm, alpha [ 1, 1 ], exp( (mu[ 1 ] + Xcen_bgNM col ( beta_bg, 1) + Xcen_biomNM col ( beta_biom, 1 ) ) / alpha increment_log_prob ( w e i b u l l _c c d f _l o g (ycennw, alpha [ 1, 2 ], exp( (mu[ 2 ] + Xcen_bgNW col ( beta_bg, 2) + Xcen_biomNW col ( beta_biom, 2 ) ) / alpha increment_log_prob ( w e i b u l l _c c d f _l o g (ycendm, alpha [ 1, 3 ], exp( (mu[ 3 ] + Xcen_bgDM col ( beta_bg, 3) + Xcen_biomDM col ( beta_biom, 3 ) ) / alpha increment_log_prob ( w e i b u l l _c c d f _l o g (ycendw, alpha [ 1, 4 ], exp( (mu[ 4 ] + Xcen_bgDW col ( beta_bg, 4) + Xcen_biomDW col ( beta_biom, 4 ) ) / alpha / / cs_params ~ gamma(0.5, ) ; / / changed to h a l f normal cs_params ~ normal ( 0. 0, ) ; } mu ~ normal ( 0. 0, tau_mu ) ; From Peltola, Havulinna, Salomaa, and Vehtari (2014). Hierarchical Bayesian survival analysis and projective covariate selection in cardiovascular event risk prediction / 89

71 Ways to estimate the predictive performance 71 / 89 Data estimate (within-sample) use same data to form the posterior and to test the performance corresponds to training error Partial predictive split data in two parts use one part to form the posterior and the other part to test the performance corresponds to test error Cross-validation improved version of partial approach divide data in several parts, can be also n parts use different parts to form posterior and to test performance Information criterion compute correction term to data estimate

72 Why model selection? 72 / 89 Assume a model rich enough capturing lot of uncertainties e.g. Bayesian model average (BMA) or non-parametric model criticism and predictive assessment done if we are happy with the model, no need for model selection Box: All models are wrong, but some are useful there are known unknowns and unknown unknowns Model selection what if some smaller (or more sparse) or parametric model is practically as good? which uncertainties can be ignored? (e.g. Student-t vs. Gaussian, irrelevant covariates) reduced measurement cost, simpler to explain (e.g. less biomarkers, and easier to explain to doctors)

73 Predictive model selection 73 / 89 Goodness of the model is evaluated by its predictive performance Select a simpler model whose predictive performance is similar to the rich model

74 Predictive model 74 / 89 p(ỹ x, D, M k ) is the posterior predictive distribution p(ỹ x, D, M k ) = p(ỹ x, θ, M k )p(θ D, x, M k )dθ ỹ is a future observation x is a future random or controlled covariate value D = {(x (i), y (i) ); i = 1, 2,..., n} M k is a model θ denotes parameters

75 Predictive performance 75 / 89 Future outcome ỹ is unknown (ignoring x in this slide) With a known true distribution p t (ỹ), the expected utility would be ū(a) = p t (ỹ)u(a; ỹ)dỹ where u is utility and a is action (in our case, a prediction) Bayes generalization utility BU g = p t (ỹ) log p(ỹ D, M k )dỹ where a = p( D, M k ) and u(a; ỹ) = log(a(ỹ)) a is to report the whole predictive distribution utility is the log-density evaluated at ỹ

76 Bayesian predictive methods 76 / 89 Many ways to approximate BU g = p t (ỹ) log p(ỹ D, M k )dỹ for example Bayesian cross-validation WAIC reference predictive methods Many other Bayesian predictive methods estimating something else, e.g., DIC L-criterion, posterior predictive criterion projection methods See our survey for more methods

77 M-open,-closed,-completed 77 / 89 Following Bernardo & Smith (1994), there are three different approaches for dealing with the unknown p t M-open M-closed M-completed

78 M-open 78 / 89 Explicit specification of p t (ỹ) is avoided by re-using the observed data D as a pseudo Monte Carlo samples from the distribution of future data For example, Bayes leave-one-out cross-validation LOO = 1 n n log p(y i x i, D i, M k ) i=1

79 Cross-validation 79 / 89 Bayes leave-one-out cross-validation LOO = 1 n n log p(y i x i, D i, M k ) i=1 different part of the data is used to update the posterior and assess the performance almost unbiased estimate for a single model E[LOO(n)] = E[BU g (n 1)] expectation is taken over all the possible training sets

80 Cross-validation 80 / 89 Naïve computation requires computation of n posteriors Less computation with analytic solutions and approximations available for some models importance sampling using the full posterior as the proposal (easy to use with Stan) k-fold cross-validation most robust

81 Importance sampling 81 / 89 Having samples θ s from p(θ s D) p(ỹ i x i, D i ) S s=1 p(ỹ i θ s )wi s S s=1 w, i s where w s i are importance weights and wi s = p(θs x i, D i ) p(θ s D) 1 p(y i θ s ).

82 Truncated importance sampling 82 / 89 The variance of the importance weights w s in IS-LOO can be large or even infinite Truncated importance sampling with truncated weights w s = min( w s, S w) has a finite variance but also some optimistic bias

83 Pareto smoothed importance sampling 83 / 89 The variance of the importance weights in IS-LOO can be large or even infinite By fitting a generalized Pareto distribution to the tail of the weight distribution obtain an estimate of the shape parameter k if k < 1 2 variance is finite, the central limit theorem holds if 1 2 k < 1 variance is infinite but mean exists, the generalized central limit theorem holds if k 1 variance and mean do not exist, the truncated estimate will have a finite variance but considerable bias variance of the IS estimate can be reduced by Pareto smoothing the weights PSIS-LOO

84 Pareto smoothed importance sampling 84 / 89 LOO IS/TIS/PSIS-LOO/WAIC IS TIS PSIS WAIC Aki Vehtari, Andrew Gelman and Jonah Gabry (2015). Efficient implementation of leave-one-out cross-validation and WAIC for evaluating fitted Bayesian models. arxiv preprint arxiv:

85 Generated quantities for LOO 85 / model { v e c t or [N] eta ; e t a < beta0 + z beta ; beta ~ normal ( 0, phi ) ; phi ~ double_exponential ( 0, 1 0 ) ; y ~ b e r n o u l l i _ l o g i t ( eta ) ; } generated q u a n t i t i e s { v e c t or [N] l o g _ l i k ; v e c t or [N] eta ; e t a < beta0 + z beta ; f o r ( n i n 1:N) l o g _ l i k [ n ] < b e r n o u l l i _ l o g i t _ l o g ( y [ n ], eta [ n ] ) ; }

86 Selection induced bias 86 / 89 Selection induced bias in LOO-CV same data is used to assess the performance and make the selection the selected model fits more to the data the LOO-CV estimate for the selected model is biased recognized already, e.g., by Stone (1974) Same holds for many other methods, e.g., DIC/WAIC Performance of the selection process itself can be assessed using two level cross-validation, but it does not help choosing better models Bigger problem if there is a large number of models as in covariate selection Juho Piironen and Aki Vehtari (2015). Comparison of Bayesian predictive methods for model selection

87 Other forms of model selection / hypothesis testing 87 / 89 Marginal posterior probabilities and intervals problems when posterior dependencies, e.g. due to correlation of covariates Bayes factor & evidence sensitive to prior as seen from the predictive interpretation Posterior (CV) predictive checking

88 Bayes factor 88 / 89 Marginal likelihood in Bayes factor is also a predictive criterion chain rule p(y M k ) = p(y 1 M k )p(y 2 y 1, M k ),..., p(y n y 1,..., y n 1, M k ) Sensitive to the first terms, and not defined if the prior is improper especially problematic to use for models with large difference in the number of parameters

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Aki Vehtari, Aalto University, Finland Probabilistic machine learning group, Aalto University http://research.cs.aalto.fi/pml/ Bayesian theory and methods, approximative integration, model assessment and