Aki Vehtari, Aalto University

Size: px
Start display at page:

Download "Aki Vehtari, Aalto University"

Transcription

1 Aki Vehtari, Aalto University 1 / 89 Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model assessment and selection, Gaussian processes, epidemiology and disease risk prediction

2 Stan and demos 2 / 89 Stan homepage with interfaces and documentation Demos used in this presentation are available at

3 Uncertainty and probabilistic modeling 3 / 89 Two types of uncertainty: aleatoric and epistemic Representing uncertainty with probabilities Updating uncertainty

4 Two types of uncertainty 4 / 89 Aleatoric uncertainty due to randomness we are not able to obtain observations which could reduce this uncertainty Epistemic uncertainty due to lack of knowledge we are able to obtain observations which can reduce this uncertainty two observers may have different epistemic uncertainty

5 Updating uncertainty 5 / 89 Probability of red #red #red+#yellow = θ p(y = red θ) = θ aleatoric uncertainty p(θ) epistemic uncertainty Picking many chips updates our uncertainty about the proportion p(θ y = red, yellow, red, red,...) =? Bayes rule p(θ y) = p(y θ)p(θ) p(y θ)p(θ)dθ

6 Model vs. likelihood 6 / 89 Bayes rule p(θ y) p(y θ)p(θ) Model: p(y θ) as a function of y given fixed θ describes the aleatoric uncertainty Likelihood: p(y θ) as a function of θ given fixed y provides information about epistemic uncertainty, but is not probability distribution Bayes rule combines the likelihood with prior uncertainty p(θ) and transforms them to updated posterior uncertainty

7 Practical application 7 / 89 How much certain brain area activates, measured by fmri, if the subject is shown a scary movie? We have lot of information to use, but also uncertainties physics behind magnetic resonance imaging model for blood-oxygen-level changes effects of heart beat and breathing nearby voxels are likely to have similar activation some brain areas are connected brains of different people are likely to react in similar way

8 I ve been involved in 8 / 89 Brain signal analysis (EEG, MEG, fmri, MRI) Disease risk prediction (CVD, diabetes, cancers, alcohol deaths) GWAS Pharmacology Occupational health care NMR spectroscopy Concrete quality estimation Steel manufacturing Animal population size estimation

9 GIST example 1 9 / 89 Probability of recurrence with Gaussian Processes 50 Gastric 50 Non gastric 50 E GIST Mitotic count Mitotic count Mitotic count Tumor size cm Gastric with no rupture Tumor size cm Gastric with rupture Tumor size cm Mitotic count Mitotic count Mitotic count Tumor size cm Non gastric with no rupture Tumor size cm Non gastric with rupture Tumor size cm Mitotic count Mitotic count Mitotic count Tumor size cm E GIST with no rupture % 0 90% % Tumor size cm E GIST with rupture 50 60% Tumor size cm 40% 20% 10% 0%

10 Stan example 10 / 89 Reaktor: kannattaakokauppa.fi

11 The art of probabilistic modeling 11 / 89 The art of probabilistic modeling is to describe in a mathematical form (model and prior distributions) what we already know and what we don t know Easy part is to use Bayes rule to update the uncertainties computational challenges Other parts of the art of probabilistic modeling are, for example, model checking: is data in conflict with our prior knowledge? presentation: presenting the model and the results to the application experts

12 Reminder: Uncertainty and probabilistic modeling 12 / 89 Two types of uncertainty: aleatoric and epistemic Representing uncertainty with probabilities Updating uncertainty Additional reading material: Dicing with unknown by Tony O Hagan

13 Model 13 / 89 Drop a ball from different heights and measure time Newton air resistance, air pressure, shape and surface structure of the ball relativity Taking into account the accuracy of the measurements, how accurate model is needed? often simple models are adequate and useful All models are wrong, but some of them are useful, George P. Box

14 Bayesian methods 14 / 89 Benefits of Bayesian approach use relevant prior information integrate over uncertainties to focus to interesting parts hierarchical models model checking and evaluation

15 Some simple models 15 / 89 Binomial model y Bin(θ, n) Comparison of two groups with binomial model y 1 Bin(θ 1, n 1 ) y 2 Bin(θ 2, n 2 )

16 Some simple models 16 / 89 Normal distribution for continuous data y N(µ, σ) Comparison of two groups with continuous data y 1 N(µ 1, σ 1 ) y 2 N(µ 2, σ 2 )

17 Some simple models 17 / 89 Gaussian linear model y N(α + βx, σ) Student s t linear model y t ν (α + βx, σ) Generalized linear model (demo3_6) y Bin(logit 1 (α + βx))

18 Hierarchical model 18 / 89 Hierarchical binomial (demo5_1) y j Bin(θ j, n j ) θ j p(φ)

19 Hierarchical model 19 / 89 Why we (usually) don t have to worry about multiple comparisons Classical Linear Regression Classical Linear Regression with Bonferroni Correction Multilevel Model treatment effect estimate site site from Gelman, Hill, and Yajima (2012) site

20 Reminder: Bayesian methods 20 / 89 Benefits of Bayesian approach use relevant prior information integrate over uncertainties to focus to interesting parts hierarchical models model checking and evaluation

21 Computation 21 / 89 Analytic only for very simple models Monte Carlo, Markov chain Monte Carlo generic Distributional approximations e.g. Laplace, variational, expectation propagation less generic, but can be much faster with sufficient accuracy

22 Approximating distributions with Monte Carlo 22 / 89 Visualize distributions, e.g., with histograms Compute functions using the posterior draws g(θ l ) where θ (l) are draws from the p(θ y)

23 Approximating integrals with Monte Carlo 23 / 89 Posterior expectation of unknown quantity θp(θ y)dθ 1 θ (l) L where θ (l) are samples from the p(θ y) l

24 How many simulation samples are needed? 24 / 89 If samples are independent usual methods to estimate the uncertainty due to a finite number of observations Markov chain Monte Carlo produces dependent samples requires additional work to estimate the effective number of samples

25 How many simulation samples are needed? 25 / 89 Expectation of unknown quantity E(θ) 1 L l θ (l) if L is big and θ (l) are independent, way may assume that the distribution of the expectation approaches normal distribution with variance σθ 2 /L (asymptotic normality) this variance is independent on dimensionality of θ total variance is sum of the epistemic uncertainty in the posterior and the uncertainty due to using finite number of Monte Carlo samples σ 2 θ + σ 2 θ/l = σ 2 θ(1 + 1/L) e.g. if L = 100, deviation increases by 1 + 1/L = ie. Monte Carlo error is very small (for the expectation) See BDA3 Ch 4 for counter-examples for asymptotic normality

26 How many simulation samples are needed? 26 / 89 Posterior probability p(θ A) 1 I(θ (l) A) L where I(θ (l) A) = 1 if θ (l) A I( ) is binomially distributed as p(θ A) var(i( )) = p(1 p) (Appendix A, p. 579) standard deviation of p is p(1 p)/l if L = 100 and p 0.5, p(1 p)/l = 0.05 ie. accuracy is about 5% units L = 2500 simulation samples needed for 1% unit accuracy To estimate small probabilities, a large number of samples is needed to be able to estimate p, need to get samples with θ (l) A, which in expectation requires L 1/p l

27 Markov chain Monte Carlo (MCMC) 27 / 89 produce samples θ (t), from a Markov chain, which has been constructed so that its equilibrium distribution is p(θ y). + generic + chain goes where most of the posterior mass is - samples are dependent - construction of efficient Markov chains is not always easy

28 Markov chain 28 / 89 set of random variables θ 1, θ 2,..., so that with all values of t p(θ t θ 1,..., θ (t 1) ) = p(θ t θ (t 1) ) starting point θ 0 transition distribution T t (θ t θ t 1 ) (may depend on t) by choosing a suitable transition distribution, the stationary distribution of Markov chain is p(θ y)

29 Gibbs sampling 29 / 89 demo11_1 When using conditionally conjugate (hyper)priors, the sampling from the conditional distributions is easy for wide range of models e.g. hierarchical normal distribution model WinBUGS/OpenBUGS/JAGS No algorithm parameters to tune (cf. proposal distribution in Metropolis algorithm) For not so easy conditional distributions, trivial to use e.g. grid, Metropolis-Hastings or slice sampling Several parameters can be updated in blocks (blocking, cf. Metropolis-Hastings) Gibbs sampling can be very slow if parameters are highly dependent in the posterior

30 Metropolis algorithm 30 / 89 demo11_2 Metropolis algorithm and its generalizations are basis for all MCMC methods Algorithm 1. starting point θ 0 2. t = 1, 2,... (a) pick a proposal θ from the proposal distribution J t(θ θ t 1 ). Proposal distribution has to be symmetric, i.e. J t(θ a θ b ) = J t(θ b θ a), for all θ a, θ b (b) calculate acceptance ratio (c) set θ t = r = p(θ y) p(θ t 1 y) { θ with probability min(r, 1) θ t 1 otherwise

31 Metropolis algorithm 31 / 89 Metropolis algorithm and its generalizations are basis for all MCMC methods Algorithm 1. starting point θ 0 2. t = 1, 2,... (a) pick a proposal θ from the proposal distribution J t(θ θ t 1 ). Proposal distribution has to be symmetric, i.e. J t(θ a θ b ) = J t(θ b θ a), for all θ a, θ b (b) calculate acceptance ratio (c) set θ t = r = p(θ y) p(θ t 1 y) { θ with probability min(r, 1) θ t 1 otherwise instead of p(θ y), unnormalized q(θ y) can be used step c is executed by generating a random number from U(0, 1) rejection of a proposal increments the time t also by one

32 Hamiltonian Monte Carlo 32 / 89 demo12_1 Uses gradient information for more efficient sampling Alternating dynamic simulation and sampling of the energy level Parameters step size, number of steps in each chain No U-Turn Sampling adaptively selects number of steps to improve robustness and efficiency Adaptation in Stan Step size adjustment (mass matrix) is estimated during initial adaptation phase

33 Warm-up (burn-in) and convergence diagnostics 33 / 89 How long it takes to forget the starting point θ 0? Need to forget the starting point Warm-up (burn-in) = remove samples from the beginning of the chain When the chain has forgot the starting point it has converged convergence diagnostics

34 MCMC samples are dependent 34 / 89 Monte Carlo estimates still valid Estimation of Monte Carlo error is more difficult time series analysis thinning Evaluation of effective sample size based on time series analysis

35 Several chains 35 / 89 Use of several chains make convergence diagnostics easier start chains from different starting points use different pseudo random generator seed Compare samples from the different chains Remove samples from the beginning of the chains and run chains long enough so that it is not possible to distinguish where each chain started and the chains are well mixed

36 Visual convergence diagnostics 36 / 89 demo11_3 Visual inspection works when a small number of quantities and may indicate where the problem is Visual inspection is hard when a large number of quantities

37 Comparison within and between variances of the chains ( ˆR) 37 / 89 demo11_4 Examines mixing and stationarity of chains To examine stationarity chains are splitted to two parts after splitting m chains, each having n samples scalar samples ψ ij (i = 1,..., n; j = 1,..., m)

38 Comparison within and between variances of the chains ( ˆR) 38 / 89 BDA3: potential scale reduction ( ˆR) compare means and variances of the chains works best for quantities which are approximately normally distributed good to transform variables, e.g., by taking logarithm of positive quantity

39 Comparison within and between variances of the chains ( ˆR) 39 / 89 Between chains variance B B = n m 1 m ( ψ.j ψ.. ) 2, where ψ.j = 1 n j=1 i=1 B/n is variance of the means of the chains Within chains variance W W = 1 m m j=1 s 2 j, where s2 j = 1 n 1 n ψ ij, ψ.. = 1 m n (ψ ij ψ.j ) 2 i=1 Estimate marginal posterior variance var(ψ y) as a weighted mean of W and B var + (ψ y) = n 1 n W + 1 n B m j=1 ψ.j

40 Comparison within and between variances of the chains ( ˆR) Estimate marginal posterior variance var(ψ y) as a weighted mean of W and B var + (ψ y) = n 1 n W + 1 n B this overestimates marginal posterior variance if the starting points are overdispersed Given finite n, W underestimates marginal posterior variance single chains have not yet visited all points in the distribution when n, E(W ) var(ψ y) As var + (ψ y) overestimates and W underestimates, compute var ˆR + = W 40 / 89

41 Comparison within and between variances of the chains ( ˆR) Potential scale reduction var ˆR + = W estimates how much the scale of ψ could reduce if n R 1, when n if R is big, keep sampling big is, e.g., R > / 89

42 Potential scale reduction ( ˆR) 42 / 89 If ˆR close to 1, it is still possible that chains have not converged if starting points were not overdispersed distribution far from normal just by chance when n is finite

43 Problematic distributions 43 / 89 Nonlinear dependencies Funnels Multimodal Long-tailed with undefined variance and mean

44 Time series analysis 44 / 89 Auto correlation function describes the correlation given a certain lag can be used to compare efficiency of MCMC algorithms θ 1 θ

45 Time series analysis 45 / 89 Time series analysis can be used to estimate Monte Carlo error in case of MCMC For expectation θ Var[ θ] = σ2 θ L/τ where τ is sum of autocorrelations τ describes how many dependent samples correspond to one independent sample in BDA3 L = nm n eff = nm/τ

46 Time series analysis 46 / 89 Estimation of the autocorrelation using several chains where V t is variogram V t = 1 m(n t) ˆρ t = 1 m V t 2 var + j=1 i=t+1 n (ψ i,j ψ i t,j ) 2 Compared to usual method which computes the autocorrelation from a single chain, this estimate has smaller variance, ad var + is estimated using several chains

47 Time series analysis 47 / 89 Estimation of τ τ = ˆρ t t=1 where ˆρ t is empirical autocorrelation empirical autocorrelation function is noisy and thus estimate of τ is noisy noise is larger for longer lags (less observations) less noisy estimate is obtained by truncating τ = T ˆρ t As τ is estimated from a finite number of samples, it s expectation is overoptimistic if τ > mn/20 then the estimate is unreliable t=1

48 Geyer s adaptive window estimator 48 / 89 Truncation can be decided adaptively by taking into account some properties of Markov chains for stationary, irreducible, recurrent Markov chain let Γ m = ρ 2m + ρ 2m+1, which is sum of two consequent autocorrelations Γ m is positive, decreasing and convex function of m initial positive sequence estimator (Geyer s IPSE) Choose the largest m so, that all values of the sequence ˆΓ 1,..., ˆΓ m are positive

49 Time series analysis 49 / 89 Effective number of samples n eff L/τ

50 Thinning 50 / 89 Not necessary, but used often to save disk space makes post-sampling computations faster makes estimation of Monte Carlo error easier Save every kth sample if k > m, where m from Geyer s method, then samples almost independent information is lost as m > τ

51 Stan 51 / 89 Probabilistic software describe data and model let the software do the computation automatically uses Markov chain Monte Carlo for inference (now also variational inference)

52 Stan interfaces 52 / 89 CmdStan RStan PyStan MatlabStan example later JuliaStan StataStan

53 Bernoulli model 53 / 89 data { i n t <lower=0> N; i n t <lower =0, upper=1> y [N ] ; } parameters { real <lower =0, upper=1> t h e t a ; } model { t h e t a ~ beta ( 1, 1 ) ; f o r ( n i n 1:N) y [ n ] ~ b e r n o u l l i ( t h e t a ) ; }

54 Vectorized Bernoulli model 54 / 89 data { i n t <lower=0> N; i n t <lower =0, upper=1> y [N ] ; } parameters { real <lower =0, upper=1> t h e t a ; } model { t h e t a ~ beta ( 1, 1 ) ; y ~ b e r n o u l l i ( t h e t a ) ; }

55 Binomial model 55 / 89 data { i n t <lower=0> N; / / number of experiments i n t <lower =0, upper=n> y ; / / number of successes } parameters { real <lower =0, upper=1> t h e t a ; / / parameter of the binom } model { t h e t a ~ beta ( 1, 1 ) ; / / p r i o r y ~ binomial (N, t h e t a ) ; / / observation model }

56 Stan 56 / 89 Stan compiles the model written in Stan language to C++ this makes the sampling for complex models and bigger data faster also makes Stan models easily portable, you can use your own favorite interface

57 Stan 57 / 89 Compilation (unless previously compiled model available) Adaptation Warm-up Sampling Generated quantities Save posterior draws Report divergences, n eff, ˆR

58 Difference between proportions 58 / 89 matlabstan_demo or pystan_demo An experiment was performed to estimate the effect of beta-blockers on mortality of cardiac patients A group of patients were randomly assigned to treatment and control groups: out of 674 patients receiving the control, 39 died out of 680 receiving the treatment, 22 died

59 Gaussian linear model 59 / 89 data { i n t <lower=0> N; / / number of data p o i n t s v e ctor [N] x ; / / v e ctor [N] y ; / / } parameters { r e a l alpha ; r e a l beta ; real <lower=0> sigma ; } transformed parameters { v e ctor [N] mu; mu < alpha + beta x ; } model { y ~ normal ( mu, sigma ) ; }

60 Kilpisjärvi summer temperature 60 / 89 matlabstan_demo or pystan_demo Temperature at Kilpisjärvi in June, July and August from 1952 to 2013 Is there change in the temperature?

61 Priors for Gaussian linear model 61 / 89 data { i n t <lower=0> N; / / number of data p o i n t s v e ctor [N] x ; / / v e ctor [N] y ; / / r e a l pmualpha ; / / p r i o r mean f o r alpha r e a l psalpha ; / / p r i o r std f o r alpha r e a l pmubeta ; / / p r i o r mean f o r beta r e a l psbeta ; / / p r i o r std f o r beta }... transformed parameters { v e ctor [N] mu; mu < alpha + beta x ; } model { alpha ~ normal ( pmualpha, psalpha ) ; beta ~ normal ( pmubeta, psbeta ) ; y ~ normal ( mu, sigma ) ; }

62 Student-t linear model 62 / parameters { r e a l alpha ; r e a l beta ; real <lower=0> sigma ; real <lower =1, upper=80> nu ; } transformed parameters { v e c t or [N] mu; mu < alpha + beta x ; } model { nu ~ gamma( 2, 0. 1 ) ; y ~ s t u d e n t _ t ( nu, mu, sigma ) ; }

63 Extreme value analysis 63 / 89 Geomagnetic storms

64 Extreme value analysis data { i n t <lower=0> N; vector <lower =0 >[N] y ; i n t <lower=0> Nt ; vector <lower =0 >[ Nt ] y t ; } transformed data { r e a l ymax ; ymax < max( y ) ; } parameters { real <lower=0> sigma ; real <lower= sigma / ymax> k ; } model { y ~ gpareto ( k, sigma ) ; } generated q u a n t i t i e s { v e c t or [ Nt ] predccdf ; predccdf < gpareto_ccdf ( yt, k, sigma ) ; } 64 / 89

65 Functions f u n c t i o n s { r e a l gpareto_log ( v e c t o r y, r e a l k, r e a l sigma ) { / / generalised Pareto log pdf with mu=0 / / should check and give e r r o r i f k<0 and max( y ) / sigm i n t N; N < dims ( y ) [ 1 ] ; i f ( fabs ( k ) > 1e 15) } r e t u r n (1+1/k ) sum( log1pv ( y k / sigma ) ) N log ( sigm else r e t u r n sum( y / sigma ) N log ( sigma ) ; / / l i m i t k >0 } v e c t or gpareto_ccdf ( v e c t o r y, r e a l k, r e a l sigma ) { / / generalised Pareto log ccdf with mu=0 / / should check and give e r r o r i f k<0 and max( y ) / sigm } i f ( fabs ( k ) > 1e 15) r e t u r n exp (( 1/ k ) log1pv ( y / sigma k ) ) ; else r e t u r n exp( y / sigma ) ; / / l i m i t k >0 65 / 89

66 Transformed data 66 / 89 data { i n t <lower=0> p ; i n t <lower=0> N; i n t <lower =0, upper=1> y [N ] ; m a t r ix [N, p ] x ; } transformed data { m a t r ix [N, p ] z ; v e c t or [ p ] mean_x ; v e c t or [ p ] sd_x ; f o r ( j i n 1: p ) { mean_x [ j ] < mean( c o l ( x, j ) ) ; sd_x [ j ] < sd ( c o l ( x, j ) ) ; f o r ( i i n 1:N) z [ i, j ] < ( x [ i, j ] mean_x [ j ] ) / sd_x [ j ] ; } }

67 Hierarchical survival model f u n c t i o n s { vector sqrt_vec ( vector x ) { vector [ dims ( x ) [ 1 ] ] res ; f o r (m i n 1:dims (x ) [ 1 ] ) { res [m] < s q r t ( x [m] ) ; } } r e t u r n res ; matrix j o i n t _ p r i o r _ l p ( matrix beta_raw, / / raw beta parameters vector csprime, / / cp and sp vector cs_params, vector r, / / scales of beta matrix V / / eigenvectors of the c o r r e l a t i o n m a t r i x ) { matrix [ dims ( r ) [ 1 ], 4] beta ; csprime [1] ~ beta (cs_params [1], cs_params [ 2 ] ) ; csprime [2] ~ beta (cs_params [3], cs_params [ 4 ] ) ; col ( beta_raw, 1) ~ normal ( 0. 0, 1. 0 ) ; col ( beta_raw, 2) ~ normal ( 0. 0, 1. 0 ) ; col ( beta_raw, 3) ~ normal ( 0. 0, 1. 0 ) ; col ( beta_raw, 4) ~ normal ( 0. 0, 1. 0 ) ; { r e a l c ; r e a l s ; r e a l lambda ; vector [ 4 ] eigvals ; c < / ( 1. 0 csprime [ 1 ] ) ; s < / ( 1. 0 csprime [ 2 ] ) ; lambda < s q r t ( ( 2. 0 c 1. 0 ) ( 2. 0 s 1. 0 ) ( 2. 0 ( c + s ) 1. 0 ) / ( ( ( c + s c s ) ) ( c + s 1. 0 ) ) ) ; eigvals [1] < 1.0; e i g v a l s [ 2 ] < 1. 0 / s q r t ( c ) ; e i g v a l s [ 3 ] < 1. 0 / s q r t ( s ) ; e i g v a l s [ 4 ] < 1. 0 / s q r t ( ( c + s ) ) ; f o r (m i n 1: dims ( r ) [ 1 ] ) { beta [m] < ( r [m] lambda ) (V ( e i g v a l s. (V beta_raw [m] ) ) ) ; } } } } r e t u r n beta ; 67 / 89

68 Hierarchcial survival model vector h s _p r i o r _l p ( r e a l r1_global, r e a l r2_global, vector r1_local, vector r 2 _ l o c a l ) { r1_global ~ normal ( 0. 0, 1. 0 ) ; r2_global ~ inv_gamma ( 0. 5, 0. 5 ) ; r 1 _ l o c a l ~ normal ( 0. 0, 1. 0 ) ; r 2 _ l o c a l ~ inv_gamma ( 0. 5, 0. 5 ) ; } r e t u r n ( r1_global s q r t ( r2_global ) ) r 1 _ l o c a l. sqrt_vec ( r 2 _ l o c a l ) ; vector b g _p r i o r _l p ( r e a l r_global, vector r _ l o c a l ) { r _ global ~ normal ( 0. 0, ) ; r _ l o c a l ~ inv_chi_square ( 1. 0 ) ; } } return r _global sqrt_vec ( r _ local ) ; data { int <lower=0> NobsNM; int <lower=0> NobsNW; int <lower=0> NobsDM; int <lower=0> NobsDW; int <lower=0> NcenNM; int <lower=0> NcenNW; int <lower=0> NcenDM; int <lower=0> NcenDW; int <lower=0> M_bg ; int <lower=0> M_biom ; vector [NobsNM] yobsnm ; vector [NcenNM] ycennm ; matrix [NobsNM, M_bg ] Xobs_bgNM ; matrix [NcenNM, M_bg ] Xcen_bgNM ; matrix [NobsNM, M_biom ] Xobs_biomNM ; matrix [NcenNM, M_biom ] Xcen_biomNM ; vector [NobsNW] yobsnw ; vector [NcenNW] ycennw ; matrix [NobsNW, M_bg ] Xobs_bgNW ; matrix [NcenNW, M_bg ] Xcen_bgNW ; matrix [NobsNW, M_biom ] Xobs_biomNW ; matrix [NcenNW, M_biom ] Xcen_biomNW ; vector [NobsDM] yobsdm ; vector [NcenDM] ycendm ; matrix [NobsDM, M_bg ] Xobs_bgDM ; matrix [NcenDM, M_bg ] Xcen_bgDM ; matrix [NobsDM, M_biom ] Xobs_biomDM ; matrix [NcenDM, M_biom ] Xcen_biomDM ; vector [NobsDW] yobsdw ; vector [NcenDW] ycendw ; 68 / 89

69 Hierarchcial survival model matrix [NobsDW, M_bg ] Xobs_bgDW ; matrix [NcenDW, M_bg ] Xcen_bgDW ; matrix [NobsDW, M_biom ] Xobs_biomDW ; matrix [NcenDW, M_biom ] Xcen_biomDW ; transformed data { real <lower=0> tau_mu ; vector <lower =0 >[1] tau_al ; matrix [4,4] V ; / / eigenvectors of the c o r r e l a t i o n matrix tau_mu < 10.0; tau_al [1] < 10.0; V[1,1] < 0.5; V[2,1] < 0.5; V[3,1] < 0.5; V[4,1] < 0.5; V[ 1, 2 ] < 0.5; V[ 2, 2 ] < 0. 5 ; V[ 3, 2 ] < 0.5; V[ 4, 2 ] < 0. 5 ; V[ 1, 3 ] < 0.5; V[ 2, 3 ] < 0.5; V[ 3, 3 ] < 0. 5 ; V[ 4, 3 ] < 0. 5 ; V[ 1, 4 ] < 0. 5 ; V[ 2, 4 ] < 0.5; V[ 3, 4 ] < 0.5; V[ 4, 4 ] < 0. 5 ; } parameters { vector <lower =0, upper =1 >[2] csprime_biom ; vector <lower =0, upper =1 >[2] csprime_bg ; vector <lower =0, upper =1 >[2] csprime_al ; vector <lower =0 >[4] cs_params ; / / a_c & b_c, and a_s & b_s real <lower=0> tau_s_bg_raw ; vector <lower =0 >[M_bg ] tau_bg_raw ; real <lower=0> tau_s1_biom_raw ; real <lower=0> tau_s2_biom_raw ; vector <lower =0 >[M_biom ] tau1_biom_raw ; vector <lower =0 >[M_biom ] tau2_biom_raw ; matrix [ 1, 4] alpha_raw ; matrix [ M_bg, 4] beta_bg_raw ; matrix [ M_biom, 4 ] beta_biom_raw ; } vector [ 4 ] mu; transformed parameters { matrix [ M_biom, 4 ] beta_biom ; matrix [ M_bg, 4 ] beta_bg ; matrix [ 1, 4 ] alpha ; beta_biom < j o i n t _ p r i o r _ l p ( beta_biom_raw, csprime_biom, cs_params, h s _p r i o r _l p ( tau_s1_biom_raw, tau_s2_biom_raw, tau1_biom_raw, tau2_biom_raw ), V ) ; beta_bg < j o i n t _ p r i o r _ l p ( beta_bg_raw, csprime_bg, cs_params, b g _p r i o r _l p ( tau_s_bg_raw, tau_bg_raw ), V ) ; 69 / 89

70 Hierarchcial survival model } alpha < exp ( j o i n t _p r i o r _l p (alpha_raw, csprime_al, cs_params, tau_al, V ) ) ; model { yobsnm ~ w e i b u l l ( alpha [1,1], exp( (mu[1] + Xobs_bgNM col (beta_bg, 1) + Xobs_biomNM col (beta_biom, 1 ) ) / alpha [1,1])); yobsnw ~ w e i b u l l ( alpha [1,2], exp( (mu[2] + Xobs_bgNW col (beta_bg, 2) + Xobs_biomNW col (beta_biom, 2 ) ) / alpha [1,2])); yobsdm ~ w e i b u l l ( alpha [1,3], exp( (mu[3] + Xobs_bgDM col (beta_bg, 3) + Xobs_biomDM col (beta_biom, 3 ) ) / alpha [1,3])); yobsdw ~ w e i b u l l ( alpha [1,4], exp( (mu[4] + Xobs_bgDW col (beta_bg, 4) + Xobs_biomDW col (beta_biom, 4 ) ) / alpha [1,4])); increment_log_prob ( w e i b u l l _c c d f _l o g (ycennm, alpha [ 1, 1 ], exp( (mu[ 1 ] + Xcen_bgNM col ( beta_bg, 1) + Xcen_biomNM col ( beta_biom, 1 ) ) / alpha increment_log_prob ( w e i b u l l _c c d f _l o g (ycennw, alpha [ 1, 2 ], exp( (mu[ 2 ] + Xcen_bgNW col ( beta_bg, 2) + Xcen_biomNW col ( beta_biom, 2 ) ) / alpha increment_log_prob ( w e i b u l l _c c d f _l o g (ycendm, alpha [ 1, 3 ], exp( (mu[ 3 ] + Xcen_bgDM col ( beta_bg, 3) + Xcen_biomDM col ( beta_biom, 3 ) ) / alpha increment_log_prob ( w e i b u l l _c c d f _l o g (ycendw, alpha [ 1, 4 ], exp( (mu[ 4 ] + Xcen_bgDW col ( beta_bg, 4) + Xcen_biomDW col ( beta_biom, 4 ) ) / alpha / / cs_params ~ gamma(0.5, ) ; / / changed to h a l f normal cs_params ~ normal ( 0. 0, ) ; } mu ~ normal ( 0. 0, tau_mu ) ; From Peltola, Havulinna, Salomaa, and Vehtari (2014). Hierarchical Bayesian survival analysis and projective covariate selection in cardiovascular event risk prediction / 89

71 Ways to estimate the predictive performance 71 / 89 Data estimate (within-sample) use same data to form the posterior and to test the performance corresponds to training error Partial predictive split data in two parts use one part to form the posterior and the other part to test the performance corresponds to test error Cross-validation improved version of partial approach divide data in several parts, can be also n parts use different parts to form posterior and to test performance Information criterion compute correction term to data estimate

72 Why model selection? 72 / 89 Assume a model rich enough capturing lot of uncertainties e.g. Bayesian model average (BMA) or non-parametric model criticism and predictive assessment done if we are happy with the model, no need for model selection Box: All models are wrong, but some are useful there are known unknowns and unknown unknowns Model selection what if some smaller (or more sparse) or parametric model is practically as good? which uncertainties can be ignored? (e.g. Student-t vs. Gaussian, irrelevant covariates) reduced measurement cost, simpler to explain (e.g. less biomarkers, and easier to explain to doctors)

73 Predictive model selection 73 / 89 Goodness of the model is evaluated by its predictive performance Select a simpler model whose predictive performance is similar to the rich model

74 Predictive model 74 / 89 p(ỹ x, D, M k ) is the posterior predictive distribution p(ỹ x, D, M k ) = p(ỹ x, θ, M k )p(θ D, x, M k )dθ ỹ is a future observation x is a future random or controlled covariate value D = {(x (i), y (i) ); i = 1, 2,..., n} M k is a model θ denotes parameters

75 Predictive performance 75 / 89 Future outcome ỹ is unknown (ignoring x in this slide) With a known true distribution p t (ỹ), the expected utility would be ū(a) = p t (ỹ)u(a; ỹ)dỹ where u is utility and a is action (in our case, a prediction) Bayes generalization utility BU g = p t (ỹ) log p(ỹ D, M k )dỹ where a = p( D, M k ) and u(a; ỹ) = log(a(ỹ)) a is to report the whole predictive distribution utility is the log-density evaluated at ỹ

76 Bayesian predictive methods 76 / 89 Many ways to approximate BU g = p t (ỹ) log p(ỹ D, M k )dỹ for example Bayesian cross-validation WAIC reference predictive methods Many other Bayesian predictive methods estimating something else, e.g., DIC L-criterion, posterior predictive criterion projection methods See our survey for more methods

77 M-open,-closed,-completed 77 / 89 Following Bernardo & Smith (1994), there are three different approaches for dealing with the unknown p t M-open M-closed M-completed

78 M-open 78 / 89 Explicit specification of p t (ỹ) is avoided by re-using the observed data D as a pseudo Monte Carlo samples from the distribution of future data For example, Bayes leave-one-out cross-validation LOO = 1 n n log p(y i x i, D i, M k ) i=1

79 Cross-validation 79 / 89 Bayes leave-one-out cross-validation LOO = 1 n n log p(y i x i, D i, M k ) i=1 different part of the data is used to update the posterior and assess the performance almost unbiased estimate for a single model E[LOO(n)] = E[BU g (n 1)] expectation is taken over all the possible training sets

80 Cross-validation 80 / 89 Naïve computation requires computation of n posteriors Less computation with analytic solutions and approximations available for some models importance sampling using the full posterior as the proposal (easy to use with Stan) k-fold cross-validation most robust

81 Importance sampling 81 / 89 Having samples θ s from p(θ s D) p(ỹ i x i, D i ) S s=1 p(ỹ i θ s )wi s S s=1 w, i s where w s i are importance weights and wi s = p(θs x i, D i ) p(θ s D) 1 p(y i θ s ).

82 Truncated importance sampling 82 / 89 The variance of the importance weights w s in IS-LOO can be large or even infinite Truncated importance sampling with truncated weights w s = min( w s, S w) has a finite variance but also some optimistic bias

83 Pareto smoothed importance sampling 83 / 89 The variance of the importance weights in IS-LOO can be large or even infinite By fitting a generalized Pareto distribution to the tail of the weight distribution obtain an estimate of the shape parameter k if k < 1 2 variance is finite, the central limit theorem holds if 1 2 k < 1 variance is infinite but mean exists, the generalized central limit theorem holds if k 1 variance and mean do not exist, the truncated estimate will have a finite variance but considerable bias variance of the IS estimate can be reduced by Pareto smoothing the weights PSIS-LOO

84 Pareto smoothed importance sampling 84 / 89 LOO IS/TIS/PSIS-LOO/WAIC IS TIS PSIS WAIC Aki Vehtari, Andrew Gelman and Jonah Gabry (2015). Efficient implementation of leave-one-out cross-validation and WAIC for evaluating fitted Bayesian models. arxiv preprint arxiv:

85 Generated quantities for LOO 85 / model { v e c t or [N] eta ; e t a < beta0 + z beta ; beta ~ normal ( 0, phi ) ; phi ~ double_exponential ( 0, 1 0 ) ; y ~ b e r n o u l l i _ l o g i t ( eta ) ; } generated q u a n t i t i e s { v e c t or [N] l o g _ l i k ; v e c t or [N] eta ; e t a < beta0 + z beta ; f o r ( n i n 1:N) l o g _ l i k [ n ] < b e r n o u l l i _ l o g i t _ l o g ( y [ n ], eta [ n ] ) ; }

86 Selection induced bias 86 / 89 Selection induced bias in LOO-CV same data is used to assess the performance and make the selection the selected model fits more to the data the LOO-CV estimate for the selected model is biased recognized already, e.g., by Stone (1974) Same holds for many other methods, e.g., DIC/WAIC Performance of the selection process itself can be assessed using two level cross-validation, but it does not help choosing better models Bigger problem if there is a large number of models as in covariate selection Juho Piironen and Aki Vehtari (2015). Comparison of Bayesian predictive methods for model selection

87 Other forms of model selection / hypothesis testing 87 / 89 Marginal posterior probabilities and intervals problems when posterior dependencies, e.g. due to correlation of covariates Bayes factor & evidence sensitive to prior as seen from the predictive interpretation Posterior (CV) predictive checking

88 Bayes factor 88 / 89 Marginal likelihood in Bayes factor is also a predictive criterion chain rule p(y M k ) = p(y 1 M k )p(y 2 y 1, M k ),..., p(y n y 1,..., y n 1, M k ) Sensitive to the first terms, and not defined if the prior is improper especially problematic to use for models with large difference in the number of parameters

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Probabilistic machine learning group, Aalto University  Bayesian theory and methods, approximative integration, model Aki Vehtari, Aalto University, Finland Probabilistic machine learning group, Aalto University http://research.cs.aalto.fi/pml/ Bayesian theory and methods, approximative integration, model assessment and

More information

1 Inference for binomial proportion (Matlab/Python)

1 Inference for binomial proportion (Matlab/Python) Bayesian data analysis exercises from 2015 1 Inference for binomial proportion (Matlab/Python) Algae status is monitored in 274 sites at Finnish lakes and rivers. The observations for the 2008 algae status

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Bayesian leave-one-out cross-validation for large data sets

Bayesian leave-one-out cross-validation for large data sets 1st Symposium on Advances in Approximate Bayesian Inference, 2018 1 5 Bayesian leave-one-out cross-validation for large data sets Måns Magnusson Michael Riis Andersen Aki Vehtari Aalto University mans.magnusson@aalto.fi

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Probabilistic programming and Stan. mc-stan.org

Probabilistic programming and Stan. mc-stan.org Probabilistic programming and Stan mc-stan.org Outline What is probabilistic programming Stan now Stan in the future A bit about other software Probabilistic programming Probabilistic programming framework

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Bayesian Phylogenetics:

Bayesian Phylogenetics: Bayesian Phylogenetics: an introduction Marc A. Suchard msuchard@ucla.edu UCLA Who is this man? How sure are you? The one true tree? Methods we ve learned so far try to find a single tree that best describes

More information

Markov Chain Monte Carlo

Markov Chain Monte Carlo Markov Chain Monte Carlo Recall: To compute the expectation E ( h(y ) ) we use the approximation E(h(Y )) 1 n n h(y ) t=1 with Y (1),..., Y (n) h(y). Thus our aim is to sample Y (1),..., Y (n) from f(y).

More information

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 EPSY 905: Intro to Bayesian and MCMC Today s Class An

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Disease mapping with Gaussian processes

Disease mapping with Gaussian processes EUROHEIS2 Kuopio, Finland 17-18 August 2010 Aki Vehtari (former Helsinki University of Technology) Department of Biomedical Engineering and Computational Science (BECS) Acknowledgments Researchers - Jarno

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Markov chain Monte Carlo

Markov chain Monte Carlo 1 / 26 Markov chain Monte Carlo Timothy Hanson 1 and Alejandro Jara 2 1 Division of Biostatistics, University of Minnesota, USA 2 Department of Statistics, Universidad de Concepción, Chile IAP-Workshop

More information

MCMC Methods: Gibbs and Metropolis

MCMC Methods: Gibbs and Metropolis MCMC Methods: Gibbs and Metropolis Patrick Breheny February 28 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/30 Introduction As we have seen, the ability to sample from the posterior distribution

More information

Learning the hyper-parameters. Luca Martino

Learning the hyper-parameters. Luca Martino Learning the hyper-parameters Luca Martino 2017 2017 1 / 28 Parameters and hyper-parameters 1. All the described methods depend on some choice of hyper-parameters... 2. For instance, do you recall λ (bandwidth

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll

More information

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17 MCMC for big data Geir Storvik BigInsight lunch - May 2 2018 Geir Storvik MCMC for big data BigInsight lunch - May 2 2018 1 / 17 Outline Why ordinary MCMC is not scalable Different approaches for making

More information

Lecture 13 Fundamentals of Bayesian Inference

Lecture 13 Fundamentals of Bayesian Inference Lecture 13 Fundamentals of Bayesian Inference Dennis Sun Stats 253 August 11, 2014 Outline of Lecture 1 Bayesian Models 2 Modeling Correlations Using Bayes 3 The Universal Algorithm 4 BUGS 5 Wrapping Up

More information

10. Exchangeability and hierarchical models Objective. Recommended reading

10. Exchangeability and hierarchical models Objective. Recommended reading 10. Exchangeability and hierarchical models Objective Introduce exchangeability and its relation to Bayesian hierarchical models. Show how to fit such models using fully and empirical Bayesian methods.

More information

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis Summarizing a posterior Given the data and prior the posterior is determined Summarizing the posterior gives parameter estimates, intervals, and hypothesis tests Most of these computations are integrals

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling 1 / 27 Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 27 Monte Carlo Integration The big question : Evaluate E p(z) [f(z)]

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

θ 1 θ 2 θ n y i1 y i2 y in Hierarchical models (chapter 5) Hierarchical model Introduction to hierarchical models - sometimes called multilevel model

θ 1 θ 2 θ n y i1 y i2 y in Hierarchical models (chapter 5) Hierarchical model Introduction to hierarchical models - sometimes called multilevel model Hierarchical models (chapter 5) Introduction to hierarchical models - sometimes called multilevel model Exchangeability Slide 1 Hierarchical model Example: heart surgery in hospitals - in hospital j survival

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

ST 740: Markov Chain Monte Carlo

ST 740: Markov Chain Monte Carlo ST 740: Markov Chain Monte Carlo Alyson Wilson Department of Statistics North Carolina State University October 14, 2012 A. Wilson (NCSU Stsatistics) MCMC October 14, 2012 1 / 20 Convergence Diagnostics:

More information

Markov chain Monte Carlo methods in atmospheric remote sensing

Markov chain Monte Carlo methods in atmospheric remote sensing 1 / 45 Markov chain Monte Carlo methods in atmospheric remote sensing Johanna Tamminen johanna.tamminen@fmi.fi ESA Summer School on Earth System Monitoring and Modeling July 3 Aug 11, 212, Frascati July,

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee and Andrew O. Finley 2 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department

More information

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC Stat 451 Lecture Notes 07 12 Markov Chain Monte Carlo Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapters 8 9 in Givens & Hoeting, Chapters 25 27 in Lange 2 Updated: April 4, 2016 1 / 42 Outline

More information

Bayes: All uncertainty is described using probability.

Bayes: All uncertainty is described using probability. Bayes: All uncertainty is described using probability. Let w be the data and θ be any unknown quantities. Likelihood. The probability model π(w θ) has θ fixed and w varying. The likelihood L(θ; w) is π(w

More information

Bayesian Linear Models

Bayesian Linear Models Bayesian Linear Models Sudipto Banerjee September 03 05, 2017 Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles Linear Regression Linear regression is,

More information

Markov Chain Monte Carlo in Practice

Markov Chain Monte Carlo in Practice Markov Chain Monte Carlo in Practice Edited by W.R. Gilks Medical Research Council Biostatistics Unit Cambridge UK S. Richardson French National Institute for Health and Medical Research Vilejuif France

More information

Bayesian Estimation with Sparse Grids

Bayesian Estimation with Sparse Grids Bayesian Estimation with Sparse Grids Kenneth L. Judd and Thomas M. Mertens Institute on Computational Economics August 7, 27 / 48 Outline Introduction 2 Sparse grids Construction Integration with sparse

More information

Bayesian model selection: methodology, computation and applications

Bayesian model selection: methodology, computation and applications Bayesian model selection: methodology, computation and applications David Nott Department of Statistics and Applied Probability National University of Singapore Statistical Genomics Summer School Program

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

MCMC for Cut Models or Chasing a Moving Target with MCMC

MCMC for Cut Models or Chasing a Moving Target with MCMC MCMC for Cut Models or Chasing a Moving Target with MCMC Martyn Plummer International Agency for Research on Cancer MCMSki Chamonix, 6 Jan 2014 Cut models What do we want to do? 1. Generate some random

More information

David Giles Bayesian Econometrics

David Giles Bayesian Econometrics David Giles Bayesian Econometrics 1. General Background 2. Constructing Prior Distributions 3. Properties of Bayes Estimators and Tests 4. Bayesian Analysis of the Multiple Regression Model 5. Bayesian

More information

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31 Hierarchical models Dr. Jarad Niemi Iowa State University August 31, 2017 Jarad Niemi (Iowa State) Hierarchical models August 31, 2017 1 / 31 Normal hierarchical model Let Y ig N(θ g, σ 2 ) for i = 1,...,

More information

Bayesian Networks in Educational Assessment

Bayesian Networks in Educational Assessment Bayesian Networks in Educational Assessment Estimating Parameters with MCMC Bayesian Inference: Expanding Our Context Roy Levy Arizona State University Roy.Levy@asu.edu 2017 Roy Levy MCMC 1 MCMC 2 Posterior

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming

More information

Hierarchical Models & Bayesian Model Selection

Hierarchical Models & Bayesian Model Selection Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or

More information

CS281A/Stat241A Lecture 22

CS281A/Stat241A Lecture 22 CS281A/Stat241A Lecture 22 p. 1/4 CS281A/Stat241A Lecture 22 Monte Carlo Methods Peter Bartlett CS281A/Stat241A Lecture 22 p. 2/4 Key ideas of this lecture Sampling in Bayesian methods: Predictive distribution

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

Bayesian Linear Regression

Bayesian Linear Regression Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective

More information

Contents. Part I: Fundamentals of Bayesian Inference 1

Contents. Part I: Fundamentals of Bayesian Inference 1 Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian

More information

STAT 425: Introduction to Bayesian Analysis

STAT 425: Introduction to Bayesian Analysis STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 2) Fall 2017 1 / 19 Part 2: Markov chain Monte

More information

Monte Carlo Inference Methods

Monte Carlo Inference Methods Monte Carlo Inference Methods Iain Murray University of Edinburgh http://iainmurray.net Monte Carlo and Insomnia Enrico Fermi (1901 1954) took great delight in astonishing his colleagues with his remarkably

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics Bayesian phylogenetics the one true tree? the methods we ve learned so far try to get a single tree that best describes the data however, they admit that they don t search everywhere, and that it is difficult

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Bayesian data analysis in practice: Three simple examples

Bayesian data analysis in practice: Three simple examples Bayesian data analysis in practice: Three simple examples Martin P. Tingley Introduction These notes cover three examples I presented at Climatea on 5 October 0. Matlab code is available by request to

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester Physics 403 Numerical Methods, Maximum Likelihood, and Least Squares Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Quadratic Approximation

More information

David Giles Bayesian Econometrics

David Giles Bayesian Econometrics David Giles Bayesian Econometrics 5. Bayesian Computation Historically, the computational "cost" of Bayesian methods greatly limited their application. For instance, by Bayes' Theorem: p(θ y) = p(θ)p(y

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Hierarchical Bayesian Modeling

Hierarchical Bayesian Modeling Hierarchical Bayesian Modeling Making scientific inferences about a population based on many individuals Angie Wolfgang NSF Postdoctoral Fellow, Penn State Astronomical Populations Once we discover an

More information

Model Checking and Improvement

Model Checking and Improvement Model Checking and Improvement Statistics 220 Spring 2005 Copyright c 2005 by Mark E. Irwin Model Checking All models are wrong but some models are useful George E. P. Box So far we have looked at a number

More information

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017 Markov Chain Monte Carlo (MCMC) and Model Evaluation August 15, 2017 Frequentist Linking Frequentist and Bayesian Statistics How can we estimate model parameters and what does it imply? Want to find the

More information

The Metropolis-Hastings Algorithm. June 8, 2012

The Metropolis-Hastings Algorithm. June 8, 2012 The Metropolis-Hastings Algorithm June 8, 22 The Plan. Understand what a simulated distribution is 2. Understand why the Metropolis-Hastings algorithm works 3. Learn how to apply the Metropolis-Hastings

More information

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA Intro: Course Outline and Brief Intro to Marina Vannucci Rice University, USA PASI-CIMAT 04/28-30/2010 Marina Vannucci

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Part 7: Hierarchical Modeling

Part 7: Hierarchical Modeling Part 7: Hierarchical Modeling!1 Nested data It is common for data to be nested: i.e., observations on subjects are organized by a hierarchy Such data are often called hierarchical or multilevel For example,

More information

Bayesian course - problem set 5 (lecture 6)

Bayesian course - problem set 5 (lecture 6) Bayesian course - problem set 5 (lecture 6) Ben Lambert November 30, 2016 1 Stan entry level: discoveries data The file prob5 discoveries.csv contains data on the numbers of great inventions and scientific

More information

36-463/663Multilevel and Hierarchical Models

36-463/663Multilevel and Hierarchical Models 36-463/663Multilevel and Hierarchical Models From Bayes to MCMC to MLMs Brian Junker 132E Baker Hall brian@stat.cmu.edu 1 Outline Bayesian Statistics and MCMC Distribution of Skill Mastery in a Population

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Approximate Bayesian Computation and Particle Filters

Approximate Bayesian Computation and Particle Filters Approximate Bayesian Computation and Particle Filters Dennis Prangle Reading University 5th February 2014 Introduction Talk is mostly a literature review A few comments on my own ongoing research See Jasra

More information

Reminder of some Markov Chain properties:

Reminder of some Markov Chain properties: Reminder of some Markov Chain properties: 1. a transition from one state to another occurs probabilistically 2. only state that matters is where you currently are (i.e. given present, future is independent

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

Probabilistic Machine Learning

Probabilistic Machine Learning Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013 Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

An introduction to Bayesian statistics and model calibration and a host of related topics

An introduction to Bayesian statistics and model calibration and a host of related topics An introduction to Bayesian statistics and model calibration and a host of related topics Derek Bingham Statistics and Actuarial Science Simon Fraser University Cast of thousands have participated in the

More information

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013 Eco517 Fall 2013 C. Sims MCMC October 8, 2013 c 2013 by Christopher A. Sims. This document may be reproduced for educational and research purposes, so long as the copies contain this notice and are retained

More information

Disease mapping with Gaussian processes

Disease mapping with Gaussian processes Liverpool, UK, 4 5 November 3 Aki Vehtari Department of Biomedical Engineering and Computational Science (BECS) Outline Example: Alcohol related deaths in Finland Spatial priors and benefits of GP prior

More information

A Review of Pseudo-Marginal Markov Chain Monte Carlo

A Review of Pseudo-Marginal Markov Chain Monte Carlo A Review of Pseudo-Marginal Markov Chain Monte Carlo Discussed by: Yizhe Zhang October 21, 2016 Outline 1 Overview 2 Paper review 3 experiment 4 conclusion Motivation & overview Notation: θ denotes the

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Bayesian Inference: Concept and Practice

Bayesian Inference: Concept and Practice Inference: Concept and Practice fundamentals Johan A. Elkink School of Politics & International Relations University College Dublin 5 June 2017 1 2 3 Bayes theorem In order to estimate the parameters of

More information

INTRODUCTION TO BAYESIAN STATISTICS

INTRODUCTION TO BAYESIAN STATISTICS INTRODUCTION TO BAYESIAN STATISTICS Sarat C. Dass Department of Statistics & Probability Department of Computer Science & Engineering Michigan State University TOPICS The Bayesian Framework Different Types

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Markov chain Monte Carlo

Markov chain Monte Carlo Markov chain Monte Carlo Markov chain Monte Carlo (MCMC) Gibbs and Metropolis Hastings Slice sampling Practical details Iain Murray http://iainmurray.net/ Reminder Need to sample large, non-standard distributions:

More information

Multivariate Normal & Wishart

Multivariate Normal & Wishart Multivariate Normal & Wishart Hoff Chapter 7 October 21, 2010 Reading Comprehesion Example Twenty-two children are given a reading comprehsion test before and after receiving a particular instruction method.

More information