Bayesian Aspects of Statistical Learning Summer School of Statistics 2011

Size: px
Start display at page:

Download "Bayesian Aspects of Statistical Learning Summer School of Statistics 2011"

Transcription

1 Bayesian Aspects of Statistical Learning Summer School of Statistics 211 Mattias Villani Division of Statistics, Dept. of Computer and Information Science, Linköping University Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 1 / 66

2 Overview of my lectures Narrowview of statistical learning and data mining An introduction to Bayesian inference Using Bayes to prevent overtting Shrinkage priors Variable selection Bayesian hierarchical models Flexibility by Bayesian mixture modeling Finite mixture models Dirichlet process mixture models Mixture-of-Experts Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 2 / 66

3 What is Data Mining? Encyclopedia Britanica data mining, also called knowledge discovery in databases, in computer science, the process of discovering interesting and useful patterns and relationships in large volumes of data. The eld combines tools from statistics and articial intelligence (such as neural networks and machine learning) with database management to analyze large digital collections, known as data sets. Data mining is widely used in business (insurance, banking, retail), science research (astronomy, medicine), and government security (detection of criminals and terrorists). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 3 / 66

4 What is Data Mining? Encyclopedia Britanica data mining, also called knowledge discovery in databases, in computer science, the process of discovering interesting and useful patterns and relationships in large volumes of data. The eld combines tools from statistics and articial intelligence (such as neural networks and machine learning) with database management to analyze large digital collections, known as data sets. Data mining is widely used in business (insurance, banking, retail), science research (astronomy, medicine), and government security (detection of criminals and terrorists). The three steps of pattern extraction: Pre-processing. Merging databases. Cleaning of data sets by transformation, ltering and noise-reduction. Data mining. Algorithmic tting. Estimation. Learning. [Association learning, Clustering, Classication, Regression] Validation. Prediction on new data observation (test data). Automated procedures. Machine learning. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 3 / 66

5 Uses of Data Mining Business and marketing. Shopping patterns from scanner data. Directed marketing. Science. DNA sequence clustering and prediction. Surviellance. Terrorist threats. Product quality. Technology. Internet search suggestions. Text mining. Image archieves. Spam lters. Health and medicine. Patient databases. x-rays. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 4 / 66

6 What is Statistical Learning? Encyclopedia Britanica No matches found. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 5 / 66

7 What is Statistical Learning? Encyclopedia Britanica No matches found. MY (current) view: Statistical learning is a branch of statistical inference which deals with the estimation of exible probabilistic models with a strong focus on predictions, typically applied in data-rich environments, often with a view towards automated procedures. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 5 / 66

8 What is Statistical Learning? Encyclopedia Britanica No matches found. MY (current) view: Statistical learning is a branch of statistical inference which deals with the estimation of exible probabilistic models with a strong focus on predictions, typically applied in data-rich environments, often with a view towards automated procedures. Statistical learning terminology: Estimation = Training Predictive performance = Generalization Evaluation = Testing etc It has its own theory and methods (Shawe-Taylor) Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 5 / 66

9 Exciting statistical problems - The death of Statistics? Hal Varian, chief economist at Google I keep saying that the sexy job in the next 1 years will be statisticians New York Times, August 5, 29 Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 6 / 66

10 Exciting statistical problems - The death of Statistics? Hal Varian, chief economist at Google I keep saying that the sexy job in the next 1 years will be statisticians New York Times, August 5, 29 And I'm not kidding Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 6 / 66

11 Exciting statistical problems - The death of Statistics? Hal Varian, chief economist at Google I keep saying that the sexy job in the next 1 years will be statisticians New York Times, August 5, 29 And I'm not kidding New statistical opportunities, but can statisticians take advantage of them? Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 6 / 66

12 Exciting statistical problems - The death of Statistics? Hal Varian, chief economist at Google I keep saying that the sexy job in the next 1 years will be statisticians New York Times, August 5, 29 And I'm not kidding New statistical opportunities, but can statisticians take advantage of them? Leo Breiman, "Statistical modeling: The two cultures". Statistical Science, 21 Perhaps the damaging consequence of the insistence on data models is that statisticians have ruled themselves out of some of the most interesting and challenging statistical problems... Breiman: Give up on statistical models, go algorithmic. Statistical learning is an attempt to unify the two cultures. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 6 / 66

13 Introduction to Bayesian inference The basics: subjective probability and Bayes theorem Warm ups: Bayesian analysis of Bernoulli trials Normal data Advantages of the Bayesian approach Brief intro to simulation-based methods: Markov Chain Monte Carlo. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 7 / 66

14 Warm up: Bernoulli trials Bernoulli trials: Likelihood: x 1,..., x n θ iid Bern(θ). p(x 1,..., x n θ) = p(x 1 θ) p(x n θ) = θ s (1 θ) f, where s = n i=1 x i is the number of successes in the Bernoulli trials and f = n s is the number of failures. Given the data x 1,..., x n, we may plot p(x 1,..., x n θ) as a function of θ. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 8 / 66

15 The likelihood function from Bernoulli trials Likelihood function of the Bernoulli model for different data s=5, f=5 s=1, f=1 s=1, f=1 s=5, f=5.5 1 s=1, f=9.5 1 s=2, f=8.5 1 s=3, f=7.5 1 s=4, f=6.5 1 s=5, f=1.5 1 s=1, f=1.5 1 s=5, f=1.5 1 s=1, f=1.5 1 s=1, f=5.5 1 s=1, f=1.5 1 s=1, f=5.5 1 s=1, f= Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 9 / 66

16 Uncertainty and subjective probability Will the likelihood give us un idea of which values of θ that should be regarded as probable (in some sense)? Kind of, but... No! In order to say that one value of θ is more probable than another we clearly must think of θ as random. But θ may be something that we know is non-random, e.g. a xed natural constant. Bayesian: doesn't matter if θ is xed or random. What matters is whether or not You know the value of θ. If θ is uncertainty to You, then You can assign a probability distribution to θ which reects Your knowledge about θ. Subjective probability. The prior distribution, p(θ), summarizes your knowledge before observing the data, x. The posterior distribution, p(θ x), summarizes your knowledge after observing the data, x. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 1 / 66

17 Learning from data - Bayes' theorem How to make the transition p(θ) p(θ x)? Bayes' theorem: If θ takes on a continuum of values p(a i B) = p(b A i)p(a i ) k i=1 p(b A i)p(a i ) p(θ x) = p(x θ)p(θ) θ p(x θ)p(θ)dθ. Short form of Bayes' theorem p(θ x) p(x θ)p(θ) or Posterior Likelihood Prior Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 11 / 66

18 Bernoulli trials - Beta prior Model: Conjugate prior: Posterior p(y) = x 1,..., x n θ iid Bern(θ) θ Beta(α, β) Γ(α, β) Γ(α)Γ(β) y α 1 (1 y) β 1 for y 1. p(θ x 1,..., x n ) p(x 1,..., x n θ)p(θ) = θ s (1 θ) f θ α 1 (1 θ) β 1 = θ s+α 1 (1 θ) f +β 1. But this is recognized as proportional to the Beta(α + s, β + f ) density. That is, the prior-to-posterior mapping reads θ Beta(α, β) x 1,...,x = n θ x1,..., x n Beta(α + s, β + f ). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 12 / 66

19 Spam data: The eect of dierent priors α = 1, β = 1 α = 1, β = α = 1, β = 1 1 α = 1, β = Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 13 / 66

20 Normal data with known variance - normal prior Model: x 1,..., x n θ, σ 2 N(θ, σ 2 ) Prior: θ N(µ, τ 2 ) Posterior θ x 1,..., x n N(θ µ n, τ 2 n ) where and 1 τ 2 n = n σ + 1, 2 τ 2 µ n = w x + (1 w)µ, w = n σ 2 n + 1 σ 2 τ 2. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 14 / 66

21 Normal data with known variance - normal prior, cont. θ N(µ, τ 2) x 1,...,x n = θ x1,..., x n N(µ n, τn 2 ). Posterior precision = Data precision + Prior precision Posterior mean = Data precision (Data mean) + Posterior precision Prior precision (Prior mean) Posterior precision Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 15 / 66

22 Bayesianism - What's in it for you? Prior information. Deep expert knowledge or just smoothness. You know what to do: p(unknown known). Probability calculus. Marginalization is no longer a nuisance p(θ y) = p(θ, ψ y)dψ Natural treatment of the prediction problem. Parameter uncertainty. p(~y y) = p(~y y, θ)p(θ y)d θ Natural treatment of model inference. Model averaging. p(~y y) = K Pr(M k y)p(~y y, M k ) k=1 Direct connection to decision theory. Maximize expected utility max U(a, θ)p(θ y)dθ a A Can tackle really complicated models. MCMC. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 16 / 66

23 Approximate posterior density Taylor expansion of log-posterior around the posterior mean θ = ˆθ: Since ln p(θ y) ln p(θ y) = ln p( ˆθ y) + θ θ= ˆθ (θ ˆθ) ln p(θ y) 2! θ 2 θ= ˆθ (θ ˆθ) ln p(θ y) θ θ= ˆθ =, we have in large samples that ln p(θ y) ln p( ˆθ y) 1 2 J y( ˆθ)(θ ˆθ) 2 where J y ( ˆθ) is the observed Fisher information: J y ( ˆθ) = 2 ln p(θ y) θ 2 Approximate posterior in large samples: θ y [ N ˆθ, J 1 y ( ˆθ) ] θ= ˆθ Numerical optimization (Newton-Raphson-BFGS) gives ˆθ and J y ( ˆθ). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 17 / 66

24 Gibbs sampling Easily implemented methods for sampling from multivariate distributions, p(θ 1,..., θ k ). Typically a posterior distribution. Requirements: Easily sampled full conditional posteriors: p(θ 1 θ 2, θ 3..., θ k ) p(θ 2 θ 1, θ 3,..., θ k ). p(θ k θ 1, θ 2,..., θ k 1 ) Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 18 / 66

25 The Gibbs sampling algorithm A: Choose initial values θ (), 2 θ() 3,..., θ() n. B: B 1 Draw θ (1) 1 from p(θ 1 θ () B 2 Draw θ (1) from p(θ 2 2 θ (1) : B n Draw θ (1) n from p(θ n θ (1) C: Repeat Step B N times. 2, θ() 3 1, θ() 3 1, θ(1) 2,..., θ() n ),..., θ() n ),..., θ(1) n 1 ) Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 19 / 66

26 Gibbs sampling, cont. The Gibbs draws θ (1), θ (2),..., θ (N) are dependent, but arithmetic means converge to expected values 1 N 1 N N θ (t) E (θ j j ) t=1 N g(θ (t) ) E [g(θ)] t=1 More generally, the Gibbs sequence θ (1), θ (2),..., θ (N) converges in distribution to the target posterior p(θ 1,..., θ k ). θ (1),..., θ (N) converge to the marginal distribution of θ j j j, p(θ j ). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 2 / 66

27 Examples Gibbs sampling Bivariate normal: Joint distribution ( θ1 θ 2 ) Full conditional posteriors: N 2 [( µ1 µ 2 ) ( 1 ρ, ρ 1 )] θ 1 θ 2 N[µ 1 + ρ(θ 2 µ 2 ), 1 ρ 2 ] θ 2 θ 1 N[µ 2 + ρ(θ 1 µ 1 ), 1 ρ 2 ] Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 21 / 66

28 Bivariate normal - Initial values don't matter Initial value: [1,1] Initial value: [ 1, 1] x 2 1 x x x 1 4 Initial value: [ 1,1] 4 Initial value: [1, 1] x 2 1 x x x 1 Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 22 / 66

29 Bivariate normal - Initial values don't matter Initial value: [1,1] Initial value: [ 1, 1] x2 6 x x x1 Mattias Villani (Linköping University) Initial value: [1, 1] 6 x2 x2 Initial value: [ 1,1] x1 Bayesian Aspects of Statistical Learning 2 x1 23 / 66

30 Gibbs sampling - Bivariate normal.4 Density estimate of f(x 1 ) Estimated contours of f(x 1,x 2 ) Density estimate of f[sin(x )] Cumulative estimate of E[sin(x 1 )] Cumulative estimate of Pr[X 1 >µ *σ 1 =2.96] Density estimate of f[π*cos(x 1 ) x 2 3 ] Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 24 / 66

31 The Metropolis Algorithm The Metropolis Algorithm Initialize with θ = θ For t = 1, 2,... Sample a proposal draw θ θ (t 1) q t (θ θ (t 1) ) Accept θ with probability r(θ (t 1) θ ) = min [ p(θ ] y) p(θ (t 1) y), 1. If the proposal is accepted, set θ (t) = θ, otherwise set θ (t) = θ (t 1). Every proposal that θ that lies uphill is always accepted. Downhill moves accepted with prob. r(θ (t 1) θ ). It is enough if we can compute the unnormalized posterior density p(y θ)p(θ) for any θ. q t (θ θ (t 1) ) must be symmetric, i.e. q t (θ a θ b ) = q t (θ b θ a ). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 25 / 66

32 Metropolis - Choosing the proposal distribution Common choice of proposal distribution: [ ] q t (θ θ (t 1) ) = N θ (t 1), c 2 J 1 ( ˆθ), where c is a tuning constant. A good proposal q t (θ θ (t 1) ) should have the following properties Easy to sample Easy to compute r(θ (t 1) θ ) Takes reasonably large jumps in the parameter space The jumps are not rejected too frequently. Set c to that average acceptance prob. is somewhere between.2 and.4. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 26 / 66

33 Practical Implementation of MCMC Algorithms The autocorrelation in the simulated sequence θ (1), θ (2),..., θ (N) makes it somewhat problematic to dene the eective number of simulation draws. Ineciency factor: IF = i=1 where ρ i is the autocorrelation at lag i. Eective sample size: When do we stop sampling? ESS = N/IF. How many burn-in iterations to discard? Several short sequences or a single long sequence? To thin out or not to thin out? Convergence diagnostics. ρ i, Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 27 / 66

34 The Metropolis-Hastings algorithm Generalization of the Metropolis algorithm to non-symmetric proposals. The acceptance probability is slightly more complicated [ ] r(θ (t 1) θ p(θ ) = y)/q t (θ θ (t 1) ) min p(θ (t 1) y)/q t (θ (t 1) θ ), 1. Gibbs sampling is a special case of the MH algorithm where the proposal is the full conditional posterior and every draw is accepted. Independence MH: q t (θ θ (t 1) ) = q t (θ ). Example: θ N[ ˆθ, J 1 ( ˆθ)]. Metropolis-Hastings-within-Gibbs: p(θ 1 θ 2, x) is an easily sampled distribution p(θ 2 θ 1, x) is not easily sampled. MH updating step. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 28 / 66

35 Bayesian model inference Comparing two models p 1 (x θ 1 ) and p 2 (x θ 2 ) would be easy if θ 1 and θ 2 were known. Usually they aren't. Bayes: average with respect to the prior. Marginal likelihood: p(x) = p(x θ)p(θ)d θ Bayes factor to compare models: BF 12 (x) = p 1(x) p 2 (x) Marginal likelihood is a measure of out-of-sample forecasting performance: p(y 1,..., y n ) = p(y 1 )p(y 2 y 1 ) p(y n y 1, y 2,..., y n 1 ) p(y t y 1,..., y t 1 ) = p(y t θ)p(θ y 1,..., y t 1 )dθ Marginal likelihood is usually very sensitive to the prior. Log Predictive Score (LPS). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 29 / 66

36 Example: Bayesian hypothesis testing, Bernoulli case Hypothesis testing is just a special case of model selection: M :x 1,..., x iid n Bernoulli(θ ) M 1 :x 1,..., x iid n Bernoulli(θ), θ Beta(α, β) p(x 1,..., x n M ) = θ y (1 θ ) n y, p(x 1,..., x n M 1 ) = 1 θ y (1 θ) n y B(α, β) 1 θ α 1 (1 θ) β 1 dθ = B(y + α, n y + β)/b(α, β), where y = n i=1 x i, and B(, ) is the Beta function. Bayes factor BF 1 (x) = θ y (1 θ ) n y B(y + α, n y + β)/b(α, β). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 3 / 66

37 Approximate marginal likelihoods Taylor approximation: p(y θ)p(θ) p(y ˆθ)p( ˆθ) exp [ 12 J y( ˆθ)(θ ˆθ) 2 ], which can be integrated analytically using properties of the multivariate normal pdf. The Laplace approximation: ln ˆp(y) = ln p(y ˆθ) + lnp( ˆθ) ln J 1 y ( ˆθ) + k 2 ln(2π), where k is the number of unrestricted parameters in the model. Cruder version of the Laplace: The BIC approximation ln ˆp(y) = ln p(y ˆθ) + ln p( ˆθ) k ln n. 2 Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 31 / 66

38 Model averaging Collection of models M 1,..., M q. Posterior model probabilities: Pr(M i x) p(x M i )p(m i ) Bayesian model averaging. Let ξ be any unknown quantity whose interpretation is the same across models. p(ξ) = q Pr(M i x)p(ξ M i ). i=1 Bayesian prediction (ξ =future value of process) takes into account: i) population uncertainty (the error variance) ii) parameter uncertainty iii) model uncertainty. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 32 / 66

39 Linear regression - Uniform prior The linear regression model in matrix form y = X β + ε (n 1) (n k)(k 1) (n 1) Standard non-informative prior: uniform on (β, log σ) Joint posterior of β and σ 2 : p(β, σ 2 ) σ 2 p(β, σ 2 y) = p(β σ 2, y)p(σ 2 y). β σ 2, y N [ ˆβ, σ 2 (X X ) 1] σ 2 y Inv-χ 2 (n k, s 2 ) ˆβ = (X X ) 1 X y s 2 = 1 n k (y X ˆβ) (y X ˆβ) Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 33 / 66

40 Avoiding overtting in exible models - the Bayesian way Flexible models can be too exible. Overtting. A good guard against overtting: always pay attention to out-of-sample predictive performance. Three Bayesian ways to avoid overtting: Zero-restrictions on parameters. Variable selection in regression, covariance selection etc Smoothness priors. Don't set parameters to zero, shrink them all towards zero. Hierarchical models. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 34 / 66

41 Preventing overtting 1: Shrinkage priors Flexible nonlinear regression - the extreme version: Let all n ordinates be unknown parameters: y i = γ i + ε i. Problem: too many parameters. Estimated curve wiggles way too much. The usual solution: γ i = x i β. May be too restrictive. Bayes: use a prior on γ = (γ 1,..., γ n ) that carries the info that the regression curve is expected to be smooth if x i and x k are close, then γ i is close to γ k. Possible implementation: order the data with respect to the covariate and assign the prior p(γ i γ i 1 ) N(γ i 1, τ 2 ), for i = 2,..., n. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 35 / 66

42 Polynomial regression is linear regression on a new basis Polynomial regression: y = β + β 1 x + β 2 x β k x k + ε. This can be written as a linear regression y = x P β + ε, where x P = (1, x, x 2,..., x k ). Polynomial are usually a bad idea. They are global: changing an observation in one part of the covariate space can aect the t in regions far from the modied data point. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 36 / 66

43 Polynomial basis functions Quadratic regression y 2 y y y x Constant basis function x Linear basis function x Quadratic basis function x Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 37 / 66

44 Splines The perhaps most popular approach to non-parametric regression in applied work uses so called spline functions. First: change-point analysis using piecewise constant dummies. Use m change-points (knots) k 1 < k 2 <... < k m. Construct a 'dummy covariate' for each change-point: { 1 if xi > k b ij = j otherwise Not smooth, the regression line has sudden jumps. Smoother: truncated linear splines b ij = { xi k j if x i > k j otherwise = (x i k j ) +. Generalization: truncated power splines b ij = { (xi k j ) p if x i > k j otherwise = (x i k j ) p +. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 38 / 66

45 Truncated polynomial basis functions Piece wise linear regression y y y y x Constant basis function x Linear basis function x Basis function for (x.4) x Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 39 / 66

46 Splines, cont. Note: given the knots, the non-parametric spline regression model is a linear regression of y on the m 'dummy variables' b j y = x b β + ε, where x b is the vector of basis functions x b = (b 1,..., b m ). It is also common to include an intercept and the linear part of the model separately. In this case we have x b = (1, x, b 1,..., b m ). Additive model for the case of r > 1 covariates: where f j (x j ) is a spline. y = r j=1 f j(x j ) + ε j, ε j iid N(, σ 2 I n ), Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 4 / 66

47 Shrinkage priors for splines Problem: too many knots leads to over-tting. Solution: shrinkage prior β i iid N(, λ 2 ), where λ determines how smooth the regression function is (smaller λ gives smoother function). Equivalent to a penalized likelihood: LogLik + λ 2 β β. Ridge regression. It is also possible to treat λ as an unknown quantity and estimate it from data. Prior, example: λ Gamma(α, β). The famous Lasso variable selection method is equivalent to using the posterior mode estimate under the prior: β i iid Laplace(, λ 1 ) where the Laplace density is p(x) = 1 ( ) 2b exp x µ b Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 41 / 66

48 Bayesian spline with smoothness prior λ = λ = LogRatio Data Estimated E(y x) LogRatio Range Range λ = 1 λ = LogRatio.4.6 LogRatio Range Range Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 42 / 66

49 Preventing overtting 2: Bayesian variable selection Selecting the knots in a spline regression is exactly like variable/covariate selection in linear regression. Bayesian variable selection is ideal here. Introduce variable selection indicators, I j such that β j = if I j = β j N(, λ 2 ) if I j = 1 Need a prior on I 1,..., I K. Simple choice: I 1,..., I K θ iid Bernoulli(θ). Simulate from the posterior distribution: p(β, σ 2, I 1,...I K y) = p(β, σ 2 I 1,..., I K, y)p(i 1,..., I K y). Simulate from p(i 1,..., I K y) using Gibbs sampling. Automatic model averaging, all in one simulation run. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 43 / 66

50 General Bayesian variable selection - A simple approach The previous algorithm only works when we can integrate out all the model parameters to obtain p(i y) = p(β, σ 2, I y)d βdσ More generally we do MH and propose β and I jointly from the proposal distribution q(β p β c, I p )q(i p I c ) Main diculty: how to propose the non-.zero elements in β p? Simple approach: Numerical optimization on posterior with all variables in the model to obtain β y approx [ N ˆβ, Jy 1 ( ˆβ) ] [ Propose from N ˆβ, Jy 1 ( ˆβ) ], conditional on the zero restrictions implied by I p. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 44 / 66

51 Finite step Newton proposals Consider a class of regression models with likelihood functions in the following general form p(y x, β µ, β ϕ ) = n i=1 g(µ i ) = x i β µ. h(ϕ i ) = x i β ϕ. p(y i x i, µ i, ϕ i ) Example: Heteroscedastic Gaussian regression: y i = µ i + ε i, ε i iid N(, ϕ 2 i ) µ i = x i β µ ln ϕ i = x i β ϕ Gibbs sampling. How to sample from the full conditionals: i) p(β µ β ϕ, y) and ii) p(β ϕ β µ, y)? Naive time-consuming way: Newton's method in each updating step. Finite-step Newton. Don't iterate all the way to the mode, a few steps Mattias Villani are enough. (Linköping University) Bayesian Aspects of Statistical Learning 45 / 66

52 Finite step Newton proposals with variable selection Variable selection. How to propose β µ conditional on I µ? Finite-step Newton with variable dimension. Exploit: g(µ i ) = x β always has the same dimension i g(µ ic ) = x i β c and g(µ ip ) = x i β p are expected to be quite close. Important: Only need to code the likelihood, gradient (and Hessian) at the model parameter level (e.g. µ, ϕ). Typically trivial. Ecient. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 46 / 66

53 Bayesian knot selection - Lidar example Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 47 / 66

54 Bayesian knot selection - Lidar example Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 48 / 66

55 Bayesian knot selection - Lidar example Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 49 / 66

56 Preventing overtting 3: Hierarchical modeling Example: y j θ j Bin(n j, θ j ), j = 1,..., J. We could do inference on each θ j separately. Problem: n j may be small for some j. Not much info then about θ j. If you knew θ j, would that give information about θ i, i = j? If so, then inference about the parameters θ j, j = 1,..., J, may 'borrow strength' from each other. Extreme case: assume θ j = θ for all j. Dene y = J j=1 y j and n = J j=1 n j. Straightforward to analyze θ with the usual Beta-Binomial approach. Intermediate case: tie the θ's together by assuming the prior θ j iid Beta(α, β). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 5 / 66

57 Flexibility by mixture models Two-component mixture of normals [MoN(2)] p(x) = πφ(x; µ 1, σ 2 1 ) + (1 π)φ(x; µ 2, σ 2 2 ), where φ(x; µ, σ 2 ) denotes the PDF of a normal variate with mean µ and variance σ 2. Simulate from a MoN(2): Simulate an indicator I Bern(π). If I = 1, simulate x from N(µ 1, σ 2 1 ) If I = 2, simulate x from N(µ 2, σ 2 2 ). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 51 / 66

58 Illustration of mixture distributions.8 Bimodal.8 Fat tails Outliers Skewed.4 Close to Uniform Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 52 / 66

59 Mixture distributions, cont. Not easy to estimate directly - the likelihood is a product of sums. Assume that we knew which of the two densities each observation came from. { 1 if xi came from Density 1 I i = 2 if x i came from Density 2. Armed with knowledge of I 1,..., I n it is now easy to estimate π,µ 1, σ 2 1, µ 2, σ 2 2 But we do not know I 1,..., I n! Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 53 / 66

60 Mixture distributions, cont. Gibbs sampling to the rescue. Assume: Prior π Beta(α 1, α 2 ). Conjugate prior for (µ j, σ 2 j ). n 1 = n i=1{i i = 1} and n 2 = n n 1. Algorithm: π I, x Beta(α 1 + n 1, α 2 + n 2 ) σ 2 1 µ 1, I, x Inv -χ 2 and µ 1 I, σ 2, x N σ 2 2 µ 2, I, x Inv -χ 2 and µ 2 I, σ 2, x N I i = 1 π, µ 1, σ 2 1, µ 2, σ 2 2, x Bern(θ i ), i = 1,..., n, θ i = πφ(x i ; µ 1, σ 2 1 ) πφ(x i ; µ 1, σ 2 1 ) + (1 π)φ(x i ; µ 2, σ 2 2 ). This generalizes to K > 2. Full conditional posterior of mixture indicators are Dirichlet distributed. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 54 / 66

61 Multinomial model with Dirichlet prior Data: y = (y 1,...y K ), where y k counts the number of observations in the kth category. K k=1 y k = n. Example: brand choices. Multinomial model: p(y θ) K θ y K k, where k θ j = 1. k=1 k=1 Conjugate prior: Dirichlet(π 1,..., π k ; α) p(θ) K θ απ k 1. j k=1 Moments of θ = (θ 1,..., θ K ) Dirichlet(π 1,..., π k ; α) E(θ k ) = π k V(θ k ) = π k(1 π k ) α + 1 Note that α is a precision parameter. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 55 / 66

62 Multinomial model with Dirichlet prior, cont. Prior-to-Posterior updating: Model: y = (y 1,...y K ) Multin(n; θ 1,..., θ K ) Prior : θ = (θ 1,..., θ K ) Dirichlet(π 1,..., π K ; α) Posterior : θ y Dirichlet( απ 1 + y 1 α + n,..., απ K + y K ; α + n). α + n Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 56 / 66

63 Bayesian Nonparametrics - The Dirichlet process By partitioning the sample space, the Dirichlet distribution can be used to construct Bayesian histograms. The Dirichlet distribution is a distribution over discrete distributions. What happens when the number of bins goes to innity? The Dirichlet process. x G G, where G DP(G, α) is a random distribution function following the Dirichlet process with base measure G (the mean) and precision parameter α. ( Posterior for G x 1,..., x n DP αg +nf n, α + n α+n empirical distribution function. ), where F n is the Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 57 / 66

64 Bayesian nonparametrics - The Dirichlet process Sethuraman's representation of the DP(G, α) process G = p i δ θi i=1 where δ θ is a Dirac point mass at θ, and θ 1, θ 2,... iid G. The weights, p i, are given by stick-breaking: p 1 = V 1 p 2 = V 2 (1 V 1 ) p 3 = V 3 (1 V 1 )(1 V 2 ). and V 1, V 2,... are iid Beta(1, α), so E (V i ) = 1/(1 + α). Realizations from G are almost surely discrete. Draws from G may have ties. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 58 / 66

65 Dirichlet process mixtures Finite mixture x i θ i N(θ i, σ 2 ) Pr(θ i = θ j ) = π j, j = 1,..., k θ j iid N(µ, τ ) Dirichlet process mixture x i θ i N(θ i, σ 2 ) θ i G iid G G DP(G, α) But note that discreteness of the DP will lead to ties in the θ i. The number of active mixture components is estimated from the data. Predictions take into account that the next observation may belong to a completely new component. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 59 / 66

66 Mixture-of-Experts Class of models for the conditional density p(y x). Finite mixture model with mixture weights depending on covariates. SAGM model (heteroscedastic Gaussian components): p(y x) = k π j (x)φ [y; µ(x), σ(x)], j=1 where both the mean and the (log) variance of the components are linear functions of covariates. Bayesian variable selection in µ, σ and π. Finite-step Newton MCMC. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 6 / 66

67 Generalized Smooth Mixtures Mixture-of-Experts type of model: p(y x) = k π j (x)p j (y x, µ j, ϕ j ) j=1 Combines much of the above: g(µ j ) = x β µ,j. h(ϕ j ) = x β ϕ,j. Mixtures Variable selection Shrinkage Log Predictive Score (for choosing k) Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 61 / 66

68 Firm leverage data. Proportion data. Response: leverage = total debt / (total debt+book value of equity). 445 non-nancial US rms with positive sales in Covariates: tang (tangible assets/book value of total assets) market2book (book value of total assets - book value of equity + market value of equity) / book value of total assets logsales prot (earnings before interest, taxes, depreciation, and amortization / book value of total assets) Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 62 / 66

69 Firm leverage data Original scale Logit scale 1 5 Leverage Leverage Tang Tang 1 5 Leverage Leverage Market2Book 1 15 Market2Book 1 5 Leverage Leverage LogSale 5 1 LogSale 1 5 Leverage Leverage Profit Mattias Villani (Linköping University) Profit Bayesian Aspects of Statistical Learning 63 / 66

70 Smooth mixture of Beta regressions Previous literature ignored that the response is a proportion. Recent literature acknowledge this, but typically ignore non-linearities. Here: smooth mixture of Beta regression models y i µ i, ϕ i Beta [µ i ϕ i, (1 µ i )ϕ i ] E (y i x i ) = µ i Var(y i x i ) = µ i(1 µ i ) 1 + ϕ i We use a logit link for µ and log link for ϕ ln µ i 1 µ i = α µ + x i β µ ln ϕ i = α ϕ + x i β ϕ. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 64 / 66

71 Model t - one Beta component 1 The data One component mixture Leverage Leverage Profit Profit Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 65 / 66

72 Model t - three Beta components 1 The data Three component mixture Leverage Leverage Profit Profit Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 66 / 66

Efficient Bayesian Multivariate Surface Regression

Efficient Bayesian Multivariate Surface Regression Efficient Bayesian Multivariate Surface Regression Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Outline of the talk 1 Introduction to flexible

More information

Efficient Bayesian Multivariate Surface Regression

Efficient Bayesian Multivariate Surface Regression Efficient Bayesian Multivariate Surface Regression Feng Li (joint with Mattias Villani) Department of Statistics, Stockholm University October, 211 Outline of the talk 1 Flexible regression models 2 The

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

STAT Advanced Bayesian Inference

STAT Advanced Bayesian Inference 1 / 32 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ(

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Probability and Estimation. Alan Moses

Probability and Estimation. Alan Moses Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani a.sadeghkhani@queensu.ca There are two main schools to statistical inference: 1-frequentist

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee and Andrew O. Finley 2 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

David Giles Bayesian Econometrics

David Giles Bayesian Econometrics David Giles Bayesian Econometrics 1. General Background 2. Constructing Prior Distributions 3. Properties of Bayes Estimators and Tests 4. Bayesian Analysis of the Multiple Regression Model 5. Bayesian

More information

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC Stat 451 Lecture Notes 07 12 Markov Chain Monte Carlo Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapters 8 9 in Givens & Hoeting, Chapters 25 27 in Lange 2 Updated: April 4, 2016 1 / 42 Outline

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing

More information

Contents. Part I: Fundamentals of Bayesian Inference 1

Contents. Part I: Fundamentals of Bayesian Inference 1 Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department

More information

Linear Models A linear model is defined by the expression

Linear Models A linear model is defined by the expression Linear Models A linear model is defined by the expression x = F β + ɛ. where x = (x 1, x 2,..., x n ) is vector of size n usually known as the response vector. β = (β 1, β 2,..., β p ) is the transpose

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9 Metropolis Hastings Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601 Module 9 1 The Metropolis-Hastings algorithm is a general term for a family of Markov chain simulation methods

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Introduction to Bayesian Methods

Introduction to Bayesian Methods Introduction to Bayesian Methods Jessi Cisewski Department of Statistics Yale University Sagan Summer Workshop 2016 Our goal: introduction to Bayesian methods Likelihoods Priors: conjugate priors, non-informative

More information

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Statistics 203: Introduction to Regression and Analysis of Variance Course review Statistics 203: Introduction to Regression and Analysis of Variance Course review Jonathan Taylor - p. 1/?? Today Review / overview of what we learned. - p. 2/?? General themes in regression models Specifying

More information

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA Intro: Course Outline and Brief Intro to Marina Vannucci Rice University, USA PASI-CIMAT 04/28-30/2010 Marina Vannucci

More information

Bayesian model selection: methodology, computation and applications

Bayesian model selection: methodology, computation and applications Bayesian model selection: methodology, computation and applications David Nott Department of Statistics and Applied Probability National University of Singapore Statistical Genomics Summer School Program

More information

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference 1 The views expressed in this paper are those of the authors and do not necessarily reflect the views of the Federal Reserve Board of Governors or the Federal Reserve System. Bayesian Estimation of DSGE

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

Bayesian non-parametric model to longitudinally predict churn

Bayesian non-parametric model to longitudinally predict churn Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics

More information

Accounting for Complex Sample Designs via Mixture Models

Accounting for Complex Sample Designs via Mixture Models Accounting for Complex Sample Designs via Finite Normal Mixture Models 1 1 University of Michigan School of Public Health August 2009 Talk Outline 1 2 Accommodating Sampling Weights in Mixture Models 3

More information

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

More information

Bayesian Linear Models

Bayesian Linear Models Bayesian Linear Models Sudipto Banerjee September 03 05, 2017 Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles Linear Regression Linear regression is,

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Bayesian Inference in Astronomy & Astrophysics A Short Course

Bayesian Inference in Astronomy & Astrophysics A Short Course Bayesian Inference in Astronomy & Astrophysics A Short Course Tom Loredo Dept. of Astronomy, Cornell University p.1/37 Five Lectures Overview of Bayesian Inference From Gaussians to Periodograms Learning

More information

Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification

Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification Michael Anderson, PhD Hélène Carabin, DVM, PhD Department of Biostatistics and Epidemiology The University of Oklahoma

More information

Time Series and Dynamic Models

Time Series and Dynamic Models Time Series and Dynamic Models Section 1 Intro to Bayesian Inference Carlos M. Carvalho The University of Texas at Austin 1 Outline 1 1. Foundations of Bayesian Statistics 2. Bayesian Estimation 3. The

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Lecture 2: Priors and Conjugacy

Lecture 2: Priors and Conjugacy Lecture 2: Priors and Conjugacy Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 6, 2014 Some nice courses Fred A. Hamprecht (Heidelberg U.) https://www.youtube.com/watch?v=j66rrnzzkow Michael I.

More information

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 EPSY 905: Intro to Bayesian and MCMC Today s Class An

More information

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Lecture 8: Bayesian Estimation of Parameters in State Space Models in State Space Models March 30, 2016 Contents 1 Bayesian estimation of parameters in state space models 2 Computational methods for parameter estimation 3 Practical parameter estimation in state space

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Bayesian Linear Regression

Bayesian Linear Regression Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1 Parameter Estimation William H. Jefferys University of Texas at Austin bill@bayesrules.net Parameter Estimation 7/26/05 1 Elements of Inference Inference problems contain two indispensable elements: Data

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 4 Problem: Density Estimation We have observed data, y 1,..., y n, drawn independently from some unknown

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Hierarchical Models & Bayesian Model Selection

Hierarchical Models & Bayesian Model Selection Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/?? to Bayesian Methods Introduction to Bayesian Methods p.1/?? We develop the Bayesian paradigm for parametric inference. To this end, suppose we conduct (or wish to design) a study, in which the parameter

More information

CS281A/Stat241A Lecture 22

CS281A/Stat241A Lecture 22 CS281A/Stat241A Lecture 22 p. 1/4 CS281A/Stat241A Lecture 22 Monte Carlo Methods Peter Bartlett CS281A/Stat241A Lecture 22 p. 2/4 Key ideas of this lecture Sampling in Bayesian methods: Predictive distribution

More information

Part 8: GLMs and Hierarchical LMs and GLMs

Part 8: GLMs and Hierarchical LMs and GLMs Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course

More information

INTRODUCTION TO BAYESIAN STATISTICS

INTRODUCTION TO BAYESIAN STATISTICS INTRODUCTION TO BAYESIAN STATISTICS Sarat C. Dass Department of Statistics & Probability Department of Computer Science & Engineering Michigan State University TOPICS The Bayesian Framework Different Types

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Bayesian Inference for DSGE Models. Lawrence J. Christiano Bayesian Inference for DSGE Models Lawrence J. Christiano Outline State space-observer form. convenient for model estimation and many other things. Bayesian inference Bayes rule. Monte Carlo integation.

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming

More information

Learning the hyper-parameters. Luca Martino

Learning the hyper-parameters. Luca Martino Learning the hyper-parameters Luca Martino 2017 2017 1 / 28 Parameters and hyper-parameters 1. All the described methods depend on some choice of hyper-parameters... 2. For instance, do you recall λ (bandwidth

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Markov chain Monte Carlo

Markov chain Monte Carlo 1 / 26 Markov chain Monte Carlo Timothy Hanson 1 and Alejandro Jara 2 1 Division of Biostatistics, University of Minnesota, USA 2 Department of Statistics, Universidad de Concepción, Chile IAP-Workshop

More information

Introduction to Bayesian Inference

Introduction to Bayesian Inference University of Pennsylvania EABCN Training School May 10, 2016 Bayesian Inference Ingredients of Bayesian Analysis: Likelihood function p(y φ) Prior density p(φ) Marginal data density p(y ) = p(y φ)p(φ)dφ

More information

DAG models and Markov Chain Monte Carlo methods a short overview

DAG models and Markov Chain Monte Carlo methods a short overview DAG models and Markov Chain Monte Carlo methods a short overview Søren Højsgaard Institute of Genetics and Biotechnology University of Aarhus August 18, 2008 Printed: August 18, 2008 File: DAGMC-Lecture.tex

More information

The Metropolis-Hastings Algorithm. June 8, 2012

The Metropolis-Hastings Algorithm. June 8, 2012 The Metropolis-Hastings Algorithm June 8, 22 The Plan. Understand what a simulated distribution is 2. Understand why the Metropolis-Hastings algorithm works 3. Learn how to apply the Metropolis-Hastings

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Nonparametric Bayes Uncertainty Quantification

Nonparametric Bayes Uncertainty Quantification Nonparametric Bayes Uncertainty Quantification David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & ONR Review of Bayes Intro to Nonparametric Bayes

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Bayesian Inference for DSGE Models. Lawrence J. Christiano Bayesian Inference for DSGE Models Lawrence J. Christiano Outline State space-observer form. convenient for model estimation and many other things. Preliminaries. Probabilities. Maximum Likelihood. Bayesian

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Chapter 5 continued. Chapter 5 sections

Chapter 5 continued. Chapter 5 sections Chapter 5 sections Discrete univariate distributions: 5.2 Bernoulli and Binomial distributions Just skim 5.3 Hypergeometric distributions 5.4 Poisson distributions Just skim 5.5 Negative Binomial distributions

More information

Lecture 13 Fundamentals of Bayesian Inference

Lecture 13 Fundamentals of Bayesian Inference Lecture 13 Fundamentals of Bayesian Inference Dennis Sun Stats 253 August 11, 2014 Outline of Lecture 1 Bayesian Models 2 Modeling Correlations Using Bayes 3 The Universal Algorithm 4 BUGS 5 Wrapping Up

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian RL Seminar. Chris Mansley September 9, 2008 Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in

More information

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester Physics 403 Numerical Methods, Maximum Likelihood, and Least Squares Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Quadratic Approximation

More information

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Bayesian nonparametrics

Bayesian nonparametrics Bayesian nonparametrics 1 Some preliminaries 1.1 de Finetti s theorem We will start our discussion with this foundational theorem. We will assume throughout all variables are defined on the probability

More information

Markov Chain Monte Carlo

Markov Chain Monte Carlo Markov Chain Monte Carlo Recall: To compute the expectation E ( h(y ) ) we use the approximation E(h(Y )) 1 n n h(y ) t=1 with Y (1),..., Y (n) h(y). Thus our aim is to sample Y (1),..., Y (n) from f(y).

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Bayesian Phylogenetics:

Bayesian Phylogenetics: Bayesian Phylogenetics: an introduction Marc A. Suchard msuchard@ucla.edu UCLA Who is this man? How sure are you? The one true tree? Methods we ve learned so far try to find a single tree that best describes

More information

Introduction to Bayesian Statistics 1

Introduction to Bayesian Statistics 1 Introduction to Bayesian Statistics 1 STA 442/2101 Fall 2018 1 This slide show is an open-source document. See last slide for copyright information. 1 / 42 Thomas Bayes (1701-1761) Image from the Wikipedia

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information