Bayesian Aspects of Statistical Learning Summer School of Statistics 2011

Size: px

Start display at page:

Download "Bayesian Aspects of Statistical Learning Summer School of Statistics 2011"

Winifred Henry
5 years ago
Views:

1 Bayesian Aspects of Statistical Learning Summer School of Statistics 211 Mattias Villani Division of Statistics, Dept. of Computer and Information Science, Linköping University Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 1 / 66

2 Overview of my lectures Narrowview of statistical learning and data mining An introduction to Bayesian inference Using Bayes to prevent overtting Shrinkage priors Variable selection Bayesian hierarchical models Flexibility by Bayesian mixture modeling Finite mixture models Dirichlet process mixture models Mixture-of-Experts Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 2 / 66

3 What is Data Mining? Encyclopedia Britanica data mining, also called knowledge discovery in databases, in computer science, the process of discovering interesting and useful patterns and relationships in large volumes of data. The eld combines tools from statistics and articial intelligence (such as neural networks and machine learning) with database management to analyze large digital collections, known as data sets. Data mining is widely used in business (insurance, banking, retail), science research (astronomy, medicine), and government security (detection of criminals and terrorists). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 3 / 66

4 What is Data Mining? Encyclopedia Britanica data mining, also called knowledge discovery in databases, in computer science, the process of discovering interesting and useful patterns and relationships in large volumes of data. The eld combines tools from statistics and articial intelligence (such as neural networks and machine learning) with database management to analyze large digital collections, known as data sets. Data mining is widely used in business (insurance, banking, retail), science research (astronomy, medicine), and government security (detection of criminals and terrorists). The three steps of pattern extraction: Pre-processing. Merging databases. Cleaning of data sets by transformation, ltering and noise-reduction. Data mining. Algorithmic tting. Estimation. Learning. [Association learning, Clustering, Classication, Regression] Validation. Prediction on new data observation (test data). Automated procedures. Machine learning. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 3 / 66

5 Uses of Data Mining Business and marketing. Shopping patterns from scanner data. Directed marketing. Science. DNA sequence clustering and prediction. Surviellance. Terrorist threats. Product quality. Technology. Internet search suggestions. Text mining. Image archieves. Spam lters. Health and medicine. Patient databases. x-rays. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 4 / 66

6 What is Statistical Learning? Encyclopedia Britanica No matches found. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 5 / 66

7 What is Statistical Learning? Encyclopedia Britanica No matches found. MY (current) view: Statistical learning is a branch of statistical inference which deals with the estimation of exible probabilistic models with a strong focus on predictions, typically applied in data-rich environments, often with a view towards automated procedures. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 5 / 66

8 What is Statistical Learning? Encyclopedia Britanica No matches found. MY (current) view: Statistical learning is a branch of statistical inference which deals with the estimation of exible probabilistic models with a strong focus on predictions, typically applied in data-rich environments, often with a view towards automated procedures. Statistical learning terminology: Estimation = Training Predictive performance = Generalization Evaluation = Testing etc It has its own theory and methods (Shawe-Taylor) Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 5 / 66

9 Exciting statistical problems - The death of Statistics? Hal Varian, chief economist at Google I keep saying that the sexy job in the next 1 years will be statisticians New York Times, August 5, 29 Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 6 / 66

10 Exciting statistical problems - The death of Statistics? Hal Varian, chief economist at Google I keep saying that the sexy job in the next 1 years will be statisticians New York Times, August 5, 29 And I'm not kidding Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 6 / 66

11 Exciting statistical problems - The death of Statistics? Hal Varian, chief economist at Google I keep saying that the sexy job in the next 1 years will be statisticians New York Times, August 5, 29 And I'm not kidding New statistical opportunities, but can statisticians take advantage of them? Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 6 / 66

12 Exciting statistical problems - The death of Statistics? Hal Varian, chief economist at Google I keep saying that the sexy job in the next 1 years will be statisticians New York Times, August 5, 29 And I'm not kidding New statistical opportunities, but can statisticians take advantage of them? Leo Breiman, "Statistical modeling: The two cultures". Statistical Science, 21 Perhaps the damaging consequence of the insistence on data models is that statisticians have ruled themselves out of some of the most interesting and challenging statistical problems... Breiman: Give up on statistical models, go algorithmic. Statistical learning is an attempt to unify the two cultures. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 6 / 66

13 Introduction to Bayesian inference The basics: subjective probability and Bayes theorem Warm ups: Bayesian analysis of Bernoulli trials Normal data Advantages of the Bayesian approach Brief intro to simulation-based methods: Markov Chain Monte Carlo. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 7 / 66

14 Warm up: Bernoulli trials Bernoulli trials: Likelihood: x 1,..., x n θ iid Bern(θ). p(x 1,..., x n θ) = p(x 1 θ) p(x n θ) = θ s (1 θ) f, where s = n i=1 x i is the number of successes in the Bernoulli trials and f = n s is the number of failures. Given the data x 1,..., x n, we may plot p(x 1,..., x n θ) as a function of θ. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 8 / 66

15 The likelihood function from Bernoulli trials Likelihood function of the Bernoulli model for different data s=5, f=5 s=1, f=1 s=1, f=1 s=5, f=5.5 1 s=1, f=9.5 1 s=2, f=8.5 1 s=3, f=7.5 1 s=4, f=6.5 1 s=5, f=1.5 1 s=1, f=1.5 1 s=5, f=1.5 1 s=1, f=1.5 1 s=1, f=5.5 1 s=1, f=1.5 1 s=1, f=5.5 1 s=1, f= Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 9 / 66

16 Uncertainty and subjective probability Will the likelihood give us un idea of which values of θ that should be regarded as probable (in some sense)? Kind of, but... No! In order to say that one value of θ is more probable than another we clearly must think of θ as random. But θ may be something that we know is non-random, e.g. a xed natural constant. Bayesian: doesn't matter if θ is xed or random. What matters is whether or not You know the value of θ. If θ is uncertainty to You, then You can assign a probability distribution to θ which reects Your knowledge about θ. Subjective probability. The prior distribution, p(θ), summarizes your knowledge before observing the data, x. The posterior distribution, p(θ x), summarizes your knowledge after observing the data, x. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 1 / 66

17 Learning from data - Bayes' theorem How to make the transition p(θ) p(θ x)? Bayes' theorem: If θ takes on a continuum of values p(a i B) = p(b A i)p(a i ) k i=1 p(b A i)p(a i ) p(θ x) = p(x θ)p(θ) θ p(x θ)p(θ)dθ. Short form of Bayes' theorem p(θ x) p(x θ)p(θ) or Posterior Likelihood Prior Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 11 / 66

18 Bernoulli trials - Beta prior Model: Conjugate prior: Posterior p(y) = x 1,..., x n θ iid Bern(θ) θ Beta(α, β) Γ(α, β) Γ(α)Γ(β) y α 1 (1 y) β 1 for y 1. p(θ x 1,..., x n ) p(x 1,..., x n θ)p(θ) = θ s (1 θ) f θ α 1 (1 θ) β 1 = θ s+α 1 (1 θ) f +β 1. But this is recognized as proportional to the Beta(α + s, β + f ) density. That is, the prior-to-posterior mapping reads θ Beta(α, β) x 1,...,x = n θ x1,..., x n Beta(α + s, β + f ). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 12 / 66

19 Spam data: The eect of dierent priors α = 1, β = 1 α = 1, β = α = 1, β = 1 1 α = 1, β = Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 13 / 66

20 Normal data with known variance - normal prior Model: x 1,..., x n θ, σ 2 N(θ, σ 2 ) Prior: θ N(µ, τ 2 ) Posterior θ x 1,..., x n N(θ µ n, τ 2 n ) where and 1 τ 2 n = n σ + 1, 2 τ 2 µ n = w x + (1 w)µ, w = n σ 2 n + 1 σ 2 τ 2. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 14 / 66

21 Normal data with known variance - normal prior, cont. θ N(µ, τ 2) x 1,...,x n = θ x1,..., x n N(µ n, τn 2 ). Posterior precision = Data precision + Prior precision Posterior mean = Data precision (Data mean) + Posterior precision Prior precision (Prior mean) Posterior precision Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 15 / 66

22 Bayesianism - What's in it for you? Prior information. Deep expert knowledge or just smoothness. You know what to do: p(unknown known). Probability calculus. Marginalization is no longer a nuisance p(θ y) = p(θ, ψ y)dψ Natural treatment of the prediction problem. Parameter uncertainty. p(~y y) = p(~y y, θ)p(θ y)d θ Natural treatment of model inference. Model averaging. p(~y y) = K Pr(M k y)p(~y y, M k ) k=1 Direct connection to decision theory. Maximize expected utility max U(a, θ)p(θ y)dθ a A Can tackle really complicated models. MCMC. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 16 / 66

23 Approximate posterior density Taylor expansion of log-posterior around the posterior mean θ = ˆθ: Since ln p(θ y) ln p(θ y) = ln p( ˆθ y) + θ θ= ˆθ (θ ˆθ) ln p(θ y) 2! θ 2 θ= ˆθ (θ ˆθ) ln p(θ y) θ θ= ˆθ =, we have in large samples that ln p(θ y) ln p( ˆθ y) 1 2 J y( ˆθ)(θ ˆθ) 2 where J y ( ˆθ) is the observed Fisher information: J y ( ˆθ) = 2 ln p(θ y) θ 2 Approximate posterior in large samples: θ y [ N ˆθ, J 1 y ( ˆθ) ] θ= ˆθ Numerical optimization (Newton-Raphson-BFGS) gives ˆθ and J y ( ˆθ). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 17 / 66

24 Gibbs sampling Easily implemented methods for sampling from multivariate distributions, p(θ 1,..., θ k ). Typically a posterior distribution. Requirements: Easily sampled full conditional posteriors: p(θ 1 θ 2, θ 3..., θ k ) p(θ 2 θ 1, θ 3,..., θ k ). p(θ k θ 1, θ 2,..., θ k 1 ) Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 18 / 66

25 The Gibbs sampling algorithm A: Choose initial values θ (), 2 θ() 3,..., θ() n. B: B 1 Draw θ (1) 1 from p(θ 1 θ () B 2 Draw θ (1) from p(θ 2 2 θ (1) : B n Draw θ (1) n from p(θ n θ (1) C: Repeat Step B N times. 2, θ() 3 1, θ() 3 1, θ(1) 2,..., θ() n ),..., θ() n ),..., θ(1) n 1 ) Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 19 / 66

26 Gibbs sampling, cont. The Gibbs draws θ (1), θ (2),..., θ (N) are dependent, but arithmetic means converge to expected values 1 N 1 N N θ (t) E (θ j j ) t=1 N g(θ (t) ) E [g(θ)] t=1 More generally, the Gibbs sequence θ (1), θ (2),..., θ (N) converges in distribution to the target posterior p(θ 1,..., θ k ). θ (1),..., θ (N) converge to the marginal distribution of θ j j j, p(θ j ). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 2 / 66

27 Examples Gibbs sampling Bivariate normal: Joint distribution ( θ1 θ 2 ) Full conditional posteriors: N 2 [( µ1 µ 2 ) ( 1 ρ, ρ 1 )] θ 1 θ 2 N[µ 1 + ρ(θ 2 µ 2 ), 1 ρ 2 ] θ 2 θ 1 N[µ 2 + ρ(θ 1 µ 1 ), 1 ρ 2 ] Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 21 / 66

28 Bivariate normal - Initial values don't matter Initial value: [1,1] Initial value: [ 1, 1] x 2 1 x x x 1 4 Initial value: [ 1,1] 4 Initial value: [1, 1] x 2 1 x x x 1 Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 22 / 66

29 Bivariate normal - Initial values don't matter Initial value: [1,1] Initial value: [ 1, 1] x2 6 x x x1 Mattias Villani (Linköping University) Initial value: [1, 1] 6 x2 x2 Initial value: [ 1,1] x1 Bayesian Aspects of Statistical Learning 2 x1 23 / 66

30 Gibbs sampling - Bivariate normal.4 Density estimate of f(x 1 ) Estimated contours of f(x 1,x 2 ) Density estimate of f[sin(x )] Cumulative estimate of E[sin(x 1 )] Cumulative estimate of Pr[X 1 >µ *σ 1 =2.96] Density estimate of f[π*cos(x 1 ) x 2 3 ] Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 24 / 66

31 The Metropolis Algorithm The Metropolis Algorithm Initialize with θ = θ For t = 1, 2,... Sample a proposal draw θ θ (t 1) q t (θ θ (t 1) ) Accept θ with probability r(θ (t 1) θ ) = min [ p(θ ] y) p(θ (t 1) y), 1. If the proposal is accepted, set θ (t) = θ, otherwise set θ (t) = θ (t 1). Every proposal that θ that lies uphill is always accepted. Downhill moves accepted with prob. r(θ (t 1) θ ). It is enough if we can compute the unnormalized posterior density p(y θ)p(θ) for any θ. q t (θ θ (t 1) ) must be symmetric, i.e. q t (θ a θ b ) = q t (θ b θ a ). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 25 / 66

32 Metropolis - Choosing the proposal distribution Common choice of proposal distribution: [ ] q t (θ θ (t 1) ) = N θ (t 1), c 2 J 1 ( ˆθ), where c is a tuning constant. A good proposal q t (θ θ (t 1) ) should have the following properties Easy to sample Easy to compute r(θ (t 1) θ ) Takes reasonably large jumps in the parameter space The jumps are not rejected too frequently. Set c to that average acceptance prob. is somewhere between.2 and.4. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 26 / 66

33 Practical Implementation of MCMC Algorithms The autocorrelation in the simulated sequence θ (1), θ (2),..., θ (N) makes it somewhat problematic to dene the eective number of simulation draws. Ineciency factor: IF = i=1 where ρ i is the autocorrelation at lag i. Eective sample size: When do we stop sampling? ESS = N/IF. How many burn-in iterations to discard? Several short sequences or a single long sequence? To thin out or not to thin out? Convergence diagnostics. ρ i, Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 27 / 66

34 The Metropolis-Hastings algorithm Generalization of the Metropolis algorithm to non-symmetric proposals. The acceptance probability is slightly more complicated [ ] r(θ (t 1) θ p(θ ) = y)/q t (θ θ (t 1) ) min p(θ (t 1) y)/q t (θ (t 1) θ ), 1. Gibbs sampling is a special case of the MH algorithm where the proposal is the full conditional posterior and every draw is accepted. Independence MH: q t (θ θ (t 1) ) = q t (θ ). Example: θ N[ ˆθ, J 1 ( ˆθ)]. Metropolis-Hastings-within-Gibbs: p(θ 1 θ 2, x) is an easily sampled distribution p(θ 2 θ 1, x) is not easily sampled. MH updating step. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 28 / 66

35 Bayesian model inference Comparing two models p 1 (x θ 1 ) and p 2 (x θ 2 ) would be easy if θ 1 and θ 2 were known. Usually they aren't. Bayes: average with respect to the prior. Marginal likelihood: p(x) = p(x θ)p(θ)d θ Bayes factor to compare models: BF 12 (x) = p 1(x) p 2 (x) Marginal likelihood is a measure of out-of-sample forecasting performance: p(y 1,..., y n ) = p(y 1 )p(y 2 y 1 ) p(y n y 1, y 2,..., y n 1 ) p(y t y 1,..., y t 1 ) = p(y t θ)p(θ y 1,..., y t 1 )dθ Marginal likelihood is usually very sensitive to the prior. Log Predictive Score (LPS). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 29 / 66

36 Example: Bayesian hypothesis testing, Bernoulli case Hypothesis testing is just a special case of model selection: M :x 1,..., x iid n Bernoulli(θ ) M 1 :x 1,..., x iid n Bernoulli(θ), θ Beta(α, β) p(x 1,..., x n M ) = θ y (1 θ ) n y, p(x 1,..., x n M 1 ) = 1 θ y (1 θ) n y B(α, β) 1 θ α 1 (1 θ) β 1 dθ = B(y + α, n y + β)/b(α, β), where y = n i=1 x i, and B(, ) is the Beta function. Bayes factor BF 1 (x) = θ y (1 θ ) n y B(y + α, n y + β)/b(α, β). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 3 / 66

37 Approximate marginal likelihoods Taylor approximation: p(y θ)p(θ) p(y ˆθ)p( ˆθ) exp [ 12 J y( ˆθ)(θ ˆθ) 2 ], which can be integrated analytically using properties of the multivariate normal pdf. The Laplace approximation: ln ˆp(y) = ln p(y ˆθ) + lnp( ˆθ) ln J 1 y ( ˆθ) + k 2 ln(2π), where k is the number of unrestricted parameters in the model. Cruder version of the Laplace: The BIC approximation ln ˆp(y) = ln p(y ˆθ) + ln p( ˆθ) k ln n. 2 Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 31 / 66

38 Model averaging Collection of models M 1,..., M q. Posterior model probabilities: Pr(M i x) p(x M i )p(m i ) Bayesian model averaging. Let ξ be any unknown quantity whose interpretation is the same across models. p(ξ) = q Pr(M i x)p(ξ M i ). i=1 Bayesian prediction (ξ =future value of process) takes into account: i) population uncertainty (the error variance) ii) parameter uncertainty iii) model uncertainty. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 32 / 66

39 Linear regression - Uniform prior The linear regression model in matrix form y = X β + ε (n 1) (n k)(k 1) (n 1) Standard non-informative prior: uniform on (β, log σ) Joint posterior of β and σ 2 : p(β, σ 2 ) σ 2 p(β, σ 2 y) = p(β σ 2, y)p(σ 2 y). β σ 2, y N [ ˆβ, σ 2 (X X ) 1] σ 2 y Inv-χ 2 (n k, s 2 ) ˆβ = (X X ) 1 X y s 2 = 1 n k (y X ˆβ) (y X ˆβ) Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 33 / 66

40 Avoiding overtting in exible models - the Bayesian way Flexible models can be too exible. Overtting. A good guard against overtting: always pay attention to out-of-sample predictive performance. Three Bayesian ways to avoid overtting: Zero-restrictions on parameters. Variable selection in regression, covariance selection etc Smoothness priors. Don't set parameters to zero, shrink them all towards zero. Hierarchical models. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 34 / 66

41 Preventing overtting 1: Shrinkage priors Flexible nonlinear regression - the extreme version: Let all n ordinates be unknown parameters: y i = γ i + ε i. Problem: too many parameters. Estimated curve wiggles way too much. The usual solution: γ i = x i β. May be too restrictive. Bayes: use a prior on γ = (γ 1,..., γ n ) that carries the info that the regression curve is expected to be smooth if x i and x k are close, then γ i is close to γ k. Possible implementation: order the data with respect to the covariate and assign the prior p(γ i γ i 1 ) N(γ i 1, τ 2 ), for i = 2,..., n. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 35 / 66

42 Polynomial regression is linear regression on a new basis Polynomial regression: y = β + β 1 x + β 2 x β k x k + ε. This can be written as a linear regression y = x P β + ε, where x P = (1, x, x 2,..., x k ). Polynomial are usually a bad idea. They are global: changing an observation in one part of the covariate space can aect the t in regions far from the modied data point. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 36 / 66

43 Polynomial basis functions Quadratic regression y 2 y y y x Constant basis function x Linear basis function x Quadratic basis function x Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 37 / 66

44 Splines The perhaps most popular approach to non-parametric regression in applied work uses so called spline functions. First: change-point analysis using piecewise constant dummies. Use m change-points (knots) k 1 < k 2 <... < k m. Construct a 'dummy covariate' for each change-point: { 1 if xi > k b ij = j otherwise Not smooth, the regression line has sudden jumps. Smoother: truncated linear splines b ij = { xi k j if x i > k j otherwise = (x i k j ) +. Generalization: truncated power splines b ij = { (xi k j ) p if x i > k j otherwise = (x i k j ) p +. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 38 / 66

45 Truncated polynomial basis functions Piece wise linear regression y y y y x Constant basis function x Linear basis function x Basis function for (x.4) x Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 39 / 66

46 Splines, cont. Note: given the knots, the non-parametric spline regression model is a linear regression of y on the m 'dummy variables' b j y = x b β + ε, where x b is the vector of basis functions x b = (b 1,..., b m ). It is also common to include an intercept and the linear part of the model separately. In this case we have x b = (1, x, b 1,..., b m ). Additive model for the case of r > 1 covariates: where f j (x j ) is a spline. y = r j=1 f j(x j ) + ε j, ε j iid N(, σ 2 I n ), Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 4 / 66

47 Shrinkage priors for splines Problem: too many knots leads to over-tting. Solution: shrinkage prior β i iid N(, λ 2 ), where λ determines how smooth the regression function is (smaller λ gives smoother function). Equivalent to a penalized likelihood: LogLik + λ 2 β β. Ridge regression. It is also possible to treat λ as an unknown quantity and estimate it from data. Prior, example: λ Gamma(α, β). The famous Lasso variable selection method is equivalent to using the posterior mode estimate under the prior: β i iid Laplace(, λ 1 ) where the Laplace density is p(x) = 1 ( ) 2b exp x µ b Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 41 / 66

48 Bayesian spline with smoothness prior λ = λ = LogRatio Data Estimated E(y x) LogRatio Range Range λ = 1 λ = LogRatio.4.6 LogRatio Range Range Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 42 / 66

49 Preventing overtting 2: Bayesian variable selection Selecting the knots in a spline regression is exactly like variable/covariate selection in linear regression. Bayesian variable selection is ideal here. Introduce variable selection indicators, I j such that β j = if I j = β j N(, λ 2 ) if I j = 1 Need a prior on I 1,..., I K. Simple choice: I 1,..., I K θ iid Bernoulli(θ). Simulate from the posterior distribution: p(β, σ 2, I 1,...I K y) = p(β, σ 2 I 1,..., I K, y)p(i 1,..., I K y). Simulate from p(i 1,..., I K y) using Gibbs sampling. Automatic model averaging, all in one simulation run. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 43 / 66

50 General Bayesian variable selection - A simple approach The previous algorithm only works when we can integrate out all the model parameters to obtain p(i y) = p(β, σ 2, I y)d βdσ More generally we do MH and propose β and I jointly from the proposal distribution q(β p β c, I p )q(i p I c ) Main diculty: how to propose the non-.zero elements in β p? Simple approach: Numerical optimization on posterior with all variables in the model to obtain β y approx [ N ˆβ, Jy 1 ( ˆβ) ] [ Propose from N ˆβ, Jy 1 ( ˆβ) ], conditional on the zero restrictions implied by I p. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 44 / 66

51 Finite step Newton proposals Consider a class of regression models with likelihood functions in the following general form p(y x, β µ, β ϕ ) = n i=1 g(µ i ) = x i β µ. h(ϕ i ) = x i β ϕ. p(y i x i, µ i, ϕ i ) Example: Heteroscedastic Gaussian regression: y i = µ i + ε i, ε i iid N(, ϕ 2 i ) µ i = x i β µ ln ϕ i = x i β ϕ Gibbs sampling. How to sample from the full conditionals: i) p(β µ β ϕ, y) and ii) p(β ϕ β µ, y)? Naive time-consuming way: Newton's method in each updating step. Finite-step Newton. Don't iterate all the way to the mode, a few steps Mattias Villani are enough. (Linköping University) Bayesian Aspects of Statistical Learning 45 / 66

52 Finite step Newton proposals with variable selection Variable selection. How to propose β µ conditional on I µ? Finite-step Newton with variable dimension. Exploit: g(µ i ) = x β always has the same dimension i g(µ ic ) = x i β c and g(µ ip ) = x i β p are expected to be quite close. Important: Only need to code the likelihood, gradient (and Hessian) at the model parameter level (e.g. µ, ϕ). Typically trivial. Ecient. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 46 / 66

53 Bayesian knot selection - Lidar example Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 47 / 66

54 Bayesian knot selection - Lidar example Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 48 / 66

55 Bayesian knot selection - Lidar example Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 49 / 66

56 Preventing overtting 3: Hierarchical modeling Example: y j θ j Bin(n j, θ j ), j = 1,..., J. We could do inference on each θ j separately. Problem: n j may be small for some j. Not much info then about θ j. If you knew θ j, would that give information about θ i, i = j? If so, then inference about the parameters θ j, j = 1,..., J, may 'borrow strength' from each other. Extreme case: assume θ j = θ for all j. Dene y = J j=1 y j and n = J j=1 n j. Straightforward to analyze θ with the usual Beta-Binomial approach. Intermediate case: tie the θ's together by assuming the prior θ j iid Beta(α, β). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 5 / 66

57 Flexibility by mixture models Two-component mixture of normals [MoN(2)] p(x) = πφ(x; µ 1, σ 2 1 ) + (1 π)φ(x; µ 2, σ 2 2 ), where φ(x; µ, σ 2 ) denotes the PDF of a normal variate with mean µ and variance σ 2. Simulate from a MoN(2): Simulate an indicator I Bern(π). If I = 1, simulate x from N(µ 1, σ 2 1 ) If I = 2, simulate x from N(µ 2, σ 2 2 ). Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 51 / 66

58 Illustration of mixture distributions.8 Bimodal.8 Fat tails Outliers Skewed.4 Close to Uniform Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 52 / 66

59 Mixture distributions, cont. Not easy to estimate directly - the likelihood is a product of sums. Assume that we knew which of the two densities each observation came from. { 1 if xi came from Density 1 I i = 2 if x i came from Density 2. Armed with knowledge of I 1,..., I n it is now easy to estimate π,µ 1, σ 2 1, µ 2, σ 2 2 But we do not know I 1,..., I n! Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 53 / 66

60 Mixture distributions, cont. Gibbs sampling to the rescue. Assume: Prior π Beta(α 1, α 2 ). Conjugate prior for (µ j, σ 2 j ). n 1 = n i=1{i i = 1} and n 2 = n n 1. Algorithm: π I, x Beta(α 1 + n 1, α 2 + n 2 ) σ 2 1 µ 1, I, x Inv -χ 2 and µ 1 I, σ 2, x N σ 2 2 µ 2, I, x Inv -χ 2 and µ 2 I, σ 2, x N I i = 1 π, µ 1, σ 2 1, µ 2, σ 2 2, x Bern(θ i ), i = 1,..., n, θ i = πφ(x i ; µ 1, σ 2 1 ) πφ(x i ; µ 1, σ 2 1 ) + (1 π)φ(x i ; µ 2, σ 2 2 ). This generalizes to K > 2. Full conditional posterior of mixture indicators are Dirichlet distributed. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 54 / 66

61 Multinomial model with Dirichlet prior Data: y = (y 1,...y K ), where y k counts the number of observations in the kth category. K k=1 y k = n. Example: brand choices. Multinomial model: p(y θ) K θ y K k, where k θ j = 1. k=1 k=1 Conjugate prior: Dirichlet(π 1,..., π k ; α) p(θ) K θ απ k 1. j k=1 Moments of θ = (θ 1,..., θ K ) Dirichlet(π 1,..., π k ; α) E(θ k ) = π k V(θ k ) = π k(1 π k ) α + 1 Note that α is a precision parameter. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 55 / 66

62 Multinomial model with Dirichlet prior, cont. Prior-to-Posterior updating: Model: y = (y 1,...y K ) Multin(n; θ 1,..., θ K ) Prior : θ = (θ 1,..., θ K ) Dirichlet(π 1,..., π K ; α) Posterior : θ y Dirichlet( απ 1 + y 1 α + n,..., απ K + y K ; α + n). α + n Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 56 / 66

63 Bayesian Nonparametrics - The Dirichlet process By partitioning the sample space, the Dirichlet distribution can be used to construct Bayesian histograms. The Dirichlet distribution is a distribution over discrete distributions. What happens when the number of bins goes to innity? The Dirichlet process. x G G, where G DP(G, α) is a random distribution function following the Dirichlet process with base measure G (the mean) and precision parameter α. ( Posterior for G x 1,..., x n DP αg +nf n, α + n α+n empirical distribution function. ), where F n is the Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 57 / 66

64 Bayesian nonparametrics - The Dirichlet process Sethuraman's representation of the DP(G, α) process G = p i δ θi i=1 where δ θ is a Dirac point mass at θ, and θ 1, θ 2,... iid G. The weights, p i, are given by stick-breaking: p 1 = V 1 p 2 = V 2 (1 V 1 ) p 3 = V 3 (1 V 1 )(1 V 2 ). and V 1, V 2,... are iid Beta(1, α), so E (V i ) = 1/(1 + α). Realizations from G are almost surely discrete. Draws from G may have ties. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 58 / 66

65 Dirichlet process mixtures Finite mixture x i θ i N(θ i, σ 2 ) Pr(θ i = θ j ) = π j, j = 1,..., k θ j iid N(µ, τ ) Dirichlet process mixture x i θ i N(θ i, σ 2 ) θ i G iid G G DP(G, α) But note that discreteness of the DP will lead to ties in the θ i. The number of active mixture components is estimated from the data. Predictions take into account that the next observation may belong to a completely new component. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 59 / 66

66 Mixture-of-Experts Class of models for the conditional density p(y x). Finite mixture model with mixture weights depending on covariates. SAGM model (heteroscedastic Gaussian components): p(y x) = k π j (x)φ [y; µ(x), σ(x)], j=1 where both the mean and the (log) variance of the components are linear functions of covariates. Bayesian variable selection in µ, σ and π. Finite-step Newton MCMC. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 6 / 66

67 Generalized Smooth Mixtures Mixture-of-Experts type of model: p(y x) = k π j (x)p j (y x, µ j, ϕ j ) j=1 Combines much of the above: g(µ j ) = x β µ,j. h(ϕ j ) = x β ϕ,j. Mixtures Variable selection Shrinkage Log Predictive Score (for choosing k) Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 61 / 66

68 Firm leverage data. Proportion data. Response: leverage = total debt / (total debt+book value of equity). 445 non-nancial US rms with positive sales in Covariates: tang (tangible assets/book value of total assets) market2book (book value of total assets - book value of equity + market value of equity) / book value of total assets logsales prot (earnings before interest, taxes, depreciation, and amortization / book value of total assets) Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 62 / 66

69 Firm leverage data Original scale Logit scale 1 5 Leverage Leverage Tang Tang 1 5 Leverage Leverage Market2Book 1 15 Market2Book 1 5 Leverage Leverage LogSale 5 1 LogSale 1 5 Leverage Leverage Profit Mattias Villani (Linköping University) Profit Bayesian Aspects of Statistical Learning 63 / 66

70 Smooth mixture of Beta regressions Previous literature ignored that the response is a proportion. Recent literature acknowledge this, but typically ignore non-linearities. Here: smooth mixture of Beta regression models y i µ i, ϕ i Beta [µ i ϕ i, (1 µ i )ϕ i ] E (y i x i ) = µ i Var(y i x i ) = µ i(1 µ i ) 1 + ϕ i We use a logit link for µ and log link for ϕ ln µ i 1 µ i = α µ + x i β µ ln ϕ i = α ϕ + x i β ϕ. Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 64 / 66

71 Model t - one Beta component 1 The data One component mixture Leverage Leverage Profit Profit Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 65 / 66

72 Model t - three Beta components 1 The data Three component mixture Leverage Leverage Profit Profit Mattias Villani (Linköping University) Bayesian Aspects of Statistical Learning 66 / 66

Efficient Bayesian Multivariate Surface Regression

Efficient Bayesian Multivariate Surface Regression Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Outline of the talk 1 Introduction to flexible