GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 1

Motivation Regression Lineare regression: Making predictions Comparison of alternative models Bayesian and regularized regression: Uncertainty in model parameters Generalized basis functions Y X x* Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 2

Motivation Further reading, useful material Christopher M. Bishop: Pattern Recognition and Machine learning [Bishop, 2006] Sam Roweis: Gaussian identities [Roweis, 1999] Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 3

Outline Outline Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 4

Linear Regression II Outline Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 5

Linear Regression II Regression Noise model and likelihood Given a dataset D = {x n, y n } N n=1, where x n = {x n,1,..., x n,s } is S dimensional (for example S SNPs), fit parameters θ of a regressor f with added Gaussian noise: y n = f(x n ; θ) + ɛ n where p(ɛ σ 2 ) = N ( ɛ 0, σ 2). Equivalent likelihood formulation: p(y X) = N N ( y n f(x n ), σ 2) n=1 Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 6

Linear Regression II Regression Choosing a regressor Choose f to be linear: p(y X) = N N ( y n xn θ + c, σ 2) n=1 Consider bias free case, c = 0, otherwise inlcude an additional column of ones in each x n. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 7

Linear Regression II Regression Choosing a regressor Choose f to be linear: N p(y X) = N ( y n xn θ + c, σ 2) n=1 Consider bias free case, c = 0, otherwise inlcude an additional column of ones in each x n. Equivalent graphical model Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 7

Linear Regression II Linear Regression Maximum likelihood Taking the logarithm, we obtain ln p(y θ, X, σ 2 ) = N ln N ( y n xn θ, σ 2) n=1 = N 2 ln 2πσ2 1 2σ 2 N (y n x n θ) 2 n=1 } {{ } Sum of squares The likelihood is maximized when the squared error is minimized. Least squares and maximum likelihood are equivalent. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 8

Linear Regression II Linear Regression and Least Squares y y n f (xn, w) x n x (C.M. Bishop, Pattern Recognition and Machine Learning) E(θ) = 1 2 N (y n x n θ) 2 n=1 Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 9

Linear Regression II Linear Regression and Least Squares Derivative w.r.t a single weight entry θ i [ ] d ln p(y θ, σ 2 ) = d 1 N dθ i dθ i 2σ 2 (y n x n θ) 2 Set gradient w.r.t to θ to zero n=1 = 1 N σ 2 (y n x n θ)x i n=1 θ ln p(y θ, σ 2 ) = 1 σ 2 N n=1 = θ ML = (X T X) 1 X T y }{{} Pseudo inverse Here, the matrix X is defined as X = (y n x n θ)x T n = 0 x 1,1... x1, D......... x N,1... x N,D Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 10

Linear Regression II Polynomial Curve Fitting Motivation Non-linear relationships. Multiple SNPs playing a role for a particular phenotype. Y X x* Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 11

Linear Regression II Polynomial Curve Fitting Univariate input x Use the polynomials up to degree K to construct new features from x f(x, θ) = θ 0 + θ 1 x + θ 2 x 2 + + θ K x K = K θ k φ k (x) = θ T φ(x) k=1 where we defined φ(x) = (1, x, x 2,..., x K ). φ can be any feature mapping. Possible to show: the feature map φ can be expressed in terms of kernels (kernel trick). Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 12

Linear Regression II Polynomial Curve Fitting Overfitting The degree of the polynomial is crucial to avoid under- and overfitting. t 1 M = 0 0 1 0 x 1 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 13

Linear Regression II Polynomial Curve Fitting Overfitting The degree of the polynomial is crucial to avoid under- and overfitting. t 1 M = 1 0 1 0 x 1 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 13

Linear Regression II Polynomial Curve Fitting Overfitting The degree of the polynomial is crucial to avoid under- and overfitting. t 1 M = 3 0 1 0 x 1 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 13

Linear Regression II Polynomial Curve Fitting Overfitting The degree of the polynomial is crucial to avoid under- and overfitting. t 1 M = 9 0 1 0 x 1 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 13

Linear Regression II Multivariate regression Polynomial curve fitting Multivariate regression (SNPs) f(x, θ) = θ 0 + θ 1 x + + θ K x K K = θ k φ k (x) k=1 = φ(x) θ, S f(x, θ) = θ s x s s=1 = x θ Note: When fitting a single binary SNP genotype x i, a linear model is most general! Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 14

Linear Regression II Regularized Least Squares Solutions to avoid overfitting: 1. Intelligently choose number of dimensions 2. Regularize the regression weights θ Quadratically regularized objective function E(θ) = 1 2 N (y n φ(x n ) θ) 2 n=1 } {{ } Squared error + λ 2 θt θ }{{} Regularizer Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 15

Linear Regression II Regularized Least Squares More general regularizers More general regularization: E(θ) = 1 N (y n φ(x n ) θ) 2 + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Squared error Regularizer Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 16

Linear Regression II Regularized Least Squares More general regularizers More general regularization: E(θ) = 1 N (y n φ(x n ) θ) 2 + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Squared error Regularizer q = 0.5 q = 1 q = 2 q = 4 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 16

Linear Regression II Regularized Least Squares More general regularizers More general regularization: E(θ) = 1 N (y n φ(x n ) θ) 2 + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Squared error Regularizer sparse q = 0.5 q = 1 q = 2 q = 4 Lasso Quadratic (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 16

Linear Regression II Loss functions and related methods Even more general: general loss function E(θ) = 1 N L(y n φ(x n ) θ) + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Loss Regularizer Many state-of-the-art machine learning methods can be expressed within this framework. Linear Regression: squared loss, squared regularizer. Support Vector Machine: hinge loss, squared regularizer. Lasso: squared loss, L1 regularizer. Inference: minimize the cost function E(θ), yielding a point estimate for θ. Q: How to determine q and the a suitable loss function? Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 17

Linear Regression II Loss functions and related methods Cross validation: minimization of expected loss For each candidate model H: Split data into K folds Training-test evaluation for each fold Assess average loss on test set E H = 1 K L test k K k=1 test set Total number of samples training set fold 1 fold 2 fold 3 Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 18

Linear Regression II Probabilistic interpretation So far: minimization of error functions. Back to probabilities? E(θ) = 1 N (y n φ(x n ) θ) 2 2 n=1 }{{} Squared error + λ 2 θt θ }{{} Regularizer Most alternative choices of regularizers and loss functions can be mapped to an equivalent probabilistic representation in a similar way. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 19

Linear Regression II Probabilistic interpretation So far: minimization of error functions. Back to probabilities? E(θ) = 1 N (y n φ(x n ) θ) 2 2 n=1 }{{} Squared error N = ln N ( y n φ(xn ) θ, σ 2) n=1 + λ 2 θt θ }{{} Regularizer ( ln N θ 0, 1 ) λ I Most alternative choices of regularizers and loss functions can be mapped to an equivalent probabilistic representation in a similar way. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 19

Linear Regression II Probabilistic interpretation So far: minimization of error functions. Back to probabilities? E(θ) = 1 N (y n φ(x n ) θ) 2 2 n=1 }{{} Squared error N = ln N ( y n φ(x n ) θ, σ 2) n=1 + λ 2 θt θ }{{} Regularizer ( ln N θ 0, 1 ) λ I = ln p(y θ, Φ(X), σ 2 ) ln p(θ) Most alternative choices of regularizers and loss functions can be mapped to an equivalent probabilistic representation in a similar way. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 19

Bayesian linear regression Outline Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 20

Bayesian linear regression Bayesian linear regression Likelihood as before N p(y X, θ, σ 2 ) = N ( y n φ(x n ) θ, σ 2) n=1 Define a conjugate prior over θ p(θ) = N (θ m 0, S 0 ) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 21

Bayesian linear regression Bayesian linear regression Posterior probability of θ p(θ y, X, σ 2 ) N N ( y n φ(xn ) θ, σ 2) N (θ m 0, S 0 ) n=1 = N ( y Φ(X) θ, σ 2 I ) N (θ m 0, S 0 ) = N ( θ µ θ, Σ θ ) where ( µ θ = Σ θ S 1 0 m 0 + 1 ) σ 2 Φ(X)T y [ Σ θ = S 1 0 + 1 ] 1 σ 2 Φ(X)T Φ(X) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 22

Bayesian linear regression Bayesian linear regression Prior choice Choice of prior: regularized (ridge) regression In this case p(θ) = N ( θ m 0, S 0 ). p(θ y, X, σ 2 ) N ( θ ) µ θ, Σ ( θ µ θ = Σ θ S 1 0 m 0 + 1 ) σ 2 Φ(X)T y [ Σ θ = S 1 0 + 1 ] 1 σ 2 Φ(X)T Φ(X) Equivalent to maximum likelihood estimate for λ 0! Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 23

Bayesian linear regression Bayesian linear regression Prior choice Choice of prior: regularized (ridge) regression In this case p(θ) = N ( θ 0, 1 λ I). p(θ y, X, σ 2 ) N ( θ µ θ, Σ θ ) µ θ = Σ θ ( 1 Σ θ = ) σ 2 Φ(X)T y [ λi + 1 σ 2 Φ(X)T Φ(X) Equivalent to maximum likelihood estimate for λ 0! ] 1 Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 23

Bayesian linear regression Bayesian linear regression Prior choice Choice of prior: regularized (ridge) regression In this case p(θ y, X, σ 2 ) N ( θ µ θ, Σ θ ) µ θ = Σ θ ( 1 Σ θ = ) σ 2 Φ(X)T y [ λi + 1 σ 2 Φ(X)T Φ(X) Equivalent to maximum likelihood estimate for λ 0! ] 1 Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 23

Bayesian linear regression Bayesian linear regression Example 0 Data points (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 24

Bayesian linear regression Bayesian linear regression Example 1 Data point (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 24

Bayesian linear regression Bayesian linear regression Example 20 Data points (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 24

Bayesian linear regression Making predictions Prediction for fixed weight ˆθ at input x trivial: p(y x, ˆθ, σ 2 ) = N (y φ(x )ˆθ, σ 2) Integrate over θ to take the posterior uncertainty into account p(y x, D) = θ p(y x, θ, σ 2 )p(θ X, y, σ 2 ) = θ N ( y φ(x )θ, σ 2) N ( θ ) µ θ, Σ θ = N ( y φ(x ) µ θ, σ 2 + φ(x ) T Σ θ φ(x ) ) Key: prediction is again Gaussian Predictive variance is increase due to the posterior uncertainty in θ. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 25

Bayesian linear regression Making predictions Prediction for fixed weight ˆθ at input x trivial: p(y x, ˆθ, σ 2 ) = N (y φ(x )ˆθ, σ 2) Integrate over θ to take the posterior uncertainty into account p(y x, D) = θ p(y x, θ, σ 2 )p(θ X, y, σ 2 ) = θ N ( y φ(x )θ, σ 2) N ( θ ) µθ, Σ θ = N ( y φ(x ) µ θ, σ 2 + φ(x ) T Σ θ φ(x ) ) Key: prediction is again Gaussian Predictive variance is increase due to the posterior uncertainty in θ. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 25

Model comparison and hypothesis testing Outline Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 26

Model comparison and hypothesis testing Model comparison Motivation What degree of polynomials describes the data best? Is the linear model at all appropriate? Association testing. Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy 1 phenotypes SNPs Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 27

Model comparison and hypothesis testing Bayesian model comparison How do we choose among alternative models? Assume we want to choose among models H 0,..., H M for a dataset D. Posterior probability for a particular model i p(h i D) p(d H i ) p(h }{{} i ) }{{} Evidence Prior Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 28

Model comparison and hypothesis testing Bayesian model comparison How to calculate the evidence The evidence is not the model likelihood! p(d H i ) = dθp(d Θ)p(Θ) for model parameters Θ. Θ Remember: p(θ H i, D) = p(d H i, Θ)p(Θ) p(d H i ) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 29

Model comparison and hypothesis testing Bayesian model comparison How to calculate the evidence The evidence is not the model likelihood! p(d H i ) = dθp(d Θ)p(Θ) for model parameters Θ. Θ Remember: p(θ H i, D) = p(d H i, Θ)p(Θ) p(d H i ) likelihood prior posterior = Evidence Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 29

Model comparison and hypothesis testing Bayesian model comparison Ocam s razor The evidence integral penalizes overly complex models. A model with few parameters and lower maximum likelihood (H 1 ) may win over a model with a peaked likelihood that requires many more parameters (H 2 ). Likelihood H2 H1 w MAP w (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 30

Model comparison and hypothesis testing Application to GWA Relevance of a single SNP Consider an association study. H 0 : no association p(y H 0, X, Θ 0 ) = N ( y 0, σ 2 I ) p(d H 0 ) = N ( y 0, σ 2 I ) p(σ 2 ) σ 2 H1 : linear association p(y H 1, x i, Θ 1 ) = N ( y xi θ, σ 2 I ) p(d H 1 ) = N ( y xi θ, σ 2 I ) p(σ 2 )p(θ) σ 2,θ Depending on the choice of priors, p(σ 2 ) and p(θ), the required integrals are often tractable in closed form. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 31

Model comparison and hypothesis testing Application to GWA Scoring models Similar to likelihood ratios, the ratio of the evidences, the Bayes factor can be used to score alternative models: BF = ln p(d H 1) p(d H 0 ). 0 15 SLC35B4 SLC35B4 LOD/BF 10 5 FPR 0.01% FPR 0.01% 0 1.3354 1.3356 1.3358 1.336 1.3362 1.3364 1.3366 1.3368 1.337 1.3372 1.3374 Position in chr. 7 x 10 8 Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 32

Model comparison and hypothesis testing Application to GWA Posterior probability of an association Bayes factors are useful, however we would like a probabilistic answer how certain an association really is. Posterior probability of H 1 p(h 1 D) = p(d H 1)p(H 1 ) p(d) p(d H 1 )p(h 1 ) = p(d H 1 )p(h 1 ) + p(d H 0 )p(h 0 ) p(h 1 D) + p(h 0 D) = 1, prior probability of observing a real association. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 33

Model comparison and hypothesis testing Bayes factor verus likelihood ratio Bayes factor Models of different complexity can be objectively compared. Statistical significance as posterior probability of a model. Typically hard to compute. Likelihood ratio Likelihood ratio scales with the number of parameters. Likelihood ratios have known null distribution, yielding p-values. Often easy to compute. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 34

Model comparison and hypothesis testing Marginal likelihood of variance component models Consider a linear model, ( accounting for ) a set of measured SNPs X p(y X, θ, σ 2 S ) = N y x s θ s, σ 2 I s=1 Choose identical Gaussian prior for all weights S p(θ) = N ( ) θ s 0, σ 2 g s=1 Marginal likelihood p(y X, ) = θ N ( y Xθ, σ 2 I ) N ( θ 0, σgi 2 ) Number of hyperparameters independent of number of SNPs Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 35

Model comparison and hypothesis testing Marginal likelihood of variance component models Consider a linear model, ( accounting for ) a set of measured SNPs X p(y X, θ, σ 2 S ) = N y x s θ s, σ 2 I s=1 Choose identical Gaussian prior for all weights S p(θ) = N ( θ s 0, σg 2 ) s=1 Marginal likelihood p(y X, σ 2, σg) 2 = θ N ( y Xθ, σ 2 I ) N ( θ 0, σgi 2 ) = N ( y 0, σgxx 2 T + σ 2 I ) Number of hyperparameters independent of number of SNPs Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 35

Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs The missing heritability paradox Complex traits are regulated by a large number of small effects Human height: the best single SNP explains little variance. But: the parents are highly predictive for the height of the child! Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 36

Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Multivariate additive models for complex traits Multivariate model over causal SNPs p(y X, θ, σ 2 ) = N ( y s causal x s θ s, σ 2 I ) Common variance prior for causal SNPs p(θ s ) = N ( θ s 0, σg 2 ) Marinalize out weights p(y X, σg, 2 σe) 2 = N ( y 0, σg 2 x s x T s + σei 2 ) s causal Which SNPs are causal? Approximation: consider all SNPs [Yang et al., 2011] p(y X, σg, 2 σe) 2 = N ( y 0, σgxx 2 T + σei 2 ) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 37

Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Approximate variance model p(y X, σ 2 g, σ 2 e) = N ( y 0, σ 2 gxx T + σ 2 ei ) Genetic variance σ 2 g across chromosomes σ2 g Heritability h 2 = σg 2 + σe 2 Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 38

Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Approximate variance model p(y X, σ 2 g, σ 2 e) = N ( y 0, σ 2 gxx T + σ 2 ei ) Genetic variance σ 2 g across chromosomes σ2 g Heritability h 2 = σg 2 + σe 2 [Yang et al., 2011] Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 38

Summary Outline Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 39

Summary Summary Generalized linear models for Curve fitting and multivariate regression. Maximum likelihood and least squares regression are identical. Construction of features using a mapping φ. Regularized least squares and other models that correspond to different choices of loss functions. Bayesian linear regression. Model comparison and ocam s razor. Variance component models in GWAs. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 40

Summary Tasks Prove that the product of two Gaussians is Gaussian distributed. Try to understand the convolution formula of Gaussian random variables. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 41

Summary References I C. Bishop. Pattern recognition and machine learning, volume 4. Springer New York, 2006. S. Roweis. Gaussian identities. technical report, 1999. URL http://www.cs.nyu.edu/~roweis/notes/gaussid.pdf. J. Yang, T. Manolio, L. Pasquale, E. Boerwinkle, N. Caporaso, J. Cunningham, M. de Andrade, B. Feenstra, E. Feingold, M. Hayes, et al. Genome partitioning of genetic variation for complex traits using common snps. Nature Genetics, 43(6):519 525, 2011. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 42