GWAS IV: Bayesian linear (variance component) models

Size: px

Start display at page:

Download "GWAS IV: Bayesian linear (variance component) models"

Byron Dean
5 years ago
Views:

Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany

1 GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian linear models Summer

2 Motivation Regression Lineare regression: Making predictions Comparison of alternative models Bayesian and regularized regression: Uncertainty in model parameters Generalized basis functions Y X x* Oliver Stegle GWAS IV: Bayesian linear models Summer

3 Motivation Regression Lineare regression: Making predictions Comparison of alternative models Bayesian and regularized regression: Uncertainty in model parameters Generalized basis functions Y X x* Oliver Stegle GWAS IV: Bayesian linear models Summer

4 Motivation Regression Lineare regression: Making predictions Comparison of alternative models Bayesian and regularized regression: Uncertainty in model parameters Generalized basis functions Y X x* Oliver Stegle GWAS IV: Bayesian linear models Summer

5 Motivation Further reading, useful material Christopher M. Bishop: Pattern Recognition and Machine learning [Bishop, 2006] Sam Roweis: Gaussian identities [Roweis, 1999] Oliver Stegle GWAS IV: Bayesian linear models Summer

6 Outline Outline Oliver Stegle GWAS IV: Bayesian linear models Summer

7 Linear Regression II Outline Oliver Stegle GWAS IV: Bayesian linear models Summer

8 Linear Regression II Regression Noise model and likelihood Given a dataset D = {x n, y n } N n=1, where x n = {x n,1,..., x n,s } is S dimensional (for example S SNPs), fit parameters θ of a regressor f with added Gaussian noise: y n = f(x n ; θ) + ɛ n where p(ɛ σ 2 ) = N ( ɛ 0, σ 2). Equivalent likelihood formulation: p(y X) = N N ( y n f(x n ), σ 2) n=1 Oliver Stegle GWAS IV: Bayesian linear models Summer

9 Linear Regression II Regression Choosing a regressor Choose f to be linear: p(y X) = N N ( y n xn θ + c, σ 2) n=1 Consider bias free case, c = 0, otherwise inlcude an additional column of ones in each x n. Oliver Stegle GWAS IV: Bayesian linear models Summer

10 Linear Regression II Regression Choosing a regressor Choose f to be linear: N p(y X) = N ( y n xn θ + c, σ 2) n=1 Consider bias free case, c = 0, otherwise inlcude an additional column of ones in each x n. Equivalent graphical model Oliver Stegle GWAS IV: Bayesian linear models Summer

11 Linear Regression II Linear Regression Maximum likelihood Taking the logarithm, we obtain ln p(y θ, X, σ 2 ) = N ln N ( y n xn θ, σ 2) n=1 = N 2 ln 2πσ2 1 2σ 2 N (y n x n θ) 2 n=1 } {{ } Sum of squares The likelihood is maximized when the squared error is minimized. Least squares and maximum likelihood are equivalent. Oliver Stegle GWAS IV: Bayesian linear models Summer

12 Linear Regression II Linear Regression Maximum likelihood Taking the logarithm, we obtain ln p(y θ, X, σ 2 ) = N ln N ( y n xn θ, σ 2) n=1 = N 2 ln 2πσ2 1 2σ 2 N (y n x n θ) 2 n=1 } {{ } Sum of squares The likelihood is maximized when the squared error is minimized. Least squares and maximum likelihood are equivalent. Oliver Stegle GWAS IV: Bayesian linear models Summer

13 Linear Regression II Linear Regression Maximum likelihood Taking the logarithm, we obtain ln p(y θ, X, σ 2 ) = N ln N ( y n xn θ, σ 2) n=1 = N 2 ln 2πσ2 1 2σ 2 N (y n x n θ) 2 n=1 } {{ } Sum of squares The likelihood is maximized when the squared error is minimized. Least squares and maximum likelihood are equivalent. Oliver Stegle GWAS IV: Bayesian linear models Summer

14 Linear Regression II Linear Regression and Least Squares y y n f (xn, w) x n x (C.M. Bishop, Pattern Recognition and Machine Learning) E(θ) = 1 2 N (y n x n θ) 2 n=1 Oliver Stegle GWAS IV: Bayesian linear models Summer

15 Linear Regression II Linear Regression and Least Squares Derivative w.r.t a single weight entry θ i [ ] d ln p(y θ, σ 2 ) = d 1 N dθ i dθ i 2σ 2 (y n x n θ) 2 Set gradient w.r.t to θ to zero n=1 = 1 N σ 2 (y n x n θ)x i n=1 θ ln p(y θ, σ 2 ) = 1 σ 2 N n=1 = θ ML = (X T X) 1 X T y }{{} Pseudo inverse Here, the matrix X is defined as X = (y n x n θ)x T n = 0 x 1,1... x1, D x N,1... x N,D Oliver Stegle GWAS IV: Bayesian linear models Summer

16 Linear Regression II Linear Regression and Least Squares Derivative w.r.t a single weight entry θ i [ ] d ln p(y θ, σ 2 ) = d 1 N dθ i dθ i 2σ 2 (y n x n θ) 2 Set gradient w.r.t to θ to zero n=1 = 1 N σ 2 (y n x n θ)x i n=1 θ ln p(y θ, σ 2 ) = 1 σ 2 N n=1 = θ ML = (X T X) 1 X T y }{{} Pseudo inverse Here, the matrix X is defined as X = (y n x n θ)x T n = 0 x 1,1... x1, D x N,1... x N,D Oliver Stegle GWAS IV: Bayesian linear models Summer

17 Linear Regression II Linear Regression and Least Squares Derivative w.r.t a single weight entry θ i [ ] d ln p(y θ, σ 2 ) = d 1 N dθ i dθ i 2σ 2 (y n x n θ) 2 Set gradient w.r.t to θ to zero n=1 = 1 N σ 2 (y n x n θ)x i n=1 θ ln p(y θ, σ 2 ) = 1 σ 2 N n=1 = θ ML = (X T X) 1 X T y }{{} Pseudo inverse Here, the matrix X is defined as X = (y n x n θ)x T n = 0 x 1,1... x1, D x N,1... x N,D Oliver Stegle GWAS IV: Bayesian linear models Summer

18 Linear Regression II Polynomial Curve Fitting Motivation Non-linear relationships. Multiple SNPs playing a role for a particular phenotype. Y X x* Oliver Stegle GWAS IV: Bayesian linear models Summer

19 Linear Regression II Polynomial Curve Fitting Univariate input x Use the polynomials up to degree K to construct new features from x f(x, θ) = θ 0 + θ 1 x + θ 2 x θ K x K = K θ k φ k (x) = θ T φ(x) k=1 where we defined φ(x) = (1, x, x 2,..., x K ). φ can be any feature mapping. Possible to show: the feature map φ can be expressed in terms of kernels (kernel trick). Oliver Stegle GWAS IV: Bayesian linear models Summer

20 Linear Regression II Polynomial Curve Fitting Univariate input x Use the polynomials up to degree K to construct new features from x f(x, θ) = θ 0 + θ 1 x + θ 2 x θ K x K = K θ k φ k (x) = θ T φ(x) k=1 where we defined φ(x) = (1, x, x 2,..., x K ). φ can be any feature mapping. Possible to show: the feature map φ can be expressed in terms of kernels (kernel trick). Oliver Stegle GWAS IV: Bayesian linear models Summer

21 Linear Regression II Polynomial Curve Fitting Overfitting The degree of the polynomial is crucial to avoid under- and overfitting. t 1 M = x 1 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer

22 Linear Regression II Polynomial Curve Fitting Overfitting The degree of the polynomial is crucial to avoid under- and overfitting. t 1 M = x 1 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer

23 Linear Regression II Polynomial Curve Fitting Overfitting The degree of the polynomial is crucial to avoid under- and overfitting. t 1 M = x 1 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer

24 Linear Regression II Polynomial Curve Fitting Overfitting The degree of the polynomial is crucial to avoid under- and overfitting. t 1 M = x 1 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer

25 Linear Regression II Multivariate regression Polynomial curve fitting Multivariate regression (SNPs) f(x, θ) = θ 0 + θ 1 x + + θ K x K K = θ k φ k (x) k=1 = φ(x) θ, S f(x, θ) = θ s x s s=1 = x θ Note: When fitting a single binary SNP genotype x i, a linear model is most general! Oliver Stegle GWAS IV: Bayesian linear models Summer

26 Linear Regression II Multivariate regression Polynomial curve fitting Multivariate regression (SNPs) f(x, θ) = θ 0 + θ 1 x + + θ K x K K = θ k φ k (x) k=1 = φ(x) θ, S f(x, θ) = θ s x s s=1 = x θ Note: When fitting a single binary SNP genotype x i, a linear model is most general! Oliver Stegle GWAS IV: Bayesian linear models Summer

27 Linear Regression II Regularized Least Squares Solutions to avoid overfitting: 1. Intelligently choose number of dimensions 2. Regularize the regression weights θ Quadratically regularized objective function E(θ) = 1 2 N (y n φ(x n ) θ) 2 n=1 } {{ } Squared error + λ 2 θt θ }{{} Regularizer Oliver Stegle GWAS IV: Bayesian linear models Summer

28 Linear Regression II Regularized Least Squares Solutions to avoid overfitting: 1. Intelligently choose number of dimensions 2. Regularize the regression weights θ Quadratically regularized objective function E(θ) = 1 2 N (y n φ(x n ) θ) 2 n=1 } {{ } Squared error + λ 2 θt θ }{{} Regularizer Oliver Stegle GWAS IV: Bayesian linear models Summer

29 Linear Regression II Regularized Least Squares More general regularizers More general regularization: E(θ) = 1 N (y n φ(x n ) θ) 2 + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Squared error Regularizer Oliver Stegle GWAS IV: Bayesian linear models Summer

30 Linear Regression II Regularized Least Squares More general regularizers More general regularization: E(θ) = 1 N (y n φ(x n ) θ) 2 + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Squared error Regularizer q = 0.5 q = 1 q = 2 q = 4 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer

31 Linear Regression II Regularized Least Squares More general regularizers More general regularization: E(θ) = 1 N (y n φ(x n ) θ) 2 + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Squared error Regularizer sparse q = 0.5 q = 1 q = 2 q = 4 Lasso Quadratic (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer

32 Linear Regression II Loss functions and related methods Even more general: general loss function E(θ) = 1 N L(y n φ(x n ) θ) + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Loss Regularizer Many state-of-the-art machine learning methods can be expressed within this framework. Linear Regression: squared loss, squared regularizer. Support Vector Machine: hinge loss, squared regularizer. Lasso: squared loss, L1 regularizer. Inference: minimize the cost function E(θ), yielding a point estimate for θ. Q: How to determine q and the a suitable loss function? Oliver Stegle GWAS IV: Bayesian linear models Summer

33 Linear Regression II Loss functions and related methods Even more general: general loss function E(θ) = 1 N L(y n φ(x n ) θ) + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Loss Regularizer Many state-of-the-art machine learning methods can be expressed within this framework. Linear Regression: squared loss, squared regularizer. Support Vector Machine: hinge loss, squared regularizer. Lasso: squared loss, L1 regularizer. Inference: minimize the cost function E(θ), yielding a point estimate for θ. Q: How to determine q and the a suitable loss function? Oliver Stegle GWAS IV: Bayesian linear models Summer

34 Linear Regression II Loss functions and related methods Even more general: general loss function E(θ) = 1 N L(y n φ(x n ) θ) + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Loss Regularizer Many state-of-the-art machine learning methods can be expressed within this framework. Linear Regression: squared loss, squared regularizer. Support Vector Machine: hinge loss, squared regularizer. Lasso: squared loss, L1 regularizer. Inference: minimize the cost function E(θ), yielding a point estimate for θ. Q: How to determine q and the a suitable loss function? Oliver Stegle GWAS IV: Bayesian linear models Summer

35 Linear Regression II Loss functions and related methods Even more general: general loss function E(θ) = 1 N L(y n φ(x n ) θ) + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Loss Regularizer Many state-of-the-art machine learning methods can be expressed within this framework. Linear Regression: squared loss, squared regularizer. Support Vector Machine: hinge loss, squared regularizer. Lasso: squared loss, L1 regularizer. Inference: minimize the cost function E(θ), yielding a point estimate for θ. Q: How to determine q and the a suitable loss function? Oliver Stegle GWAS IV: Bayesian linear models Summer

36 Linear Regression II Loss functions and related methods Cross validation: minimization of expected loss For each candidate model H: Split data into K folds Training-test evaluation for each fold Assess average loss on test set E H = 1 K L test k K k=1 test set Total number of samples training set fold 1 fold 2 fold 3 Oliver Stegle GWAS IV: Bayesian linear models Summer

37 Linear Regression II Probabilistic interpretation So far: minimization of error functions. Back to probabilities? E(θ) = 1 N (y n φ(x n ) θ) 2 2 n=1 }{{} Squared error + λ 2 θt θ }{{} Regularizer Most alternative choices of regularizers and loss functions can be mapped to an equivalent probabilistic representation in a similar way. Oliver Stegle GWAS IV: Bayesian linear models Summer

38 Linear Regression II Probabilistic interpretation So far: minimization of error functions. Back to probabilities? E(θ) = 1 N (y n φ(x n ) θ) 2 2 n=1 }{{} Squared error N = ln N ( y n φ(xn ) θ, σ 2) n=1 + λ 2 θt θ }{{} Regularizer ( ln N θ 0, 1 ) λ I Most alternative choices of regularizers and loss functions can be mapped to an equivalent probabilistic representation in a similar way. Oliver Stegle GWAS IV: Bayesian linear models Summer

39 Linear Regression II Probabilistic interpretation So far: minimization of error functions. Back to probabilities? E(θ) = 1 N (y n φ(x n ) θ) 2 2 n=1 }{{} Squared error N = ln N ( y n φ(x n ) θ, σ 2) n=1 + λ 2 θt θ }{{} Regularizer ( ln N θ 0, 1 ) λ I = ln p(y θ, Φ(X), σ 2 ) ln p(θ) Most alternative choices of regularizers and loss functions can be mapped to an equivalent probabilistic representation in a similar way. Oliver Stegle GWAS IV: Bayesian linear models Summer

40 Linear Regression II Probabilistic interpretation So far: minimization of error functions. Back to probabilities? E(θ) = 1 N (y n φ(x n ) θ) 2 2 n=1 }{{} Squared error N = ln N ( y n φ(x n ) θ, σ 2) n=1 + λ 2 θt θ }{{} Regularizer ( ln N θ 0, 1 ) λ I = ln p(y θ, Φ(X), σ 2 ) ln p(θ) Most alternative choices of regularizers and loss functions can be mapped to an equivalent probabilistic representation in a similar way. Oliver Stegle GWAS IV: Bayesian linear models Summer

41 Bayesian linear regression Outline Oliver Stegle GWAS IV: Bayesian linear models Summer

42 Bayesian linear regression Bayesian linear regression Likelihood as before N p(y X, θ, σ 2 ) = N ( y n φ(x n ) θ, σ 2) n=1 Define a conjugate prior over θ p(θ) = N (θ m 0, S 0 ) Oliver Stegle GWAS IV: Bayesian linear models Summer

43 Bayesian linear regression Bayesian linear regression Likelihood as before N p(y X, θ, σ 2 ) = N ( y n φ(x n ) θ, σ 2) n=1 Define a conjugate prior over θ p(θ) = N (θ m 0, S 0 ) Oliver Stegle GWAS IV: Bayesian linear models Summer

44 Bayesian linear regression Bayesian linear regression Posterior probability of θ p(θ y, X, σ 2 ) N N ( y n φ(xn ) θ, σ 2) N (θ m 0, S 0 ) n=1 = N ( y Φ(X) θ, σ 2 I ) N (θ m 0, S 0 ) = N ( θ µ θ, Σ θ ) where ( µ θ = Σ θ S 1 0 m ) σ 2 Φ(X)T y [ Σ θ = S ] 1 σ 2 Φ(X)T Φ(X) Oliver Stegle GWAS IV: Bayesian linear models Summer

45 Bayesian linear regression Bayesian linear regression Prior choice Choice of prior: regularized (ridge) regression In this case p(θ) = N ( θ m 0, S 0 ). p(θ y, X, σ 2 ) N ( θ ) µ θ, Σ ( θ µ θ = Σ θ S 1 0 m ) σ 2 Φ(X)T y [ Σ θ = S ] 1 σ 2 Φ(X)T Φ(X) Equivalent to maximum likelihood estimate for λ 0! Oliver Stegle GWAS IV: Bayesian linear models Summer

46 Bayesian linear regression Bayesian linear regression Prior choice Choice of prior: regularized (ridge) regression In this case p(θ) = N ( θ 0, 1 λ I). p(θ y, X, σ 2 ) N ( θ µ θ, Σ θ ) µ θ = Σ θ ( 1 Σ θ = ) σ 2 Φ(X)T y [ λi + 1 σ 2 Φ(X)T Φ(X) Equivalent to maximum likelihood estimate for λ 0! ] 1 Oliver Stegle GWAS IV: Bayesian linear models Summer

47 Bayesian linear regression Bayesian linear regression Prior choice Choice of prior: regularized (ridge) regression In this case p(θ y, X, σ 2 ) N ( θ µ θ, Σ θ ) µ θ = Σ θ ( 1 Σ θ = ) σ 2 Φ(X)T y [ λi + 1 σ 2 Φ(X)T Φ(X) Equivalent to maximum likelihood estimate for λ 0! ] 1 Oliver Stegle GWAS IV: Bayesian linear models Summer

48 Bayesian linear regression Bayesian linear regression Example 0 Data points (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer

49 Bayesian linear regression Bayesian linear regression Example 1 Data point (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer

50 Bayesian linear regression Bayesian linear regression Example 20 Data points (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer

51 Bayesian linear regression Making predictions Prediction for fixed weight ˆθ at input x trivial: p(y x, ˆθ, σ 2 ) = N (y φ(x )ˆθ, σ 2) Integrate over θ to take the posterior uncertainty into account p(y x, D) = θ p(y x, θ, σ 2 )p(θ X, y, σ 2 ) = θ N ( y φ(x )θ, σ 2) N ( θ ) µ θ, Σ θ = N ( y φ(x ) µ θ, σ 2 + φ(x ) T Σ θ φ(x ) ) Key: prediction is again Gaussian Predictive variance is increase due to the posterior uncertainty in θ. Oliver Stegle GWAS IV: Bayesian linear models Summer

52 Bayesian linear regression Making predictions Prediction for fixed weight ˆθ at input x trivial: p(y x, ˆθ, σ 2 ) = N (y φ(x )ˆθ, σ 2) Integrate over θ to take the posterior uncertainty into account p(y x, D) = θ p(y x, θ, σ 2 )p(θ X, y, σ 2 ) = θ N ( y φ(x )θ, σ 2) N ( θ ) µθ, Σ θ = N ( y φ(x ) µ θ, σ 2 + φ(x ) T Σ θ φ(x ) ) Key: prediction is again Gaussian Predictive variance is increase due to the posterior uncertainty in θ. Oliver Stegle GWAS IV: Bayesian linear models Summer

53 Bayesian linear regression Making predictions Prediction for fixed weight ˆθ at input x trivial: p(y x, ˆθ, σ 2 ) = N (y φ(x )ˆθ, σ 2) Integrate over θ to take the posterior uncertainty into account p(y x, D) = θ p(y x, θ, σ 2 )p(θ X, y, σ 2 ) = θ N ( y φ(x )θ, σ 2) N ( θ ) µθ, Σ θ = N ( y φ(x ) µ θ, σ 2 + φ(x ) T Σ θ φ(x ) ) Key: prediction is again Gaussian Predictive variance is increase due to the posterior uncertainty in θ. Oliver Stegle GWAS IV: Bayesian linear models Summer

54 Model comparison and hypothesis testing Outline Oliver Stegle GWAS IV: Bayesian linear models Summer

55 Model comparison and hypothesis testing Model comparison Motivation What degree of polynomials describes the data best? Is the linear model at all appropriate? Association testing. Oliver Stegle GWAS IV: Bayesian linear models Summer

56 Model comparison and hypothesis testing Model comparison Motivation What degree of polynomials describes the data best? Is the linear model at all appropriate? Association testing. Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy 1 phenotypes SNPs Oliver Stegle GWAS IV: Bayesian linear models Summer

57 Model comparison and hypothesis testing Bayesian model comparison How do we choose among alternative models? Assume we want to choose among models H 0,..., H M for a dataset D. Posterior probability for a particular model i p(h i D) p(d H i ) p(h }{{} i ) }{{} Evidence Prior Oliver Stegle GWAS IV: Bayesian linear models Summer

58 Model comparison and hypothesis testing Bayesian model comparison How do we choose among alternative models? Assume we want to choose among models H 0,..., H M for a dataset D. Posterior probability for a particular model i p(h i D) p(d H i ) p(h }{{} i ) }{{} Evidence Prior Oliver Stegle GWAS IV: Bayesian linear models Summer

59 Model comparison and hypothesis testing Bayesian model comparison How to calculate the evidence The evidence is not the model likelihood! p(d H i ) = dθp(d Θ)p(Θ) for model parameters Θ. Θ Remember: p(θ H i, D) = p(d H i, Θ)p(Θ) p(d H i ) Oliver Stegle GWAS IV: Bayesian linear models Summer

60 Model comparison and hypothesis testing Bayesian model comparison How to calculate the evidence The evidence is not the model likelihood! p(d H i ) = dθp(d Θ)p(Θ) for model parameters Θ. Θ Remember: p(θ H i, D) = p(d H i, Θ)p(Θ) p(d H i ) likelihood prior posterior = Evidence Oliver Stegle GWAS IV: Bayesian linear models Summer

61 Model comparison and hypothesis testing Bayesian model comparison Ocam s razor The evidence integral penalizes overly complex models. A model with few parameters and lower maximum likelihood (H 1 ) may win over a model with a peaked likelihood that requires many more parameters (H 2 ). Likelihood H2 H1 w MAP w (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer

62 Model comparison and hypothesis testing Bayesian model comparison Ocam s razor The evidence integral penalizes overly complex models. A model with few parameters and lower maximum likelihood (H 1 ) may win over a model with a peaked likelihood that requires many more parameters (H 2 ). Likelihood H2 H1 w MAP w (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer

63 Model comparison and hypothesis testing Application to GWA Relevance of a single SNP Consider an association study. H 0 : no association p(y H 0, X, Θ 0 ) = N ( y 0, σ 2 I ) p(d H 0 ) = N ( y 0, σ 2 I ) p(σ 2 ) σ 2 H1 : linear association p(y H 1, x i, Θ 1 ) = N ( y xi θ, σ 2 I ) p(d H 1 ) = N ( y xi θ, σ 2 I ) p(σ 2 )p(θ) σ 2,θ Depending on the choice of priors, p(σ 2 ) and p(θ), the required integrals are often tractable in closed form. Oliver Stegle GWAS IV: Bayesian linear models Summer

64 Model comparison and hypothesis testing Application to GWA Relevance of a single SNP Consider an association study. H 0 : no association p(y H 0, X, Θ 0 ) = N ( y 0, σ 2 I ) p(d H 0 ) = N ( y 0, σ 2 I ) p(σ 2 ) σ 2 H1 : linear association p(y H 1, x i, Θ 1 ) = N ( y xi θ, σ 2 I ) p(d H 1 ) = N ( y xi θ, σ 2 I ) p(σ 2 )p(θ) σ 2,θ Depending on the choice of priors, p(σ 2 ) and p(θ), the required integrals are often tractable in closed form. Oliver Stegle GWAS IV: Bayesian linear models Summer

65 Model comparison and hypothesis testing Application to GWA Relevance of a single SNP Consider an association study. H 0 : no association p(y H 0, X, Θ 0 ) = N ( y 0, σ 2 I ) p(d H 0 ) = N ( y 0, σ 2 I ) p(σ 2 ) σ 2 H1 : linear association p(y H 1, x i, Θ 1 ) = N ( y xi θ, σ 2 I ) p(d H 1 ) = N ( y xi θ, σ 2 I ) p(σ 2 )p(θ) σ 2,θ Depending on the choice of priors, p(σ 2 ) and p(θ), the required integrals are often tractable in closed form. Oliver Stegle GWAS IV: Bayesian linear models Summer

66 Model comparison and hypothesis testing Application to GWA Scoring models Similar to likelihood ratios, the ratio of the evidences, the Bayes factor can be used to score alternative models: BF = ln p(d H 1) p(d H 0 ). Oliver Stegle GWAS IV: Bayesian linear models Summer

67 Model comparison and hypothesis testing Application to GWA Scoring models Similar to likelihood ratios, the ratio of the evidences, the Bayes factor can be used to score alternative models: BF = ln p(d H 1) p(d H 0 ) SLC35B4 SLC35B4 LOD/BF 10 5 FPR 0.01% FPR 0.01% Position in chr. 7 x 10 8 Oliver Stegle GWAS IV: Bayesian linear models Summer

68 Model comparison and hypothesis testing Application to GWA Posterior probability of an association Bayes factors are useful, however we would like a probabilistic answer how certain an association really is. Posterior probability of H 1 p(h 1 D) = p(d H 1)p(H 1 ) p(d) p(d H 1 )p(h 1 ) = p(d H 1 )p(h 1 ) + p(d H 0 )p(h 0 ) p(h 1 D) + p(h 0 D) = 1, prior probability of observing a real association. Oliver Stegle GWAS IV: Bayesian linear models Summer

69 Model comparison and hypothesis testing Application to GWA Posterior probability of an association Bayes factors are useful, however we would like a probabilistic answer how certain an association really is. Posterior probability of H 1 p(h 1 D) = p(d H 1)p(H 1 ) p(d) p(d H 1 )p(h 1 ) = p(d H 1 )p(h 1 ) + p(d H 0 )p(h 0 ) p(h 1 D) + p(h 0 D) = 1, prior probability of observing a real association. Oliver Stegle GWAS IV: Bayesian linear models Summer

70 Model comparison and hypothesis testing Application to GWA Posterior probability of an association Bayes factors are useful, however we would like a probabilistic answer how certain an association really is. Posterior probability of H 1 p(h 1 D) = p(d H 1)p(H 1 ) p(d) p(d H 1 )p(h 1 ) = p(d H 1 )p(h 1 ) + p(d H 0 )p(h 0 ) p(h 1 D) + p(h 0 D) = 1, prior probability of observing a real association. Oliver Stegle GWAS IV: Bayesian linear models Summer

71 Model comparison and hypothesis testing Bayes factor verus likelihood ratio Bayes factor Models of different complexity can be objectively compared. Statistical significance as posterior probability of a model. Typically hard to compute. Likelihood ratio Likelihood ratio scales with the number of parameters. Likelihood ratios have known null distribution, yielding p-values. Often easy to compute. Oliver Stegle GWAS IV: Bayesian linear models Summer

72 Model comparison and hypothesis testing Bayes factor verus likelihood ratio Bayes factor Models of different complexity can be objectively compared. Statistical significance as posterior probability of a model. Typically hard to compute. Likelihood ratio Likelihood ratio scales with the number of parameters. Likelihood ratios have known null distribution, yielding p-values. Often easy to compute. Oliver Stegle GWAS IV: Bayesian linear models Summer

73 Model comparison and hypothesis testing Marginal likelihood of variance component models Consider a linear model, ( accounting for ) a set of measured SNPs X p(y X, θ, σ 2 S ) = N y x s θ s, σ 2 I s=1 Choose identical Gaussian prior for all weights S p(θ) = N ( ) θ s 0, σ 2 g s=1 Marginal likelihood p(y X, ) = θ N ( y Xθ, σ 2 I ) N ( θ 0, σgi 2 ) Number of hyperparameters independent of number of SNPs Oliver Stegle GWAS IV: Bayesian linear models Summer

74 Model comparison and hypothesis testing Marginal likelihood of variance component models Consider a linear model, ( accounting for ) a set of measured SNPs X p(y X, θ, σ 2 S ) = N y x s θ s, σ 2 I s=1 Choose identical Gaussian prior for all weights S p(θ) = N ( ) θ s 0, σ 2 g s=1 Marginal likelihood p(y X, ) = θ N ( y Xθ, σ 2 I ) N ( θ 0, σgi 2 ) Number of hyperparameters independent of number of SNPs Oliver Stegle GWAS IV: Bayesian linear models Summer

75 Model comparison and hypothesis testing Marginal likelihood of variance component models Consider a linear model, ( accounting for ) a set of measured SNPs X p(y X, θ, σ 2 S ) = N y x s θ s, σ 2 I s=1 Choose identical Gaussian prior for all weights S p(θ) = N ( θ s 0, σg 2 ) s=1 Marginal likelihood p(y X, σ 2, σg) 2 = θ N ( y Xθ, σ 2 I ) N ( θ 0, σgi 2 ) = N ( y 0, σgxx 2 T + σ 2 I ) Number of hyperparameters independent of number of SNPs Oliver Stegle GWAS IV: Bayesian linear models Summer

76 Model comparison and hypothesis testing Marginal likelihood of variance component models Consider a linear model, ( accounting for ) a set of measured SNPs X p(y X, θ, σ 2 S ) = N y x s θ s, σ 2 I s=1 Choose identical Gaussian prior for all weights S p(θ) = N ( θ s 0, σg 2 ) s=1 Marginal likelihood p(y X, σ 2, σg) 2 = θ N ( y Xθ, σ 2 I ) N ( θ 0, σgi 2 ) = N ( y 0, σgxx 2 T + σ 2 I ) Number of hyperparameters independent of number of SNPs Oliver Stegle GWAS IV: Bayesian linear models Summer

77 Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs The missing heritability paradox Complex traits are regulated by a large number of small effects Human height: the best single SNP explains little variance. But: the parents are highly predictive for the height of the child! Oliver Stegle GWAS IV: Bayesian linear models Summer

78 Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Multivariate additive models for complex traits Multivariate model over causal SNPs p(y X, θ, σ 2 ) = N ( y s causal x s θ s, σ 2 I ) Common variance prior for causal SNPs p(θ s ) = N ( θ s 0, σg 2 ) Marinalize out weights p(y X, σg, 2 σe) 2 = N ( y 0, σg 2 x s x T s + σei 2 ) s causal Which SNPs are causal? Approximation: consider all SNPs [Yang et al., 2011] p(y X, σg, 2 σe) 2 = N ( y 0, σgxx 2 T + σei 2 ) Oliver Stegle GWAS IV: Bayesian linear models Summer

79 Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Multivariate additive models for complex traits Multivariate model over causal SNPs p(y X, θ, σ 2 ) = N ( y s causal x s θ s, σ 2 I ) Common variance prior for causal SNPs p(θ s ) = N ( θ s 0, σg 2 ) Marinalize out weights p(y X, σg, 2 σe) 2 = N ( y 0, σg 2 x s x T s + σei 2 ) s causal Which SNPs are causal? Approximation: consider all SNPs [Yang et al., 2011] p(y X, σg, 2 σe) 2 = N ( y 0, σgxx 2 T + σei 2 ) Oliver Stegle GWAS IV: Bayesian linear models Summer

80 Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Multivariate additive models for complex traits Multivariate model over causal SNPs p(y X, θ, σ 2 ) = N ( y s causal x s θ s, σ 2 I ) Common variance prior for causal SNPs p(θ s ) = N ( θ s 0, σg 2 ) Marinalize out weights p(y X, σg, 2 σe) 2 = N ( y 0, σg 2 x s x T s + σei 2 ) s causal Which SNPs are causal? Approximation: consider all SNPs [Yang et al., 2011] p(y X, σg, 2 σe) 2 = N ( y 0, σgxx 2 T + σei 2 ) Oliver Stegle GWAS IV: Bayesian linear models Summer

81 Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Multivariate additive models for complex traits Multivariate model over causal SNPs p(y X, θ, σ 2 ) = N ( y s causal x s θ s, σ 2 I ) Common variance prior for causal SNPs p(θ s ) = N ( θ s 0, σg 2 ) Marinalize out weights p(y X, σg, 2 σe) 2 = N ( y 0, σg 2 x s x T s + σei 2 ) s causal Which SNPs are causal? Approximation: consider all SNPs [Yang et al., 2011] p(y X, σg, 2 σe) 2 = N ( y 0, σgxx 2 T + σei 2 ) Oliver Stegle GWAS IV: Bayesian linear models Summer

82 Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Approximate variance model p(y X, σ 2 g, σ 2 e) = N ( y 0, σ 2 gxx T + σ 2 ei ) Genetic variance σ 2 g across chromosomes σ2 g Heritability h 2 = σg 2 + σe 2 Oliver Stegle GWAS IV: Bayesian linear models Summer

83 Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Approximate variance model p(y X, σ 2 g, σ 2 e) = N ( y 0, σ 2 gxx T + σ 2 ei ) Genetic variance σ 2 g across chromosomes σ2 g Heritability h 2 = σg 2 + σe 2 [Yang et al., 2011] Oliver Stegle GWAS IV: Bayesian linear models Summer

84 Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Approximate variance model p(y X, σ 2 g, σ 2 e) = N ( y 0, σ 2 gxx T + σ 2 ei ) Genetic variance σ 2 g across chromosomes σ2 g Heritability h 2 = σg 2 + σe 2 [Yang et al., 2011] Oliver Stegle GWAS IV: Bayesian linear models Summer

85 Summary Outline Oliver Stegle GWAS IV: Bayesian linear models Summer

86 Summary Summary Generalized linear models for Curve fitting and multivariate regression. Maximum likelihood and least squares regression are identical. Construction of features using a mapping φ. Regularized least squares and other models that correspond to different choices of loss functions. Bayesian linear regression. Model comparison and ocam s razor. Variance component models in GWAs. Oliver Stegle GWAS IV: Bayesian linear models Summer

87 Summary Tasks Prove that the product of two Gaussians is Gaussian distributed. Try to understand the convolution formula of Gaussian random variables. Oliver Stegle GWAS IV: Bayesian linear models Summer

88 Summary References I C. Bishop. Pattern recognition and machine learning, volume 4. Springer New York, S. Roweis. Gaussian identities. technical report, URL J. Yang, T. Manolio, L. Pasquale, E. Boerwinkle, N. Caporaso, J. Cunningham, M. de Andrade, B. Feenstra, E. Feingold, M. Hayes, et al. Genome partitioning of genetic variation for complex traits using common snps. Nature Genetics, 43(6): , Oliver Stegle GWAS IV: Bayesian linear models Summer

GWAS V: Gaussian processes

GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011