GWAS IV: Bayesian linear (variance component) models

Similar documents
GWAS V: Gaussian processes

Latent Variable models for GWAs

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

STA414/2104 Statistical Methods for Machine Learning II

PMR Learning as Inference

CS-E3210 Machine Learning: Basic Principles

Nonparameteric Regression:

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Lecture : Probabilistic Machine Learning

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Bayesian Machine Learning

Density Estimation. Seungjin Choi

Machine learning - HT Basis Expansion, Regularization, Validation

Linear Models for Regression

Computational Approaches to Statistical Genetics

GAUSSIAN PROCESS REGRESSION

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Association studies and regression

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Least Squares Regression

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Bayesian methods in economics and finance

ECE521 week 3: 23/26 January 2017

CSC321 Lecture 18: Learning Probabilistic Models

Lecture 4: Probabilistic Learning

Least Squares Regression

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Linear Regression and Discrimination

Bayesian Regression Linear and Logistic Regression

Introduction to Probabilistic Machine Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Machine Learning

STA 4273H: Sta-s-cal Machine Learning

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

An Introduction to Statistical and Probabilistic Linear Models

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Multivariate Bayesian Linear Regression MLAI Lecture 11

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Bayesian Machine Learning

y Xw 2 2 y Xw λ w 2 2

Bayesian Machine Learning

Learning Bayesian network : Given structure and completely observed data

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

CMU-Q Lecture 24:

Lecture 1b: Linear Models for Regression

The Laplace Approximation

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Probabilistic Models for Learning Data Representations. Andreas Damianou

STA 4273H: Statistical Machine Learning

Lecture 2 Machine Learning Review

Pattern Recognition and Machine Learning

1 Bayesian Linear Regression (BLR)

CPSC 540: Machine Learning

Parametric Techniques Lecture 3

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Kernel methods, kernel SVM and ridge regression

Nonparametric Bayesian Methods (Gaussian Processes)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

20: Gaussian Processes

Machine Learning Linear Classification. Prof. Matteo Matteucci

Bayesian Linear Regression. Sargur Srihari

Statistical learning. Chapter 20, Sections 1 3 1

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Introduction to Gaussian Process

Introduction to Machine Learning

Regularized Regression A Bayesian point of view

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

g-priors for Linear Regression

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

CPSC 540: Machine Learning

COS513 LECTURE 8 STATISTICAL CONCEPTS

Introduction to Machine Learning

Bayesian Decision and Bayesian Learning

Probability and Estimation. Alan Moses

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Statistical Data Mining and Machine Learning Hilary Term 2016

STA 414/2104, Spring 2014, Practice Problem Set #1

Bayesian Learning (II)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Parametric Techniques

Bayesian RL Seminar. Chris Mansley September 9, 2008

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Lecture 5: GPs and Streaming regression

Machine Learning Basics: Maximum Likelihood Estimation

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Expectation Propagation for Approximate Bayesian Inference

Logistic Regression. COMP 527 Danushka Bollegala

Bayesian Deep Learning

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Probabilistic Graphical Models & Applications

Introduction to Bayesian Inference

Probabilistic & Bayesian deep learning. Andreas Damianou

PATTERN RECOGNITION AND MACHINE LEARNING

Bayesian Interpretations of Regularization

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Transcription:

GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 1

Motivation Regression Lineare regression: Making predictions Comparison of alternative models Bayesian and regularized regression: Uncertainty in model parameters Generalized basis functions Y X x* Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 2

Motivation Regression Lineare regression: Making predictions Comparison of alternative models Bayesian and regularized regression: Uncertainty in model parameters Generalized basis functions Y X x* Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 2

Motivation Regression Lineare regression: Making predictions Comparison of alternative models Bayesian and regularized regression: Uncertainty in model parameters Generalized basis functions Y X x* Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 2

Motivation Further reading, useful material Christopher M. Bishop: Pattern Recognition and Machine learning [Bishop, 2006] Sam Roweis: Gaussian identities [Roweis, 1999] Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 3

Outline Outline Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 4

Linear Regression II Outline Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 5

Linear Regression II Regression Noise model and likelihood Given a dataset D = {x n, y n } N n=1, where x n = {x n,1,..., x n,s } is S dimensional (for example S SNPs), fit parameters θ of a regressor f with added Gaussian noise: y n = f(x n ; θ) + ɛ n where p(ɛ σ 2 ) = N ( ɛ 0, σ 2). Equivalent likelihood formulation: p(y X) = N N ( y n f(x n ), σ 2) n=1 Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 6

Linear Regression II Regression Choosing a regressor Choose f to be linear: p(y X) = N N ( y n xn θ + c, σ 2) n=1 Consider bias free case, c = 0, otherwise inlcude an additional column of ones in each x n. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 7

Linear Regression II Regression Choosing a regressor Choose f to be linear: N p(y X) = N ( y n xn θ + c, σ 2) n=1 Consider bias free case, c = 0, otherwise inlcude an additional column of ones in each x n. Equivalent graphical model Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 7

Linear Regression II Linear Regression Maximum likelihood Taking the logarithm, we obtain ln p(y θ, X, σ 2 ) = N ln N ( y n xn θ, σ 2) n=1 = N 2 ln 2πσ2 1 2σ 2 N (y n x n θ) 2 n=1 } {{ } Sum of squares The likelihood is maximized when the squared error is minimized. Least squares and maximum likelihood are equivalent. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 8

Linear Regression II Linear Regression Maximum likelihood Taking the logarithm, we obtain ln p(y θ, X, σ 2 ) = N ln N ( y n xn θ, σ 2) n=1 = N 2 ln 2πσ2 1 2σ 2 N (y n x n θ) 2 n=1 } {{ } Sum of squares The likelihood is maximized when the squared error is minimized. Least squares and maximum likelihood are equivalent. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 8

Linear Regression II Linear Regression Maximum likelihood Taking the logarithm, we obtain ln p(y θ, X, σ 2 ) = N ln N ( y n xn θ, σ 2) n=1 = N 2 ln 2πσ2 1 2σ 2 N (y n x n θ) 2 n=1 } {{ } Sum of squares The likelihood is maximized when the squared error is minimized. Least squares and maximum likelihood are equivalent. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 8

Linear Regression II Linear Regression and Least Squares y y n f (xn, w) x n x (C.M. Bishop, Pattern Recognition and Machine Learning) E(θ) = 1 2 N (y n x n θ) 2 n=1 Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 9

Linear Regression II Linear Regression and Least Squares Derivative w.r.t a single weight entry θ i [ ] d ln p(y θ, σ 2 ) = d 1 N dθ i dθ i 2σ 2 (y n x n θ) 2 Set gradient w.r.t to θ to zero n=1 = 1 N σ 2 (y n x n θ)x i n=1 θ ln p(y θ, σ 2 ) = 1 σ 2 N n=1 = θ ML = (X T X) 1 X T y }{{} Pseudo inverse Here, the matrix X is defined as X = (y n x n θ)x T n = 0 x 1,1... x1, D......... x N,1... x N,D Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 10

Linear Regression II Linear Regression and Least Squares Derivative w.r.t a single weight entry θ i [ ] d ln p(y θ, σ 2 ) = d 1 N dθ i dθ i 2σ 2 (y n x n θ) 2 Set gradient w.r.t to θ to zero n=1 = 1 N σ 2 (y n x n θ)x i n=1 θ ln p(y θ, σ 2 ) = 1 σ 2 N n=1 = θ ML = (X T X) 1 X T y }{{} Pseudo inverse Here, the matrix X is defined as X = (y n x n θ)x T n = 0 x 1,1... x1, D......... x N,1... x N,D Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 10

Linear Regression II Linear Regression and Least Squares Derivative w.r.t a single weight entry θ i [ ] d ln p(y θ, σ 2 ) = d 1 N dθ i dθ i 2σ 2 (y n x n θ) 2 Set gradient w.r.t to θ to zero n=1 = 1 N σ 2 (y n x n θ)x i n=1 θ ln p(y θ, σ 2 ) = 1 σ 2 N n=1 = θ ML = (X T X) 1 X T y }{{} Pseudo inverse Here, the matrix X is defined as X = (y n x n θ)x T n = 0 x 1,1... x1, D......... x N,1... x N,D Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 10

Linear Regression II Polynomial Curve Fitting Motivation Non-linear relationships. Multiple SNPs playing a role for a particular phenotype. Y X x* Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 11

Linear Regression II Polynomial Curve Fitting Univariate input x Use the polynomials up to degree K to construct new features from x f(x, θ) = θ 0 + θ 1 x + θ 2 x 2 + + θ K x K = K θ k φ k (x) = θ T φ(x) k=1 where we defined φ(x) = (1, x, x 2,..., x K ). φ can be any feature mapping. Possible to show: the feature map φ can be expressed in terms of kernels (kernel trick). Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 12

Linear Regression II Polynomial Curve Fitting Univariate input x Use the polynomials up to degree K to construct new features from x f(x, θ) = θ 0 + θ 1 x + θ 2 x 2 + + θ K x K = K θ k φ k (x) = θ T φ(x) k=1 where we defined φ(x) = (1, x, x 2,..., x K ). φ can be any feature mapping. Possible to show: the feature map φ can be expressed in terms of kernels (kernel trick). Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 12

Linear Regression II Polynomial Curve Fitting Overfitting The degree of the polynomial is crucial to avoid under- and overfitting. t 1 M = 0 0 1 0 x 1 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 13

Linear Regression II Polynomial Curve Fitting Overfitting The degree of the polynomial is crucial to avoid under- and overfitting. t 1 M = 1 0 1 0 x 1 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 13

Linear Regression II Polynomial Curve Fitting Overfitting The degree of the polynomial is crucial to avoid under- and overfitting. t 1 M = 3 0 1 0 x 1 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 13

Linear Regression II Polynomial Curve Fitting Overfitting The degree of the polynomial is crucial to avoid under- and overfitting. t 1 M = 9 0 1 0 x 1 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 13

Linear Regression II Multivariate regression Polynomial curve fitting Multivariate regression (SNPs) f(x, θ) = θ 0 + θ 1 x + + θ K x K K = θ k φ k (x) k=1 = φ(x) θ, S f(x, θ) = θ s x s s=1 = x θ Note: When fitting a single binary SNP genotype x i, a linear model is most general! Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 14

Linear Regression II Multivariate regression Polynomial curve fitting Multivariate regression (SNPs) f(x, θ) = θ 0 + θ 1 x + + θ K x K K = θ k φ k (x) k=1 = φ(x) θ, S f(x, θ) = θ s x s s=1 = x θ Note: When fitting a single binary SNP genotype x i, a linear model is most general! Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 14

Linear Regression II Regularized Least Squares Solutions to avoid overfitting: 1. Intelligently choose number of dimensions 2. Regularize the regression weights θ Quadratically regularized objective function E(θ) = 1 2 N (y n φ(x n ) θ) 2 n=1 } {{ } Squared error + λ 2 θt θ }{{} Regularizer Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 15

Linear Regression II Regularized Least Squares Solutions to avoid overfitting: 1. Intelligently choose number of dimensions 2. Regularize the regression weights θ Quadratically regularized objective function E(θ) = 1 2 N (y n φ(x n ) θ) 2 n=1 } {{ } Squared error + λ 2 θt θ }{{} Regularizer Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 15

Linear Regression II Regularized Least Squares More general regularizers More general regularization: E(θ) = 1 N (y n φ(x n ) θ) 2 + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Squared error Regularizer Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 16

Linear Regression II Regularized Least Squares More general regularizers More general regularization: E(θ) = 1 N (y n φ(x n ) θ) 2 + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Squared error Regularizer q = 0.5 q = 1 q = 2 q = 4 (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 16

Linear Regression II Regularized Least Squares More general regularizers More general regularization: E(θ) = 1 N (y n φ(x n ) θ) 2 + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Squared error Regularizer sparse q = 0.5 q = 1 q = 2 q = 4 Lasso Quadratic (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 16

Linear Regression II Loss functions and related methods Even more general: general loss function E(θ) = 1 N L(y n φ(x n ) θ) + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Loss Regularizer Many state-of-the-art machine learning methods can be expressed within this framework. Linear Regression: squared loss, squared regularizer. Support Vector Machine: hinge loss, squared regularizer. Lasso: squared loss, L1 regularizer. Inference: minimize the cost function E(θ), yielding a point estimate for θ. Q: How to determine q and the a suitable loss function? Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 17

Linear Regression II Loss functions and related methods Even more general: general loss function E(θ) = 1 N L(y n φ(x n ) θ) + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Loss Regularizer Many state-of-the-art machine learning methods can be expressed within this framework. Linear Regression: squared loss, squared regularizer. Support Vector Machine: hinge loss, squared regularizer. Lasso: squared loss, L1 regularizer. Inference: minimize the cost function E(θ), yielding a point estimate for θ. Q: How to determine q and the a suitable loss function? Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 17

Linear Regression II Loss functions and related methods Even more general: general loss function E(θ) = 1 N L(y n φ(x n ) θ) + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Loss Regularizer Many state-of-the-art machine learning methods can be expressed within this framework. Linear Regression: squared loss, squared regularizer. Support Vector Machine: hinge loss, squared regularizer. Lasso: squared loss, L1 regularizer. Inference: minimize the cost function E(θ), yielding a point estimate for θ. Q: How to determine q and the a suitable loss function? Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 17

Linear Regression II Loss functions and related methods Even more general: general loss function E(θ) = 1 N L(y n φ(x n ) θ) + λ D θ d q 2 2 n=1 d=1 }{{}}{{} Loss Regularizer Many state-of-the-art machine learning methods can be expressed within this framework. Linear Regression: squared loss, squared regularizer. Support Vector Machine: hinge loss, squared regularizer. Lasso: squared loss, L1 regularizer. Inference: minimize the cost function E(θ), yielding a point estimate for θ. Q: How to determine q and the a suitable loss function? Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 17

Linear Regression II Loss functions and related methods Cross validation: minimization of expected loss For each candidate model H: Split data into K folds Training-test evaluation for each fold Assess average loss on test set E H = 1 K L test k K k=1 test set Total number of samples training set fold 1 fold 2 fold 3 Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 18

Linear Regression II Probabilistic interpretation So far: minimization of error functions. Back to probabilities? E(θ) = 1 N (y n φ(x n ) θ) 2 2 n=1 }{{} Squared error + λ 2 θt θ }{{} Regularizer Most alternative choices of regularizers and loss functions can be mapped to an equivalent probabilistic representation in a similar way. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 19

Linear Regression II Probabilistic interpretation So far: minimization of error functions. Back to probabilities? E(θ) = 1 N (y n φ(x n ) θ) 2 2 n=1 }{{} Squared error N = ln N ( y n φ(xn ) θ, σ 2) n=1 + λ 2 θt θ }{{} Regularizer ( ln N θ 0, 1 ) λ I Most alternative choices of regularizers and loss functions can be mapped to an equivalent probabilistic representation in a similar way. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 19

Linear Regression II Probabilistic interpretation So far: minimization of error functions. Back to probabilities? E(θ) = 1 N (y n φ(x n ) θ) 2 2 n=1 }{{} Squared error N = ln N ( y n φ(x n ) θ, σ 2) n=1 + λ 2 θt θ }{{} Regularizer ( ln N θ 0, 1 ) λ I = ln p(y θ, Φ(X), σ 2 ) ln p(θ) Most alternative choices of regularizers and loss functions can be mapped to an equivalent probabilistic representation in a similar way. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 19

Linear Regression II Probabilistic interpretation So far: minimization of error functions. Back to probabilities? E(θ) = 1 N (y n φ(x n ) θ) 2 2 n=1 }{{} Squared error N = ln N ( y n φ(x n ) θ, σ 2) n=1 + λ 2 θt θ }{{} Regularizer ( ln N θ 0, 1 ) λ I = ln p(y θ, Φ(X), σ 2 ) ln p(θ) Most alternative choices of regularizers and loss functions can be mapped to an equivalent probabilistic representation in a similar way. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 19

Bayesian linear regression Outline Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 20

Bayesian linear regression Bayesian linear regression Likelihood as before N p(y X, θ, σ 2 ) = N ( y n φ(x n ) θ, σ 2) n=1 Define a conjugate prior over θ p(θ) = N (θ m 0, S 0 ) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 21

Bayesian linear regression Bayesian linear regression Likelihood as before N p(y X, θ, σ 2 ) = N ( y n φ(x n ) θ, σ 2) n=1 Define a conjugate prior over θ p(θ) = N (θ m 0, S 0 ) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 21

Bayesian linear regression Bayesian linear regression Posterior probability of θ p(θ y, X, σ 2 ) N N ( y n φ(xn ) θ, σ 2) N (θ m 0, S 0 ) n=1 = N ( y Φ(X) θ, σ 2 I ) N (θ m 0, S 0 ) = N ( θ µ θ, Σ θ ) where ( µ θ = Σ θ S 1 0 m 0 + 1 ) σ 2 Φ(X)T y [ Σ θ = S 1 0 + 1 ] 1 σ 2 Φ(X)T Φ(X) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 22

Bayesian linear regression Bayesian linear regression Prior choice Choice of prior: regularized (ridge) regression In this case p(θ) = N ( θ m 0, S 0 ). p(θ y, X, σ 2 ) N ( θ ) µ θ, Σ ( θ µ θ = Σ θ S 1 0 m 0 + 1 ) σ 2 Φ(X)T y [ Σ θ = S 1 0 + 1 ] 1 σ 2 Φ(X)T Φ(X) Equivalent to maximum likelihood estimate for λ 0! Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 23

Bayesian linear regression Bayesian linear regression Prior choice Choice of prior: regularized (ridge) regression In this case p(θ) = N ( θ 0, 1 λ I). p(θ y, X, σ 2 ) N ( θ µ θ, Σ θ ) µ θ = Σ θ ( 1 Σ θ = ) σ 2 Φ(X)T y [ λi + 1 σ 2 Φ(X)T Φ(X) Equivalent to maximum likelihood estimate for λ 0! ] 1 Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 23

Bayesian linear regression Bayesian linear regression Prior choice Choice of prior: regularized (ridge) regression In this case p(θ y, X, σ 2 ) N ( θ µ θ, Σ θ ) µ θ = Σ θ ( 1 Σ θ = ) σ 2 Φ(X)T y [ λi + 1 σ 2 Φ(X)T Φ(X) Equivalent to maximum likelihood estimate for λ 0! ] 1 Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 23

Bayesian linear regression Bayesian linear regression Example 0 Data points (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 24

Bayesian linear regression Bayesian linear regression Example 1 Data point (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 24

Bayesian linear regression Bayesian linear regression Example 20 Data points (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 24

Bayesian linear regression Making predictions Prediction for fixed weight ˆθ at input x trivial: p(y x, ˆθ, σ 2 ) = N (y φ(x )ˆθ, σ 2) Integrate over θ to take the posterior uncertainty into account p(y x, D) = θ p(y x, θ, σ 2 )p(θ X, y, σ 2 ) = θ N ( y φ(x )θ, σ 2) N ( θ ) µ θ, Σ θ = N ( y φ(x ) µ θ, σ 2 + φ(x ) T Σ θ φ(x ) ) Key: prediction is again Gaussian Predictive variance is increase due to the posterior uncertainty in θ. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 25

Bayesian linear regression Making predictions Prediction for fixed weight ˆθ at input x trivial: p(y x, ˆθ, σ 2 ) = N (y φ(x )ˆθ, σ 2) Integrate over θ to take the posterior uncertainty into account p(y x, D) = θ p(y x, θ, σ 2 )p(θ X, y, σ 2 ) = θ N ( y φ(x )θ, σ 2) N ( θ ) µθ, Σ θ = N ( y φ(x ) µ θ, σ 2 + φ(x ) T Σ θ φ(x ) ) Key: prediction is again Gaussian Predictive variance is increase due to the posterior uncertainty in θ. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 25

Bayesian linear regression Making predictions Prediction for fixed weight ˆθ at input x trivial: p(y x, ˆθ, σ 2 ) = N (y φ(x )ˆθ, σ 2) Integrate over θ to take the posterior uncertainty into account p(y x, D) = θ p(y x, θ, σ 2 )p(θ X, y, σ 2 ) = θ N ( y φ(x )θ, σ 2) N ( θ ) µθ, Σ θ = N ( y φ(x ) µ θ, σ 2 + φ(x ) T Σ θ φ(x ) ) Key: prediction is again Gaussian Predictive variance is increase due to the posterior uncertainty in θ. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 25

Model comparison and hypothesis testing Outline Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 26

Model comparison and hypothesis testing Model comparison Motivation What degree of polynomials describes the data best? Is the linear model at all appropriate? Association testing. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 27

Model comparison and hypothesis testing Model comparison Motivation What degree of polynomials describes the data best? Is the linear model at all appropriate? Association testing. Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy 1 phenotypes SNPs Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 27

Model comparison and hypothesis testing Bayesian model comparison How do we choose among alternative models? Assume we want to choose among models H 0,..., H M for a dataset D. Posterior probability for a particular model i p(h i D) p(d H i ) p(h }{{} i ) }{{} Evidence Prior Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 28

Model comparison and hypothesis testing Bayesian model comparison How do we choose among alternative models? Assume we want to choose among models H 0,..., H M for a dataset D. Posterior probability for a particular model i p(h i D) p(d H i ) p(h }{{} i ) }{{} Evidence Prior Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 28

Model comparison and hypothesis testing Bayesian model comparison How to calculate the evidence The evidence is not the model likelihood! p(d H i ) = dθp(d Θ)p(Θ) for model parameters Θ. Θ Remember: p(θ H i, D) = p(d H i, Θ)p(Θ) p(d H i ) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 29

Model comparison and hypothesis testing Bayesian model comparison How to calculate the evidence The evidence is not the model likelihood! p(d H i ) = dθp(d Θ)p(Θ) for model parameters Θ. Θ Remember: p(θ H i, D) = p(d H i, Θ)p(Θ) p(d H i ) likelihood prior posterior = Evidence Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 29

Model comparison and hypothesis testing Bayesian model comparison Ocam s razor The evidence integral penalizes overly complex models. A model with few parameters and lower maximum likelihood (H 1 ) may win over a model with a peaked likelihood that requires many more parameters (H 2 ). Likelihood H2 H1 w MAP w (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 30

Model comparison and hypothesis testing Bayesian model comparison Ocam s razor The evidence integral penalizes overly complex models. A model with few parameters and lower maximum likelihood (H 1 ) may win over a model with a peaked likelihood that requires many more parameters (H 2 ). Likelihood H2 H1 w MAP w (C.M. Bishop, Pattern Recognition and Machine Learning) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 30

Model comparison and hypothesis testing Application to GWA Relevance of a single SNP Consider an association study. H 0 : no association p(y H 0, X, Θ 0 ) = N ( y 0, σ 2 I ) p(d H 0 ) = N ( y 0, σ 2 I ) p(σ 2 ) σ 2 H1 : linear association p(y H 1, x i, Θ 1 ) = N ( y xi θ, σ 2 I ) p(d H 1 ) = N ( y xi θ, σ 2 I ) p(σ 2 )p(θ) σ 2,θ Depending on the choice of priors, p(σ 2 ) and p(θ), the required integrals are often tractable in closed form. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 31

Model comparison and hypothesis testing Application to GWA Relevance of a single SNP Consider an association study. H 0 : no association p(y H 0, X, Θ 0 ) = N ( y 0, σ 2 I ) p(d H 0 ) = N ( y 0, σ 2 I ) p(σ 2 ) σ 2 H1 : linear association p(y H 1, x i, Θ 1 ) = N ( y xi θ, σ 2 I ) p(d H 1 ) = N ( y xi θ, σ 2 I ) p(σ 2 )p(θ) σ 2,θ Depending on the choice of priors, p(σ 2 ) and p(θ), the required integrals are often tractable in closed form. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 31

Model comparison and hypothesis testing Application to GWA Relevance of a single SNP Consider an association study. H 0 : no association p(y H 0, X, Θ 0 ) = N ( y 0, σ 2 I ) p(d H 0 ) = N ( y 0, σ 2 I ) p(σ 2 ) σ 2 H1 : linear association p(y H 1, x i, Θ 1 ) = N ( y xi θ, σ 2 I ) p(d H 1 ) = N ( y xi θ, σ 2 I ) p(σ 2 )p(θ) σ 2,θ Depending on the choice of priors, p(σ 2 ) and p(θ), the required integrals are often tractable in closed form. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 31

Model comparison and hypothesis testing Application to GWA Scoring models Similar to likelihood ratios, the ratio of the evidences, the Bayes factor can be used to score alternative models: BF = ln p(d H 1) p(d H 0 ). Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 32

Model comparison and hypothesis testing Application to GWA Scoring models Similar to likelihood ratios, the ratio of the evidences, the Bayes factor can be used to score alternative models: BF = ln p(d H 1) p(d H 0 ). 0 15 SLC35B4 SLC35B4 LOD/BF 10 5 FPR 0.01% FPR 0.01% 0 1.3354 1.3356 1.3358 1.336 1.3362 1.3364 1.3366 1.3368 1.337 1.3372 1.3374 Position in chr. 7 x 10 8 Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 32

Model comparison and hypothesis testing Application to GWA Posterior probability of an association Bayes factors are useful, however we would like a probabilistic answer how certain an association really is. Posterior probability of H 1 p(h 1 D) = p(d H 1)p(H 1 ) p(d) p(d H 1 )p(h 1 ) = p(d H 1 )p(h 1 ) + p(d H 0 )p(h 0 ) p(h 1 D) + p(h 0 D) = 1, prior probability of observing a real association. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 33

Model comparison and hypothesis testing Application to GWA Posterior probability of an association Bayes factors are useful, however we would like a probabilistic answer how certain an association really is. Posterior probability of H 1 p(h 1 D) = p(d H 1)p(H 1 ) p(d) p(d H 1 )p(h 1 ) = p(d H 1 )p(h 1 ) + p(d H 0 )p(h 0 ) p(h 1 D) + p(h 0 D) = 1, prior probability of observing a real association. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 33

Model comparison and hypothesis testing Application to GWA Posterior probability of an association Bayes factors are useful, however we would like a probabilistic answer how certain an association really is. Posterior probability of H 1 p(h 1 D) = p(d H 1)p(H 1 ) p(d) p(d H 1 )p(h 1 ) = p(d H 1 )p(h 1 ) + p(d H 0 )p(h 0 ) p(h 1 D) + p(h 0 D) = 1, prior probability of observing a real association. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 33

Model comparison and hypothesis testing Bayes factor verus likelihood ratio Bayes factor Models of different complexity can be objectively compared. Statistical significance as posterior probability of a model. Typically hard to compute. Likelihood ratio Likelihood ratio scales with the number of parameters. Likelihood ratios have known null distribution, yielding p-values. Often easy to compute. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 34

Model comparison and hypothesis testing Bayes factor verus likelihood ratio Bayes factor Models of different complexity can be objectively compared. Statistical significance as posterior probability of a model. Typically hard to compute. Likelihood ratio Likelihood ratio scales with the number of parameters. Likelihood ratios have known null distribution, yielding p-values. Often easy to compute. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 34

Model comparison and hypothesis testing Marginal likelihood of variance component models Consider a linear model, ( accounting for ) a set of measured SNPs X p(y X, θ, σ 2 S ) = N y x s θ s, σ 2 I s=1 Choose identical Gaussian prior for all weights S p(θ) = N ( ) θ s 0, σ 2 g s=1 Marginal likelihood p(y X, ) = θ N ( y Xθ, σ 2 I ) N ( θ 0, σgi 2 ) Number of hyperparameters independent of number of SNPs Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 35

Model comparison and hypothesis testing Marginal likelihood of variance component models Consider a linear model, ( accounting for ) a set of measured SNPs X p(y X, θ, σ 2 S ) = N y x s θ s, σ 2 I s=1 Choose identical Gaussian prior for all weights S p(θ) = N ( ) θ s 0, σ 2 g s=1 Marginal likelihood p(y X, ) = θ N ( y Xθ, σ 2 I ) N ( θ 0, σgi 2 ) Number of hyperparameters independent of number of SNPs Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 35

Model comparison and hypothesis testing Marginal likelihood of variance component models Consider a linear model, ( accounting for ) a set of measured SNPs X p(y X, θ, σ 2 S ) = N y x s θ s, σ 2 I s=1 Choose identical Gaussian prior for all weights S p(θ) = N ( θ s 0, σg 2 ) s=1 Marginal likelihood p(y X, σ 2, σg) 2 = θ N ( y Xθ, σ 2 I ) N ( θ 0, σgi 2 ) = N ( y 0, σgxx 2 T + σ 2 I ) Number of hyperparameters independent of number of SNPs Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 35

Model comparison and hypothesis testing Marginal likelihood of variance component models Consider a linear model, ( accounting for ) a set of measured SNPs X p(y X, θ, σ 2 S ) = N y x s θ s, σ 2 I s=1 Choose identical Gaussian prior for all weights S p(θ) = N ( θ s 0, σg 2 ) s=1 Marginal likelihood p(y X, σ 2, σg) 2 = θ N ( y Xθ, σ 2 I ) N ( θ 0, σgi 2 ) = N ( y 0, σgxx 2 T + σ 2 I ) Number of hyperparameters independent of number of SNPs Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 35

Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs The missing heritability paradox Complex traits are regulated by a large number of small effects Human height: the best single SNP explains little variance. But: the parents are highly predictive for the height of the child! Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 36

Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Multivariate additive models for complex traits Multivariate model over causal SNPs p(y X, θ, σ 2 ) = N ( y s causal x s θ s, σ 2 I ) Common variance prior for causal SNPs p(θ s ) = N ( θ s 0, σg 2 ) Marinalize out weights p(y X, σg, 2 σe) 2 = N ( y 0, σg 2 x s x T s + σei 2 ) s causal Which SNPs are causal? Approximation: consider all SNPs [Yang et al., 2011] p(y X, σg, 2 σe) 2 = N ( y 0, σgxx 2 T + σei 2 ) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 37

Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Multivariate additive models for complex traits Multivariate model over causal SNPs p(y X, θ, σ 2 ) = N ( y s causal x s θ s, σ 2 I ) Common variance prior for causal SNPs p(θ s ) = N ( θ s 0, σg 2 ) Marinalize out weights p(y X, σg, 2 σe) 2 = N ( y 0, σg 2 x s x T s + σei 2 ) s causal Which SNPs are causal? Approximation: consider all SNPs [Yang et al., 2011] p(y X, σg, 2 σe) 2 = N ( y 0, σgxx 2 T + σei 2 ) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 37

Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Multivariate additive models for complex traits Multivariate model over causal SNPs p(y X, θ, σ 2 ) = N ( y s causal x s θ s, σ 2 I ) Common variance prior for causal SNPs p(θ s ) = N ( θ s 0, σg 2 ) Marinalize out weights p(y X, σg, 2 σe) 2 = N ( y 0, σg 2 x s x T s + σei 2 ) s causal Which SNPs are causal? Approximation: consider all SNPs [Yang et al., 2011] p(y X, σg, 2 σe) 2 = N ( y 0, σgxx 2 T + σei 2 ) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 37

Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Multivariate additive models for complex traits Multivariate model over causal SNPs p(y X, θ, σ 2 ) = N ( y s causal x s θ s, σ 2 I ) Common variance prior for causal SNPs p(θ s ) = N ( θ s 0, σg 2 ) Marinalize out weights p(y X, σg, 2 σe) 2 = N ( y 0, σg 2 x s x T s + σei 2 ) s causal Which SNPs are causal? Approximation: consider all SNPs [Yang et al., 2011] p(y X, σg, 2 σe) 2 = N ( y 0, σgxx 2 T + σei 2 ) Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 37

Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Approximate variance model p(y X, σ 2 g, σ 2 e) = N ( y 0, σ 2 gxx T + σ 2 ei ) Genetic variance σ 2 g across chromosomes σ2 g Heritability h 2 = σg 2 + σe 2 Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 38

Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Approximate variance model p(y X, σ 2 g, σ 2 e) = N ( y 0, σ 2 gxx T + σ 2 ei ) Genetic variance σ 2 g across chromosomes σ2 g Heritability h 2 = σg 2 + σe 2 [Yang et al., 2011] Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 38

Model comparison and hypothesis testing Marginal likelihood of variance component models Application to GWAs Approximate variance model p(y X, σ 2 g, σ 2 e) = N ( y 0, σ 2 gxx T + σ 2 ei ) Genetic variance σ 2 g across chromosomes σ2 g Heritability h 2 = σg 2 + σe 2 [Yang et al., 2011] Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 38

Summary Outline Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 39

Summary Summary Generalized linear models for Curve fitting and multivariate regression. Maximum likelihood and least squares regression are identical. Construction of features using a mapping φ. Regularized least squares and other models that correspond to different choices of loss functions. Bayesian linear regression. Model comparison and ocam s razor. Variance component models in GWAs. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 40

Summary Tasks Prove that the product of two Gaussians is Gaussian distributed. Try to understand the convolution formula of Gaussian random variables. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 41

Summary References I C. Bishop. Pattern recognition and machine learning, volume 4. Springer New York, 2006. S. Roweis. Gaussian identities. technical report, 1999. URL http://www.cs.nyu.edu/~roweis/notes/gaussid.pdf. J. Yang, T. Manolio, L. Pasquale, E. Boerwinkle, N. Caporaso, J. Cunningham, M. de Andrade, B. Feenstra, E. Feingold, M. Hayes, et al. Genome partitioning of genetic variation for complex traits using common snps. Nature Genetics, 43(6):519 525, 2011. Oliver Stegle GWAS IV: Bayesian linear models Summer 2011 42