Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Similar documents
Association studies and regression

Overfitting, Bias / Variance Analysis

Linear Models in Machine Learning

GWAS IV: Bayesian linear (variance component) models

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

PCA and admixture models

Bayesian Regression (1/31/13)

Sparse Linear Models (10/7/13)

CSC321 Lecture 18: Learning Probabilistic Models

Linear Regression (continued)

Lecture 2: Conjugate priors

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Linear Regression (1/1/17)

Linear Regression (9/11/13)

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Lecture : Probabilistic Machine Learning

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

An Introduction to Statistical and Probabilistic Linear Models

Hierarchical Models & Bayesian Model Selection

Statistics: Learning models from data

Bayesian Regression Linear and Logistic Regression

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CPSC 540: Machine Learning

A Bayesian Treatment of Linear Gaussian Regression

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Bayesian Linear Regression [DRAFT - In Progress]

Relevance Vector Machines

Least Squares Regression

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

IEOR165 Discussion Week 5

CSCI567 Machine Learning (Fall 2014)

CPSC 540: Machine Learning

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Linear Models A linear model is defined by the expression

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Introduction to Machine Learning

ECE531 Lecture 10b: Maximum Likelihood Estimation

Introduction to Machine Learning

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Topic 12 Overview of Estimation

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Big Data Analytics. Lucas Rego Drumond

CMU-Q Lecture 24:

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

HOMEWORK #4: LOGISTIC REGRESSION

Linear and logistic regression

COMP90051 Statistical Machine Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Machine Learning 4771

An Introduction to Bayesian Linear Regression

Bayesian RL Seminar. Chris Mansley September 9, 2008

Introduction to Machine Learning. Lecture 2

Module 17: Bayesian Statistics for Genetics Lecture 4: Linear regression

Regression, Ridge Regression, Lasso

10-701/15-781, Machine Learning: Homework 4

Nonparametric Bayesian Methods (Gaussian Processes)

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Introduction to Simple Linear Regression

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Probability and Estimation. Alan Moses

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Density Estimation. Seungjin Choi

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Today. Calculus. Linear Regression. Lagrange Multipliers

Introduction to Machine Learning

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Machine Learning 4771

Probabilistic Reasoning in Deep Learning

Chapter 8.8.1: A factorization theorem

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

ECE521 week 3: 23/26 January 2017

g-priors for Linear Regression

Introduction to Maximum Likelihood Estimation

Linear Regression. Volker Tresp 2014

Naïve Bayes classification

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Graphical Models for Collaborative Filtering

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Least Squares Regression

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Statistical Data Mining and Machine Learning Hilary Term 2016

Parametric Techniques Lecture 3

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Gaussian and Linear Discriminant Analysis; Multiclass Classification

6.867 Machine Learning

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

HOMEWORK #4: LOGISTIC REGRESSION

Transcription:

Multiple regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Multiple regression 1 / 36

Previous two lectures Linear and logistic regression Viewed as probabilistic models Connects data and parameters Estimating parameters by maximizing likelihood Need to solve optimization problem (analytical solutions available in special cases e.g. linear regression) Examples of numerical optimization : gradient descent, Newton s method Important mathematical idea: convexity Multiple regression 2 / 36

Applications to GWAS Practical issues Model checking Multiple hypothesis testing Multiple regression 3 / 36

This lecture Can we predict phenotype from genotype? Depends on heritability Multiple regression: ridge regression Bayesian statistics Multiple regression 4 / 36

GWAS so far 2554 studies and 25037 SNP-phenotype associations Success? Can we use the results of GWAS to predict phenotype? Multiple regression 5 / 36

GWAS so far 2554 studies and 25037 SNP-phenotype associations Success? Can we use the results of GWAS to predict phenotype? Multiple regression 5 / 36

Outline Heritability Bayesian statistics Bernoulli model Ridge regression Probabilistic interpretation Estimating β Estimating hyperparameters Back to Heritability Multiple regression Heritability 6 / 36

How well does a model fit the data? (Narrow-sense) Heritability h 2 What is the accuracy of the best linear predictor of the phenotype? After we learn the parameters, how well can the model predict the phenotype? Narrow-sense: space of linear models Broad-sense: space of all models h 2 = R 2 R 2 = 1 SS res SS tot SS res = Σ n i=1(y i ˆβ T x i ) 2 SS tot = Σ n i=1(y i ȳ) 2 Multiple regression Heritability 7 / 36

Examples of heritability 2 1 0 1 2 h2=0.2 Genotype Phenotype 0 1 2 2 1 0 1 2 h2=0.8 Genotype Phenotype 0 1 2 Can better predict phenotype from genotype Left:R 2 = 0.19 vs Right:R 2 = 0.79 Multiple regression Heritability 8 / 36

Heritability estimates from GWAS 18 GWAS variants for type 2 Diabetes explained 6% of known heritability (Manolio et al. Nature 2009) 180 GWAS loci for height explain 10% of phenotypic variance (Lango-Allen et al. Nature 2010) from a sample of 133, 653 individuals. h 2 for height is estimated to be 0.80 (Silventoinen et al. Twin Res. 2003) Multiple regression Heritability 9 / 36

Heritability estimates from GWAS 1 Disease! Heritability! (h 2 )! Number of! GWAS loci! Heritability explained! by GWAS loci! Alzheimer s! 0.79! 4! 0.18! 23%! Bipolar disorder! 0.77! 5! 0.02! 3%! Breast cancer! 0.53! 13! 0.07! 13%! CAD! 0.49! 12! 0.12! 25%! Crohn s disease! 0.55! 32! 0.07! 13%! Prostate cancer! 0.50! 27! 0.15! 31%! Schizophrenia! 0.81! 4! 0.00! 0%! SLE (lupus)! 0.66! 23! 0.09! 13%! Type 1 diabetes! 0.80! 45! 0.11! 14%! Type 2 diabetes! 0.42! 25! 0.12! 28%! % of h 2 explained! So et al. Gen. Epi.2011 Multiple regression Heritability 9 / 36

The model for the phenotype m y = β 0 + β j x j + ɛ j=1 y :Phenotype x j :Genotype at SNP j sampled independently ɛ N (0, σ 2 ) m Var [y] = Var β 0 + β j x j + ɛ = = j=1 m Var [β j x j ] + Var [ɛ] j=1 m β 2 j Var [x j ] + σ 2 j=1 Multiple regression Heritability 10 / 36

The model for the phenotype To simplify notation, we assume that each genotype is standardized. So E [x j ] = 0. Var [x j ] = 1. We assume that phenotype has mean 0. E [y] = 0. Var [y] = = m β 2 j Var [x j ] + σ 2 j=1 m β 2 j + σ 2 j=1 Multiple regression Heritability 10 / 36

Heritability What is the best possible accuracy for predicting phenotype from genotype? Best accuracy is obtained when we use the true model (which we don t know in practice of course). Accuracy here refers to low mean squared error. E [(y (β 0 + β T x)) 2] = E [ ɛ 2] = σ 2 Also note [ E (y (β 0 + β T x)) 2] Var [y] Multiple regression Heritability 11 / 36

Heritability What is the best possible accuracy for predicting phenotype from genotype? Best accuracy is obtained when we use the true model (which we don t know in practice of course). Accuracy here refers to low mean squared error. E [(y (β 0 + β T x)) 2] = E [ ɛ 2] = σ 2 Also note [ E (y (β 0 + β T x)) 2] Var [y] Multiple regression Heritability 11 / 36

Heritability If we knew the true model E [(y (β 0 + β T x)) 2] h 2 = 1 Var [y] σ 2 = 1 m j=1 β j 2 + σ 2 = m j=1 β j 2 m j=1 β j 2 + σ 2 Don t know the values of β j and σ 2. Multiple regression Heritability 12 / 36

Heritability If we knew the true model E [(y (β 0 + β T x)) 2] h 2 = 1 Var [y] σ 2 = 1 m j=1 β j 2 + σ 2 = m j=1 β j 2 m j=1 β j 2 + σ 2 Don t know the values of β j and σ 2. Multiple regression Heritability 12 / 36

What happens in GWAS? Testing each of m SNPs for association. A is the set of associated SNPs. ˆβj is the estimate of the effect size. What is the heritability of associated SNPs? ĥ 2 A = j A ˆβ j 2 Var [y] If the number of discoveries A is smaller than the number of associated SNPs m, ĥ2 A < h2. Difference termed missing heritability. Multiple regression Heritability 13 / 36

What happens in GWAS? Testing each of m SNPs for association. A is the set of associated SNPs. ˆβj is the estimate of the effect size. What is the heritability of associated SNPs? ĥ 2 A = j A ˆβ j 2 Var [y] If the number of discoveries A is smaller than the number of associated SNPs m, ĥ2 A < h2. Difference termed missing heritability. Multiple regression Heritability 13 / 36

Reasons for missing heritability Power SNPs that should be in A but are not included. The true function is non-linear. Estimates of heritability are biased upwards. Multiple regression Heritability 14 / 36

Solving the power issue Idea: learn a function that relates all SNPs to phenotype y = X β + ɛ Can compute the MLE (equivalently the OLS) estimate. β = ( ) X T 1 X X T y Multiple regression Heritability 15 / 36

What if X T X is not invertible β = ( ) X T 1 X X T y Can you think of any reasons why that could happen? Answer 1: n < m + 1. Intuitively, not enough data to estimate all the parameters. Answer 2: X columns are not linearly independent. Intuitively, there are two features that are perfectly correlated. In this case, solution is not unique. Multiple regression Heritability 16 / 36

Ridge regression For X T X that is not invertible β = ( X T X + λi) 1 XT y This is equivalent to adding an extra term to RSS( β) RSS( β) { { }}{ 1 β T ( } X T X β 2 X y) T T β + 1 2 2 λ β 2 2 }{{} regularization Multiple regression Heritability 17 / 36

Outline Heritability Bayesian statistics Bernoulli model Ridge regression Probabilistic interpretation Estimating β Estimating hyperparameters Back to Heritability Multiple regression Bayesian statistics 18 / 36

Bayesian statistics Frequentist Statistics: Evalute ˆθ on repeated samples: Bayesian: Bayes theorem θ Fixed X P Random ˆθ = t(x) also Random E[ˆθ] = θ? θ Random X Random P (θ X) = P (X θ)p (θ) P (X) Likelihood P rior M arginallikelihood Multiple regression Bayesian statistics 19 / 36

Bernoulli model: Bayesian treatment X 1,, X n iid Ber (p) Likelihood : L(p) P (x 1,..., x n p) n = P (x i p) = i=1 n p x i (1 p) (1 x i) i=1 = p n x (1 p) n(1 x) Prior : P (p) = Beta(p; α, β) p α 1 (1 p) β 1 Multiple regression Bayesian statistics 20 / 36

Bernoulli model: Bayesian treatment X 1,, X n iid Ber (p) Likelihood : L(p) P (x 1,..., x n p) n = P (x i p) = i=1 n p x i (1 p) (1 x i) i=1 = p n x (1 p) n(1 x) Prior : P (p) = Beta(p; α, β) p α 1 (1 p) β 1 Multiple regression Bayesian statistics 20 / 36

Beta distribution P (x α, β) = Γ(α + β) Γ(α)Γ(β) xα 1 (1 x) β 1 1{0 x 1} X Beta(α, β) E [X] = α α + β αβ Var [X] = (α + β) 2 (α + β + 1) Multiple regression Bayesian statistics 21 / 36

Beta distribution α = β = c P (x α, β) = Γ(α + β) Γ(α)Γ(β) xα 1 (1 x) β 1 1{0 x 1} α=β Density 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.5 1 2 X Beta(c, c) E [X] = 1 2 1 Var [X] = 4(c + 1) What happens when c? What happens when c 0? 0.0 0.2 0.4 0.6 0.8 1.0 x Multiple regression Bayesian statistics 21 / 36

Beta distribution P (x α, β) = α = d, β = cd Γ(α + β) Γ(α)Γ(β) xα 1 (1 x) β 1 1{0 x 1} Density 0 2 4 6 8 10 α,β 0.2,0.8 2,8 20,80 X Beta(d, cd) E [X] = 1 1 + c c Var [X] = (c + 1) 2 (d(1 + c) + 1) What happens when d? 0.0 0.2 0.4 0.6 0.8 1.0 x Multiple regression Bayesian statistics 21 / 36

Bernoulli model: Bayesian treatment X 1,, X n iid Ber (p) Likelihood : P (x 1,..., x n p) = p n x (1 p) n(1 x) Prior : P (p) = Beta(p; α, β) p α 1 (1 p) β 1 Posterior : P (p x 1,, x n ) P (x 1,, x n p)p (p) = p α+n x 1 (1 p) β+n(1 x) 1 = Beta(p; α + n x, β + n(1 x)) Multiple regression Bayesian statistics 22 / 36

Bernoulli model: Bayesian treatment Posterior mean p MEAN E [p x 1,, x n ] = pp (p x 1,, x n )dp = Mean of Beta with parameters (α + n x, β + n(1 x)) = α + n x α + β + n ( ) n = x + α ( ) α + β α + β + n α + β α + β + n Posterior mean: Convex combination of MLE and prior mean Multiple regression Bayesian statistics 22 / 36

Bernoulli model: Bayesian treatment Posterior mean n E [p x 1,, x n ] = x α + β + n + α α + β α + β α + β + n Posterior mean is a smoothed version of MLE Example: Observe all 1 out of n trials n p MLE p MEAN (α = β = 5) 5 1 0.67 500 1 0.99 Approaches MLE as n, i.e., prior matters less with more data. Prior can be viewed as adding pseudo-observations. Multiple regression Bayesian statistics 22 / 36

Choosing prior How do we choose α, β? Subjective Bayes: encode all reasonable assumptions of domain into prior. Other considerations: computational efficiency (conjugate priors). Multiple regression Bayesian statistics 23 / 36

Outline Heritability Bayesian statistics Bernoulli model Ridge regression Probabilistic interpretation Estimating β Estimating hyperparameters Back to Heritability Multiple regression Ridge regression 24 / 36

Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

Estimating β given hyperparameters (σ 2, σ 2 0) Let X 1,..., X n be i.i.d with y β, x N (β x, σ0 2) Let β j be i.i.d with β j N (0, σ 2 ) Joint likelihood of data and parameters (given σ 0, σ) P (D, β) = P (D β)p (β) = i P (y i x i, β) j P (β j ) Joint log likelihood Plugging in Gaussian PDF, we get: log P (D, β) = i log P (y i x i, β) + j log P (β j ) = i (βt x i y i ) 2 2σ 2 0 j 1 2σ 2 β2 j + const MAP estimate: β map = arg max β log p(d, β) As with OLS, set gradient equal to zero and solve (for β) Multiple regression Ridge regression 27 / 36

Estimating β given hyperparameters (σ 2, σ 2 0) Let X 1,..., X n be i.i.d with y β, x N (β x, σ0 2) Let β j be i.i.d with β j N (0, σ 2 ) Joint likelihood of data and parameters (given σ 0, σ) P (D, β) = P (D β)p (β) = i P (y i x i, β) j P (β j ) Joint log likelihood Plugging in Gaussian PDF, we get: log P (D, β) = i log P (y i x i, β) + j log P (β j ) = i (βt x i y i ) 2 2σ 2 0 j 1 2σ 2 β2 j + const MAP estimate: β map = arg max β log p(d, β) As with OLS, set gradient equal to zero and solve (for β) Multiple regression Ridge regression 27 / 36

Estimating β given hyperparameters (σ 2, σ 2 0) Let X 1,..., X n be i.i.d with y β, x N (β x, σ0 2) Let β j be i.i.d with β j N (0, σ 2 ) Joint likelihood of data and parameters (given σ 0, σ) P (D, β) = P (D β)p (β) = i P (y i x i, β) j P (β j ) Joint log likelihood Plugging in Gaussian PDF, we get: log P (D, β) = i log P (y i x i, β) + j log P (β j ) = i (βt x i y i ) 2 2σ 2 0 j 1 2σ 2 β2 j + const MAP estimate: β map = arg max β log p(d, β) As with OLS, set gradient equal to zero and solve (for β) Multiple regression Ridge regression 27 / 36

Maximum a posterior (MAP) estimate Regularized linear regression: a new error to minimize E(β) = i (β T x i y i ) 2 + λ β 2 2 where λ > 0 is used to denote σ0 2/σ2. This extra term β 2 2 is called regularization/regularizer and controls the model complexity. Intuitions If λ +, then σ 2 0 σ 2. That is, the variance of noise is far greater than what our prior model can allow for β. In this case, our prior model on β would be more accurate than what data can tell us. Thus, we are getting a simple model. Numerically, β map 0 If λ 0, then we trust our data more. Numerically, β map β ols = arg min i (β T x i y i ) 2 Multiple regression Ridge regression 28 / 36

Maximum a posterior (MAP) estimate Regularized linear regression: a new error to minimize E(β) = i (β T x i y i ) 2 + λ β 2 2 where λ > 0 is used to denote σ0 2/σ2. This extra term β 2 2 is called regularization/regularizer and controls the model complexity. Intuitions If λ +, then σ 2 0 σ 2. That is, the variance of noise is far greater than what our prior model can allow for β. In this case, our prior model on β would be more accurate than what data can tell us. Thus, we are getting a simple model. Numerically, β map 0 If λ 0, then we trust our data more. Numerically, β map β ols = arg min i (β T x i y i ) 2 Multiple regression Ridge regression 28 / 36

Maximum a posterior (MAP) estimate Regularized linear regression: a new error to minimize E(β) = i (β T x i y i ) 2 + λ β 2 2 where λ > 0 is used to denote σ0 2/σ2. This extra term β 2 2 is called regularization/regularizer and controls the model complexity. Intuitions If λ +, then σ 2 0 σ 2. That is, the variance of noise is far greater than what our prior model can allow for β. In this case, our prior model on β would be more accurate than what data can tell us. Thus, we are getting a simple model. Numerically, β map 0 If λ 0, then we trust our data more. Numerically, β map β ols = arg min i (β T x i y i ) 2 Multiple regression Ridge regression 28 / 36

Maximum a posterior (MAP) estimate Regularized linear regression: a new error to minimize E(β) = i (β T x i y i ) 2 + λ β 2 2 where λ > 0 is used to denote σ0 2/σ2. This extra term β 2 2 is called regularization/regularizer and controls the model complexity. Intuitions If λ +, then σ 2 0 σ 2. That is, the variance of noise is far greater than what our prior model can allow for β. In this case, our prior model on β would be more accurate than what data can tell us. Thus, we are getting a simple model. Numerically, β map 0 If λ 0, then we trust our data more. Numerically, β map β ols = arg min i (β T x i y i ) 2 Multiple regression Ridge regression 28 / 36

Closed-form solution For regularized linear regression: the solution changes very little (in form) from the OLS solution arg min i (β T x i y i ) 2 + λ β 2 2 β map = ( X T X + λi ) 1 X T y and reduces to the OLS solution when λ = 0, as expected. If we have to use numerical procedure, the gradients and the Hessian matrix would change nominally too, E(β) = 2(X T Xβ X T y + λβ), H = 2(X T X + λi) As long as λ 0, the optimization is convex. Multiple regression Ridge regression 29 / 36

Closed-form solution For regularized linear regression: the solution changes very little (in form) from the OLS solution arg min i (β T x i y i ) 2 + λ β 2 2 β map = ( X T X + λi ) 1 X T y and reduces to the OLS solution when λ = 0, as expected. If we have to use numerical procedure, the gradients and the Hessian matrix would change nominally too, E(β) = 2(X T Xβ X T y + λβ), H = 2(X T X + λi) As long as λ 0, the optimization is convex. Multiple regression Ridge regression 29 / 36

Estimating hyperparameters: σ 2, σ 2 0 Ridge regression model: y = Xβ + ɛ β N (0, σ 2 I m ) To find (σ 2, σ0 2 ) given data D = (y, X), we need to compute the marginal likelihood L(σ 2, σ0) 2 P (y X, σ 2, σ0) 2 = P (y, β X, σ 2, σ0)dβ 2 β = P (y β, X, σ0)p 2 (β σ 2 )dβ β = N (0, σ 2 XX T + σ0i 2 n ) = 1 [ y T K 1 y + log det K ] + const 2 K σ 2 XX T + σ 2 0I n Multiple regression Ridge regression 30 / 36

Estimating hyperparameters: σ 2, σ 2 0 Ridge regression model: y = Xβ + ɛ β N (0, σ 2 I m ) To find (σ 2, σ0 2 ) given data D = (y, X), we need to compute the marginal likelihood L(σ 2, σ0) 2 P (y X, σ 2, σ0) 2 = P (y, β X, σ 2, σ0)dβ 2 β = P (y β, X, σ0)p 2 (β σ 2 )dβ β = N (0, σ 2 XX T + σ0i 2 n ) = 1 [ y T K 1 y + log det K ] + const 2 K σ 2 XX T + σ 2 0I n Multiple regression Ridge regression 30 / 36

Estimating hyperparameters: σ 2, σ 2 0 Ridge regression model: y = Xβ + ɛ β N (0, σ 2 I m ) To find (σ 2, σ0 2 ) given data D = (y, X), we need to compute the marginal likelihood L(σ 2, σ0) 2 P (y X, σ 2, σ0) 2 = P (y, β X, σ 2, σ0)dβ 2 β = P (y β, X, σ0)p 2 (β σ 2 )dβ β = N (0, σ 2 XX T + σ0i 2 n ) = 1 [ y T K 1 y + log det K ] + const 2 K σ 2 XX T + σ 2 0I n Multiple regression Ridge regression 30 / 36

Statistical properties of ridge-regression estimator Assumptions: Linear model is correct β map is a biased estimator of β. Contrast with OLS. β ols is an unbiased estimator of β (Lecture 3). What about the variance? Multiple regression Ridge regression 31 / 36

Computing ridge-regression estimator Estimating β given hyperparameters. Same runtime as OLS, O(m 2 n). Estimating the hyperparameters needs a numerical procedure. Each iteration is O(n 3 ). Multiple regression Ridge regression 32 / 36

Outline Heritability Bayesian statistics Bernoulli model Ridge regression Probabilistic interpretation Estimating β Estimating hyperparameters Back to Heritability Multiple regression Back to Heritability 33 / 36

So how does this relate to heritability? Heritability related to the hyperparameters h 2 m = mσ2 mσ 2 + σ 2 0 Given genotype and phenotype pairs {(x i, y i )}, model the phenotype y i as y i = β T x i + ɛ i where β j N (0, σ 2 ) and ɛ i N (0, σ 0 2 ). Estimate the hyperparameters (σ 2, σ0 2 ) by maximizing the marginal likelihood. Multiple regression Back to Heritability 34 / 36

Application of ridge regression to estimate heritability Termed linear Mixed models in the genetics literature. Yang et al. 2010 applied this model to height to estimate h 2 G = 0.45 dramatically higher than the estimates from GWAS (0.05). Been applied to a number of phenotypes. Multiple regression Back to Heritability 35 / 36

Application of ridge regression to estimate heritability 2 Visscher et al. AJHG 2012 Multiple regression Back to Heritability 35 / 36

Summary Increasing the number of SNPs in the regression leads to statistical and numerical difficulties. Regularized regression is a solution. Ridge regression is one form of regularization. Can also be derived from a Bayesian perspective. Have been useful in closing the missing heritability gap. Multiple regression Back to Heritability 36 / 36