Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Size: px

Start display at page:

Download "Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar"

Dwain Watts
5 years ago
Views:

1 Multiple regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Multiple regression 1 / 36

2 Previous two lectures Linear and logistic regression Viewed as probabilistic models Connects data and parameters Estimating parameters by maximizing likelihood Need to solve optimization problem (analytical solutions available in special cases e.g. linear regression) Examples of numerical optimization : gradient descent, Newton s method Important mathematical idea: convexity Multiple regression 2 / 36

3 Applications to GWAS Practical issues Model checking Multiple hypothesis testing Multiple regression 3 / 36

4 This lecture Can we predict phenotype from genotype? Depends on heritability Multiple regression: ridge regression Bayesian statistics Multiple regression 4 / 36

5 GWAS so far 2554 studies and SNP-phenotype associations Success? Can we use the results of GWAS to predict phenotype? Multiple regression 5 / 36

6 GWAS so far 2554 studies and SNP-phenotype associations Success? Can we use the results of GWAS to predict phenotype? Multiple regression 5 / 36

7 Outline Heritability Bayesian statistics Bernoulli model Ridge regression Probabilistic interpretation Estimating β Estimating hyperparameters Back to Heritability Multiple regression Heritability 6 / 36

8 How well does a model fit the data? (Narrow-sense) Heritability h 2 What is the accuracy of the best linear predictor of the phenotype? After we learn the parameters, how well can the model predict the phenotype? Narrow-sense: space of linear models Broad-sense: space of all models h 2 = R 2 R 2 = 1 SS res SS tot SS res = Σ n i=1(y i ˆβ T x i ) 2 SS tot = Σ n i=1(y i ȳ) 2 Multiple regression Heritability 7 / 36

9 Examples of heritability h2=0.2 Genotype Phenotype h2=0.8 Genotype Phenotype Can better predict phenotype from genotype Left:R 2 = 0.19 vs Right:R 2 = 0.79 Multiple regression Heritability 8 / 36

10 Heritability estimates from GWAS 18 GWAS variants for type 2 Diabetes explained 6% of known heritability (Manolio et al. Nature 2009) 180 GWAS loci for height explain 10% of phenotypic variance (Lango-Allen et al. Nature 2010) from a sample of 133, 653 individuals. h 2 for height is estimated to be 0.80 (Silventoinen et al. Twin Res. 2003) Multiple regression Heritability 9 / 36

11 Heritability estimates from GWAS 1 Disease! Heritability! (h 2 )! Number of! GWAS loci! Heritability explained! by GWAS loci! Alzheimer s! 0.79! 4! 0.18! 23%! Bipolar disorder! 0.77! 5! 0.02! 3%! Breast cancer! 0.53! 13! 0.07! 13%! CAD! 0.49! 12! 0.12! 25%! Crohn s disease! 0.55! 32! 0.07! 13%! Prostate cancer! 0.50! 27! 0.15! 31%! Schizophrenia! 0.81! 4! 0.00! 0%! SLE (lupus)! 0.66! 23! 0.09! 13%! Type 1 diabetes! 0.80! 45! 0.11! 14%! Type 2 diabetes! 0.42! 25! 0.12! 28%! % of h 2 explained! So et al. Gen. Epi.2011 Multiple regression Heritability 9 / 36

12 The model for the phenotype m y = β 0 + β j x j + ɛ j=1 y :Phenotype x j :Genotype at SNP j sampled independently ɛ N (0, σ 2 ) m Var [y] = Var β 0 + β j x j + ɛ = = j=1 m Var [β j x j ] + Var [ɛ] j=1 m β 2 j Var [x j ] + σ 2 j=1 Multiple regression Heritability 10 / 36

13 The model for the phenotype To simplify notation, we assume that each genotype is standardized. So E [x j ] = 0. Var [x j ] = 1. We assume that phenotype has mean 0. E [y] = 0. Var [y] = = m β 2 j Var [x j ] + σ 2 j=1 m β 2 j + σ 2 j=1 Multiple regression Heritability 10 / 36

14 Heritability What is the best possible accuracy for predicting phenotype from genotype? Best accuracy is obtained when we use the true model (which we don t know in practice of course). Accuracy here refers to low mean squared error. E [(y (β 0 + β T x)) 2] = E [ ɛ 2] = σ 2 Also note [ E (y (β 0 + β T x)) 2] Var [y] Multiple regression Heritability 11 / 36

15 Heritability What is the best possible accuracy for predicting phenotype from genotype? Best accuracy is obtained when we use the true model (which we don t know in practice of course). Accuracy here refers to low mean squared error. E [(y (β 0 + β T x)) 2] = E [ ɛ 2] = σ 2 Also note [ E (y (β 0 + β T x)) 2] Var [y] Multiple regression Heritability 11 / 36

16 Heritability If we knew the true model E [(y (β 0 + β T x)) 2] h 2 = 1 Var [y] σ 2 = 1 m j=1 β j 2 + σ 2 = m j=1 β j 2 m j=1 β j 2 + σ 2 Don t know the values of β j and σ 2. Multiple regression Heritability 12 / 36

17 Heritability If we knew the true model E [(y (β 0 + β T x)) 2] h 2 = 1 Var [y] σ 2 = 1 m j=1 β j 2 + σ 2 = m j=1 β j 2 m j=1 β j 2 + σ 2 Don t know the values of β j and σ 2. Multiple regression Heritability 12 / 36

18 What happens in GWAS? Testing each of m SNPs for association. A is the set of associated SNPs. ˆβj is the estimate of the effect size. What is the heritability of associated SNPs? ĥ 2 A = j A ˆβ j 2 Var [y] If the number of discoveries A is smaller than the number of associated SNPs m, ĥ2 A < h2. Difference termed missing heritability. Multiple regression Heritability 13 / 36

19 What happens in GWAS? Testing each of m SNPs for association. A is the set of associated SNPs. ˆβj is the estimate of the effect size. What is the heritability of associated SNPs? ĥ 2 A = j A ˆβ j 2 Var [y] If the number of discoveries A is smaller than the number of associated SNPs m, ĥ2 A < h2. Difference termed missing heritability. Multiple regression Heritability 13 / 36

20 Reasons for missing heritability Power SNPs that should be in A but are not included. The true function is non-linear. Estimates of heritability are biased upwards. Multiple regression Heritability 14 / 36

21 Solving the power issue Idea: learn a function that relates all SNPs to phenotype y = X β + ɛ Can compute the MLE (equivalently the OLS) estimate. β = ( ) X T 1 X X T y Multiple regression Heritability 15 / 36

22 What if X T X is not invertible β = ( ) X T 1 X X T y Can you think of any reasons why that could happen? Answer 1: n < m + 1. Intuitively, not enough data to estimate all the parameters. Answer 2: X columns are not linearly independent. Intuitively, there are two features that are perfectly correlated. In this case, solution is not unique. Multiple regression Heritability 16 / 36

23 Ridge regression For X T X that is not invertible β = ( X T X + λi) 1 XT y This is equivalent to adding an extra term to RSS( β) RSS( β) { { }}{ 1 β T ( } X T X β 2 X y) T T β λ β 2 2 }{{} regularization Multiple regression Heritability 17 / 36

24 Outline Heritability Bayesian statistics Bernoulli model Ridge regression Probabilistic interpretation Estimating β Estimating hyperparameters Back to Heritability Multiple regression Bayesian statistics 18 / 36

25 Bayesian statistics Frequentist Statistics: Evalute ˆθ on repeated samples: Bayesian: Bayes theorem θ Fixed X P Random ˆθ = t(x) also Random E[ˆθ] = θ? θ Random X Random P (θ X) = P (X θ)p (θ) P (X) Likelihood P rior M arginallikelihood Multiple regression Bayesian statistics 19 / 36

26 Bernoulli model: Bayesian treatment X 1,, X n iid Ber (p) Likelihood : L(p) P (x 1,..., x n p) n = P (x i p) = i=1 n p x i (1 p) (1 x i) i=1 = p n x (1 p) n(1 x) Prior : P (p) = Beta(p; α, β) p α 1 (1 p) β 1 Multiple regression Bayesian statistics 20 / 36

27 Bernoulli model: Bayesian treatment X 1,, X n iid Ber (p) Likelihood : L(p) P (x 1,..., x n p) n = P (x i p) = i=1 n p x i (1 p) (1 x i) i=1 = p n x (1 p) n(1 x) Prior : P (p) = Beta(p; α, β) p α 1 (1 p) β 1 Multiple regression Bayesian statistics 20 / 36

28 Beta distribution P (x α, β) = Γ(α + β) Γ(α)Γ(β) xα 1 (1 x) β 1 1{0 x 1} X Beta(α, β) E [X] = α α + β αβ Var [X] = (α + β) 2 (α + β + 1) Multiple regression Bayesian statistics 21 / 36

29 Beta distribution α = β = c P (x α, β) = Γ(α + β) Γ(α)Γ(β) xα 1 (1 x) β 1 1{0 x 1} α=β Density X Beta(c, c) E [X] = Var [X] = 4(c + 1) What happens when c? What happens when c 0? x Multiple regression Bayesian statistics 21 / 36

30 Beta distribution P (x α, β) = α = d, β = cd Γ(α + β) Γ(α)Γ(β) xα 1 (1 x) β 1 1{0 x 1} Density α,β 0.2,0.8 2,8 20,80 X Beta(d, cd) E [X] = c c Var [X] = (c + 1) 2 (d(1 + c) + 1) What happens when d? x Multiple regression Bayesian statistics 21 / 36

31 Bernoulli model: Bayesian treatment X 1,, X n iid Ber (p) Likelihood : P (x 1,..., x n p) = p n x (1 p) n(1 x) Prior : P (p) = Beta(p; α, β) p α 1 (1 p) β 1 Posterior : P (p x 1,, x n ) P (x 1,, x n p)p (p) = p α+n x 1 (1 p) β+n(1 x) 1 = Beta(p; α + n x, β + n(1 x)) Multiple regression Bayesian statistics 22 / 36

32 Bernoulli model: Bayesian treatment Posterior mean p MEAN E [p x 1,, x n ] = pp (p x 1,, x n )dp = Mean of Beta with parameters (α + n x, β + n(1 x)) = α + n x α + β + n ( ) n = x + α ( ) α + β α + β + n α + β α + β + n Posterior mean: Convex combination of MLE and prior mean Multiple regression Bayesian statistics 22 / 36

33 Bernoulli model: Bayesian treatment Posterior mean n E [p x 1,, x n ] = x α + β + n + α α + β α + β α + β + n Posterior mean is a smoothed version of MLE Example: Observe all 1 out of n trials n p MLE p MEAN (α = β = 5) Approaches MLE as n, i.e., prior matters less with more data. Prior can be viewed as adding pseudo-observations. Multiple regression Bayesian statistics 22 / 36

34 Choosing prior How do we choose α, β? Subjective Bayes: encode all reasonable assumptions of domain into prior. Other considerations: computational efficiency (conjugate priors). Multiple regression Bayesian statistics 23 / 36

35 Outline Heritability Bayesian statistics Bernoulli model Ridge regression Probabilistic interpretation Estimating β Estimating hyperparameters Back to Heritability Multiple regression Ridge regression 24 / 36

36 Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

37 Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

38 Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

39 Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

40 Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

41 Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

42 Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

43 Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

44 Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

45 Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

46 Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

47 Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

48 Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

49 Estimating β given hyperparameters (σ 2, σ 2 0) Let X 1,..., X n be i.i.d with y β, x N (β x, σ0 2) Let β j be i.i.d with β j N (0, σ 2 ) Joint likelihood of data and parameters (given σ 0, σ) P (D, β) = P (D β)p (β) = i P (y i x i, β) j P (β j ) Joint log likelihood Plugging in Gaussian PDF, we get: log P (D, β) = i log P (y i x i, β) + j log P (β j ) = i (βt x i y i ) 2 2σ 2 0 j 1 2σ 2 β2 j + const MAP estimate: β map = arg max β log p(d, β) As with OLS, set gradient equal to zero and solve (for β) Multiple regression Ridge regression 27 / 36

50 Estimating β given hyperparameters (σ 2, σ 2 0) Let X 1,..., X n be i.i.d with y β, x N (β x, σ0 2) Let β j be i.i.d with β j N (0, σ 2 ) Joint likelihood of data and parameters (given σ 0, σ) P (D, β) = P (D β)p (β) = i P (y i x i, β) j P (β j ) Joint log likelihood Plugging in Gaussian PDF, we get: log P (D, β) = i log P (y i x i, β) + j log P (β j ) = i (βt x i y i ) 2 2σ 2 0 j 1 2σ 2 β2 j + const MAP estimate: β map = arg max β log p(d, β) As with OLS, set gradient equal to zero and solve (for β) Multiple regression Ridge regression 27 / 36

51 Estimating β given hyperparameters (σ 2, σ 2 0) Let X 1,..., X n be i.i.d with y β, x N (β x, σ0 2) Let β j be i.i.d with β j N (0, σ 2 ) Joint likelihood of data and parameters (given σ 0, σ) P (D, β) = P (D β)p (β) = i P (y i x i, β) j P (β j ) Joint log likelihood Plugging in Gaussian PDF, we get: log P (D, β) = i log P (y i x i, β) + j log P (β j ) = i (βt x i y i ) 2 2σ 2 0 j 1 2σ 2 β2 j + const MAP estimate: β map = arg max β log p(d, β) As with OLS, set gradient equal to zero and solve (for β) Multiple regression Ridge regression 27 / 36

52 Maximum a posterior (MAP) estimate Regularized linear regression: a new error to minimize E(β) = i (β T x i y i ) 2 + λ β 2 2 where λ > 0 is used to denote σ0 2/σ2. This extra term β 2 2 is called regularization/regularizer and controls the model complexity. Intuitions If λ +, then σ 2 0 σ 2. That is, the variance of noise is far greater than what our prior model can allow for β. In this case, our prior model on β would be more accurate than what data can tell us. Thus, we are getting a simple model. Numerically, β map 0 If λ 0, then we trust our data more. Numerically, β map β ols = arg min i (β T x i y i ) 2 Multiple regression Ridge regression 28 / 36

53 Maximum a posterior (MAP) estimate Regularized linear regression: a new error to minimize E(β) = i (β T x i y i ) 2 + λ β 2 2 where λ > 0 is used to denote σ0 2/σ2. This extra term β 2 2 is called regularization/regularizer and controls the model complexity. Intuitions If λ +, then σ 2 0 σ 2. That is, the variance of noise is far greater than what our prior model can allow for β. In this case, our prior model on β would be more accurate than what data can tell us. Thus, we are getting a simple model. Numerically, β map 0 If λ 0, then we trust our data more. Numerically, β map β ols = arg min i (β T x i y i ) 2 Multiple regression Ridge regression 28 / 36

54 Maximum a posterior (MAP) estimate Regularized linear regression: a new error to minimize E(β) = i (β T x i y i ) 2 + λ β 2 2 where λ > 0 is used to denote σ0 2/σ2. This extra term β 2 2 is called regularization/regularizer and controls the model complexity. Intuitions If λ +, then σ 2 0 σ 2. That is, the variance of noise is far greater than what our prior model can allow for β. In this case, our prior model on β would be more accurate than what data can tell us. Thus, we are getting a simple model. Numerically, β map 0 If λ 0, then we trust our data more. Numerically, β map β ols = arg min i (β T x i y i ) 2 Multiple regression Ridge regression 28 / 36

55 Maximum a posterior (MAP) estimate Regularized linear regression: a new error to minimize E(β) = i (β T x i y i ) 2 + λ β 2 2 where λ > 0 is used to denote σ0 2/σ2. This extra term β 2 2 is called regularization/regularizer and controls the model complexity. Intuitions If λ +, then σ 2 0 σ 2. That is, the variance of noise is far greater than what our prior model can allow for β. In this case, our prior model on β would be more accurate than what data can tell us. Thus, we are getting a simple model. Numerically, β map 0 If λ 0, then we trust our data more. Numerically, β map β ols = arg min i (β T x i y i ) 2 Multiple regression Ridge regression 28 / 36

56 Closed-form solution For regularized linear regression: the solution changes very little (in form) from the OLS solution arg min i (β T x i y i ) 2 + λ β 2 2 β map = ( X T X + λi ) 1 X T y and reduces to the OLS solution when λ = 0, as expected. If we have to use numerical procedure, the gradients and the Hessian matrix would change nominally too, E(β) = 2(X T Xβ X T y + λβ), H = 2(X T X + λi) As long as λ 0, the optimization is convex. Multiple regression Ridge regression 29 / 36

57 Closed-form solution For regularized linear regression: the solution changes very little (in form) from the OLS solution arg min i (β T x i y i ) 2 + λ β 2 2 β map = ( X T X + λi ) 1 X T y and reduces to the OLS solution when λ = 0, as expected. If we have to use numerical procedure, the gradients and the Hessian matrix would change nominally too, E(β) = 2(X T Xβ X T y + λβ), H = 2(X T X + λi) As long as λ 0, the optimization is convex. Multiple regression Ridge regression 29 / 36

58 Estimating hyperparameters: σ 2, σ 2 0 Ridge regression model: y = Xβ + ɛ β N (0, σ 2 I m ) To find (σ 2, σ0 2 ) given data D = (y, X), we need to compute the marginal likelihood L(σ 2, σ0) 2 P (y X, σ 2, σ0) 2 = P (y, β X, σ 2, σ0)dβ 2 β = P (y β, X, σ0)p 2 (β σ 2 )dβ β = N (0, σ 2 XX T + σ0i 2 n ) = 1 [ y T K 1 y + log det K ] + const 2 K σ 2 XX T + σ 2 0I n Multiple regression Ridge regression 30 / 36

59 Estimating hyperparameters: σ 2, σ 2 0 Ridge regression model: y = Xβ + ɛ β N (0, σ 2 I m ) To find (σ 2, σ0 2 ) given data D = (y, X), we need to compute the marginal likelihood L(σ 2, σ0) 2 P (y X, σ 2, σ0) 2 = P (y, β X, σ 2, σ0)dβ 2 β = P (y β, X, σ0)p 2 (β σ 2 )dβ β = N (0, σ 2 XX T + σ0i 2 n ) = 1 [ y T K 1 y + log det K ] + const 2 K σ 2 XX T + σ 2 0I n Multiple regression Ridge regression 30 / 36

60 Estimating hyperparameters: σ 2, σ 2 0 Ridge regression model: y = Xβ + ɛ β N (0, σ 2 I m ) To find (σ 2, σ0 2 ) given data D = (y, X), we need to compute the marginal likelihood L(σ 2, σ0) 2 P (y X, σ 2, σ0) 2 = P (y, β X, σ 2, σ0)dβ 2 β = P (y β, X, σ0)p 2 (β σ 2 )dβ β = N (0, σ 2 XX T + σ0i 2 n ) = 1 [ y T K 1 y + log det K ] + const 2 K σ 2 XX T + σ 2 0I n Multiple regression Ridge regression 30 / 36

61 Statistical properties of ridge-regression estimator Assumptions: Linear model is correct β map is a biased estimator of β. Contrast with OLS. β ols is an unbiased estimator of β (Lecture 3). What about the variance? Multiple regression Ridge regression 31 / 36

62 Computing ridge-regression estimator Estimating β given hyperparameters. Same runtime as OLS, O(m 2 n). Estimating the hyperparameters needs a numerical procedure. Each iteration is O(n 3 ). Multiple regression Ridge regression 32 / 36

63 Outline Heritability Bayesian statistics Bernoulli model Ridge regression Probabilistic interpretation Estimating β Estimating hyperparameters Back to Heritability Multiple regression Back to Heritability 33 / 36

64 So how does this relate to heritability? Heritability related to the hyperparameters h 2 m = mσ2 mσ 2 + σ 2 0 Given genotype and phenotype pairs {(x i, y i )}, model the phenotype y i as y i = β T x i + ɛ i where β j N (0, σ 2 ) and ɛ i N (0, σ 0 2 ). Estimate the hyperparameters (σ 2, σ0 2 ) by maximizing the marginal likelihood. Multiple regression Back to Heritability 34 / 36

65 Application of ridge regression to estimate heritability Termed linear Mixed models in the genetics literature. Yang et al applied this model to height to estimate h 2 G = 0.45 dramatically higher than the estimates from GWAS (0.05). Been applied to a number of phenotypes. Multiple regression Back to Heritability 35 / 36

66 Application of ridge regression to estimate heritability 2 Visscher et al. AJHG 2012 Multiple regression Back to Heritability 35 / 36

67 Summary Increasing the number of SNPs in the regression leads to statistical and numerical difficulties. Regularized regression is a solution. Ridge regression is one form of regularization. Can also be derived from a Bayesian perspective. Have been useful in closing the missing heritability gap. Multiple regression Back to Heritability 36 / 36

Association studies and regression

Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration