Association studies and regression

Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104

Administration HW1 will be posted today. Due in two weeks. Send me an email if you are still not enrolled! Association studies and regression 2 / 104

Review of last lecture Basics of statistical inference. Parameter estimation, Hypothesis testing, Interval estimation Different types of mistakes Multiple hypothesis testing Would like to control: FWER and FDR FWER = P (At least one false positive). Bonferroni procedure controls FWER. FDR = Expected fraction of false discoveries. Benjamini-Hochberg procedure controls FDR. Association studies and regression 3 / 104

Outline Motivation Population Association Studies and GWAS Linear regression Univariate solution Probabilistic interpretation Statistical properties of MLE Computational and numerical optimization Logistic regression General setup Maximum likelihood estimation Gradient descent Newton s method Application to GWAS Association studies and regression Motivation 4 / 104

Personalized genomics Association studies and regression Motivation 5 / 104

Path to Personalized Genomics Basic biology Phenotype is a function of genotype and environment i.e., learn P=f(G,E) How much of the phenotype is genetic vs environmental? Find the genetic and environmental factors associated with phenotype Finding drug targets Disease prediction Given genetic data, predict the phenotype for an individual Association studies and regression Motivation 6 / 104

Finding genetic factors that influence a phenotype Will be working with a population of individuals Genotypes of a population of individuals SNPs (most common) Phenotypes in the same set of individuals Binary (disease-healthy), ordinal (number of children or years of schooling), continuous (height, gene expression), heterogeneous and high-dimensional (images, text, videos) Association studies and regression Motivation 7 / 104

Population Association Studies Unrelated population of individuals measured for a phenotype Association studies and regression Population Association Studies and GWAS 9 / 104

Population Association Studies Find an association between a SNP and phenotype Really want to find a SNP that is causal. Association studies and regression Population Association Studies and GWAS 10 / 104

Population Association Studies Find an association between a SNP and phenotype Really want to find a SNP that is causal. Simplest form : does the genotype at a single SNP predict phenotype? Association studies and regression Population Association Studies and GWAS 10 / 104

Recall some definitions from lecture 1 Locus: position along the chromosome (could be a single base or longer). Allele: set of variants at a locus Genotype: sequence of alleles along the loci of an individual Individual 1: (1,CT),(2,GG) Individual 2: (1,TT), (2,GA) Association studies and regression Population Association Studies and GWAS 11 / 104

Recall some definitions from lecture 1 Pick one allele as reference. Other allele is called alternate allele. Represent a genotype as the number of copies of the reference allele. Each genotype at a single base can be 0/1/2 Locus 1: C is reference Individual 1 has genotype 1 x 1,1 = 1 Individual 2 has genotype 0 x 2,1 = 0 Association studies and regression Population Association Studies and GWAS 11 / 104

Genome-wide Association Studies (GWAS) Perform a population association study across the genome Association studies and regression Population Association Studies and GWAS 12 / 104

GWAS Simplest form: Univariate regression between SNP and continuous phenotype. y i x i,j Phenotype for individual i Genotype for individual i at SNP j Association studies and regression Population Association Studies and GWAS 13 / 104

Linear regression Goals Predict continuous-valued output from inputs. Classification refers to binary outputs. Output, response: phenotype Input, covariate, predictor: genotype Common practice to test genotype at a single SNP (univariate regression). We will not make that assumption for now. Association studies and regression Linear regression 15 / 104

GWAS Single SNP association testing Phenotype -1 0 1 2 3 4 5 AA AG GG Genotype Phenotype Mean phenoytpe + Effect size Genotype + Noise Learn parameters that predict the phenotype well Squared diffence : (actual predicted phenotype) 2 Association studies and regression Linear regression 16 / 104

Linear regression Output: y R Input: x R m Find f : x y that predicts y well Assume f(x) = β 0 + β T x. Linear in parameters (hence the name) β 0 is called an intercept or bias. β = (β 1,, β m ): weights, parameters, or parameter vector Sometimes β = (β 0,, β m ) called parameters What does well mean? Association studies and regression Linear regression 17 / 104

Linear regression Setup We have labeled (training) data. D = {(x i, y i ), i = 1,..., n} i: individual j: SNP y i : Phenotype of individual i x i,j : Genotype of individual i at SNP j, x i,j {0, 1, 2} Association studies and regression Linear regression 18 / 104

Linear regression What does well mean? Minimize the residual sum of squares RSS RSS( β) = = = n (y i f(x i )) 2 i=1 n (y i (β 0 + β T x i )) 2 i=1 n (y i (β 0 + i=1 m j=1 2 β j x i,j )) Ordinary Least Squares estimator β OLS = arg min β RSS( β) Association studies and regression Linear regression 19 / 104

A simple case: x is just one-dimensional (m=1) Residual sum of squares RSS( β) = i [y i f(x i )] 2 = i [y i (β 0 + β 1 x i )] 2 We denote x i = (x i,1 ) (scalar). Identify stationary points by taking derivative with respect to parameters and setting to zero RSS( β) β 0 = 0 2 i [y i (β 0 + β 1 x i )] = 0 RSS( β) β 1 = 0 2 i [y i (β 0 + β 1 x i )]x i = 0 Association studies and regression Linear regression 20 / 104

Simplify these expressions to get Normal Equations y i = nβ 0 + β 1 i i x i x i y i = β 0 xi + β 1 i We have two equations and two unknowns! Do some algebra to get: i β 1 = (x i x)(y i ȳ) i (x i x) 2 and β 0 = ȳ β 1 x where x = 1 n i x i and ȳ = 1 n i y i. i x 2 i Association studies and regression Linear regression 21 / 104

Why is minimizing RSS sensible? Probabilistic interpretation Noisy observation model Y = β 0 + β 1 X + ɛ where ɛ N (0, σ 2 ) is a Gaussian random variable Likelihood of one training sample (x i, y i ) p(y i x i, (β 0, β 1, σ 2 )) = N (β 0 + β 1 x i, σ 2 ) = 1 [y 2πσ 2 e i (β 0 +β 1 x i )]2 2σ 2 Association studies and regression Linear regression 22 / 104

Probabilistic interpretation Log-likelihood of the training data D (assuming i.i.d) LL(β 0, β 1, σ 2 ) log P (D (β 0, β 1, σ 2 )) Association studies and regression Linear regression 23 / 104

Probabilistic interpretation Log-likelihood of the training data D (assuming i.i.d) LL(β 0, β 1, σ 2 ) log P (D (β 0, β 1, σ 2 )) n = log p(y i x i, (β 0, β 1, σ 2 )) = i=1 i log p(y i x i, (β 0, β 1, σ 2 )) Association studies and regression Linear regression 23 / 104

Probabilistic interpretation Log-likelihood of the training data D (assuming i.i.d) LL(β 0, β 1, σ 2 ) log P (D (β 0, β 1, σ 2 )) n = log p(y i x i, (β 0, β 1, σ 2 )) = log p(y i x i, (β 0, β 1, σ 2 )) i=1 i = { [y i (β 0 + β 1 x i )] 2 2σ 2 log } 2πσ 2 i Association studies and regression Linear regression 23 / 104

Probabilistic interpretation Log-likelihood of the training data D (assuming i.i.d) LL(β 0, β 1, σ 2 ) log P (D (β 0, β 1, σ 2 )) n = log log p(y i x i, (β 0, β 1, σ 2 )) p(y i x i, (β 0, β 1, σ 2 )) = i=1 i = { [y i (β 0 + β 1 x i )] 2 2σ 2 log } 2πσ 2 i = 1 2σ 2 [y i (β 0 + β 1 x i )] 2 n 2 log σ2 n 2 log(2π) i Association studies and regression Linear regression 23 / 104

Probabilistic interpretation Log-likelihood of the training data D (assuming i.i.d) LL(β 0, β 1, σ 2 ) log P (D (β 0, β 1, σ 2 )) n = log log p(y i x i, (β 0, β 1, σ 2 )) p(y i x i, (β 0, β 1, σ 2 )) = i=1 i = { [y i (β 0 + β 1 x i )] 2 2σ 2 log } 2πσ 2 i = 1 2σ 2 [y i (β 0 + β 1 x i )] 2 n 2 log σ2 n 2 log(2π) i { } = 1 1 2 σ 2 [y i (β 0 + β 1 x i )] 2 + n log σ 2 + const i Relationship between minimizing RSS and maximizing the log-likelihood? Association studies and regression Linear regression 23 / 104

Maximum likelihood estimation Estimating σ, β 0 and β 1 can be done in two steps Maximize over β 0 and β 1 arg max β0,β 1 LL(β 0, β 1, σ 2 ) arg min β0,β 1 [y i (β 0 + β 1 x i )] 2 That is RSS( β)! i Maximize over s = σ 2 (we could estimate σ directly) { } LL(β 0, β 1, s) = 1 1 s 2 s 2 [y i (β 0 + β 1 x i )] 2 + n 1 = 0 s σ 2 = s = 1 n i i [y i (β 0 + β 1 x i )] 2 = RSS( β) n Association studies and regression Linear regression 24 / 104

How does this probabilistic interpretation help us? It gives a solid footing to our intuition: minimizing RSS( β) is a sensible thing based on reasonable modeling assumptions Estimating σ tells us how much noise there could be in our predictions. For example, it allows us to place confidence intervals around our predictions. Association studies and regression Linear regression 25 / 104

Linear regression when x is m-dimensional Probabilistic model y i = β T x i + ɛ i iid N (0, σ 2 ) ɛ i where we have redefined some variables (by augmenting) x [1 x 1 x 2... x m ] T, The likelihood of the parameters θ = ( β, σ 2 ) n L(θ) = Pr(y i x i, θ) = i=1 β [β 0 β 1 β 2... β m ] T n 1 exp (y i β T ) 2 x i 2πσ 2 2σ 2 i=1 Choose parameters to maximize the likelihood. Association studies and regression Linear regression 26 / 104

Linear regression Maximum Likelihood Estimation ˆθ = arg max θ log L(θ) LL(θ) log L(θ) n = log Pr(y i x i, θ) i=1 = n (y i β T ) 2 x i 2σ 2 n 2 log(2πσ2 ) i=1 = 1 2σ 2 RSS( β) n 2 log(2πσ2 ) Maximizing the likelihood is equivalent to minimizing the RSS, i.e., least squares Association studies and regression Linear regression 27 / 104

Linear regression Computing the MLE LL σ 2 (ˆθ) = 0 βll(ˆθ) = 0 Association studies and regression Linear regression 28 / 104

Linear regression Computing the MLE LL σ 2 (ˆθ) = 0 ( σ 2 1 2σ 2 RSS( β) n ) 2 log(2πσ2 ) = 0 1 2(σ 2 ) 2 RSS( β) n 2σ 2 = 0 ˆσ 2 = RSS( β) n Association studies and regression Linear regression 29 / 104

Linear regression Computing the MLE βll(ˆθ) = 0 β ( 1 2σ 2 RSS( β) n ) 2 log(2πσ2 ) = 0 β RSS( β) = 0 Association studies and regression Linear regression 30 / 104

Linear regression RSS( β) in matrix form RSS( β) = i [y i (β 0 + j β j x i,j )] 2 = i [y i β T x i ] 2 which leads to RSS( β) = i (y i β T x i )(y i x T i β) { βt xi x T i β 2y i x T i β + const.} = i { ( = β T i ) ( ) } x i x T i β 2 y i x T i β + const. i Association studies and regression Linear regression 31 / 104

RSS( β) in new notations Design matrix and target vector X = Compact expression x T 1. x T n R n (m+1), y = y 1. y n RSS( β) = X β y 2 2 ( T ( ) = X β y) X β y ( = βt X T y T) ( X β ) y { = β T ( } X T X β 2 X y) T T β + const Association studies and regression Linear regression 32 / 104

Solution in matrix form Compact expression RSS( β) = X β y 2 2 = Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Normal equation { β T ( } X T X β 2 X y) T T β + const βrss( β) 2 X T X β 2 X T y = 0 This leads to the ordinary least squares (OLS) solution ( ) β = X T 1 X X T y Association studies and regression Linear regression 33 / 104

Remember the σ 2 parameter σ 2 = RSS( β) n Association studies and regression Linear regression 34 / 104

Mini-Summary Linear regression is the linear combination of features f : x y, with f(x) = β 0 + j β jx j = β 0 + β T x If we minimize residual sum of squares as our learning objective, we get a closed-form solution of parameters Probabilistic interpretation: maximum likelihood if assuming residual is Gaussian distributed MLE β = ( ) X T 1 X X T y σ 2 = RSS( β) n Association studies and regression Linear regression 35 / 104

Linear regression Statistical properties of the Least squares estimator ] E [ β X ] Cov [ β X = β = σ 2 ( X T X) 1 Assumptions: Linear model is correct Exogeneous covariates. E [ ] ɛ X = 0 Uncorrelated, homoskedastic errors. Cov Does not assume normally distributed errors. [ ] ɛ X = σ 2 I n OLS is best linear unbiased estimator (Gauss-Markov theorem) i.e., has minimum variance among all linear unbiased estimators. Association studies and regression Linear regression 36 / 104

OLS is unbiased y = X β + ɛ [ E β X ] [ ( ] = E X X) T 1 X T y X Association studies and regression Linear regression 37 / 104

OLS is unbiased y = X β + ɛ [ E β X ] [ ( ] = E X X) T 1 X T y X ( = X X) T 1 [ ] X T E y X Association studies and regression Linear regression 37 / 104

OLS is unbiased y = X β + ɛ [ E β X ] [ ( ] = E X X) T 1 X T y X ( = X X) T 1 [ ] X T E y X ( = X X) T 1 [( X T E X β ] + ɛ) X Association studies and regression Linear regression 37 / 104

OLS is unbiased y = X β + ɛ [ E β X ] [ ( ] = E X X) T 1 X T y X ( = X X) T 1 [ ] X T E y X ( = X X) T 1 [( X T E X β ] + ɛ) X ( = X X) ( T 1 [ X T E X β X ] [ ]) + E ɛ X Association studies and regression Linear regression 37 / 104

OLS is unbiased y = X β + ɛ [ E β X ] [ ( ] = E X X) T 1 X T y X ( = X X) T 1 [ ] X T E y X ( = X X) T 1 [( X T E X β ] + ɛ) X ( = X X) ( T 1 [ X T E X β X ] [ ]) + E ɛ X ( = X X) ( T 1 X T X β [ ]) + E ɛ X Association studies and regression Linear regression 37 / 104

OLS is unbiased y = X β + ɛ [ E β X ] [ ( ] = E X X) T 1 X T y X ( = X X) T 1 [ ] X T E y X ( = X X) T 1 [( X T E X β ] + ɛ) X ( = X X) ( T 1 [ X T E X β X ] [ ]) + E ɛ X ( = X X) ( T 1 X T X β [ ]) + E ɛ X = ( X T X) 1 X T X β Association studies and regression Linear regression 37 / 104

OLS is unbiased y = X β + ɛ [ E β X ] [ ( ] = E X X) T 1 X T y X ( = X X) T 1 [ ] X T E y X ( = X X) T 1 [( X T E X β ] + ɛ) X ( = X X) ( T 1 [ X T E X β X ] [ ]) + E ɛ X ( = X X) ( T 1 X T X β [ ]) + E ɛ X = ( X T X) 1 X T X β = β Association studies and regression Linear regression 37 / 104

Linear regression Statistical properties of the MLE Exercises: Use similar argument to show that ] Cov [ β X = σ 2 (X T X) 1 We can say more if we assume the errors are normally distributed β N ( β, σ 2 (X T X) 1 ) Association studies and regression Linear regression 38 / 104

Computational complexity Bottleneck of computing the solution? β = ( ) 1 XT X Xy Matrix multiply of X T X R (m+1) (m+1) Inverting the matrix X T X How many operations do we need? O(nm 2 ) for matrix multiplication O(m 3 ) (e.g., using Gauss-Jordan elimination) or O(m 2.373 ) (recent theoretical advances) for matrix inversion Impractical for very large m or n Association studies and regression Linear regression 39 / 104

Alternative method: using numerical optimization Use gradient descent (more details when we talk about logistic regression next) Association studies and regression Linear regression 40 / 104

Mini-Summary Assuming the probabilistic model is correct, the MLE is unbiased. Computing the MLE can be done by solving normal equations or by numerical optimization. Association studies and regression Linear regression 41 / 104

Linear regression After estimation, what? Prediction Given a new input x, our best guess of y: y = ˆβ 0 + ˆβ T x Association studies and regression Linear regression 42 / 104

Linear regression After estimation, what? Hypothesis testing Test if β j = 0 (Wald test) Under H 0 : β j = 0 ˆβ j N (β j, σ 2 [ (X T X) 1] j,j ) ˆβ j N (0, σ 2 [ (X T X) 1] j,j ) ˆβ j ˆσ j N (0, 1) Here ˆσ j = ˆσ 2 [ (X T X) 1] j,j Association studies and regression Linear regression 42 / 104

Linear regression After estimation, what? (Generalized) Likelihood Ratio test. Remember lecture 2. ˆβ 0 : MLE with β j = 0 Λ L( ˆβ 0 ) L(ˆβ) [ 2 log Λ = 2 log L(ˆβ) log L( ˆβ ] 0 ) 2 log Λ χ 2 1 Association studies and regression Linear regression 42 / 104

Linear regression Model diagnostics Do the observed statistics match the model assumptions? Residuals e i = y i ˆβ T x i If the model assumptions hold, residuals must be independent and normally distributed with equal variance. How do we check this? Association studies and regression Linear regression 43 / 104

Linear regression Model diagnostics Are residuals normally distributed (or more generally, distributed according to a known distribution)? Q-Q plot Plot empirical quantiles vs theoretical quantiles. Close to diagnoal implies empirical distribution is close to theoretical distribution. Association studies and regression Linear regression 43 / 104

Linear regression Model diagnostics Q-Q plot Residuals -0.6-0.4-0.2 0.0 0.2 0.4 0.6 0.8 Residuals -5 0 5-3 -2-1 0 1 2 3 norm quantiles -3-2 -1 0 1 2 3 norm quantiles Normally distributed errors Student-t distributed errors Association studies and regression Linear regression 43 / 104