Association studies and regression

Similar documents
Linear Regression (continued)

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Overfitting, Bias / Variance Analysis

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Linear Regression (1/1/17)

Statistical Machine Learning Hilary Term 2018

PCA and admixture models

Linear Models in Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 2 Machine Learning Review

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Gaussian and Linear Discriminant Analysis; Multiclass Classification

GWAS IV: Bayesian linear (variance component) models

Machine learning - HT Maximum Likelihood

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Least Squares Regression

CSCI567 Machine Learning (Fall 2014)

Linear Regression (9/11/13)

Topic 3: Hypothesis Testing

ECE521 week 3: 23/26 January 2017

Least Squares Regression

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Bayesian Linear Regression [DRAFT - In Progress]

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Simple and Multiple Linear Regression

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Link lecture - Lagrange Multipliers

Simple Linear Regression

CSC321 Lecture 18: Learning Probabilistic Models

STAT 100C: Linear models

Linear Regression. Volker Tresp 2018

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Introduction to Estimation Methods for Time Series models. Lecture 1

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Probabilistic Machine Learning. Industrial AI Lab.

ECON 4160, Autumn term Lecture 1

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Part 6: Multivariate Normal and Linear Models

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Linear Methods for Prediction

Linear Regression and Discrimination

Review of Econometrics

Two-Variable Regression Model: The Problem of Estimation

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Introduction to Machine Learning. Lecture 2

y Xw 2 2 y Xw λ w 2 2

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

HT Introduction. P(X i = x i ) = e λ λ x i

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Classification. Chapter Introduction. 6.2 The Bayes classifier

Regression with Numerical Optimization. Logistic

Linear Regression and Its Applications

Case-Control Association Testing. Case-Control Association Testing

Ch 2: Simple Linear Regression

Properties of the least squares estimates

[y i α βx i ] 2 (2) Q = i=1

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Chapter 1. Linear Regression with One Predictor Variable

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Data Mining Stat 588

BTRY 4830/6830: Quantitative Genomics and Genetics

Lecture 1: Supervised Learning

Introduction to Simple Linear Regression

CS 195-5: Machine Learning Problem Set 1

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

4 Bias-Variance for Ridge Regression (24 points)

Time Series Analysis

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Introduction to Estimation Methods for Time Series models Lecture 2

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

COS513 LECTURE 8 STATISTICAL CONCEPTS

Model comparison and selection

CS6220: DATA MINING TECHNIQUES

Regression. Oscar García

Logistic Regression. Machine Learning Fall 2018

Marginal Screening and Post-Selection Inference

Econ 583 Final Exam Fall 2008

Some General Types of Tests

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Regression Models - Introduction

STAT5044: Regression and Anova. Inyoung Kim

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Probabilistic Graphical Models for Image Analysis - Lecture 1

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Regression Estimation Least Squares and Maximum Likelihood

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

The linear model is the most fundamental of all serious statistical models encompassing:

Linear Regression Models P8111

COMS 4771 Regression. Nakul Verma

Transcription:

Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104

Administration HW1 will be posted today. Due in two weeks. Send me an email if you are still not enrolled! Association studies and regression 2 / 104

Review of last lecture Basics of statistical inference. Parameter estimation, Hypothesis testing, Interval estimation Different types of mistakes Multiple hypothesis testing Would like to control: FWER and FDR FWER = P (At least one false positive). Bonferroni procedure controls FWER. FDR = Expected fraction of false discoveries. Benjamini-Hochberg procedure controls FDR. Association studies and regression 3 / 104

Outline Motivation Population Association Studies and GWAS Linear regression Univariate solution Probabilistic interpretation Statistical properties of MLE Computational and numerical optimization Logistic regression General setup Maximum likelihood estimation Gradient descent Newton s method Application to GWAS Association studies and regression Motivation 4 / 104

Personalized genomics Association studies and regression Motivation 5 / 104

Path to Personalized Genomics Basic biology Phenotype is a function of genotype and environment i.e., learn P=f(G,E) How much of the phenotype is genetic vs environmental? Find the genetic and environmental factors associated with phenotype Finding drug targets Disease prediction Given genetic data, predict the phenotype for an individual Association studies and regression Motivation 6 / 104

Finding genetic factors that influence a phenotype Will be working with a population of individuals Genotypes of a population of individuals SNPs (most common) Phenotypes in the same set of individuals Binary (disease-healthy), ordinal (number of children or years of schooling), continuous (height, gene expression), heterogeneous and high-dimensional (images, text, videos) Association studies and regression Motivation 7 / 104

Outline Motivation Population Association Studies and GWAS Linear regression Univariate solution Probabilistic interpretation Statistical properties of MLE Computational and numerical optimization Logistic regression General setup Maximum likelihood estimation Gradient descent Newton s method Application to GWAS Association studies and regression Population Association Studies and GWAS 8 / 104

Population Association Studies Unrelated population of individuals measured for a phenotype Association studies and regression Population Association Studies and GWAS 9 / 104

Population Association Studies Find an association between a SNP and phenotype Really want to find a SNP that is causal. Association studies and regression Population Association Studies and GWAS 10 / 104

Population Association Studies Find an association between a SNP and phenotype Really want to find a SNP that is causal. Simplest form : does the genotype at a single SNP predict phenotype? Association studies and regression Population Association Studies and GWAS 10 / 104

Recall some definitions from lecture 1 Locus: position along the chromosome (could be a single base or longer). Allele: set of variants at a locus Genotype: sequence of alleles along the loci of an individual Individual 1: (1,CT),(2,GG) Individual 2: (1,TT), (2,GA) Association studies and regression Population Association Studies and GWAS 11 / 104

Recall some definitions from lecture 1 Pick one allele as reference. Other allele is called alternate allele. Represent a genotype as the number of copies of the reference allele. Each genotype at a single base can be 0/1/2 Locus 1: C is reference Individual 1 has genotype 1 x 1,1 = 1 Individual 2 has genotype 0 x 2,1 = 0 Association studies and regression Population Association Studies and GWAS 11 / 104

Genome-wide Association Studies (GWAS) Perform a population association study across the genome Association studies and regression Population Association Studies and GWAS 12 / 104

Genome-wide Association Studies (GWAS) Perform a population association study across the genome Association studies and regression Population Association Studies and GWAS 12 / 104

GWAS Simplest form: Univariate regression between SNP and continuous phenotype. y i x i,j Phenotype for individual i Genotype for individual i at SNP j Association studies and regression Population Association Studies and GWAS 13 / 104

Outline Motivation Population Association Studies and GWAS Linear regression Univariate solution Probabilistic interpretation Statistical properties of MLE Computational and numerical optimization Logistic regression General setup Maximum likelihood estimation Gradient descent Newton s method Application to GWAS Association studies and regression Linear regression 14 / 104

Linear regression Goals Predict continuous-valued output from inputs. Classification refers to binary outputs. Output, response: phenotype Input, covariate, predictor: genotype Common practice to test genotype at a single SNP (univariate regression). We will not make that assumption for now. Association studies and regression Linear regression 15 / 104

GWAS Single SNP association testing Phenotype -1 0 1 2 3 4 5 AA AG GG Genotype Phenotype Mean phenoytpe + Effect size Genotype + Noise Learn parameters that predict the phenotype well Squared diffence : (actual predicted phenotype) 2 Association studies and regression Linear regression 16 / 104

Linear regression Output: y R Input: x R m Find f : x y that predicts y well Assume f(x) = β 0 + β T x. Linear in parameters (hence the name) β 0 is called an intercept or bias. β = (β 1,, β m ): weights, parameters, or parameter vector Sometimes β = (β 0,, β m ) called parameters What does well mean? Association studies and regression Linear regression 17 / 104

Linear regression Output: y R Input: x R m Find f : x y that predicts y well Assume f(x) = β 0 + β T x. Linear in parameters (hence the name) β 0 is called an intercept or bias. β = (β 1,, β m ): weights, parameters, or parameter vector Sometimes β = (β 0,, β m ) called parameters What does well mean? Association studies and regression Linear regression 17 / 104

Linear regression Setup We have labeled (training) data. D = {(x i, y i ), i = 1,..., n} i: individual j: SNP y i : Phenotype of individual i x i,j : Genotype of individual i at SNP j, x i,j {0, 1, 2} Association studies and regression Linear regression 18 / 104

Linear regression What does well mean? Minimize the residual sum of squares RSS RSS( β) = = = n (y i f(x i )) 2 i=1 n (y i (β 0 + β T x i )) 2 i=1 n (y i (β 0 + i=1 m j=1 2 β j x i,j )) Ordinary Least Squares estimator β OLS = arg min β RSS( β) Association studies and regression Linear regression 19 / 104

A simple case: x is just one-dimensional (m=1) Residual sum of squares RSS( β) = i [y i f(x i )] 2 = i [y i (β 0 + β 1 x i )] 2 We denote x i = (x i,1 ) (scalar). Identify stationary points by taking derivative with respect to parameters and setting to zero RSS( β) β 0 = 0 2 i [y i (β 0 + β 1 x i )] = 0 RSS( β) β 1 = 0 2 i [y i (β 0 + β 1 x i )]x i = 0 Association studies and regression Linear regression 20 / 104

A simple case: x is just one-dimensional (m=1) Residual sum of squares RSS( β) = i [y i f(x i )] 2 = i [y i (β 0 + β 1 x i )] 2 We denote x i = (x i,1 ) (scalar). Identify stationary points by taking derivative with respect to parameters and setting to zero RSS( β) β 0 = 0 2 i [y i (β 0 + β 1 x i )] = 0 RSS( β) β 1 = 0 2 i [y i (β 0 + β 1 x i )]x i = 0 Association studies and regression Linear regression 20 / 104

Simplify these expressions to get Normal Equations y i = nβ 0 + β 1 i i x i x i y i = β 0 xi + β 1 i We have two equations and two unknowns! Do some algebra to get: i β 1 = (x i x)(y i ȳ) i (x i x) 2 and β 0 = ȳ β 1 x where x = 1 n i x i and ȳ = 1 n i y i. i x 2 i Association studies and regression Linear regression 21 / 104

Simplify these expressions to get Normal Equations y i = nβ 0 + β 1 i i x i x i y i = β 0 xi + β 1 i We have two equations and two unknowns! Do some algebra to get: i β 1 = (x i x)(y i ȳ) i (x i x) 2 and β 0 = ȳ β 1 x where x = 1 n i x i and ȳ = 1 n i y i. i x 2 i Association studies and regression Linear regression 21 / 104

Simplify these expressions to get Normal Equations y i = nβ 0 + β 1 i i x i x i y i = β 0 xi + β 1 i We have two equations and two unknowns! Do some algebra to get: i β 1 = (x i x)(y i ȳ) i (x i x) 2 and β 0 = ȳ β 1 x where x = 1 n i x i and ȳ = 1 n i y i. i x 2 i Association studies and regression Linear regression 21 / 104

Why is minimizing RSS sensible? Probabilistic interpretation Noisy observation model Y = β 0 + β 1 X + ɛ where ɛ N (0, σ 2 ) is a Gaussian random variable Likelihood of one training sample (x i, y i ) p(y i x i, (β 0, β 1, σ 2 )) = N (β 0 + β 1 x i, σ 2 ) = 1 [y 2πσ 2 e i (β 0 +β 1 x i )]2 2σ 2 Association studies and regression Linear regression 22 / 104

Why is minimizing RSS sensible? Probabilistic interpretation Noisy observation model Y = β 0 + β 1 X + ɛ where ɛ N (0, σ 2 ) is a Gaussian random variable Likelihood of one training sample (x i, y i ) p(y i x i, (β 0, β 1, σ 2 )) = N (β 0 + β 1 x i, σ 2 ) = 1 [y 2πσ 2 e i (β 0 +β 1 x i )]2 2σ 2 Association studies and regression Linear regression 22 / 104

Probabilistic interpretation Log-likelihood of the training data D (assuming i.i.d) LL(β 0, β 1, σ 2 ) log P (D (β 0, β 1, σ 2 )) Association studies and regression Linear regression 23 / 104

Probabilistic interpretation Log-likelihood of the training data D (assuming i.i.d) LL(β 0, β 1, σ 2 ) log P (D (β 0, β 1, σ 2 )) n = log p(y i x i, (β 0, β 1, σ 2 )) = i=1 i log p(y i x i, (β 0, β 1, σ 2 )) Association studies and regression Linear regression 23 / 104

Probabilistic interpretation Log-likelihood of the training data D (assuming i.i.d) LL(β 0, β 1, σ 2 ) log P (D (β 0, β 1, σ 2 )) n = log p(y i x i, (β 0, β 1, σ 2 )) = log p(y i x i, (β 0, β 1, σ 2 )) i=1 i = { [y i (β 0 + β 1 x i )] 2 2σ 2 log } 2πσ 2 i Association studies and regression Linear regression 23 / 104

Probabilistic interpretation Log-likelihood of the training data D (assuming i.i.d) LL(β 0, β 1, σ 2 ) log P (D (β 0, β 1, σ 2 )) n = log log p(y i x i, (β 0, β 1, σ 2 )) p(y i x i, (β 0, β 1, σ 2 )) = i=1 i = { [y i (β 0 + β 1 x i )] 2 2σ 2 log } 2πσ 2 i = 1 2σ 2 [y i (β 0 + β 1 x i )] 2 n 2 log σ2 n 2 log(2π) i Association studies and regression Linear regression 23 / 104

Probabilistic interpretation Log-likelihood of the training data D (assuming i.i.d) LL(β 0, β 1, σ 2 ) log P (D (β 0, β 1, σ 2 )) n = log log p(y i x i, (β 0, β 1, σ 2 )) p(y i x i, (β 0, β 1, σ 2 )) = i=1 i = { [y i (β 0 + β 1 x i )] 2 2σ 2 log } 2πσ 2 i = 1 2σ 2 [y i (β 0 + β 1 x i )] 2 n 2 log σ2 n 2 log(2π) i { } = 1 1 2 σ 2 [y i (β 0 + β 1 x i )] 2 + n log σ 2 + const i Relationship between minimizing RSS and maximizing the log-likelihood? Association studies and regression Linear regression 23 / 104

Maximum likelihood estimation Estimating σ, β 0 and β 1 can be done in two steps Maximize over β 0 and β 1 arg max β0,β 1 LL(β 0, β 1, σ 2 ) arg min β0,β 1 [y i (β 0 + β 1 x i )] 2 That is RSS( β)! i Maximize over s = σ 2 (we could estimate σ directly) { } LL(β 0, β 1, s) = 1 1 s 2 s 2 [y i (β 0 + β 1 x i )] 2 + n 1 = 0 s σ 2 = s = 1 n i i [y i (β 0 + β 1 x i )] 2 = RSS( β) n Association studies and regression Linear regression 24 / 104

Maximum likelihood estimation Estimating σ, β 0 and β 1 can be done in two steps Maximize over β 0 and β 1 arg max β0,β 1 LL(β 0, β 1, σ 2 ) arg min β0,β 1 [y i (β 0 + β 1 x i )] 2 That is RSS( β)! i Maximize over s = σ 2 (we could estimate σ directly) { } LL(β 0, β 1, s) = 1 1 s 2 s 2 [y i (β 0 + β 1 x i )] 2 + n 1 = 0 s σ 2 = s = 1 n i i [y i (β 0 + β 1 x i )] 2 = RSS( β) n Association studies and regression Linear regression 24 / 104

Maximum likelihood estimation Estimating σ, β 0 and β 1 can be done in two steps Maximize over β 0 and β 1 arg max β0,β 1 LL(β 0, β 1, σ 2 ) arg min β0,β 1 [y i (β 0 + β 1 x i )] 2 That is RSS( β)! i Maximize over s = σ 2 (we could estimate σ directly) { } LL(β 0, β 1, s) = 1 1 s 2 s 2 [y i (β 0 + β 1 x i )] 2 + n 1 = 0 s σ 2 = s = 1 n i i [y i (β 0 + β 1 x i )] 2 = RSS( β) n Association studies and regression Linear regression 24 / 104

How does this probabilistic interpretation help us? It gives a solid footing to our intuition: minimizing RSS( β) is a sensible thing based on reasonable modeling assumptions Estimating σ tells us how much noise there could be in our predictions. For example, it allows us to place confidence intervals around our predictions. Association studies and regression Linear regression 25 / 104

Linear regression when x is m-dimensional Probabilistic model y i = β T x i + ɛ i iid N (0, σ 2 ) ɛ i where we have redefined some variables (by augmenting) x [1 x 1 x 2... x m ] T, The likelihood of the parameters θ = ( β, σ 2 ) n L(θ) = Pr(y i x i, θ) = i=1 β [β 0 β 1 β 2... β m ] T n 1 exp (y i β T ) 2 x i 2πσ 2 2σ 2 i=1 Choose parameters to maximize the likelihood. Association studies and regression Linear regression 26 / 104

Linear regression Maximum Likelihood Estimation ˆθ = arg max θ log L(θ) LL(θ) log L(θ) n = log Pr(y i x i, θ) i=1 = n (y i β T ) 2 x i 2σ 2 n 2 log(2πσ2 ) i=1 = 1 2σ 2 RSS( β) n 2 log(2πσ2 ) Maximizing the likelihood is equivalent to minimizing the RSS, i.e., least squares Association studies and regression Linear regression 27 / 104

Linear regression Computing the MLE LL σ 2 (ˆθ) = 0 βll(ˆθ) = 0 Association studies and regression Linear regression 28 / 104

Linear regression Computing the MLE LL σ 2 (ˆθ) = 0 ( σ 2 1 2σ 2 RSS( β) n ) 2 log(2πσ2 ) = 0 1 2(σ 2 ) 2 RSS( β) n 2σ 2 = 0 ˆσ 2 = RSS( β) n Association studies and regression Linear regression 29 / 104

Linear regression Computing the MLE βll(ˆθ) = 0 β ( 1 2σ 2 RSS( β) n ) 2 log(2πσ2 ) = 0 β RSS( β) = 0 Association studies and regression Linear regression 30 / 104

Linear regression RSS( β) in matrix form RSS( β) = i [y i (β 0 + j β j x i,j )] 2 = i [y i β T x i ] 2 which leads to RSS( β) = i (y i β T x i )(y i x T i β) { βt xi x T i β 2y i x T i β + const.} = i { ( = β T i ) ( ) } x i x T i β 2 y i x T i β + const. i Association studies and regression Linear regression 31 / 104

Linear regression RSS( β) in matrix form RSS( β) = i [y i (β 0 + j β j x i,j )] 2 = i [y i β T x i ] 2 which leads to RSS( β) = i (y i β T x i )(y i x T i β) { βt xi x T i β 2y i x T i β + const.} = i { ( = β T i ) ( ) } x i x T i β 2 y i x T i β + const. i Association studies and regression Linear regression 31 / 104

Linear regression RSS( β) in matrix form RSS( β) = i [y i (β 0 + j β j x i,j )] 2 = i [y i β T x i ] 2 which leads to RSS( β) = i (y i β T x i )(y i x T i β) { βt xi x T i β 2y i x T i β + const.} = i { ( = β T i ) ( ) } x i x T i β 2 y i x T i β + const. i Association studies and regression Linear regression 31 / 104

Linear regression RSS( β) in matrix form RSS( β) = i [y i (β 0 + j β j x i,j )] 2 = i [y i β T x i ] 2 which leads to RSS( β) = i (y i β T x i )(y i x T i β) { βt xi x T i β 2y i x T i β + const.} = i { ( = β T i ) ( ) } x i x T i β 2 y i x T i β + const. i Association studies and regression Linear regression 31 / 104

RSS( β) in new notations Design matrix and target vector X = Compact expression x T 1. x T n R n (m+1), y = y 1. y n RSS( β) = X β y 2 2 ( T ( ) = X β y) X β y ( = βt X T y T) ( X β ) y { = β T ( } X T X β 2 X y) T T β + const Association studies and regression Linear regression 32 / 104

RSS( β) in new notations Design matrix and target vector X = Compact expression x T 1. x T n R n (m+1), y = y 1. y n RSS( β) = X β y 2 2 ( T ( ) = X β y) X β y ( = βt X T y T) ( X β ) y { = β T ( } X T X β 2 X y) T T β + const Association studies and regression Linear regression 32 / 104

Solution in matrix form Compact expression RSS( β) = X β y 2 2 = Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Normal equation { β T ( } X T X β 2 X y) T T β + const βrss( β) 2 X T X β 2 X T y = 0 This leads to the ordinary least squares (OLS) solution ( ) β = X T 1 X X T y Association studies and regression Linear regression 33 / 104

Solution in matrix form Compact expression RSS( β) = X β y 2 2 = Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Normal equation { β T ( } X T X β 2 X y) T T β + const βrss( β) 2 X T X β 2 X T y = 0 This leads to the ordinary least squares (OLS) solution ( ) β = X T 1 X X T y Association studies and regression Linear regression 33 / 104

Solution in matrix form Compact expression RSS( β) = X β y 2 2 = Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Normal equation { β T ( } X T X β 2 X y) T T β + const βrss( β) 2 X T X β 2 X T y = 0 This leads to the ordinary least squares (OLS) solution ( ) β = X T 1 X X T y Association studies and regression Linear regression 33 / 104

Remember the σ 2 parameter σ 2 = RSS( β) n Association studies and regression Linear regression 34 / 104

Mini-Summary Linear regression is the linear combination of features f : x y, with f(x) = β 0 + j β jx j = β 0 + β T x If we minimize residual sum of squares as our learning objective, we get a closed-form solution of parameters Probabilistic interpretation: maximum likelihood if assuming residual is Gaussian distributed MLE β = ( ) X T 1 X X T y σ 2 = RSS( β) n Association studies and regression Linear regression 35 / 104

Linear regression Statistical properties of the Least squares estimator ] E [ β X ] Cov [ β X = β = σ 2 ( X T X) 1 Assumptions: Linear model is correct Exogeneous covariates. E [ ] ɛ X = 0 Uncorrelated, homoskedastic errors. Cov Does not assume normally distributed errors. [ ] ɛ X = σ 2 I n OLS is best linear unbiased estimator (Gauss-Markov theorem) i.e., has minimum variance among all linear unbiased estimators. Association studies and regression Linear regression 36 / 104

OLS is unbiased y = X β + ɛ [ E β X ] [ ( ] = E X X) T 1 X T y X Association studies and regression Linear regression 37 / 104

OLS is unbiased y = X β + ɛ [ E β X ] [ ( ] = E X X) T 1 X T y X ( = X X) T 1 [ ] X T E y X Association studies and regression Linear regression 37 / 104

OLS is unbiased y = X β + ɛ [ E β X ] [ ( ] = E X X) T 1 X T y X ( = X X) T 1 [ ] X T E y X ( = X X) T 1 [( X T E X β ] + ɛ) X Association studies and regression Linear regression 37 / 104

OLS is unbiased y = X β + ɛ [ E β X ] [ ( ] = E X X) T 1 X T y X ( = X X) T 1 [ ] X T E y X ( = X X) T 1 [( X T E X β ] + ɛ) X ( = X X) ( T 1 [ X T E X β X ] [ ]) + E ɛ X Association studies and regression Linear regression 37 / 104

OLS is unbiased y = X β + ɛ [ E β X ] [ ( ] = E X X) T 1 X T y X ( = X X) T 1 [ ] X T E y X ( = X X) T 1 [( X T E X β ] + ɛ) X ( = X X) ( T 1 [ X T E X β X ] [ ]) + E ɛ X ( = X X) ( T 1 X T X β [ ]) + E ɛ X Association studies and regression Linear regression 37 / 104

OLS is unbiased y = X β + ɛ [ E β X ] [ ( ] = E X X) T 1 X T y X ( = X X) T 1 [ ] X T E y X ( = X X) T 1 [( X T E X β ] + ɛ) X ( = X X) ( T 1 [ X T E X β X ] [ ]) + E ɛ X ( = X X) ( T 1 X T X β [ ]) + E ɛ X = ( X T X) 1 X T X β Association studies and regression Linear regression 37 / 104

OLS is unbiased y = X β + ɛ [ E β X ] [ ( ] = E X X) T 1 X T y X ( = X X) T 1 [ ] X T E y X ( = X X) T 1 [( X T E X β ] + ɛ) X ( = X X) ( T 1 [ X T E X β X ] [ ]) + E ɛ X ( = X X) ( T 1 X T X β [ ]) + E ɛ X = ( X T X) 1 X T X β = β Association studies and regression Linear regression 37 / 104

Linear regression Statistical properties of the MLE Exercises: Use similar argument to show that ] Cov [ β X = σ 2 (X T X) 1 We can say more if we assume the errors are normally distributed β N ( β, σ 2 (X T X) 1 ) Association studies and regression Linear regression 38 / 104

Computational complexity Bottleneck of computing the solution? β = ( ) 1 XT X Xy Matrix multiply of X T X R (m+1) (m+1) Inverting the matrix X T X How many operations do we need? O(nm 2 ) for matrix multiplication O(m 3 ) (e.g., using Gauss-Jordan elimination) or O(m 2.373 ) (recent theoretical advances) for matrix inversion Impractical for very large m or n Association studies and regression Linear regression 39 / 104

Computational complexity Bottleneck of computing the solution? β = ( ) 1 XT X Xy Matrix multiply of X T X R (m+1) (m+1) Inverting the matrix X T X How many operations do we need? O(nm 2 ) for matrix multiplication O(m 3 ) (e.g., using Gauss-Jordan elimination) or O(m 2.373 ) (recent theoretical advances) for matrix inversion Impractical for very large m or n Association studies and regression Linear regression 39 / 104

Computational complexity Bottleneck of computing the solution? β = ( ) 1 XT X Xy Matrix multiply of X T X R (m+1) (m+1) Inverting the matrix X T X How many operations do we need? O(nm 2 ) for matrix multiplication O(m 3 ) (e.g., using Gauss-Jordan elimination) or O(m 2.373 ) (recent theoretical advances) for matrix inversion Impractical for very large m or n Association studies and regression Linear regression 39 / 104

Alternative method: using numerical optimization Use gradient descent (more details when we talk about logistic regression next) Association studies and regression Linear regression 40 / 104

Mini-Summary Assuming the probabilistic model is correct, the MLE is unbiased. Computing the MLE can be done by solving normal equations or by numerical optimization. Association studies and regression Linear regression 41 / 104

Linear regression After estimation, what? Prediction Given a new input x, our best guess of y: y = ˆβ 0 + ˆβ T x Association studies and regression Linear regression 42 / 104

Linear regression After estimation, what? Hypothesis testing Test if β j = 0 (Wald test) Under H 0 : β j = 0 ˆβ j N (β j, σ 2 [ (X T X) 1] j,j ) ˆβ j N (0, σ 2 [ (X T X) 1] j,j ) ˆβ j ˆσ j N (0, 1) Here ˆσ j = ˆσ 2 [ (X T X) 1] j,j Association studies and regression Linear regression 42 / 104

Linear regression After estimation, what? (Generalized) Likelihood Ratio test. Remember lecture 2. ˆβ 0 : MLE with β j = 0 Λ L( ˆβ 0 ) L(ˆβ) [ 2 log Λ = 2 log L(ˆβ) log L( ˆβ ] 0 ) 2 log Λ χ 2 1 Association studies and regression Linear regression 42 / 104

Linear regression Model diagnostics Do the observed statistics match the model assumptions? Residuals e i = y i ˆβ T x i If the model assumptions hold, residuals must be independent and normally distributed with equal variance. How do we check this? Association studies and regression Linear regression 43 / 104

Linear regression Model diagnostics Are residuals normally distributed (or more generally, distributed according to a known distribution)? Q-Q plot Plot empirical quantiles vs theoretical quantiles. Close to diagnoal implies empirical distribution is close to theoretical distribution. Association studies and regression Linear regression 43 / 104

Linear regression Model diagnostics Q-Q plot Residuals -0.6-0.4-0.2 0.0 0.2 0.4 0.6 0.8 Residuals -5 0 5-3 -2-1 0 1 2 3 norm quantiles -3-2 -1 0 1 2 3 norm quantiles Normally distributed errors Student-t distributed errors Association studies and regression Linear regression 43 / 104