Generalized Linear Models (1/29/13)

Similar documents
Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Generalized Linear Models and Exponential Families

Bayesian Regression (1/31/13)

LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS

Generalized Linear Models

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Linear Regression Models P8111

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Exponential Families

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Composite Hypotheses and Generalized Likelihood Ratio Tests

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Gaussian Models (9/9/13)

Statistics 203: Introduction to Regression and Analysis of Variance Course review

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

10. Composite Hypothesis Testing. ECE 830, Spring 2014

Nonlinear Regression

Probabilistic Graphical Models

Generalized Linear Models

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

Introduction to General and Generalized Linear Models

MIT Spring 2016

High-Throughput Sequencing Course

Generalized Linear Models 1

Poisson Regression. Ryan Godwin. ECON University of Manitoba

Poisson regression: Further topics

Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

Parameter Estimation and Fitting to Data

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

DEXSeq paper discussion

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

2.3 Analysis of Categorical Data

Lecture 16 Solving GLMs via IRWLS

Machine Learning Lecture Notes

simple if it completely specifies the density of x

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Stat 579: Generalized Linear Models and Extensions

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

LECTURE 5 HYPOTHESIS TESTING

,..., θ(2),..., θ(n)

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

Generalized Quasi-likelihood (GQL) Inference* by Brajendra C. Sutradhar Memorial University address:

Outline of GLMs. Definitions

Statistical Data Analysis Stat 3: p-values, parameter estimation

Stat 5101 Lecture Notes

Introduction to Signal Detection and Classification. Phani Chavali

Lecture 21: October 19

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent


EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Lecture 2 Machine Learning Review

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

Linear Regression (1/1/17)

ST5215: Advanced Statistical Theory

This paper is not to be removed from the Examination Halls

Stat 710: Mathematical Statistics Lecture 12

SB1a Applied Statistics Lectures 9-10

Uniformly and Restricted Most Powerful Bayesian Tests

Lecture 8: Signal Detection and Noise Assumption

Review. December 4 th, Review

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

A Few Special Distributions and Their Properties

LOGISTIC REGRESSION Joseph M. Hilbe

Exercises and Answers to Chapter 1

Answer Key for STAT 200B HW No. 8

Generalized linear models

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

STAT 526 Spring Final Exam. Thursday May 5, 2011

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Sociology 6Z03 Review II

Generalized Linear Models (GLZ)

Association studies and regression

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Lecture 3: Basic Statistical Tools. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013

Computational functional genomics

Political Science 236 Hypothesis Testing: Review and Bootstrapping

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Lecture 7: Hypothesis Testing and ANOVA

Generalized Linear Models: An Introduction

TUTORIAL 8 SOLUTIONS #

A Very Brief Summary of Statistical Inference, and Examples

Lecture 1: August 28

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

Generalized Linear Models I

Hypothesis testing: theory and methods

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Chapter 1. Modeling Basics

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

ML Testing (Likelihood Ratio Testing) for non-gaussian models

1. Fisher Information

11. Generalized Linear Models: An Introduction

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Linear Methods for Prediction

Generalized Linear Models Introduction

Transcription:

STA613/CBB540: Statistical methods in computational biology Generalized Linear Models (1/29/13) Lecturer: Barbara Engelhardt Scribe: Yangxiaolu Cao When processing discrete data, two commonly used probability distributions are the binomial distribution and the Poisson distribution. The binomial distribution is used when an event only has two possible outcomes (success, failure); the Poisson distribution describes the count of the number of random events within a fixed interval of time or space with a known average rate. Generalized linear models are a generalization of the Gaussian linear model, in that the conditional distribution of the response variable is any distribution in the exponential family. Logistic regression is one GLM with a binomial distributed response variable. We will look at Poisson regression today. 1 Poisson Regression Let D = {(x 1, y 1 ),..., (x n, y n )} be a set of paired data, where y i is a scalar and x i is a vector of length p. Let the parameter θ be a vector of length p. Then: y i x i, θ P oisson(x T i θ) Then P r(y i λ) = λyi e λ where the mean µ = λ and the variance σ 2 = λ. Recall that the single-parameter exponential family is expressed as: P (x η) = h(x) exp{η T T (x) A(η)} where η is the natural parameter, T (x) is the sufficient statistics, A(η): log partition function, and µ is the mean parameter. We can write out the Poisson distribution in the exponential family form by applying the exp(log( )) function: y i! P (x η) = { ( λ x e λ )} exp log x! = exp{x log λ λ log x!} = 1 exp{x log λ λ} x! where η = log λ, T (x) = x, A(η) = exp{η}, and µ = exp{η}. The relationship of θ and µ is µ = exp{η} and η = log µ. Let us choose to set η = θ T x. Then, since we have set the natural parameter in this way, we can describe the response function as the canonical response function for Poisson regression, exp, which maps the natural parameter to the mean parameter, and the canonical link function, log, which maps the mean parameter to the canonical parameter. 1

2 Generalized Linear Models The distribution can be written in terms of θ and x: P (y i x i, θ) = 1 y i! exp{y iθ T x i e θ T x i }. Further the mean of the distribution can be written as ŷ i = E(y i x i, θ) = e θt x ŷ i 0 If, as before, we have n observations from the joint distribution: then we can rewrite the probability: D = {(x 1, y 1 )(x 2, y 2 )...(x n, y n )} P (y i x i, θ) P (y 1 y 2...y n x 1 x 2...x n, θ) = n exp{y i θ T x i e θt x i } y i! 2 Estimating parameter θ In order to find the maximum likelihood estimate of θ, we use the log likelihood to calculate: l(y x, θ) = [y i θ T x i exp{θ T x i } log(y i!)] As before, we can take the partial derivative of the log likelihood with respect to θ to maximize the log likelihood: l(y x, θ) = [y i x i x i exp{θ T x i }] θ = x i (y i exp{θ T x i }) = x i (y i ŷ i ) The form of this equation is identical to the same term for logistic regression. As in logistic regression, notice it is impossible to directly solve for θ: x i (y i e θt x ) = 0 Instead of a closed-form solution to this equation, we can use a gradient method to estimate θ. One efficient gradient method, described in the last lecture, is Iteratively reweighted least squares (IRLS).

Generalized Linear Models 3 3 Paper: An assessment of technical reproducibility and comparison with gene expression arrays [Marioni et al., 2008] 3.1 Example: RNA sequencing: differences in gene expression across cell types From paper, recall from previous lecture: if there are 31,840 genes and, for a given test, 1487 are significant, and FDR=0.1%,FPR=1%: Q: How many False Positives based on FDR value? A = 1.4 Q: How many False Positives based on FPR value? A = 318 Genes often have multiple exons: and different isoforms, which include different combinations of exons. Each isoform will include junction reads, or RNA-sequencing reads that span exons, for the exon pairs that are in that isoform, e.g., Choose i samples (e.g., liver, kidney), j genes and k lanes on each gene. Then 1. perform RNA-sequencing on this sample, generating a large set of sequencing reads, then 2. map the reads to the underlying (reference) genetic sequence; 32 base pair reads. Then, the number of mapped reads to each exon will approximate transcription levels of each exon. These two steps both have a lot of bias. For example, if an exon has mutations, insertions, or deletions in the sequence relative to the reference sequence, it is less likely that reads from this exon in that individual will

4 Generalized Linear Models map onto the reference. 3.2 Data Processing If we model the read counts using a linear Gaussian model, we may lose information about low expressed genes. For example, if we consider transforming the distribution of read counts to a standard normal, the genes in the bottom half of read counts will be mapped to the lower half of the Gaussian distribution:

Generalized Linear Models 5 Instead, we can use a Poisson regression model: X ijk : number of reads mapped to sequence C ik : total number of reads λ ijk : rate of transcription of lane k on gene j from sample i Read counts are taken from particular lane on particular gene from particular sample. Here, they further constrain λ ijk to be a rate parameter as follows: G j=1 λ ijk = 1, and 0 λ ijk 1. Set X ijk C ijk, λ ijk P oisson(c ik λ ijk ) The mean of the distribution can be written as E[X ijk C ijk, λ ijk ] = C ik λ ijk Note that this definition of the mean parameter is not the same as in the canonical Poisson regression model, since η = λ = µ = C ik λ ijk. Returning to the previous question, do we see a lane effect (is there differential expression of genes between two lanes in the same sample and the same cell type)? Consider log λ ik λ ijk

6 Generalized Linear Models To see whether there is lane effect, we have our classical hypothesis testing framework: H 0 : null hypothesis, there is no lane effect (λ ijk = λ ijk ) H A : alternative hypothesis, there is a lane effect (λ ijk λ ijk ). In the paper, they only found a small number of genes to show differential expression across lanes, and so conclude that lane effects are minimal. 3.3 Likelihood Ratio Test In the same framework, we can also consider the question about whether we see differential expression within genes across tissue types. Let us specify the null and alternative hypotheses more formally: H 0 : λ ijk = ˆλ j H A : λ ijk = { λ ˆA j, λ ˆB j } Here A, B {lanes, tissues} and let A = liver, B = kidney. To simplify the equation, take the log of the probability: TS: test statistics log P (D H 0 ) = log P (D H A ) = log P r(x ijk C ijk, ˆλ j ) D T S = 2 log( P (D H 0) P (D H A ) ) log P r(x ijk C ijk, ˆ λ A j, ˆ λ B j ) = 2 log(p (D H 0 )) + 2 log(p (D H A ))

Generalized Linear Models 7 D T S χ 2, this test statistic is distributed according to a chi-squared distribution with degrees of freedom ν = 1 (in this example, comparing two different cell types). The significance threshold t(x) to control the FDR at a given value was calculated using the method of Storey and Tibshirani (2003), through q-values (can be computed in R)