General Regression Model

Similar documents
Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Stat 5101 Lecture Notes

High-Throughput Sequencing Course

Generalized Linear Models

Final Exam. Name: Solution:

Lecture 1. Introduction Statistics Statistical Methods II. Presented January 8, 2018

Subject CS1 Actuarial Statistics 1 Core Principles

Logistic Regression in R. by Kerry Machemer 12/04/2015

Semiparametric Generalized Linear Models

Accounting for Baseline Observations in Randomized Clinical Trials

Modelling geoadditive survival data

Chapter 1. Modeling Basics

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression:

Accounting for Baseline Observations in Randomized Clinical Trials

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Statistical Methods in HYDROLOGY CHARLES T. HAAN. The Iowa State University Press / Ames

Generalized Linear Models. Kurt Hornik

ISQS 5349 Spring 2013 Final Exam

Some Observations on the Wilcoxon Rank Sum Test

When is MLE appropriate

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Generalized Linear Models 1

Introduction to Statistical Analysis

Lab 8. Matched Case Control Studies

WU Weiterbildung. Linear Mixed Models

p y (1 p) 1 y, y = 0, 1 p Y (y p) = 0, otherwise.

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Varieties of Count Data

3 Joint Distributions 71

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

Generalized linear models

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

Finite Population Sampling and Inference

Generalized Linear Models for Non-Normal Data

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

Lecture 2: Linear and Mixed Models

Model-based boosting in R

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Part 8: GLMs and Hierarchical LMs and GLMs

Generalized Linear Models Introduction

Turning a research question into a statistical question.

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Generalized Linear Modeling - Logistic Regression

STAT5044: Regression and Anova

A general mixed model approach for spatio-temporal regression data

Chapter 1 Statistical Inference

Beyond GLM and likelihood

Probability Distributions Columns (a) through (d)

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Multilevel Statistical Models: 3 rd edition, 2003 Contents

9 Generalized Linear Models

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Lecture 18: Simple Linear Regression

Comparing IRT with Other Models

Random Effects Models for Network Data

COMPARISON OF GMM WITH SECOND-ORDER LEAST SQUARES ESTIMATION IN NONLINEAR MODELS. Abstract

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Bayesian Linear Regression

Statistics Ph.D. Qualifying Exam: Part II November 9, 2002

Survival Analysis I (CHL5209H)

Index. Regression Models for Time Series Analysis. Benjamin Kedem, Konstantinos Fokianos Copyright John Wiley & Sons, Inc. ISBN.

Poisson regression 1/15

multilevel modeling: concepts, applications and interpretations

Test Problems for Probability Theory ,

Generalized Linear Models and Exponential Families

A strategy for modelling count data which may have extra zeros

STAT 526 Spring Final Exam. Thursday May 5, 2011

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

LOGISTIC REGRESSION Joseph M. Hilbe

Lecture 01: Introduction

5. Parametric Regression Model

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Linear Regression With Special Variables

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Semiparametric Regression

Instructions: Closed book, notes, and no electronic devices. Points (out of 200) in parentheses

STAT 6350 Analysis of Lifetime Data. Probability Plotting

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Repeated ordinal measurements: a generalised estimating equation approach

Discrete Multivariate Statistics

Logistic Regression: Regression with a Binary Dependent Variable

Distribution Fitting (Censored Data)

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

7 Sensitivity Analysis

2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling

Introducing Generalized Linear Models: Logistic Regression

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Contents. Part I: Fundamentals of Bayesian Inference 1

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

Transcription:

Scott S. Emerson, M.D., Ph.D. Department of Biostatistics, University of Washington, Seattle, WA 98195, USA January 5, 2015 Abstract Regression analysis can be viewed as an extension of two sample statistical analysis models to an infinite sample setting in which groups being compared may be defined by each combination of values for multiple variables, including continuous variables. The most commonly used regression models (linear, logistic, Poisson, proportional hazards, and parametric accelerated failure time models) differ primarily in the functional (summary measure) of the distribution that is used as the basis of inference. In this document I describe the general notation shared by all of these models, and then I highlight a few of the issues unique to the individual methods. In all cases I restrict attention to the use of univariate (i.e., one response variable) regression models to investigate associations among multiple variables. 1 Notation for the Statistical Analysis Model The general regression model involves: 1. Random variable Y is the response variable. We will estimate some summary measure of the distribution of Y within groups. 2. Random variable X is the predictor of interest (POI) variable. We will divide the population into subpopulations based on this variable. Each individual in a given subpopulation will have the same value of X as all other individuals in that subpopulation. 3. Random variables W 1, W 2,..., W p are additional covariates that are used to adjust for confounding or to provide additional precision for inference. We will also use these variables to divide the population into subpopulations. Each individual in a given subpopulation will have the same value of each of the W j s as all other individuals in that subpopulation. 4. θ is some summary measure of the distribution of Y. (The technical statistical term is functional. It is something that can be computed from the probability distribution, but in the most general case it may be itself a function.) This will have to depend on the type of variable (e.g., binary, unordered categorical, ordered categorical, continuous). Common choices for continuous random variables include (a) Mean (b) Geometric mean (providing Y is always positive) (c) Median (or, less commonly, some other quantile such as 25th or 75th percentile) (d) Probability that Y > c for some specified, scientifically relevant value of c (e) Odds that Y > c for some specified, scientifically relevant value of c 1

(f) Hazard function (instantaneous rate of failure at a specified time, conditional on being at risk of failure at that time) 5. β is a p+2 dimensional vector of regression parameters (or borrowing the algebraic term, regression coefficients ). 6. η is a linear predictor that describes the combined effect of all regression predictors (X, W 1, W 2,..., W p ) on θ. For specified values of those predictors, the exact form of η is taken to be η = β 0 + β 1 X + β 2 W 1 + β 3 W 2 + + β p+1 W p. 7. g( ) is some link function that links the linear predictor to θ. That is, we have regression model g(θ) = η = β 0 + β 1 X + β 2 W 1 + β 3 W 2 + + β p+1 W p. Although more varied choices of link functions are sometimes considered, our typical use of the link function is either (a) to describe an additive model using an identity link (g(θ) = θ), (b) the describe a multiplicative model using the log link (g(θ) = log(θ)). Interpretation of the regression parameters depends on the link function. With an identity link: 1. β 0 is the value of θ within a subpopulation having X = 0, W 1 = 0, W 2 = 0,..., W p = 0 (i.e., all 2. β 1 is the value of the difference in θ between two subpopulations that differ by 1 unit in their value of X, but agree in their values of W 1, W 2,..., W p. Depending on how the variables W 1, W 2,..., W p are defined, it may not be scientifically possible to have groups differing in their value of X and not differing in the value of some W j. In that case we will have to find other interpretations of 3. β j for j > 1 has interpretations analogous to that for beta 1 : we consider differences in θ across groups that differ by one unit in the variable corresponding to β j, but agreeing in their values for all other variables. With a log link: 1. e β0 is the value of θ within a subpopulation having X = 0, W 1 = 0, W 2 = 0,..., W p = 0 (i.e., all 2. e β1 is the value of the ratio of θ between two subpopulations that differ by 1 unit in their value of X, but agree in their values of W 1, W 2,..., W p. Version: January 5, 2015 2

Depending on how the variables W 1, W 2,..., W p are defined, it may not be scientifically possible to have groups differing in their value of X and not differing in the value of some W j. In that case we will have to find other interpretations of 3. e βj for j > 1 has interpretations analogous to that for beta 1 : we consider ratios of θ across groups that differ by one unit in the variable corresponding to β j, but agreeing in their values for all other variables. Note that in the above interpretations, we were pretending that the differences in θ (for an identity link) or the ratios of θ (for the log link) would be the same for every two subpopulations that differed by 1 unit in the corresponding variable. If this linearity assumption does not hold, we then interpret the slope parameters as some sort of average difference or average ratio across subpopulations differing by 1 unit in the corresponding variable. 1.1 Linear Regression The linear regression model uses our general regression model with 1. θ is the mean of the distribution of Y. 2. g( ) is the identity link (g(θ) = θ). 1. β 0 is the mean of Y within a subpopulation having X = 0, W 1 = 0, W 2 = 0,..., W p = 0 (i.e., all 2. β 1 is the value of the difference in the mean of Y between two subpopulations that differ by 1 unit in their value of X, but agree in their values of W 1, W 2,..., W p. Hence for some arbitrary x, w 1, w 2, β 1 = E[Y X = x + 1, W 1 = w 1,..., W p = w p ] E[Y X = x, W 1 = w 1,..., W p = w p ]. Estimates of the regression parameters uses least squares. This corresponds to maximum likelihood estimation when the distribution of Y is Gaussian within subpopulations. Optimality properties of linear regression include Version: January 5, 2015 3

In the presence of independent observations, homoscedasticity (variances of Y within subpopulations are equal to each other) and the correct linear model for the means, the ordinary least squares estimates (OLSE) are best linear unbiased estimates (BLUE). This means that among all unbiased estimators that use linear combinations of the observed values of Y, the OLSE have the greatest precision. (In the presence of heteroscedasticity, we should use weighted least squares estimates (WLSE).) In the presence of independent observations, homoscedasticity, the correct linear model for the means, and a Gaussian distribution of Y within subpopulations, the OLSE are MLE and are the efficient estimators of the means. Because we already know the OLSE are BLUE, this added fact just tells us that it is best to use linear combinations of the observed values of Y nonlinear combinations of Y will not be better estimates. The probability distribution of OLSE can be shown to be asymptotically normal under reasonable assumptions on the distribution of the predictors. That is, as the sample sizes get larger, the probability distribution for the OLSE gets closer and closer to a Gaussian distribution. If Y is normally distributed within subpopulations, the OLSE are exactly normally distributed no matter how small the sample size (so long as their is enough data to make the estimates). 1.2 Linear Regression on log Transformed Response Variables This regression model is relevant only when Y is always positive. In this case, the linear regression model uses our general regression model with 1. θ is the geometric mean of the distribution of Y. 1. e β0 is the geometric mean of Y within a subpopulation having X = 0, W 1 = 0, W 2 = 0,..., W p = 0 (i.e., all 2. e β1 is the value of the ratio of geometric means of Y between two subpopulations that differ by 1 unit in their value of X, but agree in their values of W 1, W 2,..., W p. Hence for some arbitrary x, w 1, w 2, e β1 = GM[Y X = x + 1, W 1 = w 1,..., W p = w p ]. GM[Y X = x, W 1 = w 1,..., W p = w p ] Estimates of the regression parameters uses least squares. This corresponds to maximum likelihood estimation when the distribution of Y is lognormal (so the distribution of log(y ) is Gaussian) within subpopulations. Version: January 5, 2015 4

Optimality properties and probability distributions for the regression parameters of linear regression on log transformed response variables derive directly from those of linear regression. 1.3 Logistic Regression This regression model is relevant only when Y is a binary 0-1 variable. In this case, the logistic regression model uses our general regression model with 1. θ is the odds that Y = 1. 1. e β0 is the odds of Y = 1 within a subpopulation having X = 0, W 1 = 0, W 2 = 0,..., W p = 0 (i.e., all 2. e β1 is the value of the ratio of the odds that Y = 1 between two subpopulations that differ by 1 unit in their value of X, but agree in their values of W 1, W 2,..., W p. Hence for some arbitrary x, w 1, w 2, e β1 = odds[y = 1 X = x + 1, W 1 = w 1,..., W p = w p ] odds[y = 1 X = x, W 1 = w 1,..., W p = w p ] = P r[y = 1 X = x + 1, W 1 = w 1,..., W p = w p ] P r[y = 0 X = x, W 1 = w 1,..., W p = w p ] P r[y = 0 X = x + 1, W 1 = w 1,..., W p = w p ] P r[y = 1 X = x, W 1 = w 1,..., W p = w p ]. Estimates of the regression parameters uses maximum likelihood estimation (MLE) based on the Bernoulli (or binomial) distribution for Y. As such, the variability of the regression parameter estimates is classically based on the mean-variance relationship dictated by the Bernoulli distribution: if E[Y ] = p, then V ar(y ) = p(1 p). If the observations of Y are independent, and if the log odds across subpopulations adhere to the linear model, the MLE are asymptotically efficient. 1.4 Poisson Regression This regression model is relevant only when Y is nonnegative (it is classically defined for count data that has Poisson distributions within subpopulations).. In this case, the Poisson regression model uses our general regression model with Version: January 5, 2015 5

1. θ is the mean value of Y (if the data is truly Poisson count data, that mean is also an event rate). 1. e β0 is the mean (event rate?) of Y within a subpopulation having X = 0, W 1 = 0, W 2 = 0,..., W p = 0 (i.e., all 2. e β1 is the value of the ratio of the mean value of Y between two subpopulations that differ by 1 unit in their value of X, but agree in their values of W 1, W 2,..., W p. Hence for some arbitrary x, w 1, w 2, e β1 = E[Y X = x + 1, W 1 = w 1,..., W p = w p ]. E[Y X = x, W 1 = w 1,..., W p = w p ] Estimates of the regression parameters uses maximum likelihood estimation (MLE) based on the Poisson distribution for Y. As such, the variability of the regression parameter estimates is classically based on the mean-variance relationship dictated by the Poisson distribution: if E[Y ] = λ, then V ar(y ) = λ. When used with the robust Huber-White sandwich estimator of the standard errors, this model can be viewed as a regression model for ratios of means across groups. If the observations of Y are independent Poisson random variables, and if the log mean across subpopulations adhere to the linear model, the MLE are asymptotically efficient. 1.5 Proportional Hazards Regression This regression model is classically relevant only when Y is nonnegative (it is classically defined for timeto-event data that exhibits proportional hazards across subpopulations). In this case, the proportional hazards regression model uses our general regression model with 1. θ is the hazard function (instantaneous rate of failure) for Y. (This is a function that might vary over time.) 1. The intercept in this model is a hazard function λ Y (t) of Y within a subpopulation having X = 0, W 1 = 0, W 2 = 0,..., W p = 0 (i.e., all Standard statistical software does not automatically return this estimated function, because the slope parameters can be estimated without ever estimating this baseline hazard function. Version: January 5, 2015 6

Most often we are not really interested in the baseline hazard function. 2. e β1 is the value of the ratio of the hazard funtion of Y between two subpopulations that differ by 1 unit in their value of X, but agree in their values of W 1, W 2,..., W p. Under the proportional hazards assumption, the same ratio would be obtained at every point in time. Hence for some arbitrary x, w 1, w 2, e β1 = λ Y (t X = x + 1, W 1 = w 1,..., W p = w p ). λ Y (t X = x, W 1 = w 1,..., W p = w p ) Estimates of the regression parameters uses maximum partial likelihood estimation (MLE) based on the proportional hazards assumption for Y. This is somewhat related to logistic regression, and as such, the variability of the regression parameter estimates is classically based on the mean-variance relationship dictated in part by the Bernoulli distribution. Version: January 5, 2015 7