Statistics 262: Intermediate Biostatistics Regression & Survival Analysis

Similar documents
Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis

Statistics 262: Intermediate Biostatistics Model selection

Lecture 7 Remedial Measures

MAS3301 / MAS8311 Biostatistics Part II: Survival

Chapter 1 Statistical Inference

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

MAT 2379, Introduction to Biostatistics, Sample Calculator Questions 1. MAT 2379, Introduction to Biostatistics

Semiparametric Regression

Survival Analysis. Stat 526. April 13, 2018

The Random Effects Model Introduction

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Logistic regression: Miscellaneous topics

Chapter 4 Multi-factor Treatment Designs with Multiple Error Terms 93

Machine Learning. Module 3-4: Regression and Survival Analysis Day 2, Asst. Prof. Dr. Santitham Prom-on

STAT5044: Regression and Anova. Inyoung Kim

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Introduction to Statistical Analysis

BIOMETRICS INFORMATION

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Lecture 3: Inference in SLR

Chapter 7: Hypothesis testing

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

ST505/S697R: Fall Homework 2 Solution.

Exercises. (a) Prove that m(t) =

Multistate Modeling and Applications

Proportional hazards regression

Poisson regression: Further topics

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2)

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1

STAT331. Cox s Proportional Hazards Model

Overview Scatter Plot Example

Sample Size Determination

STATISTICS 174: APPLIED STATISTICS TAKE-HOME FINAL EXAM POSTED ON WEBPAGE: 6:00 pm, DECEMBER 6, 2004 HAND IN BY: 6:00 pm, DECEMBER 7, 2004 This is a

with the usual assumptions about the error term. The two values of X 1 X 2 0 1

Statistics in medicine

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

TMA 4275 Lifetime Analysis June 2004 Solution

STOR 455 STATISTICAL METHODS I

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Multistate models and recurrent event models

McGill University. Faculty of Science MATH 204 PRINCIPLES OF STATISTICS II. Final Examination

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Lecture 7. Proportional Hazards Model - Handling Ties and Survival Estimation Statistics Survival Analysis. Presented February 4, 2016

Survival Regression Models

Cox s proportional hazards model and Cox s partial likelihood

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Stat 5101 Lecture Notes

Turning a research question into a statistical question.

Introduction. Dipankar Bandyopadhyay, Ph.D. Department of Biostatistics, Virginia Commonwealth University

Welcome! Webinar Biostatistics: sample size & power. Thursday, April 26, 12:30 1:30 pm (NDT)

Generalized Linear Models I

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

STA441: Spring Multiple Regression. More than one explanatory variable at the same time

Chapter 1 Linear Regression with One Predictor

Topic 14: Inference in Multiple Regression

Multistate models and recurrent event models

Regression models. Categorical covariate, Quantitative outcome. Examples of categorical covariates. Group characteristics. Faculty of Health Sciences

Analysis of Time-to-Event Data: Chapter 6 - Regression diagnostics

Regression Models - Introduction

Lecture 10 Multiple Linear Regression

Final Exam. Name: Solution:

Introduction to Inferential Statistics. Jaranit Kaewkungwal, Ph.D. Faculty of Tropical Medicine Mahidol University

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Stat 587: Key points and formulae Week 15

Longitudinal + Reliability = Joint Modeling

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

General Linear Model (Chapter 4)

WORKSHOP 3 Measuring Association

BIOS 2083 Linear Models c Abdus S. Wahed

TGDR: An Introduction

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models

Personalized Treatment Selection Based on Randomized Clinical Trials. Tianxi Cai Department of Biostatistics Harvard School of Public Health

Intuitive Biostatistics: Choosing a statistical test

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

Logistic regression model for survival time analysis using time-varying coefficients

Modeling the Mean: Response Profiles v. Parametric Curves

Lecture 01: Introduction

Topic 23: Diagnostics and Remedies

Comparing two independent samples

The First Thing You Ever Do When Receive a Set of Data Is

The t-distribution. Patrick Breheny. October 13. z tests The χ 2 -distribution The t-distribution Summary

Analysis of Time-to-Event Data: Chapter 4 - Parametric regression models

Survival Analysis. Lu Tian and Richard Olshen Stanford University

ST 305: Final Exam ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) ( ) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ Y. σ X. σ n.

Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017

Simple Linear Regression Using Ordinary Least Squares

Ph.D. course: Regression models. Introduction. 19 April 2012

Lecture 7 Time-dependent Covariates in Cox Regression

1:15 pm Dr. Ronald Hocking, Professor Emeritus, Texas A&M University An EM-AVE Approach for Mixed Model Analysis

Examples and Limits of the GLM

STAT 3A03 Applied Regression With SAS Fall 2017

Lecture 11. Interval Censored and. Discrete-Time Data. Statistics Survival Analysis. Presented March 3, 2016

Linear models and their mathematical foundations: Simple linear regression

STAT 525 Fall Final exam. Tuesday December 14, 2010

Profile Analysis Multivariate Regression

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

Residuals and model diagnostics

Transcription:

Statistics 262: Intermediate Biostatistics Regression & Survival Analysis Jonathan Taylor & Kristin Cobb Statistics 262: Intermediate Biostatistics p.1/??

Introduction This course is an applied course, and we will look at the following types of data (to be explained, of course!) Linear Regression Models ANOVA Mixed Effects Generalized Linear Models Survival Models: (Non-,Semi-)Parametric Statistics 262: Intermediate Biostatistics p.2/??

Logistics People Instructor: Jonathan Taylor (jonathan.taylor@stanford.edu) Instructor: Kristin Cobb (kcobb@stanford.edu) TA: Pei Wang (wp57@stanford.edu) TA: Eric Bair (ebair@stat.stanford.edu) Statistics 262: Intermediate Biostatistics p.3/??

Logistics: Computer labs The following times are reserved for us (all the way over at the Fleischmann computer lab...) April 20th, 11:30am-1:00pm April 27th, Noon-1:00pm May 11th, 10:30am-12:30pm May 18th, 10:30am-12:30pm Statistics 262: Intermediate Biostatistics p.4/??

Logistics: Evaluation This is our proposed marking scheme: 6 assignments: 60%. take home final (40 %) OR project (40 %). Statistics 262: Intermediate Biostatistics p.5/??

Logistics: Computing environment Although I prefer R, I will use SAS for examples in class. Students can submit assignments in either R or SAS. Statistics 262: Intermediate Biostatistics p.6/??

What is a linear regression model? Dumb example: Suppose you wanted to test the following hypothesis H: Tall people tend to marry tall people. How could we do this? Statistics 262: Intermediate Biostatistics p.7/??

Husband & Wife height Collect height information from married couples in the form of (M i, F i ), 1 i n for some number n samples. Plot pairs (M i, F i ). If there is a relationship between M s and F s, you might conclude that H is true. What is relationship? A linear regression model says that F i = β 0 + β 1 M i + ε i where ε i is random error. Statistics 262: Intermediate Biostatistics p.8/??

Raw Data F 140 150 160 170 180 160 170 180 190 M Statistics 262: Intermediate Biostatistics p.9/??

Fitting a model Data seem to indicate linear relationship is good. How good is the fit? Can we quantify the quality of fit as in two-sample t-test? This is the basis of simple linear regression. Statistics 262: Intermediate Biostatistics p.10/??

Least Squares Fit F 140 150 160 170 180 160 170 180 190 M Statistics 262: Intermediate Biostatistics p.11/??

More variables: multiple linear regr Usually, we care about more than one predictor at a time multiple linear regression. Example: predicting birth weight of babies based given mother s habits during pregnancy: i.e. smoking / alcohol / exercise / diet. We can estimate effects for many variables: are some of them less important than others? With many variables there are many possible models: which is the best? Statistics 262: Intermediate Biostatistics p.12/??

Analysis of Variance: ANOVA Suppose you work for a pharmaceutical company developing a drug to lower cholesterol and you want to test the hypothesis Does the drug affect cholesterol, i.e. does it decrease cholesterol on average? To test its effect, you might consider a case / control expriment with a placebo on a genetically pure strain of mice. Observe (D i, P i ), 1 i n (could have different numbers of cases / placebos). Statistics 262: Intermediate Biostatistics p.13/??

ANOVA: more general Question is µ D (average cholesterol in cases) different than µ P (average cholesterol in placebos)? This example is just a two sample t-test, but what if you had different dose levels / more than one drug / more than one strain? Other issues: in genetically pure mice, average effect is well-defined. What about in human studies? How do we generalize the results of the study to the entire population? (random, mixed vs. fixed effects models). Statistics 262: Intermediate Biostatistics p.14/??

Generalized Linear Models In HRP/STATS 261 you have seen some examples of a generalized linear model: logistic regression: trying to predict probabilities based on covariates; log-linear models: modelling count data / contingency tables; multiple linear regression models are also generalized linear models ; these models can be put into a general class: GLMs. Statistics 262: Intermediate Biostatistics p.15/??

Poisson regression In a given HIV+ population, we might be obtain a viral genotype of patients at two different time points. As HIV mutates in response to drug treatment, the virus undergoes mutations in between the two visits. How quickly do mutations develop? Are there some covariates that influence this rate of mutation? How do we model this? Statistics 262: Intermediate Biostatistics p.16/??

Raw Data N 0 2 4 6 8 5 10 15 20 25 30 35 40 T Statistics 262: Intermediate Biostatistics p.17/??

Poisson regression model Basic Model (no other covariates): N i Poisson(e β 0+β 1 T i ), 1 i n where N i is the number of mutations for subject i, and T i is the time between visits. This implies that log(e(n i )) = β 0 + β 1 T i, Var(N i ) = e β 0+β 1 T i. This is a log link, with variance function V (x) = x. Statistics 262: Intermediate Biostatistics p.18/??

Poisson regression fit N 0 2 4 6 8 5 10 15 20 25 30 35 40 T Statistics 262: Intermediate Biostatistics p.19/??

Survival analysis The main topic of this course is survival data. Survival data refers to a fairly broad class of data: basically, the object of study are survival times of individuals and the observations have three components: An observed time, T i, 1 i n. Observed covariates X ij, 1 i n, 1 j p. A censoring variable δ i, 1 i n. Statistics 262: Intermediate Biostatistics p.20/??

Censored data What is δ? In a given study, not everyone fails, eventually funding dries up and data has to be published! Alternatively, people leave the area, or are otherwise lost to follow-up. However, knowing that they left the study alive has some information in it. With much at stake, we do not want to throw away this data. The difficulty comes in incorporating δ into the likelihood. Statistics 262: Intermediate Biostatistics p.21/??

Kidney infection data Given two different techniques of catheter placement in patients (surgically, pericutaneously), physicians want to decide if one technique is better at holding off infections. Observations: time on study, presence of infection, catheter placement. Censoring can bias usual estimates of CDF: Kaplan-Meier estimate gives an unbiased estimate. Can we determine whether survival experience are different in the two groups? Statistics 262: Intermediate Biostatistics p.22/?? (log-rank test)

Kidney data P(T>t) 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 t Statistics 262: Intermediate Biostatistics p.23/??

Other survival models We could model time as coming from a parametric family of distributions: Gumbel, Weibull, etc. Alternatively, we can fit a semi-parameteric model like Cox s proportional hazards model (which is actually not appropriate in this case...) Statistics 262: Intermediate Biostatistics p.24/??

Moving on End of introduction. After break: regression! Statistics 262: Intermediate Biostatistics p.25/??

Simple linear regression Specifying the model. Fitting the model: least squares. Inference about parameters. Diagnostics. Statistics 262: Intermediate Biostatistics p.26/??

Specifying the model Returning to husband and wife data: F i = β 0 + β 1 M i + ε i Assumption: E(ε i M i ) = 0 (often: ε i M i N(0, σ 2 )) Regression equation: E(F i M i ) = β 0 + β 1 M i Errors are independent across pairs of couples. This fully specifies joint distribution of (F 1,..., F n ) given (M 1,..., M n ). Statistics 262: Intermediate Biostatistics p.27/??

Least Squares Least squares regression chooses the line that minimizes n SSE(β 0, β 1 ) = (F i β 0 β 1 M i ) 2. i=1 (Conditional) mean can be estimated for any given height M as F (M) = β 0 + β 1 M. where ( β 0, β 1 ) are the minimizers of SSE. Statistics 262: Intermediate Biostatistics p.28/??

Estimating σ 2 Strength of association will depend on variability in data, as in two sample t-test. Natural estimate of σ 2 σ 2 = 1 n 2 n i=1 Under normality assumption ( F i F (M i )) 2. σ 2 σ 2 χ2 n 2 n 2. Statistics 262: Intermediate Biostatistics p.29/??

SAS code to fit basic model LIBNAME DATADIR C:\sas ; PROC IMPORT OUT=DATADIR.height RUN; DATAFILE = C:\heights.csv DBMS=CSV REPLACE PROC REG DATA=DATADIR.height; MODEL WIFE=HUSBAND; RUN; Statistics 262: Intermediate Biostatistics p.30/??

Inference: CIs Under the normality assumption, any linear combination of ( β 0, β 1 ) are also normally distributed. Can be used to construct confidence interval for predicted mean: F (M) ± t n 2,1 α/2 ŜD( F (M)) As well as a new observation F (M) ± t n 2,1 α/2 ŜD( F (M)) 2 + σ 2 Statistics 262: Intermediate Biostatistics p.31/??

SAS: predicted mean, CIs PROC REG DATA=DATADIR.height; MODEL WIFE=HUSBAND / P CLM CLI; RUN; P refers to predicted mean. CLM refers to mean. CLI refers to new individual. Statistics 262: Intermediate Biostatistics p.32/??

Inference: Hypothesis tests We can test our hypothesis formally in terms of the coefficient β 1. If the regression function is truly linear, then β 1 = 0 means height of husband has no effect on height of wife. Under H 0 : β 1 = 0 β 1 ŜD( β 1 ) t n 2. Statistics 262: Intermediate Biostatistics p.33/??

SAS: predicted mean, CIs PROC REG DATA=DATADIR.height; RUN; MODEL WIFE=HUSBAND; MYTEST: TEST WIFE=0; OTHERTEST: TEST WIFE=1; Output has a page for MYTEST and OTHERTEST. If there is more than one covariate, more than one variable can be specified per TEST statement. Statistics 262: Intermediate Biostatistics p.34/??

SAS: plots PROC REG DATA=DATADIR.height; MODEL WIFE=HUSBAND; PLOT WIFE*HUSBAND; RUN; Statistics 262: Intermediate Biostatistics p.35/??

Least Squares Fit F 140 150 160 170 180 160 170 180 190 M Statistics 262: Intermediate Biostatistics p.36/??

Diagnostics How do we tell if our assumptions are justified? Can we check if we have left a higher order term out? Is the variance constant across couples? Are the residuals close to being normally distributed? Statistics 262: Intermediate Biostatistics p.37/??

SAS: residual plot PROC REG DATA=DATADIR.height; RUN; MODEL WIFE=HUSBAND; PLOT R.*P. / VREF=0; R. refers to residual R i = F i F (M i ). P. refers to predicted value F (M i ). Can detect missing higher order terms, nonconstant variance. Statistics 262: Intermediate Biostatistics p.38/??

Residual Plot: Count data R 0 1 2 3 4 5 1.0 1.5 2.0 2.5 3.0 P Statistics 262: Intermediate Biostatistics p.39/??

Residual Plot R 20 10 0 5 10 150 155 160 165 170 175 P Statistics 262: Intermediate Biostatistics p.40/??

SAS: QQplot PROC REG DATA=DATADIR.height; MODEL WIFE=HUSBAND; PLOT R.*NQQ.; RUN; NQQ. refers to quantile-quantile. Can detect departures from normality. Statistics 262: Intermediate Biostatistics p.41/??

Sample Quantiles plot@ 20 10 0 5 10 2 1 0 1 2 Theoretical Quantiles Statistics 262: Intermediate Biostatistics p.42/??