Lecture 1. Introduction Statistics Statistical Methods II. Presented January 8, 2018

Similar documents
General Regression Model

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Introduction to Statistical Analysis

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Generalized Linear Models (GLZ)

Generalized Linear Models

LOGISTIC REGRESSION Joseph M. Hilbe

Outline of GLMs. Definitions

Semiparametric Regression

Stat 5101 Lecture Notes

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Math 494: Mathematical Statistics

Linear Regression Models P8111

STAT 526 Spring Final Exam. Thursday May 5, 2011

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Lecture 8. Poisson models for counts

STAT331. Cox s Proportional Hazards Model

Today s Outline. Biostatistics Statistical Inference Lecture 01 Introduction to BIOSTAT602 Principles of Data Reduction

Lecture 3. Truncation, length-bias and prevalence sampling

Generalized Linear Models I

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Econometrics I. Professor William Greene Stern School of Business Department of Economics 1-1/40. Part 1: Introduction

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion

Stat 642, Lecture notes for 04/12/05 96

MATH 450: Mathematical statistics

STAT 305 Introduction to Statistical Inference

where F ( ) is the gamma function, y > 0, µ > 0, σ 2 > 0. (a) show that Y has an exponential family distribution of the form

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Semiparametric Generalized Linear Models

High-Throughput Sequencing Course

Generalized Linear Models. Kurt Hornik

Discrete Multivariate Statistics

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Generalized Linear Models for Non-Normal Data

β j = coefficient of x j in the model; β = ( β1, β2,

BIOS 312: Precision of Statistical Inference

Chapter 1 Statistical Inference

8 Nominal and Ordinal Logistic Regression

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Multinomial Logistic Regression Models

Modelling geoadditive survival data

Logistic regression: Miscellaneous topics

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Lecture #11: Classification & Logistic Regression

Generalized linear models

TA: Sheng Zhgang (Th 1:20) / 342 (W 1:20) / 343 (W 2:25) / 344 (W 12:05) Haoyang Fan (W 1:20) / 346 (Th 12:05) FINAL EXAM

WU Weiterbildung. Linear Mixed Models

Lecture 11. Interval Censored and. Discrete-Time Data. Statistics Survival Analysis. Presented March 3, 2016

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Review of Statistics 101

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Classification. Chapter Introduction. 6.2 The Bayes classifier

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Multilevel Statistical Models: 3 rd edition, 2003 Contents

STAT5044: Regression and Anova

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2)

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Sample Size Determination

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

Ph.D. course: Regression models. Introduction. 19 April 2012

Investigating Models with Two or Three Categories

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

Introduction and Overview STAT 421, SP Course Instructor

Lecture Discussion. Confounding, Non-Collapsibility, Precision, and Power Statistics Statistical Methods II. Presented February 27, 2018

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning Linear Classification. Prof. Matteo Matteucci

Anders Skrondal. Norwegian Institute of Public Health London School of Hygiene and Tropical Medicine. Based on joint work with Sophia Rabe-Hesketh

Ph.D. course: Regression models. Regression models. Explanatory variables. Example 1.1: Body mass index and vitamin D status

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to Generalized Linear Models

Single-level Models for Binary Responses

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Lecture 01: Introduction

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

Unit 9: Inferences for Proportions and Count Data

CS483 Design and Analysis of Algorithms

3003 Cure. F. P. Treasure

Stat 710: Mathematical Statistics Lecture 12

Bayes methods for categorical data. April 25, 2017

Sample size calculations for logistic and Poisson regression models

Chapter 4. Parametric Approach. 4.1 Introduction

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Lecture 5: LDA and Logistic Regression

36-463/663: Multilevel & Hierarchical Models

Machine Learning (CS 567) Lecture 5

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Lecture 9. Statistics Survival Analysis. Presented February 23, Dan Gillen Department of Statistics University of California, Irvine

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL

Transcription:

Introduction Statistics 211 - Statistical Methods II Presented January 8, 2018 linear models Dan Gillen Department of Statistics University of California, Irvine 1.1

Logistics and Contact Information Lectures: Monday and Wednesday, 9:30-11:50, DBH 1423 Discussion: Tuesday, 11:00-11:50, SST 122 Course web site: Instructor: http://www.ics.uci.edu/ dgillen/stat211 Dan Gillen Professor and Chair Department of Statistics Office: 2038 Donald Bren Hall Telephone: 4-9862 E-mail: dgillen@uci.edu linear models Office hours: Monday 11:00-12:00, & Tuesday 12:00-1:00, or by appointment 1.2

Description and Textbooks Prerequisites: Description: Required text: Statistics 210 or equivalent, or permission of instructor This course will introduce theory and methods for analyzing non-normal outcomes. We will focus on developing a general theoretical for models and using that to address scientific questions. Agresti A.; Foundations of Linear and Generalized Linear Models linear models 1.3

Description and Textbooks Reference texts: McCullagh P. and Nelder, J.; Generalized Linear Models Hastie T, Tibshirani R, and Friedman J., Elements of Statistical Learning Casella G. and Berger R., Statistical Inference (2nd Ed) linear models 1.4

Software/Computing that are presented in class are primarily done using the R statistical package R is free software and can be installed on multiple platforms You can download R at http://www.r-project.org/ I (highly) recommend that you use R but you may choose to use any other software package that allows you to complete the assigned coursework linear models 1.5

Assignments, Exams and Grading Homework: Midterm Exam: Final Exam: There will be a total of 6-7 homework assignments. Assignments will typically be due 1-1.5 weeks from the date they are handed out. Tentatively scheduled for Wednesday, Feb 21st. The exam will be in-class (closedbook, closed-note), and will cover material through the Wednesday, Feb 14th lecture. The final exam is scheduled for Wednesday, March 21st. The final exam will be a take-, home handed out on Wednesday, March 14th and due on Wednesday, March 21st by 10am. linear models Grading: Homework: 35% Midterm: 30% Final: 35% 1.6

Stat 211 Introduction to generalized (.5 week) Goals of scientific studies Part II - Review of linear (2.5 weeks) Ordinary least squares theory Implications of assumptions Performance of estimators when assumptions fail Regression modeling strategies Prediction Association estimation/testing Weighted least squares linear models 1.7

Stat 211 Part III - Review of asymptotic theory (1 weeks) Review of theoretical tools Asymptotic results Likelihood theory Asymptotic inference Part IV - GLMs for non-normal data (5.5 weeks) models Components of GLMs Fitting GLMs Logistic Model assumptions and diagnostics Parameter interpretations Poisson Model assumptions and diagnostics Parameter interpretations linear models 1.8

Stat 211 Part IV - GLMs for non-normal data cont d (5.5 weeks) Overdispersion Quasi-likelihood Polytomous response methods Proportional odds model Multinomial logistic Part V - Bayesian estimation for GLMs (time permitting) Inclusion of prior distributions Model fitting Interpretation linear models 1.9

First Stage of Hypothesis generation Observation Measurement of existing populations or systems Disadvantages: Confounding Limited ability to establish cause and effect Further Stages of Refinement and confirmation of hypotheses Experiment Intervention Elements of experiment Overall goal and specific aims (hypotheses) Materials and methods Collection of data Analysis Interpretation; Refinement of hypotheses linear models 1.10

Common aims of data analysis A statistical analysis is often geared towards (at least) one three common objectives 1. Hypothesis generation (description, exploration) 2. Hypothesis testing (inference about associations between variables in a population) 3. Prediction (for future sampling units) linear models 1.11

Distinctions without a difference? In application, not enough attention is given to distinguishing between these objectives This lack of distinction is because the same statistical tools (ie. generalized models) can be used to address all of these objectives HOWEVER, the strategies used to address each goal should be distinct and the interpretation of results is dependent upon the strategy employed! linear models 1.12

Distinctions without a difference? Hypothesis generation requires data-driven modeling (the scientific method!) which hopefully yields a simplified model (hypothesis) Data-driven modeling implies that usual inferential statements (p-values, confidence intervals, posterior probabilities) are invalid due to multiple comparisons Hypothesis testing seeks to test a formal (typically parsimonious) hypothesis via valid inferential statements NO data-driven modeling! linear models Prediction necessitates model building and data-driven results since simple parsimonious models (eg. only linear effects) do not generally lead to a reliable prediction model Requires much stronger (generally untestable) assumptions to make probability statements 1.13

Hypothesis Generation Example: Factors that influence median life expectancy in stage II breast cancer General goal: Want to identify factors that affect prognosis Follow a cohort of newly diagnosed patients and measure survival time (may be censored) along with covariates of interest What defines the cohort? All patients from a particular healthcare plan? All patients diagnosed at a particular hospital? All patients in a certain location? Best" estimator of the median survival (??) Model building to identify most important factors Estimation of median survival for various combinations of factors linear models 1.14

Hypothesis Testing Example: Effect of serum albumin levels on risk of mortality in end-stage renal disease patients Randomly sample a cohort of ESRD patients and follow until death (or study end) Hypothesis a first-order relationship between albumin and the log-relative risk of death A priori specify additional adjustment variables: age, gender, ethnicity, duration of disease, etc. linear models 1.15

Prediction Example 1: Spam filters Goal: Use email content to classify an email as valid or spam Potential factors include: Individual words or characters Message length Attachments Likely to be a non-linear association between message length and the probability of spam May be interactions between types of words linear models May be interactions between types of words, length of message, and the presence of attachments 1.16

Prediction Gaussian kernel:! = 0.5 HRT use (per 1,000) 150 250 350 450 550 1997 1998 1999 2000 2001 2002 2003 2004 Calendar date 352 S. Haneuse; Biostat/Stat 572 Example 2: Hormone replacement therapy use Hormone replacement therapy slows the decrease of bone density in post-menopausal women Observational studies also suggest decrease in CVD Stock brokers would like to predict future use of HRT Goal: Collect monthly HRT use over the past 7 years and predict future use (if done in 1999...oops!) linear models 1.17

Required steps in an inferential data analysis 1. Aims of a data analysis Description, exploration, confirmation, prediction 2. Establish the context of the analysis Statistics produces inference about a population based upon a sample Need to understand the population sampled Understand the data collection procedure (true random sample?) linear models Understand the background science The scientific goals of the analysis/experiment 1.18

Required steps in an inferential data analysis 3. Develop a statistical model Clearly defined (measurable) outcome is essential Predictor(s) of interest If we cannot decide which parameters would be appropriate when measurements are available on the entire population then there is no chance that statistics can be of help! linear models 1.19

Required steps in an inferential data analysis 4. Evaluation of the properties of the design, model, and estimation procedure Essential that these aspects be addressed as completely as possible prior to data analysis Clear specification of outcomes and predictors Use of robust statistical methods What is the cost of planning not to plan? linear models 5. Computation Turn the handle... 1.20

Required steps in an inferential data analysis 6. Interpretation of results Present results clearly and precisely If applicable, present scientific justification for why results agree with the hypothesis (should have been done at the design stage) The most elegant experiment/data analysis is meaningless unless it can be easily explained to the scientific community it was designed to impact linear models 1.21

Summary The basic distinction in strategies comes with respect to model selection When making inferential statements, model selection should be avoided when possible Know what you want to adjust for, and adjust for it." As long as robust statistical procedures are used, this will ensure valid probability statements When performing data-driven modeling we need to clearly define our criteria for choosing the best model and be careful with probability statements How to choose covariates? linear models What functional form? What does a 95% confidence interval mean at the end of the day? 1.22

Comparing distributions across subpopulations In many situations, the scientific question to be addressed by a statistical analysis can be viewed as comparing the distribution of some random variable or vector (the response or dependent variable) across subpopulations defined by the values of other random variables (the predictors or independent variables). Within each subpopulation, there is a distribution for the response variable. The question is whether that distribution is the same for every subpopulation. linear models 1.23

General The standard methods for addressing such questions depends upon the type of response variable and the number of subpopulations being compared. In the case of a univariate response with independent observations across subjects we can make a chart of the basic type of analysis used to test hypotheses according to whether the response variable is binary, categorical, count data, continuous, or censored continuous. We can similary consider whether the number of subpopulations is one, two, several unordered, several ordered, or potentially infinite ordered. linear models 1.24

Our Setting General Response Variable Predictor Binary Categorical Discrete Continuous Censored Constant Z-test χ 2 Poisson t-test KM Binary χ 2 χ 2 Poisson t-test logrank Categorical χ 2 χ 2 log linear ANOVA k logrank Discrete logistic polytomous Poisson linear prop hzd Continuous logistic polytomous Poisson linear prop hzd (Note that the predictor measuring which subpopulation a subject belongs to is, respectively, constant, binary, categorical, discrete, or continuous.) linear models Proposition: Under appropriate assumptions, each of the lines in the above table can be regarded as special cases of the lines below it. (proof deferred) 1.25

Regression modeling The general problem we will address is one where we have for the i th subject: 1. measured variables R i1, R i2,... related to some quality to be compared across the subpopulations, and 2. measured variables (factors) P i1, P i2,... related to subpopulation membership We assume that analysis shall proceed by defining for some functions f R and f Pj, j = 1,..., p linear models 1. response variable Y i = f R (R i1, R i2,... ) 2. predictor variables X ij = f Pj (P i1, P i2,... ) 1.26

Regression modeling Basic assumptions for methods development Statistical modeling will proceed by using the X ij s to model some functional of the distribution of the Y i s. When presenting methodologic development, we will presume knowledge of the form of the response and predictors. When exploring the application of these methods and their robustness to departures from the underlying assumptions of the theory, we will regard that we are free to explore alternative formulations of the predictors and response. linear models 1.27

In generalized, we model the interrelationships among subpopulations (as defined by predictor variables) with respect to the distribution of some response variable Unlike linear, we shall not always consider modelling the mean, and even when we do, we will not necessarily model the mean as a linear function of the parameters linear models 1.28

Definition: In the general model, we typically have Response Y X F(y; h( X, β), φ) where β represents parameters and φ is some nuisance parameter necessary to know the full distribution of Y The model is fully specified by providing the form of F taking arguments y, h, and φ. If all of our parameters of interest are in β, (i) when φ is finite dimensional, we call the model parametric (ii) when φ is infinite dimensional, we call the model semiparametric If β does not contain all of our parameters of interest, (iii) when φ is infinite dimensional, we call the model nonparametric linear models 1.29

In generalized models, h( X, β) most often models some parameter of F or some functional of the distribution (e.g., the mean, the median, the hazard, etc.). It is quite often the case that we choose h( X, β) = h( X T β) that is, F depends on β only through some linear combination of its elements. (Recall X may include arbitrary transformations of the measured factors). linear models 1.30

Definition: A generalized linear model has 1. F an exponential family distribution with pmf or pdf ( ) yθ b(θ) f (y; θ, φ) = exp + c(y, φ) a(φ) for some functions a( ), b( ), c(, ). 2. E(Y ) = µ = b (θ) modelled by linear models g(µ) = η = X T β where g( ) is termed the link function. 1.31

Definition: A model shall consist of 1. θ, a functional of the probability distribution of a response variable Y, 2. η = h( X, β), a predictor function based on the predictors X and the parameters β. (Populations are defined as the set of individuals having the same value of η. 3. g( ), a link function describing the relationship between θ and η: η = g(θ) 4. φ, a nuisance parameter vector that is required to be able to fully specify the distribution of the response variable Y in every subpopulation. That is, the distribution of Y will depend upon the predictor function η and the nuisance parameters φ. linear models 1.32

of commonly used models 1. linear : F(y, h, φ) = Φ((y h)/ φ) (the normal cdf) θ = E[Y ] g( ) is the identity link a(φ) = σ 2 linear models 1.33

of commonly used models 2. logistic : F(y, h, φ) = h y (1 h) 1 y (the Bernoulli pmf) θ = logit(e[y ]) (the log-odds of success ) g( ) is the logit link a(φ) = 1 linear models 1.34

of commonly used models 3. Poisson : F(y, h, φ) = e h h y /y! (the Poisson pmf) θ = log(e[y ]) (with E[Y ] usually standardized as a rate) g( ) is the log link a(φ) = 1 linear models 1.35

Note: In order for a model to be useful, we must (at minimum) have a means of dealing with the nuisance parameter φ and estimating the parameters β. linear models 1.36

A prelude to parameter estimation Estimating equations Definition: In the abstract situation, if we have a function of the data and the unknown parameters that has expectation E[G( X, β, φ)] = 0 for all β and φ then one possible form of estimation is to use such a function as an (unbiased) estimating equation. That is, we will search for the values of ˆ β and ˆφ such that G( X, ˆ β, ˆφ) = 0. linear models 1.37

A prelude to parameter estimation Estimating equations There are many constraints that we need put on such an equation before it stands much chance of producing useful estimates: 1. We would like the estimates produced to be unique as the information about the parameters goes to infinity in our sample 2. We would like the estimates to be efficient and consistent 3. We would like the estimates to be relatively easy to find linear models 4. We would like to be able to estimate the distribution of the estimates 1.38

A prelude to parameter estimation Notes Note 1: Maximum likelihood leads to one such estimating equation (the score equation), however we may choose other estimating equations which do not necessarily correspond to a likelihood (more later) Note 2: A Bayesian estimation seeks to incorporate prior information" in order to make direct probability statements regarding the model parameters. Estimation will be carried out by specifying an appropriate likelihood and combing the likelihood with prior distributions on model parameters to form a posterior distribution for the parameters linear models 1.39