Lecture 1. Introduction Statistics Statistical Methods II. Presented January 8, 2018

Size: px
Start display at page:

Download "Lecture 1. Introduction Statistics Statistical Methods II. Presented January 8, 2018"


1 Introduction Statistics Statistical Methods II Presented January 8, 2018 linear models Dan Gillen Department of Statistics University of California, Irvine 1.1

2 Logistics and Contact Information Lectures: Monday and Wednesday, 9:30-11:50, DBH 1423 Discussion: Tuesday, 11:00-11:50, SST 122 Course web site: Instructor: dgillen/stat211 Dan Gillen Professor and Chair Department of Statistics Office: 2038 Donald Bren Hall Telephone: linear models Office hours: Monday 11:00-12:00, & Tuesday 12:00-1:00, or by appointment 1.2

3 Description and Textbooks Prerequisites: Description: Required text: Statistics 210 or equivalent, or permission of instructor This course will introduce theory and methods for analyzing non-normal outcomes. We will focus on developing a general theoretical for models and using that to address scientific questions. Agresti A.; Foundations of Linear and Generalized Linear Models linear models 1.3

4 Description and Textbooks Reference texts: McCullagh P. and Nelder, J.; Generalized Linear Models Hastie T, Tibshirani R, and Friedman J., Elements of Statistical Learning Casella G. and Berger R., Statistical Inference (2nd Ed) linear models 1.4

5 Software/Computing that are presented in class are primarily done using the R statistical package R is free software and can be installed on multiple platforms You can download R at I (highly) recommend that you use R but you may choose to use any other software package that allows you to complete the assigned coursework linear models 1.5

6 Assignments, Exams and Grading Homework: Midterm Exam: Final Exam: There will be a total of 6-7 homework assignments. Assignments will typically be due weeks from the date they are handed out. Tentatively scheduled for Wednesday, Feb 21st. The exam will be in-class (closedbook, closed-note), and will cover material through the Wednesday, Feb 14th lecture. The final exam is scheduled for Wednesday, March 21st. The final exam will be a take-, home handed out on Wednesday, March 14th and due on Wednesday, March 21st by 10am. linear models Grading: Homework: 35% Midterm: 30% Final: 35% 1.6

7 Stat 211 Introduction to generalized (.5 week) Goals of scientific studies Part II - Review of linear (2.5 weeks) Ordinary least squares theory Implications of assumptions Performance of estimators when assumptions fail Regression modeling strategies Prediction Association estimation/testing Weighted least squares linear models 1.7

8 Stat 211 Part III - Review of asymptotic theory (1 weeks) Review of theoretical tools Asymptotic results Likelihood theory Asymptotic inference Part IV - GLMs for non-normal data (5.5 weeks) models Components of GLMs Fitting GLMs Logistic Model assumptions and diagnostics Parameter interpretations Poisson Model assumptions and diagnostics Parameter interpretations linear models 1.8

9 Stat 211 Part IV - GLMs for non-normal data cont d (5.5 weeks) Overdispersion Quasi-likelihood Polytomous response methods Proportional odds model Multinomial logistic Part V - Bayesian estimation for GLMs (time permitting) Inclusion of prior distributions Model fitting Interpretation linear models 1.9

10 First Stage of Hypothesis generation Observation Measurement of existing populations or systems Disadvantages: Confounding Limited ability to establish cause and effect Further Stages of Refinement and confirmation of hypotheses Experiment Intervention Elements of experiment Overall goal and specific aims (hypotheses) Materials and methods Collection of data Analysis Interpretation; Refinement of hypotheses linear models 1.10

11 Common aims of data analysis A statistical analysis is often geared towards (at least) one three common objectives 1. Hypothesis generation (description, exploration) 2. Hypothesis testing (inference about associations between variables in a population) 3. Prediction (for future sampling units) linear models 1.11

12 Distinctions without a difference? In application, not enough attention is given to distinguishing between these objectives This lack of distinction is because the same statistical tools (ie. generalized models) can be used to address all of these objectives HOWEVER, the strategies used to address each goal should be distinct and the interpretation of results is dependent upon the strategy employed! linear models 1.12

13 Distinctions without a difference? Hypothesis generation requires data-driven modeling (the scientific method!) which hopefully yields a simplified model (hypothesis) Data-driven modeling implies that usual inferential statements (p-values, confidence intervals, posterior probabilities) are invalid due to multiple comparisons Hypothesis testing seeks to test a formal (typically parsimonious) hypothesis via valid inferential statements NO data-driven modeling! linear models Prediction necessitates model building and data-driven results since simple parsimonious models (eg. only linear effects) do not generally lead to a reliable prediction model Requires much stronger (generally untestable) assumptions to make probability statements 1.13

14 Hypothesis Generation Example: Factors that influence median life expectancy in stage II breast cancer General goal: Want to identify factors that affect prognosis Follow a cohort of newly diagnosed patients and measure survival time (may be censored) along with covariates of interest What defines the cohort? All patients from a particular healthcare plan? All patients diagnosed at a particular hospital? All patients in a certain location? Best" estimator of the median survival (??) Model building to identify most important factors Estimation of median survival for various combinations of factors linear models 1.14

15 Hypothesis Testing Example: Effect of serum albumin levels on risk of mortality in end-stage renal disease patients Randomly sample a cohort of ESRD patients and follow until death (or study end) Hypothesis a first-order relationship between albumin and the log-relative risk of death A priori specify additional adjustment variables: age, gender, ethnicity, duration of disease, etc. linear models 1.15

16 Prediction Example 1: Spam filters Goal: Use content to classify an as valid or spam Potential factors include: Individual words or characters Message length Attachments Likely to be a non-linear association between message length and the probability of spam May be interactions between types of words linear models May be interactions between types of words, length of message, and the presence of attachments 1.16

17 Prediction Gaussian kernel:! = 0.5 HRT use (per 1,000) Calendar date 352 S. Haneuse; Biostat/Stat 572 Example 2: Hormone replacement therapy use Hormone replacement therapy slows the decrease of bone density in post-menopausal women Observational studies also suggest decrease in CVD Stock brokers would like to predict future use of HRT Goal: Collect monthly HRT use over the past 7 years and predict future use (if done in oops!) linear models 1.17

18 Required steps in an inferential data analysis 1. Aims of a data analysis Description, exploration, confirmation, prediction 2. Establish the context of the analysis Statistics produces inference about a population based upon a sample Need to understand the population sampled Understand the data collection procedure (true random sample?) linear models Understand the background science The scientific goals of the analysis/experiment 1.18

19 Required steps in an inferential data analysis 3. Develop a statistical model Clearly defined (measurable) outcome is essential Predictor(s) of interest If we cannot decide which parameters would be appropriate when measurements are available on the entire population then there is no chance that statistics can be of help! linear models 1.19

20 Required steps in an inferential data analysis 4. Evaluation of the properties of the design, model, and estimation procedure Essential that these aspects be addressed as completely as possible prior to data analysis Clear specification of outcomes and predictors Use of robust statistical methods What is the cost of planning not to plan? linear models 5. Computation Turn the handle

21 Required steps in an inferential data analysis 6. Interpretation of results Present results clearly and precisely If applicable, present scientific justification for why results agree with the hypothesis (should have been done at the design stage) The most elegant experiment/data analysis is meaningless unless it can be easily explained to the scientific community it was designed to impact linear models 1.21

22 Summary The basic distinction in strategies comes with respect to model selection When making inferential statements, model selection should be avoided when possible Know what you want to adjust for, and adjust for it." As long as robust statistical procedures are used, this will ensure valid probability statements When performing data-driven modeling we need to clearly define our criteria for choosing the best model and be careful with probability statements How to choose covariates? linear models What functional form? What does a 95% confidence interval mean at the end of the day? 1.22

23 Comparing distributions across subpopulations In many situations, the scientific question to be addressed by a statistical analysis can be viewed as comparing the distribution of some random variable or vector (the response or dependent variable) across subpopulations defined by the values of other random variables (the predictors or independent variables). Within each subpopulation, there is a distribution for the response variable. The question is whether that distribution is the same for every subpopulation. linear models 1.23

24 General The standard methods for addressing such questions depends upon the type of response variable and the number of subpopulations being compared. In the case of a univariate response with independent observations across subjects we can make a chart of the basic type of analysis used to test hypotheses according to whether the response variable is binary, categorical, count data, continuous, or censored continuous. We can similary consider whether the number of subpopulations is one, two, several unordered, several ordered, or potentially infinite ordered. linear models 1.24

25 Our Setting General Response Variable Predictor Binary Categorical Discrete Continuous Censored Constant Z-test χ 2 Poisson t-test KM Binary χ 2 χ 2 Poisson t-test logrank Categorical χ 2 χ 2 log linear ANOVA k logrank Discrete logistic polytomous Poisson linear prop hzd Continuous logistic polytomous Poisson linear prop hzd (Note that the predictor measuring which subpopulation a subject belongs to is, respectively, constant, binary, categorical, discrete, or continuous.) linear models Proposition: Under appropriate assumptions, each of the lines in the above table can be regarded as special cases of the lines below it. (proof deferred) 1.25

26 Regression modeling The general problem we will address is one where we have for the i th subject: 1. measured variables R i1, R i2,... related to some quality to be compared across the subpopulations, and 2. measured variables (factors) P i1, P i2,... related to subpopulation membership We assume that analysis shall proceed by defining for some functions f R and f Pj, j = 1,..., p linear models 1. response variable Y i = f R (R i1, R i2,... ) 2. predictor variables X ij = f Pj (P i1, P i2,... ) 1.26

27 Regression modeling Basic assumptions for methods development Statistical modeling will proceed by using the X ij s to model some functional of the distribution of the Y i s. When presenting methodologic development, we will presume knowledge of the form of the response and predictors. When exploring the application of these methods and their robustness to departures from the underlying assumptions of the theory, we will regard that we are free to explore alternative formulations of the predictors and response. linear models 1.27

28 In generalized, we model the interrelationships among subpopulations (as defined by predictor variables) with respect to the distribution of some response variable Unlike linear, we shall not always consider modelling the mean, and even when we do, we will not necessarily model the mean as a linear function of the parameters linear models 1.28

29 Definition: In the general model, we typically have Response Y X F(y; h( X, β), φ) where β represents parameters and φ is some nuisance parameter necessary to know the full distribution of Y The model is fully specified by providing the form of F taking arguments y, h, and φ. If all of our parameters of interest are in β, (i) when φ is finite dimensional, we call the model parametric (ii) when φ is infinite dimensional, we call the model semiparametric If β does not contain all of our parameters of interest, (iii) when φ is infinite dimensional, we call the model nonparametric linear models 1.29

30 In generalized models, h( X, β) most often models some parameter of F or some functional of the distribution (e.g., the mean, the median, the hazard, etc.). It is quite often the case that we choose h( X, β) = h( X T β) that is, F depends on β only through some linear combination of its elements. (Recall X may include arbitrary transformations of the measured factors). linear models 1.30

31 Definition: A generalized linear model has 1. F an exponential family distribution with pmf or pdf ( ) yθ b(θ) f (y; θ, φ) = exp + c(y, φ) a(φ) for some functions a( ), b( ), c(, ). 2. E(Y ) = µ = b (θ) modelled by linear models g(µ) = η = X T β where g( ) is termed the link function. 1.31

32 Definition: A model shall consist of 1. θ, a functional of the probability distribution of a response variable Y, 2. η = h( X, β), a predictor function based on the predictors X and the parameters β. (Populations are defined as the set of individuals having the same value of η. 3. g( ), a link function describing the relationship between θ and η: η = g(θ) 4. φ, a nuisance parameter vector that is required to be able to fully specify the distribution of the response variable Y in every subpopulation. That is, the distribution of Y will depend upon the predictor function η and the nuisance parameters φ. linear models 1.32

33 of commonly used models 1. linear : F(y, h, φ) = Φ((y h)/ φ) (the normal cdf) θ = E[Y ] g( ) is the identity link a(φ) = σ 2 linear models 1.33

34 of commonly used models 2. logistic : F(y, h, φ) = h y (1 h) 1 y (the Bernoulli pmf) θ = logit(e[y ]) (the log-odds of success ) g( ) is the logit link a(φ) = 1 linear models 1.34

35 of commonly used models 3. Poisson : F(y, h, φ) = e h h y /y! (the Poisson pmf) θ = log(e[y ]) (with E[Y ] usually standardized as a rate) g( ) is the log link a(φ) = 1 linear models 1.35

36 Note: In order for a model to be useful, we must (at minimum) have a means of dealing with the nuisance parameter φ and estimating the parameters β. linear models 1.36

37 A prelude to parameter estimation Estimating equations Definition: In the abstract situation, if we have a function of the data and the unknown parameters that has expectation E[G( X, β, φ)] = 0 for all β and φ then one possible form of estimation is to use such a function as an (unbiased) estimating equation. That is, we will search for the values of ˆ β and ˆφ such that G( X, ˆ β, ˆφ) = 0. linear models 1.37

38 A prelude to parameter estimation Estimating equations There are many constraints that we need put on such an equation before it stands much chance of producing useful estimates: 1. We would like the estimates produced to be unique as the information about the parameters goes to infinity in our sample 2. We would like the estimates to be efficient and consistent 3. We would like the estimates to be relatively easy to find linear models 4. We would like to be able to estimate the distribution of the estimates 1.38

39 A prelude to parameter estimation Notes Note 1: Maximum likelihood leads to one such estimating equation (the score equation), however we may choose other estimating equations which do not necessarily correspond to a likelihood (more later) Note 2: A Bayesian estimation seeks to incorporate prior information" in order to make direct probability statements regarding the model parameters. Estimation will be carried out by specifying an appropriate likelihood and combing the likelihood with prior distributions on model parameters to form a posterior distribution for the parameters linear models 1.39

General Regression Model

General Regression Model Scott S. Emerson, M.D., Ph.D. Department of Biostatistics, University of Washington, Seattle, WA 98195, USA January 5, 2015 Abstract Regression analysis can be viewed as an extension of two sample statistical

More information

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation Biost 58 Applied Biostatistics II Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 5: Review Purpose of Statistics Statistics is about science (Science in the broadest

More information

Introduction to Statistical Analysis

Introduction to Statistical Analysis Introduction to Statistical Analysis Changyu Shen Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology Beth Israel Deaconess Medical Center Harvard Medical School Objectives Descriptive

More information

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3 University of California, Irvine 2017-2018 1 Statistics (STATS) Courses STATS 5. Seminar in Data Science. 1 Unit. An introduction to the field of Data Science; intended for entering freshman and transfers.

More information

Generalized Linear Models (GLZ)

Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) are an extension of the linear modeling process that allows models to be fit to data that follow probability distributions other than the

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Advanced Methods for Data Analysis (36-402/36-608 Spring 2014 1 Generalized linear models 1.1 Introduction: two regressions So far we ve seen two canonical settings for regression.

More information


LOGISTIC REGRESSION Joseph M. Hilbe LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of

More information

Outline of GLMs. Definitions

Outline of GLMs. Definitions Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density

More information

Semiparametric Regression

Semiparametric Regression Semiparametric Regression Patrick Breheny October 22 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/23 Introduction Over the past few weeks, we ve introduced a variety of regression models under

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples ST3241 Categorical Data Analysis I Generalized Linear Models Introduction and Some Examples 1 Introduction We have discussed methods for analyzing associations in two-way and three-way tables. Now we will

More information

Math 494: Mathematical Statistics

Math 494: Mathematical Statistics Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/

More information

Linear Regression Models P8111

Linear Regression Models P8111 Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started

More information

STAT 526 Spring Final Exam. Thursday May 5, 2011

STAT 526 Spring Final Exam. Thursday May 5, 2011 STAT 526 Spring 2011 Final Exam Thursday May 5, 2011 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points will

More information

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

STA216: Generalized Linear Models. Lecture 1. Review and Introduction STA216: Generalized Linear Models Lecture 1. Review and Introduction Let y 1,..., y n denote n independent observations on a response Treat y i as a realization of a random variable Y i In the general

More information

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random STA 216: GENERALIZED LINEAR MODELS Lecture 1. Review and Introduction Much of statistics is based on the assumption that random variables are continuous & normally distributed. Normal linear regression

More information

Lecture 8. Poisson models for counts

Lecture 8. Poisson models for counts Lecture 8. Poisson models for counts Jesper Rydén Department of Mathematics, Uppsala University jesper.ryden@math.uu.se Statistical Risk Analysis Spring 2014 Absolute risks The failure intensity λ(t) describes

More information

STAT331. Cox s Proportional Hazards Model

STAT331. Cox s Proportional Hazards Model STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations

More information

Today s Outline. Biostatistics Statistical Inference Lecture 01 Introduction to BIOSTAT602 Principles of Data Reduction

Today s Outline. Biostatistics Statistical Inference Lecture 01 Introduction to BIOSTAT602 Principles of Data Reduction Today s Outline Biostatistics 602 - Statistical Inference Lecture 01 Introduction to Principles of Hyun Min Kang Course Overview of January 10th, 2013 Hyun Min Kang Biostatistics 602 - Lecture 01 January

More information

Lecture 3. Truncation, length-bias and prevalence sampling

Lecture 3. Truncation, length-bias and prevalence sampling Lecture 3. Truncation, length-bias and prevalence sampling 3.1 Prevalent sampling Statistical techniques for truncated data have been integrated into survival analysis in last two decades. Truncation in

More information

Generalized Linear Models I

Generalized Linear Models I Statistics 203: Introduction to Regression and Analysis of Variance Generalized Linear Models I Jonathan Taylor - p. 1/16 Today s class Poisson regression. Residuals for diagnostics. Exponential families.

More information

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Generalized Linear Models. Last time: Background & motivation for moving beyond linear Generalized Linear Models Last time: Background & motivation for moving beyond linear regression - non-normal/non-linear cases, binary, categorical data Today s class: 1. Examples of count and ordered

More information

Econometrics I. Professor William Greene Stern School of Business Department of Economics 1-1/40. Part 1: Introduction

Econometrics I. Professor William Greene Stern School of Business Department of Economics 1-1/40. Part 1: Introduction Econometrics I Professor William Greene Stern School of Business Department of Economics 1-1/40 http://people.stern.nyu.edu/wgreene/econometrics/econometrics.htm 1-2/40 Overview: This is an intermediate

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS020) p.3863 Model Selection for Semiparametric Bayesian Models with Application to Overdispersion Jinfang Wang and

More information

Stat 642, Lecture notes for 04/12/05 96

Stat 642, Lecture notes for 04/12/05 96 Stat 642, Lecture notes for 04/12/05 96 Hosmer-Lemeshow Statistic The Hosmer-Lemeshow Statistic is another measure of lack of fit. Hosmer and Lemeshow recommend partitioning the observations into 10 equal

More information

MATH 450: Mathematical statistics

MATH 450: Mathematical statistics Departments of Mathematical Sciences University of Delaware August 28th, 2018 General information Classes: Tuesday & Thursday 9:30-10:45 am, Gore Hall 115 Office hours: Tuesday Wednesday 1-2:30 pm, Ewing

More information

STAT 305 Introduction to Statistical Inference

STAT 305 Introduction to Statistical Inference STAT 305 Introduction to Statistical Inference Pavel Krupskiy (Instructor) January April 2018 Course information Time and place: Mon, Wed, Fri, 13:00 14:00, LSK 201 Instructor: Pavel Krupskiy, ESB 3144,

More information

where F ( ) is the gamma function, y > 0, µ > 0, σ 2 > 0. (a) show that Y has an exponential family distribution of the form

where F ( ) is the gamma function, y > 0, µ > 0, σ 2 > 0. (a) show that Y has an exponential family distribution of the form Stat 579: General Instruction of Homework: All solutions should be rigorously explained. For problems using SAS or R, please attach code as part of your homework Assignment 1: Due Jan 30 Tuesday in class

More information

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis Review Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 22 Chapter 1: background Nominal, ordinal, interval data. Distributions: Poisson, binomial,

More information

Semiparametric Generalized Linear Models

Semiparametric Generalized Linear Models Semiparametric Generalized Linear Models North American Stata Users Group Meeting Chicago, Illinois Paul Rathouz Department of Health Studies University of Chicago prathouz@uchicago.edu Liping Gao MS Student

More information

High-Throughput Sequencing Course

High-Throughput Sequencing Course High-Throughput Sequencing Course DESeq Model for RNA-Seq Biostatistics and Bioinformatics Summer 2017 Outline Review: Standard linear regression model (e.g., to model gene expression as function of an

More information

Generalized Linear Models. Kurt Hornik

Generalized Linear Models. Kurt Hornik Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general

More information

Discrete Multivariate Statistics

Discrete Multivariate Statistics Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are

More information

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates Anastasios (Butch) Tsiatis Department of Statistics North Carolina State University http://www.stat.ncsu.edu/

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

Generalized Linear Models for Non-Normal Data

Generalized Linear Models for Non-Normal Data Generalized Linear Models for Non-Normal Data Today s Class: 3 parts of a generalized model Models for binary outcomes Complications for generalized multivariate or multilevel models SPLH 861: Lecture

More information

β j = coefficient of x j in the model; β = ( β1, β2,

β j = coefficient of x j in the model; β = ( β1, β2, Regression Modeling of Survival Time Data Why regression models? Groups similar except for the treatment under study use the nonparametric methods discussed earlier. Groups differ in variables (covariates)

More information

BIOS 312: Precision of Statistical Inference

BIOS 312: Precision of Statistical Inference and Power/Sample Size and Standard Errors BIOS 312: of Statistical Inference Chris Slaughter Department of Biostatistics, Vanderbilt University School of Medicine January 3, 2013 Outline Overview and Power/Sample

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j Standard Errors & Confidence Intervals β β asy N(0, I( β) 1 ), where I( β) = [ 2 l(β, φ; y) ] β i β β= β j We can obtain asymptotic 100(1 α)% confidence intervals for β j using: β j ± Z 1 α/2 se( β j )

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

Modelling geoadditive survival data

Modelling geoadditive survival data Modelling geoadditive survival data Thomas Kneib & Ludwig Fahrmeir Department of Statistics, Ludwig-Maximilians-University Munich 1. Leukemia survival data 2. Structured hazard regression 3. Mixed model

More information

Logistic regression: Miscellaneous topics

Logistic regression: Miscellaneous topics Logistic regression: Miscellaneous topics April 11 Introduction We have covered two approaches to inference for GLMs: the Wald approach and the likelihood ratio approach I claimed that the likelihood ratio

More information

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January, 1 Stat 406: Algorithms for classification and prediction Lecture 1: Introduction Kevin Murphy Mon 7 January, 2008 1 1 Slides last updated on January 7, 2008 Outline 2 Administrivia Some basic definitions.

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam ECLT 5810 Linear Regression and Logistic Regression for Classification Prof. Wai Lam Linear Regression Models Least Squares Input vectors is an attribute / feature / predictor (independent variable) The

More information

Lecture #11: Classification & Logistic Regression

Lecture #11: Classification & Logistic Regression Lecture #11: Classification & Logistic Regression CS 109A, STAT 121A, AC 209A: Data Science Weiwei Pan, Pavlos Protopapas, Kevin Rader Fall 2016 Harvard University 1 Announcements Midterm: will be graded

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

TA: Sheng Zhgang (Th 1:20) / 342 (W 1:20) / 343 (W 2:25) / 344 (W 12:05) Haoyang Fan (W 1:20) / 346 (Th 12:05) FINAL EXAM

TA: Sheng Zhgang (Th 1:20) / 342 (W 1:20) / 343 (W 2:25) / 344 (W 12:05) Haoyang Fan (W 1:20) / 346 (Th 12:05) FINAL EXAM STAT 301, Fall 2011 Name Lec 4: Ismor Fischer Discussion Section: Please circle one! TA: Sheng Zhgang... 341 (Th 1:20) / 342 (W 1:20) / 343 (W 2:25) / 344 (W 12:05) Haoyang Fan... 345 (W 1:20) / 346 (Th

More information

WU Weiterbildung. Linear Mixed Models

WU Weiterbildung. Linear Mixed Models Linear Mixed Effects Models WU Weiterbildung SLIDE 1 Outline 1 Estimation: ML vs. REML 2 Special Models On Two Levels Mixed ANOVA Or Random ANOVA Random Intercept Model Random Coefficients Model Intercept-and-Slopes-as-Outcomes

More information

Lecture 11. Interval Censored and. Discrete-Time Data. Statistics Survival Analysis. Presented March 3, 2016

Lecture 11. Interval Censored and. Discrete-Time Data. Statistics Survival Analysis. Presented March 3, 2016 Statistics 255 - Survival Analysis Presented March 3, 2016 Motivating Dan Gillen Department of Statistics University of California, Irvine 11.1 First question: Are the data truly discrete? : Number of

More information

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Multilevel Statistical Models: 3 rd edition, 2003 Contents Multilevel Statistical Models: 3 rd edition, 2003 Contents Preface Acknowledgements Notation Two and three level models. A general classification notation and diagram Glossary Chapter 1 An introduction

More information

STAT5044: Regression and Anova

STAT5044: Regression and Anova STAT5044: Regression and Anova Inyoung Kim 1 / 18 Outline 1 Logistic regression for Binary data 2 Poisson regression for Count data 2 / 18 GLM Let Y denote a binary response variable. Each observation

More information

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2)

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2) Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2) B.H. Robbins Scholars Series June 23, 2010 1 / 29 Outline Z-test χ 2 -test Confidence Interval Sample size and power Relative effect

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

Sample Size Determination

Sample Size Determination Sample Size Determination 018 The number of subjects in a clinical study should always be large enough to provide a reliable answer to the question(s addressed. The sample size is usually determined by

More information

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics Faculty of Health Sciences Regression models Counts, Poisson regression, 27-5-2013 Lene Theil Skovgaard Dept. of Biostatistics 1 / 36 Count outcome PKA & LTS, Sect. 7.2 Poisson regression The Binomial

More information

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models SCHOOL OF MATHEMATICS AND STATISTICS Linear and Generalised Linear Models Autumn Semester 2017 18 2 hours Attempt all the questions. The allocation of marks is shown in brackets. RESTRICTED OPEN BOOK EXAMINATION

More information


PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH The First Step: SAMPLE SIZE DETERMINATION THE ULTIMATE GOAL The most important, ultimate step of any of clinical research is to do draw inferences;

More information

Ph.D. course: Regression models. Introduction. 19 April 2012

Ph.D. course: Regression models. Introduction. 19 April 2012 Ph.D. course: Regression models Introduction PKA & LTS Sect. 1.1, 1.2, 1.4 19 April 2012 www.biostat.ku.dk/~pka/regrmodels12 Per Kragh Andersen 1 Regression models The distribution of one outcome variable

More information

Investigating Models with Two or Three Categories

Investigating Models with Two or Three Categories Ronald H. Heck and Lynn N. Tabata 1 Investigating Models with Two or Three Categories For the past few weeks we have been working with discriminant analysis. Let s now see what the same sort of model might

More information

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016 Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016 1 Theory This section explains the theory of conjugate priors for exponential families of distributions,

More information

Introduction and Overview STAT 421, SP Course Instructor

Introduction and Overview STAT 421, SP Course Instructor Introduction and Overview STAT 421, SP 212 Prof. Prem K. Goel Mon, Wed, Fri 3:3PM 4:48PM Postle Hall 118 Course Instructor Prof. Goel, Prem E mail: goel.1@osu.edu Office: CH 24C (Cockins Hall) Phone: 614

More information

Lecture Discussion. Confounding, Non-Collapsibility, Precision, and Power Statistics Statistical Methods II. Presented February 27, 2018

Lecture Discussion. Confounding, Non-Collapsibility, Precision, and Power Statistics Statistical Methods II. Presented February 27, 2018 , Non-, Precision, and Power Statistics 211 - Statistical Methods II Presented February 27, 2018 Dan Gillen Department of Statistics University of California, Irvine Discussion.1 Various definitions of

More information


EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA, 00 MODULE : Statistical Inference Time Allowed: Three Hours Candidates should answer FIVE questions. All questions carry equal marks. The

More information

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Anders Skrondal. Norwegian Institute of Public Health London School of Hygiene and Tropical Medicine. Based on joint work with Sophia Rabe-Hesketh

Anders Skrondal. Norwegian Institute of Public Health London School of Hygiene and Tropical Medicine. Based on joint work with Sophia Rabe-Hesketh Constructing Latent Variable Models using Composite Links Anders Skrondal Norwegian Institute of Public Health London School of Hygiene and Tropical Medicine Based on joint work with Sophia Rabe-Hesketh

More information

Ph.D. course: Regression models. Regression models. Explanatory variables. Example 1.1: Body mass index and vitamin D status

Ph.D. course: Regression models. Regression models. Explanatory variables. Example 1.1: Body mass index and vitamin D status Ph.D. course: Regression models Introduction PKA & LTS Sect. 1.1, 1.2, 1.4 25 April 2013 www.biostat.ku.dk/~pka/regrmodels13 Per Kragh Andersen Regression models The distribution of one outcome variable

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Introduction to Generalized Linear Models

Introduction to Generalized Linear Models Introduction to Generalized Linear Models Edps/Psych/Soc 589 Carolyn J. Anderson Department of Educational Psychology c Board of Trustees, University of Illinois Fall 2018 Outline Introduction (motivation

More information

Single-level Models for Binary Responses

Single-level Models for Binary Responses Single-level Models for Binary Responses Distribution of Binary Data y i response for individual i (i = 1,..., n), coded 0 or 1 Denote by r the number in the sample with y = 1 Mean and variance E(y) =

More information

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form: Outline for today What is a generalized linear model Linear predictors and link functions Example: fit a constant (the proportion) Analysis of deviance table Example: fit dose-response data using logistic

More information

Lecture 01: Introduction

Lecture 01: Introduction Lecture 01: Introduction Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 01: Introduction

More information

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary

More information

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What? You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?) I m not goin stop (What?) I m goin work harder (What?) Sir David

More information



More information

Unit 9: Inferences for Proportions and Count Data

Unit 9: Inferences for Proportions and Count Data Unit 9: Inferences for Proportions and Count Data Statistics 571: Statistical Methods Ramón V. León 12/15/2008 Unit 9 - Stat 571 - Ramón V. León 1 Large Sample Confidence Interval for Proportion ( pˆ p)

More information

CS483 Design and Analysis of Algorithms

CS483 Design and Analysis of Algorithms CS483 Design and Analysis of Algorithms Lecture 1 Introduction and Prologue Instructor: Fei Li lifei@cs.gmu.edu with subject: CS483 Office hours: Room 5326, Engineering Building, Thursday 4:30pm - 6:30pm

More information

3003 Cure. F. P. Treasure

3003 Cure. F. P. Treasure 3003 Cure F. P. reasure November 8, 2000 Peter reasure / November 8, 2000/ Cure / 3003 1 Cure A Simple Cure Model he Concept of Cure A cure model is a survival model where a fraction of the population

More information

Stat 710: Mathematical Statistics Lecture 12

Stat 710: Mathematical Statistics Lecture 12 Stat 710: Mathematical Statistics Lecture 12 Jun Shao Department of Statistics University of Wisconsin Madison, WI 53706, USA Jun Shao (UW-Madison) Stat 710, Lecture 12 Feb 18, 2009 1 / 11 Lecture 12:

More information

Bayes methods for categorical data. April 25, 2017

Bayes methods for categorical data. April 25, 2017 Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,

More information

Sample size calculations for logistic and Poisson regression models

Sample size calculations for logistic and Poisson regression models Biometrika (2), 88, 4, pp. 93 99 2 Biometrika Trust Printed in Great Britain Sample size calculations for logistic and Poisson regression models BY GWOWEN SHIEH Department of Management Science, National

More information

Chapter 4. Parametric Approach. 4.1 Introduction

Chapter 4. Parametric Approach. 4.1 Introduction Chapter 4 Parametric Approach 4.1 Introduction The missing data problem is already a classical problem that has not been yet solved satisfactorily. This problem includes those situations where the dependent

More information

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam ECLT 5810 Linear Regression and Logistic Regression for Classification Prof. Wai Lam Linear Regression Models Least Squares Input vectors is an attribute / feature / predictor (independent variable) The

More information

Lecture 5: LDA and Logistic Regression

Lecture 5: LDA and Logistic Regression Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant

More information

36-463/663: Multilevel & Hierarchical Models

36-463/663: Multilevel & Hierarchical Models 36-463/663: Multilevel & Hierarchical Models (P)review: in-class midterm Brian Junker 132E Baker Hall brian@stat.cmu.edu 1 In-class midterm Closed book, closed notes, closed electronics (otherwise I have

More information

Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection Model Selection in GLMs Last class: estimability/identifiability, analysis of deviance, standard errors & confidence intervals (should be able to implement frequentist GLM analyses!) Today: standard frequentist

More information

Lecture 9. Statistics Survival Analysis. Presented February 23, Dan Gillen Department of Statistics University of California, Irvine

Lecture 9. Statistics Survival Analysis. Presented February 23, Dan Gillen Department of Statistics University of California, Irvine Statistics 255 - Survival Analysis Presented February 23, 2016 Dan Gillen Department of Statistics University of California, Irvine 9.1 Survival analysis involves subjects moving through time Hazard may

More information


A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL Discussiones Mathematicae Probability and Statistics 36 206 43 5 doi:0.75/dmps.80 A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL Tadeusz Bednarski Wroclaw University e-mail: t.bednarski@prawo.uni.wroc.pl

More information