Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Similar documents
Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

LOGISTIC REGRESSION Joseph M. Hilbe

Linear Regression Models P8111

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Stat 5101 Lecture Notes

Regression, Ridge Regression, Lasso

Statistics 203: Introduction to Regression and Analysis of Variance Course review

12 Modelling Binomial Response Data

How the mean changes depends on the other variable. Plots can show what s happening...

Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit

STA 216, GLM, Lecture 16. October 29, 2007

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Chapter 1 Statistical Inference

Sparse Linear Models (10/7/13)

Math 423/533: The Main Theoretical Topics

Model comparison and selection

Model Selection. Frank Wood. December 10, 2009

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Bayesian linear regression

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Data-analysis and Retrieval Ordinal Classification

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Single-level Models for Binary Responses

Applied Linear Statistical Methods

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Introduction to Machine Learning

Generalized Linear Models

Generalized Linear Models

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Generalized Linear Models 1

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

1 Hypothesis Testing and Model Selection

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Departamento de Economía Universidad de Chile

Outline of GLMs. Definitions

ABHELSINKI UNIVERSITY OF TECHNOLOGY

Lecture 6: Model Checking and Selection

AP-Optimum Designs for Minimizing the Average Variance and Probability-Based Optimality

Model Estimation Example

Lecture 16: Mixtures of Generalized Linear Models

11. Generalized Linear Models: An Introduction

Econometrics II. Seppo Pynnönen. Spring Department of Mathematics and Statistics, University of Vaasa, Finland

Bayes methods for categorical data. April 25, 2017

Sequential Experimental Designs for Generalized Linear Models

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

The Laplace Approximation

9 Generalized Linear Models

Generalized Linear Models

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Generalized Linear Models Introduction

10. Alternative case influence statistics

8 Nominal and Ordinal Logistic Regression

Generalized Models: Part 1

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

The consequences of misspecifying the random effects distribution when fitting generalized linear mixed models

High-dimensional regression modeling

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

7. Estimation and hypothesis testing. Objective. Recommended reading

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes 1

Generalized Linear Models for Non-Normal Data

The Logit Model: Estimation, Testing and Interpretation

Introduction to Generalized Models

STA 450/4000 S: January

CS6220: DATA MINING TECHNIQUES

Generalized Linear Models: An Introduction

Generalized Linear Models. Kurt Hornik

Statistical Data Analysis Stat 3: p-values, parameter estimation

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

ISyE 691 Data mining and analytics

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Tuning Parameter Selection in L1 Regularized Logistic Regression

Modeling Binary Outcomes: Logit and Probit Models

Sequential Experimental Designs for Generalized Linear Models

On Bayesian Computation

COS513 LECTURE 8 STATISTICAL CONCEPTS

Introduction to General and Generalized Linear Models

Partial factor modeling: predictor-dependent shrinkage for linear regression

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Lecture 13: More on Binary Data

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

CS6220: DATA MINING TECHNIQUES

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

STAT Financial Time Series

Bayesian Multivariate Logistic Regression

Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach"

Bayesian Learning (II)

Machine Learning for OR & FE

Analysing geoadditive regression data: a mixed model approach

Transcription:

Standard Errors & Confidence Intervals β β asy N(0, I( β) 1 ), where I( β) = [ 2 l(β, φ; y) ] β i β β= β j We can obtain asymptotic 100(1 α)% confidence intervals for β j using: β j ± Z 1 α/2 se( β j ) β j ± 1.96se( β j ) for α = 0.05, where Z p denotes the pth percentile of the N(0,1) density. 1

Profile Likelihood Confidence Intervals Express log-likelihood as l(α, β), where β is the parameter of interest and α are nuisance parameters Maximum the log-likelihood with respect to α for a range of values of β to give the profile log-likelihood function l( α β, β), where α β is the maximum likelihood estimate of α for a given β. A confidence region for β is formed from those values of β giving l( α β, β) sufficiently close to the overall maximum For a 95% interval, the allowable difference from the maximum is 1 2 χ2 1,0.95 = 1 2 3.841 = 1.920. 2

Model Selection and Goodness of Fit In most data analyses, there is uncertainty about the model & you need to do some form of model comparison. 1. Select q out of p predictors to form a parsimonius model 2. Select the link function (e.g., logit or probit) 3. Select the distributional form (normal, t-distributed) We focus first on problem 1 - Variable Selection 3

Frequentist Strategies for Variable Selection It is standard practice to sequentially add or drop variables from a model one at a time and examine the change in model fit If model fit does not improve much (or significantly at some arbitrary level) when adding a predictor, then that predictor is left out of the model (forward selection) Similarly, if model fit does not decrease much when removing a predictor, then that predictor is removed (backwards elimination) Implementation: step() function in S-PLUS. 4

Some Issues with Stepwise Procedures Normal Linear & Orthogonal Case In normal models with orthogonal X, forward and backwards selection will yield the same model (i.e., the selection process is not order-dependent). However, the selection of the significance level for inclusion in the model is arbitrary and can have a large impact on the final model selected. Potentially, one can use some goodness of fit criterion (e.g., AIC, BIC). In addition, if interest focus on inference on the β s, stepwise procedures can result in biased estimates and invalid hypothesis tests (i.e., if one naively uses the final model selected without correction for the selection process) 5

Issues with Stepwise Procedures (General Case) For GLMs other than orthogonal, linear Gaussian models, the order in which parameters are added or dropped from the model can have a large impact on the final model selected. This is not good, since choices of the order of selection are typically (if not always) arbitrary. Model selection is a challenging area, but there are alternatives 6

Goodness of Fit Criteria There are a number of criteria that have been proposed for comparing models based on a measure of goodness of fit penalized by model complexity 1. Akaike s Information Criterion (AIC): AIC M = D M + 2p M, where D M is the deviance for model M and p M is the number of predictors 2. Bayesian Information Criterion (BIC): BIC M = D M + p M log(n). 7

Note that deviance decreases as variables are added and the likelihood increases. The AIC and BIC differ in the penalty for model complexity, with the AIC using twice the number of parameters and the BIC using the number of parameters multiplied by the logarithm of the sample size. The BIC tends to place a larger penalty on the number of predictors and hence more parsimonius models are selected. 8

Probability of Selecting a Model Posterior probability of selecting model j, for j {1,..., J}: Pr(M = j y, X) = exp( BIC j/2) J k=1 exp( BIC k /2), 9

Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary response Y = 1 is recorded if and only if U > 0: θ = Pr(Y = 1 x) = 1 F (0; x). Since U is not directly observed there is no loss of generality in taking the critical (i.e., cutoff point) to be 0. In addition, we can take the standard deviation of U (or some other measure of dispersion) to be 1, without loss of generality. 10

Probit Models For example, if U N(x β, 1) it follows that θ i = Pr(Y = 1 x i ) = Φ(x iβ), where Φ( ) is the cumulative normal distribution function Φ(t) = (2π) 1/2 t exp( 1 2 z2 ) dz. The relation is linearized by the inverse normal transformation Φ 1 (θ) = x iβ = p j=1 x ij β j. 11

We have regarded the cutoff value of U as fixed and the mean of U to be changing with x. Alternatively, one could assume that the distribution of U is fixed and allow the critical value to vary with x (e.g., dose) In toxicology studies where dose is the explanatory variable it makes sense to let V denote the minimum level of dose needed to produce a response (i.e., tolerance) 12

Under the second formulation, y i = 1 if x i β > v i It follows that Pr(Y = 1 x i ) = Pr(V x iβ). Note that the shape of the dose-response curve is determined by the distribution function of V If V N(0, 1), then Pr(Y = 1 x i ) = Φ(x iβ), and it follows that the U and V formulations are equivalent The U formulation is more common 13

Latent Utilities & Choice Models Suppose that Fred is choosing between 2 brands of a product (say, Ben & Jerry s or Haigen Daz) Fred has a utility for Ben & Jerry s (denoted by Z i1 ) and a utility for Haigen Daz (denotes by Z i2 ) Letting the difference in utilities be represented by the normal linear model, we have U i = Z i1 Z i2 = x iβ + ɛ i, where ɛ i N(0, 1). If Fred has a higher utility for Ben & Jerry s, then Z i1 > Z i2, U i > 0, and Fred will choose Ben & Jerry s (Y i = 1) 14

This latent utility formulation is again equivalent to a probit model for the binary response. The generalization to a multinomial response is straightforward by introducing k latent utilities instead of 2, and letting an individual s response (i.e., choice) correspond to the category with the maximum utility Although the probit model is preferred in bioassay and social sciences applications, the logistic model is preferred in the biomedical sciences Of course, the choice of distribution function for U (and hence the choice of link in the binary response GLM) should be motivated by model fit. 15

Logistic Regression The normal form is only one possibility for the distribution of U. Another is the logistic distribution with location x iβ and unit scale. The logistic distribution has cumulative distribution function so that F (u) = exp(u x iβ) 1 + exp(u x iβ), F (0; x i ) = 1/{1 + exp(x iβ)}, It follows that Pr(Y = 1 x i ) = Pr(U > 0 x i ) = 1 F (0; x i ) = 1/{1+exp( x iβ)}. To linearize this relation, we take the logit transformation of both sides, log{θ i /(1 θ i )} = x iβ. 16

Homework Exercise: For x i = (1, x) and β 2 > 0, reformulate the logistic regression in terms of a threshold model (i.e., the V formulation of the probit model described above). Derive the probability density function (pdf) obtained by differentiating Pr(Y = 1 x i ) with respect to x. Reparameterize in terms of τ = 1/β 2 and µ = β 1 /β 2. Plot this pdf for µ = 0 and πτ/ 3 = 1 along with the N(0,1) pdf in S-PLUS. Which density has the fatter tails? Is the pdf for x in the logistic case in the exponential family? 17

Some Generalizations of the Logistic Model The logistic regression model assumes a restricted dose-response shape and it is possible generalize the model to relax the restriction Aranda-Ordaz (1981) proposed two families of linearizing transformation, which can easily be inverted and which span a range of forms. The first, which is restricted to symmetric cases (i.e., invariant to interchanging success & failure) is 2 θ ν (1 θ) ν ν θ ν + (1 θ). ν In the limit as ν 0, this is logistic and for ν = 1 this is linear The second family has log[{(1 θ) ν 1}/ν], which reduces to the extreme value model when ν = 0 and the logistic when ν = 1. 18

When there is doubt about the transformation, a formal approach is to use one or the other of the above transformations and to fit the resulting model for a range of possible values for ν A profile likelihood can be obtained for ν by plotting the maximized likelihood against ν. Potentially, one could choose a standard form, such as the logistic, if the corresponding value of ν falls with the 95% profile likelihood confidence region. 19

Another possibility for generalizing the logistic regression model is to allow the effects of one or more of the predictors to be non-linear logitpr(y = 1 x i ) = α + s(x i )β, where s( ) is an unknown function This type of model is referred to as a generalized additive model and we will discuss this in more detail next class - motivating the approach using data from Longnecker et al. (2001). 20