CHOICE OF REFERENCE SUBCLASS IN REGRESSION MODELS

Size: px

Start display at page:

Download "CHOICE OF REFERENCE SUBCLASS IN REGRESSION MODELS"

Catherine Coral Garrett
6 years ago
Views:

1 CHOICE OF REFERENCE SUBCLASS IN REGRESSION MODELS Gilbert MacKenzie 1,2 & Defen Peng 2,3 1 ENSAI, Rennes & 2 Centre of Biostatistics University of Limerick, Ireland 3 UBC,Vancouver, Canada. CASI, Templepatrick, Northern Ireland, May 14-16, 2014 CASI, Templepatrick, N. Ireland, May 14-16th,

2 ENSAI Building 2nd Int. BIO-SI W/S Oct 6/7th, 2011 CASI, Templepatrick, N. Ireland, May 14-16th,

3 Outline This talk is about choice of reference subclass in parametric regression models with categorical variables - mainly in observational studies Introduction Linear Model Setting Precision & Multi-collinearity Extensions to GLMs Conclusions CASI, Templepatrick, N. Ireland, May 14-16th,

4 Introduction A quotation: There is no statistical justification for choosing one reference category or another. The choice is usually made on subject matter grounds to make the interpretations easier and the choice can easily vary from data analyst to data analyst. So, the need for a reference category can complicate interpretations and the results... R. Berk (2008). We show that a judicious choice of reference subclass can improve certain properties of the regression model. CASI, Templepatrick, N. Ireland, May 14-16th,

5 Secondary Criteria Many model properties are invariant to the choice of reference subclass so we need secondary criteria: Precision of estimates - Total Variance, ˆT r = tr[v ( ˆβ r )]. Multicollinearity - Condition Number, ˆK r Logical Considerations NB: The third can be evaluated in terms of the first two. Interested in the pair (ˆT r, ˆK r ) We illustrate in terms of the Linear Model - only 15 minutes. CASI, Templepatrick, N. Ireland, May 14-16th,

6 Linear Model Setting We consider the general linear model Y = Xβ + ɛ (1) where: Y is a continuous response variable, X is an n p design matrix, β is a p 1 column vector of regression parameters, E(ɛ) = 0 and E(ɛɛ ) = σ 2 I n. We will also assume that ɛ i N(0, σ 2 ) when required, for i = 1,..., n. It follows immediately that and that ˆβ = (X X) 1 X Y (2) V ( ˆβ) = σ 2 (X X) 1 (3) which implies, under the the Gaussian assumption, that the Fisher information matrix is I(β) = (X X)/σ 2 (4) CASI, Templepatrick, N. Ireland, May 14-16th,

7 Form of Design Matrix If the design matrix X encodes a single categorical variable with p = (k + 1) subclasses, X X, may take one of of two main forms X X = diag(n 1, n 2,..., n k+1 ) (5) or, n n 1 n 2 n k n 1 n X X = n 2 0 n 2 0. (6).... n k 0 0 n k In (5) we have included exactly p = (k + 1) binary indicator variables and in (6) we have included an intercept term and exactly k binary indicator variables. CASI, Templepatrick, N. Ireland, May 14-16th,

8 Precision Suppose we have a sample allocation (n 1, n 2,, n p ), where, at least, one of the allocated numbers is different from the others. Let r, denote the reference category which may be chosen freely from (1,..., p). Then (X X) 1 = n r /n n r /n 2 1 n r n r /n k (7) where n r = n k j=1 n j is the allocated number of the reference category. CASI, Templepatrick, N. Ireland, May 14-16th,

9 Example 1 - Binary Covariate With p = 2, X (x 0, x 1 ) implies that category 2 is the reference ( 1 ˆβ r = 1 i[r] n y ) i r n r i[r] y i + 1 n s i[s] y (8) i ( 1 ˆβ s = 1 i[s] n y ) i s n s i[s] y i + 1 n r i[r] y. (9) i the intercepts differ, but, ˆβ 1,r = ˆβ 1,s. On the diagonal of the (2 2) variance-covariance matrices diagv ( ˆβ r ) = [ σ2 n r, σ 2 ( 1 n r + 1 n s )], diagv ( ˆβ s ) = [ σ2 n s, σ 2 ( 1 n s + 1 n r )], thus, Var( ˆβ 1,r ) = Var( ˆβ 1,s ). Therefore, the precision of the regression coefficient is invariant to switching the reference category. CASI, Templepatrick, N. Ireland, May 14-16th,

10 Example 2 - Two binary covariates With p = 3, X (x 0, x 1, x 2 ) implies that category 3 is the reference First the regression coefficients are different (not shown) Then the diagonals of the V-C matrices are diag V ( ˆβ r=3 ) = σ 2[ 1 n 3, ( 1 n n 1 ), ( 1 n n 2 ) ], diag V ( ˆβ r=2 ) = σ 2[ 1 n 2, ( 1 n n 1 ), ( 1 n n 3 ) ], diag V ( ˆβ r=1 ) = σ 2[ 1 n 1, ( 1 n n 2 ), ( 1 n n 3 ) ] So in LMs ˆT r is minimised when n r = n max CASI, Templepatrick, N. Ireland, May 14-16th,

11 Multi-collinearity We use the condition number, ˆK r, to measure multi-collinearity. Belsley( 2004) defines the condition number of a square matrix, M, as K (M) = λ max /λ min = ν max /ν min, where λ max = maximum(λ j ), λ min = minimum(λ j ), and λ j, j = 1, 2,, p, are the eigenvalues of M and the νs are the Singular Value Decomposition (SVD) numbers. The threshold values for K (M = X X) are 10 and 30 indicating medium and serious degrees of multi-collinearity. We use K r to denote K (M r ) where M = X X and where r indicates reference subclass dependence. CASI, Templepatrick, N. Ireland, May 14-16th,

12 MC LM Binary Covariate The eigenvalues λ of M = X X based on determinant det(x X λi) = λ 2 (n + n 1 )λ + nn 1 n 2 1 are λ max = n + n 1 2 λ min = n + n (n n 1 ) n1 2, 1 (n n 1 ) n1 2, where I is 2 2 identity matrix. The condition number is then K r (X X) = 1 + ρ 1 + (1 ρ 1 ) 2 + 4ρ 2 1, (10) 1 + ρ 1 (1 ρ 1 ) 2 + 4ρ 2 1 where ρ 1 = n 1 /n. CASI, Templepatrick, N. Ireland, May 14-16th,

13 Relationship between ˆT r and ˆK r We have examined this in a variety of cases - analytically and via simulation - in LMs and GLMs and the results are similar. The correlation between (ˆT r, ˆK r ) is typically 0.95, showing a strong linear relationship. This means that minimizing ˆT r also minimises ˆK r. Thus the stability of the model is improved by selecting n r = n max in LMs There is no loss of information by switching reference subclass as contrasts of interest are invariant to this switch. In GLMs things are more complicated when minimizing ˆT r, but the principle is the same. CASI, Templepatrick, N. Ireland, May 14-16th,

14 Lung Cancer Study Survival Study of lung cancer in NI (Wilkinson, 1992). A total of 855 incident cases followed for 2 years. 50% dead by 6 months. Interested in who gets active treatment (and why)? Some 51.5% received no active treatment! Leads to a standard MLF analysis with Y=1 for treatment else Y=0. Some 5 covariates WHO, Age, Cell type, Metastases and Albumen. See example in next slide. CASI, Templepatrick, N. Ireland, May 14-16th,

15 CASI, Templepatrick, N. Ireland, May 14-16th,

16 Conclusions There is more see than suggested by Berk. Maximising the Precision minimizes the Multi-collinearity. Must be useful in sparse data situations with many categorical covariates. No loss of information on contrasts of interest For LMs and GLMs (and beyond) results are similar. Overall we have created some useful tools. We hope their use will improve practice. CASI, Templepatrick, N. Ireland, May 14-16th,

17 Acknowledgements The work in this paper was supported by two Science Foundation Ireland (SFI, project grants. Professor MacKenzie was supported under the Mathematics Initiative, II, via the BIO-SI ( research programme in the Centre of Biostatistics, University of Limerick, Ireland: grant number 07/MI/012. Professor Peng is also supported via a Research Frontiers Programme award, grant number 05/RF/MAT 026. CASI, Templepatrick, N. Ireland, May 14-16th,

18 References ALTMAN, D. G. & ROYSTON, P. (2006). Statistics notes - The cost of dichotomising continuous variables. British Medical Journal, 332, BELSLEY, D. A., KUH, E. & WELSCH, R. E. (2004). Regression diagnostics: Identifying influential data and sources of collinearity. John Wiley & Sons, First edition. BERK, R. (2008). Statistical learning from regression perspective. Springer, New York. COHEN, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale,NJ: Lawrence Erlbaum. COX & SNELL (1989). Analysis of Binary Data. Chapman and Hall, Second edition. CRAN.R-PROJECT, (2009). R project. Retrieved 2010, from Package pwr : ELWOOD, J. H., MACKENZIE, G. & CRAN, G. (1974). Observations on single births to women resident in Belfast : Part I - Factors associated with perinatal mortality. J. Chron. Dis, 27, CASI, Templepatrick, N. Ireland, May 14-16th,

19 References FELDSTEIN, M. S., (1966). A binary variable multiple regression method of analysing factors affecting Peri-Natal mortality and other outcomes of pregnancy. Journal of the Royal Statistical Society. A 129, FRØSLIE, K. F, RØISLIEN, J., LAAKE, P., HENRIKSEN, T., QVIGSTAD, E. and VEIERØD, M. B.(2010). Categorisation of continuous exposure variables revisited. A response to the Hyperglycaemia and Adverse Pregnancy Outcome (HAPO) Study. BMC Medical Research Methodology, 10, ISHAM, (1991). Statistical theory and modelling by edited by D.V. Hinkley, N. Reid, and E.J. Snell. Chapman and Hall. MACKENZIE, G. & PENG, D. (2010). Properties of estimators in interval censored PH regression survival models. Submitted. Journal of the Royal Statistical Society. C. NIJENHUIS, A. & WILF, H. S. (1978). Combinatorial Algorithms for Computers and Calculators. Academic Press, Second edition. PENG, D. & MACKENZIE G. (2014). Discrepancy and choice of reference subclass in categorical regression models. In: Statistical Modelling in Biostatistics and Bioinformatics. Springer, Munich, 260 pages. CASI, Templepatrick, N. Ireland, May 14-16th,

20 References POCOCK, S. J., COLLIER, T. J., DANDREO, K. J., DE STAVOLA, B. L., GOLDMAN, M. B., KALISH, L. A., LINDA, E. K. & VALERIE, A. M. (2004). Issues in the reporting of epidemiological studies: a survey of recent practice. British Medical Journal, 329, RAO, C. R. & RAO, M. B. (1998). Matrix Algebra and its Applications to Statistics and Econometrics. World Scientific Publishing, Singapore, First edition. SHAPIRO, S. S. (1980). How to test normality and other distributional assumptions. statistical techniques, 3, SMITH, O. K. (1961). Eigenvalues of a symmetric 3 3 matrix. Communications of the ACM. 4, 168. WISSMANN, M., TOUTENBURG, H. & SHALABH, (2007). Role of Categorical Variables in Multicollinearity in Linear Regression Model. Technical Report. Department of Statistics University of Munich, Germany. Number 008. WILLIAM, G. J. (2005). Regression III: Advanced methods. Lecture notes. Department of Political Science Michigan State University, America. CASI, Templepatrick, N. Ireland, May 14-16th,

21 Minimizing the Total Variance Proof. Let n r = max(n 1,, n p ), and s {1, 2,, p} (s r) be another choice of reference category, where n r > n s, then, from (17), the corresponding total variances are V r = 1/n r + (1/n r + 1/n s ) + p (1/n r + 1/n j ) j r,s and p V s = 1/n s + (1/n s + 1/n r ) + (1/n s + 1/n j ). j r,s Since 1/n r < 1/n s, we have V r < V s, i.e., choosing n r = n max minimizes the total variance. CASI, Templepatrick, N. Ireland, May 14-16th,

22 Canonical GLMs Canonical GLM for independent responses Y i with E(Y i ) = µ i = g(θ i ), θ i = k x ui β u u=0 is the linear predictor, β u = 0,..., k, represents the p regression parameters. Then the observed information matrix for β is I o (β) = ( β θ T )( θ θ K )( β θ T ) T = ( β µ T )( θ θ K ) 1 ( β µ T ) T.(11) When β 0 is the intercept, we can re-express as the (p p) matrix ( I o (β 0, β c ) = (X WX) = i w i i x ci w ) i i x ciw i i x cix ci w, (12) i CASI, Templepatrick, N. Ireland, May 14-16th,

23 Structural Weights Table 2: Structural weights Distribution Density(Mass) Function Link function w i (y µ) 2 Normal f (y; µ, σ) = 1 e 2πσ 2σ 2 xβ = µ = θ σ 2 Exponential f (y; λ) = λe λy xβ = µ 1 = θ (x i β) 2 IG f (y; µ, λ) = ( λ 2πy 3 ) 1 2 e λ(y µ) 2 2µ 2 y xβ = µ 2 = θ λ 4 (x i Poisson f (y; λ) = λy y! e λ xβ = log(µ) = θ exp(x i ( n Binomial f (y; n, p) = p y) (1 p) n y µ xβ = log( (1 µ ) = θ exp(x i β) (1+exp(x i β))2 Geometric f (y; p) = (1 p) y 1 µ p xβ = log( (1 µ ) = θ 1 1+exp(x i β) CASI, Templepatrick, N. Ireland, May 14-16th,

24 Extension to Canonical GLMs For a single categorical variate across GLMs we can Show that the optimal choice depends on n r ϕ( ˆβ 0 ). Show that we should choose the subclass where n r ϕ( ˆβ 0 ) is max. Show that choosing n r = n max is usually good. Show that when the observed allocation is uniform (n 1 = n 2 = = n p ) or near uniform the choice of reference subclass does not matter. Show there is an index to tell you when you need to worry about lack of uniformity. These results extend to GLMs with multiple categorical covariates. CASI, Templepatrick, N. Ireland, May 14-16th,

25 Contrasts of Interest Generally, such contrasts are conducted among the k regression coefficients. Then we have V (Z ) = C V (β r )C where c 0 = 0 and c 1 = 0. V (Z ) = σ 2 k cj 2 /n j Then Z does not depend on β 0 and accordingly such contrasts are invariant to the choice of reference subclass. j=1 CASI, Templepatrick, N. Ireland, May 14-16th,

26 Generalization of V-C matrix The generalised Variance covariance matrix for GLMs is (τ 1 nr ) 1 1 n 1 I 1 1 (β 0, β c) = (τ 2 nr ) 1 n n r ϕ(β 0 ) 2.., (13) (τ k nr ) n k where i [j] means subject i jth category, whence x ji = 1 for i jth category, and τ j = ϕ(β 0 )/ϕ(β 0 + β j ), n r and n j are the allocated numbers in the reference subclass and the jth subclass respectively, j = 1, 2,, k. This matrix structure recurs in other settings (MacKenzie & Peng, 2013: Peng & MacKenzie, 2014). CASI, Templepatrick, N. Ireland, May 14-16th,

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

STA216: Generalized Linear Models Lecture 1. Review and Introduction Let y 1,..., y n denote n independent observations on a response Treat y i as a realization of a random variable Y i In the general