Entropy Coe cient of Determination and Its Application

Size: px

Start display at page:

Download "Entropy Coe cient of Determination and Its Application"

Justin Garrison
5 years ago
Views:

1 Entropy Coe cient of Determination and Its Application Nobuoki Eshima Department of Biostatistics, Faculty of Medicine, Oita University, Oita , Japan. The objective of this seminar is to introduce the entropy coe cient of determination (ECD) for measuring the explanatory or predictive power of GLMs and to consider how ECD is used in data analysis. In the rst section, the classical regression and GLM frameworks are compared, and properties of GLMs concerning entropy are discussed. First, the information of an event and the entropy of a random variable are explained, and the Kullback- Leibler information that describes the di erence between two distributions is treated. Second, the log odds ratio and the mean in the GLM are considered from a view point of entropy. ECD is interpreted as the ratio of variation of a response variable explained by the explanatory variables, and is compared with some other explanatory power measures with respect to the following properties: (i) interpretability; (ii) being the multiple correlation coe cient or the coe cient of determination in normal linear regression models; (iii) entropy-based property; (iv) applicability 1

2 to all GLMs; in addition to these, it may be appropriate for a measure to have the following property: (v) monotonicity in the complexity of the linear predictor. In Section 2, rst the asymptotic properties of the maximum likelihood estimator (MLE) of ECD are discussed. The con dence interval of ECD is considered on the basis of an approximate normality of non-central chi square distribution. Second, in canonical link GLMs the contributions of factors are treated according to the decomposition of ECD. Numerical examples are also given. 1 Coe cient of determination for generalized linear models 1.1 Information Theory Let X be a categorical variable with categories = fc 1 ; C 2 ; :::; C K g, in which follows the categories are formally described as = f1; 2; :::; Kg. Then, the information of X = k is de ned by I (X = k) = log 1 Pr(X=k). (1.1) In the above information, the bottom of the logarithm is e and the unit is nat. If the bottom is 2, the unit is bit. In this seminar, the bottom of the logarithm is e. The mean of the above information, which is called entropy, is de ned by 2

3 H(X) P K k=1 Pr (X = k) I (X = k) = P K k=1 Pr (X = k) log 1 Pr(X=k). (1.2) Entropy is a measure of uncertainty in random variable X or sample space. Let p k = Pr (X = k) (k = 1; 2; :::; K). Then, we have the following theorem. Theorem 1.1. Let p = fp 1 ; p 2 ; :::; p K g and q = fq 1 ; q 2 ; :::; q K g be two distributions. Then, P K k=1 p k log p k P K k=1 p k log q k. (1.3) Proof: P K k=1 p k log p k P K k=1 p k 1 P K k=1 p k log q k = P K q k p k = 0. k=1 p k log p k q k = P K k=1 p k log q k p k * log x 1 x; x = q k p k The equation holds if and only if p k = q k (k = 1; 2; :::; K). From (1.3) it follows that H (p) = P K k=1 p k log p k P K k=1 p k log q k, where H (p) implies the entropy of distribution p. Setting q k = 1 K (k = 1; 2; :::; K), we have H (p) P K k=1 p k log 1 K = log K. The following quantity is referred to as the Kullback-Leibler (KL) information or divergence. 3

4 D(pjjq) P K k=1 p k log p k q k ( 0). (1.4) This information is interpreted as the di erence or loss of information using distribution q instead of true distribution p. Example 1.1. Let X B N (3; 1 4 ) and Let Y Pr(Y = k) = 1 4 (k = 1; 2; 3; 4). Then, we have D(XjjY ) = 1 8 log 1= =4 8 log 1= =4 8 log 3= =4 8 log 3=8 1=4 = 3 4 ln ln 2 = 0: The above quantity is the di erence between B N 3; 1 2 and the uniform distribution on f1; 2; 3; 4g. The reciprocal KL information is D(Y jjx) = log 1= =8 4 log 1=4 3=8 = 0:143 8: In this case, D(XjjY ) 6= D(Y jjx). For continuous distributions, the KL information can be de ned similarly as Z D (f(x)jjg(x)) f(x) log f(x) dx ( 0), (1.5) g(x) where f(x) and g(x) are density functions. Example 1.2. Let f(x) N( 1 ; 2 ) and g(x) N( 2 ; 2 ). Then, D (f(x)jjg(x)) = ( 1 2 )2 2 2 (= D (g(x)jjf(x))). 4

5 1.2 Entropy in GLMs Let X and Y be a p 1 explanatory variable vector and a response variable, respectively, and let f(yjx) be the conditional probability or density function of Y given X = x. The function f(yjx) is assumed to be a member of the following exponential family of distributions: f(yjx) = exp + c(y; ') ; (1.6) y b() a(') where and ' are parameters, and a('), b() (> 0) and c(y; ') are speci c functions. This is a random component. Let T = ( 1 ; 2 ; :::; p ) T : For a link function h(u) (link component) and the linear predictor = T x (systematic component), the conditional expectation of Y given X = x is described as follows: E(Y jx = x) = db() d = h 1 ( T x): (1.7) Let us assume that the link function h(u) is a strictly increasing di erentiable function. The conditional variance of response Y given X = x is as follows: Var(Y jx = x) = a(') d2 b() d 2. From this, a(') relates to the dispersion of Y, so it is referred to as a dispersion parameter. Since is a function of = T x, for simpli cation the function is denoted by = ( T x). Let us consider the following log odds ratio: 5

6 logor(x; x 0 ; y; y 0 ) = log f(yjx)=f(y 0jx) = log f(yjx)f(y 0jx 0 ) f(yjx 0 )=f(y 0 jx 0 ) f(y 0 jx)f(yjx 0 ) = 1 a(') (y y 0) ( T x) ( T x 0 ) ; (1.8) where x 0 and y 0 are baselines of X and Y, respectively. The above log odds ratio is viewed as an inner product of ( T x) and y with respect to the dispersion parameter a('). Since logor(x; x 0 ; y; y 0 ) = f log f(y 0 jx)) ( log f(yjx))g f( log f(y 0 jx 0 )) ( log f(yjx 0 ))g ; the log odds ratio (1.8) is the change of the uncertainty of response Y in explanatory variable vector X, and as seen in the above log odds ratio, predictor T x is related to the reduction of uncertainty of response Y through link functions. For levels of the factor vector X = x 1 ; x 2 ; : : : ; x K ; the averages of Y, and Y are de ned as follows: E() = P K k=1 (T x k ) K ; E(Y ) = P K k=1 E(Y jx=x k) K ; and E(Y ) = P K k=1 E(Y jx=x k)( T x k ) K : Remark 1.1. Let n k be sample sizes at factor levels x k (k = 1; 2; :::; K), and n = P K k=1 n k. Then, the above averages (expectations) are replaced by the weighted ones, e.g. E() = P K n k k=1 n ( T x k ). 6

7 When we take the expectation of the inner product (1.8), we have Cov(;Y ) a(') + (E() (T x 0 ))(E(Y ) y 0 ) a('), where Cov(; Y ) =E(Y ) E()E(Y ): For y 0 = E(Y ), the quantity becomes Cov(;Y ) a(') (1.9) and it can be viewed as the average change of uncertainty of response variable Y in explanatory variable vector X. We have Theorem 1.1 and Corollary 1.1. Theorem 1.1. In the GLM with (1.6), the quantity (1.9) is expressed by the Kullback-Leibler Information: Cov(;Y ) a(') = P K k=1 KL(f(y);f(yjx k)) K (1.10) where f(y) = P K k=1 f(yjx k) K and KL(f(y); f(yjx k )) = R f(yjx k ) log f(yjxk ) dy + R f(y) log f(y) f(y) f(yjx k dy ) = D (f(yjx k )jjf(y)) + D (f(y)jjf(yjx k )) (k = 1; 2; : : : ; K): Corollary 1.1. In the GLM with (1.6), the covariance of Y and ( T X) is nonnegative, and it is zero if and only if X and Y are independent, i.e. f(yjx k ) = f(y) (k = 1; 2; :::; K). Example 1.3. An ordinary linear regression model is Y = + T x + e, 7

8 where e is a normal error with mean 0 and variance 2. Let f(yjx) be a normal density function with mean and variance 2. Then, the random component is f(yjx) = p 1 exp (y ) y 1 p 2 = exp 2 + y2 log In this expression, setting =, a(') = 2, b() = and c(y; ') = y2 2 2 p log 2 2, and for linear predictor = + T x and link function =, the normal linear regression model can be viewed as a GLM. Example 1.4. Let Y be a binary variable with p = Pr(Y = 1) (= ). The random component is Then, f(yjx) = p y (1 p) 1 y = fp(1 p) 1 g y (1 p) n = exp y log o. p + log(1 p) 1 p = log p, a(') = 1, b() = log(1 p) and c(y; ') = 0. 1 p Setting link function h(p) = log p, we have the following logistic 1 p regression (logit) model: 8

9 f(yjx) = exp f( + x)y + log(1 p)g = expf(+x)yg 1+exp(+x). For the logit model with explanatory variables X 1 ; X 2 ; :::; X p, the model is expressed as f(yjx) = expf(+p p i=1 i x i)yg 1+exp(+ P p i=1 i x i). (1.11) 1.3 Basic Predictive Power Measures for GLMs In the sense of the previous discussion, it may be appropriate to assess the predictive or explanatory power of factors based on entropy. In the GLM framework, predictive power measures are compared, and the advantage of ECD is mentioned. First, some predictive power measures for regression models are brie y discussed. In general regression models, for variation function D, the predictive power can be measured as follows: RD 2 = D(Y ) D(Y jx), (1.12) D(Y ) where D (Y ) and D(Y jx) imply a variation function of Y and a conditional or error variation function given X, respectively (Efron, 1978; Agresti, 1986; Korn & Simon, 1991). Predictive power measures based on the likelihood function (Theil, 1970; Goodman, 1971) are made according to powers of the likelihood function, i.e. R 2 L = 1 l(0) l() 2 n, 9

10 where l () is the likelihood function and n is the sample size. Let R be the multiple correlation coe cient in ordinary linear regression model. The above measure becomes R 2 in ordinary linear regression cases and increases with model complexity; however it is di cult to interpret the measure in general (Zhen & Agresti, 2000). The entropy measure (Haberman, 1982) for categorical responses is based on entropy of Y, H(Y ), and the conditional entropy H(Y jx), i.e. R 2 E = H(Y ) H(Y jx). H(Y ) The above measure is included in (1.12). The correlation coe cient of response Y and its conditional expectation given factor X, Corr(E (Y jx) ; Y ), is recommended for measuring the predictive power of GLMs, because the correlation measure can be applied to all types of GLMs except polytomous response cases (Zheng & Agresti, 2000). This measure is the correlation coe cient between response Y and the regression on X, referred to as the regression correlation coe cient. With respect to entropy, R 2 L and R2 E may be suitable for GLMs. By considering the average change of log odds ratio, Eshima & Tabata (2007) proposed the following basic predictive power measure: m P P (Y jx) Cov(;Y ) a('). (1.13) The above measure is expressed by the Kullback-Leibler information (Eshima & Tabata, 2007), and it is increasing in Cov(; Y ) and decreasing in a('). Since 10

11 Var(Y jx = x) = a(') d2 b() d 2, function a(') may be interpreted as the error variation of Y in entropy, i.e. residual randomness of Y given X. From this, Cov(; Y ) can be interpreted as the explained entropy of Y by X. Hence, measure (1.13) is the ratio of the explained variation of Y for the error variation of Y in entropy. Entropy variation function D E is de ned by D E (Y ) Cov(; Y ) + a('). Since is a function of X, Cov(; Y jx) = 0. From this, the conditional entropy variation of Y given X is D E (Y jx) a('). Considering this, ECD is de ned as follows: ECD(X; Y ) = Cov(;Y ) Cov(;Y )+a(') = m P P (Y jx) m P P (Y jx)+1 = D E(Y ) D E (Y jx) D E (Y ). (1.14) From (1.14), ECD is included in (1.12), and ECD can be viewed as the proportion of explained variation of Y in entropy. For the normal linear regression model, it follows that ECD(X; Y ) = R 2. Let = 1 ; 2 ; :::; p T be a regression coe cient vector. For canonical links = P p i=1 ix i, ECD (1.14) and the entropy correlation coe cient (ECC) are decomposed as follows: 11

12 ECD(X; Y ) = ECorr(X; Y ) = P p i=1 i Cov(X i;y ) P p i=1 i Cov(X i;y )+a('), (1.15) P p i=1 i Cov(X i;y ) p Var() p Var(Y ) : None of the predictive power measures except ECD and ECC can make the above type of decomposition for GLMs with canonical links. In addition to the desirable properties of measures for GLMs, (i) to (v), decomposability such as (1.15) may also be a suitable property for a predictive power measure, because the relative importance of X i may be assessed by i Cov(X i ; Y ). Moreover, ECD is scale-invariant in GLMs with multivariate responses; however ECC is not. In this respect, ECD is superior to ECC. Table 1.1 Properties of ve predictive power measures in GLMs Property Corr(E (Y jx) ; Y ) RL 2 RE 2 ECC ECD (i) interpretability (ii) R or R 2 (iii) entropy (iv) all GLMs (v) monotonicity 4 4 (vi) decomposition Table 1.1 summarizes the properties of the ve measures mentioned above. Measures Corr(E (Y jx) ; Y ) and ECorr(X; Y ) may have property (v) in most of cases; however it is not easy to prove the property in general. From this table, ECD(X; Y ) is the most desirable predictive power measure for GLMs. 12

13 Example 1.5. Let X and Y be p and q dimensional random vectors, respectively; and the joint distribution is assumed to be a (p + q) variate normal distribution with the following covariance matrix: 0 1 B XX Y X XY Y Y C A. Let the inverse of the above matrix be denoted by B XX Y X XY Y Y C A. Then, = Y X X and a(') = 1, so we have From this, m P P (Y jx) = tr Y X XY. ECD(X; Y ) = try X XY. tr Y X XY +1 Let i (i = 1; 2; :::; minfp; qg) be the squared canonical correlation coe cients. Then, ECD(X; Y ) = P minfp;qg i i=1 1 i P. minfp;qg i i= i For q = 1, ECD is reduced to the usual coe cient of determination 1 (= R 2 ). Example 1.6. In the logistic regression model (1.11), we have ECD(X; Y ) = P p i=1 i Cov(X i;y ) P p i=1 i Cov(X i;y )+1. 13

14 2 Application of ECD 2.1 Asymptotic Property of the ML Estimator of ECD Let f(y) and g (x) be the marginal density or probability function of Y and X, respectively. Then, the association measure is expressed as m P P (Y jx) = RR f(yjx)g(x) log + RR f(y)g(x) log f(yjx) f(y) dxdy f(y) f(yjx) dxdy. (2:1) If Y is discrete, the integral is replaced with the summation. If X is not random and take values x k (k = 1; 2; :::; K), the above measure can be modi ed as follows: m P P (Y jx) = P K n k R k=1 f(yjxk n ) log f(yjxk ) dy + R f(y) log f(y) f(y) f(yjx k dy, (2:2) ) where n k are sample sizes at levels x k (k = 1; 2; :::; K), and n = P K k=1 n k. We have the following theorem. Theorem 2.1. Let [m P P (Y jx), b f(yjx k ), and b f(y) be the ML estimators of m P P (Y jx), f(yjx k ), and f(y), respectively. If the null model, i.e. = 0, holds, the ML estimator of (2.2) multiplied by sample size n, i.e. n [m P P (Y jx) = P K k=1 n k R bf(yjxk ) log bf(yjxk ) bf(y) dy + R b f(y) log bf(y) bf(yjx k ) dy, is asymptotically distributed according to the chi-square distribution with degrees of freedom p as the sample sizes n i tend to in nity. 14

15 Proof. For simplicity of the discussion, the theorem is proven in the case where Y is a polytomous variable with levels or categories f1; 2; : : : ; Jg. Let jjk = Pr (Y = jjx = x k ) and j = Pr (Y = j); and let b jjk and b j be the ML estimators of the jjk and j, respectively. Under the null hypothesis and for su ciently large n k, we have where n [m P P (Y jx) = P K k=1 n k P J j=1 n b jjk log b jjk b j + b j log b j b jjk o = P K P J (n k b jjk n k b j) 2 k=1 j=1 2n k b jjk + P K P J (n k b jjk n k b j) 2 k=1 j=1 2n k b j + o(n) = P K P J (n k b jjk n k b j) 2 k=1 j=1 n k b jjk + o(n), o(n) n P! 0 (n! 1) : Hence, the theorem follows. When the explanatory variables X are random, the following theorem holds similarly. Theorem 2.2. If the null model, i.e. = 0, holds, the ML estimator of (2.1) multiplied by sample size n, i.e. n [m P P (Y jx) RR = n bf(yjx)bg(x) bf(yjx) log dxdy + RR b bf(y) bf(y) f(y)bg(x) log dxdy bf(yjx) is asymptotically distributed according to the chi-square distribution with degrees of freedom p as the sample size n tends to in nity. 15

16 Since ECD(X; Y ) = Cov(;Y )=a(') = Cov(;Y )=a(')+1 m P P (Y jx) m P P (Y jx)+1, the ML estimator of ECD(X; Y ) is [ECD(X; Y ) = [m P P (Y jx) [m P P (Y jx)+1. From this, we can test the hypothesis ECD(X; Y ) = 0 based on the following statistic: 2 = n [m P P (Y jx) = n d Cov(;Y ) a(b'). (2.3) The above statistic is asymptotically distributed according to a non-central chi square distribution with non-centrality = n m P P (Y jx) and degrees of freedom p. Let c = 1 + Statistic 2 c and p+ 0 = p + 2. p+2 is asymptotically distributed according to the chi square distribution with degrees of freedom 0. As 0 becomes large, the chi square distribution tends to a normal distribution with mean 0 and variance 2 0. From this, for su ciently large sample size n, statistic (2.3) 16

17 2 n = m P P (Y jx) is asymptotically normally distributed with mean c0 and variance 2c2 0 n n 2 (Patnaik, 1949). For su ciently large n, we have c 0 n 2c 2 0 n 2 m P P (Y jx), 2c n m P P (Y jx). From this, the asymptotic standard error (ASE) of [m P P (Y jx) is q 2c n [m P P (Y jx). Example 2.1. Agresti (2002, pp ) analyzed the beetle mortality data with complementary log-log model, in which beetles were exposed to gaseous carbon disul de at various concentrations, and the numbers of beetles killed after the 5-hour-exposure were observed (Table 2.1). Let X be gaseous carbon disul de concentration, and and the model parameters. Then, for the complementary log-log link we have = log 1 expf exp(+x)g : expf exp(+x)g For the maximum likelihood estimates = 39:52 (ASE = 3:23) and = 22:01 (ASE = 1:80), ECD is calculated as ECD = 0:475 (ASE = 0:024): The ECD indicates that 47:5% of variation of response variable Y in entropy is explained by the gaseous carbon disul de concentration X. 17

18 Table 2.1. Beetles Killed after Exposure to Carbon Disul de Log Dose No. of Beetles No. of Beetles Killed 1: : : : : : : : GLMs with canonical links Most regression analyses with GLMs are performed using canonical links. Let X = (X 1 ; X 2 ; :::; X p ) T be a p 1 factor or explanatory variable vector; let Y be a response variable; let = 1 ; 2 ; :::; p T be a regression parameter vector; and let = P p i=1 ix i be the canonical links. Then, ECD is decomposed as ECD(X; Y ) = P p i=1 i Cov(X i;y ) P p j=1 j Cov(X j;y )+a('). (2.4) The above decomposition consists of components that relate to regression coe cients i, and the contribution of X i on Y may be de ned by using 18

19 i Cov(X i ; Y ). If X i are independent or the experimental design is a multiway layout experiment, the contribution ratio of X i on the response is de ned by CR (X i ) = i Cov(X i;y ) P p k=1 k Cov(X k;y ). In general, X i are correlated or the experiment model has higher-order interactions. Then, the contribution ratio of X i is de ned by CR (X i ) = Cov(;Y ) Cov(;Y jx i) Cov(;Y ). Example 2.2. The present discussion is applied to the ordinary twoway layout experimental design model. Let X 1 and X 2 be factors with levels f1; 2; : : : ; Ig and f1; 2; : : : ; Jg, respectively. Then, the linear predictor is a function of (X 1 ; X 2 ) = (i; j), i.e. = i + j + () ij. For model identi cation, the following constraints are placed on these parameters: P I i=1 i = P J j=1 j = P I i=1 () ij = P J j=1 () ij = 0. Let 8 >< 1 (X k = i) X ki = >: 0 (X k 6= i) (k = 1; 2): Then, dummy vectors 19

20 X 1 = (X 11 ; X 12 ; : : : ; X 1I ) T and X 2 = (X 21 ; X 22 ; : : : ; X 2J ) T are identi ed with factors X 1 and X 2, respectively. From this, the systematic component of the above model can be written as follows: = T X 1 + T X 2 + T X 1 X 2 ; where = ( 1 ; 2 ; : : : ; I ) T, = ( 1 ; 2 ; : : : ; J ) T, = (() 11 ; () 12 ; : : : ; () 1J ; : : : ; () IJ ) T, and X 1 X 2 = (X 11 X 21; X 11 X 22 ; : : : ; X 11 X 2J ; : : : ; X 1I X 2J ) T. Let Cov(X 1 ; Y ), Cov(X 2 ; Y ) and Cov(X 1 X 2 ; Y ) are covariance matrices. Then the total e ect of X 1 and X 2 is m P P (Y j (X 1 ; X 2 )) = Cov(;Y ) 2 = trt Cov(X 1 ;Y ) + trt Cov(X 2 ;Y ) + trt Cov(X 1 X 2 ;Y ) = 1 P I I i=1 2 i + 1 P J J j=1 2 j + 1 P J P I IJ j=1 i=1 ()2 ij The above three terms are referred to as the main e ect of X 1, that of X 2 and the interactive e ect, respectively. Then, ECD is calculated as follows: ECD((X 1 ; X 2 ) ; Y ) = In this case, 1 P I I i=1 2 i + 1 P J J j=1 2 j + 1 P J P I IJ j=1 i=1 ()2 ij 1 P I I i=1 2 i + 1 P J J j=1 2 j + 1 P J P. I IJ j=1 i=1 ()2 ij +2 20

21 CR (X 1 ) = Cov(;Y ) Cov(;Y jx i) Cov(;Y ) = 1 P I I i=1 2 i, Cov(;Y ) CR (X 2 ) = Cov(;Y ) Cov(;Y jx i) Cov(;Y ) = 1 P J J j=1 2 j. Cov(;Y ) The rest 1 CR (X 1 ) CR (X 2 ) = 1 IJ P J j=1 P I i=1 ()2 ij Cov(;Y ) is due to the e ect of the interaction. Table 2.2. Length of Home Visit in minutes by Public Health Nurses by Nurse s Age Group and Type of Patient Factor X 2 (nurse s age group) (years old) Factor X 1 (Type of Patients) (20 to 29) (30 to 39) (40 to 49) (50 and over) 1 (Cardiac) (Cancer) (C.V.A) (Tuberculosis)

22 Table 2.2 shows two-way layout experiment data in a study of length of time spent on individual home visits by public health nurses (Daniel (1999), pp ). In the example, analysis of the e ects of factors, i.e. the type of patient and the age of a nurse, on the nurses behavior will be signi cant. Let Y be length of home visit, and let factors X 1 and X 2 denote the type of a patient and the age of a nurse, respectively. The results of two-way analysis of variance are shown in Table 2.3. The main and interactive e ects of factors are signi cant. In this case, levels of factor vector X = (X 1 ; X 2 ) are (i; j) (i = 1; 2; 3; 4; 5; j = 1; 2; 3; 4; 5). Although factors X 1 and X 2 are independent (orthogonal), the model has interaction terms between them. By using the present approach, the variance decomposition in Table 2.3 we have ECD((X 1 ; X 2 ) ; Y ) = 3226: :05+704: :55 = 0:796 (SE = 0:018). From this, 79.6% of entropy is explained by the two variables (factors). The contributions of the factors are calculated as follows: CR (X 1 ) = 3226: :55 = 0:502 and CR (X 2) = 1185: :55 = 0:184. The contribution of X 1 on Y is about three times greater than that of X 2. 22

23 Table 2.3. Analysis of variance of length of time spent on individual home visits by public health nurses Source SS df MS F p X : : 5 52: 641 0:000 X : : 02 19: 334 0:000 (X 1 X 2 ) 704: : 272 3: 831 0:000 Residual 1307: : 431 Total 6423:55 79 Example 2.3. Table 2.4 shows death penalty data to study the e ects of defendant s and victim s racial characteristics on whether persons convicted homicide received the death penalty (Agresti, pp ), and the data were analyzed with the following logit model f(yjx) = expf(+ d x d+ v x v)yg 1+exp(+ d x d + v x v), where X d and X v imply the defendant s and victim s race, i.e. 0= black and 1=white; the response death penalty Y takes the value 0=no or 1=yes. The estimated parameters were b = 3:596 (SE=0.507), b d = 0:868 (SE=0.367) and b v = 2:404 (SE=0.601) (Agresti, p. 201). From the results, the odds of death penalty for the white defendant given the victim s race is exp ( 0:868) = 0:420 times higher than that for the black defendant, and the odds of death penalty for the white victim given the defendant s race is exp (2:404) = 11: 067 times higher than that for black victim. Since the 23

24 predictive power measured with ECD is ECD = 0:036 (SE = 0:014), the e ects of the defendant s and victim s races on the death penalty are small. The contribution ratios of the explanatory variables on the response are calculated according to (2.4) as follows: CR (X d ) = 0:603 and CR (X v ) = 0:813: Table 2.4. Death Penalty Data Victims Race Defendant s Race Death Penalty Yes No White White Black Black White 0 16 Black Application to a generalized logit model A baseline-category logit model is considered. Let X 1 and X 2 be categorical factors that take levels f1; 2; : : : ; Ig and f1; 2; : : : ; Jg, respectively, and let Y be a categorical 8 response variable with levels f1; 2; : 8: : ; Kg. Let >< 1 (X a = i) >< 1 (Y = k) X ai = (a = 1; 2) and Y k = >: 0 (X a 6= i) >: 0 (Y 6= k). (2.5) Then, dummy variable vectors 24

25 X 1 = (X 11 ; X 12 ; : : : ; X 1I ) T ; X 2 = (X 21 ; X 22 ; : : : ; X 2J ) T and Y = (Y 1 ; Y 2 ; : : : ; Y K ) T are identi ed with factors X 1 ; X 2 and response Y, respectively. From this, the systematic component of the baseline-category logit model is assumed as follows: = + B (1) X 1 + B (2) X 2 ; where (1)11 (1)12 ::: (1)1I 2 (1)21 (1)22 ::: (1)2I =, B 6. (1) =, 7 6 ::: ::: ::: ::: K (1)K1 (1)K1 ::: (1)KI 2 3 (2)11 (2)12 ::: (2)1J (2)21 (2)22 ::: (2)2J and B (2) =. 6 ::: ::: ::: ::: (2)K1 (2)K1 ::: (2)KJ Then, the logit model is described as Pr(Y = yjx 1 ; x 2 ) = exp(yt B (1) x 1 +y T B (2) x 2 +y T ) Pz exp(zt B (1) x 1 +z T B (2) x 2 +z T ) ; where P z implies the summation over all z. In this model, we have 25

26 ECD(X; Y ) = trb (1)Cov(X 1 ;Y )+ trb (2) Cov(X 2 ;Y ) trb (1) Cov(X 1 ;Y )+ trb (2) Cov(X 2 ;Y )+1. In the following example, the ECD approach is demonstrated. Example 2.4. The data for an investigation of factors in uencing the primary food choice of alligators (Table 2.5) are analyzed (Agresti, 2002; pp ). In this example, explanatory variables are X 1 : lakes where alligators live, {1. Hancock, 2. Oklawaha, 3. Tra ord, 4. George}; and X 2 : sizes of alligators, {1. Small, 2. Large}; and the response variable is Y : primary food choice of alligators, {1. Fish, 2. Invertebrate, 3. Reptile, 4. Bird, 5. Other}. In this analysis, the generalized logit model described in this section is used, and we set I = 4, J = 2, and K = 5 in (2.5). From this model the following estimates of regression coe cients are obtained (Agresti, 2002; pp ): :826 0:006 1: :485 0:931 0:394 0 bb 1 = 0:417 2:454 1:419 0 and b B 2 = 6 0:131 0:659 0: : : :683 0 : 0: By using the above estimates, we have tr b B 1 \ Cov (X1 ; Y ) = 0:258 and tr b B 2 \ Cov (X2 ; Y ) = 0: [m P P (Y jx 1 ; X 2 ) = 219 (0: :107) = 79: 935 (df = 16; P = 0:000). 26

27 From this, the e ect of X 1 and X 2 is signi cant. From (13), we can calculate the ECD as follows: [ECD (X; Y ) = 0: :107 0: : = 0:267 (SE=0.042): Although the e ects of factors are statistically signi cant, the predictive power of the logit model may be small, i.e. only 26:7% of the variation of the response variable in entropy is explained by the explanatory variables. The e ect of Lake on Food is about 2:4 times greater than that of Size. Table 2.5. Alligator Food Choice Data Primary Food Choice Lake Size of Alligator Fish Invertebrate Reptile Bird Other Hancock 2:3m (S) :3m (L) Oklawaha S L Tra ord S L George S L Conclusion In the GLM framework, regression models are described with random, systematic and link components, and GLMs are widely applied in data analyses; 27

28 however the explanatory powers of GLMs have not been measured in practical data analyses except the ordinary linear regression model. In this seminar, GLMs are discussed from a view point of entropy, and ECD for measuring explanatory or predictive power of GLMs has been introduced and the utility of ECD has been shown for practical data analyses. Acknowledgement 1 The author would like to thank Prof. Claudio Borroni and all the members of Department of Quantitative Methods for Economics and Business Sciences, the University of Milan (Universita degli Studi di Milano) for giving me a valuable opportunity to make the present seminar. References [1] Agresti, A. (1986). Applying R 2 -type measures to ordered categorical data, Technometrics; 28: [2] Agresti, A. (2002). Categorical Data Analysis, Second Edition, John Wiley & Sons, Inc.: New York. [3] Ash, A. & Shwarts, M. (1999). R 2 : A useful measure of model performance with predicting a dichotomous outcome, Statistics in Medicine; 18: [4] Daniel, W. W. (1999). Biostatistics: A Foundation for Analysis in the Health Sciences, Seventh Edition, John Wiley & Sons, Inc.: New York. 28

29 [5] Efron, B. (1978). Regression and ANOVA with zero-one data: measures of residual variation, Journal of the American Statistical Association; 73: [6] Eshima, N. & Tabata, M. (2007). Entropy correlation coe cient for measuring predictive power of generalized linear models, Statistics and Probability Letters; 77, [7] Eshima, N & Tabata, M. (2010). Entropy coe cient of determination for generalized linear models, Computational Statistics and Data Analysis, 54, , [8] Eshima, N & Tabata, M. (2011). Three predictive power measures for generalized linear models: Entropy coe cient of determination, entropy correlation coe cient and regression correlation coe cient, Computational Statistics and Data Analysis, 55, [9] Goodman, L. A. (1971). The analysis of multinomial contingency tables: stepwise procedures and direct estimation methods for building models for multiple classi cations, Technometrics; 13: [10] Haberman, S. J. (1982). Analysis of dispersion of multinomial responses, Journal of the American Statistical Association; 77: [11] Kent, J. T. (1983). Information gain and a general measure of correlation, Biometrika, 70,

30 [12] Korn, E. L. and Simon, R. (1991). Explained residual variation, explained risk and goodness of t, American Statistician; 45: [13] McCullagh, P. and Nelder, J. A. (1989). Generalized Linear models, 2nd Ed. Chapman and Hall: London. [14] Mittlebock, M. and Schemper, M. (1996). Explained variation for logistic regression, Statistics in Medicine; 15: [15] Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear model, Journal of the Royal Statistical Society A; 135: [16] Patnaik, P.B. (1949). The non-central 2 and F-distributions and their applications, Biometrika, 36, [17] Theil, H. (1970). On the estimation of relationships involving qualitative variables, American Journal of Sociology; 76: [18] Zheng, B. and Agresti, A. (2000). Summarizing the predictive power of a generalized linear model, Statistics in Medicine 2000; 19:

Multinomial Logistic Regression Models

Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word