Designing a Pseudo R-Squared Goodness-of-Fit Measure in Generalized Linear Models

Size: px

Start display at page:

Download "Designing a Pseudo R-Squared Goodness-of-Fit Measure in Generalized Linear Models"

Anthony Richard
5 years ago
Views:

1 Desgnng a Pseudo R-Squared Goodness-of-Ft Measure n Generalzed Lnear Models H. I. Mbachu Dept. of Mathematcs/Statstcs, Unversty of Port Harcourt, Port Harcourt E. C. Nduka Dept. of Mathematcs/Statstcs, Unversty of Port Harcourt, Port Harcourt M. E. Nja (Correspondng author) Dept. of Mathematcs/Statstcs, Cross Rver Unversty of Technology, Calabar Receved: December 19, 2011 Accepted: January 4, 2012 Publshed: Aprl 1, 2012 do: /jmr.v4n2p148 Abstract URL: The coeffcent of determnaton s a functon of resduals n the General Lnear Models. The devance, logt, standardzed and the studentzed resduals were examned n generalzed lnear models n order to determne the behavour of resduals n ths class of models and thereby desgn a new pseudo R-squared goodness-of-ft measure. The Newton-Raphson estmaton procedure was adopted. It was observed that these resduals exhbt patterns that are unque to the subpopulatons defned by levels of categorcal predctors. Resduals block on the bass of sgns, where postve sgns ndcate success responses and negatve sgns falure responses. It was also observed that the devance s a close approxmaton of the studentzed resdual. The logt resdual s two tmes the sze of the standardzed resduals. Borrowng from the Nagelkerke s mprovement of Cox and Snell s goodness-of-ft measure n generalzed lnear models and the coeffcent of determnaton counterpart of the general lnear model, a new pseudo R squared goodness-of-ft test whch uses predcted probabltes and a monotonc lnk functon s here proposed to serve both the lnear and Generalzed Lnear Models. Keywords: Devance, Normalzed resduals, Logt, Standardzed resduals, Loglkelhood functon, Response probablty 1. Introducton A generalzed lnear model s one n whch each component of the response varable Y has a dstrbuton n the exponental famly, takng the form yθ b(θ) f y (y, θ, ϕ) = exp{ + c(y, ϕ)} a(θ) for some specfc functon a( ), b( ) and c(, ) (McCullagh & Nelder, 1990). The functons a and c are such that a(ϕ) = ϕ/w and c = c(y, ϕ/w), where w s a known weght for each observaton. The model can be stated as z = Σ p j=1 x jβ j + e h (µ) = Σ j=1 x j β + e (y µ )h (µ), = 1, 2, 3,..., n (1) where z s the adjusted dependent varate, x j s the (, j)th element of the desgn matrx, h(µ ) s the lnk functon and e s the resdual error. The lnk between y and z s n the expresson. Where y s a bnomal random response varable. From (1), a resdual n generalzed lnear model can be defned as h = h(µ ) (2) e, so defned s called Pearson resdual. e = z x j β j h (µ) Standard theory for ths type of dstrbuton expresses the mean and varance of the response y as: E(y) = b (θ) and var(y) = b (θ)ϕ w = V(µ)ϕ w (3) 148 ISSN E-ISSN

2 where V s the varance functon. The log-lkelhood functon, a goodness-of-ft measure s defned for the followng exponental famly models: Generally, the log-lkelhood functon s of the form wth ndvdual contrbuton for the bnomal functon as 2. The Newton-Raphson Method The Newton-Raphson estmaton scheme s gven as where H, the Hessan matrx s gven as wth and L(y, µ, ϕ) = Σ log( f (y, µ, ϕ)) l = [r log(p ) + (n r )log(1 p )] β k+1 = β k H 1 g 2 l H = { } rs β r β s 2 l = Σ[(y µ) {W dη β r β s β s dµ x r} + W dη dµ x r (y µ)] β s 2 l = Σ[(y µ) {W dη β j dµ x j} + W dη dµ x j (y µ)] β j β 2 j l = [ y µ β j a(ϕ) 1 dµ v dη x j] = l, the loglkelhood for a bnary response varable can be wrtten as η = β 0 + Σx j β j s the lnear predctor. W dη (y µ) a(ϕ) dµ x j l = l(β; y) = Σ Σ j y x j β j Σm log(1 + expσσx j β j ) W, the weght matrx s gven as W = dag{m ( dµ dη ) 2 /µ (1 µ )}. m s row subtotal n the cross tabulaton table. The gradent vector g s gven as g = ( l β 0, l β 1,..., where the response or ftted probablty µ s defned as l ) = l = Σ y m µ dµ = Σ(y m µ )x r β n β r µ (1 µ ) dη µ = expσx jβ j 1 + expσx j β j An alternatve estmaton procedure s the Iteratve Weghted Least Squares method whch often adopted n order to avod the computatonal tedum assocated wth the Hessan matrx. 3. Resduals n Generalzed Lnear Models The coeffcent of determnaton R 2, s a functon of the resdual. It was orgnally developed for the normal-theory model. Cameron and Wndmejer (1996) desgned an R 2 for the Posson and related count data after observng that t was rarely used for count data. Nagelkerke (1991) generalzed the defnton of R 2 n what s called the generalzed R 2. The generalzed R 2 s consstent wth the classcal R 2 and s also maxmzed by the maxmum lkelhood estmaton of a model. The generalzed coeffcent of determnaton s gven as follows: R 2 = 1 ( L(0) L(θ) ) 2 n where L(0) s the lkelhood of the model wth only ntercept. L(θ) s the lkelhood of the estmated model and n s the sample sze. Resduals n a logstc model can be defned as the dfference between y and the predcted probablty θ for Publshed by Canadan Center of Scence and Educaton 149

3 y. We defne the predcted probablty n a cross-classfed data as the probablty that an object or a person selected from a subgroup s a success (Stroke et al., 1997). θ = exp{β 0 + Σβ x j } 1 + exp{β 0 + Σβ x j } The monotonc lnk functon relates the predcted probablty to the set of lnear predctors. For the logstc regresson where the underlyng dstrbuton s bnomal, the lnk functon s a logt. The devance, Pearson χ 2, standardzed, logt and studentzed resduals are the resduals normally assocated wth generalzed lnear models. The analyss of resduals made n ths paper shows that the logt resdual s approxmately twce the sze of standandzed resduals. The standardzed resdual s approxmately equal to the devance resdual. Ths can be seen n the appendx. 4. Goodness of Ft Measures n Generalzed Lnear Models The devance and the generalzed Pearson χ 2 statstc are two measures of goodness of ft n generalzed lnear models. Both the devance and the generalzed Pearson χ 2 have exact χ 2 dstrbutons for Normal-theory lnear models f the models are true (McCullagh & Nelder, 1990). The devance uses the log of the rato of lkelhoods. Cox and Snell R squared, another measure of goodness of ft n generalzed lnear models s a psudo R squared and a modfcaton of the devance whch confgures the test nterval to le between 0 and 1 (excludng 1) such that a smaller rato mples a greater mprovement. The devance for the set of dstrbutons n generalzed lnear models s gven as follows: for the normal dstrbuton, t s stated as D = Σw (y µ ) 2 For the posson, bnomal and gamma we have and 2 w [y log( y ) (y µ )], µ 2 w m [y log( y ) + (1 y )(log 1 y )] µ 1 µ 2 w [ log( y ) + y µ ] µ µ respectvely. For the nverse-gaussan, multnomal and negatve bnomal, we have and respectvely. Cox and Snell R 2 s defned as w (y µ ) 2 µ 2 y y j w y j log( ) p j m j 2 w [ylog(y/µ) (1 + 1/k)log( y + 1/k µ + 1/k )] R 2 = 1 { L(m nt) L(m f ull ) }2/N where L(m nt ) s the condtonal probablty of the dependent varable for the ntercept model. If L(m f ull ) s 1 then R 2 < 1. The Nagelkerke/Gragg & Uhler s modfcaton s R 2 = 1 { L(m nt) L(m f ull ) }2/N /1 L(m nt ) 2/N In ths paper a new goodness of ft test that makes use of ftted probabltes, a monotomc lnk functon and the Nagelkerke range of possble values s proposed. The test s desgned to serve both the general lnear and the generalzed lnear models. It s gven as follows: R 2 G&G = 1 [h (θ)] 1 (y θ) 2 (y h(θ)) 150 ISSN E-ISSN

4 RG&G 2, desgned for the generalzed lnear models can be adapted for use as a goodness of ft measure n the general lnear model by replacng the ftted probabltes and the lnk functon values wth ftted y values and the mean of y respectvely. The value of RG&G 2 range from 0 to 1, wth hgher values mplyng better fts. 5. Illustratve Example The hypothetcal data below s used for the llustraton of resdual analyss n generalzed lnear models: <Table 1> The probablty that a person from the th sex level and the jth locaton status s nfected wth a certan vrus. The model Let y j be a bnomal random response varable correspondng to the th sex status and the jth locaton whch assumes the value 0 or 1. The probablty θ j ; that a person of the hth sex and jth locaton s nfected by the vrus s modeled as where = 1, 2, j = 1, 2, β 0 = overall mean sex() = effect of th sex level =β 1 locaton(j) = effect of jth locaton status = β 2 θ j = exp exp[β 0 + sex() + locaton( j)] 1 + exp[β 0 + sex() + locaton( j)] e j = random error assocated wth observaton. The Newton-Raphson estmates of the llustratve example are as follows: Soluton β 0 = , β 1 = s the effect of the th sex level. β 2 = s the effect of the jth locaton status. The pseudo-r squared goodness of ft test reveals the followng results: Cox and Snell R 2 = Nagelkerke/Gragg & Uhler s R 2 = The proposed R 2 G&G = The outlned resduals assocated wth ths example are shown n the appendx. It s observed that resduals exhbt unque patterns n accordance wth subpopulatons defned by levels of the categorcal varables. Resduals form blocks on the bass of sgns, where postve sgns ndcate success and negatve sgns ndcate falure responses. The devance and the studentzed resduals exhbt very close resdual patterns. Stat Computng (2011) gave three nterpretatons of R 2 as follows: () R 2 as explaned varablty: The denomnator of the rato ndcates total varaton n the dependent varable whle the numerator s the varablty n the dependent varable that s not predcted by the model. The rato s the proporton of the total varablty explaned by the model whch agrees wth R 2 n Ordnary Lnear Models (Koutsoyanns, 1983). Thus a hgher rato mples a better model. () R 2 as mprovement from null model to ftted model: A smaller rato mples a greater mprovement. () R 2 as the square of the correlaton: correlaton between predcted values and the actual values. A hgher R 2 mples a greater mprovement of ft. It can be seen that the proposed R2 goodness-of-ft measure compares favourably wth the Nagelkerke/Gragg & Uhler s R 2 (0.180 aganst 0.187). 6. Concluson The Nagelkerke/Gragg & Uhler s Improvement of Cox and Snell R 2 s applcable n Generalzed Lnear models only. The exstng R squared goodness of ft measure n General Lnear models s not applcable n Generalzed Lnear model. Ths s because the model estmates from Generalzed Lnear models are maxmum lkelhood estmates whch are obtaned by teratve procedures. They are not calculated to mnmze varance; so the Ordnary Least Squares approach to goodness of ft does not apply. To evaluate goodness of ft n generalzed lnear models a pseudo R 2 s requred. Ths paper ntroduces a new pseudo R squared goodness of ft measure whch has the advantage of assessng goodness of ft n both lnear and generalzed lnear models. The result shows that the new pseudo-r squared method desgned n ths paper compares favourably wth the exstng Nagelkerke/Gragg & Uhler s desgn. Publshed by Canadan Center of Scence and Educaton 151

5 References Cameron, A. C., & Wndmeyer F. A. G. (1996). R-Squared Measures for Count Data Regresson Models wth Applcatons to Health-Care Utlzaton. Journal of Busness and Economc Statstc. Koutsoyanns, A. (1983). Theory of Econometrcs: An Introductory Exposton of Econometrc Methods. 2 Ed. The Macmllan Press Ltd, London. McCullagh, P., & Nelder, J. A. (1990). Generalzed Lnear Models. Chapman and Hall. Madras. Nagelkerke, N. (1991). A note on a General Defnton of the Coeffcent of Determnaton. Bometrka, 78 (3), pp Nja, M. E., & Bamduro, T. A. (2006). Relatve performance of Optmzaton Methods In Solutons of Generalzed Lnear Models. An unpublshed Ph. D thess, Unversty of Ibadan, Ngera. Stoke, M. E., Davs, C. S., & Koch, G. G. (1997). Categorcal Data Analyss usng the SAS system, SAS Insttute Inc., Cary, NC, USA. Table 1. Hypothetcal data I Sex x 1 Locaton x 2 Infected y Not nfected Total m 1 Female Urban Female Rural Male Urban Male Rural Appendx: Resduals PRE 1 COO 1 LEV 1 RES 1 LRE 1 SRE 1 ZRE 1 DEV 1 DFB0 1 DFB1 1 DFB2 152 ISSN E-ISSN

6 PRE 1 COO 1 LEV 1 RES 1 LRE 1 S RE 1 ZRE 1 DEV 1 DFB0 1 DFB1 1 DFB2 PRE 1 COO 1 LEV 1 RES 1 LRE 1 S RE 1 ZRE 1 DEV 1 DFB0 1 DFB1 1 DFB2 Publshed by Canadan Center of Scence and Educaton 153

7 PRE 1 COO 1 LEV 1 RES 1 LRE 1 S RE 1 ZRE 1 DEV 1 DFB0 1 DFB1 1 DFB2 Key: PRE 1 COO 1 LEV 1 RES 1 LRE 1 S RE 1 ZRE 1 DEV 1 DFB0 1 DFB1 1 DFB2 1 Predcted probablty Analog of Cook s nfluence statstcs Leverage value Dfference between observed and predcted probabltes Logt Resdual Standard Resdual Normalzed Resdual Devance value DFBeta for constant DFBeta for VAR00002(1) DFBeta for VAR00003(1) 154 ISSN E-ISSN

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore Sesson Outlne Introducton to classfcaton problems and dscrete choce models. Introducton to Logstcs Regresson. Logstc functon and Logt functon. Maxmum Lkelhood Estmator (MLE) for estmaton of LR parameters.