Chapter 8 Indcator Varables In general, e explanatory varables n any regresson analyss are assumed to be quanttatve n nature. For example, e varables lke temperature, dstance, age etc. are quanttatve n e sense at ey are recorded on a well defned scale. In many applcatons, e varables can not be defned on a well defned scale and ey are qualtatve n nature. For example, e varables lke sex (male or female), colour (black, whte), natonalty, employment status (employed, unemployed) are defned on a nomnal scale. Such varables do not have any natural scale of measurement. Such varables usually ndcate e presence or absence of a qualty or an attrbute lke employed or unemployed, graduate or non-graduate, smokers or non- smokers, yes or no, acceptance or rejecton, so ey are defned on a nomnal scale. Such varables can be quantfed by artfcally constructng e varables at takes e values, e.g., and where ndcates usually e presence of attrbute and ndcates usually e absence of attrbute. For example, ndcator at e person s male and ndcates at e person s female. Smlarly, may ndcate at e person s employed and en ndcates at e person s unemployed. Such varables classfy e data nto mutually exclusve categores. These varables are called ndcator varable or dummy varables. Usually, e ndcator varables take on e values and to dentfy e mutually exclusve classes of e explanatory varables. For example, f person s male = f person s female, f person s employed = f person s unemployed. Here we use e notaton n place of X to denote e dummy varable. The choce of and to dentfy a category s arbtrary. For example, one can also defne e dummy varable n above examples as Regresson Analyss Chapter 8 Indcator Varables Shalabh, IIT Kanpur
f person s female = f person s male, f person s unemployed = f person s employed. It s also not necessary to choose only and to denote e category. In fact, any dstnct value of wll serve e purpose. The choces of and are preferred as ey make e calculatons smple, help n easy nterpretaton of e values and usually turn out to be a satsfactory choce. In a gven regresson model, e qualtatve and quanttatve can also occur togeer,.e., some varables are qualtatve and oers are quanttatve. When all explanatory varables are - quanttatve, en e model s called as a regresson model, - qualtatve, en e model s called as an analyss of varance model and - quanttatve and qualtatve bo, en e model s called as a analyss of covarance model. Such models can be dealt wn e framework of regresson analyss. The usual tools of regresson analyss can be used n case of dummy varables. Example: Consder e followng model w x as quanttatve and as ndcator varable y = β+ βx+ β + ε, E( ε) =, Var( ε) = σ f an observaton belongs to group A = f an observaton belongs to group B. The nterpretaton of result s mportant. We proceed as follows: If =, en y = β+ βx+ β. + ε = β+ βx+ ε E( y/ = ) = β + β x whch s a straght lne relatonshp w ntercept β and slope β. Regresson Analyss Chapter 8 Indcator Varables Shalabh, IIT Kanpur
If =, en y = β + β x + β.+ ε = ( β + β ) + β x + ε E( y/ = ) = ( β + β ) + β x whch s a straght lne relatonshp w ntercept ( β + β) and slope β. The quanttes E( y/ = ) and E( y/ = ) are e average responses when an observaton belongs to group A and group B, respectvely. Thus β = E( y/ = ) E( y/ = ) whch has an nterpretaton as e dfference between e average values of y w = and =. Graphcally, t looks lke as n e followng fgure. It descrbes two parallel regresson lnes w same varances σ. If ere are ree explanatory varables n e model w two ndcator varables and 3 en ey wll descrbe ree levels, e.g., groups AB, and C. The levels of ndcator varables are as follows:. 3 =, = f e observaton s from group A. 3 =, = f e observaton s from group B 3. 3 =, = f e observaton s from group C The concerned regresson model s y = β + β x + β + β + ε E ε = ε = σ 3 3, ( ), var( ). Regresson Analyss Chapter 8 Indcator Varables Shalabh, IIT Kanpur 3
In general, f a qualtatve varable has m levels, en ( m ) ndcator varables are requred and each of em takes value and. Consder e followng examples to understand how to defne such ndcator varables and how ey can be handled. Example: Suppose y denotes e monly salary of a person and denotes wheer e person s graduate or nongraduate. The model s y = β + β + ε E ε = ε = σ, ( ), var( ). W n observatons, e model s y = β + β + ε, =,,..., n E( y / = ) = β E( y / = ) = β + β β = E( y / = ) E( y / = ) Thus - β measures e mean salary of a non-graduate. - β measures e dfference n e mean salares of a graduate and non-graduate person. Now consder e same model w two ndcator varables defned n e followng way: f person s graduate = f person s nongraduate, f person s nongraduate = f person s graduate. The model w n observatons s Then we have y = β + β + β + ε E ε = Var ε = σ = n, ( ), ( ),,,...,. E y / =, = = β + β : Average salary of non-graduate. [ ] E y / =, = = β + β : Average salary of graduate. [ ] 3. [ /, ] E y = = = β : cannot exst E y / =, = = β + β + β : cannot exst. 4. [ ] Regresson Analyss Chapter 8 Indcator Varables Shalabh, IIT Kanpur 4
Notce at n s case + = for all whch s an exact constrant and ndcates e contradcton as follows: + = person s graduate + = person s non-graduate So multcollnearty s present n such cases. Hence e rank of matrx of explanatory varables falls short by. So β, β and β are ndetermnate and least squares meod breaks down. So e proposton of ntroducng two ndcator varables s useful but ey lead to serous consequences. Ths s known as dummy varable trap. If e ntercept term s gnored, en e model becomes en y E Var n = β + β + ε, ( ε) =, ( ε) = σ, =,,..., E( y / =, = ) = β Average salary of a graduate. E( y / =, = ) = β Average salary of a non graduate. So when ntercept term s dropped, en β and β have proper nterpretatons as e average salares of a graduate and non-graduate persons, respectvely. Now e parameters can be estmated usng ordnary least squares prncple and standard procedures for drawng nferences can be used. Rule: When e explanatory varable leads to m mutually exclusve categores classfcaton, en use ( m ) ndcator varables for ts representaton. Alternatvely, use m ndcator varables but drop e ntercept term. Regresson Analyss Chapter 8 Indcator Varables Shalabh, IIT Kanpur 5
Interacton term: Suppose a model has two explanatory varables one quanttatve varable and oer an ndcator varable. Suppose bo nteract and an explanatory varable as e nteracton of em s added to e model. y x x E Var n = β + β + β + β3 + ε, ( ε) =, ( ε) = σ, =,,...,. To nterpret e model parameters, we proceed as follows: Suppose e ndcator varables are gven by f person belongs to group A = f person belongs to group B Then y = salary of ( ) person. E y / = = β + βx + β. + β3x. = β + β x. Ths s a straght lne w ntercept β and slope β. Next ( ) E y / = = β + βx + β. + β3x. = ( β + β ) + ( β + β ) x. 3 Ths s a straght lne w ntercept term ( β + β) and slope ( β+ β3). The model Ey ( ) = β + β x + β + β x 3 has dfferent slopes and dfferent ntercept terms. Thus β reflects e change n ntercept term assocated w e change n e group of person.e., when group changes from A to B. β 3 reflects e change n slope assocated w e change n e group of person,.e., when group changes from A to B. Fttng of e model y= β + β x + β + β x + ε 3 s equvalent to fttng two separate regresson models correspondng to = and =,.e. Regresson Analyss Chapter 8 Indcator Varables Shalabh, IIT Kanpur 6
and y = β + β x + β. + β x.+ ε 3 y= ( β + β ) + ( β + β ) x + ε 3 y = β + β x + β. + β x. + ε y respectvely. 3 = β + β x + ε The test of hypoess becomes convenent by usng an ndcator varable. For example, f we want to test wheer e two regresson models are dentcal, e test of hypoess nvolves testng H : β = β = 3 H : β and/or β. 3 Acceptance of H ndcates at only sngle model s necessary to explan e relatonshp. In anoer example, f e objectve s to test at e two models dffer w respect to ntercepts only and ey have same slopes, en e test of hypoess nvolves testng H : β3 = H : β. 3 Indcator varables versus quanttatve explanatory varable The quanttatve explanatory varables can be converted nto ndcator varables. For example, f e ages of persons are grouped as follows: Group : day to 3 years Group : 3 years to 8 years Group 3: 8 years to years Group 4: years to 7 years Group 5: 7 years to 5 years en e varable age can be represented by four dfferent ndcator varables. Snce t s dffcult to collect e data on ndvdual ages, so s wll help s easy collecton of data. A dsadvantage s at some loss of nformaton occurs. For example, f e ages n years are, 3, 4, 5, 6, 7 and suppose e ndcator varable s defned as Regresson Analyss Chapter 8 Indcator Varables Shalabh, IIT Kanpur 7
f age of person s > 5 years = f age of person s 5 years. Then ese values become,,,,,. Now lookng at e value, one can not determne f t corresponds to age 5, 6 or 7 years. Moreover, f a quanttatve explanatory varable s grouped nto m categores, en ( m ) parameters are requred whereas f e orgnal varable s used as such, en only one parameter s requred. Treatng a quanttatve varable as qualtatve varable ncreases e complexty of e model. The degrees of freedom for error are also reduced. Ths can effect e nferences f data set s small. In large data sets, such effect may be small. The use of ndcator varables does not requre any assumpton about e functonal form of e relatonshp between study and explanatory varables. Regresson analyss and analyss of varance The analyss of varance s oftenly used n analyzng e data from e desgned experments. There s a connecton between e statstcal tools used n analyss of varance and regresson analyss. We consder e case of analyss of varance n one way classfcaton and establsh ts relaton w regresson analyss. One way classfcaton: Let ere are k samples each of sze n from k normally dstrbuted populatons N µ σ = k The (, ),,,...,. populaton dffer only n er means but ey have same varance yj = µ + εj, =,,..., k; j =,,..., n = µ + ( µ µ ) + εj = µ + τ + ε j σ. Ths can be expressed as where y j s e j observaton for e fxed treatment effect τ = µ µ or factor level, µ s e general mean effect, ε j are dentcally and ndependently dstrbuted random errors followng N(, σ ). Regresson Analyss Chapter 8 Indcator Varables Shalabh, IIT Kanpur 8
Note at k τ = µ µ, τ =. = The null hypoess s H : τ = τ =... = τk = H : τ for atleast one. Employng meod of least squares, we obtan e estmator of µ and τ as follows ( y ) k n k n ε j j µ τ = j= = j= k n S = = S = ˆ µ = yj = y µ nk = j= n S = ˆ τ = y ˆ µ = y y τ j n j= where y n = yj. n j = Based on s, e correspondng test statstc s F n k ( y y) k = = k n ( yj y ) = j= kn ( ) whch follows F -dstrbuton w k and kn ( ) degrees of freedom when null hypoess s true. The decson rule s to reject H whenever F Fα ( k, kn ( )) and t s concluded at e k treatment means are not dentcal. Connecton w regresson: To llustrate e connecton between fxed effect one way analyss of varance and regresson, suppose ere are 3 treatments so at e model becomes y = µ + τ + ε, =,,...,3, j =,,..., n. j j There are 3 treatments whch are e ree levels of a qualtatve factor. For example, e temperature can have ree possble levels low, medum and hgh. They can be represented by two ndcator varables as Regresson Analyss Chapter 8 Indcator Varables Shalabh, IIT Kanpur 9
f e observaton s from treatment = oerwse, f e observaton s from treatment =. oerwse. The regresson model can be rewrtten as yj = β + βj + β j + εj, =,,3; j =,,..., n where st : value of for j observaton w treatment j nd : value of for j observaton w treatment. j Note at - parameters n regresson model are β, β, β. - parameters n analyss of varance model are µτ,, τ, τ 3. We establsh a relatonshp between e two sets of parameters. Suppose treatment s used on j observaton, so j j =, = and y = β + β. + β. + ε j j = β + β + ε. j In case of analyss of varance model, s s represented as y = µ + τ + ε j j = µ + ε where µ = µ + τ j β + β = µ. If treatment s appled on - n regresson model set up, =, = and j j y = β + β. + β.+ ε = β + β + ε j j j j observaton, en - n analyss of varance model set up, Regresson Analyss Chapter 8 Indcator Varables Shalabh, IIT Kanpur
y = µ + τ + ε j j = µ + ε where µ = µ + τ j β + β = µ. When treatment 3 s used on - n regresson model set up, = = j j y = β + β + β + ε 3j.. 3 j = β + ε 3j j observaton, en - n analyss of varance model set up y = µ + τ + ε 3j 3 3j = µ + ε where µ = µ + τ 3 3j 3 3 β = µ. 3 So fnally, ere are followng ree relatonshps β + β = µ β + β = µ β = µ 3 β = µ 3 β = µ µ β µ µ =. 3 In general, f ere are k treatments, en ( k ) ndcator varables are needed. The regresson model s gven by where y = β + β + β +... + β + ε, =,,..., k; j =,,..., n j j j k k, j j j f j observaton gets treatment = oerwse. In s case, e relatonshp s β = µ k β = µ µ k, =,,..., k. So β always estmates e mean of k treatment and β estmates e dfferences between e means of treatment and k treatment. Regresson Analyss Chapter 8 Indcator Varables Shalabh, IIT Kanpur