BAYESIAN APPROACH FOR SELECTION BIAS CORRECTION IN REGRESSION. Labeed Mokatrin. Submitted to the. Faculty of the College of Arts and Sciences

Size: px

Start display at page:

Download "BAYESIAN APPROACH FOR SELECTION BIAS CORRECTION IN REGRESSION. Labeed Mokatrin. Submitted to the. Faculty of the College of Arts and Sciences"

Shanon Clarke
5 years ago
Views:

2 BAESIAN APPROACH FOR SELECTION BIAS CORRECTION IN REGRESSION By Labeed Mokatrn Submtted to the Faculty of the College of Arts and Scences of Amercan Unversty n Partal Fulfllment of the Requrements for the Degree of Doctor of Phlosophy In Statstcs Char: Jun Lu Ph.D. Mary Gray Ph.D. Dean of the College of Arts and Scences Monca Jackson Ph.D. Date 0 Amercan Unversty Washngton D.C. 006

3 BAESIAN APPROACH FOR SAMPLE SELECTION BIAS CORRECTION IN REGRESSION B Labeed Mokatrn ABSTRACT Selecton bas occurs when samples are self-selected rather than randomly selected from the target populaton. Ths s a well-known problem and has been extensvely studed n research studes n statstcs and economcs. In ths work I adopt a Bayesan approach to correct sample selecton bas under the self-selecton setup proposed n Heckman model. Bayesan methods treat the populaton parameters of nterest as random varables nstead of unknown constants. The dstrbutons of these random parameters are called pror dstrbutons. Statstcal nference s based on the posteror dstrbuton whch combnes nformaton from the data and the pror. Markov Chan Monte Carlo MCMC methods are used for Bayesan computaton of the posteror dstrbutons. The results from the proposed Bayesan method are compared to that of Heckman s two-step estmaton va varous smulaton studes. A comprehensve smulaton study s conducted where varous scenaros are consdered for the smulaton setup and desgn. Furthermore n addton to the most common self-selecton setup the new approach s extended to handle self-selecton wth Bnary outcome model.

4 ACKNOWLEDGMENTS Frst I would lke to thank my dssertaton advsor Dr. Jun Lu. Hs advce nsghtful crtcsms and patent encouragement aded the wrtng of ths dssertaton n nnumerable ways. I would also lke to thank the commttee members Dr. Mary Gray and Dr. Monca Jackson. Ths work would not have been possble wthout ther valued advce and suggestons. Dr. Gray s steadfast support was greatly needed and deeply apprecated. I would lke to thank the faculty and staff members of the department of Mathematcs and Statstcs at AU n partcular Dr. Behzad Jalala and Lnda Greene for all ther help. I am forever grateful to my famly Shreen Adam and Rany for ther contnued support and encouragement. The numerous sacrfces they have made over the last few years allowed me to reach mportant mlestones n my professonal career. Fnally I would lke to extend my grattude to my mom and dad for all ther support and enthusasm for my educaton throughout my lfe.

5 TABLE OF CONTENTS ABSTRACT... ACKNOWLEDGMENTS... LIST OF TABLES... v LIST OF llustratons... v Chapter. INTRODUCTION... The Problem of Sample Selecton Bas... Heckman s Two-Step Method... Lmtatons of Heckman s Method and Known Alternatves... 3 Bayesan Approach SELF-SELECTED SAMPLING MODEL AND BIAS... 6 Self-Selected Samplng Model... 6 Self-Selected Samplng Bas... 9 MLE and Heckman Correcton MODEL DEVELOPMENT... 7 Bref Introducton to Bayesan Methods and MCMC... 7 Mssng Values and Latent Varables... 0 Pror Dstrbuton of Lnear Component... Samplng Scheme of Lnear Component SIMULATION STUD... 5 v

6 Women Wage Example... 5 Comprehensve Smulaton Study CASE STUD: PLACEMENT EAM AND MATH ACHIEVEMENT BINAR SELECTIVIT MODEL Generalzed Lnear Model Bayesan Estmaton Data Example DISCUSSION REFERENCES v

7 LIST OF TABLES Table. Women Wage Data Varable Descrpton Estmaton Usng OLS and Heckman Models Estmaton Usng Bayesan Methods Smulaton Results for Selecton Rate Smulaton Results for Correlaton Level Smulaton Results for Multcollnearty Smulaton Results for Sample Sze Varable Descrptons for Placement Exam Data Estmaton Results for Placement Exam Data Estmaton Results for Bnary Selectvty v

8 LIST OF ILLUSTRATIONS Fgure. Sample Paths for Comprehensve Smuaton Densty Plots for Comprehensve Smuaton Movng Averages for Comprehensve Smuaton Sample Paths for Placement Exam Example Densty Plots for Placement Exam Example Movng Averages for Placement Exam Example Sample Paths for Bnary Selectvty Model Densty Plots for Bnary Selectvty Model Movng Averages for Bnary Selectvty Model v

9 CHAPTER INTRODUCTION The Problem of Sample Selecton Bas In some socology and economcs studes samples are self-selected rather than randomly selected from the target populaton. Bas can occur when usng self-selected samples because the selecton crtera are often correlated wth the varables of nterest. Such bas s often called the Sample Selecton Bas. For example Mansk & Wse 983 studed the relatonshp between SAT scores and potental college achevement. The researchers could only sample from students who were already admtted to college but not from all students who could potentally go to college. In ths case students who scored well n the SAT were more lkely to attend college and hence are more lkely to be selected nto the sample. Another example s the study of women s educaton background and ther earnngs Heckman 979. Samples were selected from women wth labor force partcpaton. However ndvduals only jon the labor force f ther potental earnngs or occupatonal status meet some crtera. As a result samplng from women n labor force gnored the women who had low potental earnngs. Heckman s Two-Step Method Heckman 976 rased the ssue of sample selecton bas when a dependent

10 varable n the regresson has values that are mssng not at random. He proposed to estmate the full nformaton maxmum lkelhood FIML by way of a two-step method. Ths method s called a Lmted Informaton Maxmum Lkelhood LIML. FIML s a well-known econometrc technque for estmatng equaton models n whch the parameters of all equatons are estmated smultaneously wth all the nformaton n the model Maddala 977. Smlar to FIML LIML s a maxmum lkelhood bass for estmatng one structural equaton or a proper subset of structural equatons from a system of equatons Anderson and Rubn n 949. Heckman 976 dscussed the common structure of statstcal models of lmted dependent varable as well as a smple estmator for ths model. He presented a unfed summary of statstcal model selecton and lmted dependent varables. Heckman 979 proposed a soluton to sample selecton bas usng the two-step estmator method. Accordng to ths method n the frst step we use probt regresson to model the sample selecton process. A new varable called the Inverse Mlls Rato s calculated based on the probt regresson results. In the second step we add the Inverse Mlls rato to the regresson analyss as an ndependent varable and smply use Ordnary Least Squares OLS to estmate the regresson coeffcents. Heckman s two-step estmaton procedure s easy to mplement. It has been well recognzed n the appled felds such as economcs and socology as a correcton for sample selecton bas. Examples of the applcaton of Heckman s two-step estmaton can be found n for example Mroz 987 Nawata 994 and Leung and u 996.

11 3 Lmtatons of Heckman s Method and Known Alternatves Dscussons of Heckman s two-step estmaton and other approaches to sample selecton bas present themselves readly n the lterature n the last two decades. Wnshp and Mare 99 dscussed the dffcultes and lmtaton of several sample selecton bas correcton technques. They show how self-selecton leads to based estmates n regresson revew models that have been proposed dscuss Heckman's estmator and ts lmtatons and dscuss other approaches to selecton such as nonparametrc approaches to estmatng selecton models. They suggest that when selecton s an ssue researchers should present estmates usng a varety of methods because the results may depend on the method used. Nawata 993 analyzes methods for estmatng models wth selecton bas by comparng Maxmum Lkelhood Estmaton MLE and Heckman's two-step estmator wth Monte Carlo experments. The results show that Heckman's two-step estmator can perform well when there s no multcollnearty between the Inverse Mlls Rato and the explanatory varables. However t wll perform relatvely poorly when multcollnearty exsts and MLE becomes more effcent. Stolzenberg and Relles 997 provde mathematcal tools to assst ntuton about selecton bas n concrete emprcal analyss. They ndcate that there s no general soluton to the selecton bas problem but they present a new decomposton of selecton bas. In ths decomposton the analyst should be able to develop ntuton and make reasonable judgments about the source severty and drecton of sample selecton bas n a partcular analyss. The authors also lst several bas correcton procedures that are

12 4 avalable. They suggest that the safest approach to sample selecton bas problem s frst to understand how nonrandom selecton occurs n the data. If the data seem to be selected as descrbed by Heckman then t s approprate to use Heckman's two-step model. Pahn 000 dscusses Monte Carlo studes of Heckman s correcton and llustrates a crtque of Heckman's estmator. He ndcates that the explanatory varables n Heckman s two-step model may have a large set of varables n common whch causes collnearty wth the Inverse Mlls Rato. Pahn concludes that we should dagnose collnearty problems before decdng whch estmator to use. If there s no collnearty between the regressors and the Inverse Mlls Rato the Heckman two-part model s the most robust approach. On the other hand f collnearty problems exst the MLE approach s preferable to Heckman s two-step method. Bayesan Approach In ths study we propose a Bayesan approach to correct sample selecton bas under the self-selecton setup proposed n Heckman 979. Bayesan methods treat the populaton parameters of nterest as random varables nstead of unknown constants. The dstrbutons of these random parameters are called pror dstrbutons. Often both expert knowledge and mathematcal convenence play a role n selectng a partcular type of pror dstrbuton. Statstcal nference s based on the posteror dstrbuton whch combnes nformaton from the data and the pror. We use Markov Chan Monte Carlo MCMC methods for Bayesan computaton of the posteror dstrbutons n ths study. We also compare the performance of the proposed Bayesan method and that of Heckman s two-step estmaton va smulaton study.

13 5 In the next chapter we gve a more detaled ntroducton to the sample selecton problem and Heckman's two-step estmaton. Usng the same assumpton on the data collecton as used by Heckman a Bayesan model to correct the selecton bas s ntroduced n chapter three. A smulaton study s presented n chapter four where we demonstrate the proposed Bayesan method and compare the estmates from varous approaches usng the Women Wage data example and also by usng a comprehensve smulaton study where varous scenaros wll be consdered for the smulaton setup and desgn. In chapter fve we wll apply the Bayesan model to a real-world data example by usng AU students placement exam data. In chapter sx we wll extend the proposed Bayesan method to the Generalzed Lnear Model GLM.

14 CHAPTER SELF-SELECTED SAMPLING MODEL AND BIAS Self-Selected Samplng Model A famous example of sample selecton bas s the estmaton of the wage equaton Pahn 000. When tryng to estmate the results of schoolng on the wage rate the researcher faces the problem that some ndvduals who have receved schoolng do not work. These ndvduals have not receved an offer that meets ther reservaton wage. If we assume a postve relatonshp between schoolng and wages people wth lttle schoolng wll on average have a lower offered wage and therefore a lower employment rate than those wth more years of schoolng. But we only observe the wage offers whch exceed an ndvdual s reservaton wage. As a consequence we only observe the wages of those people wth few years of schoolng that receve comparatvely hgh wage offers. In ths case there s self-selected samplng and the OLS estmate s based. In ths example smple OLS regresson of wages on years of schoolng wll lead to bas estmates because the sample workng people s unrepresentatve of the populaton one s nterested n all people who have receved schoolng. The selecton problem can be vewed as a problem of mssng observatons except that they are not mssng at random. A lnear regresson model wth self-selecton samples can be presented usng the followng two-equaton model: 6

15 7 u We call Equaton the observaton equaton and Equaton the selecton equaton. In the prevous example ndvduals who are only able to acheve low wage rate gven ther level of schoolng wll decde not to work. Therefore the probablty that ther offered wage s below ther reservaton wage s hghest. In other words and u can be expected to be postvely correlated whch causes sample selectons bas. When observatons are mssng at random Equaton can stll be estmated by OLS. Typcally there are three causes of non-randomly mssng observatons: censorng truncaton or self-selected samplng. A sample s censored when observatons on are not avalable n some range and are reported at a cutoff value but the explanatory varables are all avalable. When observaton on the are also unavalable the sample s sad to be truncated. When self-selected samplng occurs observatons on are recorded only f another varable value. In ths artcle we dscuss self-selected samplng. s observed only f takes on a value above or below some cutoff s greater than a cutoff value C meanng the th subject s selected. Wthout loss of generalty we can assume the cutoff C to be 0. From equaton the populaton regresson functon s E [ 3 ]

16 8 The regresson functon for the ncomplete sample s E[ selecton rule] E[ selecton rule] E[ u ] 4 The last term n Equaton 4 s equal 0 f and u are uncorrelated and not equal 0 otherwse. two scenaros: Dependng on whether s drectly observed or not we consder the followng Assumng Scenaro A s fully observed we have u 5 mssng f 0 Otherwse 6 Assumng Scenaro B s not fully observed then we observe a dummy D 0 f 0 Otherwse Hence we can wrte the observaton equaton as: mssng f Otherwse D D where In practce model B s used more often than model A. We descrbe the bas that arses from each model n the next secton. 7 8

17 9 Self-Selected Samplng Bas Consder scenaro A. The regresson functon for the subsample where the data are avalable can be wrtten as: u E u E E E 9 Assumng u has a bvarate normal dstrbuton ~ N u 0 hence. exp z z u f Then the correcton wll have the followng form E Now consder scenaro B. The regresson functon for the subsample where the data are avalable can be wrtten as: 0 0 u E u E E D E 3 then the correcton wll have the followng form D E 4 where and are the standardzed normal densty and dstrbuton functons respectvely.

18 We can wrte Equaton 4 as: E w 5 D equaton 5 hghlghts the omtted varable that causes OLS estmaton of Equaton to be based. The varable s the hazard rato or the Inverse Mlls Rato. For both scenaros A and B Equatons and 4 show that the estmated wll be unbased when s uncorrelated wth selecton process s "gnorable". 0 u 0 so that the data are mssng randomly or the In general assume that and u follow a jont dstrbuton functon f u where θ s a fnte set of parameters. Applyng the Bayes rule we can wrte: f u d du E[ u ] f u d du 6 Here could be a nonlnear functon of and the parameters θ. Ths means that the condtonal expectaton of gven and the probablty that s observed wll be equal to the usual regresson functon plus a nonlnear functon of the selecton equaton regressors Equaton 4. that has non-zero mean as we showed n Therefore when estmatng we can conclude that the estmated ntercept wll be based because the mean of the resduals s not zero. Also f and are not completely uncorrelated.e. they have varables n common or they are correlated the estmated slope coeffcent wll be based because there s an omtted varable n the regresson namely that s correlated wth the ncluded varable. We can see that even f and are ndependent the fact that the data s nonrandomly mssng wll ntroduce heteroskedastcty to the error term so OLS s not fully effcent.

19 MLE and Heckman Correcton There are two major exstng approaches for estmatng the self-selected sample model under the assumpton of bvarate normal. The frst method s FIML and the second s Heckman's well-known two-step procedure. We dscuss each of these methods and adopt Heckman s method as a benchmark for smulaton comparsons because of ts popularty. We wll consder scenaro B n the selecton stage for both methods. In practce t s more common to assume that s not fully observed. In the maxmum lkelhood approach we specfy a complete model setup as n Equatons and and we assume the followng jont dstrbuton for u 0 ~ N u 0 7 We typcally assume a bvarate normal dstrbuton wth zero and means and correlaton. There s no generally accepted name for ths model. The restrcton s used to smplfy the calculatons of the lkelhood functon. We dvde the observatons nto groups accordng to the type of data observed. Each group of observatons wll have a dfferent form for the lkelhood. For example for the sample selecton model there are two types of observatons: those where >0 and those where s not observed and we know that 0. For those where 0. For these observatons the lkelhood functon s the probablty of the jont event and 0. We can wrte ths probablty for the th observaton as the followng:

20 0 0 du u du u f u P f P f P 8 Thus the probablty of an observaton for whch we see the data s the densty functon at the pont multpled by the condtonal probablty dstrbuton for gven the value of that was observed. For those where s not observed and we know that 0. For these observatons the lkelhood functon s just the margnal probablty that 0. We have no ndependent nformaton on. Ths probablty s wrtten as 0 u P P 9 therefore f we assume the frst N observatons have 0 and the rest have 0 then the log lkelhood for the complete sample of observatons s the followng:

21 3 N log L ; data [ log log N0 log N 0 log ] 0 In the above log lkelhood there are N 0 observatons where we do not observe and there are N observatons where we do observe. Then N 0 N N. The parameter estmates for the sample selecton model can be obtaned by maxmzng ths lkelhood functon wth respect to ts arguments. As an alternatve to MLE Heckman 979 developed a two-step model that s wdely used for sample selecton bas. Heckman s model s based on two latent dependent varables. The steps of Heckman's estmaton are: a Estmate n Equaton usng a probt model; b Use the estmated to calculate E u ; c Estmate n equaton 5 by replacng E u wth. Estmaton of Equaton 5 by OLS gves consstent parameter estmates but specal formulas are needed to get correct standard errors because the errors V are correlated.

22 4 If 0 the usual formula provdes a consstent estmate of the covarance matrx of the parameters n the second-stage regresson. Heckman suggests that we use the t-test of the coeffcent on the varable as a test of sample selecton bas. Melno 98 shows that ths represents the optmal test of selectvty bas under the mantaned dstrbutonal assumptons as t s based on the same moment as the Lagrange multpler test. That s both the Lagrange multpler test and the t-test for the coeffcent on are based on the correlaton between the errors n the prmary equaton and the errors from the selecton equaton. Note that the Inverse Mlls Rato s the error from the probt equaton explanng selecton. In other words Heckman s proposal s to estmate the Inverse Mlls Rato n a probt model and then estmate Equaton 5 by OLS to obtan consstent estmates of and Although Heckman s two-step procedure gves a consstent estmator varous papers crtcze ts small sample propertes. Many clams were that the predctve power of subsample OLS or the two-step model s at least as good as the one of Heckman s procedure or MLE. Here the two-step model gves the condtonal expectaton of wages. Daun 984 contends that the condtonal expectaton s of nterest to us. In addton we nterpret the coeffcent of the two part model wth the same way that we estmate the

23 wage equaton by subsample OLS. Stolzenberg and Relles 990 provde evdence that the hgher the correlaton between the error terms the greater the superorty of the maxmum lkelhood and maybe OLS estmator over Heckman procedure n terms of effcency. The most mportant lne of crtcsm of Heckman s procedure s based on practcal rather than theoretcal grounds. If the set of 5 varables that affect the wage n the wage equaton are almost dentcal wth the set of varables that affect labor force partcpaton n selecton equaton then the second step of Heckman's method s only dentfed through the nonlnearty of the Inverse Mlls Rato. In many practcal cases we only observe values wthn the quas-lnear not completely lnear range of the nverse mlls rato. Then we need varables that are good predctors of labor force partcpaton and do not appear n whch are dffcult to fnd n practce. Most studes fnd that the two-step approach can be unrelable n the absence of excluson restrctons. Generally an excluson restrcton s requred to generate credble estmates: there must be at least one varable whch appears wth a non-zero coeffcent n the selecton equaton but does not appear n the equaton of nterest. If no such varable s avalable t may be dffcult to correct for samplng selectvty. Leung and u 996 conclude that ths result s due to expermental desgn. They fnd that Heckman s twostep estmator s effectve provded that at least one dsplays suffcent varaton to nduce tal behavor n the Inverse Mlls Rato. Under certan crcumstances even when ts assumptons and formal requrements are satsfed the two-step selecton bas

24 correcton s known to produce estmates that are farther from true parameter values than estmates obtaned by uncorrected OLS. Puhan 000 strongly recommends exploratory work to check for collnearty problems before decdng on whch estmator to apply. If there s no collnearty between the regressors and the Inverse Mlls Rato the Heckman two-part model s the most robust approach. On the other hand f collnearty problems exst the MLE approach s preferable to Heckman s two-step method. In the next chapter we propose a Bayesan method to estmate sample selecton bas. We study ts behavor by comparng our estmates to Heckman's estmates. 6

25 CHAPTER 3 MODEL DEVELOPMENT Bref Introducton to Bayesan Methods and MCMC The Bayesan approach s fundamentally dfferent from the conventonal approach. In the conventonal approach a sample... n s drawn from a populaton wth an unknown but fxed parameter θ. Knowledge about θ s obtaned from the observed random sample. In the Bayesan approach θ s consdered to be a random varable and ts varaton can be descrbed by a probablty dstrbuton called the pror dstrbuton. Ths s a subjectve dstrbuton based on the researcher's belef and s formulated before the data are seen hence pror. When a sample from a populaton ndexed by θ s observed the pror dstrbuton s updated wth the nformaton n the sample. The updated pror s called the posteror dstrbuton. The Bayesan approach s concerned wth generatng the posteror dstrbuton of the parameters and provdes a more complete pcture of the uncertanty n the estmaton of unknown parameters especally after the confoundng effects of nusance parameters are removed. A complete ntroducton to Bayesan analyss can be found n Lee 997 and Draper 000. The foundaton of Bayesan statstcs s Bayes' Theorem whch s used to update the posteror dstrbuton. Bayes Theorem s named after Thomas Bayes He calculated the probablty of a new event on the bass of earler probablty estmates 7

26 8 that have been derved from emprcal data. Bayes work became the bass of a statstcal technque whch s now called Bayesan statstcs or Bayes method. The basc prncple of Bayes theorem s as follows. If event A occurred the probablty that event E also occurred s E p E A p E p E A p A p E p E A p A E P In structured modelng and analyss Bayes method can be wrtten n the followng equaton. Assume we observe data y from dstrbuton wth parameter of θ and we wsh to make nference about another random varable θ where θ s drawn from some dstrbuton πθ. Then d y p y p y p y p y p y p where y s a vector of the observed data and θ s the unknown parameters. The posteror probablty condtonal on y s pθy. The pror dstrbuton s πθ and t can be nformatve or non-nformatve. An nformatve pror expresses specfc defnte nformaton about a varable. The lkelhood functon s pyθ when t s regarded as a functon of θ for a fxed * y. The pror predctve dstrbuton also called the margnal dstrbuton of y s py. Wth Bayes model we can estmate the posteror dstrbuton pθy by ntegratng the full Bayes equaton the lkelhood and pror probablty functons. For example f y represents a random sample from N then we have: exp y y p

27 9 under the non-nformatve pror: y the posteror dstrbuton s: y n y Another approach s to use MCMC smulaton to obtan the posteror dstrbuton. Metropols 953 showed how ths method helps n constructng a Markov Chan wth statonary dstrbuton. The method was generalzed by Hastngs 970 and s now wdely used to sample from analytcally ntractable probablty dstrbutons arsng n statstcs Glks 996; Robert and Casella 999. The effcency of MCMC methods s of sgnfcant practcal mportance and loosely speakng s determned by the convergence rate of the chan. In contrast to the maxmum lkelhood method the MCMC Bayesan method s useful and relable even for fnte sample szes snce convergence results depend only on the number of teratons. The man advantage of Bayesan methodology s that n the absence of much data the pror dstrbuton carres a lot of weght; but the more data that are observed the less nfluence the pror dstrbuton has on the posteror dstrbuton. The most common crtcsm of Bayesan methodology s that snce there s no sngle correct pror dstrbuton then all conclusons drawn from the posteror dstrbuton are suspect. We develop a Bayesan method for estmatng the parameters of the self-selected samplng model. We mplement MCMC methods and Gbbs samplng to facltate

28 0 computaton for the posteror estmates. We also conduct a smulaton study to determne the performance of the MCMC algorthm for varous pror dstrbutons. Mssng Values and Latent Varables Recall scenaro B n secton. and latent varable u where N x and ~ N0 ~ One can show that P D P 0 x where s the cumulatve densty functon cdf of N 0. The prors for other parameters reman the same: x ~ BVN x note that s a latent varable and t needs to be sampled. Also m... n are mssng and they wll be sampled as well. The below steps show how to sample If D then y s observed for... m gven ntal or sampled values MCMC of s and we can show that normally dstrbuted wth followng condtonal mean and condtonal varance: ; y D ~ N x x and 0 We can sample 0. from truncated Normal by the above equaton and

29 If D 0 then y s observed for m... n hence sample x ~ BVN x wth the restrcton 0 snce D 0. One way to do ths s to generate jontly untl we get a sample wth 0. Now recall scenaro B s fully observed and s not. The jont dstrbuton s: x ~ BVN x for mssng m... n one can sample ; ~ N x x Pror Dstrbuton of Lnear Component Consder the multvarate regresson model for... n where s an m-vector of dependent varables for unt ; s an m p matrx of ndependent varables for unt β s a p-vector of regresson coeffcents; and s the error term. The error terms are mutually ndependent random varables from a multvarate normal dstrbuton wth mean zero and covarance matrx Σ ~ 0. A well-accepted Bayesan approach s to consder the normal dstrbuton as the pror of β s because t s qute flexble. We wll consder the pror dstrbutons: ~ N p ~ IW m df H N m

30 Where IW m df H s the m -dmensonal nverted Wshart dstrbuton wth df pror degrees of freedom and scale parameters H wth densty functon: p H H p m m p p p 4 exp m tr p m Now recall Heckman's sample selecton model Equatons and : u for the Bayesan analyss of the above model we assume the jont dstrbuton of and u s bvarate normal: u ~ N0 where ρ s the correlaton coeffcent. To compute ths model we need to provde the pror dstrbuton usng β ~ N p ~ N p q snce ρ s unknown we use the Inverse Wshart dstrbuton: ~ IW m df H

31 3 Samplng Scheme of Lnear Component Consderng the multvarate normal dstrbuton ~ N where ~ N And ~ the condtonal posteror of β can be calculated usng f y f f the condtonal posteror s proportonal to the exponent part so we can wrte: ] exp[ ] exp[ ] exp[ ] exp[ T T T T f takng the frst and second dervatves wth respect to β we can fnd the condtonal precson matrx Ω. Thus the frst dervatve: f f T and the second dervatve: f f T T therefore the varance or the condtonal precson matrx s:. T settng the frst dervatve equal to zero we can calculate the mean 0 T T T T T T we can wrte: T T T T T thus T T T T T T and

32 4 T therefore ] [ ~ MVN f T the above posteror mean s a weghted average of the data and the pror mean wth weghts gven by the data pror precson matrces and. The condtonal posteror of can be calculated usng y f f thus the condtonal posteror can be wrtten as: m y f f exp m T m x y x y the above can be expressed as: ] exp[ H x y x y tr m T df m n also we can wrte: ] exp[ H S m df where m T x y x y S ths leads to ~ H S m df Wshart The model developed n ths secton wll be used to analyze data examples and smulated data n the next chapter. We show the results of MCMC smulatons carred out and evaluate samplng propertes of the estmators dscussed n ths secton.

33 CHAPTER 4 SIMULATION STUD In ths chapter we apply the proposed Bayesan approach by conductng two smulatons studes usng MCMC methods and Gbbs samplng to facltate computaton of posteror estmates. The frst smulaton study uses an artfcal data example to determne the performance of MCMC usng varous prors. The second smulaton s a comprehensve smulaton study usng a generated data wth the ablty to test varous data scenaros such as: sample mssng rate resduals correlaton multcollnearty and sample sze. Women Wage Example Dscussons n the context of labor economcs concernng labor force populaton wages and earnngs hghlght the mportance of sample selecton. One representatve example s the estmaton of women s wages. Snce we only observe the wages of women who enter the workforce our sample represents only one part of the wage offer dstrbuton. Other secondary wage groups such as marred women and teenagers are not represented. Therefore estmaton procedures may nvolve certan bas when appled to the secondary wage groups. Ths s an example of self-selected samplng bas. We propose a Bayesan MCMC algorthm to estmate parameters of a selfselected samplng model. We consder the labor force example n the STATA user manual. Ths data s used to llustrate Heckman s approach by predctng women's wages from ther educaton and age. To evaluate the performance of the proposed Bayesan approach and compare wth other methods we smulate sample selecton as t was 5

34 6 specfed n the example and perform the MCMC algorthm usng dfferent Wshart pror specfcatons. MCMC methods use smulaton of Markov chans n the parameter space. The Markov chans are defned n such a way that the posteror dstrbuton n the gves statstcal nference problem s asymptotc dstrbuton. Ths allows usng averages to approxmate the desred posterors expectatons. Several standard algorthms to defne such Markov chans exst ncludng Gbbs samplng and Metropols-Hastng. Usng these algorthms t s possble to mplement posteror smulaton n essentally any problem whch allows pontwse evaluaton of pror dstrbuton and lkelhood functons. The data contan a sample of 000 observatons of 5 varables. A bref descrpton of the varables that are relevant for our analyss s shown n Table From among the 000 observatons; we observe wage data for only 343. The remanng 657 women were not n the pad work force and so dd not receve wages. We are nterested n modelng two thngs: the decson of the women to enter the labor force and predctng women s hourly wage. We wll consder a reasonable assumpton that the women s decson to enter the labor force s a functon of age martal status the number of chldren and her level of educaton. Also the wage rate a woman earns s a functon of her age and educaton.

35 7 Table Women Wage Data Varable Descrpton Varable Name Age Educaton Marred Chldren Wage Defnton Age of the woman Number of years of educaton of the woman Dummy varable equal to f the woman s marred 0 otherwse Number of chldren that the woman has n her household Hourly wage of the woman We begn wth OLS estmaton of the regresson model usng only the observatons that have wage data. The estmates can be found n Table see `OLS- Selected Wage' row. Ths analyss would be fne f n fact the mssng wage data were mssng completely at random. However the decson to work or not work was made by the ndvdual woman. Thus those who were not workng consttute a self-selected sample and not a random sample. It s lkely that some of the women who would have earned low wages chose not to work. If so ths would account for much of the mssng wage data. Thus t s lkely that we wll over-estmate the wages of the women n the populaton. So somehow we need to account for nformaton that we have on the nonworkng women. We attempt to do ths by replacng the mssng values wth zeros for wage varable. The estmates can be found n Table see OLS- Non-mssng Wage row.

36 8 Table Estmaton Usng OLS and Heckman Models Method Parameter Parameter Estmate Standard Error Bas OLS Full Wage Intercept NA OLS Selected Wage Intercept OLS Non-Mssng Wage Intercept Heckman Intercept OLS Full Wage Educaton NA OLS Selected Wage Educaton OLS Non-Mssng Wage Educaton Heckman Educaton OLS Full Wage Age NA OLS Selected Wage Age OLS Non-Mssng Wage Age Heckman Age OLS Full Wage Inverse Mlls NA NA NA OLS Selected Wage Inverse Mlls NA NA NA OLS Non-Mssng Wage Inverse Mlls NA NA NA Heckman Inverse Mlls NA Ths analyss s also troublng. It s true that we are usng data from all 000 women but zero s not a far estmate of what the women would have earned f they had chosen to work. It s lkely that ths model wll under-estmate the wages of women n the populaton. The soluton to our quandary s to use the Heckman selecton model Heckman 979. The Heckman selecton model allows us to use nformaton from non-workng women to mprove the estmates of the parameters n the regresson model. The Heckman selecton model provdes consstent asymptotcally effcent estmates for all parameters n the model. In our example we have one model predctng wages and one model predctng whether a woman wll be workng. We wll use martal status chldren

37 9 educaton and age to predct selecton. In addton to the two equatons Heckman estmates ρ actually the nverse hyperbolc tangent of ρ the correlaton of the resduals n the two equatons and Σ actually the log of Σ the standard error of the resduals of the wage equaton. Then λ = ρσ. The estmates can be found n Table see Heckman row. Recall that we do have full wage nformaton on all 000 women. We can therefore run a regresson usng the full wage nformaton to use as a comparson. The estmates can be found n Table see OLS- Full Wage row. The Selected Wage model tends to over-estmate wages; the Non-Mssng Wage model tends to severely under-estmate wages; and the Heckman model does the best job n predctng wages. Fnally we consder the Bayesan approach to predctng women's wage from ther educaton and age. In ths approach the posteror dstrbutons are too complcated to evaluate analytcally. However by usng MCMC methods and Gbbs samplng ths posteror dstrbuton can be sampled ndrectly by generatng a sample of parameter values from the condtonal dstrbuton of nterest. Posteror Bayes estmates are then obtaned from the generated samples. We estmate the parameters usng the MCMC algorthm as the followng:. Use Bayesan approach usng many loops for Gbbs samplng / MCMC to repeatedly sample from the condtonal dstrbuton Sample latent varable n selecton stage Update mssng y s and mean predcton of y s Update and jontly Update usng Wshart pror whch leads to Wshart posteror

38 30. Repeat the above steps 500 tmes The goal s to see how ths method performs when we use dfferent prors and dfferent correlatons of the error terms n the two-model equaton. We run the algorthm for 500 teratons after convergence dscardng the frst 500 teratons. The estmates are n Table 3. Comparng the results n Table and Table 3 we fnd that the Bayesan approach s provdng estmates that are at least as effectve as the Heckman s estmates and are better than the OLS usng selected wage. The Wshart dstrbuton s an objectve pror because the posteror mean wll be affected by the pror choce. We perform our analyss under two nstances of a Wshart pror: Wsh3H & Wsh4H. Also we change the nverse scale matrx n the Wshart pror n usng dfferent values of σ: The results n Table 3 show that the parameters dd not change much compared wth the results n Table. Ths ndcates that the Bayesan approach s performng as well as the Heckman approach or even better n some scenaros. We can see good estmates for under Wshart3H wth Sgma0000. The standard error and bas are low compared to other estmates.

39 3 Table 3 Estmaton Usng Bayesan Methods Wshart Pror Inverse Scale Matrx Parameter Estmate Intercept Standard Error Bas Parameter Estmate Educaton Standard Error Bas 3 H H H H H H Age Wshart Pror Inverse Scale Matrx Parameter Estmate Standard Error Bas 3 H H H H H H Intercept Wshart Pror Inverse Scale Matrx Parameter Estmate Standar d Error Bas Parameter Estmate Educaton Standard Error Bas 4 H H H H H H Age Wshart Pror Inverse Scale Matrx Parameter Estmate Standar d Error Bas 3 H H H H H H

40 3 Comprehensve Smulaton Study Here we conduct a more comprehensve smulaton study to further nvestgate the effect of pror dstrbutons and the robustness of the Bayesan approach. Unlke the work n the prevous secton the data sets here wll be generated from several specfc condtons. More specfcally I consder the effects of the fracton of selecton correlaton between selecton of regresson models sample szes and the fracton of the ndependent varables that appear n both selecton and regresson model. The smulated data wll be analyzed by both Heckman s two-stage estmator and the Bayesan methods proposed above. The estmates from both methods wll be evaluated and compared n terms of Bas and RMSE Root Mean Square Error. The Bas of an estmator s the dfference between the estmator s expected value and the true parameter value of the parameter beng estmated. An estmator or decson rule wth zero bas s called unbased. Otherwse the estmator s sad to be based. RMSE s based on two sums of squares: Sum of Squares Total SST and Sum of Squares Error SSE. SST measures how far the data are from the mean and SSE measures how far the data are from the model s predcted vales. The data set wll be generated usng the two model equaton wth the below selfselected samples. Observaton stage: Selecton stage: u Where 0 ~ N u 0

41 33 Note that s the observaton stage and s the selecton stage. Wthout loss of generalty we wll observe y... y yn n m where z z... zn are greater than the selecton cutoff value c. Hence y... n y n ym wll be consdered mssng at the analyss step. We consder the followng scenaros at the data generaton step: Change the level or percentage of mssng values Change the value of to control the level of correlaton Change the two scalars x and x to control the level of multcollnearty Change the sample sze The data are smulated as follows: we generate desgn matrces and n observaton and screenng stage wth 3 vector for each wth ther frst rows fxed as one to make constant terms for each equaton. The two other rows n and are ndependently generated from a unform dstrbuton on [0]. I set the true parameter value for 0 as 3 respectvely and I generate random error and u from a bvarate normal densty wth zero mean and varance-covarance matrx. In each case 500 Gbbs samples were drawn the frst 500 were dscarded and the remanng 000 were used for posteror nference. I tred multple runs to ensure convergence of the results. The man smulaton ddn t nclude Thnnng strategy of reducng autocorrelaton by storng only every m th pont after the burn-n perod however we are confdent wth the results because we use dfferent settngs wth dfferent startng ponts and the results are obtaned from 00 replcatons to avod systematc mstakes. Snce 500 mght not be long enough for MCMC to converge I

42 34 selected one smulaton scenaro from one chan and ran 000 MCMC samples usng two dfferent startng values. The frst 000 were dscarded as burn-n and we appled thnnng of 5 leavng 000 effectve posteror samples to show plot behavor and how the dstrbuton converges. Convergence refers to the dea that eventually the MCMC and Gbbs Sampler that we choose dd eventually reach a statonary dstrbuton whch s also our target dstrbuton. To test the results we generated the followng dagnoss plots: trace plots to show the samplng path kernel densty to show the posteror densty functon and movng averages of posteror samples to show that samples are convergng to smlar values. One way to see f our chan has converged s to see how well our chan s mxng or movng around the parameter space. If our chan s takng a long tme to move around the parameter space then t wll take longer to converge. We can see how well our chan s mxng through vsual nspecton. We wll dscuss these nspectons for every parameter. Fgure shows the samplng paths for 0 from two dfferent startng chans. Ths fgure contans plots known as trace plots of the teraton number aganst the value of the draw of the parameters at each teraton. These plots are useful to show whether our chan s convergng to the same value or gets stuck n certan areas of the parameter space whch ndcate bad mxng. Our results show that all samples converge to the same dstrbuton and there seems to be large spread for estmates.

43 Fgure. Sample Path for Comprehensve Smulaton 35

44 36 Fgure shows the posteror densty plots for the estmates from the two MCMC chans wth the normal densty curve. The plots show strong evdence for convergence for 0 and whch s reflected n the dstrbutons. Usually non-convergence s reflected n multmodal dstrbuton and ths s especally true f the kernel densty s not just mult-modal but lumpy. Fgure 3 shows the plots of the movng averages of 0 from the two MCMC chans. The x-axs represents the number of teratons and the y-axs shows the posteror mean from these teratons. As a result all the paths are beleved to be statonary n an acceptable rang. When comparng the two settngs 0 and seem to converge to the same value farly quckly however convergence does not seem to be apparent whch s consstent wth the results n prevous studes. The followng sectons summarze the smulaton results for the varous set-ups we mentoned earler. For each sample of data generated we obtan MCMC estmates by calculatng the mean of the condtonal posteror denstes for the smulated samples RMSE and Bas. We also nclude the estmates from Heckman s method and from OLS usng all data and subset data.

45 Fgure. Densty Plots for Comprehensve Smulaton 37

46 Fgure 3. Movng Averages for Comprehensve Smulaton 38

47 39 Selecton Rate Table 4 shows the smulaton results for the regresson coeffcents usng dfferent selecton rate scenaros. To test the effect of the mssng rate level we use the followng two levels: 50% and 0%. In both scenaros we notce reducton n RMSE and n bas n most coeffcents when usng a Bayesan approach. By comparng the Bayesan method to Heckman s method we can see 64% 0.4 versus 0.05 reducton n bas for 0 n the frst scenaro and 93% reducton n Bas n the second scenaro. Ths shows sgnfcant mprovement n 0 estmates usng Bayesan methods comparng to Heckman s method when the mssng rate s low. We also notce 0% reducton n RMSE n the second scenaro for both and. The RMSE for n Bayesan s 0.7 versus 0.83 n Heckman. There s also a slght reducton n RMSE for n the second scenaro - RMSE of 0.79 n Bayesan versus 0.8 n Heckman. The results of both scenaros ndcate that the Bayesan approach s performng as well as the Heckman approach or even better.

48 40 Table 4 Smulaton Results for Selecton Rate Selecton Rate Method Parameter Mean RMSE Bas 50% OLS all data β % OLS subset β % Heckman β % Bayesan β % OLS all data β % OLS subset β % Heckman β % Bayesan β % OLS all data β % OLS subset β % Heckman β % Bayesan β % OLS all data β % OLS subset β % Heckman β % Bayesan β % OLS all data β % OLS subset β % Heckman β % Bayesan β % OLS all data β % OLS subset β % Heckman β % Bayesan β Note. Sample sze = 80 correlaton level = 0.50 and multcollnearty COR < 0..

49 4 Correlaton Level Table 5 shows the smulaton results for the regresson coeffcents usng dfferent correlaton levels. The correlaton level s controlled by the value of correlaton between the error terms cor u. To check the Bayesan approach s performance we assgn the followng three correlaton values: and The overall results show that the Bayesan approach provdes a sgnfcant reducton n RMSE and n bas when correlaton level s hgh. We see reducton n bas n the frst two scenaros 0.3 and 0.5 and less reducton for and bas when the correlaton level s hgh cor u =0.75. The results for the lowest correlaton level cor u =0.3 show slght reducton n Bas for all coeffcents wth no sgnfcant mprovement n RMSE. However sgnfcant reducton n RMSE seems to exst n the hgh correlaton scenaros.

50 4 Table 5 Smulaton Results for Correlaton Level Correlaton Method Parameter Mean RMSE Bas 0.30 OLS all data β OLS subset β Heckman β Bayesan β OLS all data β OLS subset β Heckman β Bayesan β OLS all data β OLS subset β Heckman β Bayesan β OLS all data β OLS subset β Heckman β Bayesan β OLS all data β OLS subset β Heckman β Bayesan β OLS all data β OLS subset β Heckman β Bayesan β OLS all data β OLS subset β Heckman β Bayesan β OLS all data β OLS subset β Heckman β Bayesan β OLS all data β OLS subset β Heckman β Bayesan β Note. Sample sze = 80 selecton rate = 0% and multcollnearty COR < 0..

51 43 Multcollnearty The results n Table 6 show the smulaton results for the regresson coeffcents usng three dfferent multcollnearty levels. Ths level s controlled by the level of correlaton between two explanatory varables n observaton and screenng stages [ cor x x ]. To test the effect of multcollnearty we use the followng levels: and.00. The thrd scenaro [ cor x x =] refers to the case where the screenng stage and the observaton stage contan one common explanatory varable. The Bayesan method seems to provde best results for all coeffcents when the multcollnearty level s hgh. The thrd case show large decrease n RMSE and bas wth a Bayesan approach comparng to Heckman. The Bas for 0 dropped 63% and the RMSE for dropped 5% n Bayesan estmaton when comparng to Heckman method. There seems to be no sgnfcant mprovement wth the Bayesan approach n the frst scenaro where the multcollnearty level s very low. In ths case Heckman s approach seems to be a good choce for estmaton but defntely not when a hgh level of multcollnearty exsts.

52 44 Table 6 Smulaton Results for Multcollnearty Multcollnearty Method Parameter Mean RMSE Bas 0.06 OLS all data β OLS subset β Heckman β Bayesan β OLS all data β OLS subset β Heckman β Bayesan β OLS all data β OLS subset β Heckman β Bayesan β OLS all data β OLS subset β Heckman β Bayesan β OLS all data β OLS subset β Heckman β Bayesan β OLS all data β OLS subset β Heckman β Bayesan β OLS all data β OLS subset β Heckman β Bayesan β OLS all data β OLS subset β Heckman β Bayesan β OLS all data β OLS subset β Heckman β Bayesan β Note. Sample sze = 80 selecton rate = 0% and correlaton level = 0.50.

53 45 Sample Sze Fnally Table 7 shows the smulaton results for the regresson coeffcents usng three dfferent sample szes N= Ths s another stuaton where the results of both scenaros ndcate that the Bayesan approach s performng as well as the Heckman approach or even better. The results for varous sample sze show smlar slght reducton of less than 0% n RMSE average for Bayesan method comparng to Heckman for all coeffcents. However we see sgnfcant mprovement n Bas for Bayesan method for 0. The Bas reducton n Bayesan method comparng to Heckman for 0 exceeds 50% n all scenaros. Ths comprehensve smulaton study used varous scenaros that a researcher could face when dealng wth a real data. The results proved the effectveness of the Bayesan approach and showed the lmtatons of Heckman s approach partcularly when faced wth a hgh level of multcollnearty. A detaled nvestgaton of the multcollnearty ssue can be found n Leung and u 996. They show that the degree of multcollnearty s the man decson drver to judge the approprateness of the LIML and FIML estmates n relaton to the two-part model. In emprcal analyss n wage equatons for example the standard procedure to solve the multcollnearty problem s to fnd varables that determne the probablty to work selecton equaton but not the wage rate observaton equaton drectly. Practcal examples for these varables could be the ncome of the spouse household ncome etc. However these varables are not always avalable n practcal stuatons.

54 46 Table 7 Smulaton Results for Sample Sze Sample Sze Method Parameter Mean RMSE Bas N=80 OLS all data β N=80 OLS subset β N=80 Heckman β N=80 Bayesan β N=80 OLS all data β N=80 OLS subset β N=80 Heckman β N=80 Bayesan β N=80 OLS all data β N=80 OLS subset β N=80 Heckman β N=80 Bayesan β N=0 OLS all data β N=0 OLS subset β N=0 Heckman β N=0 Bayesan β N=0 OLS all data β N=0 OLS subset β N=0 Heckman β N=0 Bayesan β N=0 OLS all data β N=0 OLS subset β N=0 Heckman β N=0 Bayesan β N=60 OLS all data β N=60 OLS subset β N=60 Heckman β N=60 Bayesan β N=60 OLS all data β N=60 OLS subset β N=60 Heckman β N=60 Bayesan β N=60 OLS all data β N=60 OLS subset β N=60 Heckman β N=60 Bayesan β Note. Selecton rate = 0% selecton level = 0.50 and multcollnearty COR < 0..

55 CHAPTER 5 CASE STUD: PLACEMENT EAM AND MATH ACHIEVEMENT In ths chapter we apply the Bayesan model to a real-world data example by usng AU students placement exam data. The goal s to nvestgate how the students placement exam scores are assocated wth ther frst year math achevement. All AU students are supposed to take the math placement exam and regster for approprate math/stat courses accordngly. The problem s that nformaton about the students frst year math achevements grades s only avalable for those who actually take and complete ther frst year math courses. We wsh to forecast outcomes n the whole pool of freshmen but are forced to rely on a subset chosen non-randomly. The data contan 0 freshmen students wth 4 varables. A bref descrpton of the varables that are relevant for our analyss s shown n Table 8. From among the 0 students we observe students grades for only 75. The remanng 60 students dd not regster or complete a math class n fall 00 and so dd not receve a grade. We are nterested to see how placement exam score are related to the student s math grade. Due to the self-selecton we need to perform accurate estmaton by correctng for sample selecton bas. The student grade s a functon of hs placement score and the recommended class. A dummy varable called Basc Level was created to determne the type of recommended math class e.g. Basc Algebra Appled Calculus etc.. If the recommended class type s classfed as basc the new dummy varable value s equal to ; otherwse the value s equal to zero. Ths new dummy varable was ncluded n the observaton stage as an explanatory varable. 47

56 48 Table 8 Varable Descrptons for Placement Exam Data Varable Name Placement Score Basc Level Grade Defnton Recommended class based on placement exam score Student s placement exam score Dummy varable equal to f the recommended class type s classfed as a basc level; otherwse the value s equal to zero Student s grade for the class We begn wth OLS estmaton of the regresson model usng only the observatons that have grade data. The estmates can be found n Table 9 n `OLS- subset' row. The estmated coeffcent shows a very small effect of placement exam score on the student s math achevement = Ths analyss would be fne f n fact the mssng grade data were mssng completely at random. However the decson to regster for a math class or not was made by the ndvdual student. Thus those who were not regstered consttute a self-selected sample and not a random sample. It s lkely that some of the students who had low placement scores chose not to regster for any math class. If so ths would account for much of the mssng grade data. Thus t s lkely that we wll over-estmate the grade of the student n the populaton. On the other hand the Heckman s method s supposed to allow us to use nformaton from non-regstered students to mprove the estmates of the parameters n the regresson model. However Heckman s method shows negatve estmates and low

57 49 negatve effect of placement exam score on the student s math achevement = Table 9 Case Study Results for Placement Exam Data Intercept β0 Score β Basc Level β Mean Std. Error Mean Std. Error Mean Std. Error OLS subset Heckman Bayesan Fnally we consder the Bayesan approach usng MCMC methods and Gbbs samplng where the posteror dstrbuton can be sampled ndrectly by generatng a sample of parameter values from the condtonal dstrbuton of nterest. Posteror Bayes estmates are then obtaned from the generated samples. In the 0000 samples the frst 000 are dscarded and we use thnnng of leavng 9000 effectve posteror samples. Fgure 4 shows the sample path for 0 usng two dfferent startng ponts. The plots show that all values converge to the same dstrbuton. Fgure 5 shows the posteror densty plot for 0 and the two settngs show normal dstrbutons. Fgure 6 shows the movng averages for 0 from the two MCMC chans. The plots ndcate that the two MCMC chans are convergng to the same value for 0 and. The Movng Averages for Basc Level from the two settngs seem not to converge to the

58 50 same value. The Bayesan estmate for n Table 9 seem to support ths result. Ths ndcates low effect of Basc Level on student s math achevement. Table 9 shows that the estmated effect of placement score on student s grade s postve when applyng the Bayesan model. Such relatonshp of placement score s not dentfed n estmates usng Heckman s method. However Basc Level seems to have less effect n the Bayesan method compared wth Heckman s method. There seems to be no mprovement n the standard errors usng the Bayesan approach for all coeffcents. The Bayesan standard errors seem to be larger than those of the other two estmaton methods Heckman and OLS.

59 Fgure 4. Sample Paths for Placement Exam Example 5

60 Fgure 5. Densty Plots for Placement Exam Example 5

61 Fgure 6. Movng Averages for Placement Exam Example 53

62 CHAPTER 6 BINAR SELECTIVIT MODEL In ths chapter the proposed Bayesan method s extended to the Generalzed Lnear Model GLM. The GLM extends the lnear regresson model n order to accommodate non-normal responses e.g. bnomal data frequency data etc. to lnear equaton va a lnk functon. Examples of GLM nclude well-known models such as logstc regresson and log-lnear models Posson regresson for frequency tables etc. Conceptually the Bayesan specfcaton s straghtforward. We need to assgn a pror for regresson coeffcents as n the prevous regresson examples. There s no closed form soluton avalable but t s smple to obtan samples from posterors va MCMC. Generalzed Lnear Model Let y...yn denote n ndependent observatons on a response varable and treat y as a realzaton of a random varable. In GLM we assume that y that s part of the exponental famly wth three man components random systematc and lnk. The random part s the dstrbuton of the observatons the systematc component s the lnear combnaton of explanatory varables and the lnk functon s the lnk between the random part and the systematc component. The exponental famly s defned as the followng: y b f y; exp c y 4 a where and are locaton and scale parameters respectvely. The mean and varance are: 54

63 55 ' '' E y b and V y b a and we assume that the expected value s a lnear functon of x. g x ' 4 where s the lnear predctor g. s the lnk functon x are the predctors and s a vector of unknown parameters regresson coeffcents. Our current model employs a lnear regresson n the observed stage and a probt/logstc regresson n the selecton stage. Let us consder U ~ N x ' whch follows where P x x ' t t / exp z The relaton s lnearzed by the nverse normal transformaton ' x x j j The cutoff value of U s fxed and the mean of U s changng wth x. p j The goal s to have probt or GLM n the observed as n the selecton stage. We wll obtan ths by usng two latent varables n each stage and we wll be able to allow correlaton between these two varables. Ths s smlar to Heckman s selecton model except that now we have a bnary outcomes n the observaton stage. Assume the followng selecton setup: dz

64 56 Otherwse f D 0 0 and ~ x N s a latent varable. 0 0 sn * or D f g Ms or D and f or D and f The probt regresson model can expressed as: * x P ~ x N s also a latent varable. Therefore the latent varables can be expressed as the followng Bvarate Normal: ~ x x BVN Bayesan Estmaton The Bayesan setup for the GLM s an extenson of the framework we have used for regresson models. Suppose we have g we need to choose a pror densty for the parameters. The posteror densty s then expressed as: ; ; ; ; y f y y f d d y f y f y where y s the margnal lkelhood of the data obtaned by ntegratng the lkelhood condtonal on the unknown regresson coeffcent and dsperson parameter across the pror densty. The jont posteror densty of unobserved and gven s N x y x y y L D L y D y * * } {

65 57 Let ] [ ' ' from the above jont posteror we now nfer condtonal posterors and mplement Gbbs sampler. We start wth samplng the condtonal posteror of from ~ * x x BVN D therefore ~ * z N D - - Truncated at left by 0 f D ~ * z N D - - Truncated at rght by 0 f 0 D where z x x n a smlar way we calculate the condtonal posteror of. We can get a result for when D and * whch also truncated normal. ~ * N D - - Truncated at left by 0 f D and * ~ * N D - - Truncated at left by 0 otherwse where x x To sample we use the followng pror N y let 00 ' ' x x and ' W and we can get the condtonal posteror functon of whch s normal densty ~ B N where ' W B and B B ' 0

66 58 can sample We can sample and teratvely by drawng gven and vce versa. We and from the posteror margnal dstrbuton at each teraton. Ths margnal dstrbuton s condtonal only on the data and not on any parameters. And then we can sample from the same posteror full condtonal dstrbuton as the followng: Sample and from ts posteror margnal dstrbuton. Sample from the same posteror full condtonal dstrbuton as descrbed prevously.

67 59 Data Example The prevous model s appled to the same data example we used n Chapter 5 AU Students Replacement Exam Score Data except that the response varable s now bnary. For each student I assgned a new varable called Pass that takes the value of f Grade>3.0 Pass and otherwse 0 Fal. I begn wth MLE estmaton for the bvarate probt model usng only the observatons that have grade data. The estmates can be found n ths frst row n Table 0. As dscussed n prevous examples t s lkely that we wll over-estmate proporton of Passed students n the populaton. Bayesan method s appled after runnng 0000 of teratons MCMC usng Gbbs Sampler wth 000 as burn-n. The results of ths method can be found n ths frst row n Table 0. Table 0 Model Results for Bnary Selectvty Intercept Score Basc Level Mean Std. Error Mean Std. Error Mean Std. Error MLE Bayesan Fgure 7 shows the sample path for 0 usng two dfferent startng ponts. The plots show that all values converge to the same dstrbuton. Fgure 8 shows the posteror densty plot for 0 and the two settngs show normal dstrbuton. Fgure

68 60 9 shows the Movng Averages for 0 from the two MCMC chans. The graph ndcates that the two MCMC chans are convergng to the same value. The results n Table 0 ndcates that the Bayesan approach s performng at least as well as the MLE approach. The Score coeffcent seems to ndcate postve relatonshp wth Grade. Ths relatonshp was reversed wth negatve coeffcent for Score n MLE method. Another advantage n the Bayesan approach seems to be n the slght reducton of the standard error for all coeffcents. In general we would expect strong postve relatonshp between the placement exam score and the fnal outcome whether the student passed or faled math class n the frst semester. Smlarly we would expect larger correlaton between Basc Level whch s based on the student s placement exam score and whether the student passed or faled math class. However these strong correlatons are not present n the data. Ths s manly due to the lmtaton of the avalable varables that we have to use n our model. The varable Score s used n both stages selecton and observaton whch leads to hgh multcollnearty. These results confrm that the parameters estmates are not very effectve when multcollnearty exsts n the model.

69 6 Fgure 7. Sample Paths for Bnary Selectvty 6

A Robust Method for Calculating the Correlation Coefficient

A Robust Method for Calculating the Correlation Coefficient A Robust Method for Calculatng the Correlaton Coeffcent E.B. Nven and C. V. Deutsch Relatonshps between prmary and secondary data are frequently quantfed usng the correlaton coeffcent; however, the tradtonal