Bayesian Network Learning for Rare Events

Size: px

Start display at page:

Download "Bayesian Network Learning for Rare Events"

Frederick Adams
5 years ago
Views:

1 Internatonal Conference on Comuter Systems and Technologes - ComSysTech 06 Bayesan etwor Learnng for Rare Events Samuel G. Gerssen, Leon J. M. Rothrantz Abstract: Parameter learnng from data n Bayesan networs s a straghtforward tas. The average number of observed occurrences s stored n a condtonal roblty tle, from whch future redctons can be calculated. Ths method reles heavly on the qualty of the data. A data set wth rare events wll not yeld statstcally relle estmates. Bayesan networs allow ror and osteror learnng. In ths aer, new ror assessment technques are ntroduced to obtan stle rors for a condtonal roblty tle. These learnng algorthms are mlemented and tested, and the results wll be resented. Key words: Bayesan networs, Learnng, ave Bayes rors. ITRODUCTIO A Bayesan networ [8, 0] s a drected acyclc grah wth each node reresentng a varle and each arc reresentng a causal relaton between two varles. Varles are characterzed by a roblty dstrbuton for each value. The roblty dstrbuton of each node s nfluenced by the states (for dscrete nodes) or values (for contnuous nodes) of the arent node. The condtonal robltes of a node are stored n a condtonal roblty tle (CPT). The CPT s needed to calculate any condtonal roblty n the model, nference [5]. The sze of the CPT deends on the number of states (s), the number of arents (), and the number of arent states (s) n the followng way: sze ( CPT ) s ( ) () For every ossble combnaton of arent states, there s an entry lsted n the CPT. otce that for a large number of arents the CPT wll eand drastcally. Assume the varles n the Bayesan networ llustrated n Fgure are bnary. s Fgure : Eamle Bayesan networ The condtonal roblty tle of node C wll have eght entres, wth four degrees of freedom. The CPT for node C s gven n Tle. Tle : CPT for node C ( C AB) c c a, b a, b a a, b a a, b b b The research reorted here s art of the Interactve Colloratve Informaton Systems (ICIS) roect, suorted by the Dutch Mnstry of Economc Affars, grant nr: BSIK II.5- -

2 Internatonal Conference on Comuter Systems and Technologes - ComSysTech 06 The number of degrees of freedom of a CPT wth number of states (s), number of arents (), and number of arent states ( s ) s s ( s ). Values for can be obtaned from the data by arameter learnng. s the number of occurrences ( c, ) dvded by the number of occurrences ( ), or ( c, ) + ( c, ). In general, f for a varle X, wth states,...,, s the -th combnaton of arent states, then gven. Let (, ) now be calculated as follows: o be the number of observatons ) o(, ) (2) o(, ) denotes the roblty of, n the data set. can ( Parameter learnng s bascally countng the number of observatons of a secfc event. otce that the accuracy of heavly deends on the number of observed events. A large number of observatons wll result n a more accurate estmate than ust a small samle. Therefore, large data sets usually rovde good, stle models. Unfortunately many realworld data sets are mbalanced; some states of the resonse varle may be dozens to thousands of tmes less lely than other states. Ths s the case n, for eamle, customer banrutces n bans, nternatonal armed conflcts, or edemologcal nfectons. Thus the effect of havng a large data set avalle s canceled by the fact that t contans only a small number of nterestng records. Therefore, a model based on such data may be severely based or hghly unstle. PREVIOUS WORK Rare event roblems occur n generalzed lnear models, such as logstc regresson, but also n models learned from data, such as neural networs and Bayesan networs. Esecally n the case of generalzed lnear models, technques such as ror correcton and weghtng have been develoed to dscard most of the unnterestng art of the data set wthout much erformance loss [4]. Addtonally there are ways of reducng bas and varance [9]. In Bayesan networs, small samle roblems are usually solved by settng a good ror dstrbuton [, 6], based on eert nowledge. However, f ths eert nowledge s not avalle, an accetle ror, based on the data, needs to be set. [7] rooses nosy-or to obtan rors. Ths aroach uses cuttng to shft from ror to osteror estmate. ave Bayes s a very good classfer, as descrbed n [2]. Parameter learnng s elaned n detal by Davd Hecerman n [3]. Inference n a Bayesan networ s the calculaton of condtonal robltes, gven the robltes n a CPT. Inference s based on two rules. The frst one s Bayes rule, whch s defned as: y ) ) ( (3) The second rule s the eanson rule, whch s defned for bnary varles ( ) ) ) as: - II.5-2 -

3 Internatonal Conference on Comuter Systems and Technologes - ComSysTech 06 ) + ) Y Z (4) Usng these two rules, n the eamle n Fgure, ( a c) can now be calculated: a c) a) A a) c a) c) B B c )b) c )) (5) Inference wll be used for the assessment of nave Bayes rors. CPT Calculaton Dervng CPT arameters usng a ror-osteror aroach conssts of three stages:. ror assessment 2. osteror assessment 3. mergng The osteror assessment s equvalent wth CPT arameter learnng, shown n equaton 2, and wll not be dscussed here. Frst, the mergng method s descrbed and then the ror assessment. Mergng The mergng rocess of rors and osterors can be handled n a coule of ways. [7] uses a cut value for the number of observatons, n that aer called smoothng. If, n the data set, a combnaton of arent states occurs less than the cut value, the ror wll be used n the CPT. Otherwse, the osteror wll be used. Let the ror for be, the osteror r, and the cut value c, the smoothng mergng s defned as: Substtuton of equatons 2 and 6 yelds:, f < o(, ) c (6) r, otherwse, f o(, ) < c o(, ) (7), otherwse o(, ) - II.5-3 -

4 Internatonal Conference on Comuter Systems and Technologes - ComSysTech 06 In ths aer, a gentle transton s roosed, by weghtng the rors. The rors wll receve an nteger value, weght w, whch s equvalent to a number of observatons. The resultng CPT entres wll be a weghted average of the rors and the osterors: (w ) + o(, ) (8) w + o(, ) If o(, ) s low, the CPT value wll manly be based on the ror, however, f (, ) s often observed n the data set, the data wll nfluence the CPT value more than the ror. Bascally ths effect s the same as smoothng, only the transton s gentler. In smoothng and weghtng, the values for c and w, resectvely, need to be set. The tests n Secton 4 show that these arameters have the greatest effect f a value between 5 and 80 s chosen, referly between 0 and 40. If the values are too low, the rocedure wll be smlar to normal CPT learnng. If the values are too hgh, the osterors wll have very lttle effect. otce that for a model wth etremely good rors, the weght or cut value should be very hgh. Pror Assessment A CPT contans very detaled nformaton out condtonal robltes for all ossble arent states. As shown n the ntroducton, the number of degrees of freedom eands radly f the number of arents and the number of arent states ncrease. A stle ror should therefore be derved from a low number of degrees of freedom, but as accurate as ossble. [7] roose the usage of nosy-max rors, because t s a good modelng technque for rare events. osy-max maes the assumton that the state of the resonse varle s a logcal combnaton of the states of the nut arameters. In ractce, for a lot of data sets, nosy-max shows hgh erformance, because n many cases, an effect s an addton or multlcaton of the causes. An eamle of a bnary nosy-max (nosy-or) model s gven n Fgure 2. Fgure 2: osy-max networ Asde from the nut varles X...X and resonse varle Y, there s a set of nhbtor nodes Z...Z. X can cause Z to be resent wth a roblty c, but sence of X always mles sence of Z. The CPT of node Y s smlar to a logcal MAX gate. ave Bayes s n many cases a sueror classfer [2]. Just as nosy-max, nave Bayes has a relatvely low number of degrees of freedom. The erformance as a classfer s slghtly better than nosy-max. The networ structure s comletely ooste from a regular CPT networ. It may be aganst ntuton to use rors from an ooste networ. In the ror assessment, the causaltes are less relevant than how y ) s calculated. A nave Bayes networ maes very unrealstc assumtons out causalty. It assumes the redctve varles to be deendent on the resonse varle. Also, t assumes condtonal ndeendence between the redctve varles, meanng that gven the value of Y, X...X are ndeendent. Asde from these assumtons, a nave Bayes model s a - II.5-4 -

5 Internatonal Conference on Comuter Systems and Technologes - ComSysTech 06 very owerful classfer. Inference n a nave Bayes networ s not as straghtforward as n a CPT networ. A networ wth redctve varles X...X and resonse varle Y wth states y...ym wll have the robltes y ) lsted n the CPT. In a nave Bayes networ, these values need to be calculated as follows: y ) ) ) ( y ) y )) (9) The values for y ) and can be obtaned by arameter learnng n the nave Bayes networ. from equaton 8 can be relaced by the rght art of equaton 9. ow the formula for calculaton of the CPT values s comlete. RESULTS The method ove was mlemented and tested on a banrutcy data set from a ban. The banrutces were rare (around 2% of all records). As a erformance measure for scorng the models, the GII nde was used. In a grah where a curve s lotted as the cumulatve banrut ercentage of clents aganst the total ercentage of clents when they are sorted on rs (low rs on the left, hgh rs on the rght), the GII nde s defned as the area between the dagonal and the curve (the Lorentz curve) dvded by the total area under the dagonal. It has a range between - and, or 00% to 00%. Hgh scores ndcate good models. The GII nde s wdely used n socal scences to measure the dscrmnatve ower of a model. Tle 2: Performance of models MODEL GII ave Weghtng W 76.4% ave Weghtng W5 77.0% ave Weghtng W0 ave Weghtng W20 77.% 77.2% ave Weghtng W % ave Weghtng W % osy-max Weghtng W0 7.4% ave Smoothng C0 ave Smoothng C % 76.6% ave Smoothng C40 osy-max Smoothng C % 7.4% More nformaton out the GII nde and ts alcatons can be found n []. A coule of modelng methods were comared, of whch the results are lsted n Tle 2. Aarently, for ths data set, nave Bayes rovdes much better rors than nosy-max. Ths s artly due to the fact that a subotmal learnng algorthm for nosy-max s used. - II.5-5 -

6 Internatonal Conference on Comuter Systems and Technologes - ComSysTech 06 Even wth EM learnng, nosy-max scores wll not be hgher than 76%. Secondly, the weghtng seems to erform better than the smoothng for ths data set. COCLUSIOS AD FUTURE WORK Two mrovements to learnng for rare event data were suggested n ths aer. Frstly, weghted mergng nstead of cuttng, whch allows a more gentle balance between a ror and a osteror. Secondly, nference n a nave Bayes models can rovde ecellent rors for a CPT n a normal Bayesan networ. Comared to estng methods, such as nosy-max rors, nave Bayes rors erform better on the test data set. Unfortunately, the best erformng model on ths data set was nave Bayes wthout any osteror learnng. Therefore, more data sets to test these methods on are requred for a defnte statement. These ntal results are very romsng. REFERECES [] Bec,., G. Kng & L. Zeng. The Problem wth Quanttatve Studes of Internatonal Conflct. htt://web.olmeth.ufl.edu/aers/98/bec98.z. [2] Fredman,., D. Geger & M. Goldszmdt. Bayesan etwor Classfers. Machne Learnng 29(2-3):3-63, 997. [3] Hecerman, D. A Tutoral on Learnng Bayesan etwors. Techncal Reort, Mcrosoft Research, 995. [4] Kng, G. & L. Zeng. Logstc Regresson n Rare Events Data. Poltcal Analyss 9(2):37-63, 200. [5] Laurtzen, S.L. & D.J. Segelhalter. Local comutatons wth robltes on grahcal structures and ther alcaton to eert systems (wth dscusson). Journal of Royal Statstcal Socety, Seres B 50(2):57-224, 988. [6] eal, R.M. Bayesan Learnng for eural etwors. Srnger-Verlag, 996. [7] Onso, A., M.J. Druzdzel & H. Wasylu. Learnng Bayesan networ arameters from small data sets: alcaton of osy-or gates. Internatonal Journal of Aromate Reasonng 27(2):65-82, 200. [8] Pearl, J. Problstc Reasonng n Intellgent Systems: etwors of Plausble Inference. Morgan Kaufmann, 988. [9] Rley, B.D. Pattern Recognton and eural etwors. Cambrdge Unversty Press, 996. [0] Srtes, P., C. Glymour & R. Schenes Causaton, Predcton and Search. Lecture otes n Statstcs 8, Srnger Verlag, 993. [] Xu, K. How has the lterature on Gn s Inde evolved n the ast 80 years? Worng aer, ABOUT THE AUTHOR Assoc. Prof. L. J. M. Rothrantz, Deartment of Man-Machne Interacton, Delft Unversty of Technology, Phone: , -mal: L.J.M.Rothrantz@ew.tudelft.nl - II.5-6 -

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Analyss of Varance and Desgn of Exerments-I MODULE III LECTURE - 2 EXPERIMENTAL DESIGN MODELS Dr. Shalabh Deartment of Mathematcs and Statstcs Indan Insttute of Technology Kanur 2 We consder the models