L1 Variable Selection Based Causal Structure Learning

Size: px

Start display at page:

Download "L1 Variable Selection Based Causal Structure Learning"

Russell Myron Patrick
5 years ago
Views:

1 L1 Varable Selecton Based Causal Structure Learnng Rana N. Elkhateeb Mahmoud A. Mofaddel Marghany H. Mohamed Faculty of Scence Faculty of Scence Faculty of computers & Informaton Sohag Unversty Sohag Unversty Assut Unversty Abstract- Causal graphs are the most sutable representaton of causal relatonshps between varables n a complex system. Drected acyclc graphs (DAGs), also known as Bayesan networks, are the well known and frequently used to represent these causal relatonshps. Learnng the structure of causal DAGs from data s very mportant n applcatons to varous felds, such as medcne, artfcal ntellgence and bonformatcs. Recently L1-regularzaton technque has been used for learnng causal DAGs model structure. In ths paper we propose an mproved verson of Grow-Shrnk algorthm (GS), presented by Margarts 2000, where we used L1 varable selecton for Markov blanket selecton step. Then we compare the performance of the proposed method wth the state of the art ones: PC (Peter & Clark), and GS algorthms. Fnally, we dscuss and analyze the results. Keywords: Bayesan networks; causal dscovery; graphcal models; Markov blanket; regularzaton. I. INTRODUCTION Bayesan networks (DAGs) are the most convenent graph for representng causal relatonshps; because they are drectonal thus are capable of dsplayng relatonshps clearly and ntutvely. A DAG conssts of a set of nodes N where each node represents a varable X receves drected edges from ts set of parent nodes P so They can be used to represent both drect and ndrect causaton and handle uncertanty though the establshed theory of probablty [1]. The last decade wtnessed great nterest n learnng causal Bayesan networks from observatonal data because t can be used to automatcally construct Decson Support Systems. In addton learnng Bayesan networks have been used n several applcatons, for example, n bonformatcs for the nterpretaton and dscovery of gene regulatory pathways, varable selecton for classfcaton, desgnng algorthms that optmally solve the problem under certan condtons, nformaton retreval and natural language processng [3]. The algorthms for learnng causal Bayesan networks can be classfed nto three man types. The frst, constrant-based algorthms use tests to detect relatonshps among varables for example SGS and PC algorthms [7]. The second, score-based algorthms select the structure that has the hghest score of a functon that measures how well the structure fts the data for example the GES algorthm [8]. The thrd s hybrd algorthms that utlze both score-based and constrant-based methods to construct the graph for example MMHC (Max-Mn hll clmbng) [12]. L1 regularzaton has been used frst to learn the structure n probablstc undrected graphcal models. However, t s more preferable to use t n learnng drected acyclc graphs (DAGs); because DAGs are more effcent for computng ont probabltes, samples, and (approxmate) margnal's, the lkelhood n DAG models factorzes nto a product of sngle varable condtonal dstrbutons n contrast to drected acyclc graphs where estmatng sngle-varable condtonal dstrbutons s used as an approxmaton and DAGs allow mxng dfferent types of varables n a straghtforward way. Some algorthms lke GS [2] and CS (Collder Set) [4] use Markov blanket approach to mprove scalablty by usng nformaton obtaned from feature selecton algorthms. GS algorthm refers to the frst class (constrant-based methods) addresses the dsadvantages of ths category, exponental executon tmes and proneness to errors n dependence tests used, by dentfyng the Markov blanket of each varable n the Bayesan net as a preprocessng step. Ths preprocessng step handles the exponental tme to be polynomal under the assumpton of bounded Markov blanket sze. In ths paper we use L1 regularzaton technque for feature selecton [5] n the preprocessng step to mprove the exstng GS algorthm, because addng regularzaton to a learnng algorthm avods over fttng. L1 norm s advantageous because t encourages the sparse solutons whch n order reduce the computatonal requrements. A background and the problem statement are shown n secton 2. Then the GS algorthm s dscussed n secton 3. After that the proposed algorthm L1GS s descrbed n secton 4. The expermental results of applyng L1GS, GS and Pc algorthms to a set of observatonal datasets and a comparson between them s shown n secton 5. Fnally, secton 6 provdes the concluson and hnts to future work , IJIRAE- All Rghts Reserved Page -84

www.rae.com II. BACKGROUND AND PROBLEM STATEMENT Relatvely recently (1980 s), the dea of nferrng causal relatons from observatonal data took place n many practcal cases [10, 11].

2 II. BACKGROUND AND PROBLEM STATEMENT Relatvely recently (1980 s), the dea of nferrng causal relatons from observatonal data took place n many practcal cases [10, 11]. Snce then, several algorthms have been developed to nfer such causal relatons whch greatly reduce the number of experments requred to dscover the causal structure. In general learnng the causal structure [6, 7] s a multvarate data analyss problem that ams at constructng a drected acyclc graph (DAG) presents the drect causal relatonshps among the nterestng varables of a gven system. The last decade wtnessed more nterest n local learnng. Markov blanket learnng methods were frst proposed by Koller & Saham (1996) and Cooper (1997) ntroduced ncomplete local causal methods. The local causal dscovery methods (drect cause and effect) were frst ntroduced by Alfers & Tsamardnos, 2002a and Tsamardnos et al., 2003b [13]. Usng Markov blanket (Mb) defnton, Mb(X) s a mnmal varable subset condtoned on whch all other measured varables are probablstcally ndependent of X, n causal structure learnng acheves promsng results, snce t was frst proposed by Margarts (2000) [1]. Ths approach s qute effcent snce t extracts the Markov blanket nformaton for each varable X V from observatonal data, V s a set of varables, and then constructs a DAG graph from t. Another attempt for usng ths technque was proposed by Pellet& Elsseeff (2008) [9]. The Markov blanket for each varable X V s the smallest set contanng all varables carryng nformaton about X that cannot be obtaned from any other varable, n a causal graph ths s the set of all parents, chldren, and spouses of X, fg. (1) Presents an example of Mb(X). Formally, n the context of a fathful causal graph G we have: X Mb ( Y ) Y Mb( X ). An mportant property of Markov blanket s that Mb(X) s unque n a fathful BN or a CPN. Fg. 1 An example of Markov blanket of varable X As mentoned before DAG models are one of the best ways to model the ont dstrbuton P(X 1, X 2,, X n ) of a set of n random varables. If we repeatedly use the defnton of condtonal probablty, P(X, Y) = P(Y X) P(X), n the order n down to 1, then we obtan the factorzaton of the ont dstrbuton: n 1,..., n 1: 1 P X X P X X 1 Ths factorzaton of the ont dstrbuton s vald for any probablty dstrbuton. The man goal of ths paper s to use the Markov blanket approach to learn the structure of DAG that encodes causal relatonshps among a set of nterestng varables V from observatonal data D. We propose an mproved verson of GS algorthm where we use L1regularzaton for feature selecton to select the Markov blanket for each nterestng varable n a preprocessng step before startng the causal structure learnng phase. We call the proposed algorthm as L1GS. III. GROW-SHRINK ALGORITHM (GS) It s the frst publshed sound Markov blanket nducton algorthm that was proposed by Margrts & Thrun [2]. It was ntroduced wth the ntent to nduce the Markov blanket for the purpose of speedng up global network learnng. The whole algorthm can be splt nto two phases; the frst: fnds the Markov blanket for each varable whch greatly facltate the recovery of the local structure around each node, the second: performs further condtonal-ndependence tests around each varable to nfer the structure locally and uses a heurstcs to remove cycles possbly ntroduced by prevous steps. Table (1) lsts the steps of the GS responsble for buldng the local structure usng the Markov blanket nformaton, DN(X) represents the drect neghbors to X. the complexty of step 1 for explorng Markov blanket s O (n) n the number of ndependence tests , IJIRAE- All Rghts Reserved Page -85

3 The GS algorthm passes each varable and ts Markov blanket members twce. In the frst pass t removes the possble spouse lnks between lnked varables X and Y by lookng for a d-separatng set around X and Y (set Z d-separates X and Y f every path from X to Y s blocked by Z). In the second pass, the GS orents the arcs whenever t fnds that condtonng on a mddle node creates a dependency. Whle searchng for the approprate condtonng set, GS selects the smallest base search set (set B n table 1) for each phase. The search for the smallest set B has two advantages. The frst, t reduces the number of tests, whch s desrable because each phase contans a subset search, exponental n tme complexty wth respect to the searched set. The second advantage s reducng the average sze of the condtonng set, whch ncreases the power of the statstcal tests, and thus helps reduce the number of type II errors [9]. It s obvous from table (1) that the whole algorthm requres O (n 2 + nb 2 2 b ) condtonal ndependence tests, where b=max x ( B(X) ). If we assume that b s bounded wth a constant then the algorthm s O (n 2 ) n the number of condtonal ndependence tests. The man core and advantage n GS algorthm s usng a Markov blanket technque before startng the structure learnng; because t restrcts the sze of the condtonng sets. We concentrate on ths step to mprove the GS performance and choose another Markov blanket algorthm L1MB; t wll be shown n detals n the next secton. 1. c o m p u te M ark o v B la n k e t M b. X V co m p u te M b X 2. C o m p u te G rap h stru ctu re. X V & Y M b X d ete rm n e Y n e g h b o u r o f X f X & Y 3. O re n t E d g es. X V & Y D N X, to b e a d re c t are n d e p e n d e n t g v en S S T w h ere T s th e sm alle r o f M b X Y an d M b Y X o ren t Y X f Z D N X D N Y Y su ch th at Y & Z are n d e p en d en t g v en S S U, w h e re U s th e sm alle r o f M b Y Z & M b Z Y. 4. R em o v e C yc le s D o th e f o llo w n g w h le th e re e x st c y cles n th e g rap h : a. c o m p u te th e s et o f ed g e s C X Y s u ch th at X Y s p art o f a cy c le b. re m o v e th e ed g e n C th at s p art o f th e g reate st n u m b er o f cy cle s, an d p u t TABLE I PLAIN STEPS FOR GS ALGORITHM t n R. 5. R e v erse E d g e s. g rap h, rev e rse d. 6. P ro p a g ate d rec to n s. X V & Y D N X su c h th at n e th e r Y X n o r X Y, e x ec u te n se rt eac h e d g e f ro m R n th e th e f o llo w n g ru le u n tll n o lo n g e r ap p le s : f th e re e x sts a d re cte d p ath f ro m X to Y, o re n t X Y. X, , IJIRAE- All Rghts Reserved Page -86

4 IV. THE PROPOSED ALGORITHM L1GS Usng L1 regularzaton for learnng DAGs model structure has been studed before by L & Yang (2004, 2005), Huang et al. (2006) and Levna et al. (2008). In ths paper we use L1 regularzaton for varable selecton to explore the Markov blanket for each varable. we propose an mproved verson of GS algorthm where we used L1MB algorthm[5], ntroduced by Mark Schmdt (2007), n the frst step n table (1) to compute the Markov blanket for each varable X V. We show that usng regresson to fnd the Markov blanket results n much lower false negatve rates and t s also more statstcally effcent because t does not need to perform condtonal ndependency tests on exponentally large condtonng sets. Table (2) shows the steps of L1MB algorthm. The problem of choosng the Markov blanket for a node X from a set V can be formulated as solvng the followng: ˆ L1 V arg mn nll X, V,, where s the scale of the penalty on the L1 norm of the parameter vector, 1, the smplest way for choosng s to use cross valdaton approach. Hence L1MB technque depends on excludng 0 regressng each node X on all others (V = x ), usng L1 varable selecton. Ths process takes O (nd 3 ) tme per node, and (deally) fnds a set that s as small as possble, but that contans all of X s parents, chldren and co-parents (ts Markov blanket) Input : DataX for 1,..., n and 1,..., p. Output : Markov Blanket MB foreach node. for 1to pdo TABLE II L1MB ALGORITHM // startwthempty Markov blanket. MB ; s score x, x 0 ; // optmzefor basevarable. n b mn b log 1exp x b ; 1 // regressonweg htsgradentszero. n g x x / 1 exp x b ; 1 //regularzatonparametermaxmumvalue max max g ;. for = p-1 / p max downto 0 n ncrementsof max / pdo n T w, b w, b arg mn log 1 exp x w x b w 1; 1 // nonzerovarables nz u w u 0 ; // scorewthselectedmb. s sub score x, x nz ; f s sub s then s s sub ; // newmaxmumvalue. MB nz // hgherscorngmarkovblanket , IJIRAE- All Rghts Reserved Page -87

the orgnal GS and the state of the art PC algorthms; we performed a seres of experments on dfferent samples of well

We used BNT Matlab toolbox [15] to mplement PC and we use Mark Schmdt mplementaton for L1MB [5] for the proposed

The statstcal tests were done usng ch-square X 2 test, for PC and GS; we chose the default value of = 0.05.

5 V. EXPERIMENTAL RESULTS In order to test the effectveness of the proposed algorthm L1GS and compare the results wth the orgnal GS and the state of the art PC algorthms; we performed a seres of experments on dfferent samples of well known datasets from the Bayes net repostory [14], table (3) shows some statstcal nformaton about these datasets. We used BNT Matlab toolbox [15] to mplement PC and we use Mark Schmdt mplementaton for L1MB [5] for the proposed algorthm L1GS and the GS algorthm also mplemented n Matlab. The statstcal tests were done usng ch-square X 2 test, for PC and GS; we chose the default value of = TABLE III DATASETS FROM BAYES NET REPOSITORY Name No. No. Max Fg. Nodes edges Parents Alarm Insurance Halfnder Carpo Fg. 2 Alarm Network Fg. 3 Insurance Network Fg. 4 Halfnder Network Fg. 5 Carpo Network , IJIRAE- All Rghts Reserved Page -88

6 In ths seres of experments, we compared the proposed algorthm L1GS to the orgnal GS, where the graph beng bult s ntalzed wth the Markov blanket nformaton about each varable. However, PC algorthm s modfed to start wth the moral graph (nstead of the full graph n the orgnal verson of PC). We tested the three algorthms on each network usng ch-square X 2 test of the sample partal correlaton coeffcent as computed on artfcal data, wth sgnfcance α = Table (4) shows the results of these experments. TABLE IV NUMBER OF TESTS AND SIZE OF THE CONDITIONING SETS AS PERFORMED BY VARIOUS ALGORITHMS Algorthm Network Alarm Maxmum L1GS GS Pc Insurance Maxmum Halfnder Maxmum Carpo Maxmum The results for the modfed PC algorthm are only shown for the comparson purposes; t s a general purpose algorthm whch s not specalzed n local structure learnng usng Markov blankets. From the comparson we note that, whenever the Markov blanket nformaton s avalable or cheap to obtan, there are more effcent algorthms than the states of the art. From the results n table (4), L1GS outperforms the others n nearly all the experments or at least have the same results. L1GS and the orgnal GS are close to one another n all scores, and outperform PC n the number of tests and n average and maxmum sze of the condtonng sets because they uses the Markov blanket nformaton. However, our approach s a bt better than the others n terms of number of tests, whle stll usng smaller average and maxmum condtonng set szes n all tested networks. VI. CONCLUSION AND FUTURE WORK The Markov blanket approaches enable constraned-based learnng methods to learn the local causal structure of DAGs. It mproves the algorthms scalablty by lmtng the szes of condtonng sets to Markov blankets. Hence, gven restrcted Markov blanket nformaton wth a theoretcally correct condtonal ndependence testng and frmest search procedure, the learnng algorthm wll guarantee the best DAGs structure. Fnally, Causal dscovery and feature selecton are strongly lnked, snce optmal feature selecton dscovers Markov blanket as sets of strongly relevant features, and causal dscovery dscovers Markov blankets as drect causes, drect effects and common causes of drect effects. The undrected moral graph (approxmaton of the causal graph) can be obtaned by performng perfect feature selecton on each varable. Then an extra step, computng the graph structure, s needed n order to transform the Markov blankets nto parents, chldren and spouses. Ths step s exponental n the worst case, but s actually effcent provded the graph s sparse enough; ths sparsty s acheved when usng L1 varable selecton and ths s what we mplement n ths paper , IJIRAE- All Rghts Reserved Page -89

7 The future challenges may be wth learnng causal structure ncludng robust and consstent dstrbuton-free structure learnng wth contnuous and potentally hghly nonlnear data. And we are lookng forward to extend L1GS technque to handle mult-state dscrete varables as well as modelng parent nteractons and nonlnear effects. An apparent weakness of the two-stage approaches s that f a true parent s mssed n Stage 1, t wll never be recovered n Stage 2. Another weakness of the exstng algorthms s computatonal effcency,.e., t may take hours or days to learn a large-scale BN such as one wth 500 nodes. REFERENCES [1] D. Margarts, Learnng Bayesan Network Structure From Data. CMU-CS, PhD thess, [2] D. Margarts, S. Thrun, Bayesan network nducton va local neghborhoods. In Advances Neural Informaton Processng Systems 12, , [3] N. Fredman, M. Lnal, I. Nachman, & D. Peter, Usng Bayesan Networks to Analyze Expresson Data. Computatonal Bology, 7, , [4] J.P. Pellet, A. Elsseeff, Usng Markov blankets for causal structure learnng. J. Machne Learn. Res., 9, , [5] Mark Schmdt, Alexandru Nculescu-Mzl & Kevn Murphy, Learnng Graphcal Model Structure usng L1-Regularzaton Paths. Assocaton for the Advancement of Artfcal Intellgence, [6] J. Pearl, & Verma T., A theory of nferred causaton. Prncples of Knowledge Representaton and Reasonng: Proceedngs of the Second Internatonal Conference. San Francsco, Calforna: Morgan Kaufmann Publshers, pp , [7] Peter Sprtes, Clark Glymour & Rchard Schenes, Causaton, Predcton and Search. Sprnger verlag, [8] D. M. Chckerng, Optmal structure dentfcaton wth greedy search. The Journal of Machne Learnng Research, 3: , [9] Jean-Phlppe Pellet & Andr e Elsseeff, Usng Markov Blankets for Causal Structure Learnng. Journal of Machne Learnng Research (9), , [10] J. Pearl, Causalty: Models, Reasonng, and Inference. Cambrdge Unversty Press, Cambrdge, U.K, [11] P. Sprtes, C. N. Glymour & R. Schenes. Causaton, Predcton, and Search, volume 2nd. MIT Press, Cambrdge, Mass, [12] I. Tsamardnos, L. E. Brown & C. F. Alfers, The max-mn hll-clmbng Bayesan network structure learnng algorthm. Machne Learnng, 65(1):31 78, [13] F. Alfers Constantn, Alexander Statnkov, Ioanns Tsamardnos, Subraman Man & Xenofon D. Koutsoukos, Local Causal and Markov Blanket Inducton for Causal Dscovery and Feature Selecton for Classfcaton Part I: Algorthms and Emprcal Evaluaton, Journal of Machne Learnng Research (11), , [14] G. Eldan, Bayes net repostory. Webste, URL [15] P. Leray and O. Francos, BNT structure learnng package. URL , IJIRAE- All Rghts Reserved Page -90

A New Evolutionary Computation Based Approach for Learning Bayesian Network

A New Evolutionary Computation Based Approach for Learning Bayesian Network Avalable onlne at www.scencedrect.com Proceda Engneerng 15 (2011) 4026 4030 Advanced n Control Engneerng and Informaton Scence A New Evolutonary Computaton Based Approach for Learnng Bayesan Network Yungang