Int. Statstcal Inst.: Proc. 58th World Statstcal Congress, 0, Dubln (Sesson CPS05).585 Prncal Coonent Analyss of Modal nterval-valued Data wth Constant Nuercal Characterstcs WANG Huwen,, CHEN Melng,, LI Nan,, WANG Lanhu. School of Econocs and Manageent, Behang Unversty, Beng 009, Chna. Research Center of Cole Data Analyss,Behang Unversty,Beng 009,Chna. School of Econocs and Manageent,Beng Forestry Unversty,Beng 0008,Chna Abstract: Modal nterval-valued data s one of the ost ortant tyes of sybolc data and each unt of ts atr contans a hstogra or a dstrbuton functon. In ths aer, a new ethod through Prncal Coonent Analyss of odal nterval-valued data s dscussed. Ths Prncal Coonent Analyss (PCA) ethod as to reduce the densons of a large dataset by reconstructng the covarance atr. The fundaental eleents of the covarance atr such as ean, varance and the covarance and ther defnton ethod s ortant n Prncal Coonent Analyss. Soe of the current researches on Prncal Coonent Analyss of odal nterval-valued data have contrbutons to denson reducton of odal nterval-valued data by transforng the hstogra-valued data nto nterval data. In other estng ethods, the defnton of ean s n dstrbutve data for and the ean s an average level for all odal-valued data observatons. However, data centralzaton based on the ean defned ths way actually obtans the resdual dstrbuton. The result of Prncal Coonent Analyss n accordance wth the atr of resdual dstrbutons ay thus fal to resent the essental varaton of the orgnal data accordngly. In ths aer, we defne nuercal characterstcs of odal nterval-valued data as real constants whch can ae full use of nforaton n hstogras. Centralzaton n ters of constant nuercal characterstcs s to relocate the odal-valued varances as a whole to get orgnal hstogras whose gravty center s settled on the orgn. Therefore, the Prncal Coonent Analyss of odal nterval-valued data wth constant nuercal characterstcs based on the obtaned covarance atr s roosed. Sulaton roves the effectveness of the roosed ethod. Key words: Modal nterval-valued data; Prncal Coonent Analyss; Constant Nuercal Characterstcs ntroducton Sybolc data analyss ethod s one of the ost groundbreang theoretcal acheveents n odern statstcal data analyss feld. Modal nterval-valued data s one of the sybolc data tyes and each unt of a hgh-densonal odal nterval-valued data atr contans a hstogra or a dstrbuton functon. Ths aer wll focus on PCA of odal nterval-valued data. The routne ethod aled on PCA on odal nterval-valued data s transforng t nto nterval data by a certan transforaton ethod. For nstance, Rodrguez O., Dday E., Wnsberg S. [] (000) and Sun Maosso Kallyth, Edwn Dday [] (00) ; thus t can be analyzed by Prncal Coonent Analyss ethod of nterval-valued data. A ore drect ethod for hstogra PCA was resented by P. Nagabhushan and R. Pradee Kuar [] (007). In the aer, they defned unt hstogra, null hstogra, and the basc arthetc oeratons of hstogra such as addton, subtracton, ultlcaton, dvson and roosed a hstogra PCA. However, eans of hstogra varables based on the above ethod s n hstogra for, and data centralzaton obtans the resdual hstogras consequently. The result of PCA hstogra wth the atr of resdual hstogras ay thus fal to reresent the essental varaton of the orgnal data accordngly. In ths aer, we attet to elore a new PCA ethod for odal nterval-valued varables. Based on the ethod of nuercal characterstcs ntegral calculaton on contnuous rando varables n robablty theory, we frstly defne constant nuercal characterstcs about odal nterval-valued data, then leent the for denson reducton odelng of odel nterval-valued dataset through PCA. The roosed ethod not only can ae use of colete nforaton n the hstogra, but also ay gve a ore relable concluson, snce data centralzaton on the bass of the constant nuercal characterstcs. Furtherore, an aroate ethod to calculate the lnear cobnaton of odal nterval-valued varables s gven accordng to the algorth of unvarate hstogras theory of hstogra-valued data [4] (006) cobned wth
Int. Statstcal Inst.: Proc. 58th World Statstcal Congress, 0, Dubln (Sesson CPS05).586 Moore algebra [5] (96) n nterval data analyss. Thus the roectng orgnal odal nterval-valued data to rncal aes can be realzed. Ths aer s structured as follows: Secton ntroduces several basc defntons about nuercal characterstcs of odal nterval-valued data and the dervaton rocess and calculaton stes of PCA on odal nterval-valued data; sulaton s conducted n secton to valdate the effectveness of the roosed ethod; the last secton gves out the suary. Methodology We consder a n data atr Χ ( ) n n, whch s called odal nterval-valued data, and whose eleents are all rando varables that follow a hstogra or a dstrbuton functon. Here { I, f} eans the rando varable defned on the feld of defntons I wth the densty functon as follows, Subect to: where f For hstogra data, the densty functon of hstogra data f would be denoted, (,,, ) K f ( ),,, n;,,, () 0, else K s the nuber of odaltes of K. 0,, and satsfes K I I [, ), s the frequency of I [, ) s the th sub-nterval of I,whch I. It s assued that wthn each sub-nterval I, the rando varable s unforly dstrbuted across the sub-nterval. Hence, for hstogra data can be denoted as follows: { I, f } { [, ),,,, K },,, n;,,. also. The frst oent, second oent and second order ed oent Accordng to classcal robablty theory, the frst oent, second oent and second order ed oent of odal-valued varable can be defned as follows: Defnton. For a odal-valued varable X, the frst oent s gven by n E( X ) E( ), () n Where the frst oent of unt s defned as Accordngly, the centralzaton of E( ) f d = d, () s gven by y E X { [ E X, E X ),,,, K },,, n;,,. (4) Defnton. Gven any two odal-valued varables defned as Where Here and n X and X, the second order ed oent s n EX X E, (5) f f dd f d f d, (6) E E E are suosed to be ndeendent. Defnton. For any odal-valued varable X, the second oent s defned by
Int. Statstcal Inst.: Proc. 58th World Statstcal Congress, 0, Dubln (Sesson CPS05).587 Where E E X, (7) n E n f d d. (8) Therefore the covarance and varance of odal-valued varables X, X are as follows: Cov X, X E X E X X E X E X X E X E X, (9) D X E X E X E X E X. (0). Lnear cobnaton algorth of odal nterval-valued varables Lnear cobnaton algorth for nterval-valued varables has been ntroduced by Moore(96), the defnton can be eressed as follow: Defnton 4. Gven nterval-valued varables X, X,, X, all wth n observatons and real nubers a,,,, each observaton can be regarded as a hyercube. Defne an nterval-valued varable Y as a lnear cobnaton of X, X,, X, vz. Ya X y, y, y, y,, yn, y n, () 0, a 0 where y a and y a,wth., a 0 Wth Defnton 4, an algorth to calculate the lnear cobnaton of odal nterval-valued varables can be resented. Gven odal nterval-valued varables X, X,, X, and real nubers a,,,, an odal nterval-valued varable Y as a lnear cobnaton of X, X,, X can be defned as follows, where Y s a hstogra vector,each eleent can be eressed as follows, y { I, f } { [ y, y ), ;,,, K } Y a X, (),where K a{ K,, n;,, } For the th hstogra,,, K, the nuber of odaltes K, each sub-nterval has ts densty functon as, thus the th observaton contans K hyercubes n denson sace. The densty of each hyercube s the roduct of the densty of corresondng sub-nterval and be denoted as u,,, K,. Accordng to forula () n Defnton 4, calculatng lnear cobnaton to all the hyercubes, we can obtan I, u,,, K. Then we can get the au and the nu values of I, u,,, K,denote as I.The sub-nterval I [ y, y ),,,, K can be obtaned by dvdng I nto follows. K arts. Proectng the above I u,,, K to I,whch s denoted as I I u K,,,, u I. () Accordng to the above entoned stes, we can get the lnear cobnaton of all varables n the th
Int. Statstcal Inst.: Proc. 58th World Statstcal Congress, 0, Dubln (Sesson CPS05).588 hstogra observaton.. The Algorth For convenence, suose the odal nterval-valued vectors are all centralzed. Wth the above defntons, we begn to derve the Prncal Coonent Analyss ethod for odal nterval-valued data. Slar to the nuerc case, the th odal nterval-valued data PC Y,,, s a lnear cobnaton of X, X,, X,.e., u Y Xu X, where u,,,, wth the constrants of u and uu l 0 l,,,, l. Also, the frst rncal coonents Y, Y,, Y ust aze total varance to reresent the orgnal nforaton carred by X, X,, X. Accordng to defntons roosed above, we have D Y E Y E ux u X ux EX EX X EX X X X X X X E E E u u, u,, u u E E E X X X X X uvu, (4) X X X. where V reresents the covarance atr of,,, u The followng dervaton s the sae wth the classcal PCA that for nuerc data,.e., loong for orthogonalsed vectors u, u, u to acheve azaton of D by solvng equatons of Y wth D Y D Y D Y Vu u. Thus, u, u, u are the orthonoral egenvectors of V,. By algorth of lnear cobnaton of odal-valued corresondng to the egenvalues varables, see forula (), we fnally get the th odal-valued PC Y Xu. Eerental Results of Synthetc Data Sets Ths secton we conduct a coarson between PCA of odal nterval-valued data and PCA of nuerc data. The hstogra dataset wll be generated n Monte-Carlo sulaton ethod. And the nuerc dataset corresondng to ths hstogra dataset wll be obtaned by dfferentatng I, the feld of defntons of the hstogra. The coarson wll focus on the egenvalues and egenvectors of the covarance atr of hstogra dataset and nuerc dataset. It ll be concluded that the egenvalues and egenvectors are slar for the two tyes of data. The calculaton results of nuerc dataset tend to the hstogra dataset s results when the nuber of dfferentaton n I becoes lager... Dataset.. Hstogra dataset We leent Monte-Carlo sulaton ethod to generate a 50 4 hstogra dataset. For generalty, we tae all the nuber of odaltes as three. Therefore, the hstogra { I, f } can be generated randoly by the two stes as follows. C ) Defne the generaton ethod of feld of defntons: generate the center ~ 5,5 R the rads ~,0 U randoly and C R C R U, thus we can get the nterval of hstogra I [, ] and dvde the nterval I equally nto three arts, we obtan I [, ),,,. ) Defne the generaton ethod of densty(frequency) functon: generate three data q, q, q,where 4
Int. Statstcal Inst.: Proc. 58th World Statstcal Congress, 0, Dubln (Sesson CPS05).589 q ~ U 0,,,,, then orthogonalze the data as follows: Q Meante, t satsfes the constrant q, q / Q,,,. and we can get the densty functon of the th hstogra n the dataset... Nuerc dataset When we get the hstogra dataset, for each hstogra we get a nuerc dataset by dfferentatng the feld of defntons. That s for each hstogra { I, f}, to choose ostve nteger nubers n the feld of defnton nterval I wth dvdng the th sub-nterval equally nto (,, K) arts, therefore the nuerc dataset bde by aroately to the dstrbuton of. We call as the nuber of dfferentaton. The larger of the nuber of dfferentaton, the ore slar of nuerc dataset to the hstogra. For eale, for two denson hstogra, ( y, ) ({[,),0.;[,4),0.5;[4,5],0.}, {[5,6),0.;[6,7),0.7;[7,8],0.} Fg shows the condtons of =0 and =50 resectvely. In the left chart there are 44 onts whle there are 60 n the rght chart. Fg. The dagra of nuerc dataset wth the condtons of =0 and =50 Therefore, a denson hstogra s eanded nto a nuerc dataset, on whch classcal ultvarate analyss can be erfored. It roves the ethod roosed n ths aer s reasonable, f the calculaton results obtaned by PCA tend to the hstogra dataset s whle the nuber of dfferentaton s ncreasng. The sulaton result of PCA of hstogra data For the 50 4 hstogra dataset generated above, we calculate ts egenvalues and egenvectors of ts covarance atr by PCA ethod; the egenvalues are sorted descendng as follows: =.56 ; =0.475 ; =9.498 ; 4 =5.095. And the corresondng egenvectors are showed n table. Table. Egenvectors of corresondng egenvalue u u u u -0.65 0.6589-0.66 0.74 0.77 0.447-0.486-0.776 0.675 0.9-0.486 0.5067 0.659 0.5557 0.7456-0.06 In the followng art, we coare the egenvalues and egenvectors of hstogra data and nuerc data. Assue the nuber of dfferentatons are 0 0 0 40 50 70 00 and 00, we calculated the egenvalues and egenvectors u,,, 4; 0, 0,0, 40,50, 70,00, 00. For the egenvalues, we tae Absolute Error to coare the both,.e.: AE(, ),,, 4; 0, 0, 0, 40, 50, 70,00, 00. For the egenvectors, we tae the Absolute Cosne Value, whch s defned as follows: 4 5
Int. Statstcal Inst.: Proc. 58th World Statstcal Congress, 0, Dubln (Sesson CPS05).590 ACV u u u u, 0, 0,0, 40,50, 70,00, 00,,, 4 The horzontal as n Fg denotes the nuber of dfferentaton. In the left chart of Fg the curves show the AE changes whle the nuber of dfferentaton s ncreasng, we can see the AE s gettng saller whle the nuber of dfferentaton s on the ncrease. The vertcal as of the rght chart denotes ACV, the results shows the absolute error of ACV converges to whle the nuber of dfferentaton s ncreasng whch llustrates the angle between the two coared vectors converges to 0. The obtaned coonents are ore ale when the slarty s hgh. 0.5 AE(la).00 0. 0.5 0. 0.05 AE(la) AE(la) AE(la4) 0.999 0.998 0.997 0.996 ACV(egv) ACV(egv) ACV(egv) ACV(egv4) 0 0 50 00 50 00 50 0.995 0 50 00 50 00 50 Fg.The AE of egenvalues and ACE of egenvectors Fnally,we contrast the sale dstrbuton of frst rncle coonent of nuerc data and hstogra data. The frst rncle coonent of hstogra data s obtaned by the lnear cobnaton of hstogra as secton., whle the frst rncle coonent of nuerc data s calculated the ercal dstrbuton 4 frequency. For savng coutaton, we tae =0, the sale sze of nuber dataset s (0 ) 50. Then the results of Two Indeendent Sales Kologorov-Srnov test fro the 50 sales show that both dstrbutons are consstent. 4 Concluson Ths aer roosed a new PCA ethod based on nuerc characterstc constant tye for odal nterval-valued varables, whch s drawn on the ethod of nuercal characterstcs ntegral calculaton on contnuous rando varables n robablty theory, In the rocess of coutaton, the ethod not only adots the colete nforaton of hstogra, but also ae the characterstc analyss clear and the result s reasonable and recse. The sulaton roves the ratonalty and effectveness of the ethod. Acnowledgeents Ths wor was suorted by the Natonal Natural Scence Foundaton of Chna (Grant No. 70806, 7077004, 7000). References [] Rodrguez O., Dday E., Wnsberg S. Generalzaton of rncal coonents analyss to hstogra data, Worsho of Sybolc Data Analyss, 4th Eur. Conf. Prncles and Practce of Knowledge Dscovery n Databases, Lyon, France, Set. 000. [] Kallyth S.M., Dday E., Sybolc PCA of Coostonal Data, 9th Internatonal Conference on Coutatonal Statstcs, Pars - France, August -7, 00. [] Nagabhsushan P., Kuar P., Prncal Coonent Analyss of Hstogra Data. Srnger-Verglag Berln Hedelberg, EdsISNN Part II LNCS 449, 0-0, 007. [4] Bllard L., Dday E., Descrtve Statstcs for Interval-valued Observatons n the Presence of Rules, Coutatonal Statstcs, :87-0, 006. [5] Moore Raon E., Baer Kearfott R., Cloud Mchael J., Introducton to Interval Analyss, ages. Sa. ISBN 978-0-89876-69-6, 009. 6