Int. Statistical Inst.: Proc. 58th World Statistical Congress, 2011, Dublin (Session CPS053) p.5185

Similar documents
System in Weibull Distribution

Dimension Reduction and Visualization of the Histogram Data

Several generation methods of multinomial distributed random number Tian Lei 1, a,linxihe 1,b,Zhigang Zhang 1,c

Departure Process from a M/M/m/ Queue

Excess Error, Approximation Error, and Estimation Error

Applied Mathematics Letters

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

PROBABILITY AND STATISTICS Vol. III - Analysis of Variance and Analysis of Covariance - V. Nollau ANALYSIS OF VARIANCE AND ANALYSIS OF COVARIANCE

Managing Capacity Through Reward Programs. on-line companion page. Byung-Do Kim Seoul National University College of Business Administration

Spectral method for fractional quadratic Riccati differential equation

BAYESIAN CURVE FITTING USING PIECEWISE POLYNOMIALS. Dariusz Biskup

1 Definition of Rademacher Complexity

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

LECTURE :FACTOR ANALYSIS

Denote the function derivatives f(x) in given points. x a b. Using relationships (1.2), polynomials (1.1) are written in the form

Least Squares Fitting of Data

Interactive Markov Models of Evolutionary Algorithms

Fuzzy approach to solve multi-objective capacitated transportation problem

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Finite Vector Space Representations Ross Bannister Data Assimilation Research Centre, Reading, UK Last updated: 2nd August 2003

Discrete Memoryless Channels

Quantum Particle Motion in Physical Space

Solutions for Homework #9

Hongyi Miao, College of Science, Nanjing Forestry University, Nanjing ,China. (Received 20 June 2013, accepted 11 March 2014) I)ϕ (k)

COS 511: Theoretical Machine Learning

Our focus will be on linear systems. A system is linear if it obeys the principle of superposition and homogenity, i.e.

General Results of Local Metric Dimensions of. Edge-Corona of Graphs

On the number of regions in an m-dimensional space cut by n hyperplanes

Solving Fuzzy Linear Programming Problem With Fuzzy Relational Equation Constraint

Introducing Entropy Distributions

Source-Channel-Sink Some questions

ROC ANALYSIS FOR PREDICTIONS MADE BY PROBABILISTIC CLASSIFIERS

Preference and Demand Examples

Determination of the Confidence Level of PSD Estimation with Given D.O.F. Based on WELCH Algorithm

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

PARAMETER ESTIMATION IN WEIBULL DISTRIBUTION ON PROGRESSIVELY TYPE- II CENSORED SAMPLE WITH BETA-BINOMIAL REMOVALS

Chapter 12 Lyes KADEM [Thermodynamics II] 2007

ON THE NUMBER OF PRIMITIVE PYTHAGOREAN QUINTUPLES

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Quasi-Static transient Thermal Stresses in a Robin's thin Rectangular plate with internal moving heat source

XII.3 The EM (Expectation-Maximization) Algorithm

Linear Approximation with Regularization and Moving Least Squares

2-Adic Complexity of a Sequence Obtained from a Periodic Binary Sequence by Either Inserting or Deleting k Symbols within One Period

Differentiating Gaussian Processes

Uncertainty and auto-correlation in. Measurement

Economics 130. Lecture 4 Simple Linear Regression Continued

The Prncpal Component Transform The Prncpal Component Transform s also called Karhunen-Loeve Transform (KLT, Hotellng Transform, oregenvector Transfor

Reliability estimation in Pareto-I distribution based on progressively type II censored sample with binomial removals

Quaternion Quasi-Normal Matrices And Their Eigenvalues

Designing Fuzzy Time Series Model Using Generalized Wang s Method and Its application to Forecasting Interest Rate of Bank Indonesia Certificate

y new = M x old Feature Selection: Linear Transformations Constraint Optimization (insertion)

Perron Vectors of an Irreducible Nonnegative Interval Matrix

STAT 3008 Applied Regression Analysis

Integral Transforms and Dual Integral Equations to Solve Heat Equation with Mixed Conditions

Journal of Chemical and Pharmaceutical Research, 2014, 6(5): Research Article

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Collaborative Filtering Recommendation Algorithm

RELIABILITY ASSESSMENT

Statistical analysis of Accelerated life testing under Weibull distribution based on fuzzy theory

18.1 Introduction and Recap

Finite Fields and Their Applications

Turbulence classification of load data by the frequency and severity of wind gusts. Oscar Moñux, DEWI GmbH Kevin Bleibler, DEWI GmbH

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

The Parity of the Number of Irreducible Factors for Some Pentanomials

On the correction of the h-index for career length

Special Relativity and Riemannian Geometry. Department of Mathematical Sciences

Research Article Green s Theorem for Sign Data

Power law and dimension of the maximum value for belief distribution with the max Deng entropy

On the Eigenspectrum of the Gram Matrix and the Generalisation Error of Kernel PCA (Shawe-Taylor, et al. 2005) Ameet Talwalkar 02/13/07

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

A General Class of Selection Procedures and Modified Murthy Estimator

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Least Squares Fitting of Data

The Dirac Equation. Elementary Particle Physics Strong Interaction Fenomenology. Diego Bettoni Academic year

On the Calderón-Zygmund lemma for Sobolev functions

Multipoint Analysis for Sibling Pairs. Biostatistics 666 Lecture 18

Lecture 12: Discrete Laplacian

Chapter 3 Describing Data Using Numerical Measures

What is LP? LP is an optimization technique that allocates limited resources among competing activities in the best possible manner.

An Optimal Bound for Sum of Square Roots of Special Type of Integers

Chapter 2 A Class of Robust Solution for Linear Bilevel Programming

Comparing two Quantiles: the Burr Type X and Weibull Cases

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede

Xiangwen Li. March 8th and March 13th, 2001

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

1 Review From Last Time

Identification of Modal Parameters from Ambient Vibration Data by Modified Eigensystem Realization Algorithm *

Advanced Topics in Optimization. Piecewise Linear Approximation of a Nonlinear Function

arxiv: v2 [math.co] 3 Sep 2017

SMARANDACHE-GALOIS FIELDS

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

An Application of Fuzzy Hypotheses Testing in Radar Detection

A Hybrid Variational Iteration Method for Blasius Equation

CHAPT II : Prob-stats, estimation

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Fuzzy Boundaries of Sample Selection Model

Lecture 3 Stat102, Spring 2007

The Order Relation and Trace Inequalities for. Hermitian Operators

Randić Energy and Randić Estrada Index of a Graph

Transcription:

Int. Statstcal Inst.: Proc. 58th World Statstcal Congress, 0, Dubln (Sesson CPS05).585 Prncal Coonent Analyss of Modal nterval-valued Data wth Constant Nuercal Characterstcs WANG Huwen,, CHEN Melng,, LI Nan,, WANG Lanhu. School of Econocs and Manageent, Behang Unversty, Beng 009, Chna. Research Center of Cole Data Analyss,Behang Unversty,Beng 009,Chna. School of Econocs and Manageent,Beng Forestry Unversty,Beng 0008,Chna Abstract: Modal nterval-valued data s one of the ost ortant tyes of sybolc data and each unt of ts atr contans a hstogra or a dstrbuton functon. In ths aer, a new ethod through Prncal Coonent Analyss of odal nterval-valued data s dscussed. Ths Prncal Coonent Analyss (PCA) ethod as to reduce the densons of a large dataset by reconstructng the covarance atr. The fundaental eleents of the covarance atr such as ean, varance and the covarance and ther defnton ethod s ortant n Prncal Coonent Analyss. Soe of the current researches on Prncal Coonent Analyss of odal nterval-valued data have contrbutons to denson reducton of odal nterval-valued data by transforng the hstogra-valued data nto nterval data. In other estng ethods, the defnton of ean s n dstrbutve data for and the ean s an average level for all odal-valued data observatons. However, data centralzaton based on the ean defned ths way actually obtans the resdual dstrbuton. The result of Prncal Coonent Analyss n accordance wth the atr of resdual dstrbutons ay thus fal to resent the essental varaton of the orgnal data accordngly. In ths aer, we defne nuercal characterstcs of odal nterval-valued data as real constants whch can ae full use of nforaton n hstogras. Centralzaton n ters of constant nuercal characterstcs s to relocate the odal-valued varances as a whole to get orgnal hstogras whose gravty center s settled on the orgn. Therefore, the Prncal Coonent Analyss of odal nterval-valued data wth constant nuercal characterstcs based on the obtaned covarance atr s roosed. Sulaton roves the effectveness of the roosed ethod. Key words: Modal nterval-valued data; Prncal Coonent Analyss; Constant Nuercal Characterstcs ntroducton Sybolc data analyss ethod s one of the ost groundbreang theoretcal acheveents n odern statstcal data analyss feld. Modal nterval-valued data s one of the sybolc data tyes and each unt of a hgh-densonal odal nterval-valued data atr contans a hstogra or a dstrbuton functon. Ths aer wll focus on PCA of odal nterval-valued data. The routne ethod aled on PCA on odal nterval-valued data s transforng t nto nterval data by a certan transforaton ethod. For nstance, Rodrguez O., Dday E., Wnsberg S. [] (000) and Sun Maosso Kallyth, Edwn Dday [] (00) ; thus t can be analyzed by Prncal Coonent Analyss ethod of nterval-valued data. A ore drect ethod for hstogra PCA was resented by P. Nagabhushan and R. Pradee Kuar [] (007). In the aer, they defned unt hstogra, null hstogra, and the basc arthetc oeratons of hstogra such as addton, subtracton, ultlcaton, dvson and roosed a hstogra PCA. However, eans of hstogra varables based on the above ethod s n hstogra for, and data centralzaton obtans the resdual hstogras consequently. The result of PCA hstogra wth the atr of resdual hstogras ay thus fal to reresent the essental varaton of the orgnal data accordngly. In ths aer, we attet to elore a new PCA ethod for odal nterval-valued varables. Based on the ethod of nuercal characterstcs ntegral calculaton on contnuous rando varables n robablty theory, we frstly defne constant nuercal characterstcs about odal nterval-valued data, then leent the for denson reducton odelng of odel nterval-valued dataset through PCA. The roosed ethod not only can ae use of colete nforaton n the hstogra, but also ay gve a ore relable concluson, snce data centralzaton on the bass of the constant nuercal characterstcs. Furtherore, an aroate ethod to calculate the lnear cobnaton of odal nterval-valued varables s gven accordng to the algorth of unvarate hstogras theory of hstogra-valued data [4] (006) cobned wth

Int. Statstcal Inst.: Proc. 58th World Statstcal Congress, 0, Dubln (Sesson CPS05).586 Moore algebra [5] (96) n nterval data analyss. Thus the roectng orgnal odal nterval-valued data to rncal aes can be realzed. Ths aer s structured as follows: Secton ntroduces several basc defntons about nuercal characterstcs of odal nterval-valued data and the dervaton rocess and calculaton stes of PCA on odal nterval-valued data; sulaton s conducted n secton to valdate the effectveness of the roosed ethod; the last secton gves out the suary. Methodology We consder a n data atr Χ ( ) n n, whch s called odal nterval-valued data, and whose eleents are all rando varables that follow a hstogra or a dstrbuton functon. Here { I, f} eans the rando varable defned on the feld of defntons I wth the densty functon as follows, Subect to: where f For hstogra data, the densty functon of hstogra data f would be denoted, (,,, ) K f ( ),,, n;,,, () 0, else K s the nuber of odaltes of K. 0,, and satsfes K I I [, ), s the frequency of I [, ) s the th sub-nterval of I,whch I. It s assued that wthn each sub-nterval I, the rando varable s unforly dstrbuted across the sub-nterval. Hence, for hstogra data can be denoted as follows: { I, f } { [, ),,,, K },,, n;,,. also. The frst oent, second oent and second order ed oent Accordng to classcal robablty theory, the frst oent, second oent and second order ed oent of odal-valued varable can be defned as follows: Defnton. For a odal-valued varable X, the frst oent s gven by n E( X ) E( ), () n Where the frst oent of unt s defned as Accordngly, the centralzaton of E( ) f d = d, () s gven by y E X { [ E X, E X ),,,, K },,, n;,,. (4) Defnton. Gven any two odal-valued varables defned as Where Here and n X and X, the second order ed oent s n EX X E, (5) f f dd f d f d, (6) E E E are suosed to be ndeendent. Defnton. For any odal-valued varable X, the second oent s defned by

Int. Statstcal Inst.: Proc. 58th World Statstcal Congress, 0, Dubln (Sesson CPS05).587 Where E E X, (7) n E n f d d. (8) Therefore the covarance and varance of odal-valued varables X, X are as follows: Cov X, X E X E X X E X E X X E X E X, (9) D X E X E X E X E X. (0). Lnear cobnaton algorth of odal nterval-valued varables Lnear cobnaton algorth for nterval-valued varables has been ntroduced by Moore(96), the defnton can be eressed as follow: Defnton 4. Gven nterval-valued varables X, X,, X, all wth n observatons and real nubers a,,,, each observaton can be regarded as a hyercube. Defne an nterval-valued varable Y as a lnear cobnaton of X, X,, X, vz. Ya X y, y, y, y,, yn, y n, () 0, a 0 where y a and y a,wth., a 0 Wth Defnton 4, an algorth to calculate the lnear cobnaton of odal nterval-valued varables can be resented. Gven odal nterval-valued varables X, X,, X, and real nubers a,,,, an odal nterval-valued varable Y as a lnear cobnaton of X, X,, X can be defned as follows, where Y s a hstogra vector,each eleent can be eressed as follows, y { I, f } { [ y, y ), ;,,, K } Y a X, (),where K a{ K,, n;,, } For the th hstogra,,, K, the nuber of odaltes K, each sub-nterval has ts densty functon as, thus the th observaton contans K hyercubes n denson sace. The densty of each hyercube s the roduct of the densty of corresondng sub-nterval and be denoted as u,,, K,. Accordng to forula () n Defnton 4, calculatng lnear cobnaton to all the hyercubes, we can obtan I, u,,, K. Then we can get the au and the nu values of I, u,,, K,denote as I.The sub-nterval I [ y, y ),,,, K can be obtaned by dvdng I nto follows. K arts. Proectng the above I u,,, K to I,whch s denoted as I I u K,,,, u I. () Accordng to the above entoned stes, we can get the lnear cobnaton of all varables n the th

Int. Statstcal Inst.: Proc. 58th World Statstcal Congress, 0, Dubln (Sesson CPS05).588 hstogra observaton.. The Algorth For convenence, suose the odal nterval-valued vectors are all centralzed. Wth the above defntons, we begn to derve the Prncal Coonent Analyss ethod for odal nterval-valued data. Slar to the nuerc case, the th odal nterval-valued data PC Y,,, s a lnear cobnaton of X, X,, X,.e., u Y Xu X, where u,,,, wth the constrants of u and uu l 0 l,,,, l. Also, the frst rncal coonents Y, Y,, Y ust aze total varance to reresent the orgnal nforaton carred by X, X,, X. Accordng to defntons roosed above, we have D Y E Y E ux u X ux EX EX X EX X X X X X X E E E u u, u,, u u E E E X X X X X uvu, (4) X X X. where V reresents the covarance atr of,,, u The followng dervaton s the sae wth the classcal PCA that for nuerc data,.e., loong for orthogonalsed vectors u, u, u to acheve azaton of D by solvng equatons of Y wth D Y D Y D Y Vu u. Thus, u, u, u are the orthonoral egenvectors of V,. By algorth of lnear cobnaton of odal-valued corresondng to the egenvalues varables, see forula (), we fnally get the th odal-valued PC Y Xu. Eerental Results of Synthetc Data Sets Ths secton we conduct a coarson between PCA of odal nterval-valued data and PCA of nuerc data. The hstogra dataset wll be generated n Monte-Carlo sulaton ethod. And the nuerc dataset corresondng to ths hstogra dataset wll be obtaned by dfferentatng I, the feld of defntons of the hstogra. The coarson wll focus on the egenvalues and egenvectors of the covarance atr of hstogra dataset and nuerc dataset. It ll be concluded that the egenvalues and egenvectors are slar for the two tyes of data. The calculaton results of nuerc dataset tend to the hstogra dataset s results when the nuber of dfferentaton n I becoes lager... Dataset.. Hstogra dataset We leent Monte-Carlo sulaton ethod to generate a 50 4 hstogra dataset. For generalty, we tae all the nuber of odaltes as three. Therefore, the hstogra { I, f } can be generated randoly by the two stes as follows. C ) Defne the generaton ethod of feld of defntons: generate the center ~ 5,5 R the rads ~,0 U randoly and C R C R U, thus we can get the nterval of hstogra I [, ] and dvde the nterval I equally nto three arts, we obtan I [, ),,,. ) Defne the generaton ethod of densty(frequency) functon: generate three data q, q, q,where 4

Int. Statstcal Inst.: Proc. 58th World Statstcal Congress, 0, Dubln (Sesson CPS05).589 q ~ U 0,,,,, then orthogonalze the data as follows: Q Meante, t satsfes the constrant q, q / Q,,,. and we can get the densty functon of the th hstogra n the dataset... Nuerc dataset When we get the hstogra dataset, for each hstogra we get a nuerc dataset by dfferentatng the feld of defntons. That s for each hstogra { I, f}, to choose ostve nteger nubers n the feld of defnton nterval I wth dvdng the th sub-nterval equally nto (,, K) arts, therefore the nuerc dataset bde by aroately to the dstrbuton of. We call as the nuber of dfferentaton. The larger of the nuber of dfferentaton, the ore slar of nuerc dataset to the hstogra. For eale, for two denson hstogra, ( y, ) ({[,),0.;[,4),0.5;[4,5],0.}, {[5,6),0.;[6,7),0.7;[7,8],0.} Fg shows the condtons of =0 and =50 resectvely. In the left chart there are 44 onts whle there are 60 n the rght chart. Fg. The dagra of nuerc dataset wth the condtons of =0 and =50 Therefore, a denson hstogra s eanded nto a nuerc dataset, on whch classcal ultvarate analyss can be erfored. It roves the ethod roosed n ths aer s reasonable, f the calculaton results obtaned by PCA tend to the hstogra dataset s whle the nuber of dfferentaton s ncreasng. The sulaton result of PCA of hstogra data For the 50 4 hstogra dataset generated above, we calculate ts egenvalues and egenvectors of ts covarance atr by PCA ethod; the egenvalues are sorted descendng as follows: =.56 ; =0.475 ; =9.498 ; 4 =5.095. And the corresondng egenvectors are showed n table. Table. Egenvectors of corresondng egenvalue u u u u -0.65 0.6589-0.66 0.74 0.77 0.447-0.486-0.776 0.675 0.9-0.486 0.5067 0.659 0.5557 0.7456-0.06 In the followng art, we coare the egenvalues and egenvectors of hstogra data and nuerc data. Assue the nuber of dfferentatons are 0 0 0 40 50 70 00 and 00, we calculated the egenvalues and egenvectors u,,, 4; 0, 0,0, 40,50, 70,00, 00. For the egenvalues, we tae Absolute Error to coare the both,.e.: AE(, ),,, 4; 0, 0, 0, 40, 50, 70,00, 00. For the egenvectors, we tae the Absolute Cosne Value, whch s defned as follows: 4 5

Int. Statstcal Inst.: Proc. 58th World Statstcal Congress, 0, Dubln (Sesson CPS05).590 ACV u u u u, 0, 0,0, 40,50, 70,00, 00,,, 4 The horzontal as n Fg denotes the nuber of dfferentaton. In the left chart of Fg the curves show the AE changes whle the nuber of dfferentaton s ncreasng, we can see the AE s gettng saller whle the nuber of dfferentaton s on the ncrease. The vertcal as of the rght chart denotes ACV, the results shows the absolute error of ACV converges to whle the nuber of dfferentaton s ncreasng whch llustrates the angle between the two coared vectors converges to 0. The obtaned coonents are ore ale when the slarty s hgh. 0.5 AE(la).00 0. 0.5 0. 0.05 AE(la) AE(la) AE(la4) 0.999 0.998 0.997 0.996 ACV(egv) ACV(egv) ACV(egv) ACV(egv4) 0 0 50 00 50 00 50 0.995 0 50 00 50 00 50 Fg.The AE of egenvalues and ACE of egenvectors Fnally,we contrast the sale dstrbuton of frst rncle coonent of nuerc data and hstogra data. The frst rncle coonent of hstogra data s obtaned by the lnear cobnaton of hstogra as secton., whle the frst rncle coonent of nuerc data s calculated the ercal dstrbuton 4 frequency. For savng coutaton, we tae =0, the sale sze of nuber dataset s (0 ) 50. Then the results of Two Indeendent Sales Kologorov-Srnov test fro the 50 sales show that both dstrbutons are consstent. 4 Concluson Ths aer roosed a new PCA ethod based on nuerc characterstc constant tye for odal nterval-valued varables, whch s drawn on the ethod of nuercal characterstcs ntegral calculaton on contnuous rando varables n robablty theory, In the rocess of coutaton, the ethod not only adots the colete nforaton of hstogra, but also ae the characterstc analyss clear and the result s reasonable and recse. The sulaton roves the ratonalty and effectveness of the ethod. Acnowledgeents Ths wor was suorted by the Natonal Natural Scence Foundaton of Chna (Grant No. 70806, 7077004, 7000). References [] Rodrguez O., Dday E., Wnsberg S. Generalzaton of rncal coonents analyss to hstogra data, Worsho of Sybolc Data Analyss, 4th Eur. Conf. Prncles and Practce of Knowledge Dscovery n Databases, Lyon, France, Set. 000. [] Kallyth S.M., Dday E., Sybolc PCA of Coostonal Data, 9th Internatonal Conference on Coutatonal Statstcs, Pars - France, August -7, 00. [] Nagabhsushan P., Kuar P., Prncal Coonent Analyss of Hstogra Data. Srnger-Verglag Berln Hedelberg, EdsISNN Part II LNCS 449, 0-0, 007. [4] Bllard L., Dday E., Descrtve Statstcs for Interval-valued Observatons n the Presence of Rules, Coutatonal Statstcs, :87-0, 006. [5] Moore Raon E., Baer Kearfott R., Cloud Mchael J., Introducton to Interval Analyss, ages. Sa. ISBN 978-0-89876-69-6, 009. 6