Descriptive Statistics for Symbolic Data

Size: px

Start display at page:

Download "Descriptive Statistics for Symbolic Data"

Quentin Smith
6 years ago
Views:

1 Outline Descriptive Statistics for Symbolic Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account

2 Outline Outline 1 A new framework 2 3

3 Outline A new framework 1 A new framework 2 3

4 Descriptive Statistics for Symbolic Variables No unique and straightforward definitions! What is the variance of a set of interval observations? How de we measure correlation? Measures based on interval parameters Measures based on distributional assumptions Measures based on distances

5 Outline A new framework 1 A new framework 2 3

6 Biplots for Interval variables

7 Descriptive Statistics for Interval Variables First option : Using the dispersion of the interval centers The mean value and the dispersion of all interval midpoints are given by Y j = 1 n l ij + u ij n 2 S 2 Y k = 1 n n ( ) lij + u 2 ij Y j 2

8 Descriptive Statistics for Interval Variables Second option : Using the dispersion of the interval boundaries. The mean value and the dispersion of all interval midpoints are given by Y j = 1 n l ij + u ij n 2 S 2 Y k = 1 n n (l ij Y j ) 2 + (u ij Y j ) 2 2

9 Descriptive Statistics for Interval Variables Under the assumption that the observed Y j (s i ) and Y j (s i ) values, i = 1,..., n, are uniformly distributed across each interval I ik = [l ik, u ik ], k = j, j, we have E(Y ik ) = (l ik + u ik )/2 = c ik and Var(Y ik ) = (u ik l ik ) 2 /12 symbolic sample mean : Y k = 1 n (l ik + u ik ) = 1 2n n symbolic sample variance : SY 2 k = 1 n [(l ik Y k ) 2 + (l ik Y k )(u ik Y k ) + (u ik Y k ) 2 ] 3n = 1 n (lik 2 + l ik u ik + u 2 2 3n ik) Y k Bertrand and Goupil s (2000) obtained from the empirical density function for an interval variable n c ik

10 Descriptive Statistics for Interval Variables For the symbolic covariance three definitions were proposed : Cov 1 (Y j, Y j ) = 1 n (l ij + u ij )(l ij + u ij ) Y j.y j 4n Billard & Diday (2003) obtained from the empirical joint density function Cov 2 (Y j, Y j ) = 1 n G j G j [Q j, Q j ] 1/2 3n with Q k = (l ik Y k ) 2 + (l ik Y k )(u ik Y k ) + (u ik Y k ) 2, { 1 if c ik Y k G k = 1 if c ik > Y k Billard & Diday (2006) incorporating more accurately both between and within interval variations into the overall covariance

11 Descriptive Statistics for Interval Variables Cov 3 (Y j, Y j ) = 1 n (u ij l ij )(u ij l ij ) + n 12 }{{} = 1 6n WithinSP + 1 n ( lij + u )( ij lij + u ) ij Y j Y j n 2 2 }{{} BetweenSP n [2(l ij Y j )(l ij Y j ) + (l ij Y j )(u ij Y j ) +(u ij Y j )(l ij Y j ) + 2(u ij Y j )(u ij Y j )] Billard (2008) considering a decomposition into Within observations Sum of Products (WithinSP) and Between observations Sum of Products (BetweenSP)

12 : Distance measures Many measures proposed in the litterature Hausdorff distance : d H (I i, I j ) = max {{ l i l j, u i u j } Euclidean distance : d 2 (I i, I j ) = (l i l j ) 2 + (u i u j ) 2 City-Block distance : d 1 (I i, I j ) = l i l j + u i u j.

13 Outline A new framework 1 A new framework 2 3

14 Biplots for Histogram variables

15 Descriptive Statistics for Histogram Variables Assumming an Uniform distributon within each sub-interval of Y k (s i ), i = 1,..., n, I ikl = [l ikl, u ikl ], l = 1,... K j, k = j, j we have symbolic sample mean : Y k = 1 K n j ((l ikl + u ikl )p ikl ) 2n l=1 symbolic sample variance : SY 2 k == 1 K n j 3n l=1 Billard and Diday (2003) ((l 2 ik + l iku ik + u 2 ik )p ikl) Y k 2

16 Descriptive Statistics for Histogram Variables And for the symbolic covariance three definitions : Cov 1 (Y j, Y j ) = 1 K n j p ijl p ij 4n l(l ij + u ij )(l ij + u ij ) Y j.y j l=1 Billard & Diday (2003) obtained from the empirical joint density function

17 Correlation Between Symbolic Variables As in the classic variables: the correlation coefficient is defined as : where r Yj Y j = Cov(Y j, Y j ) S Yj S Yj Cov(Y j, Y j ) is the covariance function between Y j and Y j S Yj, S Yj the symbolic standard deviation of the variables Y j and Y j, respectively. In the particular case of interval variables the descriptive statistics depend on the assumed distribution within each interval. Results already obtained for other distributions, e.g., the triangular distribution.

Discriminant Analysis for Interval Data

Outline Discriminant Analysis for Interval Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account