Dependencies in Interval-valued. valued Symbolic Data. Lynne Billard University of Georgia

Size: px

Start display at page:

Download "Dependencies in Interval-valued. valued Symbolic Data. Lynne Billard University of Georgia"

Bethanie Lynch
5 years ago
Views:

1 Dependencies in Interval-valued valued Symbolic Data Lynne Billard University of Georgia Tribute to Professor Edwin Diday: Paris, France; 5 September 2007

2 Naturally occurring Symbolic Data -- Mushrooms

3 Patient Records Single Hospital, Cardiology Patient Hospital Age Smoker. Patient 1 Fontaines 74 heavy Patient 2 Fontaines 78 light Patient 3 Beaune 69 no Patient 4 Beaune 73 heavy Patient 5 Beaune 80 light Patient 6 Fontaines 70 heavy Patient 7 Fontaines 82 heavy M M M M

4 Patient Records by Hospital -- aggregate over patients Result: Symbolic Data Patient Hospital Age Smoker Patient 1 Fontaines 74 heavy Patient 2 Fontaines 78 light Patient 3 Beaune 69 no Patient 4 Beaune 73 heavy Patient 5 Beaune 80 light Patient 6 Fontaines 70 heavy M M Patient 7 Fontaines 82 heavy M M Hospital Age Smoker Fontaines [70, 82] {light ¼, heavy ¾} Beaune [69, 80] {no, light, heavy} M M M

5 Histogram-valued Data -- Weight by Age Distribution:

6 Logical dependency rule E.g. Y 1 = age Y 2 = # children Classical: Y a = (10, 0), Y b = (20, 2), Y c = (18, 1) Aggregation Symbolic: ξ = (10, 20) (0, 1, 2) I.e., ξ implies classical Y d = (10, 2) is possible Need rule ν: {If Y 1 < 15, then Y 2 = 0}

7 Interval-valued data u Team Y 1 # At-Bats Y 2 # Hits u Team Y 1 # At-Bats Y 2 # Hits 1 (289, 538) (75, 162) 11 (212, 492) (57, 151) 2 (88, 422) (49, 149) 12 (177, 245) (189, 238) 3 (189, 223) (201, 254) 13 (342, 614) (121, 206) 4 (184, 476) (46, 148) 14 (120, 439) (35, 102) 5 (283, 447) (86, 115) 15 (80, 468) (55, 115) 6 (24, 26) (133, 141) 16 (75, 110) (75, 110) 7 (168, 445) (37, 135) 17 (116, 557) (95, 163) 8 (123, 148) (137, 148) 18 (197, 507) (52, 53) 9 (256, 510) (78, 124) 19 (167, 203) (48, 232) 10 (101, 126) (101, 132) ξ(2): Y 2 = 149 not possible when Y 1 < 149

8 Observation ξ(2) Y 2 Y 2 = αy R 1 R 4 R 2 R

10 Dependencies between Variables Interval-valued Variables E.g., Regression Analysis Dependent variable: Y = ( Y 1, L, Y q ), e.g., q=1 Predictor/regression variable: X = (X 1, L, X p ) Multiple regression model: Y = β 0 + β 1 X 1 + L + β p X p + e Error: e E(e)=0, Var(E) = σ 2, Cov(e i, e k )= 0, i k.

11 Multiple Regression Model: Y = β 0 + β 1 X 1 + L + β p X p + e In vector terms, Y = X β + e Observation matrix: Y 0 = (Y 1, L, Y n ) Design matrix: X = 1 X 11 X 1p... 1 X n1 X np Regression coefficient matrix: β 0 = (β 0, β 1, L, β p ) Error matrix: e 0 = (e 1, L, e n )

12 Model: Y = X β + e Least squares estimator of β is ˆβ = (X 0 X) -1 X 0 Y When p=1, ˆβ 1 = P ni=1 (X i X)(Y i Ȳ ) P ni=1 (X i X) 2 = Cov(X, Y ) Var(X), ˆβ 0 = Ȳ ˆβ X where Ȳ = 1 n nx i=1 Y i, X = 1 n nx i=1 X i.

13 Model: Y = β 0 + β 1 X 1 + L + β p X p + e Or, write as Y Ȳ = β 1 (X 1 X 1 )+...+ β p (X p X p )+e X j = 1 n nx i=1 X ij, j =1,...,p. β 0 Ȳ (β 1 X β p X p ) Then,

14 Y Ȳ = β 1 (X 1 X 1 )+...+ β p (X p X p )+e Least squares estimator of β is ˆβ =[(X X ) 0 (X X )] 1 (X X ) 0 (Y Ȳ ) where (X X) 0 (X X) = = Σ(X 1 X 1 ) 2. Σ(X 1 X 1 )(X p X p ). Σ(X p X p )(X 1 X 1 ) Σ(X p X p ) 2 = X i (X j1 X j1 )(X j2 X j2 ), j 1,j 2 =1,,p (X X) 0 (Y Ȳ )= X i (X j X j )(Y Ȳ ),j =1,,p

15 Interval-valued data: Yuj = [ auj, buj ], j= 1,..., p, u E= { w1,..., wu,... wm} Bertrand and Goupil (2000): Symbolic sample mean is 1 Yj = ( b a ), 2m uj + uj u E Symbolic sample variance is 1 1 S b b a a b a = ( ) [ ( )] j + + uj uj uj uj uj uj m u E 4m u E Notice, e.g., m = 1, Y = Weight Y 1 = [132, 138] Y 2 = [129, 141] 2 Y = 135, S = Y = 135, S =

16 Can rewrite 1 S [( a Y ) ( a Y )( b Y ) ( b Y ) ] j = uj j uj j uj j uj j 3 m + + u E Then, by analogy, for j = 1,2, for interval-valued variables Y 1 and Y 2, empirical covariance function Cov(Y 1, Y 2 ) is 1 Cov( Y, Y ) [ ] 1/2 1 2 = G1G 2 Q1Q 2 3m u E 2 2 Qj = ( auj Yj) + ( auj Yj)( buj Yj) + ( buj Yj) G j 1, if Yuj Yj, = 1, if Yuj > Yj, Y = ( a + b )/2. uj uj uj Notice, special cases: (i) Cov( Y, Y ) S (ii) If a uj = b uj = y j, for all u, i.e., classical data, 1 Cov( Y1, Y2) = Σ( y1 Y1)( y2 Y2) m

17 Back to Bertrand and Goupil (2000) Sample variance is 1 1 S b b a a a b = ( ) [ ( )] j + + uj uj uj uj uj uj m u E 4m u E This is total variance. Take Total Sum of Squares = Total SS j = ms 2 j Then, we can show where Total SS = Within Objects SS + Between Objects SS j j j

18 Within Objects SS j Between Objects 1 S [( a Y ) ( a Y )( b Y ) ( b Y ) ] j = uj j uj j uj j uj j 3 m + + u E 1 = [( ) + ( )( ) + ( ) ] 2 2 auj Yuj auj Yuj buj Yuj buj Yuj 3 u E SS = [( a + b ) / 2 Y ] j uj uj j u E with 1 Yuj = auj + buj Y j = auj + buj 2 m u E ( ) / 2, ( ). 2 Classical data: auj = buj = Yuj Within Objects SS j = 0

19 So, for Y j, we have Sum of Squares SS, Total SS = Within Objects SS + Between Objects SS j j j Likewise, for (Y i, Y j ), we have Sum of Products SP Total SPij = Within Objects SPij + Between Objects SPij

20 Can rewrite 1 S [( a Y ) ( a Y )( b Y ) ( b Y ) ] j = uj j uj j uj j uj j 3 m + + u E Then, by analogy, for j = 1,2, for interval-valued variables Y 1 and Y 2, empirical covariance function Cov(Y 1, Y 2 ) is 1 Cov( Y, Y ) [ ] 1/2 1 2 = G1G 2 Q1Q 2 3m u E 2 2 Qj = ( auj Yj) + ( auj Yj)( buj Yj) + ( buj Yj) G j 1, if Yuj Yj, = 1, if Yuj > Yj, Y = ( a + b )/2. uj uj uj

21 Can rewrite 1 S [( a Y ) ( a Y )( b Y ) ( b Y ) ] j = uj j uj j uj j uj j 3 m + + u E Then, by analogy, for j = 1,2, for interval-valued variables Y 1 and Y 2, empirical covariance function Cov(Y 1, Y 2 ) is 1 Cov( Y, Y ) [ ] 1/2 1 2 = G1G 2 Q1Q 2 3m u E 2 2 Qj = ( auj Yj) + ( auj Yj)( buj Yj) + ( buj Yj) G j 1, if Yuj Yj, = 1, if Yuj > Yj, Y = ( a + b )/2. uj uj uj (Total)SP part can be replaced by Total SP = 1 6 X u 2(a Ȳ )(c X)+(a Ȳ )(d X)+(b Ȳ )(c X) +2(b Ȳ )(d X)

22 How is this obtained? Recall that for a Uniform distribution, Y S (a, b), Var(Y )= (b a)2 12 By analogy, we can show, for u=1,,m observations, Within SP = 1 12 Between SP = mx u=1 mx u=1 (a u b u )(c u d u ) µ au + b u 2 Ȳ 1 µ cu + d u 2 Ȳ 2 where Y u1 =[a u,b u ], Y u2 =[c u,d u ] mx µ au + b u mx µ cu + d u Ȳ 1 = 1 m u=1 2, Ȳ 2 = 1 m u=1 2

23 Within SP = 1 12 Between SP = mx u=1 mx u=1 (a u b u )(c u d u ) µ au + b u 2 Ȳ 1 µ cu + d u 2 Ȳ 2 Hence, from Total SP = Within SP + Between SP = 1 mx 2(au Ȳ 1 )(c Ȳ 2 )+(a Ȳ 1 )(d Ȳ 2 ) 6 u=1 +(b Ȳ 1 )(c Ȳ 2 )+2(b Ȳ 1 )(d Ȳ 2 )

24 Y X1 X2 Pulse Systolic Diastolic u Rate Pressure Pressure 1 [44, 68] [90, 110] [50, 70] 2 [60, 72] [90, 130] [70, 90] 3 [56, 90] [140, 180] [90, 100] 4 [70, 112] [110, 142] [80, 108] 5 [54, 72] [90, 100] [50, 70] 6 [70, 100] [134, 142] [80, 110] 7 [72, 100] [130, 160] [76, 90] 8 [76, 98] [110, 190] [70, 110] 9 [86, 96] [138, 180] [90, 110] 10 [86, 100] [110, 150] [78, 100] 11 [63, 75] [60, 100] [140, 150] Rule: X2 = Diastolic Pressure < Systolic Pressure = X1

26 The regression equation becomes, for Y = Pulse Rate, X1 = Systolic Pressure Y = X1 Ȳ = X = Std Devn(Y) = Std Devn(X1) = Cov(Y, X1) = rho(y, X1) = 0.725

27 Y = Pulse Rate, X1 = Systolic Pressure Y = X1 Prediction with Ŷ u =[â u1,ˆb u1 ] â u1 = a u2 ˆb u1 = b u2

28 Symbolic Prediction Equation

29 Symbolic Prediction Intervals

30 Symbolic Prediction Intervals and Equation

31 Original Intervals Prediction Intervals

32 Data Intervals:. Prediction Intervals:

33 Predicted Pulse Rates and Residuals Observed (Y, X1) Predicted Y Residuals u Pulse Rate Systolic âu ˆbu Resa Res b 1 [44,68] [90,100] [62.099, ] [ , 1.805] 2 [60,72] [90,130] [62.099, ] [-2.099, ] 3 [56,90] [140,180] [ , ] [ , ] 4 [70,112] [110,142] [70.292, ] [-0.292, ] 5 [54,72] [90,100] [62.099, ] [-8.099, 5.805] 6 [70,100] [130,160] [78.486, ] [-8.486, ] 7 [72,100] [130,160] [78.486, ] [-6.486, ] 8 [76,98] [110,190] [70.292, ] [5.708, ] 9 [86,96] [138,180] [81.763, ] [4.237, ] 10 [86,100] [110,150] [70.292, ] [15.708, ] Y u = Pulse Rate =[a u, b u ] Ŷ u = Predicted Pulse Rate =[â u, ˆb u ] Residual =[Resa, Res b ]

34 Sum of Residuals for Symbolic Fit Sum of Min Residuals Σ u Res au = Sum of Max Residuals Σ u Res bu = Sum of Squared Residuals for Symbolic Fit Sum of Min Squared Residuals = Sum of Max Squared Residuals =

35 Classical Regression on Midpoints Y c u =(a u1+b u1 )/2, X c ju =(a uj+b uj )/2, j =1, 2 Y c = X 1 Ŷ c =[â c,ˆb c ] â c u = a u2 ˆb u = b u2

36 Classical Regression through Midpoints

37 Symbolic Regression ---- Classical regression ----

38 Comparison of Regression Fits Sum of Residuals for Symbolic Fit Sum of Min Residuals = Sum of Max Residuals = Sum of Squared Residuals for Symbolic Fit Sum of Min Squared Residuals = Sum of Max Squared Residuals = Sum of Residuals for Classical Fit Sum of Min Residuals = Sum of Max Residuals = Sum of Squared Residuals for Classical Fit Sum of Min Squared Residuals = Sum of Max Squared Residuals =

39 Centers and Range Regression DeCarvalho, Lima Neto, Tenorio, Freire,... (2004, 2005, ) Midpoint: Y c = (a + b)/2, X c = (c + d)/2 Range: Y r = (b a)/2, X r = (d - c)/2 Ŷ c = X c Ŷ r = X r

40 Centers and Range Regression DeCarvalho, Lima Neto, Tenorio, Freire,... (2004, 2005, ) Midpoint: Y c = (a + b)/2, X c = (c + d)/2 Range: Y r = (b a)/2, X r = (d - c)/2 Ŷ c = X c Ŷ r = X r Ŷ c = X1 c Xr 1 Ŷ r = X1 c Xr 1

41 Centers and Range Regression --Predictions Obs Single Multiple Y [Ŷ a, Ŷ b ] [Ŷ a, Ŷ b ] 1 [44,68] [52.572,77.439] [53.195,75.299] 2 [60,72] [59.230,82.365] [63.089,81.937] 3 [56,90] [78.537, ] [75.334, ] 4 [70,112] [65.178,88.774] [65.349,88.470] 5 [54,72] [52.572,77.439] [53.195,75.299] 6 [70,100] [72.457,96.168] [69.587,96.331] 7 [72,100] [72.457,96.168] [69.587,96.331] 8 [76,98] [75.831,96.655] [81.180,99.092] 9 [86,96] [78.209, ] [75.504, ] 10 [86,100] [66.953,90.087] [67.987,90.241]

42 Symbolic Principal Components -- BATS Y1=Head, Y2=Tail, Y3=Height, Y4=Forearm Obs [Y1a,Y1b] [Y2a,Y2b] [Y3a,Y3b] [Y4a,Y4b] [33, 52] [26, 33] [4, 7] [27, 32] 2 [38, 50] [30, 40] [7, 8] [32, 37] 3 [43, 48] [34, 39] [6, 7] [31, 38] 4 [44, 48] [34, 44] [7, 8] [31, 36] 5 [41, 51] [30, 39] [8, 11] [33, 41] 6 [40, 45] [39, 44] [9, 9] [36, 42] 7 [45, 53] [35, 38] [10, 12] [39, 44] 8 [44, 58] [41, 54] [6, 8] [35, 41] 9 [47, 53] [43, 53] [7, 9] [37, 41] 10 [50, 69] [30, 43] [11, 13] [51, 61] 11 [65, 80] [48, 60] [12, 16] [55, 68] 12 [82, 87] [46, 57] [11, 12] [58, 63]

43 Symbolic Principal Components -- BATS Y1=Head, Y2=Tail,Y3=Height,Y4=Forearm Obs PC1a PC1b PC2a PC2b PC3a PC3b

44 Symbolic Principal Component Analysis -- BATS

Descriptive Statistics for Symbolic Data

Outline Descriptive Statistics for Symbolic Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account