ASSESSING VARIATION: A UNIFYING APPROACH FOR ALL SCALES OF MEASUREMENT JSM Tamar Gadrich Emil Bashkansky

Size: px

Start display at page:

Download "ASSESSING VARIATION: A UNIFYING APPROACH FOR ALL SCALES OF MEASUREMENT JSM Tamar Gadrich Emil Bashkansky"

Aubrie McCarthy
5 years ago
Views:

(ORT Braude College of Engineering, Israel) Ri

1 ASSESSING VARIATION: A UNIFYING APPROACH FOR ALL SCALES OF MEASUREMENT Tamar Gadrich Emil Bashkansky (ORT Braude College of Engineering, Israel) Ri cardas Zitikis (University of Western Ontario, Canada) JSM 04

2 MOTIVATION For various reasons quite often we wish or need to measure the variability of populations, or samples, which can be quantitative, qualitative, and quite often mixed. social inequality and mobility, political consensus, homogeneity of some material, uncertainty of prediction, diversity or similarity of species, synchronization degree of biological rhythms, etc. These are complex tasks due to a number of reasons, not least because of the inherent heterogeneity of populations, which are usually made up of various groups and categories often requiring different scales of measurement.

3 Status quo A number of variability measures have been developed to accommodate various scales of measurement, and there are four of them: nominal, ordinal, interval, and ratio. The variety of scales and, accordingly, various restrictions on possible arithmetical operations and order relationships carry serious challenges for researchers and decision makers. Most popular are measures designed to evaluate variability of numerical (interval and ratio) data and among them are the range, IQR, variance, and SD, as well as measures based on mean absolute deviation/difference and also entropy-based measures. There is a necessity for developing variability measures based on legitimate arithmetical operations between categorical variables and their possible decomposition into intra - and inter-components in a single unifying way. 3

4 THE CLASSICAL VARIANCE AND THE GINI MEAN DIFFERENCE Classical variance: Gini formula for variance: ( k ) k with k k () k k VAR x p x p VAR x x p p ( i j) i j () j i Gini mean dif ference (GMD): GMD x x p p j i i j i j (3) For both definitions: j i L( x, x ) p p (4) i j i j 4

5 FOUR SCALES FOR QUALITY DATA Numerical Categorical Ratio Interval Ordinal Nominal,,,,,,,,,/,,,,,, Quality cost, delivery time, number of defects, MTTF temperature, calendar time, lake level customer satisfaction, status, FMECA ranking, belt rank (Y,G,B,MB), quality level... vendor, failure mode, record type, quality requirement 5

6 THE LAYOUT OF CATEGORICAL DATA 6

7 DESCRIPTION OF CATEGORICAL DATA Set of n ordinal/nominal data based on an scale with ordered/nonordered categories coded by integers k=,,, n, n,..., n Proportion of data belonging to the k-th category For ordinal data: cumulative frequency of data belonging up to k the k-th category Fˆ ˆ k pj n=500 apples j n (Low=, Medium=, High=3) 7,9,40 pˆ k ˆ p,, ˆ F,, nk n 7

8 MEASURING VARIABILIT Y - CATEGORICAL c Let, k, k, be category codes and k their corresponding probabilities, which we also call frequencies. L( c, c ) Let i j be a two-argument function, defined on the codes, which is non-negative, symmetric, and such that for all k. We call it loss-of-similarity function. The population total variation is defined by: V L( c, c ) p p T i j i j j i The sample total variation is defined by: Vˆ L( c, c ) pˆˆ p T i j i j j i p L( c, c ) 0 (5) (6) k k The m-th group variation: Vˆ n (, ) ( ) km ˆ ˆ ˆ m L ci c j pi m p j m pk m j i nm (7) 8

9 Let m group. Obviously, WEIGHTED AVERAGE OF THE WITHIN GROUP VARIATIONS n N m to present the proportion of data in the m-th M m M ˆ Vˆ (8) V W m m 9

10 THE BETWEEN GROUPS COVARIATION & ITS CHARACTERISTIC ERNEL Cˆ L( c, c ) B i j i j j i ˆ (9) M ˆ ( pˆ pˆ )( pˆ pˆ ) i j m i m i j m j m (0) 0

11 MEASURING VARIABILIT Y - CONTINUOUS V L( x, x) df( x) df( x) T V L( x, x) df( x m) df( xm) m C L( x, x) d( x, x) M B m ( )( ) ( x, x) m F( x m) F( x) F( x m) F( x)

12 TOTAL-VARIATION DECOMPOSITION THEOREM The total-variation can be split to the sum of the within-variation and the between covariation: V V C T W B ()

13 INDEX PVE PVE ˆ C ˆB V T () Note the following properties:. PVE = 0 when there is no association, that is, when there is no group effect on the category distribution. In mathematical terms, that is, the total-variation is a pure (i.e., without any interaction) aggregate of the individual group variations..pve = when data within every group fall into one (but perhaps not the same for all samples) category, that is, when there is perfect predictability. 3

14 INDEX OF SEGREGATION POWER (SP) AMONG GROUPS Rule of thumb: if SP > 3, homogeneity hypothesis H 0 must be rejected, if SP < - not rejected, the region [-3] is the region of doubt, i.e. more data is required. 4

15 EXAMPLE- MONTE CARLO SIMULATION 5

16 SPECIAL CASE : NOMINAL VARIABLES CATANOVA OF LIGHT AND MARGOLIN (97) L( c, c ) i j { 0 when i j, when i j. Normalizing the total-variation by its maximal value we obtain: IQV ( pˆ k ) the between-covariation: T k ( k ) k k k Vˆ pˆ pˆ pˆ M k Cˆ ( pˆ pˆ ) B m km k k m 6

17 SPECIAL CASE : ORDINAL VARIABLES ORDANOVA OF GADRICH AND BASHANSY(03) L( c, c ) i j i j not c c! i j Normalizing the total-variation by its maximal value we obtain: ˆ 4 h Fˆ ( ˆ k Fk) the between-covariation: Vˆ Fˆ ( Fˆ ) T k k k M k Cˆ ( Fˆ Fˆ ) B m km k k m (Berry & Mielke, 99) (Blair & Lacy, 996) 7

18 SPECIAL CASE 3: INTERVAL VARIABLES GMD L( x, x ) x x i j i j Vˆ F( x)( F( x)) dx T Normalizing the total-variation by its maximal value we obtain: RHS Gini mean difference ( GMD) the between-covariation: M ˆ ˆ ˆ CB m ( Fm ( x) F( x)) dx m 8

19 SPECIAL CASE 4: RATIO SCALE The loss-of-similarity function L( x, x) log( x) log( x) is well suited for the ratio scale. By adopting this function, we effectively replace our considerations on the ratio scale by those on the interval scale, and thus work with the loss-of-similarity function L( y, y) y y,where instead of the original x s we now deal with their logarithms y = log x. Hence, all our earlier results pertaining to the interval scale can be utilized in a straightforward manner to establish analogous results on the ratio scale. Of course, there is an element of arbitrariness in our choice of the logarithmic transformation there are indeed many alternatives. Nevertheless, our experience suggests that underlying problems and philosophies for tackling the problems usually restrict the class of loss-of-similarity functions as well as of transformations to just a few reasonable ones, and certain axiomatic approaches may even produce unique choices. 9

20 SUMMARY We have presented a unifying approach for assessing variation in populations and data sets that accommodates every scale of measurement: nominal, ordinal, interval, and ratio. In particular, we have put forward a general decomposition result for the total variation into within (intra)and between (inter) components. This has enabled us to introduce two indices: PVE as the proportion-of-variation-explained and SP as the segregation power. Our results extend and generalize the ORDANOVA method developed by Gadrich and Bashkansky (0) in the case of categorical ordinal variables. 0

21 THAN YOU FOR YOUR ATTENTION!

22 VARIATION DEFINITION Nominal: IQV p GMD ( ˆ k ) k Ordinal: ˆ ˆ ˆ k k 4 h F F k

23 CATANOVA (CATEGORICAL DATA ANALYSIS OF VARIATION) DECOMPOSITION M samples Within IQV W IQV IQV [ ( p )] M M ( m) ˆ WITHIN m WITHIN m km m m k Between IQV B M IQV [ ( pˆ ) pˆ ] BETWEEN m km k k m pˆkm - the frequency of data belonging to the k- th category in the m-th sample Total Variation ( pˆ k ) k IQV T ( ) pˆk - the total frequency of items belonging to the k-th category 3

24 ORDANOVA (ORDINAL DATA ANALYSIS OF VARIATION) DECOMPOSITION M ordinal samples of the same size n Within M ˆ ˆ M m 4 k F km ˆ h W within mth sample ( F ) km Between S B M ( F ˆ F ˆ. ) ( ) / 4 M m km k k between samples for every k th category Fˆkm The cumulative frequency of data belonging up to the k-th category in the m-th sample Total Variation Fˆ. ˆ k Fk. 4 k ˆ h T Fˆ k. M M m Fˆ km The total cumulative frequency of items belonging up to the k-th category 4

25 ORDANOVA DECOMPOSITION EXAMPLE () Given M=3 samples, size n=00, total N=600 items Classifying according to k=4 categories Samples data: Sample Category 3 Total Total

26 ORDANOVA DECOMPOSITION EXAMPLE () Cumulative frequency up to the k-th category within the m-th sample (k=,,3,4; m=,,3) Last column The total cumulative frequency of items belonging up to the k-th category Sample Category 3 Fˆk. 78/00 /00 4/00 4/600 4/00 /00 89/00 34/ /00 4/00 35/00 44/600 4 Total

27 ORDANOVA DECOMPOSITION EXAMPLE (3) sample Category 3 Fˆk. 78/00 /00 4/00 4/600 4/00 /00 89/00 34/ /00 4/00 35/00 44/600 4 Total ˆ / h T 7

28 ORDANOVA DECOMPOSITION EXAMPLE (4) sample Category 3 Fˆk. 78/00 /00 4/00 4/600 4/00 /00 89/00 34/ /00 4/00 35/00 44/600 4 Total hˆ hˆ ˆ h h ˆ Dispersion within Wthe m-th sample: W 3W h h h W 3 3 W W W h mw ˆ h W 4 / ˆ ˆ ˆ ˆ

29 ORDANOVA DECOMPOSITION EXAMPLE (5) sample Category 3 Fˆk. 3 4 Total 78/00 4/00 65/00 00 /00 /00 4/ /00 89/00 35/ /600 34/600 44/ S B S B S 3 B Classic variation between the samples for the k-th category ˆ S B S B S S B B B 4 4 S kb 9

30 ORDANOVA DECOMPOSITION EXAMPLE (6) sample Category 3 Fˆk. 78/00 /00 4/00 4/600 4/00 /00 89/00 34/ /00 4/00 35/00 44/600 4 Total hˆ hˆ S T W B

31 DISTINGUISHING STATISTIC FOR ORDINAL DATA Item measures according to an scale with categories M samples of equal size n are drawn Were all samples drawn from the same population characterized by p, p,..., p or not? Under H 0 B Multinomial distribution W M n N E between variation E within variation E total variation = = df df df T where : df = N -, df = M(n - ), df = M -, TOTAL WITHIN BETWEEN in other words : E( MS ) =E( MS ) = E( MS ) B W T 3

32 I cr df 0.95 DISTINGUISHING STATISTIC SP MS MS B T SP can be asymptotically approximated by ( M )( ) ( M )( ) The quintiles (95%, for example) of the last may be used for hypothesis checking. degrees of freedom ( M )( ) ( M )( )

33 DISTINGUISHING FACTOR IDENTIFICATION Data can be divided/segregated according to various type of factors (segregation). For each segregation, calculate the indicator I: The best segregating factor is the one for which the indicator is the largest. 33

34 DISTRIBUTION OF ACADEMIC DEGREE HOLDERS BY ORDINAL DEGREE LEVEL ( ST DEGREE, ND DEGREE, 3 RD DEGREE) First case: according to age Up to Total Under graduate degree ( st degree) Graduate degree ( nd degree) ,3,380,70 46 SP=,84 4,058 4, 65,80 376,60 8,793,759 Doctoral degree (3 rd degree) ,437 Total 448 4,378 5,53 6,40 3,44,805 34

35 DISTRIBUTION OF ACADEMIC DEGREE HOLDERS BY ORDINAL DEGREE LEVEL ( ST DEGREE, ND DEGREE, 3 RD DEGREE) Second case: according to religion & origin/ethnic group Under graduate degree ( st degree) Graduate degree ( nd degree) Doctoral degree (3 rd degree) Jews born in Israel 3,83 9,04,04 Jews born abroad,894,90 Moslems Christians Druze 07 Age is a much more SP=33 significant 58 Others distinguishing/segregating factor than religion & origin /ethnic Total 8,793,759,437 Total 448 4,378 5,53 6,40 3,44,805 35

36 EXAMPLE IDENTIFY THE DISTINGUISHING FACTOR Distribution of faculty by ordinal academic ranks (lecturer, senior lecturer, associate professor, full professor) in five different types of higher educational institutions. Indicator ratio= 08! Number of positions Lecturer Senior lecturer Associate professor Full professor Type Type Type 3 Type 4 Type 5 36

37 EXAMPLE IDENTIFY THE DISTINGUISHING FACTOR In order to find the outlier use the Jackknife procedure Option no SP

Module 10: Analysis of Categorical Data Statistics (OA3102)

Module 10: Analysis of Categorical Data Statistics (OA3102) Professor Ron Fricker Naval Postgraduate School Monterey, California Reading assignment: WM&S chapter 14.1-14.7 Revision: 3-12 1 Goals for this