A Study into Mechanisms of Attitudinal Scale Conversion: A Randomized Stochastic Ordering Approach

Size: px

Start display at page:

Download "A Study into Mechanisms of Attitudinal Scale Conversion: A Randomized Stochastic Ordering Approach"

Dina Warner
5 years ago
Views:

1 A Study into Mechanisms of Attitudinal Scale Conversion: A Randomized Stochastic Ordering Approach Zvi Gilula (Hebrew University) Robert McCulloch (Arizona State) Ya acov Ritov (University of Michigan) Oleg Urminsky (University of Chicago)

2 1. The Problem: Scale Conversion 2. Minimum conditional entropy conversion algorithm 3. MCE and Our Survey Data 4. Mixture Models Based on the MCE Algorithm 5. Mixture Model Inference 6. Conclusion

3 1. The Problem: Scale Conversion A huge amount of survey data is collected of the form How much do you like it? Respond with Y {1, 2,..., N}. 1 means you hate it, N means you love it. Satisfaction: the bigger Y is the more you like it. We have ordered categorical data By scale, we mean the choice of N. 1

4 You could measure satisfaction on two different scales: Y r {1, 2,..., R}. Y c {1, 2,..., C}. How are Y r and Y c related???? How do you CONVERT a Y r to a Y c??? 2

5 Example R = 5, C = 11. How satisfied are you with the quality of life in your city of residence? If you were in a hurry: Y r = 4 (4/5)*11 = 8.8 Y c = 9 But, not very satisfying. 3

6 Scale Conversion Of Practical importance: It is not uncommon to have two separate surveys (about the same thing) on different scales perhaps with demographic information x: Data set 1 on the R scale: {y ir, x i }, y ir {1, 2,..., R}. Data set 2 on the C scale: {y ic, x i }, y ic {1, 2,..., C}. You want to combine the information. For example, Zvi has encountered this in practice where firms have data obtained on the same attribute collected by different consulting companies using different scales. 4

7 Note: It is not so easy to choose the scale: too few choices gives too coarse a grid and information is lost. too many choice confuses the respondents, introducing noise. As a result, researchers actively consider and debate how many scale categories to use. A firm may change it s choice of scale but wish to combine data on different scales (assuming no change in the underlying population). 5

8 Scale Conversion Of General interest: I think it is just interesing to investigate how the different measurements are related. Obviously, we hope Y r and Y c are dependent. How dependent are they? Is the level of dependence related to characteristics of the respondent? 6

9 2. Minimum conditional entropy conversion algorithm Our basic approach to scale conversion is to construct a joint distribution for (Y r, Y c ). Then, (stochastic) scale conversion Y r Y c is simply obtained from the conditionals p(y c = y c Y r = y r ) Similarly, we can convert or the other way around: Y c Y r. 7

10 But note: We want to consider the case where we only have data: {y ir, x i }, {y ix, x i } or just, {y ir }, {y ic } We may not have data {y ir, y ic, x i }. That is, in practice, we may not observe the Y r and Y c jointly. 8

11 Given just this data: We can estimate the marginals p(y r = y r ), p(y c = y c ) But, how do we construct a joint p(y r = y r, Y c = y c )?? We need a copula. 9

12 Obviously we could use independence: p(y r = y r, Y c = y c ) = p(y r = y r ) p(y c = y c ) but that rules out the very thing we want to study: dependence. The key to this paper is an simple, ingeneous algorithm (due to Zvi!) which constructs a maximally dependent joint from the marginals. The algorithm is called MCE because the entropy of the conditionals is minimized: (Minimal Conditional Entropy). 10

13 The MCE Algorithm Rather than stating the algorithm, let s work through a simple example. Let R = 3 and C = 5. We put Y r on the rows and Y c on the colums (hence the names) of the two-way table representation of the joint distribution. Suppose P(Y r = y r ) = {.3,.53,.17}, y r = 1, 2, 3. Suppose P(Y c = y c ) = {0.2, 0.15, 0.25, 0.3, 0.1}, y c = 1, 2, 3, 4, 5. 11

14 As usual, the table represents the joint probabilities. We intitialize the table with 0 s. At each iteration we will move probability into the table from the marginals in such a way as to maximize the dependence. At our first iteration.2 is the most probability we can put in the (1,1) position given the marginals..2 = min(.2,.3). 12

15 Key to the MCE algorithm is that the final table will have the marginals we start with. The grey area in the figure shows the marginal probability we still have left to move into the table. Next we can move.1 = min(.15,.1) 13

16 Table after two iterations: Now we need to move onto the second row and we can move.05 = min(.05,.53). 14

17 The next several iterations: 15

18 The final table: Looks pretty dependent!! Has the correct marginals. 16

19 Given marginals, the MCE algorithm gives us a joint. Zvi derives the algorithm is derived by requiring the following properties of the joint: the margins from the joint are the given marginals. The row conditionals P j = (Y r Y c = j) are such that P k is stochastically larger than P l for k > l. Each P j has minimal entropy (the distribution is tight). Same for column conditionals But, it is very intuitive that it loads up the diagonal while preserving the marginals. 17

20 3. MCE and Our Survey Data To explore this ideas we collected a survey where we have: We will: {y ir, y ic, x i }. See how methods that just use the marginals, estimate the joints. Use the joint data to estimate models based on the MCE. The marginal data we looked at before is from our survey. 18

21 We asked three questions. Mostly we will look at results for the question: How satisfied are you with the quality of life in your city of residence?. On the same date all individuals were approached (for the first time) and were asked to respond to the 3 questions. Seven days later the same individuals were contacted again and were asked the same 3 questions. The individuals were not informed that they would be approached a week later. When re-approached, the interviewees were not given any information on their first ratings. The second time, the may be asked to repond on a different scale!!! We used a professional survey company. 19

22 From our survey we do have an estimate of the joint (just the bivariate frequencies) and here are the conditionals Y c Y r. Clearly, there is dependence. 20

23 We now consider three estimates of the joint using only the marginals: The MCE table. Independence table..5*mce +.5*Independence We just plot the 5*11=55 joint probabilities from an estimate versus the joints obtained from the sample. SAE = ˆp ij p ij. MCE is good. Mixture is killer. 21

24 To get a feeling for the different fitted joint distributions of (Y r, Y c ), let s look at the conditionals. Conditionals of Y c Y r r = 3, 4, 5.. from: data, black, MCE, blue mixture, red. We can see that the MCE is too concentrated. mixture does very well. 22

25 4. Mixture Models Based on the MCE Algorithm The mixture approach seemed to work very well. Inutively, it is like any shrinkage approach but the context is kind of cool. When I ask on on the 11-point scale are you more like yourself on the 5-point scale or the other people (the marginal) on the 11-point scale. Or, Can I reduce the noise in you imprecise answers by shinking towards everyone else. We make up models using the MCE algorithm and the mixture idea and estimate them using the joint data. For example, is.5 really best mixture weight?? 23

26 Let p r and p c denote the Y r and Y c marginals. Let p M denote the joint obtained from the marginals using the MCE algorithm. Let p I denote the joint obtained from the marginals using independence. Our basic mixture model is: p(y r = y r, Y c = y c p r, p c, w) = w p M (Y r = y r, Y c = y c p r, p c ) +(1 w) p I (Y r = y r, Y c = y c p r, p c ). We can also let the mixture weight depend on demographics x. p(y r = y r, Y c = y c p r, p c, θ, x) = p(y r = y r, Y c = y c p r, p c, w = F (x θ)). with F (η) = exp(η) 1 + exp(η). 24

27 We obtain MCMC draws from the posteriors: (p r, p c, θ y r, r c, x) and (p r, p c, w y r, r c ) and (p r, p c y r, r c ) where the third model sets w = 0 so we are just applying the MCE algorithm to the marginals. 25

28 Note that the likelihood is highly nonstandard because of the MCE algorithm. Nice example of relativly simple Bayesian ideas solving a non-standard inference. We use the obvious Gibbs sampler: p r p c, θ, p c p r, θ, θ p r, p c. θ p r, p c, griddy gibbs sampler on each component of θ (after we standardize the x s). p r p c, θ, p c p r, independence proposal Metropolis Hastings. 26

29 ?Proposal?: By the margin preserving property we have P(Y r p r, p c, w) = p r. We propose from draws from the the marginal model p(y r p r ) p(p r ) where p(p r ) is the standard Dirichlet prior, which is also our actual prior. Same for p c. Coded in C++. 27

30 Thinned marginal draws of a marginal probability. 28

31 Note: We also implemented a version with multinomial models p r = F (x, θ r ) and and p c = F (x, θ c ). A good idea, but with our data it was a wash, it obviously explodes the number of parameters to be estimated. 29

32 5. Mixture Model Inference Inference for θ: w = F (x θ) The evidence for predictor dependent weights is not definitive. However, the suggestion that older females, not retired or unemployed, have more highly dependent responses is interesting. 30

33 Inference for w the mixture weight. w.6, (.5,.7). 31

34 Estimate of the joint based on the full Bayesian predictive. We have over MCMC draws and x when θ is in the model. Top left: MCE predictive ((p r, p c ) is the parameter) vs. data estimate. Top right: general mixture model. Bottom plug-in estimates vs. Baysian predictive. 32

35 Conditionals from the MCE algorithm, simple-mixture algorithm, and the joint data. The simple mixture model works pretty well!! 33

36 Posterior inference for a specific conditional probability from various models. Dashed line at estimate directly from the joint data. Again, mixture does very well, MCE misses. 34

37 6. Conclusion Seem like a nice problem the came directly out of Zvi s consulting. Zvi s MCE algorithm is a very nice copula for ordinal categorical data. we have a lot of ordinal categorical data!!! Mixture model really does very well!! Bayesian approach works very well. Have also done more complex P(Y = y x) modelling (multinomial) but in this data it was a wash. 35

38 Note: An alternative head space would be to imagine there is a real valued latent satisfaction that gets truncated by scale dependent cut offs. This is kind of appealing. There are pros and cons between this approach and ours. I like ours because it is simple. 36

A study into mechanisms of attitudinal scale conversion: A randomized stochastic ordering approach

Quantitative Marketing and Economics https://doi.org/10.1007/s11129-019-09209-3 A study into mechanisms of attitudinal scale conversion: A randomized stochastic ordering approach Zvi Gilula 1 & Robert