Advanced Quantitative Research Methodology Lecture Notes: January Ecological 28, 2012 Inference1 / 38

Size: px

Start display at page:

Download "Advanced Quantitative Research Methodology Lecture Notes: January Ecological 28, 2012 Inference1 / 38"

Nathaniel Matthews
5 years ago
Views:

1 Advanced Quantitative Research Methodology Lecture Notes: Ecological Inference 1 Gary King January 28, c Copyright 2008 Gary King, All Rights Reserved. Gary King () Advanced Quantitative Research Methodology Lecture Notes: January Ecological 28, 2012 Inference1 / 38

2 Reading Reading: Gary King. A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data. Princeton University Press, 1997 Gary King () Ecological Inference 2 / 38

3 Preliminaries Gary King () Ecological Inference 3 / 38

4 Preliminaries Definition: Ecological Inference is the process of using aggregate (i.e., ecological ) data to infer discrete individual-level relationships of interest when individual-level data are not available. Gary King () Ecological Inference 3 / 38

5 Preliminaries Definition: Ecological Inference is the process of using aggregate (i.e., ecological ) data to infer discrete individual-level relationships of interest when individual-level data are not available. History of the Problem: Gary King () Ecological Inference 3 / 38

6 Preliminaries Definition: Ecological Inference is the process of using aggregate (i.e., ecological ) data to infer discrete individual-level relationships of interest when individual-level data are not available. History of the Problem: 1. Ogburn and Goltra (1919) in the very first multivariate statistical analysis of politics in a political science journal made ecological inferences and recognized the problem. The big issue in 1919: are the newly enfranchised women going to take over the political system? They regressed votes in referenda in Oregon precincts on the percent of women in each precinct. But they worried: Gary King () Ecological Inference 3 / 38

7 Preliminaries Definition: Ecological Inference is the process of using aggregate (i.e., ecological ) data to infer discrete individual-level relationships of interest when individual-level data are not available. History of the Problem: 1. Ogburn and Goltra (1919) in the very first multivariate statistical analysis of politics in a political science journal made ecological inferences and recognized the problem. The big issue in 1919: are the newly enfranchised women going to take over the political system? They regressed votes in referenda in Oregon precincts on the percent of women in each precinct. But they worried: It is also theoretically possible to gerrymander the precincts in such a way that there may be a negative correlative even though men and women each distribute their votes 50 to 50 on a given measure... (Ogburn and Goltra, 1919). Gary King () Ecological Inference 3 / 38

8 Preliminaries Gary King () Ecological Inference 4 / 38

9 Preliminaries 2. Robinson s (1950) clarified the problem, causing: Gary King () Ecological Inference 4 / 38

10 Preliminaries 2. Robinson s (1950) clarified the problem, causing: (a) several literatures to wither, including studies of local and regional politics through aggregate electoral statistics in favor of survey research based on national samples. Gary King () Ecological Inference 4 / 38

11 Preliminaries 2. Robinson s (1950) clarified the problem, causing: (a) several literatures to wither, including studies of local and regional politics through aggregate electoral statistics in favor of survey research based on national samples. (b) the development of a methodological literature devoted to solving the problem. Gary King () Ecological Inference 4 / 38

12 Preliminaries 2. Robinson s (1950) clarified the problem, causing: (a) several literatures to wither, including studies of local and regional politics through aggregate electoral statistics in favor of survey research based on national samples. (b) the development of a methodological literature devoted to solving the problem. 3. Hundreds of other articles have helped us understand the problem. Gary King () Ecological Inference 4 / 38

13 Preliminaries 2. Robinson s (1950) clarified the problem, causing: (a) several literatures to wither, including studies of local and regional politics through aggregate electoral statistics in favor of survey research based on national samples. (b) the development of a methodological literature devoted to solving the problem. 3. Hundreds of other articles have helped us understand the problem. History of Solutions: A 45-year war between supporters of Gary King () Ecological Inference 4 / 38

14 Preliminaries 2. Robinson s (1950) clarified the problem, causing: (a) several literatures to wither, including studies of local and regional politics through aggregate electoral statistics in favor of survey research based on national samples. (b) the development of a methodological literature devoted to solving the problem. 3. Hundreds of other articles have helped us understand the problem. History of Solutions: A 45-year war between supporters of 1. Duncan and Davis (1953): a deterministic solution. Gary King () Ecological Inference 4 / 38

15 Preliminaries 2. Robinson s (1950) clarified the problem, causing: (a) several literatures to wither, including studies of local and regional politics through aggregate electoral statistics in favor of survey research based on national samples. (b) the development of a methodological literature devoted to solving the problem. 3. Hundreds of other articles have helped us understand the problem. History of Solutions: A 45-year war between supporters of 1. Duncan and Davis (1953): a deterministic solution. 2. Goodman (1953, 1959): a statistical solution. Gary King () Ecological Inference 4 / 38

16 Preliminaries 2. Robinson s (1950) clarified the problem, causing: (a) several literatures to wither, including studies of local and regional politics through aggregate electoral statistics in favor of survey research based on national samples. (b) the development of a methodological literature devoted to solving the problem. 3. Hundreds of other articles have helped us understand the problem. History of Solutions: A 45-year war between supporters of 1. Duncan and Davis (1953): a deterministic solution. 2. Goodman (1953, 1959): a statistical solution. 3. for 50 years, no other methods used in applications. Gary King () Ecological Inference 4 / 38

17 If you can avoid making ecological inferences, do so! Gary King () Ecological Inference 5 / 38

18 If you can avoid making ecological inferences, do so! Some of those who aren t so lucky: 1. Public policy: Applying the Voting Rights Act. Gary King () Ecological Inference 5 / 38

19 If you can avoid making ecological inferences, do so! Some of those who aren t so lucky: 1. Public policy: Applying the Voting Rights Act. 2. History: Who voted for the Nazi s? Gary King () Ecological Inference 5 / 38

20 If you can avoid making ecological inferences, do so! Some of those who aren t so lucky: 1. Public policy: Applying the Voting Rights Act. 2. History: Who voted for the Nazi s? 3. Marketing: What types of people buy your products? Gary King () Ecological Inference 5 / 38

21 If you can avoid making ecological inferences, do so! Some of those who aren t so lucky: 1. Public policy: Applying the Voting Rights Act. 2. History: Who voted for the Nazi s? 3. Marketing: What types of people buy your products? 4. Banking: Are banks complying with red-lining laws? Are there areas with certain types of people who might take out loans but have not? Gary King () Ecological Inference 5 / 38

22 If you can avoid making ecological inferences, do so! Some of those who aren t so lucky: 1. Public policy: Applying the Voting Rights Act. 2. History: Who voted for the Nazi s? 3. Marketing: What types of people buy your products? 4. Banking: Are banks complying with red-lining laws? Are there areas with certain types of people who might take out loans but have not? 5. Candidates for office: How do good representatives decide what policies they should favor? How can candidates tailor campaign appeals and target voter groups? Gary King () Ecological Inference 5 / 38

23 If you can avoid making ecological inferences, do so! Some of those who aren t so lucky: 1. Public policy: Applying the Voting Rights Act. 2. History: Who voted for the Nazi s? 3. Marketing: What types of people buy your products? 4. Banking: Are banks complying with red-lining laws? Are there areas with certain types of people who might take out loans but have not? 5. Candidates for office: How do good representatives decide what policies they should favor? How can candidates tailor campaign appeals and target voter groups? 6. Sociology: Do the unemployed commit more crimes or is it just that there are more crimes in unemployed areas? Gary King () Ecological Inference 5 / 38

24 If you can avoid making ecological inferences, do so! Some of those who aren t so lucky: 1. Public policy: Applying the Voting Rights Act. 2. History: Who voted for the Nazi s? 3. Marketing: What types of people buy your products? 4. Banking: Are banks complying with red-lining laws? Are there areas with certain types of people who might take out loans but have not? 5. Candidates for office: How do good representatives decide what policies they should favor? How can candidates tailor campaign appeals and target voter groups? 6. Sociology: Do the unemployed commit more crimes or is it just that there are more crimes in unemployed areas? 7. Economics: With some exceptions, most theories are based on assumptions about individuals, but most data are on groups. Gary King () Ecological Inference 5 / 38

25 If you can avoid making ecological inferences, do so! Some of those who aren t so lucky: 8. Education: Do students who attend private schools through a voucher system do as well as students who can afford to attend on their own? Gary King () Ecological Inference 6 / 38

26 If you can avoid making ecological inferences, do so! Some of those who aren t so lucky: 8. Education: Do students who attend private schools through a voucher system do as well as students who can afford to attend on their own? 9. Atmospheric physics: How can we tell which types of the vehicles actually on the roads emit more carbon dioxide and carbon monoxide? Gary King () Ecological Inference 6 / 38

27 If you can avoid making ecological inferences, do so! Some of those who aren t so lucky: 8. Education: Do students who attend private schools through a voucher system do as well as students who can afford to attend on their own? 9. Atmospheric physics: How can we tell which types of the vehicles actually on the roads emit more carbon dioxide and carbon monoxide? 10. Oceanography: How many marine organisms of a certain type were collected at a given depth, from fishing nets dropped from the surface down through a variety of depths. Gary King () Ecological Inference 6 / 38

28 If you can avoid making ecological inferences, do so! Some of those who aren t so lucky: 8. Education: Do students who attend private schools through a voucher system do as well as students who can afford to attend on their own? 9. Atmospheric physics: How can we tell which types of the vehicles actually on the roads emit more carbon dioxide and carbon monoxide? 10. Oceanography: How many marine organisms of a certain type were collected at a given depth, from fishing nets dropped from the surface down through a variety of depths. 11. Epidemiology: Does radon cause lung cancer? Gary King () Ecological Inference 6 / 38

29 If you can avoid making ecological inferences, do so! Some of those who aren t so lucky: 8. Education: Do students who attend private schools through a voucher system do as well as students who can afford to attend on their own? 9. Atmospheric physics: How can we tell which types of the vehicles actually on the roads emit more carbon dioxide and carbon monoxide? 10. Oceanography: How many marine organisms of a certain type were collected at a given depth, from fishing nets dropped from the surface down through a variety of depths. 11. Epidemiology: Does radon cause lung cancer? 12. Changes in public opinion: How to use repeated independent cross-sectional surveys to measure individual change? Gary King () Ecological Inference 6 / 38

30 The Problem: The District Level Race of Voting Age Voting Decision Person Democrat Republican No vote black??? 55,054 white??? 25,706 19,896 10,936 49,928 80,760 The Ecological Inference Problem at the District-Level: The 1990 Election to the Ohio State House, District 42. The goal is to infer from the marginal entries (each of which is the sum of the corresponding row or column) to the cell entries. (Note information in the bounds.) Gary King () Ecological Inference 7 / 38

31 The Problem: The Precinct Level Race of Voting Age Voting Decision Person Democrat Republican No vote black??? 221 white??? The Ecological Inference Problem at the Precinct-Level: Precinct P in District 42 (1 of 131 in the district). The goal is to infer from the margins of a set of tables like this one to the cell entries in each. Gary King () Ecological Inference 8 / 38

32 The best we could do, circa 1996 Estimated Percent of Blacks Year District Voting for the Democratic Candidate % Sample Ecological Inferences: All Ohio State House districts where an African American Democrat ran against a white Republican, (Source: Statement of Gordon G. Henderson, presented as an exhibit in federal court, using Goodman s regression). Figures above 100% are logically impossible. Gary King () Ecological Inference 9 / 38

33 The best we could do, circa 1996: Continued Estimated Percent of Blacks Year District Voting for the Democratic Candidate % Sample Ecological Inferences: All Ohio State House districts where an African American Democrat ran against a white Republican, (Source: Statement of Gordon G. Henderson, presented as an exhibit in federal court, using Goodman s regression). Figures above 100% are logically impossible. Gary King () Ecological Inference 10 / 38

34 What Information Does The New Method Provide? Goodman s Method: One incorrect number (5 standard deviations outside the deterministic bounds) Gary King () Ecological Inference 11 / 38

35 What Information Does The New Method Provide? Goodman s Method: One incorrect number (5 standard deviations outside the deterministic bounds) The New Method: Gary King () Ecological Inference 11 / 38

36 What Information Does The New Method Provide? Goodman s Method: One incorrect number (5 standard deviations outside the deterministic bounds) The New Method: Non-minority Turnout in New Jersey Cities and Towns. In contrast to the best existing methods, which provide one (incorrect) number for the entire state, the method offered here gives an accurate estimate of white turnout for all 567 minor civil divisions in the state, a few of which are labeled. Gary King () Ecological Inference 11 / 38

37 Notation Vote No vote black βi b 1 βi b X i white βi w 1 βi w 1 X i T i 1 T i Notation for Precinct i (i = 1,..., p). Gary King () Ecological Inference 12 / 38

38 Notation Vote No vote black βi b 1 βi b X i white βi w 1 βi w 1 X i T i 1 T i Notation for Precinct i (i = 1,..., p). Observed variables: Gary King () Ecological Inference 12 / 38

39 Notation Vote No vote black βi b 1 βi b X i white βi w 1 βi w 1 X i T i 1 T i Notation for Precinct i (i = 1,..., p). Observed variables: T i = voter Turnout in precinct i Gary King () Ecological Inference 12 / 38

40 Notation Vote No vote black βi b 1 βi b X i white βi w 1 βi w 1 X i T i 1 T i Notation for Precinct i (i = 1,..., p). Observed variables: T i = voter Turnout in precinct i X i = Black proportion of Voting Age Population in precinct i Gary King () Ecological Inference 12 / 38

41 Notation Vote No vote black βi b 1 βi b X i white βi w 1 βi w 1 X i T i 1 T i Notation for Precinct i (i = 1,..., p). Observed variables: T i = voter Turnout in precinct i X i = Black proportion of Voting Age Population in precinct i Unobserved quantities of interest: Gary King () Ecological Inference 12 / 38

42 Notation Vote No vote black βi b 1 βi b X i white βi w 1 βi w 1 X i T i 1 T i Notation for Precinct i (i = 1,..., p). Observed variables: T i = voter Turnout in precinct i X i = Black proportion of Voting Age Population in precinct i Unobserved quantities of interest: β b i = fraction of blacks who vote in precinct i Gary King () Ecological Inference 12 / 38

43 Notation Vote No vote black βi b 1 βi b X i white βi w 1 βi w 1 X i T i 1 T i Notation for Precinct i (i = 1,..., p). Observed variables: T i = voter Turnout in precinct i X i = Black proportion of Voting Age Population in precinct i Unobserved quantities of interest: βi b βi w = fraction of blacks who vote in precinct i = fraction of whites who vote in precinct i Gary King () Ecological Inference 12 / 38

44 Notation An accounting identity (a fact, not an assumption): Gary King () Ecological Inference 13 / 38

45 Notation An accounting identity (a fact, not an assumption): T i = β b i X i + β w i (1 X i ) Gary King () Ecological Inference 13 / 38

46 Notation An accounting identity (a fact, not an assumption): T i = β b i X i + β w i (1 X i ) = β w i + (β b i β w i )X i Gary King () Ecological Inference 13 / 38

47 Notation An accounting identity (a fact, not an assumption): T i = β b i X i + β w i (1 X i ) = β w i + (β b i β w i )X i Goodman s regression: Gary King () Ecological Inference 13 / 38

48 Notation An accounting identity (a fact, not an assumption): T i = β b i X i + β w i (1 X i ) = β w i + (β b i β w i )X i Goodman s regression: Run a regression of T i on X i and (1 X i ) (no constant term). Coefficients are intended to be: Gary King () Ecological Inference 13 / 38

49 Notation An accounting identity (a fact, not an assumption): T i = β b i X i + β w i (1 X i ) = β w i + (β b i β w i )X i Goodman s regression: Run a regression of T i on X i and (1 X i ) (no constant term). Coefficients are intended to be: B b, District-wide black turnout Gary King () Ecological Inference 13 / 38

50 Notation An accounting identity (a fact, not an assumption): T i = β b i X i + β w i (1 X i ) = β w i + (β b i β w i )X i Goodman s regression: Run a regression of T i on X i and (1 X i ) (no constant term). Coefficients are intended to be: B b, District-wide black turnout B w, District-wide white turnout Gary King () Ecological Inference 13 / 38

51 Selected Problems with the Goodman s Approach Gary King () Ecological Inference 14 / 38

52 Selected Problems with the Goodman s Approach If we follow Goodman s advice, we won t apply the model. Gary King () Ecological Inference 14 / 38

53 Selected Problems with the Goodman s Approach If we follow Goodman s advice, we won t apply the model. If we don t follow Goodman s advice & apply it anyway: Gary King () Ecological Inference 14 / 38

54 Selected Problems with the Goodman s Approach If we follow Goodman s advice, we won t apply the model. If we don t follow Goodman s advice & apply it anyway: 1. We know parameters are not constant 1.75 T i X i Precincts in Marion County, Indiana: Voter Turnout for the U.S. Senate by Fraction Black, Gary King () Ecological Inference 14 / 38

55 Selected Problems with the Goodman s Approach The accounting identity, T i = β b i X i + β w i (1 X i ), contains no error other than due to parameter variation. Thus, all scatter around the regression line is due to parameter variation. Gary King () Ecological Inference 15 / 38

56 Selected Problems with the Goodman s Approach The accounting identity, T i = β b i X i + β w i (1 X i ), contains no error other than due to parameter variation. Thus, all scatter around the regression line is due to parameter variation. 2. Goodman s model does not take into account information from the method of bounds or from massive heteroskedasticity in aggregate data. See the graph. Gary King () Ecological Inference 15 / 38

57 Selected Problems with the Goodman s Approach The accounting identity, T i = β b i X i + β w i (1 X i ), contains no error other than due to parameter variation. Thus, all scatter around the regression line is due to parameter variation. 2. Goodman s model does not take into account information from the method of bounds or from massive heteroskedasticity in aggregate data. See the graph. 3. Goodman s regression is biased in the presence of aggregation bias: C(β b i, X i) 0 or C(β w i, X i ) 0 (True in any regression even if not ecological.) Gary King () Ecological Inference 15 / 38

58 Selected Problems with the Goodman s Approach Gary King () Ecological Inference 16 / 38

59 Selected Problems with the Goodman s Approach 4. We cannot correct for aggregation bias within Goodman s framework. Gary King () Ecological Inference 16 / 38

60 Selected Problems with the Goodman s Approach 4. We cannot correct for aggregation bias within Goodman s framework. (a) The good idea that doesn t work: since the coefficients vary with X i, let s model that explicitly, hence using X i to control for the covariation. Gary King () Ecological Inference 16 / 38

61 Selected Problems with the Goodman s Approach 4. We cannot correct for aggregation bias within Goodman s framework. (a) The good idea that doesn t work: since the coefficients vary with X i, let s model that explicitly, hence using X i to control for the covariation. (b) More specifically, even if C(β b i, X i) 0, if we control for Z i it might be true that C(β b i, X i Z i ) = 0. And if Z i = X i, its true for sure. Gary King () Ecological Inference 16 / 38

62 Selected Problems with the Goodman s Approach 4. We cannot correct for aggregation bias within Goodman s framework. (a) The good idea that doesn t work: since the coefficients vary with X i, let s model that explicitly, hence using X i to control for the covariation. (b) More specifically, even if C(β b i, X i) 0, if we control for Z i it might be true that C(β b i, X i Z i ) = 0. And if Z i = X i, its true for sure. (c) Take Goodman s regression E(T i ) = B b X i + B w (1 X i ) Gary King () Ecological Inference 16 / 38

63 Selected Problems with the Goodman s Approach 4. We cannot correct for aggregation bias within Goodman s framework. (a) The good idea that doesn t work: since the coefficients vary with X i, let s model that explicitly, hence using X i to control for the covariation. (b) More specifically, even if C(β b i, X i) 0, if we control for Z i it might be true that C(β b i, X i Z i ) = 0. And if Z i = X i, its true for sure. (c) Take Goodman s regression E(T i ) = B b X i + B w (1 X i ) (d) Let B b = γ 0 + γ 1 X i and B w = θ 0 + θ 1 X i and substitute: Gary King () Ecological Inference 16 / 38

64 Selected Problems with the Goodman s Approach 4. We cannot correct for aggregation bias within Goodman s framework. (a) The good idea that doesn t work: since the coefficients vary with X i, let s model that explicitly, hence using X i to control for the covariation. (b) More specifically, even if C(β b i, X i) 0, if we control for Z i it might be true that C(β b i, X i Z i ) = 0. And if Z i = X i, its true for sure. (c) Take Goodman s regression E(T i ) = B b X i + B w (1 X i ) (d) Let B b = γ 0 + γ 1 X i and B w = θ 0 + θ 1 X i and substitute: E(T i ) = (γ 0 + γ 1 X i )X i + (θ 0 + θ 1 X i )(1 X i ) Gary King () Ecological Inference 16 / 38

65 Selected Problems with the Goodman s Approach 4. We cannot correct for aggregation bias within Goodman s framework. (a) The good idea that doesn t work: since the coefficients vary with X i, let s model that explicitly, hence using X i to control for the covariation. (b) More specifically, even if C(β b i, X i) 0, if we control for Z i it might be true that C(β b i, X i Z i ) = 0. And if Z i = X i, its true for sure. (c) Take Goodman s regression E(T i ) = B b X i + B w (1 X i ) (d) Let B b = γ 0 + γ 1 X i and B w = θ 0 + θ 1 X i and substitute: E(T i ) = (γ 0 + γ 1 X i )X i + (θ 0 + θ 1 X i )(1 X i ) = θ 0 + (γ 0 + θ 1 θ 0 )X i (γ 1 θ 1 )X 2 i Gary King () Ecological Inference 16 / 38

66 Selected Problems with the Goodman s Approach 4. We cannot correct for aggregation bias within Goodman s framework. (a) The good idea that doesn t work: since the coefficients vary with X i, let s model that explicitly, hence using X i to control for the covariation. (b) More specifically, even if C(β b i, X i) 0, if we control for Z i it might be true that C(β b i, X i Z i ) = 0. And if Z i = X i, its true for sure. (c) Take Goodman s regression E(T i ) = B b X i + B w (1 X i ) (d) Let B b = γ 0 + γ 1 X i and B w = θ 0 + θ 1 X i and substitute: E(T i ) = (γ 0 + γ 1 X i )X i + (θ 0 + θ 1 X i )(1 X i ) = θ 0 + (γ 0 + θ 1 θ 0 )X i (γ 1 θ 1 )X 2 i (e) Model is not identified: Four parameters need to be estimated (γ 0, γ 1, θ 0, and θ 1 ), but only 3 can be estimated (θ 0 and coefficients in parens on X i and X 2 i ). Gary King () Ecological Inference 16 / 38

67 Selected Problems with the Goodman s Approach 4. We cannot correct for aggregation bias within Goodman s framework. (a) The good idea that doesn t work: since the coefficients vary with X i, let s model that explicitly, hence using X i to control for the covariation. (b) More specifically, even if C(β b i, X i) 0, if we control for Z i it might be true that C(β b i, X i Z i ) = 0. And if Z i = X i, its true for sure. (c) Take Goodman s regression E(T i ) = B b X i + B w (1 X i ) (d) Let B b = γ 0 + γ 1 X i and B w = θ 0 + θ 1 X i and substitute: E(T i ) = (γ 0 + γ 1 X i )X i + (θ 0 + θ 1 X i )(1 X i ) = θ 0 + (γ 0 + θ 1 θ 0 )X i (γ 1 θ 1 )X 2 i (e) Model is not identified: Four parameters need to be estimated (γ 0, γ 1, θ 0, and θ 1 ), but only 3 can be estimated (θ 0 and coefficients in parens on X i and X 2 i ). 5. If the number of people differs across precinct, Goodman s model is not estimating the correct quantity of interest. Gary King () Ecological Inference 16 / 38

68 The Data 1.75 T i X i A Scattercross Graph of Voter Turnout by Fraction Hispanic Gary King () Ecological Inference 17 / 38

69 The Data 1.75 T i X i A Scattercross Graph of Voter Turnout by Fraction Hispanic Solve the accounting identity: Gary King () Ecological Inference 17 / 38

70 The Data 1.75 T i X i A Scattercross Graph of Voter Turnout by Fraction Hispanic Solve the accounting identity: T i = β w i + (β b i β w i )X i Gary King () Ecological Inference 17 / 38

71 The Data 1.75 T i X i A Scattercross Graph of Voter Turnout by Fraction Hispanic Solve the accounting identity: for the unknowns: T i = β w i + (β b i β w i )X i Gary King () Ecological Inference 17 / 38

72 The Data 1.75 T i X i A Scattercross Graph of Voter Turnout by Fraction Hispanic Solve the accounting identity: for the unknowns: β w i = T i = β w i + (β b i β w i )X i Ti 1 X i ««Xi βi b 1 X i Gary King () Ecological Inference 17 / 38

73 The Data: Continued Precinct 52: T 52 =.19, X 52 =.88 Gary King () Ecological Inference 18 / 38

74 The Data: Continued Precinct 52: T 52 =.19, X 52 =.88 β w 52 = T 52 1 X 52 X 52 1 X 52 β b 52 Gary King () Ecological Inference 18 / 38

75 The Data: Continued Precinct 52: T 52 =.19, X 52 =.88 β w 52 = T 52 1 X 52 X 52 1 X 52 β b 52 = βb 52 Gary King () Ecological Inference 18 / 38

76 The Data: Continued Precinct 52: T 52 =.19, X 52 =.88 β52 w = T 52 X 52 β52 b 1 X 52 1 X 52 = βb 52 = β52 b Gary King () Ecological Inference 18 / 38

77 The Data: Continued Precinct 52: T 52 =.19, X 52 =.88 β52 w = T 52 X 52 β52 b 1 X 52 1 X 52 = βb 52 = β52 b 1.75 β w i β b i Gary King () Ecological Inference 18 / 38

78 The Model for Data Without Aggregation Bias, But Robust in its Presence Gary King () Ecological Inference 19 / 38

79 The Model for Data Without Aggregation Bias, But Robust in its Presence The Goal: Knowledge of β b i and β w i in each precinct. Gary King () Ecological Inference 19 / 38

80 The Model for Data Without Aggregation Bias, But Robust in its Presence The Goal: Knowledge of β b i and β w i in each precinct. Begin with the basic accounting identity (not an assumption of linearity): Gary King () Ecological Inference 19 / 38

81 The Model for Data Without Aggregation Bias, But Robust in its Presence The Goal: Knowledge of β b i and β w i in each precinct. Begin with the basic accounting identity (not an assumption of linearity): T i = β b i X i + β w i (1 X i ) Gary King () Ecological Inference 19 / 38

82 The Model for Data Without Aggregation Bias, But Robust in its Presence The Goal: Knowledge of β b i and β w i in each precinct. Begin with the basic accounting identity (not an assumption of linearity): T i = β b i X i + β w i (1 X i ) add three assumptions (in the basic version of the model): Gary King () Ecological Inference 19 / 38

83 The Model for Data Without Aggregation Bias, But Robust in its Presence The Goal: Knowledge of β b i and β w i in each precinct. Begin with the basic accounting identity (not an assumption of linearity): T i = β b i X i + β w i (1 X i ) add three assumptions (in the basic version of the model): 1. β b i and β w i are truncated bivariate normal: β w i β b i β w i β b i β w i β b i (a) (b) (c) Gary King () Ecological Inference 19 / 38

84 The Model for Data Without Aggregation Bias, But Robust in its Presence The Goal: Knowledge of β b i and β w i in each precinct. Begin with the basic accounting identity (not an assumption of linearity): T i = β b i X i + β w i (1 X i ) add three assumptions (in the basic version of the model): 1. β b i and β w i are truncated bivariate normal: β w i β b i β w i β b i β w i β b i (a) (b) (c) (The 5 parameters of this density need to be estimated by forming the likelihood.) Gary King () Ecological Inference 19 / 38

85 The Model for Data Without Aggregation Bias, But Robust in its Presence Gary King () Ecological Inference 20 / 38

86 The Model for Data Without Aggregation Bias, But Robust in its Presence 2. No aggregation bias (a priori): β b i and β w i mean independent of X i. Allows a posteriori aggregation bias (i.e., after conditioning on T i ) Gary King () Ecological Inference 20 / 38

87 The Model for Data Without Aggregation Bias, But Robust in its Presence 2. No aggregation bias (a priori): β b i and β w i mean independent of X i. Allows a posteriori aggregation bias (i.e., after conditioning on T i ) 3. No spatial autocorrelation: T i X i are independent over observations. Gary King () Ecological Inference 20 / 38

88 Deriving the Likelihood Function Gary King () Ecological Inference 21 / 38

89 Deriving the Likelihood Function 1. The story of the model is that we learn things in order Gary King () Ecological Inference 21 / 38

90 Deriving the Likelihood Function 1. The story of the model is that we learn things in order (a) (As in regression), everything is conditional on X i, which means we learn it first. Gary King () Ecological Inference 21 / 38

91 Deriving the Likelihood Function 1. The story of the model is that we learn things in order (a) (As in regression), everything is conditional on X i, which means we learn it first. (b) Then the world draws β b i and β w i from a truncated normal, but we don t get to see them. Gary King () Ecological Inference 21 / 38

92 Deriving the Likelihood Function 1. The story of the model is that we learn things in order (a) (As in regression), everything is conditional on X i, which means we learn it first. (b) Then the world draws β b i and β w i from a truncated normal, but we don t get to see them. (c) Finally, we learn T i, which is computed via the accounting identity deterministically: T i = β b i X i + β w i (1 X i ). Gary King () Ecological Inference 21 / 38

93 Deriving the Likelihood Function 1. The story of the model is that we learn things in order (a) (As in regression), everything is conditional on X i, which means we learn it first. (b) Then the world draws β b i and β w i from a truncated normal, but we don t get to see them. (c) Finally, we learn T i, which is computed via the accounting identity deterministically: T i = β b i X i + β w i (1 X i ). 2. The random variable is then T (given X ), which is truncated bivarate normal Gary King () Ecological Inference 21 / 38

94 Deriving the Likelihood Function 1. The story of the model is that we learn things in order (a) (As in regression), everything is conditional on X i, which means we learn it first. (b) Then the world draws β b i and β w i from a truncated normal, but we don t get to see them. (c) Finally, we learn T i, which is computed via the accounting identity deterministically: T i = β b i X i + β w i (1 X i ). 2. The random variable is then T (given X ), which is truncated bivarate normal 3. The five parameters of the truncated bivariate normal need to be estimated: ψ = { B b, B w, σ b, σ w, ρ} = { B, Σ} Gary King () Ecological Inference 21 / 38

95 Deriving the Likelihood Function 1. The story of the model is that we learn things in order (a) (As in regression), everything is conditional on X i, which means we learn it first. (b) Then the world draws β b i and β w i from a truncated normal, but we don t get to see them. (c) Finally, we learn T i, which is computed via the accounting identity deterministically: T i = β b i X i + β w i (1 X i ). 2. The random variable is then T (given X ), which is truncated bivarate normal 3. The five parameters of the truncated bivariate normal need to be estimated: ψ = { B b, B w, σ b, σ w, ρ} = { B, Σ} These are on the untruncated scale (and not quantities of interest) since: Gary King () Ecological Inference 21 / 38

96 Deriving the Likelihood Function 1. The story of the model is that we learn things in order (a) (As in regression), everything is conditional on X i, which means we learn it first. (b) Then the world draws β b i and β w i from a truncated normal, but we don t get to see them. (c) Finally, we learn T i, which is computed via the accounting identity deterministically: T i = β b i X i + β w i (1 X i ). 2. The random variable is then T (given X ), which is truncated bivarate normal 3. The five parameters of the truncated bivariate normal need to be estimated: ψ = { B b, B w, σ b, σ w, ρ} = { B, Σ} These are on the untruncated scale (and not quantities of interest) since: TN(β b i, β w i B, Σ) = N(β b i, β w i B, Σ) 1(βb i, βw i ) R( B, Σ) Gary King () Ecological Inference 21 / 38

97 Deriving the Likelihood Function 1. The story of the model is that we learn things in order (a) (As in regression), everything is conditional on X i, which means we learn it first. (b) Then the world draws β b i and β w i from a truncated normal, but we don t get to see them. (c) Finally, we learn T i, which is computed via the accounting identity deterministically: T i = β b i X i + β w i (1 X i ). 2. The random variable is then T (given X ), which is truncated bivarate normal 3. The five parameters of the truncated bivariate normal need to be estimated: ψ = { B b, B w, σ b, σ w, ρ} = { B, Σ} These are on the untruncated scale (and not quantities of interest) since: where TN(β b i, β w i B, Σ) = N(β b i, β w i B, Σ) 1(βb i, βw i ) R( B, Σ) Gary King () Ecological Inference 21 / 38

98 Deriving the Likelihood Function 1. The story of the model is that we learn things in order (a) (As in regression), everything is conditional on X i, which means we learn it first. (b) Then the world draws β b i and β w i from a truncated normal, but we don t get to see them. (c) Finally, we learn T i, which is computed via the accounting identity deterministically: T i = β b i X i + β w i (1 X i ). 2. The random variable is then T (given X ), which is truncated bivarate normal 3. The five parameters of the truncated bivariate normal need to be estimated: ψ = { B b, B w, σ b, σ w, ρ} = { B, Σ} These are on the untruncated scale (and not quantities of interest) since: where R( B, Σ) = TN(β b i, β w i B, Σ) = N(β b i, β w i B, Σ) 1(βb i, βw i ) R( B, Σ) Z 1 Z 1 N(β b, β w B, Σ)dβ b dβ w (volume above unit square) 0 0 Gary King () Ecological Inference 21 / 38

99 Deriving the Likelihood Function Gary King () Ecological Inference 22 / 38

100 Deriving the Likelihood Function 4. (From simulations of these parameters, we will compute quantities of interest: β b i, βw i. Details shortly.) Gary King () Ecological Inference 22 / 38

101 Deriving the Likelihood Function 4. (From simulations of these parameters, we will compute quantities of interest: β b i, βw i. Details shortly.) 5. The likelihood: Gary King () Ecological Inference 22 / 38

102 Deriving the Likelihood Function 4. (From simulations of these parameters, we will compute quantities of interest: β b i, βw i. Details shortly.) 5. The likelihood: L( ψ T ) X i (0,1) P(T i ψ) Gary King () Ecological Inference 22 / 38

103 Deriving the Likelihood Function 4. (From simulations of these parameters, we will compute quantities of interest: β b i, βw i. Details shortly.) 5. The likelihood: L( ψ T ) X i (0,1) = X i (0,1) P(T i ψ) ( What we observe ) What we could have observed Gary King () Ecological Inference 22 / 38

104 Deriving the Likelihood Function 4. (From simulations of these parameters, we will compute quantities of interest: β b i, βw i. Details shortly.) 5. The likelihood: L( ψ T ) X i (0,1) = X i (0,1) = X i (0,1) P(T i ψ) ( What we observe What we could have observed ) ( ) Area above line segment Volume above square Gary King () Ecological Inference 22 / 38

105 Deriving the Likelihood Function 4. (From simulations of these parameters, we will compute quantities of interest: β b i, βw i. Details shortly.) 5. The likelihood: L( ψ T ) X i (0,1) = X i (0,1) = X i (0,1) = X i (0,1) P(T i ψ) ( What we observe What we could have observed ) ( ) Area above line segment Volume above square ) ( ) Area above line segment ( Area above line Volume above plane Area above line ( Volume above square Volume above plane ) Gary King () Ecological Inference 22 / 38

106 Deriving the Likelihood Function 4. (From simulations of these parameters, we will compute quantities of interest: β b i, βw i. Details shortly.) 5. The likelihood: L( ψ T ) X i (0,1) = X i (0,1) = X i (0,1) = X i (0,1) = X i (0,1) P(T i ψ) ( What we observe What we could have observed ) ( ) Area above line segment Volume above square ) ( ) Area above line segment ( Area above line Volume above plane N(T i µ i, σ 2 i ) S( B, Σ) R( B, Σ) Area above line ( Volume above square Volume above plane ) Gary King () Ecological Inference 22 / 38

107 Deriving the Likelihood Function Gary King () Ecological Inference 23 / 38

108 Deriving the Likelihood Function where Gary King () Ecological Inference 23 / 38

109 Deriving the Likelihood Function where E(T i X i ) µ i = B b X i + B w (1 X i ), Gary King () Ecological Inference 23 / 38

110 Deriving the Likelihood Function where E(T i X i ) µ i = B b X i + B w (1 X i ), V (T i X i ) σ 2 i = ( σ 2 w ) + (2 σ bw 2 σ 2 w )X i + ( σ 2 b + σ 2 w 2 σ bw )X 2 i, Gary King () Ecological Inference 23 / 38

111 Deriving the Likelihood Function where E(T i X i ) µ i = B b X i + B w (1 X i ), V (T i X i ) σ 2 i = ( σ 2 w ) + (2 σ bw 2 σ 2 w )X i + ( σ 2 b + σ 2 w 2 σ bw )X 2 i, min 1, T i X i S( B, Σ) = max 0, T (1 X i ) X i ( N β b B b + ω ) i ɛ i, σ b 2 ω2 i σ i σi 2 dβ b Gary King () Ecological Inference 23 / 38

112 Deriving the Likelihood Function 6. A visual version of the likelihood: 1.75 β w i β b i Gary King () Ecological Inference 24 / 38

113 The Truncated Bivariate Normal Distribution s Five Parameters Can be Estimated From Aggregate Data: Intuition (a) X i T i (b) X i T i (c) X i T i (d) X i T i (e) X i T i (f) X i T i Data were randomly generated from the model with parameter values B b, B w, σ b, σ w, and ρ, at the top of each graph. The solid line is the expected value and dashed lines are at plus and minus one standard deviation. Gary King () Ecological Inference 25 / 38

114 Another view of how the data change with the model 1 (a) (d) β w i.5 β w i β b i β b i 1.75 (b) (e) β w i.5 β w i β b i β b i 1 (c) (f) β w i.5 β w i β b i β b i Observable Implications for Sample Parameter Values. The numbers at the top of each tomography plot are the parameter values for the distribution from which data were randomly generated: B b, B w, σ b, σ w, and ρ. Gary King () Ecological Inference 26 / 38

115 Calculating Quantities of Interest: A story of X-Rays and tomography machines; then how to do it Rearranging the basic accounting identity gives βi w βi b: as a linear function of Gary King () Ecological Inference 27 / 38

116 Calculating Quantities of Interest: A story of X-Rays and tomography machines; then how to do it Rearranging the basic accounting identity gives βi w as a linear function of βi b: ( ) ( ) βi w Ti Xi = βi b 1 X i 1 X i Gary King () Ecological Inference 27 / 38

117 Calculating Quantities of Interest: A story of X-Rays and tomography machines; then how to do it Rearranging the basic accounting identity gives βi w as a linear function of βi b: ( ) ( ) βi w Ti Xi = βi b 1 X i 1 X i Thus, knowing T i and X i in one precinct narrows the possible values of βi b, βw i to one line cut across this figure: Gary King () Ecological Inference 27 / 38

118 Calculating Quantities of Interest: A story of X-Rays and tomography machines; then how to do it Rearranging the basic accounting identity gives βi w as a linear function of βi b: ( ) ( ) βi w Ti Xi = βi b 1 X i 1 X i Thus, knowing T i and X i in one precinct narrows the possible values of βi b, βw i to one line cut across this figure: 1.75 β w i.5.25 A Tomography Plot β b i Gary King () Ecological Inference 27 / 38

119 Calculating Quantities of Interest: A story of X-Rays and tomography machines; then how to do it P P P P β b i Gary King () Ecological Inference 28 / 38

120 How to Calculate Quantities of Interest Gary King () Ecological Inference 29 / 38

121 How to Calculate Quantities of Interest 1. Option 1. Simulate only (district level) aggregate quantities Gary King () Ecological Inference 29 / 38

122 How to Calculate Quantities of Interest 1. Option 1. Simulate only (district level) aggregate quantities (a) Algorithm to take one draw of the district-level fraction of blacks who vote: Gary King () Ecological Inference 29 / 38

123 How to Calculate Quantities of Interest 1. Option 1. Simulate only (district level) aggregate quantities (a) Algorithm to take one draw of the district-level fraction of blacks who vote: i. Draw ψ from its posterior or sampling density: an asymptotic normal with mean equal to point estimates and variance the inverse of the -Hessian at the maximum. Gary King () Ecological Inference 29 / 38

124 How to Calculate Quantities of Interest 1. Option 1. Simulate only (district level) aggregate quantities (a) Algorithm to take one draw of the district-level fraction of blacks who vote: i. Draw ψ from its posterior or sampling density: an asymptotic normal with mean equal to point estimates and variance the inverse of the -Hessian at the maximum. ii. Draw β b i and β w i from TN(β b i, β w i B, Σ), given the simulated parameters, ψ = { B, Σ}. Gary King () Ecological Inference 29 / 38

125 How to Calculate Quantities of Interest 1. Option 1. Simulate only (district level) aggregate quantities (a) Algorithm to take one draw of the district-level fraction of blacks who vote: i. Draw ψ from its posterior or sampling density: an asymptotic normal with mean equal to point estimates and variance the inverse of the -Hessian at the maximum. ii. Draw β b i and β w i from TN(β b i, β w i B, Σ), given the simulated parameters, ψ = { B, Σ}. iii. Compute the weighted average of the simulated coefficients (weights based on precinct population): Gary King () Ecological Inference 29 / 38

126 How to Calculate Quantities of Interest 1. Option 1. Simulate only (district level) aggregate quantities (a) Algorithm to take one draw of the district-level fraction of blacks who vote: i. Draw ψ from its posterior or sampling density: an asymptotic normal with mean equal to point estimates and variance the inverse of the -Hessian at the maximum. ii. Draw βi b and βi w from TN(βi b, βi w B, Σ), given the simulated parameters, ψ = { B, Σ}. iii. Compute the weighted average of the simulated coefficients (weights based on precinct population): px B b N b+ i β i b = N b+ + i=1 Gary King () Ecological Inference 29 / 38

127 How to Calculate Quantities of Interest 1. Option 1. Simulate only (district level) aggregate quantities (a) Algorithm to take one draw of the district-level fraction of blacks who vote: i. Draw ψ from its posterior or sampling density: an asymptotic normal with mean equal to point estimates and variance the inverse of the -Hessian at the maximum. ii. Draw βi b and βi w from TN(βi b, βi w B, Σ), given the simulated parameters, ψ = { B, Σ}. iii. Compute the weighted average of the simulated coefficients (weights based on precinct population): px B b N b+ i β i b = N b+ + (b) Problem: We only get knowledge of the district-wide aggregate & its not robust. i=1 Gary King () Ecological Inference 29 / 38

128 How to Calculate Quantities of Interest Gary King () Ecological Inference 30 / 38

129 How to Calculate Quantities of Interest 2. Option 2. use the knowledge that simulations for observation i must come from its tomography line: Gary King () Ecological Inference 30 / 38

130 How to Calculate Quantities of Interest 2. Option 2. use the knowledge that simulations for observation i must come from its tomography line: (a) By the story of the model, if we know T i, we learn the entire tomography line (since X i is known ex ante). Gary King () Ecological Inference 30 / 38

131 How to Calculate Quantities of Interest 2. Option 2. use the knowledge that simulations for observation i must come from its tomography line: (a) By the story of the model, if we know T i, we learn the entire tomography line (since X i is known ex ante). (b) So we will condition on T i to make a prediction from the tomography line. Gary King () Ecological Inference 30 / 38

132 How to Calculate Quantities of Interest 2. Option 2. use the knowledge that simulations for observation i must come from its tomography line: (a) By the story of the model, if we know T i, we learn the entire tomography line (since X i is known ex ante). (b) So we will condition on T i to make a prediction from the tomography line. (c) We could apply the Option 1 algorithm and use rejection sampling (discard simulations of βi b, βw i that are not on the tomography line), but this would take forever. Gary King () Ecological Inference 30 / 38

133 How to Calculate Quantities of Interest 2. Option 2. use the knowledge that simulations for observation i must come from its tomography line: (a) By the story of the model, if we know T i, we learn the entire tomography line (since X i is known ex ante). (b) So we will condition on T i to make a prediction from the tomography line. (c) We could apply the Option 1 algorithm and use rejection sampling (discard simulations of βi b, βw i that are not on the tomography line), but this would take forever. (d) Alternative algorithm for drawing simulations of βi b and βi w. Gary King () Ecological Inference 30 / 38

134 How to Calculate Quantities of Interest 2. Option 2. use the knowledge that simulations for observation i must come from its tomography line: (a) By the story of the model, if we know T i, we learn the entire tomography line (since X i is known ex ante). (b) So we will condition on T i to make a prediction from the tomography line. (c) We could apply the Option 1 algorithm and use rejection sampling (discard simulations of βi b, βw i that are not on the tomography line), but this would take forever. (d) Alternative algorithm for drawing simulations of βi b and βi w. i. Find the expression for P(β b i T i, ψ) analytically, which is a particular truncated univariate normal (see King, 1997: Appendix C). Gary King () Ecological Inference 30 / 38

135 How to Calculate Quantities of Interest 2. Option 2. use the knowledge that simulations for observation i must come from its tomography line: (a) By the story of the model, if we know T i, we learn the entire tomography line (since X i is known ex ante). (b) So we will condition on T i to make a prediction from the tomography line. (c) We could apply the Option 1 algorithm and use rejection sampling (discard simulations of βi b, βw i that are not on the tomography line), but this would take forever. (d) Alternative algorithm for drawing simulations of βi b and βi w. i. Find the expression for P(β b i T i, ψ) analytically, which is a particular truncated univariate normal (see King, 1997: Appendix C). ii. Draw ψ from its posterior or sampling density (the same multivariate normal as always). Gary King () Ecological Inference 30 / 38

136 How to Calculate Quantities of Interest 2. Option 2. use the knowledge that simulations for observation i must come from its tomography line: (a) By the story of the model, if we know T i, we learn the entire tomography line (since X i is known ex ante). (b) So we will condition on T i to make a prediction from the tomography line. (c) We could apply the Option 1 algorithm and use rejection sampling (discard simulations of βi b, βw i that are not on the tomography line), but this would take forever. (d) Alternative algorithm for drawing simulations of βi b and βi w. i. Find the expression for P(β b i T i, ψ) analytically, which is a particular truncated univariate normal (see King, 1997: Appendix C). ii. Draw ψ from its posterior or sampling density (the same multivariate normal as always). iii. Insert the simulation into P(β b i T i, ψ) and draw out one simulated β b i. Gary King () Ecological Inference 30 / 38

Ecological Inference

Ecological Inference Simone Zhang March 2017 With thanks to Gary King for slides on EI. Simone Zhang Ecological Inference March 2017 1 / 28 What is ecological inference? Definition: Ecological inference