Ecological inference with distribution regression

Size: px

Start display at page:

Download "Ecological inference with distribution regression"

Ellen Webster
5 years ago
Views:

1 Ecological inference with distribution regression Seth Flaxman 10 May 2017 Department of Politics and International Relations

2 Ecological inference I How to draw conclusions about individuals from aggregate-level data? Black White Democrat?? 2,200 Republican?? 1,300 1,500 2,000 2

3 Ecological inference Ecological fallacy [Robinson 1950], Method of Bounds [Duncan and Davis 1953], Goodman s method [1959], neighborhood method [Freedman 1991], Bayesian modeling approaches to ecological inference [King 1997, Gelman 2001, Wakefield 2004] Source: King,

4 Ecological inference: a modern approach We (often) have unlabeled individual level data! Learning from label proportions [e.g. Kueck and de Freitas 2005, Quadrianto et al 2009, Sheldon and Dietterich 2011, Patrini et al 2014] Distribution regression [e.g., Szabo et al 2015] Data privacy literature / de-anonymization [e.g. Narayanan and Shmatikov, 2009] Bayesian approaches [Jackson, Best, and Richardson 2006, Wakefield 2004 and 2008] 4

5 Electoral data I Voting results per district, e.g. Obama 52%, Romney 48% I Individual-level demographic data from the American Community Survey: race, education, income, occupation, time to work, etc for 5% of US residents I Goal: infer how demographic subgroups voted (by district!) 5

6 Spatial join (a) (b) Figure: Electoral data for 67 counties (left), census data for 127 public use microdata areas (not shown). Merging results in 37 electoral regions (right). 6

7 Ground truth : exit polls women men Costly, only carried out in 18 states in 2012 Limited number of demographic variables Not representative at substate level Unreliable Other sources of ground truth: public polls, voter file, post-election surveys 7

8 My new approach 1 Insight: ecological inference with individual level data can be framed as distribution regression Goal: scalable spatially varying Bayesian model Approach: new Bayesian distribution regression method with GPs and fast kernel methods 1 Flaxman, Wang, Smola, Who Supported Obama in 2012? Ecological Inference through Distribution Regression, KDD 2015; Flaxman, Sutherland, Wang, and Teh, Understanding the 2016 US Presidential Election using ecological inference and distribution regression with census microdata. 8

9 Distribution regression Individual-level data with group-level labels: ( ) ( ) ( ) {x j 1 }N 1 j=1, y 1, {x j 2 }N 2 j=1, y 2,... {xn} j Nn j=1, y n 9

10 Distribution regression Individual-level data with group-level labels: ( ) ( ) ( ) {x j 1 }N 1 j=1, y 1, {x j 2 }N 2 j=1, y 2,... {xn} j Nn j=1, y n Learn a function: f : {x j } N j=1 y 10

11 Distribution regression Individual-level data with group-level labels: ( ) ( ) ( ) {x j 1 }N 1 j=1, y 1, {x j 2 }N 2 j=1, y 2,... {xn} j Nn j=1, y n Learn a function: f : {x j } N j=1 y Adapt it for ecological inference: make predictions for subgroups. ˆf ({x j women }N 1 j=1 ) 11

12 Learning from distributions Previous work: Jebara et al, 2004; Hein and Bousquet, 2005; Muandet et al, 2012; Póczos et al, 2013, Szabó et al (2014), Lopez-Paz et al, 2015, Lopez-Paz (2016). Distribution regression / distribution classification relies on the kernel mean embedding [see Muandet et al 2017 s survey] Given kernel k(x, ), RKHS H k, and corresponding embedding φ(x) H k, consider a measure with X P. Then define: µ P := E[φ(X )] = φ(x)dp(x) (1) Obvious empirical estimator for samples x 1,..., x n : µ ˆ P := 1 n X φ(x i ) (2) Learning: use any supervised learning method to learn a function f (µ P ). 12 i

13 Distribution embedding illustration p(x) P Q x Reproducing Kernel Hilbert Space RKHS embedding of Q RKHS embedding of P Figure: Each distribution is mapped into the reproducing kernel Hilbert space via an expectation operation. (Source: Muandet et al 2017) 13

14 My new approach: distribution regression y 2 % vote for Obama? y 3? y 1 µ 1 µ w 3 µ 3 µ m 3 µ 2 feature space x 1 1 x 2 1 x 1 3 x 2 3 x 3 3 x 1 2 x 2 2 men women both x 3 1 x 5 3 x 4 3 x 3 2 region 1 region 2 region 3 14

15 Bayesian distribution regression Estimate µ 1,..., µ n R n using kernel embeddings: Use GP logistic regression µ i = 1 k(x j i N, ) = 1 φ(x j i N ) Additive kernels with a spatial component: j K ij = σ 2 x µ i, µ j + k s (s i, s j ) f GP(0, K) k i f i Binomial(n i, logit 1 (f i )) Obama received k i out of n i votes in region i. Make predictions for demographic subgroups: j f (µ women i, s i ) 15

16 Kernel details Demographic attributes (Gaussian RBF): Standardize coordinates Expand discrete attributes: (low, medium, high income) ([1 0 0], [0 1 0], [0 0 1]). Use random Fourier features for speed: k(x, x ) = φ(x), φ(x ) ˆφ(x), ˆφ(x ) with ˆφ(x) R Spatial attributes with Matérn- 3 2 : k(s, s ) = (1 + ρ s s ) exp( ρ s s ) Millions of observations, but the covariance matrix is for the 843 electoral regions. 16

17 Algorithm details One pass through census data to create mean embeddings: µ 1 = Setup GP regression: j w j 1 φ(x j 1 ) j w j,..., µ n = 1 f GP(0, σ 2 xk x + σ 2 s K s ) k i f i Binomial(n i, logit 1 (f i )) j w j nφ(x j n) j w j n Laplace approximation for hyperparameter learning θ = [σ x, σ s, ρ] w/ marginal likelihood: arg max p(y θ) θ Bayesian posterior inference to make predictions for latent f at new locations : p(f men y, ˆθ) 17 (3)

18 Experiments Exit poll women Exit poll men Ecological regression women Ecological regression men 18

19 Experiments 70% 70% Ecological regression 60% 50% 40% Ecological regression 60% 50% 40% 40% 50% 60% 70% 40% 50% 60% 70% Exit poll Exit poll Women Men 19

20 Experiments Fraction of geographic regions % 0% 5% 10% 15% Obama support gender gap (percentage points) 20

21 Ecological regression 70% 50% 30% Ecological regression 70% 50% 30% 30% 50% 70% 30% 50% 70% Exit poll Exit poll Low income Medium income Ecological regression 70% 50% 30% 30% 50% 70% Exit poll High income 21

22 Refinements for 2016 election 2 Explicity model non-voters: y i = [Clinton votes, Trump votes, Non-votes and third party votes] Multinomial likelihood with softmax link, fit with penalized MLE with group lasso and L 2 penalty More interpretable / richer feature representation to allow for exploratory analysis / calculation of marginal effects: φ(x j i ) := [φ 1(x j r1 ),..., φ d(x j rd )] (4) Incorporation of some exit polling data as extra set of labeled distributions 2 arxiv:

Results for 2016 Presidential Election Clinton Trump Frac. electorate Participation rate Men 0.45 0.55 0.47 0.50 Women 0.56 0.44 0.

23 Results for 2016 Presidential Election Clinton Trump Frac. electorate Participation rate Men Women year olds and older

24 Results for 2016 Presidential Election Clinton Trump Participation Language other than English spoken at home Mobility = lived here one year ago Mobility = moved here from outside US and Puerto Rico Mobility = moved here from inside US or Puerto Rico Active duty military Not enrolled in school Enrolled in a public school or public college Enrolled in private school, private college, or home school

25 Results for 2016 Presidential Election Clinton Trump Frac Participation personal income & men personal income & women < personal income & men < personal income < & women personal income > & men personal income > & women

26 Exploratory results feature deviance frac.deviance 1 RAC3P - race coding ethnicity interacted with has degree schooling attainment ANC2P - detailed ancestry OCCP - occupation COW - class of worker ANC1P - detailed ancestry NAICSP - industry code RAC2P - race code age interacted with usual hours worked per week (WKHP) sex interacted with ethnicity MSP - marital status FOD1P - field of degree ethnicity RAC1P - recoded race sex interacted with age has degree interacted with age age interacted with personal income sex interacted with hours worked per week personal income interacted with hours worked per week personal income RACSOR - single or multiple race has degree interacted with hours worked per week hispanic sex interacted with personal income

27 Marginal results Clinton/Trump Vote Share 100% 100% Clinton 50%/50% 100% Trump Participation Rate 80% 60% 40% 20% American Irish African American Polish German Italian English Not reported Mexican White 0% 27

28 Marginal results Clinton/Trump Vote Share 100% 100% Clinton 50%/50% 100% Trump Hispanic, college degree 80% White, college degree Black, college degree Black, no degree White, no degree Participation Rate 60% 40% Amerindian, no degree Other/Multiracial, no degree 20% Asian, college degree Asian, no degree Hispanic, no degree 0% Other/Multiracial, college degree Amerindian, college degree 28

29 Conclusion: ecological inference New ecological inference method through Bayesian distribution regression Scalable to millions of observations through random features Good empirical results Realistic uncertainty intervals Simple method [off-the-shelf tools] Python package by Dougal Sutherland and replication code Next steps: fully Bayesian version of multinomial model, learning richer feature representations, validation on ground truth 29

30 Future work Fully Bayesian method (pre-print will be on arxiv later this week) Mean embedding as an ecological covariate in hierarchical modeling Poll design: how to define likely voters? Imputing demographic characteristics of voters from voter file Many other application areas (public health, etc) Thanks! Papers and code: and on arxiv 30

1. Capitalize all surnames and attempt to match with Census list. 3. Split double-barreled names apart, and attempt to match first half of name.

1. Capitalize all surnames and attempt to match with Census list. 3. Split double-barreled names apart, and attempt to match first half of name. Supplementary Appendix: Imai, Kosuke and Kabir Kahnna. (2016). Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Records. Political Analysis doi: 10.1093/pan/mpw001