A Bayesian Partitioning Approach to Duplicate Detection and Record Linkage

Size: px

Start display at page:

Download "A Bayesian Partitioning Approach to Duplicate Detection and Record Linkage"

Amy Lang
5 years ago
Views:

1 A Bayesian Partitioning Approach to Duplicate Detection and Record Linkage Mauricio Sadinle Duke University Supported by NSF grants SES to Carnegie Mellon University and SES to Duke University 1 / 49

2 Record Linkage 2 / 49

3 Examples of (Non-Trivial) Record Linkage Statistical agencies (US Census Bureau, Statistics Canada,...): Creation of unified administrative databases Merging of census and post-enumeration surveys for census adjustments Human rights: Creation of unified lists of casualties Combination of data sources to estimate true mortality levels Epidemiological studies, educational studies,... 3 / 49

4 Examples of (Non-Trivial) Record Linkage Statistical agencies (US Census Bureau, Statistics Canada,...): Creation of unified administrative databases Merging of census and post-enumeration surveys for census adjustments Human rights: Creation of unified lists of casualties Combination of data sources to estimate true mortality levels Epidemiological studies, educational studies,... 3 / 49

5 Examples of (Non-Trivial) Record Linkage Statistical agencies (US Census Bureau, Statistics Canada,...): Creation of unified administrative databases Merging of census and post-enumeration surveys for census adjustments Human rights: Creation of unified lists of casualties Combination of data sources to estimate true mortality levels Epidemiological studies, educational studies,... 3 / 49

6 Record Linkage vs Duplicate Detection Record linkage: The task of finding records that refer to the same entity across different data sources Duplicate detection: The task of identifying groups of records referring to the same entities within one datafile 4 / 49

7 The Intuition Very similar records are likely to be coreferent! Record Gender First Name Last Name... Age i Male Benedict Cumberbatch j Male Benedict Cucumberbatch Very dissimilar records should be non-coreferent! Record Gender First Name Last Name... Age i Male Benedict Cumberbatch j Male Martin Freeman / 49

8 Comparison Data Construct comparison vectors: γ ij = (..., γ f ij,... ) Examples of similarity measures: Field Numeric (e.g. age) Strings (e.g. name) Nominal (e.g. race) Similarity Measure Absolute difference Levenshtein, Jaro Winkler Binary comparison 6 / 49

9 Common Solutions: Training Classifiers Take sample of record pairs and find if they refer to the same entity or not (e.g. clerical review) Train your favorite classifier using the comparison vectors γ ij (logistic regression, SVM, random forest,...) Predict coreference status of the remaining record pairs 7 / 49

10 Common Solutions: Mixture Model Comparison vector γ ij is a realization of a random vector Γ ij Mixture model implementation Γ ij ij = 1 iid G M Γ ij ij = 0 iid G U ij iid Bernoulli(p) used by Winkler (1980), Jaro (1989), Larsen and Rubin (2001) and many others Often called Fellegi-Sunter approach, after these authors seminal paper (JASA, 1969) Caveat: mixture components may not be associated with matches and non-matches! 8 / 49

11 Common Solutions: Mixture Model Comparison vector γ ij is a realization of a random vector Γ ij Mixture model implementation Γ ij ij = 1 iid G M Γ ij ij = 0 iid G U ij iid Bernoulli(p) used by Winkler (1980), Jaro (1989), Larsen and Rubin (2001) and many others Often called Fellegi-Sunter approach, after these authors seminal paper (JASA, 1969) Caveat: mixture components may not be associated with matches and non-matches! 8 / 49

12 Common Solutions: Mixture Model Comparison vector γ ij is a realization of a random vector Γ ij Mixture model implementation Γ ij ij = 1 iid G M Γ ij ij = 0 iid G U ij iid Bernoulli(p) used by Winkler (1980), Jaro (1989), Larsen and Rubin (2001) and many others Often called Fellegi-Sunter approach, after these authors seminal paper (JASA, 1969) Caveat: mixture components may not be associated with matches and non-matches! 8 / 49

13 Drawbacks of the Traditional Methodologies They output independent decisions on the coreference status of pairs of records Independent pairwise decisions are often non-transitive a and b declared coreferent b and c declared coreferent a and c declared non-coreferent b a c No proper account of uncertainty in linkage decisions 9 / 49

14 Record Linkage à la Fellegi-Sunter Fellegi-Sunter decision rule: w ij = log P( γ ij ij = 1 ) P ( γ ij ij = 0 ) ˆ ij = 1, if w ij > t M ; (link) 0, if w ij < t U ; (non-link) R, otherwise. ( reject ) Caveat: the comparison vector γ ij alone determines the linkage decision 10 / 49

15 Traditional Approaches: What s Wrong? ij s are not independent! 11 / 49

16 Overview of New Methodology General Bayesian approach to joint duplicate detection and record linkage (DDRL) Guarantees transitive coreference decisions Bayesian approach allows us to Incorporate prior information on the quality of the datafiles Provide a proper account of uncertainty in DDRL decisions 12 / 49

17 Contents Bayesian Partitioning Approach to DDRL Bipartite Record Linkage Population Size Estimation with Linked Files Conclusions 13 / 49

18 Contents Bayesian Partitioning Approach to DDRL Bipartite Record Linkage Population Size Estimation with Linked Files Conclusions 14 / 49

19 Linking Multiple Files that Contain Duplicates File 1 Sex, Sex,...,Age Male,...,24 1 Male,...,25 2 Male,...,24 3 Male,...,50 4 Male,...,52 5 Female,...,20 6 Male Male Male Male Male Fem Male File 2 Male,...,24 Male,...,10 Female,...,77 Male,..., File 3 Female,...,78 Male,...,25 Female,..., / 49

20 Our Parameter of Interest Goal is to partition the datafile into groups of coreferent records This partition is our parameter of interest, denoted coreference partition A coreference partition can be represented by a matrix { 1, if records i and j are coreferent; ij = 0, otherwise. Or by labelings of the records (Z 1,..., Z r ) such that Z i = Z j iff records i and j are coreferent 16 / 49

21 Inference on the Partition of the Concatenated File Comparison data: Γ ij = (..., Γ f ij,... ) We propose the model Γ ij ij = 1, D i = k, D j = l Γ ij ij = 0, D i = k, D j = l iid G kl 1, iid G kl 0, Distribution on valid partitions D i : datafile where record i belongs. Linkage decisions are based on p( Γ) 17 / 49

22 Inference on the Partition of the Concatenated File Comparison data: Γ ij = (..., Γ f ij,... ) We propose the model Γ ij ij = 1, D i = k, D j = l Γ ij ij = 0, D i = k, D j = l iid G kl 1, iid G kl 0, Distribution on valid partitions D i : datafile where record i belongs. Linkage decisions are based on p( Γ) 17 / 49

23 Particular Cases and Further Connections Approach can be constrained to handle important particular cases Duplicate detection within one file (AOAS 2014) Traditional record linkage scenario: two files, no dups (JASA 2016+) Account for linkage uncertainty in some models for population size estimation 18 / 49

24 Contents Bayesian Partitioning Approach to DDRL Bipartite Record Linkage Population Size Estimation with Linked Files Conclusions 19 / 49

25 Bipartite Record Linkage no duplicates within file 20 / 49

26 Traditional Methodologies Chains of links can happen! X 1 X / 49

27 Terminology: Bipartite Matching 1 2 X 1 X 2 Representation of bipartite matchings 1 2 Z = (..., Z j,... ) for the records in the file X 2 { i, if i, j are coref.; Z j = n 1 + j, if no coref. in X 1. Z j Z j Example: Z 1 = 1, Z 3 = 8 22 / 49

28 Terminology: Bipartite Matching 1 2 X 1 X 2 Representation of bipartite matchings 1 2 Z = (..., Z j,... ) for the records in the file X 2 { i, if i, j are coref.; Z j = n 1 + j, if no coref. in X 1. Z j Z j Example: Z 1 = 1, Z 3 = 8 22 / 49

29 Terminology: Bipartite Matching 1 2 X 1 X 2 Representation of bipartite matchings 1 2 Z = (..., Z j,... ) for the records in the file X 2 { i, if i, j are coref.; Z j = n 1 + j, if no coref. in X 1. Z j Z j Example: Z 1 = 1, Z 3 = 8 22 / 49

30 Terminology: Bipartite Matching 1 2 X 1 X 2 Representation of bipartite matchings 1 2 Z = (..., Z j,... ) for the records in the file X 2 { i, if i, j are coref.; Z j = n 1 + j, if no coref. in X 1. Z j Z j Example: Z 1 = 1, Z 3 = 8 22 / 49

31 Terminology: Bipartite Matching 1 2 X 1 X 2 Representation of bipartite matchings 1 2 Z = (..., Z j,... ) for the records in the file X 2 { i, if i, j are coref.; Z j = n 1 + j, if no coref. in X 1. Z j Z j Example: Z 1 = 1, Z 3 = 8 22 / 49

32 Enforcing One-to-One Matching in Fellegi-Sunter Approach Jaro (1989) proposes to plug the weights w ij into a linear sum assignment problem max w ij ij i subject to ij {0, 1}, ij 1, j X 2, i j ij 1, i X 1. j 23 / 49

33 Model for Bipartite Matchings Γ ij Z j = i iid G M Γ ij Z j i iid G U Z Prior on Bipartite Matchings 24 / 49

34 Beta Prior on Bipartite Matchings Larsen (2005, 2010): Files overlap size n 12 is Beta-Binomial(n 2, α, β) All bipartite matchings with the same n 12 are equally likely 25 / 49

35 Models G M and G U for Comparison Vectors Assume comparisons are independent for both coreferent, and non-coreferent records Γ f ij Z j = i Multinomial(1, m f ) Γ f ij Z j i Multinomial(1, u f ) Flat priors for m f, u f, for all f 26 / 49

36 Estimation of the Bipartite Matching Z MCMC 27 / 49

37 Point Estimation From the posterior p(z γ) we obtain Bayes estimates under the loss function L(Z, Ẑ) = j X 2 L(Z j, Ẑj), with { L(Z j, Ẑj) = 0, if Z j = Ẑj (gets it right) 28 / 49

38 Point Estimation From the posterior p(z γ) we obtain Bayes estimates under the loss function L(Z, Ẑ) = j X 2 L(Z j, Ẑj), with L(Z j, Ẑj) = { 0, if Z j = Ẑj; λ R, if Ẑj = R (does not output a decision, reject ) 29 / 49

39 Point Estimation From the posterior p(z γ) we obtain Bayes estimates under the loss function L(Z, Ẑ) = j X 2 L(Z j, Ẑj), with 0, if Z j = Ẑj; L(Z j, Ẑj) = λ R, if Ẑj = R; λ FNM, if Z j n 1, Ẑj = j (false non-match) 30 / 49

40 Point Estimation From the posterior p(z γ) we obtain Bayes estimates under the loss function L(Z, Ẑ) = j X 2 L(Z j, Ẑj), with L(Z j, Ẑj) = 0, if Z j = Ẑj; λ R, if Ẑj = R; λ FNM, if Z j n 1, Ẑj = j; λ FM1, if Z j = j, Ẑj n 1 (false match type 1) 31 / 49

41 Point Estimation From the posterior p(z γ) we obtain Bayes estimates under the loss function L(Z, Ẑ) = j X 2 L(Z j, Ẑj), with L(Z j, Ẑj) = 0, if Z j = Ẑj; λ R, if Ẑj = R; λ FNM, if Z j n 1, Ẑj = n 1 + j; λ FM1, if Z j = n 1 + j, Ẑj n 1 ; λ FM2, if Z j, Ẑj n 1, Z j Ẑj (false match type 2) For generic applications λ FNM = λ FM1 = 1 and λ FM2 = 2 works well 32 / 49

42 Point Estimation From the posterior p(z γ) we obtain Bayes estimates under the loss function L(Z, Ẑ) = j X 2 L(Z j, Ẑj), with L(Z j, Ẑj) = 0, if Z j = Ẑj; λ R, if Ẑj = R; λ FNM, if Z j n 1, Ẑj = n 1 + j; λ FM1, if Z j = n 1 + j, Ẑj n 1 ; λ FM2, if Z j, Ẑj n 1, Z j Ẑj (false match type 2) For generic applications λ FNM = λ FM1 = 1 and λ FM2 = 2 works well 32 / 49

43 Point Estimation THEOREM. If λ FM2 λ FM1 2λ R > 0 and λ FNM 2λ R in the additive loss function, the Bayes estimate of the bipartite matching can be obtained from Ẑ = (Ẑ1,..., Ẑn 2 ), where Ẑ j = i, if P(Z j = i γ obs ) > 1 λ R λ FM1 + λ FM2 λ FM1 λ FM1 P(Z j / {i, n 1 + j} γ obs ); n 1 + j, if P(Z j = n 1 + j γ obs ) > 1 λ R λ FNM ; R, otherwise. 33 / 49

44 Simulation Setup Different scenarios of files overlap and measurement error 100 pairs of synthetic datafiles for each scenario Each datafile has 500 records Four fields: given and family names, age and occupation 34 / 49

45 Results with Full Assignments Overlap 100% Overlap 50% Overlap 10% Fellegi Sunter Precision / Recall Precision / Recall Precision / Recall Number of Erroneous Fields Number of Erroneous Fields Number of Erroneous Fields Beta Record Linkage Precision / Recall Precision / Recall Precision / Recall Number of Erroneous Fields Number of Erroneous Fields Number of Erroneous Fields : precision, : recall black lines show medians, and gray lines show first and 99th percentiles. 35 / 49

46 Results with Partial Assignments Fellegi Sunter Mixture Model Beta Record Linkage Full Estimates Partial Estimates Full Estimates Partial Estimates Positive Predictive Value / Negative PV PPV / NPV / Rejection Rate PPV / NPV PPV / NPV / RR Number of Erroneous Fields Number of Erroneous Fields Number of Erroneous Fields Number of Erroneous Fields Datafiles with 10% overlap : precision or positive predictive value (PPV) : negative predictive value (NPV) - - : rejection rate (RR) 36 / 49

47 Contents Bayesian Partitioning Approach to DDRL Bipartite Record Linkage Population Size Estimation with Linked Files Conclusions 37 / 49

48 A Case Study from El Salvador El Salvador underwent a civil war from 1980 until 1991 Three datafiles on casualties obtained from the Human Rights Data Analysis Group - HRDAG ER-TL El Rescate - Tutela Legal (4335 records) CDHES Salvadoran Human Rights Commission (1038 records) Collected throughout the period of the civil war Reports were investigated before accepted in the files UNTC United Nations Truth Commission for El Salvador (5161 records) Reports collected in 1992 from relatives and friends of the victims Common to find different reports on the same event with slightly different information DDRL for these datafiles would allow us to estimate Number of different reported killings in these datafiles Number of civilian killings occurred during the war 38 / 49

49 A Case Study from El Salvador El Salvador underwent a civil war from 1980 until 1991 Three datafiles on casualties obtained from the Human Rights Data Analysis Group - HRDAG ER-TL El Rescate - Tutela Legal (4335 records) CDHES Salvadoran Human Rights Commission (1038 records) Collected throughout the period of the civil war Reports were investigated before accepted in the files UNTC United Nations Truth Commission for El Salvador (5161 records) Reports collected in 1992 from relatives and friends of the victims Common to find different reports on the same event with slightly different information DDRL for these datafiles would allow us to estimate Number of different reported killings in these datafiles Number of civilian killings occurred during the war 38 / 49

50 A Case Study from El Salvador El Salvador underwent a civil war from 1980 until 1991 Three datafiles on casualties obtained from the Human Rights Data Analysis Group - HRDAG ER-TL El Rescate - Tutela Legal (4335 records) CDHES Salvadoran Human Rights Commission (1038 records) Collected throughout the period of the civil war Reports were investigated before accepted in the files UNTC United Nations Truth Commission for El Salvador (5161 records) Reports collected in 1992 from relatives and friends of the victims Common to find different reports on the same event with slightly different information DDRL for these datafiles would allow us to estimate Number of different reported killings in these datafiles Number of civilian killings occurred during the war 38 / 49

51 A Case Study from El Salvador El Salvador underwent a civil war from 1980 until 1991 Three datafiles on casualties obtained from the Human Rights Data Analysis Group - HRDAG ER-TL El Rescate - Tutela Legal (4335 records) CDHES Salvadoran Human Rights Commission (1038 records) Collected throughout the period of the civil war Reports were investigated before accepted in the files UNTC United Nations Truth Commission for El Salvador (5161 records) Reports collected in 1992 from relatives and friends of the victims Common to find different reports on the same event with slightly different information DDRL for these datafiles would allow us to estimate Number of different reported killings in these datafiles Number of civilian killings occurred during the war 38 / 49

52 Levels of Disagreement for the Salvadoran Datafiles Levels of Disagreement Field Similarity Measure Year Absolute Difference Month Absolute Difference Day Absolute Difference Municipality Binary Comparison Agree Disagree Given Name Modified Levenshtein 0 (0, 0.25] (0.25, 0.5] (0.5, 1] Family Name Modified Levenshtein 0 (0, 0.25] (0.25, 0.5] (0.5, 1] 39 / 49

53 Posterior Distribution of Number of Reported Killings Frequency n ~ = Number of Reported Killings Figure: In red the number corresponding to the estimated posterior mode of the coreference partition. 40 / 49

54 Population Size Estimation Target: size N of a population Several models have the capture histories of individuals in multiple samples as sufficient statistics n = Third Sample Yes No Second Sample Second Sample First Sample Yes No Yes No Yes n 111 n 101 n 110 n 100 No n 011 n 001 n 010 From these models we can obtain p A (N n) We focus on the methodology of Madigan and York (1997): Bayesian model averaging over the class of decomposable graphical models for n 41 / 49

55 Population Size Estimation Target: size N of a population Several models have the capture histories of individuals in multiple samples as sufficient statistics n = Third Sample Yes No Second Sample Second Sample First Sample Yes No Yes No Yes n 111 n 101 n 110 n 100 No n 011 n 001 n 010 From these models we can obtain p A (N n) We focus on the methodology of Madigan and York (1997): Bayesian model averaging over the class of decomposable graphical models for n 41 / 49

56 Population Size Estimation with Linked Files It is easy to see that n is a deterministic function of On the other hand the uncertainty on is captured by the posterior p L ( X) Is p C (N X) = p A (N n( ))p L ( X) a valid posterior? Yes! p C (N X) linkage model likelihood p(n, ) }{{} p A (N n( ))p( ) 42 / 49

57 Population Size Estimation with Linked Files It is easy to see that n is a deterministic function of On the other hand the uncertainty on is captured by the posterior p L ( X) Is p C (N X) = p A (N n( ))p L ( X) a valid posterior? Yes! p C (N X) linkage model likelihood p(n, ) }{{} p A (N n( ))p( ) 42 / 49

58 Population Size Estimation with Linked Files It is easy to see that n is a deterministic function of On the other hand the uncertainty on is captured by the posterior p L ( X) Is p C (N X) = p A (N n( ))p L ( X) a valid posterior? Yes! p C (N X) linkage model likelihood p(n, ) }{{} p A (N n( ))p( ) 42 / 49

59 Back to the Salvadoran Datafiles UNTC Present Absent CDHES CDHES ER-TL Present Absent Present Absent n 111 n 101 n 110 n 100 Present n 011 n 001 n 010 Absent / 49

60 Given Posterior Mode of Coreference Partition Given Posterior Random Coreference Partition Given Posterior Random Coreference Partition Probability Probability Probability Number of Killings Number of Killings Number of Killings Given Posterior Random Coreference Partition Given Posterior Random Coreference Partition Probability Probability Model Average [ER TL][CDHES, UNTC] [ER TL, CDHES][ER TL, UNTC] [ER TL, CDHES][CDHES, UNTC] [ER TL, UNTC][CDHES, UNTC] Number of Killings Number of Killings 44 / 49

61 Posterior of Number of Killings Accounting for Coreference Partition Uncertainty Probability Number of Killings 45 / 49

62 Total Variance Decomposition Var(N X) = Var X [E(N )] ( DDRL ) + E X {Var m [E(N, m)]} ( pop. size model ) + E X {E m [Var(N, m)]} ( intrinsic ) In our application the decomposition is: DDRL Pop. Size Model Intrinsic 2.6% 65.6% 31.8% 46 / 49

63 Total Variance Decomposition Var(N X) = Var X [E(N )] ( DDRL ) + E X {Var m [E(N, m)]} ( pop. size model ) + E X {E m [Var(N, m)]} ( intrinsic ) In our application the decomposition is: DDRL Pop. Size Model Intrinsic 2.6% 65.6% 31.8% 46 / 49

64 Contents Bayesian Partitioning Approach to DDRL Bipartite Record Linkage Population Size Estimation with Linked Files Conclusions 47 / 49

65 Conclusions Unsupervised approach to duplicate detection and record linkage problems Guarantees transitive coreference decisions Provides a natural account for uncertainty of the coreference decisions in the form of a posterior distribution New methodology improves on existing methods and performs much better 48 / 49

66 Questions? 49 / 49

Bayesian Estimation of Bipartite Matchings for Record Linkage

Bayesian Estimation of Bipartite Matchings for Record Linkage Mauricio Sadinle msadinle@stat.duke.edu Duke University Supported by NSF grants SES-11-30706 to Carnegie Mellon University and SES-11-31897