Bayesian Estimation of Bipartite Matchings for Record Linkage

Size: px

Start display at page:

Download "Bayesian Estimation of Bipartite Matchings for Record Linkage"

Nigel Cannon
5 years ago
Views:

1 Bayesian Estimation of Bipartite Matchings for Record Linkage Mauricio Sadinle Duke University Supported by NSF grants SES to Carnegie Mellon University and SES to Duke University

2 Record Linkage no duplicates within file

3 Record Linkage no duplicates within file

4 Outline The Fellegi-Sunter Framework for Record Linkage Record Linkage Actual Target A Bayesian Approach to Bipartite Record Linkage Performance Comparison Take-Home Message

5 Record Linkage à la Fellegi-Sunter After Fellegi and Sunter (1969): Datafiles: X 1 and X 2 Given i X 1 and j X 2, define { 1, if records i and j are a match (refer to the same unit); ij = 0, otherwise. Goal: To estimate ij for each i X 1 and j X 2

6 Record Linkage à la Fellegi-Sunter After Fellegi and Sunter (1969): Datafiles: X 1 and X 2 Given i X 1 and j X 2, define { 1, if records i and j are a match (refer to the same unit); ij = 0, otherwise. Goal: To estimate ij for each i X 1 and j X 2

7 Record Linkage à la Fellegi-Sunter After Fellegi and Sunter (1969): Datafiles: X 1 and X 2 Given i X 1 and j X 2, define { 1, if records i and j are a match (refer to the same unit); ij = 0, otherwise. Goal: To estimate ij for each i X 1 and j X 2

8 Record Linkage à la Fellegi-Sunter The intuition: Very similar records are likely to be matches! Record Gender City... Age i Male Gatineau j Male Gatineau Very dissimilar records should be non-matches! Record Gender City... Age i Male Gatineau j Female Durham... 60

9 Record Linkage à la Fellegi-Sunter For each record pair compute a comparison vector γ ij = (..., γ f ij,... ) The probabilities P ( γ ij ij = 1 ) and P ( γ ij ij = 0 ) should (generally) be very different

10 Record Linkage à la Fellegi-Sunter For each record pair compute a comparison vector γ ij = (..., γ f ij,... ) The probabilities P ( γ ij ij = 1 ) and P ( γ ij ij = 0 ) should (generally) be very different

11 Record Linkage à la Fellegi-Sunter Fellegi-Sunter decision rule: w ij = log P( γ ij ij = 1 ) P ( γ ij ij = 0 ) ˆ ij = 1, if w ij > t M ; (link) 0, if w ij < t U ; (non-link) R, otherwise. ( reject ) Fellegi and Sunter (1969): this rule minimizes P( ˆ ij = R) subject to some nominal error levels Caveat: the comparison vector γ ij alone determines the linkage decision

12 Record Linkage à la Fellegi-Sunter Fellegi-Sunter decision rule: w ij = log P( γ ij ij = 1 ) P ( γ ij ij = 0 ) ˆ ij = 1, if w ij > t M ; (link) 0, if w ij < t U ; (non-link) R, otherwise. ( reject ) Fellegi and Sunter (1969): this rule minimizes P( ˆ ij = R) subject to some nominal error levels Caveat: the comparison vector γ ij alone determines the linkage decision

13 Record Linkage à la Fellegi-Sunter Fellegi-Sunter decision rule: w ij = log P( γ ij ij = 1 ) P ( γ ij ij = 0 ) ˆ ij = 1, if w ij > t M ; (link) 0, if w ij < t U ; (non-link) R, otherwise. ( reject ) Fellegi and Sunter (1969): this rule minimizes P( ˆ ij = R) subject to some nominal error levels Caveat: the comparison vector γ ij alone determines the linkage decision

14 Record Linkage à la Fellegi-Sunter Fellegi-Sunter decision rule: w ij = log P( γ ij ij = 1 ) P ( γ ij ij = 0 ) ˆ ij = 1, if w ij > t M ; (link) 0, if w ij < t U ; (non-link) R, otherwise. ( reject ) Fellegi and Sunter (1969): this rule minimizes P( ˆ ij = R) subject to some nominal error levels Caveat: the comparison vector γ ij alone determines the linkage decision

15 Record Linkage à la Fellegi-Sunter Comparison vector γ ij is a realization of a random vector Γ ij Mixture model implementation Γ ij ij = 1 iid G M Γ ij ij = 0 iid G U ij iid Bernoulli(p) used by Winkler (1980), Jaro (1989), Larsen and Rubin (2001) and many others Caveat: mixture components may not be associated with links and non-links!

16 Record Linkage à la Fellegi-Sunter Comparison vector γ ij is a realization of a random vector Γ ij Mixture model implementation Γ ij ij = 1 iid G M Γ ij ij = 0 iid G U ij iid Bernoulli(p) used by Winkler (1980), Jaro (1989), Larsen and Rubin (2001) and many others Caveat: mixture components may not be associated with links and non-links!

17 Record Linkage à la Fellegi-Sunter Comparison vector γ ij is a realization of a random vector Γ ij Mixture model implementation Γ ij ij = 1 iid G M Γ ij ij = 0 iid G U ij iid Bernoulli(p) used by Winkler (1980), Jaro (1989), Larsen and Rubin (2001) and many others Caveat: mixture components may not be associated with links and non-links!

18 Record Linkage à la Fellegi-Sunter Even if mixture model is successful, chains of links can happen using Fellegi-Sunter decision rule X 1 X

19 Record Linkage à la Fellegi-Sunter What went wrong? ij s are not independent!

20 Outline The Fellegi-Sunter Framework for Record Linkage Record Linkage Actual Target A Bayesian Approach to Bipartite Record Linkage Performance Comparison Take-Home Message

21 Terminology: Bipartite Matching 1 2 X 1 X 2 Representation of bipartite matchings 1 2 Z = (..., Z j,... ) for the records in the file X 2 { i, if i, j are coref.; Z j = n 1 + j, if no coref. in X 1. Z j Z j Example: Z 1 = 1, Z 2 = 4, Z 3 = 8, Z 4 = 2 5

22 Terminology: Bipartite Matching 1 2 X 1 X 2 Representation of bipartite matchings 1 2 Z = (..., Z j,... ) for the records in the file X 2 { i, if i, j are coref.; Z j = n 1 + j, if no coref. in X 1. Z j Z j Example: Z 1 = 1, Z 2 = 4, Z 3 = 8, Z 4 = 2 5

23 Terminology: Bipartite Matching 1 2 X 1 X 2 Representation of bipartite matchings 1 2 Z = (..., Z j,... ) for the records in the file X 2 { i, if i, j are coref.; Z j = n 1 + j, if no coref. in X 1. Z j Z j Example: Z 1 = 1, Z 2 = 4, Z 3 = 8, Z 4 = 2 5

24 Terminology: Bipartite Matching 1 2 X 1 X 2 Representation of bipartite matchings 1 2 Z = (..., Z j,... ) for the records in the file X 2 { i, if i, j are coref.; Z j = n 1 + j, if no coref. in X 1. Z j Z j Example: Z 1 = 1, Z 2 = 4, Z 3 = 8, Z 4 = 2 5

25 Terminology: Bipartite Matching 1 2 X 1 X 2 Representation of bipartite matchings 1 2 Z = (..., Z j,... ) for the records in the file X 2 { i, if i, j are coref.; Z j = n 1 + j, if no coref. in X 1. Z j Z j Example: Z 1 = 1, Z 2 = 4, Z 3 = 8, Z 4 = 2 5

26 Enforcing One-to-One Matching in Fellegi-Sunter Approach Jaro (1989) proposes to plug the weights w ij into a linear sum assignment problem max w ij ij i subject to ij {0, 1}, ij 1, j X 2, i j ij 1, i X 1. j

27 Outline The Fellegi-Sunter Framework for Record Linkage Record Linkage Actual Target A Bayesian Approach to Bipartite Record Linkage Performance Comparison Take-Home Message

28 Just Modify the Mixture Model! Z Prior on Bipartite Matchings Γ ij Z j = i iid G M Γ ij Z j i iid G U

29 Beta Prior on Bipartite Matchings Larsen (2005, 2010): Files overlap size n 12 is Beta-Binomial(n 2, α, β) All bipartite matchings with the same n 12 are equally likely

30 Models G M and G U for Comparison Vectors Assume comparisons are independent for both coreferent, and non-coreferent records Γ f ij Z j = i Multinomial(1, m f ) Γ f ij Z j i Multinomial(1, u f ) Flat priors for m f, u f, for all f

31 Estimation of the Bipartite Matching Z MCMC

32 Point Estimation From the posterior P(Z γ) we obtain Bayes estimates under the loss function L(Z, Ẑ) = j X 2 L(Z j, Ẑj), with { L(Z j, Ẑj) = 0, if Z j = Ẑj (gets it right)

33 Point Estimation From the posterior P(Z γ) we obtain Bayes estimates under the loss function L(Z, Ẑ) = j X 2 L(Z j, Ẑj), with L(Z j, Ẑj) = { 0, if Z j = Ẑj; λ R, if Ẑj = R (does not output a decision, reject )

34 Point Estimation From the posterior P(Z γ) we obtain Bayes estimates under the loss function L(Z, Ẑ) = j X 2 L(Z j, Ẑj), with 0, if Z j = Ẑj; L(Z j, Ẑj) = λ R, if Ẑj = R; λ FNM, if Z j n 1, Ẑj = j (false non-match)

35 Point Estimation From the posterior P(Z γ) we obtain Bayes estimates under the loss function L(Z, Ẑ) = j X 2 L(Z j, Ẑj), with L(Z j, Ẑj) = 0, if Z j = Ẑj; λ R, if Ẑj = R; λ FNM, if Z j n 1, Ẑj = j; λ FM1, if Z j = j, Ẑj n 1 (false match type 1)

36 Point Estimation From the posterior P(Z γ) we obtain Bayes estimates under the loss function L(Z, Ẑ) = j X 2 L(Z j, Ẑj), with L(Z j, Ẑj) = 0, if Z j = Ẑj; λ R, if Ẑj = R; λ FNM, if Z j n 1, Ẑj = n 1 + j; λ FM1, if Z j = n 1 + j, Ẑj n 1 ; λ FM2, if Z j, Ẑj n 1, Z j Ẑj (false match type 2) For generic applications λ FNM = λ FM1 = 1 and λ FM2 = 2 works well

37 Point Estimation From the posterior P(Z γ) we obtain Bayes estimates under the loss function L(Z, Ẑ) = j X 2 L(Z j, Ẑj), with L(Z j, Ẑj) = 0, if Z j = Ẑj; λ R, if Ẑj = R; λ FNM, if Z j n 1, Ẑj = n 1 + j; λ FM1, if Z j = n 1 + j, Ẑj n 1 ; λ FM2, if Z j, Ẑj n 1, Z j Ẑj (false match type 2) For generic applications λ FNM = λ FM1 = 1 and λ FM2 = 2 works well

38 Point Estimation THEOREM. If λ FM2 λ FM1 2λ R > 0 and λ FNM 2λ R in the additive loss function, the Bayes estimate of the bipartite matching can be obtained from Ẑ = (Ẑ1,..., Ẑn 2 ), where i, if P(Z j = i γ obs ) > 1 λ R + λ FM2 λ FM1 P(Z λ FM1 λ FM1 j / {i, n 1 + j} γ obs ); Ẑ j = n 1 + j, if P(Z j = n 1 + j γ obs ) > 1 λ R ; λ FNM R, otherwise.

39 Outline The Fellegi-Sunter Framework for Record Linkage Record Linkage Actual Target A Bayesian Approach to Bipartite Record Linkage Performance Comparison Take-Home Message

40 Simulation Setup Different scenarios of files overlap and measurement error 100 pairs of synthetic datafiles for each scenario Each datafile has 500 records Four fields: given and family names, age and occupation

41 Results with Full Assignments Overlap 100% Overlap 50% Overlap 10% Fellegi Sunter Precision / Recall Precision / Recall Precision / Recall Number of Erroneous Fields Number of Erroneous Fields Number of Erroneous Fields Beta Record Linkage Precision / Recall Precision / Recall Precision / Recall Number of Erroneous Fields Number of Erroneous Fields Number of Erroneous Fields Solid lines refer to precision, dashed lines to recall, black lines show medians, and gray lines show first and 99th percentiles.

42 Results with Partial Assignments Fellegi Sunter Mixture Model Beta Record Linkage Full Estimates Partial Estimates Full Estimates Partial Estimates Positive Predictive Value / Negative PV PPV / NPV / Rejection Rate PPV / NPV PPV / NPV / RR Number of Erroneous Fields Number of Erroneous Fields Number of Erroneous Fields Number of Erroneous Fields Datafiles with 10% overlap. Solid lines refer to precision or positive predictive value (PPV), dashed lines to negative predictive value (NPV), dot-dashed lines to rejection rate (RR), black lines show medians, and gray lines show first and 99th percentiles.

43 Outline The Fellegi-Sunter Framework for Record Linkage Record Linkage Actual Target A Bayesian Approach to Bipartite Record Linkage Performance Comparison Take-Home Message

44 Take-Home Message The optimality of the Fellegi-Sunter decision rule depends on conditions that are not met in practice Mixture model implementation of Fellegi-Sunter has a number of weaknesses New methodology improves on existing methods and performs much better

45 Questions? For further details see: Sadinle (2016+). Bayesian Estimation of Bipartite Matchings for Record Linkage. Journal of the American Statistical Association (in press).

46 Levels of Disagreement Compute a measure of similarity and divide its range into levels of disagreement Level of Disagreement Field Similarity Measure Age Absolute Diff Race Binary Comp. Agree Disagree.. Example: Record Age... i j γ Age ij = 3 γ ij = (..., γ f ij,... )

47 Levels of Disagreement Compute a measure of similarity and divide its range into levels of disagreement Level of Disagreement Field Similarity Measure Age Absolute Diff Race Binary Comp. Agree Disagree.. Example: Record Age... i j γ Age ij = 3 γ ij = (..., γ f ij,... )

48 Levels of Disagreement Compute a measure of similarity and divide its range into levels of disagreement Level of Disagreement Field Similarity Measure Age Absolute Diff Race Binary Comp. Agree Disagree.. Example: Record Age... i j γ Age ij = 3 γ ij = (..., γ f ij,... )

49 Simulation Setup Table: Types of errors per field in the simulation study. Type of Error Fields Missing Edits OCR Keyboard Phonetic Given and Family Names Age and Occupation Table: Construction of disagreement levels in the simulation study. Levels of Disagreement Fields Similarity Given and Family Names Levenshtein 0 (0,.25] (.25,.5] (.5, 1] Age and Occupation Binary Agree Disagree

50 Simulation Setup Table: Types of errors per field in the simulation study. Type of Error Fields Missing Edits OCR Keyboard Phonetic Given and Family Names Age and Occupation Table: Construction of disagreement levels in the simulation study. Levels of Disagreement Fields Similarity Given and Family Names Levenshtein 0 (0,.25] (.25,.5] (.5, 1] Age and Occupation Binary Agree Disagree

51 Training Classifiers for Record Linkage Take sample of record pairs and find if they refer to the same entity or not (e.g. clerical review) Train your favorite classification model (logistic regression, SVM, random forest,...) Predict coreference status of the remaining record pairs Training assumes pairs are i.i.d. Method takes independent decisions: chains of links are possible

52 Training Classifiers for Record Linkage Take sample of record pairs and find if they refer to the same entity or not (e.g. clerical review) Train your favorite classification model (logistic regression, SVM, random forest,...) Predict coreference status of the remaining record pairs Training assumes pairs are i.i.d. Method takes independent decisions: chains of links are possible

A Bayesian Partitioning Approach to Duplicate Detection and Record Linkage

A Bayesian Partitioning Approach to Duplicate Detection and Record Linkage Mauricio Sadinle msadinle@stat.duke.edu Duke University Supported by NSF grants SES-11-30706 to Carnegie Mellon University and