Statistical models for record linkage

Similar documents
Learning Partially Observable Markov Models from First Passage Times

Arrow s Impossibility Theorem

Arrow s Impossibility Theorem

Outline. Theory-based Bayesian framework for property induction Causal structure induction

ANALYSIS AND MODELLING OF RAINFALL EVENTS

Lecture Notes No. 10

12.4 Similarity in Right Triangles

Bayesian Networks: Approximate Inference

Generalization of 2-Corner Frequency Source Models Used in SMSIM

Eigenvectors and Eigenvalues

University of Sioux Falls. MAT204/205 Calculus I/II

PAIR OF LINEAR EQUATIONS IN TWO VARIABLES

Exercise sheet 6: Solutions

Algorithm Design and Analysis

Appendix C Partial discharges. 1. Relationship Between Measured and Actual Discharge Quantities

Maintaining Mathematical Proficiency

1 PYTHAGORAS THEOREM 1. Given a right angled triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides.

Iowa Training Systems Trial Snus Hill Winery Madrid, IA

Data Structures LECTURE 10. Huffman coding. Example. Coding: problem definition

Activities. 4.1 Pythagoras' Theorem 4.2 Spirals 4.3 Clinometers 4.4 Radar 4.5 Posting Parcels 4.6 Interlocking Pipes 4.7 Sine Rule Notes and Solutions

A Non-parametric Approach in Testing Higher Order Interactions

Section 4.4. Green s Theorem

Algorithm Design and Analysis

6.3.2 Spectroscopy. N Goalby chemrevise.org 1 NO 2 H 3 CH3 C. NMR spectroscopy. Different types of NMR

Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 17

Now we must transform the original model so we can use the new parameters. = S max. Recruits

6.3.2 Spectroscopy. N Goalby chemrevise.org 1 NO 2 CH 3. CH 3 C a. NMR spectroscopy. Different types of NMR

CSE 332. Sorting. Data Abstractions. CSE 332: Data Abstractions. QuickSort Cutoff 1. Where We Are 2. Bounding The MAXIMUM Problem 4

Golden Section Search Method - Theory

Satellite Retrieval Data Assimilation

Ch. 2.3 Counting Sample Points. Cardinality of a Set

For a, b, c, d positive if a b and. ac bd. Reciprocal relations for a and b positive. If a > b then a ab > b. then

Probability. b a b. a b 32.

Calculus Cheat Sheet. Integrals Definitions. where F( x ) is an anti-derivative of f ( x ). Fundamental Theorem of Calculus. dx = f x dx g x dx

Continuous Random Variables

( ) 1. 1) Let f( x ) = 10 5x. Find and simplify f( 2) and then state the domain of f(x).

IN RECENT YEARS, the number of publications on evolutionary

Review Topic 14: Relationships between two numerical variables

where the box contains a finite number of gates from the given collection. Examples of gates that are commonly used are the following: a b

MAT 403 NOTES 4. f + f =

QUADRATIC EQUATION. Contents

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 17

Robust Linear Regression: A Review and Comparison

SECTION A STUDENT MATERIAL. Part 1. What and Why.?

Lesson 2.1 Inductive Reasoning

CS311 Computational Structures Regular Languages and Regular Grammars. Lecture 6

4.1. Probability Density Functions

Derivations for maximum likelihood estimation of particle size distribution using in situ video imaging

6.5 Improper integrals

1 This diagram represents the energy change that occurs when a d electron in a transition metal ion is excited by visible light.

Project 6: Minigoals Towards Simplifying and Rewriting Expressions

Common intervals of genomes. Mathieu Raffinot CNRS LIAFA

Algorithms & Data Structures Homework 8 HS 18 Exercise Class (Room & TA): Submitted by: Peer Feedback by: Points:

MATH 1080: Calculus of One Variable II Spring 2018 Textbook: Single Variable Calculus: Early Transcendentals, 7e, by James Stewart.

Section 1.3 Triangles

Unit 4. Combinational Circuits

Tests for the Ratio of Two Poisson Rates

Network Analysis and Synthesis. Chapter 5 Two port networks

Solutions to Assignment 1

( ) as a fraction. Determine location of the highest

Table of Content. c 1 / 5

Consistent Probabilistic Social Choice

8 THREE PHASE A.C. CIRCUITS

Centre de Referència en Economia Analítica

Logic Synthesis and Verification

CS 491G Combinatorial Optimization Lecture Notes

( ) Same as above but m = f x = f x - symmetric to y-axis. find where f ( x) Relative: Find where f ( x) x a + lim exists ( lim f exists.

A Study on the Properties of Rational Triangles

Continuous Random Variable X:

Linear Algebra Introduction

Behavior Composition in the Presence of Failure

Metodologie di progetto HW Technology Mapping. Last update: 19/03/09

Let s divide up the interval [ ab, ] into n subintervals with the same length, so we have

Lecture 6: Coding theory

Reflection Property of a Hyperbola

Prefix-Free Regular-Expression Matching

Lesson 2.1 Inductive Reasoning

Chapter 8 Roots and Radicals

Estimation of Global Solar Radiation in Onitsha and Calabar Using Empirical Models

Spacetime and the Quantum World Questions Fall 2010

AP Calculus BC Chapter 8: Integration Techniques, L Hopital s Rule and Improper Integrals

On the Scale factor of the Universe and Redshift.

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

AP Calculus AB Unit 4 Assessment

Trigonometry Revision Sheet Q5 of Paper 2

Resources. Introduction: Binding. Resource Types. Resource Sharing. The type of a resource denotes its ability to perform different operations

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Statistics in medicine

22: Union Find. CS 473u - Algorithms - Spring April 14, We want to maintain a collection of sets, under the operations of:

Fast Frequent Free Tree Mining in Graph Databases

arxiv: v1 [cond-mat.mtrl-sci] 10 Aug 2017

Section 11.5 Estimation of difference of two proportions

Mathematics SKE: STRAND F. F1.1 Using Formulae. F1.2 Construct and Use Simple Formulae. F1.3 Revision of Negative Numbers

Nondeterministic Automata vs Deterministic Automata

DETERMINING SIGNIFICANT FACTORS AND THEIR EFFECTS ON SOFTWARE ENGINEERING PROCESS QUALITY

Chem Homework 11 due Monday, Apr. 28, 2014, 2 PM

1. Twelve less than five times a number is thirty three. What is the number

Logic Synthesis and Verification

Introduction to Olympiad Inequalities

Part I: Study the theorem statement.

Transcription:

Sttistil models for reord linge Trining ourse on reord linge Mro Fortini Istt fortini@istt.it Estimtion prolems Prmeters estimtion for reord linge Use of dt from prior studies Use of urrent smples (leril review) Mximum lielihood estimtion under independene ssumption More omplex models Overome greement/disgreement Frequen-sed mthing Prmeters In order to deide whether pir () is mth or not it is neessr to uild the rtio r()=m()/u() There re essentill three methods: ) Dt from prior studies 2) Anlsis of speil smples from the dt set 3) Estimtion from urrent files A nd B

Dt from prior studies Prior studies re quite rre ut the m suggest useful informtion. - Newome (988) elortes some vlues for the 0- omprison distriution for M nd U for some ommon e vriles in UK Identifier Surnme Forenme Yer of irth Comprison outomes Agree Disgree Agree Disgree Agree Disgree Reltive frequenies rtios Lins (m) 96.5 3.5 79.0 2.0 77.3 22.2 Non lins (u) 0. 99.9 0.9 99.. 98.9 r (m/u) 965/ /29 88/ /5 70/ /4 Use of urrent smples u distriution n e pproximted the unonditionl distriution of Y (in ft onl smll frtion of the n A n B pirs is mth) From Stt. New Zelnd 2006 Speil smples Me inferene on distriution m is more omplite A possiilit is to use leril preliminr wor so to identif smple of mthes upon whih omputing the m proilities

Speil smples Exmple: Cops nd Hilton (990) lined rrivls nd deprtures in UK. The first step of mnul linge (reltive to the rrivls in ouple of wees) gives the dt set for estimting m Then proilisti reord linge mong rrivls nd deprtures is done using e vriles. Estimtion The frmewor for reord linge is suitle for ppling prtiulr estimtion tehniques: the EM lgorithm Sttistil model Y is distriuted differentl for M nd U m() = P(Y= C=) u() = P(Y= C=0) P(C=) is the proilit tht rndom hoie from Ω returns mth (from M). This proilit is given the frtion of mthes over the whole numer of pirs P(C=) = p P(C=0) = -p

Sttistil model The joint proilit tht pir hs C= (=0) nd Y= is: P(Y= C=) = [p m()] [(-p) u()] (-) Sttistil model The lielihood for our prmeters (the distriution m the distriution u nd the prmeter p) is: L ( ) p m p u ( ) where p m() nd u() re unnown prmeters oservtion ltent (unnown) vlues Mximum lielihood estimte of the unnown prmeters When the lielihood ontins unnown dt (in this se the sttus of pir ) mximum lielihood estimtes n e found mens of n itertive proedure nown s EM (Expettion- Mximiztion)

Mximum lielihood estimte of the unnown prmeters 0 Step: fix p m() nd u() t initil vlues E Step: sustitute the unnown vlues with the Expeted vlues r * (). Pirs () shring the sme pttern of e vriles hve the sme vlue = r * (). M Step: ompute p m() nd u() tht Mximize L on the dt set ompleted in the E step Conditionl independene When sttistil independene mong e vrile holds onditionll to the nowledge of memership of pir () to M or U P(Y= C=)= P(Y = C=)P(C=) Then Mximum lielihood solution of EM lgorithm exists in losed form ML estimtion vi EM under onditionl independene K K K u u p m m p m m p r C C m ) ( ) ( C C u ) ( ) ( N p C ) ( E step M step

Models Jro (989) ssumes independene etween the e vriles for oth M nd U Estimtion is ver es ut this model rrel holds Under independee the numer of e vriles must e t lest 3 (otherwise there re identifition prolems) Reltionships mong e vriles Prtil experiene shows roustness of independene ssumption Sometimes the independene ssumption does not hold so produing is in the estimtes p m nd u Two min effets n use dependene: Assoition mong e vriles for not mthes Assoition etween errors in e vriles for mthes Exmple onditionl independene Y P(Y= M) P(Y= U) Surnme 0.8063 0.00 Forenme 0.7936 0.0360 Gender 0.8856 0.473 Yer of irth 0.873 0.03 P(M)=0.0234

Distriution of r ptterns of e vriles Rn Y Y2 Y3 Y4 P(Y/M) P(Y/U) P(Y2/M) P(Y2/U) P(Y3/M) P(Y3/U) P(Y4/M) P(Y4/U) P(Y/M) P(Y/U) r 0 0 0 0 0.937 0.9989 0.2064 0.964 0.44 0.5287 0.287 0.9887 0.000589 0.503353 0.0069 2 0 0 0 0.937 0.9989 0.2064 0.964 0.8856 0.473 0.287 0.9887 0.004557 0.448705 0.0055 3 0 0 0 0.937 0.9989 0.7936 0.036 0.44 0.5287 0.287 0.9887 0.002263 0.08797 0.20403 4 0 0 0 0.937 0.9989 0.2064 0.964 0.44 0.5287 0.873 0.03 0.003985 0.005753 0.692702 5 0 0 0.937 0.9989 0.7936 0.036 0.8856 0.473 0.287 0.9887 0.0752 0.06757.045589 6 0 0 0 0.8063 0.00 0.2064 0.964 0.44 0.5287 0.287 0.9887 0.00245 0.000554 4.420459 7 0 0 0.937 0.9989 0.2064 0.964 0.8856 0.473 0.873 0.03 0.030849 0.00528 6.05472 8 0 0 0.8063 0.00 0.2064 0.964 0.8856 0.473 0.287 0.9887 0.08968 0.000494 38.38759 9 0 0 0.937 0.9989 0.7936 0.036 0.44 0.5287 0.873 0.03 0.05322 0.00025 7.32023 0 0 0 0.8063 0.00 0.7936 0.036 0.44 0.5287 0.287 0.9887 0.00942 2.07E 05 455.283 0 0.937 0.9989 0.7936 0.036 0.8856 0.473 0.873 0.03 0.864 0.00092 69.350 2 0 0 0.8063 0.00 0.2064 0.964 0.44 0.5287 0.873 0.03 0.06588 6.34E 06 268.44 3 0 0.8063 0.00 0.7936 0.036 0.8856 0.473 0.287 0.9887 0.07293.85E 05 3952.367 4 0 0.8063 0.00 0.2064 0.964 0.8856 0.473 0.873 0.03 0.2844 5.65E 06 22738.72 5 0 0.8063 0.00 0.7936 0.036 0.44 0.5287 0.873 0.03 0.06378 2.37E 07 269593.3 6 0.8063 0.00 0.7936 0.036 0.8856 0.473 0.873 0.03 0.493746 2.E 07 23468 Error in prmeters estimtion Mgnitude of errors 0.8.2 Y P(Y= M) P(Y= U) Surnme 0.6450 0.003 Forenme 0.6348 0.0432 Gender 0.7085 0.5656 Yer of irth 0.69703 0.035 P(M)=0.0234 Chnge of rning due to error in prmeters Rn Y Y2 Y3 Y4 r r' 0 0 0 0 0.00694 0.0279525 2 0 0 0 0.00553 0.052832 3 0 0 0 0.20403.0765002 5 0 0.0455889 2.0096688 4 0 0 0 0.692707 4.6784725 7 0 0 6.05479 8.7340252 6 0 0 0 4.4204589 38.43088 8 0 0 38.387587 7.744844 9 0 0 7.320233 80.7624 0 69.35009 336.36274 0 0 0 455.2832 480.04 3 0 3952.3674 2763.0207 2 0 0 268.4399 6432.262 4 0 22738.723 2008.094 5 0 269593.3 24777.78 6 23468 462452.93

Assoition mong e vriles It ffets not mthes i.e. pirs in U Exmple: Age of people is ssoited to their mritl sttus Young people re more prol singles Elderl people re more prol widows/ers SO Two different people shring the sme er of irth hve more proilit to shre the sme mritl sttus Assoition mong error in e vriles It ffets mthes i.e. pirs in M Exmple: Swp of surnme nd nme SO Speifi popultions (e.g. foreigners) n experiene the swp of nme nd surnme in one of the two soures so using n error in oth the e vriles Models Mn uthors hve tried to define more omplex models. When using the 0- omprison funtion suitle lss of models is represented the logliner models Applitions inlude Thiudeu (993) nd Armstrong nd Md (993). However the logliner model should e nown in dvne

Tests Winler (993) suggests to use logliner model whih is suffiientl generl s logliner model with ll the three-ws intertions. Furthermore estimtion the EM lgorithm n e onstrined (for exmple p <n A /(n A n B )) Other pprohes Besin pprohes: the sttus of pir eomes prmeter to estimte (Fortini et l 200) Itertive pprohes. Alternte steps of proilisti reord linge nd leril review (Lrsen nd Ruin 200) Improvements of the theor Overome greement/disgreement: how to te into ount for omprisons in ontinuous domin How to te dvntge from e vriles frequen distriution

Overome greement/disgreement - Agreement/disgreement on omprisons ould use too muh informtion loss for some vriles Age of people Dte of events Turnover of firms Distnes mong string vriles When omprisons re mde in ontinuous it is plusile tht mthes gets smll differenes ompred to non-mthes e.g. If two people shring the sme nme nd surnme differ onl for one er of ge we will e more onfident tht the re the sme person thn in se their ge differene is of nmel 5 ers Overome greement/disgreement - 2 The lielihood rtio r n e djusted in order to ount for omprison mong ontinuous vriles The djustment is sed on omprison funtion ssuming vlues etween 0 nd ge( ) ge( ) f ( ge( ) ge( )) mxmx( ge( A)) min( ge( B)) mx( ge( A)) min( ge( B)) Overome greement/disgreement - 3 α[0] result of omprison θ[0] is lower ound hosen so to designte s disgreement n omprison sed on vlues α< θ. r ( ) P 0 M 0 r( ) P 0U P M P U Lielihood rtio from EM for pirs in whih disgrees Lielihood rtio from EM for pirs in whih grees

Overome greement/disgreement - 4 If α< θ then the EM weight remins unhnged r r 0 ( ) If α θ then the resulting weight is given the liner omintion r r r 0 ( ) Overome greement/disgreement - 5 r ( ) 0 r ( ) 0 r r r 0 ( ) Frequen-sed mthing Useful when the distriution of ttriutes for e vrile is not uniform Cn te into ount for informtion given rre informtion (e.g. rre nmes or surnmes) Consists in llotion of n outome-speifi weight to the pirs More weight to pirs tht re in greements for rre sttements

Lrger Weights for greement on rre events P(M =Zrins) > P(M =Smith) Implies tht P(U =Smith) > P(U =Zrins) And using Bes theorem P(=Zrins M) P(=Smith M) > P(=Zrins U) P(=Smith U) How to lulte weights - Solution Fellegi nd Sunter (969) f f 2 f m N A = i f i (File A) g g 2 g m N B = i g i (File B) f i g i frequen of i-th ourrene mong the AxB pirs Frequen in the set of mthing pirs h h 2 h m N M = i h i where h i min(f i g i ) How to lulte weights - 2 Assuming h i =min(f i g i ) we otin P(greement on string i M)=h i /N M P(greement on string i U)= (f i g i -h i )/(N A N B -N M )

Referenes Armstrong nd Md (993) Model sed estimtion of reord linge error rtes. Surve Methodolog 37-47 Cops nd Hilton (990) Reord linge: sttistil models for mthign omputer reords. JRSS/A 287-320 Fortini Liseo Nuitelli Snu (200) On Besin reord linge. Reserh in Offiil Sttistis 85-98 Gill (200) Methods for utomti reord mthing nd linge nd their use in Ntionl Sttistis. ONS methodologil series no. 25 Jro M.A. (989): Advnes in reord linge methodolog s pplied to mthing the 885 ensus of Tmp Florid JASA 44-420. Referenes Lrsen M.D. Ruin D.B (200): Itertive utomted reord linge using mixture models JASA 32-4. Newome (988) Hndoo of reord linge methods for helth nd sttistil studies dministrtion nd usiness. Oxford Universit Press Thiudeu Y. (993): The disrimintion power of dependen strutures in reord linge Surve Methodolog 3-38. Winler (992) Comprtive nlsis of reord linge deision rules. Proeedings of the setion on surve reserh methods ASA 829-834 Winler (993) Improved deision rules in the Fellegi-Sunter model of reord linge. Proeedings of the setion on surve reserh methods ASA274-279