Sttistil models for reord linge Trining ourse on reord linge Mro Fortini Istt fortini@istt.it Estimtion prolems Prmeters estimtion for reord linge Use of dt from prior studies Use of urrent smples (leril review) Mximum lielihood estimtion under independene ssumption More omplex models Overome greement/disgreement Frequen-sed mthing Prmeters In order to deide whether pir () is mth or not it is neessr to uild the rtio r()=m()/u() There re essentill three methods: ) Dt from prior studies 2) Anlsis of speil smples from the dt set 3) Estimtion from urrent files A nd B
Dt from prior studies Prior studies re quite rre ut the m suggest useful informtion. - Newome (988) elortes some vlues for the 0- omprison distriution for M nd U for some ommon e vriles in UK Identifier Surnme Forenme Yer of irth Comprison outomes Agree Disgree Agree Disgree Agree Disgree Reltive frequenies rtios Lins (m) 96.5 3.5 79.0 2.0 77.3 22.2 Non lins (u) 0. 99.9 0.9 99.. 98.9 r (m/u) 965/ /29 88/ /5 70/ /4 Use of urrent smples u distriution n e pproximted the unonditionl distriution of Y (in ft onl smll frtion of the n A n B pirs is mth) From Stt. New Zelnd 2006 Speil smples Me inferene on distriution m is more omplite A possiilit is to use leril preliminr wor so to identif smple of mthes upon whih omputing the m proilities
Speil smples Exmple: Cops nd Hilton (990) lined rrivls nd deprtures in UK. The first step of mnul linge (reltive to the rrivls in ouple of wees) gives the dt set for estimting m Then proilisti reord linge mong rrivls nd deprtures is done using e vriles. Estimtion The frmewor for reord linge is suitle for ppling prtiulr estimtion tehniques: the EM lgorithm Sttistil model Y is distriuted differentl for M nd U m() = P(Y= C=) u() = P(Y= C=0) P(C=) is the proilit tht rndom hoie from Ω returns mth (from M). This proilit is given the frtion of mthes over the whole numer of pirs P(C=) = p P(C=0) = -p
Sttistil model The joint proilit tht pir hs C= (=0) nd Y= is: P(Y= C=) = [p m()] [(-p) u()] (-) Sttistil model The lielihood for our prmeters (the distriution m the distriution u nd the prmeter p) is: L ( ) p m p u ( ) where p m() nd u() re unnown prmeters oservtion ltent (unnown) vlues Mximum lielihood estimte of the unnown prmeters When the lielihood ontins unnown dt (in this se the sttus of pir ) mximum lielihood estimtes n e found mens of n itertive proedure nown s EM (Expettion- Mximiztion)
Mximum lielihood estimte of the unnown prmeters 0 Step: fix p m() nd u() t initil vlues E Step: sustitute the unnown vlues with the Expeted vlues r * (). Pirs () shring the sme pttern of e vriles hve the sme vlue = r * (). M Step: ompute p m() nd u() tht Mximize L on the dt set ompleted in the E step Conditionl independene When sttistil independene mong e vrile holds onditionll to the nowledge of memership of pir () to M or U P(Y= C=)= P(Y = C=)P(C=) Then Mximum lielihood solution of EM lgorithm exists in losed form ML estimtion vi EM under onditionl independene K K K u u p m m p m m p r C C m ) ( ) ( C C u ) ( ) ( N p C ) ( E step M step
Models Jro (989) ssumes independene etween the e vriles for oth M nd U Estimtion is ver es ut this model rrel holds Under independee the numer of e vriles must e t lest 3 (otherwise there re identifition prolems) Reltionships mong e vriles Prtil experiene shows roustness of independene ssumption Sometimes the independene ssumption does not hold so produing is in the estimtes p m nd u Two min effets n use dependene: Assoition mong e vriles for not mthes Assoition etween errors in e vriles for mthes Exmple onditionl independene Y P(Y= M) P(Y= U) Surnme 0.8063 0.00 Forenme 0.7936 0.0360 Gender 0.8856 0.473 Yer of irth 0.873 0.03 P(M)=0.0234
Distriution of r ptterns of e vriles Rn Y Y2 Y3 Y4 P(Y/M) P(Y/U) P(Y2/M) P(Y2/U) P(Y3/M) P(Y3/U) P(Y4/M) P(Y4/U) P(Y/M) P(Y/U) r 0 0 0 0 0.937 0.9989 0.2064 0.964 0.44 0.5287 0.287 0.9887 0.000589 0.503353 0.0069 2 0 0 0 0.937 0.9989 0.2064 0.964 0.8856 0.473 0.287 0.9887 0.004557 0.448705 0.0055 3 0 0 0 0.937 0.9989 0.7936 0.036 0.44 0.5287 0.287 0.9887 0.002263 0.08797 0.20403 4 0 0 0 0.937 0.9989 0.2064 0.964 0.44 0.5287 0.873 0.03 0.003985 0.005753 0.692702 5 0 0 0.937 0.9989 0.7936 0.036 0.8856 0.473 0.287 0.9887 0.0752 0.06757.045589 6 0 0 0 0.8063 0.00 0.2064 0.964 0.44 0.5287 0.287 0.9887 0.00245 0.000554 4.420459 7 0 0 0.937 0.9989 0.2064 0.964 0.8856 0.473 0.873 0.03 0.030849 0.00528 6.05472 8 0 0 0.8063 0.00 0.2064 0.964 0.8856 0.473 0.287 0.9887 0.08968 0.000494 38.38759 9 0 0 0.937 0.9989 0.7936 0.036 0.44 0.5287 0.873 0.03 0.05322 0.00025 7.32023 0 0 0 0.8063 0.00 0.7936 0.036 0.44 0.5287 0.287 0.9887 0.00942 2.07E 05 455.283 0 0.937 0.9989 0.7936 0.036 0.8856 0.473 0.873 0.03 0.864 0.00092 69.350 2 0 0 0.8063 0.00 0.2064 0.964 0.44 0.5287 0.873 0.03 0.06588 6.34E 06 268.44 3 0 0.8063 0.00 0.7936 0.036 0.8856 0.473 0.287 0.9887 0.07293.85E 05 3952.367 4 0 0.8063 0.00 0.2064 0.964 0.8856 0.473 0.873 0.03 0.2844 5.65E 06 22738.72 5 0 0.8063 0.00 0.7936 0.036 0.44 0.5287 0.873 0.03 0.06378 2.37E 07 269593.3 6 0.8063 0.00 0.7936 0.036 0.8856 0.473 0.873 0.03 0.493746 2.E 07 23468 Error in prmeters estimtion Mgnitude of errors 0.8.2 Y P(Y= M) P(Y= U) Surnme 0.6450 0.003 Forenme 0.6348 0.0432 Gender 0.7085 0.5656 Yer of irth 0.69703 0.035 P(M)=0.0234 Chnge of rning due to error in prmeters Rn Y Y2 Y3 Y4 r r' 0 0 0 0 0.00694 0.0279525 2 0 0 0 0.00553 0.052832 3 0 0 0 0.20403.0765002 5 0 0.0455889 2.0096688 4 0 0 0 0.692707 4.6784725 7 0 0 6.05479 8.7340252 6 0 0 0 4.4204589 38.43088 8 0 0 38.387587 7.744844 9 0 0 7.320233 80.7624 0 69.35009 336.36274 0 0 0 455.2832 480.04 3 0 3952.3674 2763.0207 2 0 0 268.4399 6432.262 4 0 22738.723 2008.094 5 0 269593.3 24777.78 6 23468 462452.93
Assoition mong e vriles It ffets not mthes i.e. pirs in U Exmple: Age of people is ssoited to their mritl sttus Young people re more prol singles Elderl people re more prol widows/ers SO Two different people shring the sme er of irth hve more proilit to shre the sme mritl sttus Assoition mong error in e vriles It ffets mthes i.e. pirs in M Exmple: Swp of surnme nd nme SO Speifi popultions (e.g. foreigners) n experiene the swp of nme nd surnme in one of the two soures so using n error in oth the e vriles Models Mn uthors hve tried to define more omplex models. When using the 0- omprison funtion suitle lss of models is represented the logliner models Applitions inlude Thiudeu (993) nd Armstrong nd Md (993). However the logliner model should e nown in dvne
Tests Winler (993) suggests to use logliner model whih is suffiientl generl s logliner model with ll the three-ws intertions. Furthermore estimtion the EM lgorithm n e onstrined (for exmple p <n A /(n A n B )) Other pprohes Besin pprohes: the sttus of pir eomes prmeter to estimte (Fortini et l 200) Itertive pprohes. Alternte steps of proilisti reord linge nd leril review (Lrsen nd Ruin 200) Improvements of the theor Overome greement/disgreement: how to te into ount for omprisons in ontinuous domin How to te dvntge from e vriles frequen distriution
Overome greement/disgreement - Agreement/disgreement on omprisons ould use too muh informtion loss for some vriles Age of people Dte of events Turnover of firms Distnes mong string vriles When omprisons re mde in ontinuous it is plusile tht mthes gets smll differenes ompred to non-mthes e.g. If two people shring the sme nme nd surnme differ onl for one er of ge we will e more onfident tht the re the sme person thn in se their ge differene is of nmel 5 ers Overome greement/disgreement - 2 The lielihood rtio r n e djusted in order to ount for omprison mong ontinuous vriles The djustment is sed on omprison funtion ssuming vlues etween 0 nd ge( ) ge( ) f ( ge( ) ge( )) mxmx( ge( A)) min( ge( B)) mx( ge( A)) min( ge( B)) Overome greement/disgreement - 3 α[0] result of omprison θ[0] is lower ound hosen so to designte s disgreement n omprison sed on vlues α< θ. r ( ) P 0 M 0 r( ) P 0U P M P U Lielihood rtio from EM for pirs in whih disgrees Lielihood rtio from EM for pirs in whih grees
Overome greement/disgreement - 4 If α< θ then the EM weight remins unhnged r r 0 ( ) If α θ then the resulting weight is given the liner omintion r r r 0 ( ) Overome greement/disgreement - 5 r ( ) 0 r ( ) 0 r r r 0 ( ) Frequen-sed mthing Useful when the distriution of ttriutes for e vrile is not uniform Cn te into ount for informtion given rre informtion (e.g. rre nmes or surnmes) Consists in llotion of n outome-speifi weight to the pirs More weight to pirs tht re in greements for rre sttements
Lrger Weights for greement on rre events P(M =Zrins) > P(M =Smith) Implies tht P(U =Smith) > P(U =Zrins) And using Bes theorem P(=Zrins M) P(=Smith M) > P(=Zrins U) P(=Smith U) How to lulte weights - Solution Fellegi nd Sunter (969) f f 2 f m N A = i f i (File A) g g 2 g m N B = i g i (File B) f i g i frequen of i-th ourrene mong the AxB pirs Frequen in the set of mthing pirs h h 2 h m N M = i h i where h i min(f i g i ) How to lulte weights - 2 Assuming h i =min(f i g i ) we otin P(greement on string i M)=h i /N M P(greement on string i U)= (f i g i -h i )/(N A N B -N M )
Referenes Armstrong nd Md (993) Model sed estimtion of reord linge error rtes. Surve Methodolog 37-47 Cops nd Hilton (990) Reord linge: sttistil models for mthign omputer reords. JRSS/A 287-320 Fortini Liseo Nuitelli Snu (200) On Besin reord linge. Reserh in Offiil Sttistis 85-98 Gill (200) Methods for utomti reord mthing nd linge nd their use in Ntionl Sttistis. ONS methodologil series no. 25 Jro M.A. (989): Advnes in reord linge methodolog s pplied to mthing the 885 ensus of Tmp Florid JASA 44-420. Referenes Lrsen M.D. Ruin D.B (200): Itertive utomted reord linge using mixture models JASA 32-4. Newome (988) Hndoo of reord linge methods for helth nd sttistil studies dministrtion nd usiness. Oxford Universit Press Thiudeu Y. (993): The disrimintion power of dependen strutures in reord linge Surve Methodolog 3-38. Winler (992) Comprtive nlsis of reord linge deision rules. Proeedings of the setion on surve reserh methods ASA 829-834 Winler (993) Improved deision rules in the Fellegi-Sunter model of reord linge. Proeedings of the setion on surve reserh methods ASA274-279