Regression analysis for longitudinally linked data

Size: px

Start display at page:

Download "Regression analysis for longitudinally linked data"

Gilbert Page
5 years ago
Views:

University of Wollongong Research Online Centre for Statistical & Survey Methodology Working Paper Series Faculty of Engineering and Information Sciences 2010 Regression analysis

au Ray Chambers University of Wollongong, ray@uow.edu.

1 University of Wollongong Research Online Centre for Statistical & Survey Methodology Working Paper Series Faculty of Engineering and Information Sciences 2010 Regression analysis for longitudinally linked data Gunky Kim University of Wollongong, Ray Chambers University of Wollongong, Recommended Citation Kim, Gunky and Chambers, Ray, Regression analysis for longitudinally linked data, Centre for Statistical and Survey Methodology, University of Wollongong, Working Paper 22-10, 2010, 56p. Research Online is the open access institutional repository for the University of Wollongong. For further information contact the UOW Library:

Centre for Statistical and Survey Methodology The University of Wollongong Working Paper 2210 Regression analysis for longitudinally linked data Gunky Kim and Raymond Chambers Copyright 2008 by the

2 Centre for Statistical and Survey Methodology The University of Wollongong Working Paper 2210 Regression analysis for longitudinally linked data Gunky Kim and Raymond Chambers Copyright 2008 by the Centre for Statistical & Survey Methodology, UOW. Work in progress, no part of this paper may be reproduced without permission from the Centre. Centre for Statistical & Survey Methodology, University of Wollongong, Wollongong NSW Phone , Fax anica@uow.edu.au

3 Regression analysis for longitudinally linked data Gunky Kim and Raymond Chambers Centre for Statistical and Survey Methodology University of Wollongong Abstract Most probability-based methods used to link records from two distinct data sets corresponding to the same target population do not lead to perfect linkage, i.e. there are linkage errors in the merged data. Chambers (2008) describes modifications to standard methods of regression analysis that can be used with such imperfectly linked data. However, these methods assume that the linkage process is complete, i.e. all records on the two data sets are linked. This paper extends theses ideas to accommodate the situation when the number of data sets are more than two. key words: Record matching; linkage errors; linear regression; logistic regression; estimating euations. 1 Introduction In recent years, because of its advantage of creating new information from already existing files by linking them, the linkage process becomes an important research tool in many areas such as health, business, economics and sociology. One important linkage application is where different data sets relating to the same individuals at different points in time are linked to provide a longitudinal data record for each individual, thus permitting longitudinal analysis for these individuals. To illustrate, the Census Data Enhancement project of the 1

4 Australian Bureau of Statistics aims to develop a Statistical Longitudinal Census Dataset by linking data from the same individuals over a number of censuses. It is expected that this linked data set will provide a powerful tool for future research into the longitudinal dynamics of the Australian population. However, without access to the same uniue identifier in each of the linked data sets, there is always the possibility that linkage errors in the merged data could lead to a longitudinal record ostensibly relating to a single individual being actually made up of a composite of data items from different individuals. This in turn could lead to bias and loss of efficiency for the longitudinal modelling process. Further, as the number of censuses to be linked increase, the structure of linkage error will be more complicated as it will increase more bias and inefficiency for the modelling process. The work of Neter et al. (1965) shows that small mismatching could cause significant response error. Their work has become a foundation of the analysis on the linkage error. Some authors, such as Scheuren and Winkler (1993), Scheuren and Winkler (1997) and Lahiri and Larsen (2005), have tried to extend the work of Neter et al. (1965) on regression setting. However, the volume of works on the analysis of the linkage error is not rich. In Chambers (2008), Chambers has developed new methods to adjust the bias in the linear regression parameters for the linkage process when two data sets were merged. In this study, we extended the ideas of Chambers (2008) to accommodate longitudinally linked data sets where the number of merged data sets are more than two. In general, most of works for linkage error correction has been done when two data sets are merged. However, the linkage error structure of longitudinally linked data sets, when the number of data sets are more than two, are more complicated compared to the linkage error structure of two data sets. As far as our knowledge, this is the first attempt to correct the linkage errors in the merged data sets when the number of data sets are more than two. We will use three data set case as an illustration of our regression analysis, but it is trivial to see that it can be easily extended to deal with any number of data sets. 1.1 Backgrounds and assumptions Suppose that we are interested in fitting a regression model of the form E X (Y ) = f(x; θ), 2

5 where f is a known function, but the parameter θ is to be estimated, and X has more than one data sets. For example, consider a linear regression model of the form Y = β 0 + X 1 β 1 + X 2 β 2 + ǫ = Xβ + ǫ, where we have three different files, one for Y, one for X 1 and one for X 2. When these three data sets were created separately, and if there is no uniue identifier among them to match each other, matching y i with the correct values of x 1i from one file and x 2i from another file could be a difficult task and there could be a strong chance of mismatching. If there exist mismatches, the estimation of β could be biased if we ignore them in the estimation process. The purpose of our study is to develop some methodological frames to adjust the bias of β estimations when the mismatches are expected. For the assumptions we made in this papers are: 1. For the case of register-register, there exist a population of N units for all Y, X 1 and X 2 such that each one of y i should be linked with one of x 1i from one file and x 2i from another file. 2. X can be partitioned into Q different blocks 1. Let us call this block as m-block. 3. The linkage errors occur only within the m-block, in the sense that records in distinct m-blocks can never be linked. The records from X that make up the th m-block is denoted X. 4. In case of sample-register, suppose that we only have sample 2 s from a bench mark register, for example, X 1 with possible relation E(Y X 1, X 2 ) = f(x 1, X 2 ; θ) when they are correctly linked. f could be either linear of logistic function. 5. Denote X 1s the sample records X 1 of the sample size s and some of records in X 1s may not be linked to the records in Y or X Even though some of records are not linked, we assume that the regression model of linked records would be valid for the non-linked records if the links are found. 1 See Chambers (2008) for more detailed discussion about the block. 2 The sample set can be drawn from any data set. To explain our assumptions in more details, here we assume that the samples are drawn from X 1, while Y and X 2 are registers. 3

6 2 Register-register case When there are three data sets, usually one of them is regraded as a bench mark data set and mismatches happens when someone try to link this bench mark data set with other data sets. Thus, when there are three data sets, we expect that at most two kinds of possible mismatches can happen. For example, if we set X 1 as the bench mark data set, possible mismatches happen when we link Y with X 1 or link X 1 with X 2. We will consider the case of one mismatch situations and the case of two possible mismatches case separately. For the two mismatches case, we assume that mismatches from the linkage process between Y and X 1 are independent of the mismatches from the linkage process between X 1 and X Three data sets and one mismatch cases: A ratio-type estimator Note that our model is of the form Y = β 0 + X 1 β 1 + X 2 β 2 + ǫ = Xβ + ǫ, where X = (1, X 1, X 2 ). Suppose that X 1 is the bench mark data set. Then, possible mismatch can happen either when one links records from X 1 with Y or when one links records from X 1 with X 2. However, if the mismatch happens only when one links records from X 1 with Y and X 1 and X 2 can be linked perfectly, one can regards X = (1, X 1, X 2 ) as a one data set, and this case has been dealt extensively in Chambers (2008). Hence, we will only consider the case where mismatch happens when one links records from X 1 with X 2. Let us call this situation as Case 0. Case 0: When each x 1i is correctly linked with corresponding y i, but some of x 2i are not correctly linked with x 1i, one has Y = β 0 + X 1 β 1 + X 2 β 2 + ǫ = X β + ǫ, where X = (1, X 1, X 2 ), X 2 = B 2X 2 and B 2 is a permutation matrix. Note that X 2 is not observable, and we only observe X 2. However, if the matrix B 2 is known, one has X 2 = B2 T X 2. 4

7 Thus, Let X = (1, X 1, X 2 ) = (1, X 1, B T 2X 2). X B 2 = (1, X 1, B2 T X 2 ). (1) Note that X B 2 is only observable if B 2 is known. But, generally, B 2 is unknown and in this case we adapt the non-informative linkage assumption 3, that is, E X (X 2 ) = E X (B T 2 )X 2 = E B 2 X 2, where E B2 satisfies the exchangeable linkage error model. It means E B2 = (λ B2 γ B2 )I + γ B2 1 1 T, where and Let λ B2 = pr(correct linkage between X 1 and X 2 ) γ B2 = pr(incorrect linkage between X 1 and X 2). X E = E X (X ) = E X (1, X 1, X 2 ) = (1, X 1, E B2 X 2 ). (2) Then, by OLS, one has ˆβ 1 = (X )T X (X )T Y, where E X (ˆβ 1 ) = (X ) T X (X ) T X E β = D 1 β. Hence, if the matrix E B2 is known and the inverse of D 1 exists, a ratio form of an unbiased estimator of β is of the form ˆβ R1 = D 1 1 ˆβ. Let f = X β, f = X β, (3) f E = XE β. 3 We assume that the distribution of B 2 is independent of X 2 given X. 5

8 Proposition 1. An asymptotic variance estimator of ˆβ R1 can be defined by 1 ( 1 ) T, ˆV 1 (ˆβ R1 ) = (X )T X E (X )T ˆV 1 (Y )X (X )T X E where ˆV 1 (Y ) = ˆσ 2 I + ˆV B2. Here, ˆV 1 (Y ) can be estimated by ˆσ 2 = N 1 (Y f E ) T (Y f E ) and, given f B 2 := X 2β 2, V B2 = diag (1 λ B2 ) { λ B2 (fb 2,i f B 2 ) 2 + f (2) B 2 ( f B 2 ) 2}, where f B 2 = (f B 2,i ) and f B 2, f (2) B 2 in f B 2. are the averages of f B 2,i and their suares respectively 2.2 Three data sets and two mismatches cases: A ratio-type estimator When there are three data sets and and two mismatches in the data linkage processes, there are two possible scenarios. Case 1: Y is the bench mark data set and the linkages between Y and X 1 and the linkages between Y and X 2 are done with some errors. Case 2: Either X 1 or X 2 is the bench mark data set 4 and the linkage between the bench mark data and other X data set and the linkage between the bench mark data set and Y are done with some errors. Let us consider the Case 1 first. So, we assume that the data set for y i is correctly recorded, but there are mismatches between y i and x 1i as well as between y i and x 2i. Also, we assume that mismatches between y i and x 1i are independent of the mismatches between y i and x 2i. In this case, our regression model is of the form Y = β 0 + X 1β 1 + X 2β 2 + ǫ = X β + ǫ, 4 In this paper, we assume that X 1 is the bench mark. 6

9 where X = (1, X 1, X 2), X 1 = B 1 X 1 and X 2 = B 2 X 2, and B 1 and B 2 are permutation matrices. If B 1 and B 2 are known, one has X = (1, X 1, X 2 ) = (1, B T 1 X 1, BT 2 X 2 ). Since, B 1 and B 2 are unknown in general, we apply the non-informative linkage assumption so that where, X E2 = E X (X ) = (1, E B1 X 1, E B 2 X 2 ), (4) E Bi = (λ Bi γ Bi )I + γ Bi 1 1 T and λ Bi = pr(correct linkage between Y and X i ) Then, by OLS, where γ Bi = pr(incorrect linkage between Y and X i ). ˆβ 1 = (X )T X (X )T Y, E X (ˆβ 1 ) = (X )T X (X )T X E2 β = D 2 β. Hence, if the matrices E B1 and E B2 are known and the inverse of D 2 exists, a ratio form of an unbiased estimator of β is of the form ˆβ R2 = D 1 2 ˆβ. Let f E2 = X E2 β. Proposition 2. An asymptotic variance estimator of ˆβ R2 can be defined by 1 ( 1 ) T, ˆV 2 (ˆβ R2 ) = (X ) T X E2 (X ) T ˆV 2 (Y )X (X ) T X E2 where Here, ˆV 2 (Y ) can be estimated by ˆV 2 (Y ) = ˆσ 2 I + ˆV B1 + ˆV B2. ˆσ 2 = N 1 (Y f E2 ) T (Y f E2 ) 7

10 and, given f B j := X jβ j for j = 1 or 2, V Bj = diag (1 λ Bj ) { λ Bj (fb j,i f B j ) 2 + f (2) B j ( f B j ) 2}, where f B j = (f B j,i ) and f B j, f (2) B j in f B j. are the averages of f B j,i and their suares respectively Now, we are considering the Case 2. When some of x 1i are incorrectly linked with corresponding y i or with x 2i, our regression model is of the form Y = β 0 + X 1 β 1 + X 2 β 2 + ǫ = X β + ǫ, where Y = A Y, X 2 = B 2X 2 and A and B 2 are permutation matrices. By non-informative linkage assumption 5 on A, one has E X (Y ) = E X (A Y ) = E X (A )E X (Y ) = E A E X (Y ) = E A X E β, (5) where E A = (λ A γ A )I + γ A 1 1 T with and λ A = pr(correct linkage between X 1 and Y ) γ A = pr(incorrect linkage between X 1 and Y ). Further, we assume that the mismatch between x 1i and y i is uncorrelated 6 with the mismatch between x 1i and x 2i. With these assumption, by OLS, one has ˆβ 1 = (X )T X (X )T Y and 1 = (X )T X (X )T A Y E X (ˆβ 1 ) = (X ) T X (X ) T E A X E β = D 3 β. 5 Here we assume the randomness of the linkage error between Y and X. See Chambers (2008) for a more detailed discussion. 6 We will try to relax this assumption soon. 8

11 Thus, if the matrices E X (B 2 ) = E B2 and E X (A ) = E A are known and the inverse of D 3 exists, a ratio form of an unbiased estimator of β for this case is of the form ˆβ R3 = D 1 3 ˆβ. Proposition 3. An asymptotic variance estimator of ˆβ R3 can be defined by 1 ( 1 ) T, ˆV 3 (ˆβ R3 ) = (X )T X E (X )T ˆV 3 (Y )X (X )T X E where ˆV 3 (Y ) = ˆσ 2 I + ˆV B2 + ˆV C2. Here, ˆV 3 (Y ) can be estimated by ( ˆσ 2 = N 1 (Y f E ) T (Y f E ) 2 (f E ) T ) I E A f E and, given f B 2 := X 2 β 2, V B2 = diag (1 λ B2 ) { λ B2 (fb 2,i f B 2 ) 2 + where f B 2 = (fb 2,i ) and f B (2) 2, f B 2 in f B 2. Further, one has V C2 = A Var X f (2) B 2 ( f B 2 ) 2}, are the averages of fb 2,i and their suares respectively E X (Y B 2 ) A T, and it can be estimated by V C2 = diag (1 λ C2 ) { λ C2 (fb 2,i f B 2 ) 2 + f (2) B 2 ( f B 2 ) 2}, where f B 2 = (fb 2,i ) and f B (2) 2, f B 2 are the averages of fb 2,i and their suares respectively in f B 2. Moreover, C B2 = A B2 T and λ C2 is the probability of correct linkages in C B The estimating function we will modify the estimating functions used in Chambers (2008) to accommodate the longitudinal linkage case. Suppose that one has E(Y X) = g(x; θ), where θ can be estimated by solving H(θ) = 0, 9

12 and H(θ) is a function that satisfies E X H(θ0 ) = 0 when θ 0 is the true value of θ. Let θ be the partial differentiation operator with respect to θ. Suppose that ˆθ satisfies H(ˆθ) = 0. Then, under some regularity condition for the smoothness and Taylor expansion, 0 = H(ˆθ) H(θ 0 ) + (ˆθ θ 0 ) θ H(θ 0 ). If H(θ) is an unbiased estimating function and θ H(θ 0 ) is non-singular, one has E X ˆθ θ0 θ H(θ 0 ) 1 EX H(θ0 ) = 0. Then, the variance function for ˆθ can be derived by Var X (ˆθ) θ H(θ 0 ) 1 Var X H(θ0 ) ( θ H(θ 0 ) ) 1 T. In Chambers (2008), the estimating function is of the form H(θ) = G (θ) { Y f }, where f = E X (Y ) and G (θ) 7 is a function of θ and X but not of Y. In the longitudinal case for the three data set, we have three different cases to consider. Firstly, consider the Case 0 where Y and X 1 are correctly linked, but X 1 and X 2 are not correctly linked. Hence, we can observe true Y, but we cannot observe the true X. Instead, we observe X, which is of the form X = (1, X 1, X 2 ), X 2 = B 2X 2 and B 2 is a permutation matrix that is not observable in general. Then, a naive estimating function can be of the form H (θ) = G (θ) { Y f (θ)}, where f (θ) = X β. Then, it is easy to see that the estimator from the naive estimating function is biased, because E X (Y ) = f E (θ) f (θ). Thus, an unbiased estimating function can be of the form H 1(θ) = G (θ) { Y f E (θ) }, (6) 7 Some examples of G for different estimators are given in the simulation section. 10

13 where f E (θ) = X E β = (1, X 1, E B2 X 2)β. Let us consider the Case 1 where Y is the bench mark data set and the linkages between Y and X 1 and the linkages between Y and X 2 are done with some errors. In this case, we have similar estimating function H 2 (θ) = G (θ) { Y f E2 (θ) }, (7) where, by (4), f E2 (θ) = X E2 β = (1, E B1 X 1, E B2 X 2)β. Now, consider the Case 2 where X 1 is the bench mark data set and the linkage between X 1 and X 2 and the linkage between X 1 and Y are done with some errors. In this case, Y is observed instead of Y and also the true X is not observable. Instead, we observe X, which is of the form X = (1, X 1, X 2 ), X 2 = B 2X 2 and B 2 is a permutation matrix that is not observable in general. Hence, a naive estimating function can be of the form H (θ) = G (θ) { Y f (θ)}, where f (θ) = X β. Then, as before, it is easy to see that the estimator from the naive estimation function is biased, because E X (Y ) = E A f E (θ) f (θ). Hence, by (2), (5) and (28), an unbiased estimator is of the form H 3(θ) = G (θ) { Y E A f E (θ) }, (8) and the estimator ˆθ 3 is defined as the the solution of H 3 (ˆθ 3 ) = 0. Theorem 4. Let ˆθ 1 be the solution of (6). Then, an asymptotic variance estimator is of the form V 1 X (ˆθ 1) = G θ f E (ˆθ 1 1) G ˆΣ 1 G T ( G θ f E (ˆθ 1 ) T 1) 11

14 where, ˆΣ 1 = ˆσ 2 I + ˆV B2. Also, let ˆθ 2 be the solution of (7). Then, an asymptotic variance estimator is of the form V 2 X (ˆθ 2) = G θ f E2 (ˆθ 1 ( 2) G ˆΣ 2 G T G θ f E2 (ˆθ 2) 1 ) T, where, by (26), ˆΣ 2 = ˆσ 2 I + ˆV B1 + ˆV B2. Finally, the asymptotic variance estimator for the solution of (8) is of the form V 3 X (ˆθ 3) = G E A θ f E (ˆθ 1 3) G ˆΣ 3 G T ( G E A θ f E (ˆθ 3) 1 ) T, where, ˆΣ 3 = ˆσ 2 I + ˆV C2 + ˆV A. 2.4 Variance estimation when linkage probabilities are estimated So far, we assume that we know the correct linkage probabilities which is a very strong assumption. In this subsection, we consider the case where the correct linkage probabilities are estimated by checking a random audit sample of linked records in each m-block. More details of this audit estimates when there are two data sets can be found in Chambers (2008), and we will modify his idea to accommodate the cases when there are more than two data sets. Let us consider the Case 2 where x 1i is neither correctly linked with corresponding y i, nor with x 2i. In this case, we need to consider two different linkage probabilities: λ A = pr(correct linkage between X 1 and Y ) λ B2 = pr(correct linkage between X 1 and X 2), where there is no correlation between them. Thus, the estimating function (8) can be replaced by H 3(θ, λ A, λ B2 ) = G { Y E A (λ A )f E (θ, λ B2 ) } = U (θ, λ A, λ B2 ), 12

15 and a first order Taylor series approximation is of the form 0 = H 3(ˆθ 3, ˆλ A, ˆλ B2 ) H 3 (θ 0, λ 0 A, λ0 B 2 ) + θ H 3 (θ 0, λ 0 A, λ0 B 2 )(ˆθ θ 0 ) + λa H 3 (θ 0, λ 0 A, λ0 B 2 )(ˆλ A λ 0 A ) + λ B2 H 3 (θ 0, λ 0 A, λ0 B 2 )(ˆλ B2 λ 0 B 2 ), where θ 0, λ 0 A and λ0 B 2 denote the true values of θ, λ A and λ B2 respectively. Denote H 0 = H 3(θ 0, λ 0 A, λ 0 B 2 ), λ1 = λa and λ2 = λb2. Then, one has ˆθ 3 = θ 0 θ H 1 0 H 0 + λ 1 H 0 (ˆλ A λ 0 A ) + λ 2 H 0 (ˆλ B2 λ 0 B 2 ). If the estimates of the linkage probabilities are obtained by a random audit sample (of the size m A for λ A and m B for λ B 2 ) of linked records, one has and Var X (λ A ) = (m A ) 1 λ A (1 λ A ) Var X (λ B2 ) = (m B ) 1 λ B2 (1 λ B2 ). Theorem 5. An asymptotic variance estimator of 3 is of the form ˆV λ 3 X (ˆθ 3 ) = θ Ĥ 1 ( ) ( 0 ˆV3 X θ 3 + λ1 Ĥ ) 0 VarX (ˆλ A ) ( λ1 Ĥ T 0) ˆθ + ( λ2 Ĥ 0) VarX (ˆλ B2 ) ( λ2 Ĥ 0) T { θ Ĥ 0 1 } T, where λ1 Ĥ 0 = G (M 1) 1 (M I 1 1 T E ) ˆf (ˆθ, ˆλ B2 ) and λ2 Ĥ 0 = G E A (M 1) 1 (M I 1 1 T ) X 2 ˆβ 2. and ˆV 3 X is the asymptotic variance estimator for ˆθ 3 in the Theorem 4. Similarly, an asymptotic variance estimator for when the linkage probabilities are unknown, can be represented by ˆV λ 2 X (ˆθ 2 ) = θ Ĥ 1 ( 0 ˆV2 X θ 2 ) ( + λb1 Ĥ 0 ˆθ 2, the unbiased estimator for the Case 1 ) VarX (ˆλ B1 ) ( λb1 Ĥ 0 + ( λb2 Ĥ ) 0 VarX (ˆλ B2 ) ( { λb2 Ĥ T θ } 0) Ĥ 1 T, 0 13 ) T

16 where, λ B1 = pr(correct linkage between Y and X 1 ) and λb1 Ĥ 0 = G (M 1) 1 (M I 1 1 T X ) ˆβ 1 1. Finally, an asymptotic variance estimator for when the linkage probabilities are unknown, can be represented by ˆθ 1, the unbiased estimator for the Case 0 ˆV λ 1 X (ˆθ 1 ) = θ Ĥ 1 ( ) ( 0 ˆV1 X θ 1 + λb2 Ĥ ) 0 VarX (ˆλ B2 ) ( { λb2 Ĥ T θ } 0) Ĥ 1 T Simulation We use simulation to compare the performances of different estimators we considered in this study. The linear model we used in this simulation is of the form Y = 1 + 5X 1 + 8X 2 + ǫ, where X 1 were drawn from the standard normal distribution and X 2 were drawn from the normal distribution with mean= 2 and the variance of 4. ǫ were drawn from the standard normal distribution as well. In this simulation, we consider all three cases we have studied: Case 0: X 1 is the bench mark data set and the mismatch happens only from the linkage between X 1 and r X 2. Case 1: Y is the bench mark data set and the linkages between Y and X 1 and the linkages between Y and X 2 are done with some errors. Case 2: X 1 is the bench mark data set and the linkage between X 1 and X 2 data set and the linkage between X 1 and Y are done with some errors. Here, we will only explain how we generate the data sets for Case 2. Generating the data sets for other cases are uite trivial. There are three m-blocks and in each m block, the pairs (x 1i, x 2i) were generated according to an independent exchangeable linkage error model. Further, given X i = (1, x 1i, x 2i), the pairs (yi, X i) were generated according to another independent exchangeable linkage error model. In this simulation, we use three m-blocks of sizes 500 for each m-block. Also we 14

17 assume that the probability of correct linkage between Y and X and probability of correct linkage between X 1 and X 2 are known. The estimators for the simulations are 1. the naive OLS estimator (ST), 2. the ratio-type estimator (R), 3. the Lahiri-Larsen estimator (A) and 4. the empirical Best Linear Unbiased Estimator, EBLUE, (C). Note that different estimating functions have different form of G. In our case, 1. the naive estimator: G = (X )T, 2. the Lahiri-Larsen estimator: G = (ÊA X E ) T and 3. the EBLUE: G = (ÊA X E )T (ˆσ 2I + ˆV C2 + ˆV A ) 1. The assumptions on the probability of correct linkage on each m-block are the probability of correct linkage between Y and X 1 : λ A1 = 1, λ A2 = 0.95 and λ A3 = 0.75, the probability of correct linkage between Y andx 1 : λ B11 = 1, λ B12 = 0.95 and λ B13 = 0.75 and the probability of correct linkage between X 1 and X 2 : λ B 21 = 1, λ B22 = 0.85 and λ B23 = 0.8. Under the above scenario, the estimators were independently simulated 1000 times. The regression parameters were estimated using the four estimators. The following plot box represent the overall performance of the estimators. Clearly, the ration-type estimator, the Lahiri-Larsen estimator and the EBLUE correct the bias due to incorrect linkage, and the EBLUE outperforms other estimators, that was also noted in Chambers (2008) where two registers were merged. These observations are consistent for all three cases. It is worth to note that the EBLUE(C) outperforms all other estimators in general. The figures 1-3 clearly show that EBLUE is the best one. However, our simulation shows that the relative biases of EBLUE, when λs are unknown, are 15

18 larger than the Lahiri-Larsen estimator and the ratio-type estimator. But the overall relative RMSE are smaller than other estimators. Table 1 here. Table 2 here. Table 3 here. 3 Sample-register case In this section, we consider the case where we only observe a sample s of records from the bench mark data set. Suppose that X 1 is the bench mark data set. When all the records in X 1 -register are linked to the records inx 2 -register and Y -register, all of the sample records s from X 1 -register are perfectly linked with some records in X 2 -register and Y -register. However, in reality, some records in the sample s cannot be linked to a record in X 2 -register or Y -register. We will consider these two cases separately. 3.1 Sample-register case: When sample records are perfectly linked As before, we will consider three different cases, Case 0, Case 1 and Case 2. Let us start with Case 2. If all records in the sample s are linked to the records inx 2 - register and Y -register, We can assume that the sample s is a part of X 1 -register that is complete register-register linkage. Hence we can use a weighted estimating function. In this subsection, we will modify the estimating function approach to accommodate this sampleregister linkage. When the sample records s from X 1 -register are linked to X 2 -register and Y -register, we observe a subset s of M records from Y, which we denote by Y s. More precisely, let M be the population number in the th m-block, and let m s be the sample size in the th m-block. We use a subscript of s to denote uantities that depend on the sample records in the th m-block. Similarly, the subscript of r is used to indicate uantities that depend on the non-sample records in the th m-block. 16

19 Under perfect linkage of the sample data, when there is no linkage error, the true parameter θ 0 can be estimated by solving the estimating euation H s (θ) = G s { Y s f s (θ) }, where G s is modified by the sample weights w s that depend on the ratio of the sample size from the population. When there exist linkage errors and we ignore the errors, the estimating euation is then of the form H s(θ) = G s { Y s f s(θ) }, where and A = ( As Y s = A sy A r ) ( ) Ass A = sr is the sample/non-sample decomposition of the complete linkage process in the th m-block. This estimating euation leads to a bias because E X (Y s ) f s. To correct the bias, by using the fact that E X (Y s ) = E X (A s Y ) = E A s f E (θ), we modify this estimating euation A rs A rr H adj s (θ) = = G s { Y s E A s f E (θ)} G s { Y s E Ass f E s(θ) E Asr f E r(θ) }, (9) where E A = ( EAs E Ar ) ( EAss E = Asr E Ars E Arr is the corresponding sample/non-sample decomposition of the expected value E A of A. Now, by the definition of E A, one has ( λa M 1) E Ass = I s + M 1 ( 1 λa M 1 ) ) 1 s 1 T s (10) and ( 1 λa E Asr = M 1 ) 1 s 1 T r 17

20 so that (9) becomes H adj s (θ) = { ( G s Y s λa M 1) M 1 I s f E s (θ) ( 1 λa M 1 ) 1 s 1 T f E (θ)}. Using a weighting approach 8, the unknown value 1 T f E (θ) can be replaced by wt s fe s (θ) under the assumption that the samples are chosen randomly from the population. This leads us to the estimating function of the form H adj ws3 (θ) = G s { Y s ẼA s f E s (θ)}, (11) where ( λa M 1) Ẽ As = I s + M 1 ( 1 λa M 1 ) 1 s w T s. For the Case 1, formulae are similar, but simpler than those in Case 2. Note that, in this case, we observe true Y s. Hence, the estimating function is of the form H adj ws2(θ) = G s { Y s f E2 s (θ)}, where f E2 s = X E2 s β = (1 s, ẼB 1s X 1s, ẼB 2s X 2s)β, ( λb1 M 1) ( 1 λb1 ) Ẽ B1s = I s + 1 s w T s and M 1 M 1 ( λb2 M 1) ( 1 λb2 ) Ẽ B2s = I s + 1 s w T s M 1 M 1. Finally, for the Case 0, it has simplest forms for their formulae since there is only one mismatch. The estimating function is of the form H adj ws1 (θ) = G s { Y s f E s (θ)}, where Theorem 6. Let f E s = XE s β = (1 s, X 1s, ẼB 2s X 2s )β and ( λb2 M 1) ( 1 λb2 ) Ẽ B2s = I s + 1 s w T s M 1 M 1. ˆθ s 3 be the solution of the estimating euation (11). Then, under the assumption that we know true λ A and λ B2, an asymptotic variance estimator is of the form V ws s 1 ( 1 ) T, 3 X (ˆθ 3 ) = G s Ẽ As θ f E s(ˆθ) G s ˆΣs G T s G s Ẽ As θ f s(ˆθ) E 8 In this article, we simply use weight w s = ( M m s ) 1. 18

21 where ( (λa M 1)d i + M (1 λ A ) ˆΣ s diag d s +(1 λ A ) λ A (fi E M 1 f s E )2 E(2) + f s ( f ) s E )2 ; i s with D s = diag{d i ; i s } Var X (Y s ) and d s is the mean of {d i ; i s }. Let ˆθ s 2 be the solution of the estimating euation for the Case 1. Then, under the assumption that we know true λ B1 and λ B2, an asymptotic variance estimator is of the form V ws s 2 X (ˆθ 2 ) = G s θ f E2 s 1 s (ˆθ 2 ) ( G ˆΣ(2) s s GT s G s θ f E2 s 1 ) T, s (ˆθ 2 ) where Finally, let ˆΣ (2) s = ˆσ 2 si s + ˆV B1s + ˆV B2s. ˆθ s 1 be the solution of the estimating euation for the Case 0. Then, an asymptotic variance estimator is of the form V ws s 1 X (ˆθ 1 ) = G s θ f E s 1 s (ˆθ 1 ) where, G s ˆΣ(1) s GT s ˆΣ (1) s = ˆσ 2 si s + ˆV B2s. ( G s θ f E s 1 ) T, s (ˆθ 1 ) Note that the above asymptotic variance estimator assumes that the λ A, λ B1 and λ B2 are known. If we need to estimate these probabilities, the asymptotic variance estimator should have more terms that count the estimations of λ A, λ B1 and λ B2. To see this, note that, when λ A and λ B2 are estimated, (11) becomes H adj ws3,λ (θ, λ A, λ B2 ) = G s { Y s ẼA s (λ A )f E s(θ, λ B2 ) }. (12) In this case, the asymptotic variance estimator is of the form Var X (ˆθ) θ H adj 1 ( ws3,0 Var X H adj ) ( ws3,0 + λa H adj ) ws3,0 VarX (λ A ) ( λa H adj ) T ws3,0 + ( λb2 H adj ) ws3,0 VarX (λ B2 ) ( ) { T θ λb2 Hadj ws3,0 H adj } 1 T, (13) ws3,0 where H adj ws3,0 = H adj ws3,λ (θ 0, λ 0 A, λ 0 B 2 ), λa H adj ws3,0 = λb2 H adj ws3,0 = G s (M 1) 1 (M I s 1 s w T s) f E s(θ 0, λ 0 B 2 ), G s Ẽ As (M 1) 1 (M I s 1 s w T s) X 2β 2. (14) 19

22 Corollary 7. Let variance estimator is of the form Also, Let ˆθ s 3 be the solution of the estimating euation (12). Then, an asymptotic V ws,λ s 3 X (ˆθ 3 ) = θ Ĥ adj 1 ws3,0 V ws s 3 X (ˆθ 3 ) + ( λ1 Ĥ adj ) ws3,0 VarX (ˆλ A ) ( λ1 Ĥ adj ) T ws3,0 + ( λ2 Ĥ adj ) ws3,0 VarX (ˆλ B2 ) ( λ2 Ĥ adj ) { T θ } ws3,0 Ĥ adj 1 T. ws3,0 ˆθ s 2 be the solution of the estimating euation for the Case 1. Then, an asymptotic variance estimator is of the form V ws,λ s 2 X (ˆθ 2 ) = θ Ĥ adj 1 ws2,0 V ws s 2 X (ˆθ 2 ) + ( λb1 Ĥ adj ws2,0) VarX (ˆλ B1 ) ( λb1 Ĥ adj ) T ws2,0 + ( λb2 Ĥ adj ws2,0) VarX (ˆλ B2 ) ( λb2 Ĥ adj ) { T θ } ws2,0 Ĥ adj 1 T, ws2,0 where H adj ws2,0 = H adj ws2(θ 0, λ B1, λ B2 ), Var X (λ B1 ) = (m B 1 ) 1 λ B1 (1 λ B1 ), λb1 H adj ws2,0 = G s (M 1) 1 (M I s 1 s w T s) X 1β 1. Further, let ˆθ s 1 be the solution of the estimating euation for the Case 0. Then, an asymptotic variance estimator is of the form V ws,λ s 1 X (ˆθ 1 ) = θ Ĥ adj 1 ws1,0 V ws s 1 X (ˆθ 1 ) + ( λb2 Ĥ adj ws1,0) VarX (ˆλ B2 ) ( λb2 Ĥ adj ws1,0 ) { T θ } Ĥ adj 1 T. ws1,0 3.2 Sample-register case: When sample records are not perfectly linked When some records are not linked, A or B 2 cannot be a permutation matrix, because the entries of some rows are all zero due to non-linkage. However, we can still use similar ideas introduced in the previous subsection. Firstly, we consider the Case 2. Let X 1s be the set of the sample records from X 1. Also let X 1sl be the set of sample records in X 1s that are linked both to X 2 -register and to Y -register. Further, let X 1su := X 1s X 1sl. Then it represents the set of sampled records in X 1s that cannot be linked either to X 2 -register or to Y -register. Also, let X 1r := X 1 X 1s, the set of non-sample records in X 1. We assume that, theoretically, 20

23 there exists X 1rl that represents the set of non-sample records that can be linked both to X 2 -register and Y -register. Similarly, X 1ru := X 1r X 1rl. Similarly, under the one to one linkage assumption, Y can be partitioned into four groups, namely Y sl, Y su, Y rl and Y ru. Thus, one has Y sl A slsl, A slsu, A slrl, A slru, Y sl Y = Y su Y = A susl, A susu, A surl, A suru, Y su rl A rlsl, A rlsu, A rlrl, A rlru, Y rl = A Y, Y ru A rusl, A rusu, A rurl, A ruru, Y ru and E slsl,a E slsu,a E slrl,a E slru,a E(A X ) = E A = E susl,a E susu,a E surl,a E suru,a E rlsl,a E rlsu,a E rlrl,a E rlru,a. E rusl,a E rusu,a E rurl,a E ruru,a Further, because X 2 also can be partitioned into X 2sl, X 2su, X 2rl and X 2ru, one has E slsl,b2 E slsu,b2 E slrl,b2 E slru,b2 E(B 2 X ) = E B 2 = E susl,b2 E susu,b2 E surl,b2 E suru,b2 E rlsl,b2 E rlsu,b2 E rlrl,b2 E rlru,b2. E rusl,b2 E rusu,b2 E rurl,b2 E ruru,b2 This leads to the estimating euation of the form H adj sl (θ) = = G sl { Y sl E A sl f E (θ)} G sl { Y sl E slsl,a f E sl (θ) E slsu,a f E su (θ) E slrl,a f E rl(θ) E slru,a f E ru(θ) }. (15) Under the exchangeable linkage error model, one has λa M 1 1 λa E slsl,a = I sl + M 1 M 1 1 λa E slsu,a = 1 sl 1 T su M 1, 1 λa E slrl,a = 1 sl 1 T M 1 rl, 1 λa E slru,a = 1 sl 1 T ru M 1. It leads (15) to the form of H adj sl (θ) = G sl { Y sl λa M 1 M 1 21 I sl f E sl(θ) 1 sl 1 T sl, 1 λa M 1 1 sl 1 T f E }.

24 If we assume that the distribution of Y sl is the same as that of Y in the population, the observable population value 1 T f E (θ) can be replaced by weighted sample estimate by w T sl f E sl (θ)9 so that one has H adj wsl (θ) = G sl { Y sl ẼA sl f E sl(θ) }, where For f E sl (θ), note that by (2) where λa M 1 Ẽ Asl = I sl + M 1 1 λa M 1 1 sl w T sl. f E sl(θ) = (1 sl, X 1sl, E Bsl,2 X 2)(β 0, β1, β 2 ) T, E Bsl,2 X 2 = E slsl,b 2 X 2sl + E slsu,b 2 X 2su + E slrl,b 2 X 2rl + E slru,b 2 X 2ru. The exchangeable linkage error model provides that λb2 M 1 E slsl,b2 = M 1 1 λb2 E slsu,b2 = M 1 1 λb2 E slrl,b2 = M 1 1 λb2 E slru,b2 = M 1 I sl + 1 sl 1 T su, 1 sl 1 T rl, 1 sl 1 T ru. 1 λb2 M 1 1 sl 1 T sl, If we also assume that the distribution of X 2sl is the same as that of X 2 in the population, then E Bsl,2 X 2 can be replaced by ẼB sl,2 X 2sl where Then, f E sl(θ) can be evaluated by λb2 M 1 Ẽ Bsl,2 = I sl + M 1 1 λb2 M 1 1 sl w T sl. f E sl (θ) = (1 sl, X 1sl, ẼB sl,2 X 2sl )(β 0, β1, β 2 ) T. 9 We will use w sl = ( M m sl )1 sl, where m sl is the number of linked sample records, while M is the total population number in th m-block. 22

25 Suppose that we know λ A and λ B2, and let ˆθ be the solution of the estimating euation. To derive the asymptotic variance estimator for ˆθ, note that (47) becomes now Var X (ˆθ) θ H adj wsl (θ 0) 1 VarX H adj wsl (θ 0) ( θ H adj wsl (θ 0) ) 1 T with corresponding estimator of the form V X (ˆθ) = θ H adj wsl (θ 0) 1 V X H adj wsl (θ 0) ( θ H adj wsl (θ 0) ) 1 T 1 ( G sl Ẽ Asl θ f E sl (ˆθ) (ˆθ) 1 ) T, G sl ˆΣsl G T sl G sl Ẽ Asl θ f E sl under the assumption that G sl is independent of θ. By the similar arguments in (48)- (49), Σ sl = Var X (Y sl ) Var X (A slsl, Y sl ) + Var X (A slsu, Y su ) + Var X (A slrl, Y rl ) + Var X (A slru, Y ru ) that can be approximated by ˆΣ sl diag ( (λa M 1)d i + M (1 λ A ) d sl M 1 + (1 λ A ) λ A (f E i f E sl )2 + E(2) f sl ( f ) sl E )2 ; i {1,...,m sl }. If we need to estimate λ A and λ B2, we still can use the asymptotic variance estimator defined by (13)-(14), except that the subscripts sp and ws need to be replaced by slp and wsl. That is, the asymptotic variance estimator is of the form Var X (ˆθ) θ H adj 1 ( ) ( ) wsl,0 Var X H adj wsl,0 + λ1 H adj wsl,0 VarX (ˆλ A ) ( ) λ1 H adj T wsl,0 + ( ) λ2 H adj wsl,0 VarX (ˆλ B2 ) ( ) { λ2 H adj T θ } wsl,0 Hwsl,0 adj 1 T. Using the above arguments, it is clear that, to deal with Case 0 and Case 1 in this case, we can use the formulae in the previous subsection by replacing sp and ws with slp and wsl. 3.3 Simulation We use simulation to compare the performances of different estimators we considered in this study for the sample to register linkage case. The linear model we used in this simulation is the same as before, Y = 1 + 5X 1 + 8X 2 + ǫ. 23

26 Most of assumptions and scenarios we made for the register to register case are the same except that we use the sample here instead of whole population. In this simulation, we considered the case of complete linkage and incomplete linkage separately. For the case of complete linkage, we assume that the sample records s from the bench mark data sets are linked to the records in other registers. The extra assumption we made in this simulation is that the population size of all registers the same and each m-block has 2000 records, and 500 samples are chosen randomly for each m-block. Further, in case of incomplete linkage, we assume that, among 2000 records, half of them cannot be linked. In this incomplete linkage case, we chose 1000 samples. The reason is that because half of them cannot be linked, we might have around 500 samples that are linked to other registers. This assumption will provide another consistent comparisons of the same estimators between the complete linkage case and incomplete linkage case. The results for the complete linkage case can be found in the Table 4 Table 6, while the results for the incomplete case are in Table 7 Table 9 The result shows very similar pattern in the register to register case. Clearly, while the ratio-type estimator, the Lahiri-Larsen estimator and the EBLUE correct the bias due to linkage errors, the EBLUE outperforms all other estimators. Here are the results for the complete linkage case: Table 4 here. Table 5 here. Table 6 here. Here are the results for the incomplete linkage case: Table 7 here. Table 8 here. Table 9 here. The results for the sample-register cases are very similar to the register-register cases as long as the sample sizes are similar. One thing to note is that the coverage rates are all higher than 95%. This is not the case when the number of merged data sets are two. One possible explanation is that the variance terms in these cases are more complicated and, as the number of merged data sets increase, the variances increase as well so that the confidence intervals are becoming wider. 24

27 4 Conclusion and further research direction In this paper we extend the linkage error adjusting techniue in regression analysis developed in Chambers (2008) to accommodated the situation where the number of merged data sets are more than two. We developed a ration-type estimator for the regression analysis and then it has been extended to more general adjusted estimating function approach. These methods can deal with the case where all the data sets are registers, as well as the case where the bench mark data sets are sample and the others are registers. Even though it hasn t been dealt here, it is easy to see that these methods can naturally accommodate the case where all the data sets are sample. These methods also extended to deal with the situation where some of sample data are failed to be linked to other registers. However, all of these bias correction methods have to pay the price of large variance. Furthermore, in the case of sample-registers case with non-linkage situation, the number of linked sample data, if the the number of merged data sets are increasing, will be decreasing. Thus, we expect some sort of loss of information by merging more data sets. We expect to overcome this limitation by adapting other approaches. Another limitation of these methods is that we assume that the linkage errors among the data sets occurs randomly. However, there might be some correlation among the linkage errors. To deal with this situation, our model should include more complicated covariance measures in the formulae and it will be dealt in our next research paper. A Proofs of the Propositions and Theorems A.1 Proof of Proposition 1 For the variance of the estimator, note that Var X (ˆβ R ) = D 1 1 Var X (ˆβ ) ( D 1 1 ) T, where Var X (ˆβ 1 1. ) = (X )T X (X )T Var X (Y )X (X )T X 25

28 Further, one has Var X (Y ) = E X Var X (Y B 2 ) + Var X E X (Y B 2 ). Note that, by (1), E X (Y B 2 ) = X B 2 β. Thus, Denote that V B2 = Var X E X (Y B 2 ) = E X = E X f B 2 = X 2β 2. 2 X B 2 β XE β 2. B2 T X 2 β 2 E B2 X 2 2 β Then, by (16) from Chambers (2008), V B2 = diag (1 λ B2 ) { λ B2 (fb 2,i f B 2 ) 2 + f (2) B 2 ( f B 2 ) 2}, (16) where f B 2 = (fb 2,i ) and f B (2) 2, f B 2 in f B 2. Furthermore, one has are the averages of fb 2,i and their suares respectively Var X (Y B 2 ) = E X (Y X B 2 β)2 = E X (ǫ ) 2 = σ 2 I. Therefore, one has Var X (Y ) = σ 2 I + V B2 (17) which implies that 1 ( Var X (ˆβ R ) = (X )T X E (X )T σ 2 I 1. + V B2 )X (X )T X E (18) To evaluate Var X (ˆβ R ), Then, one has T (Y f E )T (Y f E (Y ) = f ) (f E f ) (Y f ) (f E f ) = (Y f ) T (Y f ) (19) (Y f ) T (f E f ) (f E f ) T (Y f ) (20) + (f E f ) T (f E f ). (21) Note that, by the definition, (Y f ) T (Y f ) = Nˆσ 2. (22) 26

29 Further, E X (Y f ) T (f E f ) + (f E f ) T (Y f ) = 0 (23) because Y f = ǫ and cov(ǫ, X ) = 0. Moreover, one has E X (f E f ) T (f E f ) = E X (f E ) T (f E Y ) + (f E ) T (Y f ) + (f ) T (f Y ) + (f ) T (Y f E ) (24) because E X ( Y f E = 0 ) = 0. Thus, by (19)-(24), ˆσ 2 = N 1 (Y f E )T (Y f E ). (25) Conseuently, Var X (ˆβ R ) can be evaluated by using (25), (16) and (18). A.2 Proof of Proposition 2 For the variance of the estimator, one has Var X (ˆβ R ) = D 1 2 Var X (ˆβ ) ( D 1 2 ) T, where Var X (ˆβ 1 1. ) = (X )T X (X )T Var X (Y )X (X )T X Note that one has Var X (Y ) = E X Var X (Y B 1, B 2 ) + Var X E X (Y B 1, B 2 ). Then, by the assumption that the mismatches found in X 1 mismatches found in X 2, are not correlated with the 2 V B = Var X E X (Y B 1, B 2 ) = E X X B 2 β X E β 2 = E X (B1 T X 1β 1 E B1 X 1β 1 ) + (B2 T X 2β 2 E B2 X 2β 2 ) 2 = E X B1 T X 1β 1 E B1 X 1β 1 + EX B T2 X 2β 2 E B2 X 2β 2 2 = V B1 + V B2, (26) 27

30 where V B2 is defined in (16) and V B1 also can be defined similarly. Then, by the similar arguments to (17)-(25), one has 1 ( 1, Var X (ˆβ R ) = (X ) T X E (X ) T σi 2 + V B )X (X ) T X E where ˆσ 2 can be evaluated by (25). A.3 Proof of Proposition 3 To derive the variance of ˆβ R, note that Var X (ˆβ R ) = D 1 3 Var X (ˆβ ) ( D 1 3 )T, where Var X (ˆβ 1 1. ) = (X ) T X (X ) T Var X (Y )X (X ) T X Hence, we need to calculate Var X (Y ) first in order to derive the variance of ˆβ R. Note that Var X (Y ) Var X(Y ). To see this, one has Var X (Y ) = E X Var X (Y A ) + Var X E X (Y A ). (27) Then, by (2) and (3) E X (Y A ) = A E X (Y ) = A X E β = A f E. Note that f is not observable, because it is the expectation of Y, that is also not observable, under completely correct linkage. f is observable,but it contains incorrect linkage between X 1 and X 2. f E is a adjusted version of f to eliminate the bias due to incorrect linkage between X 1 and X 2. Also let V A = Var X E X (Y A ). Then, one has 10 V A = Var X (A f E ). 10 One way to estimate V A is using (16) from Chambers (2008). Then, V A = diag (1 λ A ) { λ A (f,i E f E )2 E(2) + f ( f E )2}, (28) where f E = (fe,i ) and f E, fe(2) are the averages of f E,i and their suares respectively in f E. 28

31 Further, Var X (Y A ) = Var X (A Y ) = A Var X (Y )A T ( ) = A (E X Var X (Y B 2 ) )A T + A Var X E X (Y B 2 ) A T, (29) because one has Var X (Y ) = E X Var X (Y B 2 ) + Var X E X (Y B 2 ). (30) Note that, by (1), E X (Y B 2 ) = X B 2 β. Thus, Denote that Also, let Var X E X (Y B 2 ) 2 = E X X B 2 β X E β = E X B T2 X 2β 2 E B2 X 2β 2. (31) 2 f B 2 = X 2β 2. C B2 = A B T 2, which is another permutation matrix, and let E C2 = E X ( A B T 2 ). Further, let Then, one has 11 V C2 = A Var X E X (Y B 2 ) A T. (32) V C2 = E X C 2 f B2 (f B2 ) T C T2 E C2 f B 2 (f B 2 ) T EC T 2. Furthermore, one has Var X (Y B 2 ) = E X (Y X B 2 β)2 = E X (ǫ ) 2 = σ 2 I. 11 By (16) from Chambers (2008), V C2 = diag (1 λ C2 ) { λ C2 (fb 2,i f B 2 ) 2 (2) + f B 2 ( f B 2 ) 2}, where f B 2 = (f B 2,i ) and f B 2, f (2) B 2 are the averages of f B 2,i and their suares respectively in f B 2. 29

32 Hence, A (E X ) Var X (Y B 2 ) A T = A σi 2 A T = σa 2 A T = σi 2. (33) Thus, by (29), (30), (32) and (33) E X Var X (Y A ) = E X A Var X (Y )A T = E X {σ 2I } + V C2 (34) = σ 2 I + V C2. Then, by (27), (32) and (34), Var X (Y ) = σ2 I + V C2 + V A = Σ. (35) Conseuently, one has Var X (ˆβ 1 ( 1, ) = (X )T X (X )T σ 2 I + V C2 + V A )X (X )T X and V R = Var X (ˆβ R ) 1 ( = (X ) T E A X E (X ) T σi 2 + V C2 + V A )X (X ) T E A X E 1. To define ˆV R, the estimator of V R, let ˆf B 2 = X 2 ˆβ 2 and ˆf E = XE ˆβ, where ˆβ 2 and ˆβ are the estimates of β 2 and β respectively. Then, ˆV A and ˆV C2 can be estimated by replacing f E and f B 2, in V A and V C2, with ˆf E and ˆf B 2 respectively. Now, to estimate σ 2, one has (Y fe )T (Y fe ) = (Y )T Y (Y )T f E (fe )T Y + (f E )T f E = Y T AT A Y Y T f f T Y + f T f + Y T f + f T Y f T f (36) (Y )T f E (f E )T Y + (fe )T f E, 30

33 where ) (Y T A T A Y Y T f f T Y + f T f = E X E X Also, one has Further, (Y f ) T (Y f ) = Nσ 2. (37) ( E X f T Y f T f ( ) = EX f T Y f ) ( = E X f T ǫ ) = 0. (38) Y T f (Y )T f E (fe )T Y + (f E )T f E = Y T f (Y )T f E (fe )T Y + (f E )T f E Y T f E + Y T fe (f E )T f + (f E )T f (f E )T f E + (f E )T f E = Y T f E (Y )T f E + (f E )T f E (fe )T Y (39) + Y T (f E )T f + (f E )T Y T f E + (f E )T f f E. (40) Then it is easy to see that the expectation of (40) is zero. Also, ( Y T E X f E (Y )T f E + (f E )T f E (f ) E )T Y = 2(f E )T I E A f E (41) Therefore,by (36) (41), ( ˆσ 2 = N 1 (Y fe )T (Y fe ) 2 (f E )T ) I E A f E. A.4 Proof of Theorem 4 Let ˆθ 1 be the solution of (6). Then, the asymptotic variance of ˆθ 1 is of the form Var X (ˆθ 1 ) θ H 1 (θ 0) 1 Var X H 1 (θ 0) ( θ H 1 (θ 0) 1 ) T. Note that, in general G (θ) is a function of both θ and X, but, in our case, we only consider the case where G is a function of X. Thus, Further, by (17), one has Var X θ H 1 (θ) = H 1 (θ) = = = G θ f E (θ). ( ) G Var X Y G T G σ 2 I + V B2 G T G Σ 1 G T. 31

34 Therefore, the asymptotic variance estimator is of the form V 1 X (ˆθ 1 ) = G θ f E (ˆθ 1 1 ) ( G ˆΣ 1 G T G θ f E (ˆθ 1 ) T. 1 ) Let us consider the Case 1 where Y is the bench mark data set and the linkages between Y and X 1 and the linkages between Y and X 2 are done with some errors. In this case, we have similar estimating function H 2(θ) = G (θ) { Y f E2 (θ) }, but, by (4), f E2 (θ) = X E2 β = (1, E B1 X 1, E B 2 X 2 )β. This leads the asymptotic variance estimator of the form V 2 X (ˆθ 2 ) = G θ f E2 (ˆθ 1 2 ) G ˆΣ 2 G T ( G θ f E2 (ˆθ 2 ) 1 ) T, where, by (26), ˆΣ 2 = ˆσ 2 I + ˆV B1 + ˆV B2. Finally, the asymptotic variance of ˆθ 3 is of the form Var X (ˆθ 3 ) θ H 3 (θ 0) 1 Var X H 3 (θ 0) ( θ H 3 (θ 0) ) 1 T, (42) where, θ H 3(θ) = G E A θ f E (θ). (43) Further, by (35), one has Var X H 3 (θ) = = = G Var X ( Y ) G T G σ 2 I + V C2 + V A G T G Σ 3 GT. Therefore, the asymptotic variance estimator is of the form V 3 X (ˆθ 3 ) = G E A θ f E (ˆθ 1 3 ) ( G ˆΣ 3 G T G E A θ f E (ˆθ 1 ) T, 3 ) as reuired. 32

35 A.5 Proof of the Theorem 5 Let λ 1 = λ A and λ 2 = λ B2. Then the variance of ˆθ 3 can be approximated by Var X (ˆθ 3 ) { θ H0 1VarX θ } H 0 + λ1 H 0(ˆλ A λ A ) + λ2 H 0(ˆλ B2 λ B2 ) H 1 T 0 = θ H 1 ( ( 0 Var X H 0) + λ1 H0) VarX (λ A ) ( ) λ1 H T 0 + ( λ2 H0) VarX (λ B2 ) ( { λ2 H0) T θ } H0 1 T. To derive λi H 0, we assume that the distribution of λ i is independent 12 of the distribution of H 0. Then, by the similar arguments in Chambers (2008), λ1 H 0 = { λ 1 G Y E A (λ A )f E (θ, λ B 2 ) } = = G λ1 E A (λ A ) f E (θ, λ B2 ) G (M 1) 1 (M I 1 1 T ) f E (θ, λ B2 ) (44) (45) and λ2 H 0 = λ 2 = = = = G { Y E A (λ A )f E (θ, λ B 2 ) } G E A (λ A ) λ2 f E (θ, λ B 2 ) G E A (λ A ) λ2 (β 0 + X 1 β 1 + E B2 X 2β 2 ) G E A (λ A ) λ2 (E B2 )X 2 β 2 G E A (M 1) 1 (M I 1 1 T ) X 2 β 2. (46) Therefore, the variance Var X (ˆθ 3 ) can be evaluated by substituting the estimated values of (43), (45) and (46) into (44). For the Case 1 where Y is the bench mark data set and the linkages between Y and X 1 and the linkages between Y and X 2 are done with some errors, the variance of Var X (ˆθ 2 ) is 12 This assumption was originally introduced in Chambers (2008). 33

36 of the form Var X (ˆθ 2 ) { θ H0 1 Var X H 0 + λ B1 H 0 (ˆλ B1 λ B1 ) + λb2 H 0 (ˆλ θ } B2 λ B2 ) H 1 T 0 = θ H 1 ( ( 0 Var X H 0) + λb1 0) H VarX (λ B1 ) ( ) T λb1 H 0 + ( λb2 H0) VarX (λ B2 ) ( { T θ } λb2 0) H H0 1 T, where, λ B1 = pr(correct linkage between Y and X 1), Further, it is easy to see that H 0 = H 2(θ 0, λ B1, λ B2 ). λb1 H 0 = G (M 1) 1 (M I 1 1 T ) X 1β 1 and λb2 H 0 = G (M 1) 1 (M I 1 1 T ) X 2 β 2. Finally, for the Case 0, one has Var X (ˆθ 1 ) { θ H0 1 Var X H 0 + λ B2 H 0 (ˆλ θ } B2 λ B2 ) H 1 T 0 = θ H 1 ( ( 0 Var X H 0) + λb2 0) H VarX (λ B2 ) ( { T θ } λb2 0) H H0 1 T, where, with λ B2 = pr(correct linkage between X 1 and X 2 ), H 0 = H 1(θ 0, λ B2 ) λb2 H 0 = G (M 1) 1 (M I 1 1 T ) X 2 β 2. A.6 Proof of the Theorem 6 ˆθ s Let 3 be the solution of the estimating euation (11). To derive the asymptotic variance s estimator for ˆθ 3, note that by (42), Var X (ˆθ s 3 ) θ H adj ws (θ 0) 1 Var X H adj ws (θ 0) ( θ H adj ws (θ 0) ) 1 T (47) 34

37 with corresponding estimator of the form V ws s 3 X (ˆθ 3 ) = θ H adj ws (θ 0 ) 1 V ws 3 X H adj ws (θ 0 ) ( θ H adj ws (θ 0 ) ) 1 T G s Ẽ As θ f E s 1 s (ˆθ 3 ) ( G s ˆΣs G T s G s Ẽ As θ f E s 1 ) T, s (ˆθ 3 ) under the assumption that G s is independent of θ. Next step is to define Σ s. Note that Σ s = Var X (Y s ) = Var X (A ss Y s + A sr Y r ) (48) = Var X (A ss Y s ) + 2cov X (A ss Y s, A sr Y r ) + Var X (A sr Y r ). Further, by (30) and with similar arguments in (31)-(34), one has Var X (Y ) = E X Var X (Y B 2 ) + Var X E X (Y B 2 ) = σ 2 I + V B2, where V B2 = Var X E X (Y B 2 ) that can be approximated with a diagonal matrix 13 by the same argument in (16) from Chambers (2008). Thus, Var X (Y ) can be approximately regarded as a diagonal matrix and set Var X (Y ) D = diag{d i ; i }. In this case, one has Also, (48) becomes cov X (A ss Y s, A sr Y r ) 0. Σ s Var X (A ss Y s ) + Var X (A sr Y r ) = E X Var X (A ss Y s A ) + Var X E X (A ss Y s A ) + E X Var X (A sr Y r A ) + Var X E X (A sr Y r A ) ( ) ( ) = E X A ss Var X (Y s )A T ss + E X A sr Var X (Y r )A T sr ( ) + Var X A ss f E s + A sr f E r ( ) ( ) ( ) E X A ss D s A T ss + E X A sr D r A T sr + Var X A ss f E s + A srf E r 13 By (16) from Chambers (2008), V B2 diag (1 λ B2 ) { λ B2 (fb 2,i f B 2 ) 2 (2) + f B 2 ( f B 2 ) 2}, where f B 2 = (f B 2,i ) and f B 2, f (2) B 2 are the averages of f B 2,i and their suares respectively in f B 2. 35

Linear regression with nested errors using probability-linked data

University of Wollongong Research Online Faculty of Engineering and Information Sciences - Papers: Part A Faculty of Engineering and Information Sciences 2014 Linear regression with nested errors using