Regression analysis for longitudinally linked data

Size: px
Start display at page:

Download "Regression analysis for longitudinally linked data"

Transcription

1 University of Wollongong Research Online Centre for Statistical & Survey Methodology Working Paper Series Faculty of Engineering and Information Sciences 2010 Regression analysis for longitudinally linked data Gunky Kim University of Wollongong, Ray Chambers University of Wollongong, Recommended Citation Kim, Gunky and Chambers, Ray, Regression analysis for longitudinally linked data, Centre for Statistical and Survey Methodology, University of Wollongong, Working Paper 22-10, 2010, 56p. Research Online is the open access institutional repository for the University of Wollongong. For further information contact the UOW Library:

2 Centre for Statistical and Survey Methodology The University of Wollongong Working Paper 2210 Regression analysis for longitudinally linked data Gunky Kim and Raymond Chambers Copyright 2008 by the Centre for Statistical & Survey Methodology, UOW. Work in progress, no part of this paper may be reproduced without permission from the Centre. Centre for Statistical & Survey Methodology, University of Wollongong, Wollongong NSW Phone , Fax anica@uow.edu.au

3 Regression analysis for longitudinally linked data Gunky Kim and Raymond Chambers Centre for Statistical and Survey Methodology University of Wollongong Abstract Most probability-based methods used to link records from two distinct data sets corresponding to the same target population do not lead to perfect linkage, i.e. there are linkage errors in the merged data. Chambers (2008) describes modifications to standard methods of regression analysis that can be used with such imperfectly linked data. However, these methods assume that the linkage process is complete, i.e. all records on the two data sets are linked. This paper extends theses ideas to accommodate the situation when the number of data sets are more than two. key words: Record matching; linkage errors; linear regression; logistic regression; estimating euations. 1 Introduction In recent years, because of its advantage of creating new information from already existing files by linking them, the linkage process becomes an important research tool in many areas such as health, business, economics and sociology. One important linkage application is where different data sets relating to the same individuals at different points in time are linked to provide a longitudinal data record for each individual, thus permitting longitudinal analysis for these individuals. To illustrate, the Census Data Enhancement project of the 1

4 Australian Bureau of Statistics aims to develop a Statistical Longitudinal Census Dataset by linking data from the same individuals over a number of censuses. It is expected that this linked data set will provide a powerful tool for future research into the longitudinal dynamics of the Australian population. However, without access to the same uniue identifier in each of the linked data sets, there is always the possibility that linkage errors in the merged data could lead to a longitudinal record ostensibly relating to a single individual being actually made up of a composite of data items from different individuals. This in turn could lead to bias and loss of efficiency for the longitudinal modelling process. Further, as the number of censuses to be linked increase, the structure of linkage error will be more complicated as it will increase more bias and inefficiency for the modelling process. The work of Neter et al. (1965) shows that small mismatching could cause significant response error. Their work has become a foundation of the analysis on the linkage error. Some authors, such as Scheuren and Winkler (1993), Scheuren and Winkler (1997) and Lahiri and Larsen (2005), have tried to extend the work of Neter et al. (1965) on regression setting. However, the volume of works on the analysis of the linkage error is not rich. In Chambers (2008), Chambers has developed new methods to adjust the bias in the linear regression parameters for the linkage process when two data sets were merged. In this study, we extended the ideas of Chambers (2008) to accommodate longitudinally linked data sets where the number of merged data sets are more than two. In general, most of works for linkage error correction has been done when two data sets are merged. However, the linkage error structure of longitudinally linked data sets, when the number of data sets are more than two, are more complicated compared to the linkage error structure of two data sets. As far as our knowledge, this is the first attempt to correct the linkage errors in the merged data sets when the number of data sets are more than two. We will use three data set case as an illustration of our regression analysis, but it is trivial to see that it can be easily extended to deal with any number of data sets. 1.1 Backgrounds and assumptions Suppose that we are interested in fitting a regression model of the form E X (Y ) = f(x; θ), 2

5 where f is a known function, but the parameter θ is to be estimated, and X has more than one data sets. For example, consider a linear regression model of the form Y = β 0 + X 1 β 1 + X 2 β 2 + ǫ = Xβ + ǫ, where we have three different files, one for Y, one for X 1 and one for X 2. When these three data sets were created separately, and if there is no uniue identifier among them to match each other, matching y i with the correct values of x 1i from one file and x 2i from another file could be a difficult task and there could be a strong chance of mismatching. If there exist mismatches, the estimation of β could be biased if we ignore them in the estimation process. The purpose of our study is to develop some methodological frames to adjust the bias of β estimations when the mismatches are expected. For the assumptions we made in this papers are: 1. For the case of register-register, there exist a population of N units for all Y, X 1 and X 2 such that each one of y i should be linked with one of x 1i from one file and x 2i from another file. 2. X can be partitioned into Q different blocks 1. Let us call this block as m-block. 3. The linkage errors occur only within the m-block, in the sense that records in distinct m-blocks can never be linked. The records from X that make up the th m-block is denoted X. 4. In case of sample-register, suppose that we only have sample 2 s from a bench mark register, for example, X 1 with possible relation E(Y X 1, X 2 ) = f(x 1, X 2 ; θ) when they are correctly linked. f could be either linear of logistic function. 5. Denote X 1s the sample records X 1 of the sample size s and some of records in X 1s may not be linked to the records in Y or X Even though some of records are not linked, we assume that the regression model of linked records would be valid for the non-linked records if the links are found. 1 See Chambers (2008) for more detailed discussion about the block. 2 The sample set can be drawn from any data set. To explain our assumptions in more details, here we assume that the samples are drawn from X 1, while Y and X 2 are registers. 3

6 2 Register-register case When there are three data sets, usually one of them is regraded as a bench mark data set and mismatches happens when someone try to link this bench mark data set with other data sets. Thus, when there are three data sets, we expect that at most two kinds of possible mismatches can happen. For example, if we set X 1 as the bench mark data set, possible mismatches happen when we link Y with X 1 or link X 1 with X 2. We will consider the case of one mismatch situations and the case of two possible mismatches case separately. For the two mismatches case, we assume that mismatches from the linkage process between Y and X 1 are independent of the mismatches from the linkage process between X 1 and X Three data sets and one mismatch cases: A ratio-type estimator Note that our model is of the form Y = β 0 + X 1 β 1 + X 2 β 2 + ǫ = Xβ + ǫ, where X = (1, X 1, X 2 ). Suppose that X 1 is the bench mark data set. Then, possible mismatch can happen either when one links records from X 1 with Y or when one links records from X 1 with X 2. However, if the mismatch happens only when one links records from X 1 with Y and X 1 and X 2 can be linked perfectly, one can regards X = (1, X 1, X 2 ) as a one data set, and this case has been dealt extensively in Chambers (2008). Hence, we will only consider the case where mismatch happens when one links records from X 1 with X 2. Let us call this situation as Case 0. Case 0: When each x 1i is correctly linked with corresponding y i, but some of x 2i are not correctly linked with x 1i, one has Y = β 0 + X 1 β 1 + X 2 β 2 + ǫ = X β + ǫ, where X = (1, X 1, X 2 ), X 2 = B 2X 2 and B 2 is a permutation matrix. Note that X 2 is not observable, and we only observe X 2. However, if the matrix B 2 is known, one has X 2 = B2 T X 2. 4

7 Thus, Let X = (1, X 1, X 2 ) = (1, X 1, B T 2X 2). X B 2 = (1, X 1, B2 T X 2 ). (1) Note that X B 2 is only observable if B 2 is known. But, generally, B 2 is unknown and in this case we adapt the non-informative linkage assumption 3, that is, E X (X 2 ) = E X (B T 2 )X 2 = E B 2 X 2, where E B2 satisfies the exchangeable linkage error model. It means E B2 = (λ B2 γ B2 )I + γ B2 1 1 T, where and Let λ B2 = pr(correct linkage between X 1 and X 2 ) γ B2 = pr(incorrect linkage between X 1 and X 2). X E = E X (X ) = E X (1, X 1, X 2 ) = (1, X 1, E B2 X 2 ). (2) Then, by OLS, one has ˆβ 1 = (X )T X (X )T Y, where E X (ˆβ 1 ) = (X ) T X (X ) T X E β = D 1 β. Hence, if the matrix E B2 is known and the inverse of D 1 exists, a ratio form of an unbiased estimator of β is of the form ˆβ R1 = D 1 1 ˆβ. Let f = X β, f = X β, (3) f E = XE β. 3 We assume that the distribution of B 2 is independent of X 2 given X. 5

8 Proposition 1. An asymptotic variance estimator of ˆβ R1 can be defined by 1 ( 1 ) T, ˆV 1 (ˆβ R1 ) = (X )T X E (X )T ˆV 1 (Y )X (X )T X E where ˆV 1 (Y ) = ˆσ 2 I + ˆV B2. Here, ˆV 1 (Y ) can be estimated by ˆσ 2 = N 1 (Y f E ) T (Y f E ) and, given f B 2 := X 2β 2, V B2 = diag (1 λ B2 ) { λ B2 (fb 2,i f B 2 ) 2 + f (2) B 2 ( f B 2 ) 2}, where f B 2 = (f B 2,i ) and f B 2, f (2) B 2 in f B 2. are the averages of f B 2,i and their suares respectively 2.2 Three data sets and two mismatches cases: A ratio-type estimator When there are three data sets and and two mismatches in the data linkage processes, there are two possible scenarios. Case 1: Y is the bench mark data set and the linkages between Y and X 1 and the linkages between Y and X 2 are done with some errors. Case 2: Either X 1 or X 2 is the bench mark data set 4 and the linkage between the bench mark data and other X data set and the linkage between the bench mark data set and Y are done with some errors. Let us consider the Case 1 first. So, we assume that the data set for y i is correctly recorded, but there are mismatches between y i and x 1i as well as between y i and x 2i. Also, we assume that mismatches between y i and x 1i are independent of the mismatches between y i and x 2i. In this case, our regression model is of the form Y = β 0 + X 1β 1 + X 2β 2 + ǫ = X β + ǫ, 4 In this paper, we assume that X 1 is the bench mark. 6

9 where X = (1, X 1, X 2), X 1 = B 1 X 1 and X 2 = B 2 X 2, and B 1 and B 2 are permutation matrices. If B 1 and B 2 are known, one has X = (1, X 1, X 2 ) = (1, B T 1 X 1, BT 2 X 2 ). Since, B 1 and B 2 are unknown in general, we apply the non-informative linkage assumption so that where, X E2 = E X (X ) = (1, E B1 X 1, E B 2 X 2 ), (4) E Bi = (λ Bi γ Bi )I + γ Bi 1 1 T and λ Bi = pr(correct linkage between Y and X i ) Then, by OLS, where γ Bi = pr(incorrect linkage between Y and X i ). ˆβ 1 = (X )T X (X )T Y, E X (ˆβ 1 ) = (X )T X (X )T X E2 β = D 2 β. Hence, if the matrices E B1 and E B2 are known and the inverse of D 2 exists, a ratio form of an unbiased estimator of β is of the form ˆβ R2 = D 1 2 ˆβ. Let f E2 = X E2 β. Proposition 2. An asymptotic variance estimator of ˆβ R2 can be defined by 1 ( 1 ) T, ˆV 2 (ˆβ R2 ) = (X ) T X E2 (X ) T ˆV 2 (Y )X (X ) T X E2 where Here, ˆV 2 (Y ) can be estimated by ˆV 2 (Y ) = ˆσ 2 I + ˆV B1 + ˆV B2. ˆσ 2 = N 1 (Y f E2 ) T (Y f E2 ) 7

10 and, given f B j := X jβ j for j = 1 or 2, V Bj = diag (1 λ Bj ) { λ Bj (fb j,i f B j ) 2 + f (2) B j ( f B j ) 2}, where f B j = (f B j,i ) and f B j, f (2) B j in f B j. are the averages of f B j,i and their suares respectively Now, we are considering the Case 2. When some of x 1i are incorrectly linked with corresponding y i or with x 2i, our regression model is of the form Y = β 0 + X 1 β 1 + X 2 β 2 + ǫ = X β + ǫ, where Y = A Y, X 2 = B 2X 2 and A and B 2 are permutation matrices. By non-informative linkage assumption 5 on A, one has E X (Y ) = E X (A Y ) = E X (A )E X (Y ) = E A E X (Y ) = E A X E β, (5) where E A = (λ A γ A )I + γ A 1 1 T with and λ A = pr(correct linkage between X 1 and Y ) γ A = pr(incorrect linkage between X 1 and Y ). Further, we assume that the mismatch between x 1i and y i is uncorrelated 6 with the mismatch between x 1i and x 2i. With these assumption, by OLS, one has ˆβ 1 = (X )T X (X )T Y and 1 = (X )T X (X )T A Y E X (ˆβ 1 ) = (X ) T X (X ) T E A X E β = D 3 β. 5 Here we assume the randomness of the linkage error between Y and X. See Chambers (2008) for a more detailed discussion. 6 We will try to relax this assumption soon. 8

11 Thus, if the matrices E X (B 2 ) = E B2 and E X (A ) = E A are known and the inverse of D 3 exists, a ratio form of an unbiased estimator of β for this case is of the form ˆβ R3 = D 1 3 ˆβ. Proposition 3. An asymptotic variance estimator of ˆβ R3 can be defined by 1 ( 1 ) T, ˆV 3 (ˆβ R3 ) = (X )T X E (X )T ˆV 3 (Y )X (X )T X E where ˆV 3 (Y ) = ˆσ 2 I + ˆV B2 + ˆV C2. Here, ˆV 3 (Y ) can be estimated by ( ˆσ 2 = N 1 (Y f E ) T (Y f E ) 2 (f E ) T ) I E A f E and, given f B 2 := X 2 β 2, V B2 = diag (1 λ B2 ) { λ B2 (fb 2,i f B 2 ) 2 + where f B 2 = (fb 2,i ) and f B (2) 2, f B 2 in f B 2. Further, one has V C2 = A Var X f (2) B 2 ( f B 2 ) 2}, are the averages of fb 2,i and their suares respectively E X (Y B 2 ) A T, and it can be estimated by V C2 = diag (1 λ C2 ) { λ C2 (fb 2,i f B 2 ) 2 + f (2) B 2 ( f B 2 ) 2}, where f B 2 = (fb 2,i ) and f B (2) 2, f B 2 are the averages of fb 2,i and their suares respectively in f B 2. Moreover, C B2 = A B2 T and λ C2 is the probability of correct linkages in C B The estimating function we will modify the estimating functions used in Chambers (2008) to accommodate the longitudinal linkage case. Suppose that one has E(Y X) = g(x; θ), where θ can be estimated by solving H(θ) = 0, 9

12 and H(θ) is a function that satisfies E X H(θ0 ) = 0 when θ 0 is the true value of θ. Let θ be the partial differentiation operator with respect to θ. Suppose that ˆθ satisfies H(ˆθ) = 0. Then, under some regularity condition for the smoothness and Taylor expansion, 0 = H(ˆθ) H(θ 0 ) + (ˆθ θ 0 ) θ H(θ 0 ). If H(θ) is an unbiased estimating function and θ H(θ 0 ) is non-singular, one has E X ˆθ θ0 θ H(θ 0 ) 1 EX H(θ0 ) = 0. Then, the variance function for ˆθ can be derived by Var X (ˆθ) θ H(θ 0 ) 1 Var X H(θ0 ) ( θ H(θ 0 ) ) 1 T. In Chambers (2008), the estimating function is of the form H(θ) = G (θ) { Y f }, where f = E X (Y ) and G (θ) 7 is a function of θ and X but not of Y. In the longitudinal case for the three data set, we have three different cases to consider. Firstly, consider the Case 0 where Y and X 1 are correctly linked, but X 1 and X 2 are not correctly linked. Hence, we can observe true Y, but we cannot observe the true X. Instead, we observe X, which is of the form X = (1, X 1, X 2 ), X 2 = B 2X 2 and B 2 is a permutation matrix that is not observable in general. Then, a naive estimating function can be of the form H (θ) = G (θ) { Y f (θ)}, where f (θ) = X β. Then, it is easy to see that the estimator from the naive estimating function is biased, because E X (Y ) = f E (θ) f (θ). Thus, an unbiased estimating function can be of the form H 1(θ) = G (θ) { Y f E (θ) }, (6) 7 Some examples of G for different estimators are given in the simulation section. 10

13 where f E (θ) = X E β = (1, X 1, E B2 X 2)β. Let us consider the Case 1 where Y is the bench mark data set and the linkages between Y and X 1 and the linkages between Y and X 2 are done with some errors. In this case, we have similar estimating function H 2 (θ) = G (θ) { Y f E2 (θ) }, (7) where, by (4), f E2 (θ) = X E2 β = (1, E B1 X 1, E B2 X 2)β. Now, consider the Case 2 where X 1 is the bench mark data set and the linkage between X 1 and X 2 and the linkage between X 1 and Y are done with some errors. In this case, Y is observed instead of Y and also the true X is not observable. Instead, we observe X, which is of the form X = (1, X 1, X 2 ), X 2 = B 2X 2 and B 2 is a permutation matrix that is not observable in general. Hence, a naive estimating function can be of the form H (θ) = G (θ) { Y f (θ)}, where f (θ) = X β. Then, as before, it is easy to see that the estimator from the naive estimation function is biased, because E X (Y ) = E A f E (θ) f (θ). Hence, by (2), (5) and (28), an unbiased estimator is of the form H 3(θ) = G (θ) { Y E A f E (θ) }, (8) and the estimator ˆθ 3 is defined as the the solution of H 3 (ˆθ 3 ) = 0. Theorem 4. Let ˆθ 1 be the solution of (6). Then, an asymptotic variance estimator is of the form V 1 X (ˆθ 1) = G θ f E (ˆθ 1 1) G ˆΣ 1 G T ( G θ f E (ˆθ 1 ) T 1) 11

14 where, ˆΣ 1 = ˆσ 2 I + ˆV B2. Also, let ˆθ 2 be the solution of (7). Then, an asymptotic variance estimator is of the form V 2 X (ˆθ 2) = G θ f E2 (ˆθ 1 ( 2) G ˆΣ 2 G T G θ f E2 (ˆθ 2) 1 ) T, where, by (26), ˆΣ 2 = ˆσ 2 I + ˆV B1 + ˆV B2. Finally, the asymptotic variance estimator for the solution of (8) is of the form V 3 X (ˆθ 3) = G E A θ f E (ˆθ 1 3) G ˆΣ 3 G T ( G E A θ f E (ˆθ 3) 1 ) T, where, ˆΣ 3 = ˆσ 2 I + ˆV C2 + ˆV A. 2.4 Variance estimation when linkage probabilities are estimated So far, we assume that we know the correct linkage probabilities which is a very strong assumption. In this subsection, we consider the case where the correct linkage probabilities are estimated by checking a random audit sample of linked records in each m-block. More details of this audit estimates when there are two data sets can be found in Chambers (2008), and we will modify his idea to accommodate the cases when there are more than two data sets. Let us consider the Case 2 where x 1i is neither correctly linked with corresponding y i, nor with x 2i. In this case, we need to consider two different linkage probabilities: λ A = pr(correct linkage between X 1 and Y ) λ B2 = pr(correct linkage between X 1 and X 2), where there is no correlation between them. Thus, the estimating function (8) can be replaced by H 3(θ, λ A, λ B2 ) = G { Y E A (λ A )f E (θ, λ B2 ) } = U (θ, λ A, λ B2 ), 12

15 and a first order Taylor series approximation is of the form 0 = H 3(ˆθ 3, ˆλ A, ˆλ B2 ) H 3 (θ 0, λ 0 A, λ0 B 2 ) + θ H 3 (θ 0, λ 0 A, λ0 B 2 )(ˆθ θ 0 ) + λa H 3 (θ 0, λ 0 A, λ0 B 2 )(ˆλ A λ 0 A ) + λ B2 H 3 (θ 0, λ 0 A, λ0 B 2 )(ˆλ B2 λ 0 B 2 ), where θ 0, λ 0 A and λ0 B 2 denote the true values of θ, λ A and λ B2 respectively. Denote H 0 = H 3(θ 0, λ 0 A, λ 0 B 2 ), λ1 = λa and λ2 = λb2. Then, one has ˆθ 3 = θ 0 θ H 1 0 H 0 + λ 1 H 0 (ˆλ A λ 0 A ) + λ 2 H 0 (ˆλ B2 λ 0 B 2 ). If the estimates of the linkage probabilities are obtained by a random audit sample (of the size m A for λ A and m B for λ B 2 ) of linked records, one has and Var X (λ A ) = (m A ) 1 λ A (1 λ A ) Var X (λ B2 ) = (m B ) 1 λ B2 (1 λ B2 ). Theorem 5. An asymptotic variance estimator of 3 is of the form ˆV λ 3 X (ˆθ 3 ) = θ Ĥ 1 ( ) ( 0 ˆV3 X θ 3 + λ1 Ĥ ) 0 VarX (ˆλ A ) ( λ1 Ĥ T 0) ˆθ + ( λ2 Ĥ 0) VarX (ˆλ B2 ) ( λ2 Ĥ 0) T { θ Ĥ 0 1 } T, where λ1 Ĥ 0 = G (M 1) 1 (M I 1 1 T E ) ˆf (ˆθ, ˆλ B2 ) and λ2 Ĥ 0 = G E A (M 1) 1 (M I 1 1 T ) X 2 ˆβ 2. and ˆV 3 X is the asymptotic variance estimator for ˆθ 3 in the Theorem 4. Similarly, an asymptotic variance estimator for when the linkage probabilities are unknown, can be represented by ˆV λ 2 X (ˆθ 2 ) = θ Ĥ 1 ( 0 ˆV2 X θ 2 ) ( + λb1 Ĥ 0 ˆθ 2, the unbiased estimator for the Case 1 ) VarX (ˆλ B1 ) ( λb1 Ĥ 0 + ( λb2 Ĥ ) 0 VarX (ˆλ B2 ) ( { λb2 Ĥ T θ } 0) Ĥ 1 T, 0 13 ) T

16 where, λ B1 = pr(correct linkage between Y and X 1 ) and λb1 Ĥ 0 = G (M 1) 1 (M I 1 1 T X ) ˆβ 1 1. Finally, an asymptotic variance estimator for when the linkage probabilities are unknown, can be represented by ˆθ 1, the unbiased estimator for the Case 0 ˆV λ 1 X (ˆθ 1 ) = θ Ĥ 1 ( ) ( 0 ˆV1 X θ 1 + λb2 Ĥ ) 0 VarX (ˆλ B2 ) ( { λb2 Ĥ T θ } 0) Ĥ 1 T Simulation We use simulation to compare the performances of different estimators we considered in this study. The linear model we used in this simulation is of the form Y = 1 + 5X 1 + 8X 2 + ǫ, where X 1 were drawn from the standard normal distribution and X 2 were drawn from the normal distribution with mean= 2 and the variance of 4. ǫ were drawn from the standard normal distribution as well. In this simulation, we consider all three cases we have studied: Case 0: X 1 is the bench mark data set and the mismatch happens only from the linkage between X 1 and r X 2. Case 1: Y is the bench mark data set and the linkages between Y and X 1 and the linkages between Y and X 2 are done with some errors. Case 2: X 1 is the bench mark data set and the linkage between X 1 and X 2 data set and the linkage between X 1 and Y are done with some errors. Here, we will only explain how we generate the data sets for Case 2. Generating the data sets for other cases are uite trivial. There are three m-blocks and in each m block, the pairs (x 1i, x 2i) were generated according to an independent exchangeable linkage error model. Further, given X i = (1, x 1i, x 2i), the pairs (yi, X i) were generated according to another independent exchangeable linkage error model. In this simulation, we use three m-blocks of sizes 500 for each m-block. Also we 14

17 assume that the probability of correct linkage between Y and X and probability of correct linkage between X 1 and X 2 are known. The estimators for the simulations are 1. the naive OLS estimator (ST), 2. the ratio-type estimator (R), 3. the Lahiri-Larsen estimator (A) and 4. the empirical Best Linear Unbiased Estimator, EBLUE, (C). Note that different estimating functions have different form of G. In our case, 1. the naive estimator: G = (X )T, 2. the Lahiri-Larsen estimator: G = (ÊA X E ) T and 3. the EBLUE: G = (ÊA X E )T (ˆσ 2I + ˆV C2 + ˆV A ) 1. The assumptions on the probability of correct linkage on each m-block are the probability of correct linkage between Y and X 1 : λ A1 = 1, λ A2 = 0.95 and λ A3 = 0.75, the probability of correct linkage between Y andx 1 : λ B11 = 1, λ B12 = 0.95 and λ B13 = 0.75 and the probability of correct linkage between X 1 and X 2 : λ B 21 = 1, λ B22 = 0.85 and λ B23 = 0.8. Under the above scenario, the estimators were independently simulated 1000 times. The regression parameters were estimated using the four estimators. The following plot box represent the overall performance of the estimators. Clearly, the ration-type estimator, the Lahiri-Larsen estimator and the EBLUE correct the bias due to incorrect linkage, and the EBLUE outperforms other estimators, that was also noted in Chambers (2008) where two registers were merged. These observations are consistent for all three cases. It is worth to note that the EBLUE(C) outperforms all other estimators in general. The figures 1-3 clearly show that EBLUE is the best one. However, our simulation shows that the relative biases of EBLUE, when λs are unknown, are 15

18 larger than the Lahiri-Larsen estimator and the ratio-type estimator. But the overall relative RMSE are smaller than other estimators. Table 1 here. Table 2 here. Table 3 here. 3 Sample-register case In this section, we consider the case where we only observe a sample s of records from the bench mark data set. Suppose that X 1 is the bench mark data set. When all the records in X 1 -register are linked to the records inx 2 -register and Y -register, all of the sample records s from X 1 -register are perfectly linked with some records in X 2 -register and Y -register. However, in reality, some records in the sample s cannot be linked to a record in X 2 -register or Y -register. We will consider these two cases separately. 3.1 Sample-register case: When sample records are perfectly linked As before, we will consider three different cases, Case 0, Case 1 and Case 2. Let us start with Case 2. If all records in the sample s are linked to the records inx 2 - register and Y -register, We can assume that the sample s is a part of X 1 -register that is complete register-register linkage. Hence we can use a weighted estimating function. In this subsection, we will modify the estimating function approach to accommodate this sampleregister linkage. When the sample records s from X 1 -register are linked to X 2 -register and Y -register, we observe a subset s of M records from Y, which we denote by Y s. More precisely, let M be the population number in the th m-block, and let m s be the sample size in the th m-block. We use a subscript of s to denote uantities that depend on the sample records in the th m-block. Similarly, the subscript of r is used to indicate uantities that depend on the non-sample records in the th m-block. 16

19 Under perfect linkage of the sample data, when there is no linkage error, the true parameter θ 0 can be estimated by solving the estimating euation H s (θ) = G s { Y s f s (θ) }, where G s is modified by the sample weights w s that depend on the ratio of the sample size from the population. When there exist linkage errors and we ignore the errors, the estimating euation is then of the form H s(θ) = G s { Y s f s(θ) }, where and A = ( As Y s = A sy A r ) ( ) Ass A = sr is the sample/non-sample decomposition of the complete linkage process in the th m-block. This estimating euation leads to a bias because E X (Y s ) f s. To correct the bias, by using the fact that E X (Y s ) = E X (A s Y ) = E A s f E (θ), we modify this estimating euation A rs A rr H adj s (θ) = = G s { Y s E A s f E (θ)} G s { Y s E Ass f E s(θ) E Asr f E r(θ) }, (9) where E A = ( EAs E Ar ) ( EAss E = Asr E Ars E Arr is the corresponding sample/non-sample decomposition of the expected value E A of A. Now, by the definition of E A, one has ( λa M 1) E Ass = I s + M 1 ( 1 λa M 1 ) ) 1 s 1 T s (10) and ( 1 λa E Asr = M 1 ) 1 s 1 T r 17

20 so that (9) becomes H adj s (θ) = { ( G s Y s λa M 1) M 1 I s f E s (θ) ( 1 λa M 1 ) 1 s 1 T f E (θ)}. Using a weighting approach 8, the unknown value 1 T f E (θ) can be replaced by wt s fe s (θ) under the assumption that the samples are chosen randomly from the population. This leads us to the estimating function of the form H adj ws3 (θ) = G s { Y s ẼA s f E s (θ)}, (11) where ( λa M 1) Ẽ As = I s + M 1 ( 1 λa M 1 ) 1 s w T s. For the Case 1, formulae are similar, but simpler than those in Case 2. Note that, in this case, we observe true Y s. Hence, the estimating function is of the form H adj ws2(θ) = G s { Y s f E2 s (θ)}, where f E2 s = X E2 s β = (1 s, ẼB 1s X 1s, ẼB 2s X 2s)β, ( λb1 M 1) ( 1 λb1 ) Ẽ B1s = I s + 1 s w T s and M 1 M 1 ( λb2 M 1) ( 1 λb2 ) Ẽ B2s = I s + 1 s w T s M 1 M 1. Finally, for the Case 0, it has simplest forms for their formulae since there is only one mismatch. The estimating function is of the form H adj ws1 (θ) = G s { Y s f E s (θ)}, where Theorem 6. Let f E s = XE s β = (1 s, X 1s, ẼB 2s X 2s )β and ( λb2 M 1) ( 1 λb2 ) Ẽ B2s = I s + 1 s w T s M 1 M 1. ˆθ s 3 be the solution of the estimating euation (11). Then, under the assumption that we know true λ A and λ B2, an asymptotic variance estimator is of the form V ws s 1 ( 1 ) T, 3 X (ˆθ 3 ) = G s Ẽ As θ f E s(ˆθ) G s ˆΣs G T s G s Ẽ As θ f s(ˆθ) E 8 In this article, we simply use weight w s = ( M m s ) 1. 18

21 where ( (λa M 1)d i + M (1 λ A ) ˆΣ s diag d s +(1 λ A ) λ A (fi E M 1 f s E )2 E(2) + f s ( f ) s E )2 ; i s with D s = diag{d i ; i s } Var X (Y s ) and d s is the mean of {d i ; i s }. Let ˆθ s 2 be the solution of the estimating euation for the Case 1. Then, under the assumption that we know true λ B1 and λ B2, an asymptotic variance estimator is of the form V ws s 2 X (ˆθ 2 ) = G s θ f E2 s 1 s (ˆθ 2 ) ( G ˆΣ(2) s s GT s G s θ f E2 s 1 ) T, s (ˆθ 2 ) where Finally, let ˆΣ (2) s = ˆσ 2 si s + ˆV B1s + ˆV B2s. ˆθ s 1 be the solution of the estimating euation for the Case 0. Then, an asymptotic variance estimator is of the form V ws s 1 X (ˆθ 1 ) = G s θ f E s 1 s (ˆθ 1 ) where, G s ˆΣ(1) s GT s ˆΣ (1) s = ˆσ 2 si s + ˆV B2s. ( G s θ f E s 1 ) T, s (ˆθ 1 ) Note that the above asymptotic variance estimator assumes that the λ A, λ B1 and λ B2 are known. If we need to estimate these probabilities, the asymptotic variance estimator should have more terms that count the estimations of λ A, λ B1 and λ B2. To see this, note that, when λ A and λ B2 are estimated, (11) becomes H adj ws3,λ (θ, λ A, λ B2 ) = G s { Y s ẼA s (λ A )f E s(θ, λ B2 ) }. (12) In this case, the asymptotic variance estimator is of the form Var X (ˆθ) θ H adj 1 ( ws3,0 Var X H adj ) ( ws3,0 + λa H adj ) ws3,0 VarX (λ A ) ( λa H adj ) T ws3,0 + ( λb2 H adj ) ws3,0 VarX (λ B2 ) ( ) { T θ λb2 Hadj ws3,0 H adj } 1 T, (13) ws3,0 where H adj ws3,0 = H adj ws3,λ (θ 0, λ 0 A, λ 0 B 2 ), λa H adj ws3,0 = λb2 H adj ws3,0 = G s (M 1) 1 (M I s 1 s w T s) f E s(θ 0, λ 0 B 2 ), G s Ẽ As (M 1) 1 (M I s 1 s w T s) X 2β 2. (14) 19

22 Corollary 7. Let variance estimator is of the form Also, Let ˆθ s 3 be the solution of the estimating euation (12). Then, an asymptotic V ws,λ s 3 X (ˆθ 3 ) = θ Ĥ adj 1 ws3,0 V ws s 3 X (ˆθ 3 ) + ( λ1 Ĥ adj ) ws3,0 VarX (ˆλ A ) ( λ1 Ĥ adj ) T ws3,0 + ( λ2 Ĥ adj ) ws3,0 VarX (ˆλ B2 ) ( λ2 Ĥ adj ) { T θ } ws3,0 Ĥ adj 1 T. ws3,0 ˆθ s 2 be the solution of the estimating euation for the Case 1. Then, an asymptotic variance estimator is of the form V ws,λ s 2 X (ˆθ 2 ) = θ Ĥ adj 1 ws2,0 V ws s 2 X (ˆθ 2 ) + ( λb1 Ĥ adj ws2,0) VarX (ˆλ B1 ) ( λb1 Ĥ adj ) T ws2,0 + ( λb2 Ĥ adj ws2,0) VarX (ˆλ B2 ) ( λb2 Ĥ adj ) { T θ } ws2,0 Ĥ adj 1 T, ws2,0 where H adj ws2,0 = H adj ws2(θ 0, λ B1, λ B2 ), Var X (λ B1 ) = (m B 1 ) 1 λ B1 (1 λ B1 ), λb1 H adj ws2,0 = G s (M 1) 1 (M I s 1 s w T s) X 1β 1. Further, let ˆθ s 1 be the solution of the estimating euation for the Case 0. Then, an asymptotic variance estimator is of the form V ws,λ s 1 X (ˆθ 1 ) = θ Ĥ adj 1 ws1,0 V ws s 1 X (ˆθ 1 ) + ( λb2 Ĥ adj ws1,0) VarX (ˆλ B2 ) ( λb2 Ĥ adj ws1,0 ) { T θ } Ĥ adj 1 T. ws1,0 3.2 Sample-register case: When sample records are not perfectly linked When some records are not linked, A or B 2 cannot be a permutation matrix, because the entries of some rows are all zero due to non-linkage. However, we can still use similar ideas introduced in the previous subsection. Firstly, we consider the Case 2. Let X 1s be the set of the sample records from X 1. Also let X 1sl be the set of sample records in X 1s that are linked both to X 2 -register and to Y -register. Further, let X 1su := X 1s X 1sl. Then it represents the set of sampled records in X 1s that cannot be linked either to X 2 -register or to Y -register. Also, let X 1r := X 1 X 1s, the set of non-sample records in X 1. We assume that, theoretically, 20

23 there exists X 1rl that represents the set of non-sample records that can be linked both to X 2 -register and Y -register. Similarly, X 1ru := X 1r X 1rl. Similarly, under the one to one linkage assumption, Y can be partitioned into four groups, namely Y sl, Y su, Y rl and Y ru. Thus, one has Y sl A slsl, A slsu, A slrl, A slru, Y sl Y = Y su Y = A susl, A susu, A surl, A suru, Y su rl A rlsl, A rlsu, A rlrl, A rlru, Y rl = A Y, Y ru A rusl, A rusu, A rurl, A ruru, Y ru and E slsl,a E slsu,a E slrl,a E slru,a E(A X ) = E A = E susl,a E susu,a E surl,a E suru,a E rlsl,a E rlsu,a E rlrl,a E rlru,a. E rusl,a E rusu,a E rurl,a E ruru,a Further, because X 2 also can be partitioned into X 2sl, X 2su, X 2rl and X 2ru, one has E slsl,b2 E slsu,b2 E slrl,b2 E slru,b2 E(B 2 X ) = E B 2 = E susl,b2 E susu,b2 E surl,b2 E suru,b2 E rlsl,b2 E rlsu,b2 E rlrl,b2 E rlru,b2. E rusl,b2 E rusu,b2 E rurl,b2 E ruru,b2 This leads to the estimating euation of the form H adj sl (θ) = = G sl { Y sl E A sl f E (θ)} G sl { Y sl E slsl,a f E sl (θ) E slsu,a f E su (θ) E slrl,a f E rl(θ) E slru,a f E ru(θ) }. (15) Under the exchangeable linkage error model, one has λa M 1 1 λa E slsl,a = I sl + M 1 M 1 1 λa E slsu,a = 1 sl 1 T su M 1, 1 λa E slrl,a = 1 sl 1 T M 1 rl, 1 λa E slru,a = 1 sl 1 T ru M 1. It leads (15) to the form of H adj sl (θ) = G sl { Y sl λa M 1 M 1 21 I sl f E sl(θ) 1 sl 1 T sl, 1 λa M 1 1 sl 1 T f E }.

24 If we assume that the distribution of Y sl is the same as that of Y in the population, the observable population value 1 T f E (θ) can be replaced by weighted sample estimate by w T sl f E sl (θ)9 so that one has H adj wsl (θ) = G sl { Y sl ẼA sl f E sl(θ) }, where For f E sl (θ), note that by (2) where λa M 1 Ẽ Asl = I sl + M 1 1 λa M 1 1 sl w T sl. f E sl(θ) = (1 sl, X 1sl, E Bsl,2 X 2)(β 0, β1, β 2 ) T, E Bsl,2 X 2 = E slsl,b 2 X 2sl + E slsu,b 2 X 2su + E slrl,b 2 X 2rl + E slru,b 2 X 2ru. The exchangeable linkage error model provides that λb2 M 1 E slsl,b2 = M 1 1 λb2 E slsu,b2 = M 1 1 λb2 E slrl,b2 = M 1 1 λb2 E slru,b2 = M 1 I sl + 1 sl 1 T su, 1 sl 1 T rl, 1 sl 1 T ru. 1 λb2 M 1 1 sl 1 T sl, If we also assume that the distribution of X 2sl is the same as that of X 2 in the population, then E Bsl,2 X 2 can be replaced by ẼB sl,2 X 2sl where Then, f E sl(θ) can be evaluated by λb2 M 1 Ẽ Bsl,2 = I sl + M 1 1 λb2 M 1 1 sl w T sl. f E sl (θ) = (1 sl, X 1sl, ẼB sl,2 X 2sl )(β 0, β1, β 2 ) T. 9 We will use w sl = ( M m sl )1 sl, where m sl is the number of linked sample records, while M is the total population number in th m-block. 22

25 Suppose that we know λ A and λ B2, and let ˆθ be the solution of the estimating euation. To derive the asymptotic variance estimator for ˆθ, note that (47) becomes now Var X (ˆθ) θ H adj wsl (θ 0) 1 VarX H adj wsl (θ 0) ( θ H adj wsl (θ 0) ) 1 T with corresponding estimator of the form V X (ˆθ) = θ H adj wsl (θ 0) 1 V X H adj wsl (θ 0) ( θ H adj wsl (θ 0) ) 1 T 1 ( G sl Ẽ Asl θ f E sl (ˆθ) (ˆθ) 1 ) T, G sl ˆΣsl G T sl G sl Ẽ Asl θ f E sl under the assumption that G sl is independent of θ. By the similar arguments in (48)- (49), Σ sl = Var X (Y sl ) Var X (A slsl, Y sl ) + Var X (A slsu, Y su ) + Var X (A slrl, Y rl ) + Var X (A slru, Y ru ) that can be approximated by ˆΣ sl diag ( (λa M 1)d i + M (1 λ A ) d sl M 1 + (1 λ A ) λ A (f E i f E sl )2 + E(2) f sl ( f ) sl E )2 ; i {1,...,m sl }. If we need to estimate λ A and λ B2, we still can use the asymptotic variance estimator defined by (13)-(14), except that the subscripts sp and ws need to be replaced by slp and wsl. That is, the asymptotic variance estimator is of the form Var X (ˆθ) θ H adj 1 ( ) ( ) wsl,0 Var X H adj wsl,0 + λ1 H adj wsl,0 VarX (ˆλ A ) ( ) λ1 H adj T wsl,0 + ( ) λ2 H adj wsl,0 VarX (ˆλ B2 ) ( ) { λ2 H adj T θ } wsl,0 Hwsl,0 adj 1 T. Using the above arguments, it is clear that, to deal with Case 0 and Case 1 in this case, we can use the formulae in the previous subsection by replacing sp and ws with slp and wsl. 3.3 Simulation We use simulation to compare the performances of different estimators we considered in this study for the sample to register linkage case. The linear model we used in this simulation is the same as before, Y = 1 + 5X 1 + 8X 2 + ǫ. 23

26 Most of assumptions and scenarios we made for the register to register case are the same except that we use the sample here instead of whole population. In this simulation, we considered the case of complete linkage and incomplete linkage separately. For the case of complete linkage, we assume that the sample records s from the bench mark data sets are linked to the records in other registers. The extra assumption we made in this simulation is that the population size of all registers the same and each m-block has 2000 records, and 500 samples are chosen randomly for each m-block. Further, in case of incomplete linkage, we assume that, among 2000 records, half of them cannot be linked. In this incomplete linkage case, we chose 1000 samples. The reason is that because half of them cannot be linked, we might have around 500 samples that are linked to other registers. This assumption will provide another consistent comparisons of the same estimators between the complete linkage case and incomplete linkage case. The results for the complete linkage case can be found in the Table 4 Table 6, while the results for the incomplete case are in Table 7 Table 9 The result shows very similar pattern in the register to register case. Clearly, while the ratio-type estimator, the Lahiri-Larsen estimator and the EBLUE correct the bias due to linkage errors, the EBLUE outperforms all other estimators. Here are the results for the complete linkage case: Table 4 here. Table 5 here. Table 6 here. Here are the results for the incomplete linkage case: Table 7 here. Table 8 here. Table 9 here. The results for the sample-register cases are very similar to the register-register cases as long as the sample sizes are similar. One thing to note is that the coverage rates are all higher than 95%. This is not the case when the number of merged data sets are two. One possible explanation is that the variance terms in these cases are more complicated and, as the number of merged data sets increase, the variances increase as well so that the confidence intervals are becoming wider. 24

27 4 Conclusion and further research direction In this paper we extend the linkage error adjusting techniue in regression analysis developed in Chambers (2008) to accommodated the situation where the number of merged data sets are more than two. We developed a ration-type estimator for the regression analysis and then it has been extended to more general adjusted estimating function approach. These methods can deal with the case where all the data sets are registers, as well as the case where the bench mark data sets are sample and the others are registers. Even though it hasn t been dealt here, it is easy to see that these methods can naturally accommodate the case where all the data sets are sample. These methods also extended to deal with the situation where some of sample data are failed to be linked to other registers. However, all of these bias correction methods have to pay the price of large variance. Furthermore, in the case of sample-registers case with non-linkage situation, the number of linked sample data, if the the number of merged data sets are increasing, will be decreasing. Thus, we expect some sort of loss of information by merging more data sets. We expect to overcome this limitation by adapting other approaches. Another limitation of these methods is that we assume that the linkage errors among the data sets occurs randomly. However, there might be some correlation among the linkage errors. To deal with this situation, our model should include more complicated covariance measures in the formulae and it will be dealt in our next research paper. A Proofs of the Propositions and Theorems A.1 Proof of Proposition 1 For the variance of the estimator, note that Var X (ˆβ R ) = D 1 1 Var X (ˆβ ) ( D 1 1 ) T, where Var X (ˆβ 1 1. ) = (X )T X (X )T Var X (Y )X (X )T X 25

28 Further, one has Var X (Y ) = E X Var X (Y B 2 ) + Var X E X (Y B 2 ). Note that, by (1), E X (Y B 2 ) = X B 2 β. Thus, Denote that V B2 = Var X E X (Y B 2 ) = E X = E X f B 2 = X 2β 2. 2 X B 2 β XE β 2. B2 T X 2 β 2 E B2 X 2 2 β Then, by (16) from Chambers (2008), V B2 = diag (1 λ B2 ) { λ B2 (fb 2,i f B 2 ) 2 + f (2) B 2 ( f B 2 ) 2}, (16) where f B 2 = (fb 2,i ) and f B (2) 2, f B 2 in f B 2. Furthermore, one has are the averages of fb 2,i and their suares respectively Var X (Y B 2 ) = E X (Y X B 2 β)2 = E X (ǫ ) 2 = σ 2 I. Therefore, one has Var X (Y ) = σ 2 I + V B2 (17) which implies that 1 ( Var X (ˆβ R ) = (X )T X E (X )T σ 2 I 1. + V B2 )X (X )T X E (18) To evaluate Var X (ˆβ R ), Then, one has T (Y f E )T (Y f E (Y ) = f ) (f E f ) (Y f ) (f E f ) = (Y f ) T (Y f ) (19) (Y f ) T (f E f ) (f E f ) T (Y f ) (20) + (f E f ) T (f E f ). (21) Note that, by the definition, (Y f ) T (Y f ) = Nˆσ 2. (22) 26

29 Further, E X (Y f ) T (f E f ) + (f E f ) T (Y f ) = 0 (23) because Y f = ǫ and cov(ǫ, X ) = 0. Moreover, one has E X (f E f ) T (f E f ) = E X (f E ) T (f E Y ) + (f E ) T (Y f ) + (f ) T (f Y ) + (f ) T (Y f E ) (24) because E X ( Y f E = 0 ) = 0. Thus, by (19)-(24), ˆσ 2 = N 1 (Y f E )T (Y f E ). (25) Conseuently, Var X (ˆβ R ) can be evaluated by using (25), (16) and (18). A.2 Proof of Proposition 2 For the variance of the estimator, one has Var X (ˆβ R ) = D 1 2 Var X (ˆβ ) ( D 1 2 ) T, where Var X (ˆβ 1 1. ) = (X )T X (X )T Var X (Y )X (X )T X Note that one has Var X (Y ) = E X Var X (Y B 1, B 2 ) + Var X E X (Y B 1, B 2 ). Then, by the assumption that the mismatches found in X 1 mismatches found in X 2, are not correlated with the 2 V B = Var X E X (Y B 1, B 2 ) = E X X B 2 β X E β 2 = E X (B1 T X 1β 1 E B1 X 1β 1 ) + (B2 T X 2β 2 E B2 X 2β 2 ) 2 = E X B1 T X 1β 1 E B1 X 1β 1 + EX B T2 X 2β 2 E B2 X 2β 2 2 = V B1 + V B2, (26) 27

30 where V B2 is defined in (16) and V B1 also can be defined similarly. Then, by the similar arguments to (17)-(25), one has 1 ( 1, Var X (ˆβ R ) = (X ) T X E (X ) T σi 2 + V B )X (X ) T X E where ˆσ 2 can be evaluated by (25). A.3 Proof of Proposition 3 To derive the variance of ˆβ R, note that Var X (ˆβ R ) = D 1 3 Var X (ˆβ ) ( D 1 3 )T, where Var X (ˆβ 1 1. ) = (X ) T X (X ) T Var X (Y )X (X ) T X Hence, we need to calculate Var X (Y ) first in order to derive the variance of ˆβ R. Note that Var X (Y ) Var X(Y ). To see this, one has Var X (Y ) = E X Var X (Y A ) + Var X E X (Y A ). (27) Then, by (2) and (3) E X (Y A ) = A E X (Y ) = A X E β = A f E. Note that f is not observable, because it is the expectation of Y, that is also not observable, under completely correct linkage. f is observable,but it contains incorrect linkage between X 1 and X 2. f E is a adjusted version of f to eliminate the bias due to incorrect linkage between X 1 and X 2. Also let V A = Var X E X (Y A ). Then, one has 10 V A = Var X (A f E ). 10 One way to estimate V A is using (16) from Chambers (2008). Then, V A = diag (1 λ A ) { λ A (f,i E f E )2 E(2) + f ( f E )2}, (28) where f E = (fe,i ) and f E, fe(2) are the averages of f E,i and their suares respectively in f E. 28

31 Further, Var X (Y A ) = Var X (A Y ) = A Var X (Y )A T ( ) = A (E X Var X (Y B 2 ) )A T + A Var X E X (Y B 2 ) A T, (29) because one has Var X (Y ) = E X Var X (Y B 2 ) + Var X E X (Y B 2 ). (30) Note that, by (1), E X (Y B 2 ) = X B 2 β. Thus, Denote that Also, let Var X E X (Y B 2 ) 2 = E X X B 2 β X E β = E X B T2 X 2β 2 E B2 X 2β 2. (31) 2 f B 2 = X 2β 2. C B2 = A B T 2, which is another permutation matrix, and let E C2 = E X ( A B T 2 ). Further, let Then, one has 11 V C2 = A Var X E X (Y B 2 ) A T. (32) V C2 = E X C 2 f B2 (f B2 ) T C T2 E C2 f B 2 (f B 2 ) T EC T 2. Furthermore, one has Var X (Y B 2 ) = E X (Y X B 2 β)2 = E X (ǫ ) 2 = σ 2 I. 11 By (16) from Chambers (2008), V C2 = diag (1 λ C2 ) { λ C2 (fb 2,i f B 2 ) 2 (2) + f B 2 ( f B 2 ) 2}, where f B 2 = (f B 2,i ) and f B 2, f (2) B 2 are the averages of f B 2,i and their suares respectively in f B 2. 29

32 Hence, A (E X ) Var X (Y B 2 ) A T = A σi 2 A T = σa 2 A T = σi 2. (33) Thus, by (29), (30), (32) and (33) E X Var X (Y A ) = E X A Var X (Y )A T = E X {σ 2I } + V C2 (34) = σ 2 I + V C2. Then, by (27), (32) and (34), Var X (Y ) = σ2 I + V C2 + V A = Σ. (35) Conseuently, one has Var X (ˆβ 1 ( 1, ) = (X )T X (X )T σ 2 I + V C2 + V A )X (X )T X and V R = Var X (ˆβ R ) 1 ( = (X ) T E A X E (X ) T σi 2 + V C2 + V A )X (X ) T E A X E 1. To define ˆV R, the estimator of V R, let ˆf B 2 = X 2 ˆβ 2 and ˆf E = XE ˆβ, where ˆβ 2 and ˆβ are the estimates of β 2 and β respectively. Then, ˆV A and ˆV C2 can be estimated by replacing f E and f B 2, in V A and V C2, with ˆf E and ˆf B 2 respectively. Now, to estimate σ 2, one has (Y fe )T (Y fe ) = (Y )T Y (Y )T f E (fe )T Y + (f E )T f E = Y T AT A Y Y T f f T Y + f T f + Y T f + f T Y f T f (36) (Y )T f E (f E )T Y + (fe )T f E, 30

33 where ) (Y T A T A Y Y T f f T Y + f T f = E X E X Also, one has Further, (Y f ) T (Y f ) = Nσ 2. (37) ( E X f T Y f T f ( ) = EX f T Y f ) ( = E X f T ǫ ) = 0. (38) Y T f (Y )T f E (fe )T Y + (f E )T f E = Y T f (Y )T f E (fe )T Y + (f E )T f E Y T f E + Y T fe (f E )T f + (f E )T f (f E )T f E + (f E )T f E = Y T f E (Y )T f E + (f E )T f E (fe )T Y (39) + Y T (f E )T f + (f E )T Y T f E + (f E )T f f E. (40) Then it is easy to see that the expectation of (40) is zero. Also, ( Y T E X f E (Y )T f E + (f E )T f E (f ) E )T Y = 2(f E )T I E A f E (41) Therefore,by (36) (41), ( ˆσ 2 = N 1 (Y fe )T (Y fe ) 2 (f E )T ) I E A f E. A.4 Proof of Theorem 4 Let ˆθ 1 be the solution of (6). Then, the asymptotic variance of ˆθ 1 is of the form Var X (ˆθ 1 ) θ H 1 (θ 0) 1 Var X H 1 (θ 0) ( θ H 1 (θ 0) 1 ) T. Note that, in general G (θ) is a function of both θ and X, but, in our case, we only consider the case where G is a function of X. Thus, Further, by (17), one has Var X θ H 1 (θ) = H 1 (θ) = = = G θ f E (θ). ( ) G Var X Y G T G σ 2 I + V B2 G T G Σ 1 G T. 31

34 Therefore, the asymptotic variance estimator is of the form V 1 X (ˆθ 1 ) = G θ f E (ˆθ 1 1 ) ( G ˆΣ 1 G T G θ f E (ˆθ 1 ) T. 1 ) Let us consider the Case 1 where Y is the bench mark data set and the linkages between Y and X 1 and the linkages between Y and X 2 are done with some errors. In this case, we have similar estimating function H 2(θ) = G (θ) { Y f E2 (θ) }, but, by (4), f E2 (θ) = X E2 β = (1, E B1 X 1, E B 2 X 2 )β. This leads the asymptotic variance estimator of the form V 2 X (ˆθ 2 ) = G θ f E2 (ˆθ 1 2 ) G ˆΣ 2 G T ( G θ f E2 (ˆθ 2 ) 1 ) T, where, by (26), ˆΣ 2 = ˆσ 2 I + ˆV B1 + ˆV B2. Finally, the asymptotic variance of ˆθ 3 is of the form Var X (ˆθ 3 ) θ H 3 (θ 0) 1 Var X H 3 (θ 0) ( θ H 3 (θ 0) ) 1 T, (42) where, θ H 3(θ) = G E A θ f E (θ). (43) Further, by (35), one has Var X H 3 (θ) = = = G Var X ( Y ) G T G σ 2 I + V C2 + V A G T G Σ 3 GT. Therefore, the asymptotic variance estimator is of the form V 3 X (ˆθ 3 ) = G E A θ f E (ˆθ 1 3 ) ( G ˆΣ 3 G T G E A θ f E (ˆθ 1 ) T, 3 ) as reuired. 32

35 A.5 Proof of the Theorem 5 Let λ 1 = λ A and λ 2 = λ B2. Then the variance of ˆθ 3 can be approximated by Var X (ˆθ 3 ) { θ H0 1VarX θ } H 0 + λ1 H 0(ˆλ A λ A ) + λ2 H 0(ˆλ B2 λ B2 ) H 1 T 0 = θ H 1 ( ( 0 Var X H 0) + λ1 H0) VarX (λ A ) ( ) λ1 H T 0 + ( λ2 H0) VarX (λ B2 ) ( { λ2 H0) T θ } H0 1 T. To derive λi H 0, we assume that the distribution of λ i is independent 12 of the distribution of H 0. Then, by the similar arguments in Chambers (2008), λ1 H 0 = { λ 1 G Y E A (λ A )f E (θ, λ B 2 ) } = = G λ1 E A (λ A ) f E (θ, λ B2 ) G (M 1) 1 (M I 1 1 T ) f E (θ, λ B2 ) (44) (45) and λ2 H 0 = λ 2 = = = = G { Y E A (λ A )f E (θ, λ B 2 ) } G E A (λ A ) λ2 f E (θ, λ B 2 ) G E A (λ A ) λ2 (β 0 + X 1 β 1 + E B2 X 2β 2 ) G E A (λ A ) λ2 (E B2 )X 2 β 2 G E A (M 1) 1 (M I 1 1 T ) X 2 β 2. (46) Therefore, the variance Var X (ˆθ 3 ) can be evaluated by substituting the estimated values of (43), (45) and (46) into (44). For the Case 1 where Y is the bench mark data set and the linkages between Y and X 1 and the linkages between Y and X 2 are done with some errors, the variance of Var X (ˆθ 2 ) is 12 This assumption was originally introduced in Chambers (2008). 33

36 of the form Var X (ˆθ 2 ) { θ H0 1 Var X H 0 + λ B1 H 0 (ˆλ B1 λ B1 ) + λb2 H 0 (ˆλ θ } B2 λ B2 ) H 1 T 0 = θ H 1 ( ( 0 Var X H 0) + λb1 0) H VarX (λ B1 ) ( ) T λb1 H 0 + ( λb2 H0) VarX (λ B2 ) ( { T θ } λb2 0) H H0 1 T, where, λ B1 = pr(correct linkage between Y and X 1), Further, it is easy to see that H 0 = H 2(θ 0, λ B1, λ B2 ). λb1 H 0 = G (M 1) 1 (M I 1 1 T ) X 1β 1 and λb2 H 0 = G (M 1) 1 (M I 1 1 T ) X 2 β 2. Finally, for the Case 0, one has Var X (ˆθ 1 ) { θ H0 1 Var X H 0 + λ B2 H 0 (ˆλ θ } B2 λ B2 ) H 1 T 0 = θ H 1 ( ( 0 Var X H 0) + λb2 0) H VarX (λ B2 ) ( { T θ } λb2 0) H H0 1 T, where, with λ B2 = pr(correct linkage between X 1 and X 2 ), H 0 = H 1(θ 0, λ B2 ) λb2 H 0 = G (M 1) 1 (M I 1 1 T ) X 2 β 2. A.6 Proof of the Theorem 6 ˆθ s Let 3 be the solution of the estimating euation (11). To derive the asymptotic variance s estimator for ˆθ 3, note that by (42), Var X (ˆθ s 3 ) θ H adj ws (θ 0) 1 Var X H adj ws (θ 0) ( θ H adj ws (θ 0) ) 1 T (47) 34

37 with corresponding estimator of the form V ws s 3 X (ˆθ 3 ) = θ H adj ws (θ 0 ) 1 V ws 3 X H adj ws (θ 0 ) ( θ H adj ws (θ 0 ) ) 1 T G s Ẽ As θ f E s 1 s (ˆθ 3 ) ( G s ˆΣs G T s G s Ẽ As θ f E s 1 ) T, s (ˆθ 3 ) under the assumption that G s is independent of θ. Next step is to define Σ s. Note that Σ s = Var X (Y s ) = Var X (A ss Y s + A sr Y r ) (48) = Var X (A ss Y s ) + 2cov X (A ss Y s, A sr Y r ) + Var X (A sr Y r ). Further, by (30) and with similar arguments in (31)-(34), one has Var X (Y ) = E X Var X (Y B 2 ) + Var X E X (Y B 2 ) = σ 2 I + V B2, where V B2 = Var X E X (Y B 2 ) that can be approximated with a diagonal matrix 13 by the same argument in (16) from Chambers (2008). Thus, Var X (Y ) can be approximately regarded as a diagonal matrix and set Var X (Y ) D = diag{d i ; i }. In this case, one has Also, (48) becomes cov X (A ss Y s, A sr Y r ) 0. Σ s Var X (A ss Y s ) + Var X (A sr Y r ) = E X Var X (A ss Y s A ) + Var X E X (A ss Y s A ) + E X Var X (A sr Y r A ) + Var X E X (A sr Y r A ) ( ) ( ) = E X A ss Var X (Y s )A T ss + E X A sr Var X (Y r )A T sr ( ) + Var X A ss f E s + A sr f E r ( ) ( ) ( ) E X A ss D s A T ss + E X A sr D r A T sr + Var X A ss f E s + A srf E r 13 By (16) from Chambers (2008), V B2 diag (1 λ B2 ) { λ B2 (fb 2,i f B 2 ) 2 (2) + f B 2 ( f B 2 ) 2}, where f B 2 = (f B 2,i ) and f B 2, f (2) B 2 are the averages of f B 2,i and their suares respectively in f B 2. 35

Linear regression with nested errors using probability-linked data

Linear regression with nested errors using probability-linked data University of Wollongong Research Online Faculty of Engineering and Information Sciences - Papers: Part A Faculty of Engineering and Information Sciences 2014 Linear regression with nested errors using

More information

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

More information

Contextual Effects in Modeling for Small Domains

Contextual Effects in Modeling for Small Domains University of Wollongong Research Online Applied Statistics Education and Research Collaboration (ASEARC) - Conference Papers Faculty of Engineering and Information Sciences 2011 Contextual Effects in

More information

Statistics 910, #5 1. Regression Methods

Statistics 910, #5 1. Regression Methods Statistics 910, #5 1 Overview Regression Methods 1. Idea: effects of dependence 2. Examples of estimation (in R) 3. Review of regression 4. Comparisons and relative efficiencies Idea Decomposition Well-known

More information

Econometrics. Week 6. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Econometrics. Week 6. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague Econometrics Week 6 Institute of Economic Studies Faculty of Social Sciences Charles University in Prague Fall 2012 1 / 21 Recommended Reading For the today Advanced Panel Data Methods. Chapter 14 (pp.

More information

Econometrics A. Simple linear model (2) Keio University, Faculty of Economics. Simon Clinet (Keio University) Econometrics A October 16, / 11

Econometrics A. Simple linear model (2) Keio University, Faculty of Economics. Simon Clinet (Keio University) Econometrics A October 16, / 11 Econometrics A Keio University, Faculty of Economics Simple linear model (2) Simon Clinet (Keio University) Econometrics A October 16, 2018 1 / 11 Estimation of the noise variance σ 2 In practice σ 2 too

More information

Modification and Improvement of Empirical Likelihood for Missing Response Problem

Modification and Improvement of Empirical Likelihood for Missing Response Problem UW Biostatistics Working Paper Series 12-30-2010 Modification and Improvement of Empirical Likelihood for Missing Response Problem Kwun Chuen Gary Chan University of Washington - Seattle Campus, kcgchan@u.washington.edu

More information

A measurement error model approach to small area estimation

A measurement error model approach to small area estimation A measurement error model approach to small area estimation Jae-kwang Kim 1 Spring, 2015 1 Joint work with Seunghwan Park and Seoyoung Kim Ouline Introduction Basic Theory Application to Korean LFS Discussion

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

Econometrics of Panel Data

Econometrics of Panel Data Econometrics of Panel Data Jakub Mućk Meeting # 6 Jakub Mućk Econometrics of Panel Data Meeting # 6 1 / 36 Outline 1 The First-Difference (FD) estimator 2 Dynamic panel data models 3 The Anderson and Hsiao

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

Applied Statistics and Econometrics

Applied Statistics and Econometrics Applied Statistics and Econometrics Lecture 6 Saul Lach September 2017 Saul Lach () Applied Statistics and Econometrics September 2017 1 / 53 Outline of Lecture 6 1 Omitted variable bias (SW 6.1) 2 Multiple

More information

Regression: Lecture 2

Regression: Lecture 2 Regression: Lecture 2 Niels Richard Hansen April 26, 2012 Contents 1 Linear regression and least squares estimation 1 1.1 Distributional results................................ 3 2 Non-linear effects and

More information

Homoskedasticity. Var (u X) = σ 2. (23)

Homoskedasticity. Var (u X) = σ 2. (23) Homoskedasticity How big is the difference between the OLS estimator and the true parameter? To answer this question, we make an additional assumption called homoskedasticity: Var (u X) = σ 2. (23) This

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Bias and variance (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 49 Our plan today We saw in last lecture that model scoring methods seem to be trading off two different

More information

Chapter 8: Estimation 1

Chapter 8: Estimation 1 Chapter 8: Estimation 1 Jae-Kwang Kim Iowa State University Fall, 2014 Kim (ISU) Ch. 8: Estimation 1 Fall, 2014 1 / 33 Introduction 1 Introduction 2 Ratio estimation 3 Regression estimator Kim (ISU) Ch.

More information

5. Erroneous Selection of Exogenous Variables (Violation of Assumption #A1)

5. Erroneous Selection of Exogenous Variables (Violation of Assumption #A1) 5. Erroneous Selection of Exogenous Variables (Violation of Assumption #A1) Assumption #A1: Our regression model does not lack of any further relevant exogenous variables beyond x 1i, x 2i,..., x Ki and

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Simple linear regression tries to fit a simple line between two variables Y and X. If X is linearly related to Y this explains some of the variability in Y. In most cases, there

More information

Panel Data Models. Chapter 5. Financial Econometrics. Michael Hauser WS17/18 1 / 63

Panel Data Models. Chapter 5. Financial Econometrics. Michael Hauser WS17/18 1 / 63 1 / 63 Panel Data Models Chapter 5 Financial Econometrics Michael Hauser WS17/18 2 / 63 Content Data structures: Times series, cross sectional, panel data, pooled data Static linear panel data models:

More information

An Introduction to Parameter Estimation

An Introduction to Parameter Estimation Introduction Introduction to Econometrics An Introduction to Parameter Estimation This document combines several important econometric foundations and corresponds to other documents such as the Introduction

More information

For more information about how to cite these materials visit

For more information about how to cite these materials visit Author(s): Kerby Shedden, Ph.D., 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/

More information

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component

More information

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Jae-Kwang Kim 1 Iowa State University June 26, 2013 1 Joint work with Shu Yang Introduction 1 Introduction

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression ST 430/514 Recall: A regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates)

More information

Visually Identifying Potential Domains for Change Points in Generalized Bernoulli Processes: an Application to DNA Segmental Analysis

Visually Identifying Potential Domains for Change Points in Generalized Bernoulli Processes: an Application to DNA Segmental Analysis University of Wollongong Research Online Centre for Statistical & Survey Methodology Working Paper Series Faculty of Engineering and Information Sciences 2009 Visually Identifying Potential Domains for

More information

Combining data from two independent surveys: model-assisted approach

Combining data from two independent surveys: model-assisted approach Combining data from two independent surveys: model-assisted approach Jae Kwang Kim 1 Iowa State University January 20, 2012 1 Joint work with J.N.K. Rao, Carleton University Reference Kim, J.K. and Rao,

More information

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018 Econometrics I KS Module 2: Multivariate Linear Regression Alexander Ahammer Department of Economics Johannes Kepler University of Linz This version: April 16, 2018 Alexander Ahammer (JKU) Module 2: Multivariate

More information

Review of Econometrics

Review of Econometrics Review of Econometrics Zheng Tian June 5th, 2017 1 The Essence of the OLS Estimation Multiple regression model involves the models as follows Y i = β 0 + β 1 X 1i + β 2 X 2i + + β k X ki + u i, i = 1,...,

More information

The outline for Unit 3

The outline for Unit 3 The outline for Unit 3 Unit 1. Introduction: The regression model. Unit 2. Estimation principles. Unit 3: Hypothesis testing principles. 3.1 Wald test. 3.2 Lagrange Multiplier. 3.3 Likelihood Ratio Test.

More information

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley Review of Classical Least Squares James L. Powell Department of Economics University of California, Berkeley The Classical Linear Model The object of least squares regression methods is to model and estimate

More information

Chapter 2 The Simple Linear Regression Model: Specification and Estimation

Chapter 2 The Simple Linear Regression Model: Specification and Estimation Chapter The Simple Linear Regression Model: Specification and Estimation Page 1 Chapter Contents.1 An Economic Model. An Econometric Model.3 Estimating the Regression Parameters.4 Assessing the Least Squares

More information

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University. Summer School in Statistics for Astronomers V June 1 - June 6, 2009 Regression Mosuk Chow Statistics Department Penn State University. Adapted from notes prepared by RL Karandikar Mean and variance Recall

More information

Link lecture - Lagrange Multipliers

Link lecture - Lagrange Multipliers Link lecture - Lagrange Multipliers Lagrange multipliers provide a method for finding a stationary point of a function, say f(x, y) when the variables are subject to constraints, say of the form g(x, y)

More information

Chapter 4: Constrained estimators and tests in the multiple linear regression model (Part III)

Chapter 4: Constrained estimators and tests in the multiple linear regression model (Part III) Chapter 4: Constrained estimators and tests in the multiple linear regression model (Part III) Florian Pelgrin HEC September-December 2010 Florian Pelgrin (HEC) Constrained estimators September-December

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Small Area Estimation Using a Nonparametric Model Based Direct Estimator

Small Area Estimation Using a Nonparametric Model Based Direct Estimator University of Wollongong Research Online Centre for Statistical & Survey Methodology Working Paper Series Faculty of Engineering and Information Sciences 2009 Small Area Estimation Using a Nonparametric

More information

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari MS&E 226: Small Data Lecture 11: Maximum likelihood (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 18 The likelihood function 2 / 18 Estimating the parameter This lecture develops the methodology behind

More information

[y i α βx i ] 2 (2) Q = i=1

[y i α βx i ] 2 (2) Q = i=1 Least squares fits This section has no probability in it. There are no random variables. We are given n points (x i, y i ) and want to find the equation of the line that best fits them. We take the equation

More information

IV estimators and forbidden regressions

IV estimators and forbidden regressions Economics 8379 Spring 2016 Ben Williams IV estimators and forbidden regressions Preliminary results Consider the triangular model with first stage given by x i2 = γ 1X i1 + γ 2 Z i + ν i and second stage

More information

Applied Econometrics (QEM)

Applied Econometrics (QEM) Applied Econometrics (QEM) The Simple Linear Regression Model based on Prinicples of Econometrics Jakub Mućk Department of Quantitative Economics Jakub Mućk Applied Econometrics (QEM) Meeting #2 The Simple

More information

Missing dependent variables in panel data models

Missing dependent variables in panel data models Missing dependent variables in panel data models Jason Abrevaya Abstract This paper considers estimation of a fixed-effects model in which the dependent variable may be missing. For cross-sectional units

More information

Estimating and Testing the US Model 8.1 Introduction

Estimating and Testing the US Model 8.1 Introduction 8 Estimating and Testing the US Model 8.1 Introduction The previous chapter discussed techniques for estimating and testing complete models, and this chapter applies these techniques to the US model. For

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Outline of GLMs. Definitions

Outline of GLMs. Definitions Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density

More information

PANEL DATA RANDOM AND FIXED EFFECTS MODEL. Professor Menelaos Karanasos. December Panel Data (Institute) PANEL DATA December / 1

PANEL DATA RANDOM AND FIXED EFFECTS MODEL. Professor Menelaos Karanasos. December Panel Data (Institute) PANEL DATA December / 1 PANEL DATA RANDOM AND FIXED EFFECTS MODEL Professor Menelaos Karanasos December 2011 PANEL DATA Notation y it is the value of the dependent variable for cross-section unit i at time t where i = 1,...,

More information

Econometrics of Panel Data

Econometrics of Panel Data Econometrics of Panel Data Jakub Mućk Meeting # 2 Jakub Mućk Econometrics of Panel Data Meeting # 2 1 / 26 Outline 1 Fixed effects model The Least Squares Dummy Variable Estimator The Fixed Effect (Within

More information

Sensitivity of GLS estimators in random effects models

Sensitivity of GLS estimators in random effects models of GLS estimators in random effects models Andrey L. Vasnev (University of Sydney) Tokyo, August 4, 2009 1 / 19 Plan Plan Simulation studies and estimators 2 / 19 Simulation studies Plan Simulation studies

More information

To Estimate or Not to Estimate?

To Estimate or Not to Estimate? To Estimate or Not to Estimate? Benjamin Kedem and Shihua Wen In linear regression there are examples where some of the coefficients are known but are estimated anyway for various reasons not least of

More information

Econometrics II - EXAM Answer each question in separate sheets in three hours

Econometrics II - EXAM Answer each question in separate sheets in three hours Econometrics II - EXAM Answer each question in separate sheets in three hours. Let u and u be jointly Gaussian and independent of z in all the equations. a Investigate the identification of the following

More information

Multivariate Statistical Analysis

Multivariate Statistical Analysis Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 4 for Applied Multivariate Analysis Outline 1 Eigen values and eigen vectors Characteristic equation Some properties of eigendecompositions

More information

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Xiuming Zhang zhangxiuming@u.nus.edu A*STAR-NUS Clinical Imaging Research Center October, 015 Summary This report derives

More information

Asymptotic Theory. L. Magee revised January 21, 2013

Asymptotic Theory. L. Magee revised January 21, 2013 Asymptotic Theory L. Magee revised January 21, 2013 1 Convergence 1.1 Definitions Let a n to refer to a random variable that is a function of n random variables. Convergence in Probability The scalar a

More information

An Unbiased Estimator Of The Greatest Lower Bound

An Unbiased Estimator Of The Greatest Lower Bound Journal of Modern Applied Statistical Methods Volume 16 Issue 1 Article 36 5-1-017 An Unbiased Estimator Of The Greatest Lower Bound Nol Bendermacher Radboud University Nijmegen, Netherlands, Bendermacher@hotmail.com

More information

Properties of the least squares estimates

Properties of the least squares estimates Properties of the least squares estimates 2019-01-18 Warmup Let a and b be scalar constants, and X be a scalar random variable. Fill in the blanks E ax + b) = Var ax + b) = Goal Recall that the least squares

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Data Integration for Big Data Analysis for finite population inference

Data Integration for Big Data Analysis for finite population inference for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1 / 36 What is big data? 2 / 36 Data do not speak for themselves Knowledge Reproducibility Information Intepretation

More information

Fractional Imputation in Survey Sampling: A Comparative Review

Fractional Imputation in Survey Sampling: A Comparative Review Fractional Imputation in Survey Sampling: A Comparative Review Shu Yang Jae-Kwang Kim Iowa State University Joint Statistical Meetings, August 2015 Outline Introduction Fractional imputation Features Numerical

More information

Introduction to Estimation Methods for Time Series models. Lecture 1

Introduction to Estimation Methods for Time Series models. Lecture 1 Introduction to Estimation Methods for Time Series models Lecture 1 Fulvio Corsi SNS Pisa Fulvio Corsi Introduction to Estimation () Methods for Time Series models Lecture 1 SNS Pisa 1 / 19 Estimation

More information

Statistics 135 Fall 2008 Final Exam

Statistics 135 Fall 2008 Final Exam Name: SID: Statistics 135 Fall 2008 Final Exam Show your work. The number of points each question is worth is shown at the beginning of the question. There are 10 problems. 1. [2] The normal equations

More information

IV Estimation and its Limitations: Weak Instruments and Weakly Endogeneous Regressors

IV Estimation and its Limitations: Weak Instruments and Weakly Endogeneous Regressors IV Estimation and its Limitations: Weak Instruments and Weakly Endogeneous Regressors Laura Mayoral IAE, Barcelona GSE and University of Gothenburg Gothenburg, May 2015 Roadmap Deviations from the standard

More information

36. Multisample U-statistics and jointly distributed U-statistics Lehmann 6.1

36. Multisample U-statistics and jointly distributed U-statistics Lehmann 6.1 36. Multisample U-statistics jointly distributed U-statistics Lehmann 6.1 In this topic, we generalize the idea of U-statistics in two different directions. First, we consider single U-statistics for situations

More information

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011 Lecture 2: Linear Models Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011 1 Quick Review of the Major Points The general linear model can be written as y = X! + e y = vector

More information

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8 Contents 1 Linear model 1 2 GLS for multivariate regression 5 3 Covariance estimation for the GLM 8 4 Testing the GLH 11 A reference for some of this material can be found somewhere. 1 Linear model Recall

More information

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012 Lecture 3: Linear Models Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012 1 Quick Review of the Major Points The general linear model can be written as y = X! + e y = vector of observed

More information

STAT 830 Non-parametric Inference Basics

STAT 830 Non-parametric Inference Basics STAT 830 Non-parametric Inference Basics Richard Lockhart Simon Fraser University STAT 801=830 Fall 2012 Richard Lockhart (Simon Fraser University)STAT 830 Non-parametric Inference Basics STAT 801=830

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Linear Algebra March 16, 2019

Linear Algebra March 16, 2019 Linear Algebra March 16, 2019 2 Contents 0.1 Notation................................ 4 1 Systems of linear equations, and matrices 5 1.1 Systems of linear equations..................... 5 1.2 Augmented

More information

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study TECHNICAL REPORT # 59 MAY 2013 Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study Sergey Tarima, Peng He, Tao Wang, Aniko Szabo Division of Biostatistics,

More information

In the bivariate regression model, the original parameterization is. Y i = β 1 + β 2 X2 + β 2 X2. + β 2 (X 2i X 2 ) + ε i (2)

In the bivariate regression model, the original parameterization is. Y i = β 1 + β 2 X2 + β 2 X2. + β 2 (X 2i X 2 ) + ε i (2) RNy, econ460 autumn 04 Lecture note Orthogonalization and re-parameterization 5..3 and 7.. in HN Orthogonalization of variables, for example X i and X means that variables that are correlated are made

More information

Ch 3: Multiple Linear Regression

Ch 3: Multiple Linear Regression Ch 3: Multiple Linear Regression 1. Multiple Linear Regression Model Multiple regression model has more than one regressor. For example, we have one response variable and two regressor variables: 1. delivery

More information

Introduction to Survey Data Integration

Introduction to Survey Data Integration Introduction to Survey Data Integration Jae-Kwang Kim Iowa State University May 20, 2014 Outline 1 Introduction 2 Survey Integration Examples 3 Basic Theory for Survey Integration 4 NASS application 5

More information

Regression. Oscar García

Regression. Oscar García Regression Oscar García Regression methods are fundamental in Forest Mensuration For a more concise and general presentation, we shall first review some matrix concepts 1 Matrices An order n m matrix is

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Statistics II. Management Degree Management Statistics IIDegree. Statistics II. 2 nd Sem. 2013/2014. Management Degree. Simple Linear Regression

Statistics II. Management Degree Management Statistics IIDegree. Statistics II. 2 nd Sem. 2013/2014. Management Degree. Simple Linear Regression Model 1 2 Ordinary Least Squares 3 4 Non-linearities 5 of the coefficients and their to the model We saw that econometrics studies E (Y x). More generally, we shall study regression analysis. : The regression

More information

Advanced Quantitative Research Methodology, Lecture Notes: Research Designs for Causal Inference 1

Advanced Quantitative Research Methodology, Lecture Notes: Research Designs for Causal Inference 1 Advanced Quantitative Research Methodology, Lecture Notes: Research Designs for Causal Inference 1 Gary King GaryKing.org April 13, 2014 1 c Copyright 2014 Gary King, All Rights Reserved. Gary King ()

More information

Linear Model Under General Variance

Linear Model Under General Variance Linear Model Under General Variance We have a sample of T random variables y 1, y 2,, y T, satisfying the linear model Y = X β + e, where Y = (y 1,, y T )' is a (T 1) vector of random variables, X = (T

More information

Econometrics. Week 4. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Econometrics. Week 4. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague Econometrics Week 4 Institute of Economic Studies Faculty of Social Sciences Charles University in Prague Fall 2012 1 / 23 Recommended Reading For the today Serial correlation and heteroskedasticity in

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 12: Frequentist properties of estimators (v4) Ramesh Johari ramesh.johari@stanford.edu 1 / 39 Frequentist inference 2 / 39 Thinking like a frequentist Suppose that for some

More information

Ordinary Least Squares Regression

Ordinary Least Squares Regression Ordinary Least Squares Regression Goals for this unit More on notation and terminology OLS scalar versus matrix derivation Some Preliminaries In this class we will be learning to analyze Cross Section

More information

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures FE661 - Statistical Methods for Financial Engineering 9. Model Selection Jitkomut Songsiri statistical models overview of model selection information criteria goodness-of-fit measures 9-1 Statistical models

More information

A note on multiple imputation for general purpose estimation

A note on multiple imputation for general purpose estimation A note on multiple imputation for general purpose estimation Shu Yang Jae Kwang Kim SSC meeting June 16, 2015 Shu Yang, Jae Kwang Kim Multiple Imputation June 16, 2015 1 / 32 Introduction Basic Setup Assume

More information

Brief Review on Estimation Theory

Brief Review on Estimation Theory Brief Review on Estimation Theory K. Abed-Meraim ENST PARIS, Signal and Image Processing Dept. abed@tsi.enst.fr This presentation is essentially based on the course BASTA by E. Moulines Brief review on

More information

2.1 Linear regression with matrices

2.1 Linear regression with matrices 21 Linear regression with matrices The values of the independent variables are united into the matrix X (design matrix), the values of the outcome and the coefficient are represented by the vectors Y and

More information

Cross Sectional Time Series: The Normal Model and Panel Corrected Standard Errors

Cross Sectional Time Series: The Normal Model and Panel Corrected Standard Errors Cross Sectional Time Series: The Normal Model and Panel Corrected Standard Errors Paul Johnson 5th April 2004 The Beck & Katz (APSR 1995) is extremely widely cited and in case you deal

More information

Econ 583 Final Exam Fall 2008

Econ 583 Final Exam Fall 2008 Econ 583 Final Exam Fall 2008 Eric Zivot December 11, 2008 Exam is due at 9:00 am in my office on Friday, December 12. 1 Maximum Likelihood Estimation and Asymptotic Theory Let X 1,...,X n be iid random

More information

Relative Improvement by Alternative Solutions for Classes of Simple Shortest Path Problems with Uncertain Data

Relative Improvement by Alternative Solutions for Classes of Simple Shortest Path Problems with Uncertain Data Relative Improvement by Alternative Solutions for Classes of Simple Shortest Path Problems with Uncertain Data Part II: Strings of Pearls G n,r with Biased Perturbations Jörg Sameith Graduiertenkolleg

More information

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) 1 EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) Taisuke Otsu London School of Economics Summer 2018 A.1. Summation operator (Wooldridge, App. A.1) 2 3 Summation operator For

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

Answer Key for STAT 200B HW No. 8

Answer Key for STAT 200B HW No. 8 Answer Key for STAT 200B HW No. 8 May 8, 2007 Problem 3.42 p. 708 The values of Ȳ for x 00, 0, 20, 30 are 5/40, 0, 20/50, and, respectively. From Corollary 3.5 it follows that MLE exists i G is identiable

More information

ECON 3150/4150, Spring term Lecture 7

ECON 3150/4150, Spring term Lecture 7 ECON 3150/4150, Spring term 2014. Lecture 7 The multivariate regression model (I) Ragnar Nymoen University of Oslo 4 February 2014 1 / 23 References to Lecture 7 and 8 SW Ch. 6 BN Kap 7.1-7.8 2 / 23 Omitted

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Econometrics. Week 11. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Econometrics. Week 11. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague Econometrics Week 11 Institute of Economic Studies Faculty of Social Sciences Charles University in Prague Fall 2012 1 / 30 Recommended Reading For the today Advanced Time Series Topics Selected topics

More information

. a m1 a mn. a 1 a 2 a = a n

. a m1 a mn. a 1 a 2 a = a n Biostat 140655, 2008: Matrix Algebra Review 1 Definition: An m n matrix, A m n, is a rectangular array of real numbers with m rows and n columns Element in the i th row and the j th column is denoted by

More information

University of Regina. Lecture Notes. Michael Kozdron

University of Regina. Lecture Notes. Michael Kozdron University of Regina Statistics 252 Mathematical Statistics Lecture Notes Winter 2005 Michael Kozdron kozdron@math.uregina.ca www.math.uregina.ca/ kozdron Contents 1 The Basic Idea of Statistics: Estimating

More information

REGRESSION ANALYSIS AND INDICATOR VARIABLES

REGRESSION ANALYSIS AND INDICATOR VARIABLES REGRESSION ANALYSIS AND INDICATOR VARIABLES Thesis Submitted in partial fulfillment of the requirements for the award of degree of Masters of Science in Mathematics and Computing Submitted by Sweety Arora

More information

Econometrics of Panel Data

Econometrics of Panel Data Econometrics of Panel Data Jakub Mućk Meeting # 4 Jakub Mućk Econometrics of Panel Data Meeting # 4 1 / 30 Outline 1 Two-way Error Component Model Fixed effects model Random effects model 2 Non-spherical

More information

8 Nonlinear Regression

8 Nonlinear Regression 8 Nonlinear Regression Nonlinear regression relates to models, where the mean response is not linear in the parameters of the model. A MLRM Y = β 0 + β 1 x 1 + β 2 x 2 + + β k x k + ε, ε N (0, σ 2 ) has

More information

Design and Estimation for Split Questionnaire Surveys

Design and Estimation for Split Questionnaire Surveys University of Wollongong Research Online Centre for Statistical & Survey Methodology Working Paper Series Faculty of Engineering and Information Sciences 2008 Design and Estimation for Split Questionnaire

More information

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

On dealing with spatially correlated residuals in remote sensing and GIS

On dealing with spatially correlated residuals in remote sensing and GIS On dealing with spatially correlated residuals in remote sensing and GIS Nicholas A. S. Hamm 1, Peter M. Atkinson and Edward J. Milton 3 School of Geography University of Southampton Southampton SO17 3AT

More information