Should Non-Sensitive Attributes be Masked? Data Quality Implications of Data Perturbation in Regression Analysis

Size: px
Start display at page:

Download "Should Non-Sensitive Attributes be Masked? Data Quality Implications of Data Perturbation in Regression Analysis"

Transcription

1 Should Non-Sensitive Attributes be Masked? Data Quality Implications of Data Perturbation in Regression Analysis Sumitra Mukherjee Nova Southeastern University Abstract Ensuring the security of sensitive data is an increasingly important challenge for information systems managers. A widely used technique to protect sensitive data is to mask the data by adding zero mean noise. Noise addition affects the quality of data available for legitimate statistical use. This article develops a framework that may be used to analyze the implications of additive noise data masking on data quality when the data is used for regression analysis. The framework is used to investigate whether noise should be added to non-sensitive attributes when only a subset of attributes in the database are considered sensitive, an issue that has not been addressed in the literature. Our analysis indicates that adding noise to all the attributes is preferable to the existing practice of masking only the subset of sensitive attributes. 1. Introduction Ensuring the security of sensitive data is an increasingly important challenge for information systems managers (see [2] and [11]). Common security measures such as password protection and data encryption often fail to provide adequate protection. This is because a user may be able to infer sensitive information on the basis of information to which he has authorized access [7]. Recognizing the significance of this problem in statistical databases, researchers have developed several disclosure limitation methods (see [1], and [6] for a survey of security methods). A widely used method for disclosure limitation is data perturbation under which sensitive data are masked by adding zero mean noise. Noise addition, while reducing the risk of disclosure, has a negative effect on the quality of data available for legitimate statistical use. Information systems managers have the conflicting mandates of maintaining data quality while ensuring security. Researchers have mainly focused on the security aspect of noise addition methods (see [3] and [10] for a good review of the literature). While there is a growing body of research on data quality that addresses issues such as accuracy, consistency, timeliness and completeness (see, e.g., [12]), the subject has not received much attention in the data security literature. Accuracy and consistency issues in masked data have been addressed by Matloff [9], Kim [8], and Tendick [13]. Matloff investigated univariate distributions while the latter studies considered regression analysis. We focus on regression models since they are widely used in statistical analysis. Kim s work used the bias in the regression coefficient estimates as a measure of data quality while the Tendick s study was based on squared correlation coefficients. In each case the noise addition method was found to be optimal when the covariance matrix of the noise vector is proportional to the covariance matrix of the attributes to which the noise is added. Both these studies assumed that all the attributes in the database were sensitive and hence had to be masked. Typically, however, only a few attributes in the database are considered sensitive. The literature has not addressed the issue of data quality implications of masking non-sensitive attributes. In this article a general framework is developed to analyze the effect of noise addition on data quality when a subset of attributes in the database is considered sensitive. The framework is used to investigate whether non-sensitive data should be masked. The utility of this approach is further demonstrated by comparing the method proposed by Kim [8] and Tendick [13] to the more common implementation of adding independent noise. In the case of independent noise our results are related to the extensive literature developed by statisticians and econometricians on measurement error models (see [5] and Fuller [4] for a review of the literature). In section 2 the mathematical framework used for the analysis is developed. Measures of data quality are

2 derived in section 3. The framework is used to compare alternative additive noise data masking schemes in section 4. Section 5 provides an example to demonstrate the results and section 6 concludes with discussions. 2. The Model Since regression analysis is a widely used statistical tool, we use the classical normal linear regression (CNLR) model for the investigation of data quality issues. The model may be represented as follows: Y = β X + e where Y is the attribute representing the dependent variable, X is a (Kx1) vector of attributes representing the regressors, β is a (Kx1) vector of regression coefficients to be estimated, and e is the disturbance term. The assumptions of normality and independence in the CNLR model imply that: X µ Σ ~ N, Y β' µ β ' Σ β' + σ 2 where µ is the vector of means for X, Σ is the (KxK) covariance matrix for X, and σ 2 is the variance of the disturbance term e. The masked data M may be represented as M = X + U where M is a (Kx1) vector of masked attributes and U is a (Kx1) vector of normally distributed zero mean additive noise with covariance matrix Ω. Assume further, as is the case under all implementations of noise addition methods, that X and U are statistically independent. It follows that: M µ Σ + Ω ~ N, Y β ' µ β ' Σ β' + σ 2 Noise addition schemes may be characterized by the covariance matrix Ω of the additive noise vector. For instance, the methods proposed by Kim [8] and Tendick [13] require that Ω = ασ (3) where α is a proportionality constant. Under an implementation where the noise components are independent of each other, Ω is a diagonal matrix. For obvious reasons, it is a common practice to select the variances of the noise components to be proportional to the variances of the corresponding attributes. In this case: Ω = Diagonal(αΣ) (4) (1) (2) Under an implementation where only the subset of sensitive attributes is masked, the noise addition scheme may be characterized by a partitioned matrix. Without loss of generality, when only the first p of the K attributes are masked, the covariance matrix has the structure: Ω11 0 Ω = where Ω 11 is (pxp) sub-matrix. 0 0 Under this approach Tendick s method would require that Ω 11 = α Σ 11 (5) and an implementation using independent noise components requires that: Ω 11 = α diagonal(σ 11 ) (6) where Σ 11 is the corresponding (pxp) sub matrix obtained by partitioning Σ. While in this article we investigate noise addition methods characterized by covariance matrices specified by (3), (4), (5), and (6), the framework is general enough to accommodate any alternate implementation. Based on this structure, we specify measures for data quality in the context of regression analysis. 3. Measures of Data Quality The expected value of the regression coefficients when Y is regressed on the true values of the attributes X is given by β. When masked attributes are used, it follows from (2) that: E[Y M] = β [I - Σ(Σ+Ω) -1 ]µ + β Σ(Σ+Ω) -1 M (7) where I is an identity matrix. Hence, the expected value of the coefficients obtained by regression of Y on M is given by: b = (Σ+Ω) -1 Σ β (8) and the bias in the regression coefficients is given by B = β-b = [I - (Σ+Ω) -1 Σ] β (9) Following the work of Kim [8] we use the bias B as a measure of data quality. This choice is further supported by the extensive literature on errors in variables (see e.g. [5]). An alternate measure of data quality may be based on the squared correlation coefficient R 2. In a regression R 2 represents the proportion of the variance in the outcome variable that is explained by the explanatory variables. Tendick s work [13] on optimal noise addition methods motivates the choice of this measure of data quality. Within our framework we take the ratio of the R 2 obtained when Y is regressed on the masked attributes to that obtained by regressing Y on the true values X. The lower the ratio, the worse is the quality of the masked data. Using (1) and (2) it can be shown that the ratio:

3 γ = R 2 M /R 2 X = (β Σ(Σ+Ω) -1 )/(β ) (10) Under alternate noise addition schemes, the bias B and the ratio of the squared correlation coefficient γ may be computed to evaluate the impact of the method on data quality, when the data is used for regression analysis. The relative merits of these candidate measures will also be discussed. 4. Comparison of Alternate Additive Noise Data Masking Schemes In this section we illustrate how the framework presented may be used to compare implementations of data perturbation where all attributes are masked to those where only the sensitive attributes are masked. In the latter case we consider both, addition of noise with covariance matrix proportional to that of the covariance matrix of the attributes as well as implementations when the noise components are statistically independent. Case I Under this implementation noise is added not only to the sensitive attributes but also to the non sensitive attributes. The covariance matrix of the noise vector is proportional to the covariance matrix of the attribute vector as specified in (3). The issue of bias Substituting (3) in (8) we find the expected value of regression coefficient b i = 1/(1+α) β i for i = 1,..., K (11) and hence the bias B i = α/(1+α) β i for i = 1,..., K (12) The above implies that under such a scheme all the regression coefficients are attenuated by the same factor 1/(1+α). For instance, when the variances of the additive noise components are 25% of the variances in the corresponding attributes (as may be the case in a practical data masking scheme), the attenuation in the regression coefficients is 20%. More significantly, if the proportionality factor α is published, the necessary adjustments can be made to obtain unbiased estimates for all coefficients. The issue of reduction in R 2 The ratio γ of the squared correlation coefficients may be computed by substituting (3) in (10). This measure of data quality is given by γ = 1/(1+α) (13) The results imply that for additive noise with 25% of the variance of the attributes, the R 2 for the regression would decrease by 20% when the masked values rather than the true values are used as regressors. This reduction in R 2 has implications in terms of the confidence that the user has on estimates obtained using masked data. Case II Under this implementation noise is added only to the subset of sensitive attributes. The covariance matrix of the noise vector is proportional to the covariance matrix of the corresponding sensitive attributes. The issue of bias Substituting (5) in (8), the expected value of the coefficients have the form P11 0 b = β (14) P21 I where I is an identity matrix and P 11 and P 21 are (p x p) and (K-p x p) matrices with elements that are functions of α and Σ. The exact expressions are complex (interested readers may obtain the expressions by inverting partitioned matrices using the Binomial Inverse Theorem (Woodbury 1950)) but provide some insights into the nature of the biases introduced. The coefficients of the sensitive attributes are biased but not affected by the coefficients of the unperturbed attributes. However, the coefficients of the unperturbed attributes are influenced by those of the sensitive attributes. More significantly, the magnitude and sign of the bias in the coefficient estimates do follow any easily discernable pattern. Some of the coefficients may be attenuated while others are amplified. In general, the signs of the coefficients may even be reversed, based on the correlation between the attributes. Hence, unlike in the where noise is added to all attributes, adding noise to only the sensitive attributes prevents the legitimate user from adopting convenient procedures to compensate for the biases introduced due to noise addition. The estimates provided may be quite inaccurate and misleading, especially as the levels of noise increase. The issue of reduction in R 2 The ratio γ of the R 2 for perturbed and unperturbed data may be obtained by substituting (5) in (10). This measure of data quality has the form Q 11 Σ 12 γ = β β / (β ) (15) Σ21 Σ22 The matrices in the numerator and denominator differ only in the elements of the (p x p) sub-matrix corresponding to the sensitive attributes. The reduction in the R 2 is hence typically small, especially when a small subset of the attributes are masked. Further, it can be

4 shown that the ratio is greater than 1/(1+α) (the proof is beyond the scope of this article but a special case is established in equations (18) and (19)). Thus based on this measure alone, adding noise to only the subset of sensitive attributes seems more attractive than adding noise to all attributes. This can be intuitively explained by the fact that the percentage of the variation in the outcome variables explained by the regressors is higher when fewer regressors are disturbed. However, concerns about biases in the coefficient estimates may far out weigh the lower reduction in the R 2 and this measure not be an appropriate candidate for evaluating data quality. Case III When independent noise is added, but only the sensitive attributes are masked, the covariance matrix of the noise vector is specified by (6). The expressions for the bias and reduction in R 2 are complex and are treated exhaustively in the literature on errors in variables (see [4]). Rather than reproducing the results, we draw on this body of knowledge to investigate the data quality implications of data perturbation. In order to keep the analysis simple we consider the case when only one (i=1) of the K attributes is sensitive. Notice that this implementation is equivalent to the previous case with p=1. The issue of bias Based on results from [5] it can be shown that: b 1 = 1/(1+α/(1-R 2 1)) β 1 (16) b i = β i + αδ i β 1 /(1+α- R 2 1)for i = 2,...,K. (17) where R 2 1 is the squared correlation coefficient obtained by regression of the sensitive attribute X 1 on the K-1 non sensitive attributes, and δ i is the regression coefficient for X i when X 1 is regressed on the other K-1 non sensitive attributes. Comparing these results to those obtained in Case I, the following observations can be made: a) For the sensitive attribute the regression coefficient is attenuated. Moreover, the attenuation is greater than that in the previous case when noise is added to all K attributes. This is obvious from (11) and (16) since the squared correlation coefficient is less than 1. b) The bias in the coefficient of a non sensitive (and hence non-masked) attribute is proportional to the bias in the coefficient of the sensitive attribute. Specifically, B i =δ i B 1 for i=2,...,k. Note that the coefficients of the non sensitive attributes are not necessarily attenuated. Hence the bias may be positive or negative. c) The absolute value of the bias for the non-sensitive attributes is less in the second case iff R 2 1 < (1+α)(1- δ i β 1 ). Hence, it is not necessarily true that adding noise to non-sensitive attributes causes an increase in the bias of the regression coefficient estimates. Results based on models with more than one disturbed regressor provide similar insights. The biases in the coefficients are not necessarily increased when noise is added to all the attributes. Hence if the bias in the regression coefficients is taken as a measure of data quality then adding independent noise provides no advantage over the methods proposed by Kim [8] and Tendick [13]. The issue of reduction in R 2 In investigating the effect on R 2, it can be shown that the matrix Σ(Σ+Ω) -1 Σ differs from the covariance matrix Σ only in the first row first column element. Hence the ratio of the squared correlation coefficient may be written in terms of this difference c as: γ = 1 - β 2 1 αc/(β ) (18) with c < (β )/(β 2 1 (1+α)) (19) Comparing γ when all attributes are masked (13) to the case when only the first attribute is masked ((18) and (19)), we find that the squared correlation coefficient is always greater in the latter case. This is because the percentage of variation in the outcome variables explained by the regressors is lower when only a single regressor is disturbed. Notice that the results are consistent with those obtained for Case II. In view of the fact that the coefficients can be estimated with much greater accuracy under Kim s method, the higher R 2 obtained under this method may not be sufficient justification for adding independent noise. We now use an example to illustrate the results obtained in this section. In addition to the three cases analyzed we also consider an implementation in which independent noise is added to all the attributes in the database. 5. An Example Consider a database with eight attributes of which only the first three attributes are considered sensitive. Assume that the attributes are multi-normally distributed with a covariance matrix Σ given by:

5 Covarianx1 x2 x3 x4 x5 x6 x7 x8 x x x x x x x x Case I: Consider the case when noise is added to all attributes with the covariance of the noise vector Ω = ασ. For a choice of α = 0.25, the expected values of the coefficients estimated using masked data are b i = 0.8 β i for i = 1,.., 8, and, the ratio of the squared correlation coefficient γ = 0.8. These are obviously consistent with the theoretical results obtained in the previous section. Users may compensate for the systematic bias if the value of the parameter α is known. Case II When noise is added to only the three sensitive attributes, the covariance matrix has the structure specified by (5). The expected values of the coefficients are given by (Σ+Ω) -1 Σ β, and for α = 0.25 the matrix (Σ+Ω) -1 Σ can be computed as: In order to get a sense of the magnitude and sign of the biases, consider the case where the true regression coefficients are all one (that is, when β = [ ]). In this case the expected values of the coefficients estimated using the masked data can be computed as: b = [ ] Notice that bias is introduced in the coefficient estimates for both the masked and the unperturbed attributes. The estimated coefficient can be more than 17 times the true value of the coefficient (as is the case with b 10 ). The signs of the coefficients can be reversed (as is the case with b 1 and b 5 ). In general, the biases do not follow a pattern and are determined by the covariance matrix Σ. The ratio of the squared correlation coefficient is given by (10). For this example, the matrix Σ(Σ+Ω) -1 Σ can be computed as: And hence the ratio of the squared correlation coefficient (with β = [ ]) is γ = 519.9/521 = This surprisingly high ratio is due to the fact that the elements by which the matrix Σ(Σ+Ω) -1 Σ differs from Σ is small compared to the elements that they have in common. However, as the biases indicate, this ratio may be a poor measure for data quality. Case III Now we assume that noise is added to the sensitive attributes only and the noise components are independent of each other, as characterized in (6). It follows that for α = 0.25 the matrix (Σ+Ω) -1 Σ can be computed as: Again, when the true regression coefficients are given by β = [ ], the expected values of the coefficients estimated using the masked data are: b = [ ]. Notice that there is considerable variation in terms of the magnitude and sign of the bias. The signs of the coefficients are reversed in some cases. Comparing the coefficient estimates to those obtained in Case II we find that the magnitude and sign of the biases are quite similar for each coefficient regardless of whether the noise components for the masked attributes are

6 independent or follow the same covariance pattern as the vector of attributes. The ratio of the squared correlation coefficient (10) can be computed for this example using the matrix Σ(Σ+Ω) The ratio of the squared correlation coefficient (with β = [ ]) is hence γ = 514.4/521 = Again, the high ratio is a result of the fact that only a few regressors are disturbed and should not be taken to indicate high data quality. Case IV Finally we consider an implementation in which all attributes are masked using independent noise (4). For α = 0.25 the matrix (Σ+Ω) -1 Σ can be computed as: With the true regression coefficients given by β = [ ], the expected values of the coefficients estimated using the masked data are: b = [ ]. Notice that the signs of the coefficients are all consistent. Moreover, the biases are not very significant (except in the case of the last coefficient) and are of the same order as those obtained using Kim s method. However, as compared to implementations that mask only the subset of sensitive attributes, adding independent noise to all attributes is much more desirable. The ratio of the squared correlation coefficient (10) can be computed using the matrix Σ(Σ+Ω) -1 that is given by The ratio of the squared correlation coefficient (with β = [ ]) can be computed as γ = 468.2/521 = This is obviously lower than that obtained when only the sensitive attributes are masked. However, it is higher than that obtained with Kim s method. The only significant disadvantage of this method is that the procedure to compensate for the biases is not as simple as in the fist case. This example reinforces the analytical results obtained in the last section. The significant results are summarized in the discussions that follow. 6. Discussion This article develops a framework that may be used to analyze the implications of additive noise data masking on data quality when the data is used for regression analysis. Two measures of data quality that have been proposed in the literature are considered - the bias in the regression coefficient estimates and the percentage of the variation (R 2 ) in the outcome variable that is explained by the regressors. The framework is used to investigate whether noise should be added to non-sensitive attributes when only a subset of attributes in the database are considered sensitive. Two common implementations of noise addition schemes are considered. The first adds noise with the covariance matrix of the noise vector proportional to the covariance matrix of the attribute vector. Under the second scheme the noise components are statistically independent of each other. Based on our analysis the following observations can be made: 1. Of the two measures of data quality considered, the bias in the coefficient estimates is the more appropriate one. This is because the reduction in R 2 is strongly influenced by the proportion of attributes masked and may be misleading about the accuracy of the statistical estimates. 2. Adding noise to all the attributes is preferable to masking only the subset of sensitive attributes. The estimated coefficients are much more accurate in the

7 former case. This is true regardless of the covariance structure of the noise vector. 3. Using noise with a covariance matrix proportional to the covariance matrix of the attributes (as proposed by Kim [8] and Tendick [13]) is preferable to adding statistically independent noise. Preserving the covariance structure of the data allows simple procedures to compensate for the bias introduced due to noise addition. Our results provide practical guidance to database administrators and data users. Database administrators have the conflicting mandates of protecting sensitive data while ensuring access to good quality data for legitimate statistical analysis. The literature has not considered the typical case when some of the attributes in the database are non-sensitive. In our investigation of this issue we focus on data quality since security is not a concern for non-sensitive data. Our analysis indicates that both sensitive and non-sensitive data should be masked using the method proposed by Kim [8] and Tendick [13]. Under this implementation we specify a simple procedure for the users to correct the estimates for bias. The only major drawback of this approach is that it adversely affects users interested in univariate statistics of nonsensitive attributes. Our analysis is restricted to the classical normal linear regression model. Generalizations of our results pose significant and interesting research challenges. We relate our results to the literature on errors in variable models since this extensive body of knowledge may prove useful in identifying and answering questions in data security and data quality issues. [6] Jabine T.B. Statistical Disclosure Limitation Practices of the United State Statistical Agencies, Journal of Official Statistics, 1993, 9, 2, [7] Landwehr C.E. [editor], Data Security: Status and Prospects New York: North Holland, [8] Kim, J., A Method for Limiting Disclosure in Microdata based on Random Noise and Transformation, ASA Proceedings of the Survey Research Method Section, 1986, [9] Matloff, N.S., "Another Look at Noise Addition for Database Security," Proceedings of the IEEE Computer Society Symposium on Research in Security and Privacy, 1986, [10] Muralidhar K. and Batra D., Accessibility, Security, and Accuracy in Statistical Databases: The Case for Multiplicative Fixed Data Perturbation Approach, Management Science, 1995, 41, 9, [11] Straub D.W. and Collins R.W., Key Information Liability Issues Facing Managers: Software, Piracy, Proprietary Databases, and Individual Rights to Privacy, MIS Q., 1990, 14, [12] Strong, D.M., Lee, Y.W, and Wang, R.Y., "Data Quality in Context," Communications of the ACM, 1997, 40, 5, [13] Tendick, P., "Optimal Noise Addition for the Preservation of Confidentiality in Multivariate Data," Journal of Statistical Planning and Inference, 1991, 27, References [1] Adam, N.R. and Wortmann, J.C., "Security-Control Methods for Statistical Databases: A Comparative Study," ACM Computing Surveys, 1989, 21, 4, [2] Duncan, G.T. and Pearson, R.W., "Enhancing Access to Data while Protecting Confi-dentiality: Prospects for the Future," Statistical Science, 1991, 6, 3, [3] Fuller, W. A., "Masking Procedures for Microdata Disclosure Limitation," Journal of Official Statistics, 1993, 9, 2, [4] Fuller, W. A., Measurement Error Models, New York: John Wiley, [5] Garber S. and Klepper S., Extending the Classical Normal Errors in Variable Model, Econometrica, 1980, 48, 6,

Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION

Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION INTRODUCTION Statistical disclosure control part of preparations for disseminating microdata. Data perturbation techniques: Methods assuring

More information

Disclosure Risk Measurement with Entropy in Two-Dimensional Sample Based Frequency Tables

Disclosure Risk Measurement with Entropy in Two-Dimensional Sample Based Frequency Tables Disclosure Risk Measurement with Entropy in Two-Dimensional Sample Based Frequency Tables Laszlo Antal, Natalie Shlomo, Mark Elliot University of Manchester, UK, email: laszlo.antal@postgrad.manchester.ac.uk,

More information

Data Obfuscation. Bimal Kumar Roy. December 17, 2015

Data Obfuscation. Bimal Kumar Roy. December 17, 2015 December 17, 2015 Problem description (informal) Owner with large database. Lends the database for public use user is allowed to run restriced set of queries on data items. Goal is to prevent the user

More information

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018 Econometrics I KS Module 2: Multivariate Linear Regression Alexander Ahammer Department of Economics Johannes Kepler University of Linz This version: April 16, 2018 Alexander Ahammer (JKU) Module 2: Multivariate

More information

Zellner s Seemingly Unrelated Regressions Model. James L. Powell Department of Economics University of California, Berkeley

Zellner s Seemingly Unrelated Regressions Model. James L. Powell Department of Economics University of California, Berkeley Zellner s Seemingly Unrelated Regressions Model James L. Powell Department of Economics University of California, Berkeley Overview The seemingly unrelated regressions (SUR) model, proposed by Zellner,

More information

Classical Measurement Error with Several Regressors

Classical Measurement Error with Several Regressors Classical Measurement Error with Several Regressors Andrew B. Abel Wharton School of the University of Pennsylvania National Bureau of Economic Research February 4, 208 Abstract In OLS regressions with

More information

Introduction to Statistical Analysis

Introduction to Statistical Analysis Introduction to Statistical Analysis Changyu Shen Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology Beth Israel Deaconess Medical Center Harvard Medical School Objectives Descriptive

More information

Mean squared error matrix comparison of least aquares and Stein-rule estimators for regression coefficients under non-normal disturbances

Mean squared error matrix comparison of least aquares and Stein-rule estimators for regression coefficients under non-normal disturbances METRON - International Journal of Statistics 2008, vol. LXVI, n. 3, pp. 285-298 SHALABH HELGE TOUTENBURG CHRISTIAN HEUMANN Mean squared error matrix comparison of least aquares and Stein-rule estimators

More information

The Linear Regression Model

The Linear Regression Model The Linear Regression Model Carlo Favero Favero () The Linear Regression Model 1 / 67 OLS To illustrate how estimation can be performed to derive conditional expectations, consider the following general

More information

Method for Noise Addition for Individual Record Preserving Privacy and Statistical Characteristics: Case Study of Real Estate Transaction Data

Method for Noise Addition for Individual Record Preserving Privacy and Statistical Characteristics: Case Study of Real Estate Transaction Data CSIS Discussion Paper No. 140 Method for Noise Addition for Individual Record Preserving Privacy and Statistical Characteristics: Case Study of Real Estate Transaction Data Yuzo Maruyama Ryoko Tone and

More information

An Introduction to Parameter Estimation

An Introduction to Parameter Estimation Introduction Introduction to Econometrics An Introduction to Parameter Estimation This document combines several important econometric foundations and corresponds to other documents such as the Introduction

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression Christopher Ting Christopher Ting : christophert@smu.edu.sg : 688 0364 : LKCSB 5036 January 7, 017 Web Site: http://www.mysmu.edu/faculty/christophert/ Christopher Ting QF 30 Week

More information

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Xiuming Zhang zhangxiuming@u.nus.edu A*STAR-NUS Clinical Imaging Research Center October, 015 Summary This report derives

More information

Regression Analysis for Data Containing Outliers and High Leverage Points

Regression Analysis for Data Containing Outliers and High Leverage Points Alabama Journal of Mathematics 39 (2015) ISSN 2373-0404 Regression Analysis for Data Containing Outliers and High Leverage Points Asim Kumer Dey Department of Mathematics Lamar University Md. Amir Hossain

More information

Chapter 3 ANALYSIS OF RESPONSE PROFILES

Chapter 3 ANALYSIS OF RESPONSE PROFILES Chapter 3 ANALYSIS OF RESPONSE PROFILES 78 31 Introduction In this chapter we present a method for analysing longitudinal data that imposes minimal structure or restrictions on the mean responses over

More information

Relevance Vector Machines for Earthquake Response Spectra

Relevance Vector Machines for Earthquake Response Spectra 2012 2011 American American Transactions Transactions on on Engineering Engineering & Applied Applied Sciences Sciences. American Transactions on Engineering & Applied Sciences http://tuengr.com/ateas

More information

Specification errors in linear regression models

Specification errors in linear regression models Specification errors in linear regression models Jean-Marie Dufour McGill University First version: February 2002 Revised: December 2011 This version: December 2011 Compiled: December 9, 2011, 22:34 This

More information

ASA Section on Survey Research Methods

ASA Section on Survey Research Methods REGRESSION-BASED STATISTICAL MATCHING: RECENT DEVELOPMENTS Chris Moriarity, Fritz Scheuren Chris Moriarity, U.S. Government Accountability Office, 411 G Street NW, Washington, DC 20548 KEY WORDS: data

More information

CHAPTER 5. Outlier Detection in Multivariate Data

CHAPTER 5. Outlier Detection in Multivariate Data CHAPTER 5 Outlier Detection in Multivariate Data 5.1 Introduction Multivariate outlier detection is the important task of statistical analysis of multivariate data. Many methods have been proposed for

More information

Differential Privacy

Differential Privacy CS 380S Differential Privacy Vitaly Shmatikov most slides from Adam Smith (Penn State) slide 1 Reading Assignment Dwork. Differential Privacy (invited talk at ICALP 2006). slide 2 Basic Setting DB= x 1

More information

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley Review of Classical Least Squares James L. Powell Department of Economics University of California, Berkeley The Classical Linear Model The object of least squares regression methods is to model and estimate

More information

Pooling multiple imputations when the sample happens to be the population.

Pooling multiple imputations when the sample happens to be the population. Pooling multiple imputations when the sample happens to be the population. Gerko Vink 1,2, and Stef van Buuren 1,3 arxiv:1409.8542v1 [math.st] 30 Sep 2014 1 Department of Methodology and Statistics, Utrecht

More information

Very simple marginal effects in some discrete choice models *

Very simple marginal effects in some discrete choice models * UCD GEARY INSTITUTE DISCUSSION PAPER SERIES Very simple marginal effects in some discrete choice models * Kevin J. Denny School of Economics & Geary Institute University College Dublin 2 st July 2009 *

More information

Least Squares Estimation-Finite-Sample Properties

Least Squares Estimation-Finite-Sample Properties Least Squares Estimation-Finite-Sample Properties Ping Yu School of Economics and Finance The University of Hong Kong Ping Yu (HKU) Finite-Sample 1 / 29 Terminology and Assumptions 1 Terminology and Assumptions

More information

Bootstrap Approach to Comparison of Alternative Methods of Parameter Estimation of a Simultaneous Equation Model

Bootstrap Approach to Comparison of Alternative Methods of Parameter Estimation of a Simultaneous Equation Model Bootstrap Approach to Comparison of Alternative Methods of Parameter Estimation of a Simultaneous Equation Model Olubusoye, O. E., J. O. Olaomi, and O. O. Odetunde Abstract A bootstrap simulation approach

More information

Why is the field of statistics still an active one?

Why is the field of statistics still an active one? Why is the field of statistics still an active one? It s obvious that one needs statistics: to describe experimental data in a compact way, to compare datasets, to ask whether data are consistent with

More information

Heteroskedasticity-Consistent Covariance Matrix Estimators in Small Samples with High Leverage Points

Heteroskedasticity-Consistent Covariance Matrix Estimators in Small Samples with High Leverage Points Theoretical Economics Letters, 2016, 6, 658-677 Published Online August 2016 in SciRes. http://www.scirp.org/journal/tel http://dx.doi.org/10.4236/tel.2016.64071 Heteroskedasticity-Consistent Covariance

More information

Differentially Private Linear Regression

Differentially Private Linear Regression Differentially Private Linear Regression Christian Baehr August 5, 2017 Your abstract. Abstract 1 Introduction My research involved testing and implementing tools into the Harvard Privacy Tools Project

More information

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017 Summary of Part II Key Concepts & Formulas Christopher Ting November 11, 2017 christopherting@smu.edu.sg http://www.mysmu.edu/faculty/christophert/ Christopher Ting 1 of 16 Why Regression Analysis? Understand

More information

A Note on Bayesian Inference After Multiple Imputation

A Note on Bayesian Inference After Multiple Imputation A Note on Bayesian Inference After Multiple Imputation Xiang Zhou and Jerome P. Reiter Abstract This article is aimed at practitioners who plan to use Bayesian inference on multiplyimputed datasets in

More information

SOME BASICS OF TIME-SERIES ANALYSIS

SOME BASICS OF TIME-SERIES ANALYSIS SOME BASICS OF TIME-SERIES ANALYSIS John E. Floyd University of Toronto December 8, 26 An excellent place to learn about time series analysis is from Walter Enders textbook. For a basic understanding of

More information

6.435, System Identification

6.435, System Identification System Identification 6.435 SET 3 Nonparametric Identification Munther A. Dahleh 1 Nonparametric Methods for System ID Time domain methods Impulse response Step response Correlation analysis / time Frequency

More information

Bootstrap Simulation Procedure Applied to the Selection of the Multiple Linear Regressions

Bootstrap Simulation Procedure Applied to the Selection of the Multiple Linear Regressions JKAU: Sci., Vol. 21 No. 2, pp: 197-212 (2009 A.D. / 1430 A.H.); DOI: 10.4197 / Sci. 21-2.2 Bootstrap Simulation Procedure Applied to the Selection of the Multiple Linear Regressions Ali Hussein Al-Marshadi

More information

6. Assessing studies based on multiple regression

6. Assessing studies based on multiple regression 6. Assessing studies based on multiple regression Questions of this section: What makes a study using multiple regression (un)reliable? When does multiple regression provide a useful estimate of the causal

More information

Specification Errors, Measurement Errors, Confounding

Specification Errors, Measurement Errors, Confounding Specification Errors, Measurement Errors, Confounding Kerby Shedden Department of Statistics, University of Michigan October 10, 2018 1 / 32 An unobserved covariate Suppose we have a data generating model

More information

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares Many economic models involve endogeneity: that is, a theoretical relationship does not fit

More information

Outline. Possible Reasons. Nature of Heteroscedasticity. Basic Econometrics in Transportation. Heteroscedasticity

Outline. Possible Reasons. Nature of Heteroscedasticity. Basic Econometrics in Transportation. Heteroscedasticity 1/25 Outline Basic Econometrics in Transportation Heteroscedasticity What is the nature of heteroscedasticity? What are its consequences? How does one detect it? What are the remedial measures? Amir Samimi

More information

1. The General Linear-Quadratic Framework

1. The General Linear-Quadratic Framework ECO 317 Economics of Uncertainty Fall Term 2009 Slides to accompany 21. Incentives for Effort - Multi-Dimensional Cases 1. The General Linear-Quadratic Framework Notation: x = (x j ), n-vector of agent

More information

Selection Criteria Based on Monte Carlo Simulation and Cross Validation in Mixed Models

Selection Criteria Based on Monte Carlo Simulation and Cross Validation in Mixed Models Selection Criteria Based on Monte Carlo Simulation and Cross Validation in Mixed Models Junfeng Shang Bowling Green State University, USA Abstract In the mixed modeling framework, Monte Carlo simulation

More information

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of

More information

Robustness of Logit Analysis: Unobserved Heterogeneity and Misspecified Disturbances

Robustness of Logit Analysis: Unobserved Heterogeneity and Misspecified Disturbances Discussion Paper: 2006/07 Robustness of Logit Analysis: Unobserved Heterogeneity and Misspecified Disturbances J.S. Cramer www.fee.uva.nl/ke/uva-econometrics Amsterdam School of Economics Department of

More information

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors: Wooldridge, Introductory Econometrics, d ed. Chapter 3: Multiple regression analysis: Estimation In multiple regression analysis, we extend the simple (two-variable) regression model to consider the possibility

More information

6 Pattern Mixture Models

6 Pattern Mixture Models 6 Pattern Mixture Models A common theme underlying the methods we have discussed so far is that interest focuses on making inference on parameters in a parametric or semiparametric model for the full data

More information

Interpreting Regression Results

Interpreting Regression Results Interpreting Regression Results Carlo Favero Favero () Interpreting Regression Results 1 / 42 Interpreting Regression Results Interpreting regression results is not a simple exercise. We propose to split

More information

Inference about Clustering and Parametric. Assumptions in Covariance Matrix Estimation

Inference about Clustering and Parametric. Assumptions in Covariance Matrix Estimation Inference about Clustering and Parametric Assumptions in Covariance Matrix Estimation Mikko Packalen y Tony Wirjanto z 26 November 2010 Abstract Selecting an estimator for the variance covariance matrix

More information

SIMILAR-ON-THE-BOUNDARY TESTS FOR MOMENT INEQUALITIES EXIST, BUT HAVE POOR POWER. Donald W. K. Andrews. August 2011

SIMILAR-ON-THE-BOUNDARY TESTS FOR MOMENT INEQUALITIES EXIST, BUT HAVE POOR POWER. Donald W. K. Andrews. August 2011 SIMILAR-ON-THE-BOUNDARY TESTS FOR MOMENT INEQUALITIES EXIST, BUT HAVE POOR POWER By Donald W. K. Andrews August 2011 COWLES FOUNDATION DISCUSSION PAPER NO. 1815 COWLES FOUNDATION FOR RESEARCH IN ECONOMICS

More information

The LIML Estimator Has Finite Moments! T. W. Anderson. Department of Economics and Department of Statistics. Stanford University, Stanford, CA 94305

The LIML Estimator Has Finite Moments! T. W. Anderson. Department of Economics and Department of Statistics. Stanford University, Stanford, CA 94305 The LIML Estimator Has Finite Moments! T. W. Anderson Department of Economics and Department of Statistics Stanford University, Stanford, CA 9435 March 25, 2 Abstract The Limited Information Maximum Likelihood

More information

Bayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap

Bayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap Bayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap Dale J. Poirier University of California, Irvine September 1, 2008 Abstract This paper

More information

Chapter 2 Sampling for Biostatistics

Chapter 2 Sampling for Biostatistics Chapter 2 Sampling for Biostatistics Angela Conley and Jason Pfefferkorn Abstract Define the terms sample, population, and statistic. Introduce the concept of bias in sampling methods. Demonstrate how

More information

Differential Privacy in an RKHS

Differential Privacy in an RKHS Differential Privacy in an RKHS Rob Hall (with Larry Wasserman and Alessandro Rinaldo) 2/20/2012 rjhall@cs.cmu.edu http://www.cs.cmu.edu/~rjhall 1 Overview Why care about privacy? Differential Privacy

More information

POL 681 Lecture Notes: Statistical Interactions

POL 681 Lecture Notes: Statistical Interactions POL 681 Lecture Notes: Statistical Interactions 1 Preliminaries To this point, the linear models we have considered have all been interpreted in terms of additive relationships. That is, the relationship

More information

Lecture Notes on Measurement Error

Lecture Notes on Measurement Error Steve Pischke Spring 2000 Lecture Notes on Measurement Error These notes summarize a variety of simple results on measurement error which I nd useful. They also provide some references where more complete

More information

Testing Homogeneity Of A Large Data Set By Bootstrapping

Testing Homogeneity Of A Large Data Set By Bootstrapping Testing Homogeneity Of A Large Data Set By Bootstrapping 1 Morimune, K and 2 Hoshino, Y 1 Graduate School of Economics, Kyoto University Yoshida Honcho Sakyo Kyoto 606-8501, Japan. E-Mail: morimune@econ.kyoto-u.ac.jp

More information

Estimation of the Conditional Variance in Paired Experiments

Estimation of the Conditional Variance in Paired Experiments Estimation of the Conditional Variance in Paired Experiments Alberto Abadie & Guido W. Imbens Harvard University and BER June 008 Abstract In paired randomized experiments units are grouped in pairs, often

More information

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Confidence Estimation Methods for Neural Networks: A Practical Comparison , 6-8 000, Confidence Estimation Methods for : A Practical Comparison G. Papadopoulos, P.J. Edwards, A.F. Murray Department of Electronics and Electrical Engineering, University of Edinburgh Abstract.

More information

Chapter 8 Heteroskedasticity

Chapter 8 Heteroskedasticity Chapter 8 Walter R. Paczkowski Rutgers University Page 1 Chapter Contents 8.1 The Nature of 8. Detecting 8.3 -Consistent Standard Errors 8.4 Generalized Least Squares: Known Form of Variance 8.5 Generalized

More information

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is Multinomial Data The multinomial distribution is a generalization of the binomial for the situation in which each trial results in one and only one of several categories, as opposed to just two, as in

More information

Sigmaplot di Systat Software

Sigmaplot di Systat Software Sigmaplot di Systat Software SigmaPlot Has Extensive Statistical Analysis Features SigmaPlot is now bundled with SigmaStat as an easy-to-use package for complete graphing and data analysis. The statistical

More information

PRINCIPAL COMPONENTS ANALYSIS

PRINCIPAL COMPONENTS ANALYSIS 121 CHAPTER 11 PRINCIPAL COMPONENTS ANALYSIS We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves

More information

Wooldridge, Introductory Econometrics, 3d ed. Chapter 9: More on specification and data problems

Wooldridge, Introductory Econometrics, 3d ed. Chapter 9: More on specification and data problems Wooldridge, Introductory Econometrics, 3d ed. Chapter 9: More on specification and data problems Functional form misspecification We may have a model that is correctly specified, in terms of including

More information

Variable Selection and Model Building

Variable Selection and Model Building LINEAR REGRESSION ANALYSIS MODULE XIII Lecture - 37 Variable Selection and Model Building Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur The complete regression

More information

Efficient Choice of Biasing Constant. for Ridge Regression

Efficient Choice of Biasing Constant. for Ridge Regression Int. J. Contemp. Math. Sciences, Vol. 3, 008, no., 57-536 Efficient Choice of Biasing Constant for Ridge Regression Sona Mardikyan* and Eyüp Çetin Department of Management Information Systems, School of

More information

Statistical Practice

Statistical Practice Statistical Practice A Note on Bayesian Inference After Multiple Imputation Xiang ZHOU and Jerome P. REITER This article is aimed at practitioners who plan to use Bayesian inference on multiply-imputed

More information

ACE 564 Spring Lecture 8. Violations of Basic Assumptions I: Multicollinearity and Non-Sample Information. by Professor Scott H.

ACE 564 Spring Lecture 8. Violations of Basic Assumptions I: Multicollinearity and Non-Sample Information. by Professor Scott H. ACE 564 Spring 2006 Lecture 8 Violations of Basic Assumptions I: Multicollinearity and Non-Sample Information by Professor Scott H. Irwin Readings: Griffiths, Hill and Judge. "Collinear Economic Variables,

More information

Linear Regression Models

Linear Regression Models Linear Regression Models Model Description and Model Parameters Modelling is a central theme in these notes. The idea is to develop and continuously improve a library of predictive models for hazards,

More information

Econometrics II. Nonstandard Standard Error Issues: A Guide for the. Practitioner

Econometrics II. Nonstandard Standard Error Issues: A Guide for the. Practitioner Econometrics II Nonstandard Standard Error Issues: A Guide for the Practitioner Måns Söderbom 10 May 2011 Department of Economics, University of Gothenburg. Email: mans.soderbom@economics.gu.se. Web: www.economics.gu.se/soderbom,

More information

Working Papers in Econometrics and Applied Statistics

Working Papers in Econometrics and Applied Statistics T h e U n i v e r s i t y o f NEW ENGLAND Working Papers in Econometrics and Applied Statistics Finite Sample Inference in the SUR Model Duangkamon Chotikapanich and William E. Griffiths No. 03 - April

More information

A NONLINEARITY MEASURE FOR ESTIMATION SYSTEMS

A NONLINEARITY MEASURE FOR ESTIMATION SYSTEMS AAS 6-135 A NONLINEARITY MEASURE FOR ESTIMATION SYSTEMS Andrew J. Sinclair,JohnE.Hurtado, and John L. Junkins The concept of nonlinearity measures for dynamical systems is extended to estimation systems,

More information

Section I. Define or explain the following terms (3 points each) 1. centered vs. uncentered 2 R - 2. Frisch theorem -

Section I. Define or explain the following terms (3 points each) 1. centered vs. uncentered 2 R - 2. Frisch theorem - First Exam: Economics 388, Econometrics Spring 006 in R. Butler s class YOUR NAME: Section I (30 points) Questions 1-10 (3 points each) Section II (40 points) Questions 11-15 (10 points each) Section III

More information

0 0'0 2S ~~ Employment category

0 0'0 2S ~~ Employment category Analyze Phase 331 60000 50000 40000 30000 20000 10000 O~----,------.------,------,,------,------.------,----- N = 227 136 27 41 32 5 ' V~ 00 0' 00 00 i-.~ fl' ~G ~~ ~O~ ()0 -S 0 -S ~~ 0 ~~ 0 ~G d> ~0~

More information

Regression M&M 2.3 and 10. Uses Curve fitting Summarization ('model') Description Prediction Explanation Adjustment for 'confounding' variables

Regression M&M 2.3 and 10. Uses Curve fitting Summarization ('model') Description Prediction Explanation Adjustment for 'confounding' variables Uses Curve fitting Summarization ('model') Description Prediction Explanation Adjustment for 'confounding' variables MALES FEMALES Age. Tot. %-ile; weight,g Tot. %-ile; weight,g wk N. 0th 50th 90th No.

More information

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Linear Models 1. Isfahan University of Technology Fall Semester, 2014 Linear Models 1 Isfahan University of Technology Fall Semester, 2014 References: [1] G. A. F., Seber and A. J. Lee (2003). Linear Regression Analysis (2nd ed.). Hoboken, NJ: Wiley. [2] A. C. Rencher and

More information

,i = 1,2,L, p. For a sample of size n, let the columns of data be

,i = 1,2,L, p. For a sample of size n, let the columns of data be MAC IIci: Miller Asymptotics Chapter 5: Regression Section?: Asymptotic Relationship Between a CC and its Associated Slope Estimates in Multiple Linear Regression The asymptotic null distribution of a

More information

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM Subject Business Economics Paper No and Title Module No and Title Module Tag 8, Fundamentals of Econometrics 3, The gauss Markov theorem BSE_P8_M3 1 TABLE OF CONTENTS 1. INTRODUCTION 2. ASSUMPTIONS OF

More information

GLS and FGLS. Econ 671. Purdue University. Justin L. Tobias (Purdue) GLS and FGLS 1 / 22

GLS and FGLS. Econ 671. Purdue University. Justin L. Tobias (Purdue) GLS and FGLS 1 / 22 GLS and FGLS Econ 671 Purdue University Justin L. Tobias (Purdue) GLS and FGLS 1 / 22 In this lecture we continue to discuss properties associated with the GLS estimator. In addition we discuss the practical

More information

Estimation of Production Functions using Average Data

Estimation of Production Functions using Average Data Estimation of Production Functions using Average Data Matthew J. Salois Food and Resource Economics Department, University of Florida PO Box 110240, Gainesville, FL 32611-0240 msalois@ufl.edu Grigorios

More information

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

LECTURE 2 LINEAR REGRESSION MODEL AND OLS SEPTEMBER 29, 2014 LECTURE 2 LINEAR REGRESSION MODEL AND OLS Definitions A common question in econometrics is to study the effect of one group of variables X i, usually called the regressors, on another

More information

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline. MFM Practitioner Module: Risk & Asset Allocation September 11, 2013 Before we define dependence, it is useful to define Random variables X and Y are independent iff For all x, y. In particular, F (X,Y

More information

coefficients n 2 are the residuals obtained when we estimate the regression on y equals the (simple regression) estimated effect of the part of x 1

coefficients n 2 are the residuals obtained when we estimate the regression on y equals the (simple regression) estimated effect of the part of x 1 Review - Interpreting the Regression If we estimate: It can be shown that: where ˆ1 r i coefficients β ˆ+ βˆ x+ βˆ ˆ= 0 1 1 2x2 y ˆβ n n 2 1 = rˆ i1yi rˆ i1 i= 1 i= 1 xˆ are the residuals obtained when

More information

Identifying the Monetary Policy Shock Christiano et al. (1999)

Identifying the Monetary Policy Shock Christiano et al. (1999) Identifying the Monetary Policy Shock Christiano et al. (1999) The question we are asking is: What are the consequences of a monetary policy shock a shock which is purely related to monetary conditions

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

A Note on Bootstraps and Robustness. Tony Lancaster, Brown University, December 2003.

A Note on Bootstraps and Robustness. Tony Lancaster, Brown University, December 2003. A Note on Bootstraps and Robustness Tony Lancaster, Brown University, December 2003. In this note we consider several versions of the bootstrap and argue that it is helpful in explaining and thinking about

More information

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima Applied Statistics Lecturer: Serena Arima Hypothesis testing for the linear model Under the Gauss-Markov assumptions and the normality of the error terms, we saw that β N(β, σ 2 (X X ) 1 ) and hence s

More information

Uncertainty due to Finite Resolution Measurements

Uncertainty due to Finite Resolution Measurements Uncertainty due to Finite Resolution Measurements S.D. Phillips, B. Tolman, T.W. Estler National Institute of Standards and Technology Gaithersburg, MD 899 Steven.Phillips@NIST.gov Abstract We investigate

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

IDENTIFICATION OF TREATMENT EFFECTS WITH SELECTIVE PARTICIPATION IN A RANDOMIZED TRIAL

IDENTIFICATION OF TREATMENT EFFECTS WITH SELECTIVE PARTICIPATION IN A RANDOMIZED TRIAL IDENTIFICATION OF TREATMENT EFFECTS WITH SELECTIVE PARTICIPATION IN A RANDOMIZED TRIAL BRENDAN KLINE AND ELIE TAMER Abstract. Randomized trials (RTs) are used to learn about treatment effects. This paper

More information

Mediation analyses. Advanced Psychometrics Methods in Cognitive Aging Research Workshop. June 6, 2016

Mediation analyses. Advanced Psychometrics Methods in Cognitive Aging Research Workshop. June 6, 2016 Mediation analyses Advanced Psychometrics Methods in Cognitive Aging Research Workshop June 6, 2016 1 / 40 1 2 3 4 5 2 / 40 Goals for today Motivate mediation analysis Survey rapidly developing field in

More information

Poisson regression: Further topics

Poisson regression: Further topics Poisson regression: Further topics April 21 Overdispersion One of the defining characteristics of Poisson regression is its lack of a scale parameter: E(Y ) = Var(Y ), and no parameter is available to

More information

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline. Practitioner Course: Portfolio Optimization September 10, 2008 Before we define dependence, it is useful to define Random variables X and Y are independent iff For all x, y. In particular, F (X,Y ) (x,

More information

Comparison of Some Improved Estimators for Linear Regression Model under Different Conditions

Comparison of Some Improved Estimators for Linear Regression Model under Different Conditions Florida International University FIU Digital Commons FIU Electronic Theses and Dissertations University Graduate School 3-24-2015 Comparison of Some Improved Estimators for Linear Regression Model under

More information

Conservative variance estimation for sampling designs with zero pairwise inclusion probabilities

Conservative variance estimation for sampling designs with zero pairwise inclusion probabilities Conservative variance estimation for sampling designs with zero pairwise inclusion probabilities Peter M. Aronow and Cyrus Samii Forthcoming at Survey Methodology Abstract We consider conservative variance

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Covariance to PCA. CS 510 Lecture #14 February 23, 2018

Covariance to PCA. CS 510 Lecture #14 February 23, 2018 Covariance to PCA CS 510 Lecture 14 February 23, 2018 Overview: Goal Assume you have a gallery (database) of images, and a probe (test) image. The goal is to find the database image that is most similar

More information

Approximate analysis of covariance in trials in rare diseases, in particular rare cancers

Approximate analysis of covariance in trials in rare diseases, in particular rare cancers Approximate analysis of covariance in trials in rare diseases, in particular rare cancers Stephen Senn (c) Stephen Senn 1 Acknowledgements This work is partly supported by the European Union s 7th Framework

More information

Measuring Social Influence Without Bias

Measuring Social Influence Without Bias Measuring Social Influence Without Bias Annie Franco Bobbie NJ Macdonald December 9, 2015 The Problem CS224W: Final Paper How well can statistical models disentangle the effects of social influence from

More information

Lack-of-fit Tests to Indicate Material Model Improvement or Experimental Data Noise Reduction

Lack-of-fit Tests to Indicate Material Model Improvement or Experimental Data Noise Reduction Lack-of-fit Tests to Indicate Material Model Improvement or Experimental Data Noise Reduction Charles F. Jekel and Raphael T. Haftka University of Florida, Gainesville, FL, 32611, USA Gerhard Venter and

More information

4.7 Confidence and Prediction Intervals

4.7 Confidence and Prediction Intervals 4.7 Confidence and Prediction Intervals Instead of conducting tests we could find confidence intervals for a regression coefficient, or a set of regression coefficient, or for the mean of the response

More information

Investigation into the use of confidence indicators with calibration

Investigation into the use of confidence indicators with calibration WORKSHOP ON FRONTIERS IN BENCHMARKING TECHNIQUES AND THEIR APPLICATION TO OFFICIAL STATISTICS 7 8 APRIL 2005 Investigation into the use of confidence indicators with calibration Gerard Keogh and Dave Jennings

More information

WHEN IS A MAXIMAL INVARIANT HYPOTHESIS TEST BETTER THAN THE GLRT? Hyung Soo Kim and Alfred O. Hero

WHEN IS A MAXIMAL INVARIANT HYPOTHESIS TEST BETTER THAN THE GLRT? Hyung Soo Kim and Alfred O. Hero WHEN IS A MAXIMAL INVARIANT HYPTHESIS TEST BETTER THAN THE GLRT? Hyung Soo Kim and Alfred. Hero Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor, MI 489-222 ABSTRACT

More information