A NONPARAMETRIC TEST FOR HOMOGENEITY: APPLICATIONS TO PARAMETER ESTIMATION

Similar documents
Generalized Multivariate Rank Type Test Statistics via Spatial U-Quantiles

Testing Statistical Hypotheses

Advanced Statistics II: Non Parametric Tests

Testing Statistical Hypotheses

A Comparison of Robust Estimators Based on Two Types of Trimming

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

Application of Homogeneity Tests: Problems and Solution

Lecture 18: Simple Linear Regression

Minimum distance tests and estimates based on ranks

AMS 7 Correlation and Regression Lecture 8

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Physics 509: Bootstrap and Robust Parameter Estimation

One-Sample Numerical Data

High Breakdown Analogs of the Trimmed Mean

Cramér-Type Moderate Deviation Theorems for Two-Sample Studentized (Self-normalized) U-Statistics. Wen-Xin Zhou

HANDBOOK OF APPLICABLE MATHEMATICS

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Improved Ridge Estimator in Linear Regression with Multicollinearity, Heteroscedastic Errors and Outliers

Application of Variance Homogeneity Tests Under Violation of Normality Assumption

What to do today (Nov 22, 2018)?

Computational rank-based statistics

Fast and robust bootstrap for LTS

Introduction Robust regression Examples Conclusion. Robust regression. Jiří Franc

A Resampling Method on Pivotal Estimating Functions

Support Vector Method for Multivariate Density Estimation

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

On Robustification of Some Procedures Used in Analysis of Covariance

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

A comparison study of the nonparametric tests based on the empirical distributions

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

STAT5044: Regression and Anova

Stat 710: Mathematical Statistics Lecture 31

Nonparametric tests. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 704: Data Analysis I

Journal of Biostatistics and Epidemiology

S The Over-Reliance on the Central Limit Theorem

Asymptotic Statistics-VI. Changliang Zou

Robust Outcome Analysis for Observational Studies Designed Using Propensity Score Matching

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

f(x µ, σ) = b 2σ a = cos t, b = sin t/t, π < t 0, a = cosh t, b = sinh t/t, t > 0,

Cauchy Regression and Confidence Intervals for the Slope. University of Montana and College of St. Catherine Missoula, MT St.

Exploring data sets using partial residual plots based on robust fits

LAD Regression for Detecting Outliers in Response and Explanatory Variables

Re-weighted Robust Control Charts for Individual Observations

Confidence Intervals, Testing and ANOVA Summary

Simulation-based robust IV inference for lifetime data

Fiducial Inference and Generalizations

Lectures on Simple Linear Regression Stat 431, Summer 2012

Analyzing the Number of Samples Required for an Approximate Monte-Carlo LMS Line Estimator

Lecture 30. DATA 8 Summer Regression Inference

Density Temp vs Ratio. temp

Joseph W. McKean 1. INTRODUCTION

Two-sample scale rank procedures optimal for the generalized secant hyperbolic distribution

ON SMALL SAMPLE PROPERTIES OF PERMUTATION TESTS: INDEPENDENCE BETWEEN TWO SAMPLES

Finite Population Sampling and Inference

A ROBUST METHOD OF ESTIMATING COVARIANCE MATRIX IN MULTIVARIATE DATA ANALYSIS G.M. OYEYEMI *, R.A. IPINYOMI **

Applications of Basu's TheorelTI. Dennis D. Boos and Jacqueline M. Hughes-Oliver I Department of Statistics, North Car-;'lina State University

SAS/STAT 14.1 User s Guide. Introduction to Nonparametric Analysis

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

BY ANDRZEJ S. KOZEK Macquarie University

Quantiles symmetry. Keywords: Quantile function, distribution function, symmetry, equivariance

Independent Component (IC) Models: New Extensions of the Multinormal Model

1 One-way Analysis of Variance

Testing Goodness-of-Fit for Exponential Distribution Based on Cumulative Residual Entropy

Small Sample Corrections for LTS and MCD

Contents. Acknowledgments. xix

A REMARK ON ROBUSTNESS AND WEAK CONTINUITY OF M-ESTEVtATORS

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Jyh-Jen Horng Shiau 1 and Lin-An Chen 1

L-momenty s rušivou regresí

The Goodness-of-fit Test for Gumbel Distribution: A Comparative Study

The Masking and Swamping Effects Using the Planted Mean-Shift Outliers Models

arxiv:math/ v1 [math.pr] 9 Sep 2003

HOTELLING'S T 2 APPROXIMATION FOR BIVARIATE DICHOTOMOUS DATA

Prediction Intervals for Regression Models

Bootstrapping, Randomization, 2B-PLS

Statistical Inference Using Maximum Likelihood Estimation and the Generalized Likelihood Ratio

Quantitative Empirical Methods Exam

AP Statistics Cumulative AP Exam Study Guide

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)

Group comparison test for independent samples

Homework 2: Simple Linear Regression

Descriptive Univariate Statistics and Bivariate Correlation

AN IMPROVEMENT TO THE ALIGNED RANK STATISTIC

Small sample corrections for LTS and MCD

Introduction to Linear regression analysis. Part 2. Model comparisons

A note on the asymptotic distribution of Berk-Jones type statistics under the null hypothesis

Flexible Estimation of Treatment Effect Parameters

Permutation Tests. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

Chapter 3. Measuring data

COMPARING ROBUST REGRESSION LINES ASSOCIATED WITH TWO DEPENDENT GROUPS WHEN THERE IS HETEROSCEDASTICITY

A Note on Jackknife Based Estimates of Sampling Distributions. Abstract

Resampling Methods. Lukas Meier

The Glejser Test and the Median Regression

LOWER BOUNDS FOR A POLYNOMIAL IN TERMS OF ITS COEFFICIENTS

The exact bootstrap method shown on the example of the mean and variance estimation

Application of Parametric Homogeneity of Variances Tests under Violation of Classical Assumption

Bootstrap. Director of Center for Astrostatistics. G. Jogesh Babu. Penn State University babu.

Contents 1. Contents

Estimation, admissibility and score functions

Distribution-Free Tests for Two-Sample Location Problems Based on Subsamples

Transcription:

Change-point Problems IMS Lecture Notes - Monograph Series (Volume 23, 1994) A NONPARAMETRIC TEST FOR HOMOGENEITY: APPLICATIONS TO PARAMETER ESTIMATION BY K. GHOUDI AND D. MCDONALD Universite' Lava1 and University of Ottawa Testing for homogeneity has many applications in statistical analysis. For example, regression analysis may be viewed as determining the set of parameters that makes the residuals homogeneous. Assume for each i = 1, - -.-, q a small sample n(i) observations is collected. Let N = x:=l n(i), let F' be the empirical distribution function of the ith sample - and let the empirical distribution function of all the samples taken together be F. The problem is to test if these samples are homogeneous. Lehmann (1951) considered the problem of testing the equality of the distributions of q samples. He proposed the statistic The asymptotic properties of Crimer-Von Mises statistics like the above were studied by Kiefer (1959). He considered the case where n(i) + oo while q stayed fixed. McDonald (1991) considered the situation where q + oo and n(i) stays fixed for univariate observations for a more general family of statistics called mndomness statistics. In case of multivariate observations similar asymptotics are discussed in Ghoudi (1992). Here we present an application of the above statistics to the estimation of the parameters of regression models with independent additive errors. The main novelty of our approach is the use of blocking to contrast the empirical distribution of the residuals of observations whose independent variables are in the same block with the empirical distribution of all the residuals taken together. Our estimated regression surface is the one whose residuals minimize a randomness statistic like Lehmann's. Confidence intervals and a test for the model follow without assumptions on the error distribution. Introduction. Consider a multivariate linear regression model. Obser- vations of a dependent vector y = (yl, y2,..., yd) and independent variables x = (xl, 22, -, xp) are indexed by t = 1, - - -, N and are governed by the model AMS 1991 Subject Classification: Primary 62G10; Secondary 62J02. Key words and phrases: Homogeneity tests, nonparametric regression.

150 A NONPARAMETRIC TEST FOR HOMOGENEITY,&)I. We assume throughout that the errors ~t are independent identically distributed d dimensional vectors with unknown distribution F. with p unknown parameters (p) = (PI, P2, - In many experimental designs there are replicate experiments for a fixed value of xt. In this case, group into blocks i = 1,.., q all the observations having the same independent variables (x:, x:,.. -, xy)'. Index the observations in block i by j = 1, -. -, n(i). The n(i) observations in block i satisfy Yij = (xij)'(p) + Eij where (xij)' = (x:, sf,..., xy)'. Even if there are no replicates we may simply pave the parameter space of the dependent variables with contiguous blocks we may calculate the residuals indexed by i = 1,.., q. For any choice of,l? of the dependent vector. Denote the jth residual in the ith block by Uij. The true (unknown) value for p, say Po, makes the residuals i.i.d. hence the best fit or best choice of p is the one that makes these residuals homogeneous. Throughout we consider Lehmann's statistic (but any randomness statistic would do): In principle the above statistic should be small when the residuals are homogeneous since the empirical distribution of the residuals of each group would then have the same empirical distribution as the entire sample. The best estimate for p should then be the value that minimizes the above randomness statistic of the residuals. In Section 2 we exploit an efficient algorithm for finding the value b minimizing the statistic %(P) as a function of /3 which compares well with the standard least squares estimator. Since the statistic %(P) is based on empirical distributions it works well when the errors are not normal and even if outliers are present. The estimate B can be shown to be a consistent estimator of P. Having found b we calculate the N residuals. If b is close to Po these residuals should be homogeneous. More precisely, since b is a consistent estimator of p it can be shown the empirical distribution of the marginals, denoted by Fj, converges to the true sampling distribution F. We now consider all N! permutations of the N residuals and for each permutation we reconstruct the blocks {n(i) : i = 1,2, -. -, q) and we call this a permutation sample. We now

K. GHOUDI and D. McDONALD 151 recalculate 8 for each permutation sample. The distribution of these R values is called the permutation distribution P(W, Fb) (In practice we sample at random from the set of permutation samples to approximate P(%, Fb)). If the regression model is valid, it follows from the asymptotic normality of randomness statistics and the consistency of fi that the quantiles of the permutation distribution P(W, Fb) converge to those of the distribution of %(Po). If, in fact, homogeneity is violated one would expect %(p) to lie in the upper percentiles of the permutation distribution P(R, Fb). If this proves to be the case we reject homogeneity and hence the regression model. Numerical studies in Section 2 show this procedure has good power for detecting deviations from the regression model. Finally if the linear model is accepted, the 95% confidence region for Po is simply the set of P such that %(P) is less than the 95% percentile of the permutation distribution P(W, Fb). Note that in the univariate case the quantiles of the distribution %(Po) may be calculated exactly since they are distribution free. In the multivariate case the permutation procedure is the only one possible. In fact it is always advisable to use the quantiles of P(%, Fb) since the model may have some undetected nonlinear term so using the theoretical quantiles based on the distribution of %(Po) might lead to an unjustifiably tight confidence interval. Nonparametric regression has been treated by Adichie (1967), JureEkovi (1971), Jaeckel (1972), Hettmansperger and McKean (1977), Koenker and Bassett (1978, 1982) and Gutenbrunner and JureEkovi (1992). 2. Numerical Results. In this section we present an algorithm for computing the estimate fi described above. Consider all possible subsamples Let Pj be the of p different points and index them by J; J = 1,...,(:). vector of coefficients of the regression surface passing through the p points of subsample J. The computation of such a Pj amounts to the solution of a linear system of p equations with p unknowns. The search procedure goes as follows: for each Pj we compute the corresponding residuals and then the corresponding statistics R(Pj) which allows us to determine the argument fi giving the minimum of %(P) and it provides us also with the curve of R(P) as a function of p. Note that the number of P's that should be examined is (i). Rousseeuw and LeRoy (1984) reduce computation by randomly selecting m subsamples where m is chosen in a way to insure a high probability of selecting a "good subsample". We can do the same. We first consider the case d = 1 of univariate dependent variables. First we consider the model y = 4s + E where E is normal with mean 0 and variance 1. We make 5 observations {yij : j = 1,2,...,5} at xi = i; i = 1,2,..,lo.

152 A NONPARAMETRIC TEST FOR HOMOGENEITY The graph of R(P) is given in Figure 1 and the minimum X(p) = 1.12111 is obtained at,6 = 3.9033. Figure 1: 8 as a function of P Next we calculate the residuals y;j - pz;. We then sample at random from the set of permutations of these residuals and then arrange these values in 10 blocks of 5 and recalculate X. The histogram of these resampled values of X is given in Figure 2. The distribution obtained in this way is approximately the same as the permutation distribution P(X, Fb). 1 2 Figure 2: Histogram of the resampled values of 8 We notice that the value 8(p) = 1.12111 obtained above is in the center of this histogram (at 33.6 percentile) so we conclude the linear model is compatible with our results. Finally the 95th percentile of the estimated permutation distribution, Lo.os = 1.88111, is plotted on Figure 1 and the associated confidence interval is [3.647,4.096]. The corresponding least squares estimate and confidence interval are 3.892 and [3.782,4.001]. We repeated this experiment 50 times and the histogram of the values of 6 is given in Figure 3.

K. GHOUDI and D. McDONALD 153 Figure 3: Histogram of the values of,h Suppose the above model is modified to y = 4x+0.1x2+~ but we continue to fit the model: y = px + E. We repeat the above experiment 50 times and each time we calculate the percentile of?fz(,h) in the estimated permutation distribution. The result is summarized in Figure 4 which shows the linear model is (correctly) rejected in each experiment at least at the 80% level. Figure 4: Histogram of the percentile of the observed value of R(,h) Now modify the first model by changing the distribution of the error E from a standard normal to a standard Cauchy distribution. Figure 3 gives the histogram of the observed values of,h for 50 replications of the experiment. It is clear that our nonparametric procedure is much more successful than the traditional least squares procedure which in some of the above experiments gave estimates like -30. Consider the model (y(l), y(2)) = (4,l)x + ( (I), 42)) where the compo- nents of E are normal with mean 0 and variance 1 and the correlation is 0.5. We make 5 observations {(yij(l), yij(2)) : j = 1,2, -.,5) at xi = i; i = 1,2,..,lo.

pauplqo ~L'P = (q)g anp ayl lsyl a3ylou am 'L a~nbyg uy ua~yb sy 8 jo sanp pa~dussa~ asayl jo m~bolsyy ayj '8 als1n3ls3a~ pus E; jo sy301q 01 uy sanp asayl a8us~l.e uayl pus spnpysal asayl jo suo!yeqnur~ad jo qas ayl u~og uropusl 3s a~duss uayl ah.!xd - ((z).@/i'(1)f!/i) spnpysal ayq aqs1n3p3 ah lxan '(80'1 'LO'P) = q $2 pauysl -qo s! 8L'P = (g)g "n"yuy" ayl pus g alnbyg uy ua~!% sy (d)g jo yds~8 ayj d jo sanpa ayl jo m ~%olsy~ :E; aln%yg Z'P C'P O'P 6'E 8'E L'E 8

K. GHOUDI and D. McDONALD 155 above is in the center of this histogram so we conclude the linear model is compatible with our results. Finally the value Lo.os = 6.552, the 95th percentile of the histogram in Figure 7, is plotted on Figure 6 to give a confidence region for (PI, P2). The comparable least squares estimate for the slope is (4.08J.07). Figure 7: Histogram of the resampled values of 8 REFERENCES ADICHIE, J. N. (1967). Estimation of regression coefficients based on rank tests. Ann. Math. Statist. 38, 894-904. BASSETT, G. and KOENKER, R. (1982). An empirical quantile function for linear models with iid errors. J. Amer. Statist. Assn. 77, 407-415. GHOUDI, K. (1990). Multivariate Nonparametric Quality Control Statistics. Master's thesis, University of Ottawa, Ottawa, Ontario Canada. GHOUDI, K. (1992). Multivariate Randomness Statistics. PhD thesis, University of Ottawa, Ottawa, Ontario Canada. GHOUDI, K. (1992) Asymptotics of multivariate randomness statistics. Preprint. GUTENBRUNNER, C. and JURECKOVA, J. (1992). Regression rank scores and regression quantiles. Ann. Statist. 20, 305-330. HETTMANSPERGER, T. P. and MCKEAN, J. W. (1977). A robust alternative based on ranks to least squares in analyzing linear models. Technometrics 19, 275-284. JAECKEL, L. A. (1972). Estimating regression coefficients by minimizing the dispersion of residuals. Ann. Math. Statist. 43, 1449-1458. JURECKOVA, J. (1971). ~on~arametric estimate of regression coefficients. Ann. Math. Statist. 42, 1328-1338.

156 A NONPARAMETRIC TEST FOR HOMOGENEITY KIEFER, J. (1959). K-sample analogues of the Kolmogorov-Smirnov and Cramer- Von Mises tests. Ann. Math. Statist. 30, 420-447. KOENKER, R. and BASSETT, G. (1978). Regression quantiles. Econornetrica 46, 33-50. KOENKER, R. and BASSETT, G. (1982). Robust tests for heteroscedasticity based on regression quantiles. Econornetrica 50, 43-61. LEHMANN, E. L. (1951). Consistency and unbiasedness of certain nonparametric tests. Ann. Math. Statist. 22, 165-179. MCDONALD, D. (1991). On the asymptotics of randomness statistics. Can. J. Statist. 19, 209-217. ROUSSEEUW, P. J. and LEROY, A. M. (1987). Robust Regression and Outlier Detection. John Wiley, New York. SIEGEL, A. F. (1982). Robust regression using repeated medians. Biometrika 69, 242-244.