Monitoring Random Start Forward Searches for Multivariate Data
|
|
- Alyson Cook
- 5 years ago
- Views:
Transcription
1 Monitoring Random Start Forward Searches for Multivariate Data Anthony C. Atkinson 1, Marco Riani 2, and Andrea Cerioli 2 1 Department of Statistics, London School of Economics London WC2A 2AE, UK, a.c.atkinson@lse.ac.uk 2 Dipartimento di Economia, Università diparma Parma, Italy, mriani@unipr.it, andrea.cerioli@unipr.it Abstract. During a forward search from a robustly chosen starting point the plot of maximum Mahalanobis distances of observations in the subset may provide a test for outliers. This is not the customary test. We obtain distributional results for this distance during the search and exemplify its use. However, if clusters are present in the data, searches from random starts are required for their detection. We show that our new statistic has the same distributional properties whether the searches have random or robustly chosen starting points. Keywords: clustering, Mahalanobis distance, order statistic, outlier detection, robustness 1 Introduction The forward search is a powerful general method for detecting systematic or random departures from statistical models, such as those caused by outliers and the presence of clusters. The forward search for multivariate data is given book-length treatment by Atkinson, Riani and Cerioli (2004). To detect outliers they study the evolution of Mahalanobis distances calculated during a search through the data that starts from a carefully selected subset of observations. More recently Atkinson and Riani (2007) suggested the use of many searches starting from random starting points as a tool in the detection of clusters. An important aspect of this work is the provision of bounds against which to judge the observed values of the distances. Atkinson and Riani (2007) use simulation for this purpose as well as providing approximate numerical values for the quantiles of the distribution. These theoretical results are for the minimum Mahalanobis distance of observations not in the subset used for fitting when the starting point of the search is robustly selected. In this paper we consider instead the alternative statistic of the maximum Mahalanobis distance amongst observations in the subset. We derive good approximations to its distribution during the forward search and empirically compare its distribution to that of the minimum distance, both for random and robust starts. We find for the maximum distance,
2 448 Atkinson, A.C. et al. but not for the minimum, that the distribution of the distance does not depend on how the search starts. Our ultimate purpose is a more automatic method of outlier and cluster identification. We start in 2 with an introduction to the forward search that emphasises the importance of Mahalanobis distances in outlier detection. Some introductory theoretical results for the distributions of distances are in 3. Section 4 introduces the importance of random start searches in cluster detection. Our main theoretical results are in 5 where we use results on order statistics to derive good approximations to the distribution of the maximum distance during the search. Our methods are exemplified in 6 by the analysis of data on horse mussels. The comparison of distributions for random and elliptical starts to the search is conducted by simulation in 7. 2 Mahalanobis distances and the forward search The tools that we use for outlier detection and cluster identification are plots of various Mahalanobis distances. The squared distances for the sample are defined as d 2 i (ˆµ, ˆΣ) ={y i ˆµ} T ˆΣ 1 {y i ˆµ}, (1) where ˆµ and ˆΣ are estimates of the mean and covariance matrix of the n observations. In the forward search the parameters µ and Σ are replaced by their standard unbiased estimators from a subset of m observations, yielding estimates ˆµ(m)and ˆΣ(m). From this subset we obtain n squared Mahalanobis distances d 2 i (m) ={y i ˆµ(m)} T ˆΣ 1 (m){y i ˆµ(m)}, i =1,...,n. (2) We start with a subset of m 0 observations which grows in size during the search. When a subset S(m) ofm observations is used in fitting, we order the squared distances and take the observations corresponding to the m +1 smallest as the new subset S(m + 1). In what we call normal progression this process augments the subset by one observation, but sometimes two or more observations enter as one or more leave. In our examples we look at forward plots of quantities derived from the distances d i (m). These distances tend to decrease as n increases. If interest is in the latter part of the search we may use scaled distances d sc i (m) =d i (m) ( ˆΣ(m) / ˆΣ(n) ) 1/2v, (3) where v is the dimension of the observations y and ˆΣ(n) is the estimate of Σ at the end of the search. To detect outliers Atkinson et al. (2004) and Atkinson and Riani (2007) examined the minimum Mahalanobis distance amongst observations not in the subset d min (m) =mind i (m) i/ S(m), (4)
3 Random Forward Search 449 or its scaled version d sc min (m). In either case let this be observation i min (m). If observation i min (m) is an outlier relative to the other m observations, the distance (4) will be large compared to its reference distribution. In this paper we investigate instead the properties of the maximum Mahalanobis distance amongst the m observations in the subset d max (m) =maxd i (m) i S(m), (5) letting this be observation i max (m). Whether we monitor d max (m) or d min (m) the search is the same, progressing through the ordering of d 2 i (m). 3 Minimum and maximum Mahalanobis distances We now consider the relationship between d min (m) andd max (m) as outlier tests. This relationship depends on the subsets S(m) and S(m + 1). Let the kth largest ordered Mahalanobis distance be d [k] (m) whenestimation is based on the subset S(m). In normal progression d [m+1] (m) =d min (m) (6) and S(m +1)isformedfromS(m) by the addition of observation i min (m). Likewise, in normal progression this will give rise to the largest distance within the new subset, that is d [m+1] (m +1)=d max (m +1)=d imin(m) (m +1). (7) The distance d imin(m) (m + 1) is that for the new observation i min (m) when the parameters are estimated from S(m + 1). The consequence of (7) is that both d max (m +1)andd min (m) are tests of the outlyingness of observation i min (m). Although both statistics are testing the same hypothesis they do not have the same numerical value and should be referred to different null distributions. In 5 we discuss the effect of the ranking of the observations on these distributions as well as the consequence of estimating µ and Σ from a subset of the observations. For estimates using all n observations d max (n) isoneof the distances in (1). Standard distributional results in, for example, Atkinson et al. (2004, 2.6) show that d 2 i (ˆµ, ˆΣ) =d 2 (n 1)2 i (n) n ( v Beta 2, n v 1 ). (8) 2 On the other hand, d 2 min (n 1) is a deletion distance in which the parameters are estimated, in general, with the omission of observation i. The distribution of such distances is d 2 (i) n v(n 2) (n 1) (n v 1) F v,n v 1, (9)
4 450 Atkinson, A.C. et al. although the distribution of d 2 min (n 1) depends on the order statistics from this distribution. For moderate n the range of the distribution of d 2 i (n) in (8) is approximately (0,n) rather than the unbounded range for the F distribution of the deletion distances. As we shall see, the consequence is that the distribution of d 2 max (m) has much shorter tails than that of d2 min (m), particularly for small m. Our argument has been derived assuming normal progression. This occurs under the null hypothesis of a single multivariate normal population when there are no outliers or clusters in the data, so that the ordering of the observations by closeness to the fitted model does not alter appreciably during the search. Then we obtain very similar forward plots of d max (m) andd min (m), even if they have to be interpreted against different null distributions. In fact, we do not need the order to remain unchanged, but only that i min (m) and i max (m + 1) are the same observation and that the other observations in S(m + 1) arethosethatwereins(m). Dispersed outliers likewise do not appreciably affect the ordering of the data. This is however affected by clusters of observations that cause appreciable changes in the parameter estimates as they enter S(m). A discussion of the ordering of observations within and without S(m) is on pp of Atkinson et al. (2004). 4 Elliptical and random starts To find the starting subset for the search Atkinson et al. (2004) use the robust bivariate boxplots of Zani, Riani and Corbellini (1998) to pick a starting set S (m 0 ) that excludes any two-dimensional outliers. The boxplots have elliptical contours, so we refer to this method as the elliptical start. However, if there are clusters in the data, the elliptical start may lead to a search in which observations from several clusters enter the subset in sequence in such a way that the clusters are not revealed. Searches from more than one starting point are then needed to reveal the clustering structure. If we start with an initial subset of observations from each cluster in turn, the other clusters are revealed as outliers. However, such a procedure is not suitable for automatic cluster detection. Atkinson and Riani (2007) therefore instead run many forward searches from randomly selected starting points, monitoring the evolution of the values of d min (m) as the searches progress. Here we monitor d max (m). As the search progresses, the examples of Atkinson and Riani (2007) show that the effect of the starting point decreases. Once two searches have the same subsets S(m) forsomem, they will have the same subsets for all successive m. Typically, in the last third of the search the individual searches from random starts converge to that from the elliptical start. The implication is that the same envelopes can be used, except in the very early stages of the search, whether we use random or elliptical starts. If we are looking for a few
5 Random Forward Search 451 outliers, we will be looking at the end of the search. However, the envelopes for d min (m) andd max (m) will be different. 5 Envelopes from order statistics For relatively small samples we can use simulation to obtain envelopes for d max (m) during the search. For larger samples we adapt the method of Riani, Atkinson and Cerioli (2007) who find very good approximations to the envelopes for d min (m) using order statistics and a result of Tallis (1963) on truncated multivariate normal distributions. Let Y [m] be the mth order statistic from a sample of size n from a univariate distribution with c.d.f. G(y). From, for example Lehmann (1991, p. 353) and Guenther (1977), the required quantile of order γ of the distribution of Y [m] say y m,n;γ can be obtained as y m,n;γ = G 1 ( m m +(n m +1)x 2(n m+1),2m;1 γ ), (10) where x 2(n m+1),2m;1 γ is the quantile of order 1 γ of the F distribution with 2(n m +1)and2m degrees of freedom. Riani et al. (2007) comment that care needs to be taken to ensure that the numerical calculation of this inverse distribution is sufficiently accurate as m n, particularly for large n and extreme γ. We now consider our choice of G(x), which is different from that of Riani et al. (2007). We estimate Σ on m 1 degrees of freedom. The distribution of the m distances in the subset can, from (8), be written as d 2 (m 1)2 i (m) m Beta ( v 2, m v 1 2 ), i S(m). (11) The estimate of Σ that we use is biased since it is calculated from the m observations in the subset that have been chosen as having the m smallest distances. However, in the calculation of the scaled distances (3) we approximately correct for this effect by multiplication by a ratio derived from estimates of Σ. So the envelopes for the scaled Mahalanobis distances derived from d max (m) aregivenby (m 1) 2 V m,γ = ym,n;γ, (12) m with G the beta distribution in (11). For unscaled distances we need to correct for the bias in the estimate of Σ. We follow Riani et al. (2007) and consider elliptical truncation in the multivariate normal distribution. From the results of Tallis (1963) they obtain the large-sample correction factor m/n c FS (m) = P (Xv+2 2 <χ2 v,m/n ), (13)
6 452 Atkinson, A.C. et al. with χ 2 v,m/n the m/n quantile of χ2 v and X2 v+2 a chi-squared random variable on v + 2 degrees of freedom. Envelopes for unscaled distances are then obtained by scaling up the values of the order statistics V m,γ = c FS(m)V m,γ. Figure 1 shows the agreement between simulated envelopes (continuous lines) and theoretical envelopes (dotted lines) for d max (m) whenn = Scaled distances are in the upper panel; agreement between the two sets of envelopes is excellent throughout virtually the whole range. Agreement for the unscaled distances in the lower panel of the figure is less good, but is certainly more than satisfactory for inferences about outliers at least in the last half of the search. Unfortunately, the inclusion of ˆΣ(n) in the expression for scaled distances (3) results in small distances in the presence of outliers, due to the inflation of the variance estimate and to consequent difficulties of interpretation. For practical data analysis we have to use the unscaled distances, which are less well approximated. 6 Horse mussels As an example of the uses of elliptical and random starts in the analysis of multivariate data we look at measurements on horse mussels from New Zealand introduced by Cook and Weisberg (1994, p. 161) who treat them as regression with muscle mass, the edible portion of the mussel, as response. They focus on independent transformations of the response and of one of the explanatory variables. Atkinson et al. (2004, 4.9) consider multivariate normality obtained by joint transformation of all five variables. There are 82 observations on five variables: shell length, width, height and mass and the mass of the mussels muscle, which is the edible part. We begin with an analysis of the untransformed data using a forward search with an elliptical start. The left-hand panel of Figure 2 monitors d min (m), whereas the right-hand panel monitors d max (m). The two sets of simulation envelopes were found by direct simulation of 5,000 forward searches. The figure shows how very different the two distributions are at the beginning of the search. That in the left-hand panel for d min (m) isderived from the unbounded F distribution (9) whereas that for d max (m) in the right-hand panel is derived from the beta distribution (11). The two traces are very similar once they are calibrated by the envelopes. They both show appreciable departure from multivariate normality in the last one third of the search. Since we are selecting observations by their closeness to the multivariate normal model, we expect departure, if any, to be at the end of the search. Even allowing for the scaling of the two plots, the maximum distances seem to show less fluctuation at the beginning of the search. For
7 Random Forward Search 453 Scaled MD Unscaled MD Fig. 1. Envelopes for Mahalanobis distances dmax(m) whenn = 1000 and v =5. Dotted lines from order statistics, continuous lines from 5,000 simulations. Upper panel scaled distances, lower panel unscaled distances. Elliptical starts. 1%, 50% and 99% points.
8 454 Atkinson, A.C. et al Fig. 2. Horse mussels: forward search on untransformed data. Left-hand panel d min (m), right-hand panel dmax(m). Elliptical starts; 1%, 50% and 99% points from 5,000 simulations. much of the rest of the paper we focus on plots of the maximum Mahalanobis distances d max (m). We now analyse the data using a multivariate version of the parametric transformation of Box and Cox (1964). As a result of their analysis Atkinson et al. (2004) suggest the vector of transformation parameters λ =(0.5, 0, 0.5, 0, 0) T ; that is, the square root transformation for y 1 and y 3 and the logarithmic transformation for the other three variables. We look at forward plots of d max (m) to see whether this transformation yields multivariate normality. The upper-left panel of Figure 3 shows the maximum distance for all n = 82 observations for the transformed data. The contrast with the righthand panel of Figure 2 is informative. The plot still goes out of the 99% envelope at the end of the search, but the number of outliers is much smaller, now only around 5. The last five units to enter are those numbered 37, 16, 78, 8 and finally 48. The plot of the maximum distance in the upper-right panel of Figure 3 shows that, with these five observations deleted, the last value just lies below the 99% point of the distribution. We have found a multivariate normal sample, after transformation, with five outliers. That there are five outliers, not four, is confirmed in the lower panel of Figure 3 where observation 37 has been reincluded. Now the plot of maximum distances goes outside the 99% envelope at the end of the search. The limits in figures like 3 have been simulated to have the required pointwise level, that is they are correct for each m considered independently. However, the probability that the observed trace of values of d max (m) exceeds a specific bound at least once during the search is much greater than the pointwise value. Atkinson and Riani (2006) evaluate such simultaneous probabilities; they are surprisingly high.
9 Random Forward Search Fig. 3. Horse mussels: forward search on transformed data. Plots of dmax(m) for three different sample sizes. Upper-left panel n = 82; upper-right panel, observations 37, 16, 78, 8 and 48 removed (n = 77). Lower panel, observation 37 re-included (n = 78). Elliptical starts; 1%, 50% and 99% points from 5,000 simulations. There are five outliers. 7 Elliptical and random starts We have analysed the mussels data and investigated the properties of d max (m) andd min (m) using searches with elliptical starts. Finally we look at the properties of plots of both distances when random starts are used, for example to aid in the identification of clusters. The upper panel of Figure 4 presents a comparison of simulated envelopes for random and elliptical starts for d max (m) from data with the same dimensions as the mussels data. For this small data set there is no operationally important difference between the two envelopes. The important conclusion is that, for larger data sets, we can use the approximations of 5 based on order statistics whether we are using random or elliptical starts. The surprising conclusion that we obtain the same envelopes for searches from elliptical or random starts however does not hold when instead we monitor d min (m). The lower panel of Figure 4 repeats the simulations for d min (m). Now there is a noticeable difference, during the first half of the search, between the envelopes for random and those from elliptical starts. We now consider the implications of this difference on the properties of individual searches. The left-hand panel of Figure 5 repeats the simulated envelopes for elliptical starts from the upper panel of Figure 4 and adds 250
10 456 Atkinson, A.C. et al. Max. MD Random start Elliptical start Min. MD Random start Elliptical start Fig. 4. Simulated envelopes for Mahalanobis distances when n = 82andv = 5. Upper panel, dmax(m), lower panel d min (m) Dotted lines from elliptical starts, continuous lines from random starts. 1%, 50% and 99% points from 5,000 simulations.
11 Random Forward Search 457 trajectories of distances for d max (m) for simulations from random starting points. The simulated values nicely fill the envelope, although there are a surprising number of transient peaks above the envelope. The right-hand panel of the figure repeats this process of envelopes from elliptical starts and trajectories from random starts but for the minimum distances d min (m). Now, as we would expect from Figure 4, the simulated values sit a little low in the envelopes. If a subset contains one or more outliers, these will give rise to a too large estimate of Σ. As a consequence, some of the distances of units not included in the subset will be too small and the smallest of these will be selected as d min (m). On the contrary, if outliers are present in S(m) when we calculate d max (m), the distance that we look at will be that for one of the outliers and so will not be shrunken due to the too-large estimate of Σ Fig. 5. Simulated envelopes from elliptical starts for Mahalanobis distances when n = 82 and v = 5 with 250 trajectories from random starts. Left-hand panel dmax(m), right-hand panel d min (m). Our justification for the use of random start forward searches was that searches from elliptical starts may not detect clusters in the data if these start from a subset of units in more than one cluster. We have however analysed the mussels data using values of d max (m) from elliptical starts. Our conclusion, from Figure 3, was that after transformation there were 77 units from a multivariate normal population and five outliers. We checked this conclusion using random start forward searches with d max (m) and failed to detect any clusters. The purpose of this paper has been to explore the properties of the maximum distance d max (m). We have found its null distribution and obtained good approximations to this distribution for use in the forward search.the lack of dependence of this distribution on the starting point of the search is an appealing feature. However, we need to investigate the properties of this measure when the null distribution does not hold. One particular question is whether use of d max (m) provides tests for outliers and clusters that are as powerful as those using the customary minimum distance d min (m).
12 458 Atkinson, A.C. et al. References ATKINSON, A.C. and RIANI, M. (2000): Robust Diagnostic Regression Analysis. Springer-Verlag, New York. ATKINSON, A.C. and RIANI, M. (2006): Distribution theory and simulations for tests of outliers in regression. Journal of Computational and Graphical Statistics 15, ATKINSON, A.C. and RIANI, M. (2007): Exploratory tools for clustering multivariate data. Computational Statistics and Data Analysis 52, doi: /j.csda ATKINSON, A.C., RIANI, M. and CERIOLI, A (2004): Exploring Multivariate Data with the Forward Search. Springer-Verlag, New York. BOX, G.E.P. and COX, D.R. (1964): An analysis of transformations (with discussion). Journal of the Royal Statistical Society, Series B 26, COOK, R.D. and WEISBERG, S. (1994): An Introduction to Regression Graphics. Wiley, New York. GUENTHER, W.C. (1977): An easy method for obtaining percentage points of order statistics. Technometrics 19, LEHMANN, E. (1991): Point Estimation, 2nd edition. Wiley, New York. RIANI, M., ATKINSON, A.C. and CERIOLI, A. (2007): Results in finding an unknown number of multivariate outliers in large data sets. Research Report 140, Department of Statistics, London School of Economics. TALLIS, G.M.(1963): Elliptical and radial truncation in normal samples. Annals of Mathematical Statistics 34, ZANI, S., RIANI, M. and CORBELLINI, A. (1998): Robust bivariate boxplots and multiple outlier detection. Computational Statistics and Data Analysis 28,
TESTS FOR TRANSFORMATIONS AND ROBUST REGRESSION. Anthony Atkinson, 25th March 2014
TESTS FOR TRANSFORMATIONS AND ROBUST REGRESSION Anthony Atkinson, 25th March 2014 Joint work with Marco Riani, Parma Department of Statistics London School of Economics London WC2A 2AE, UK a.c.atkinson@lse.ac.uk
More informationAccurate and Powerful Multivariate Outlier Detection
Int. Statistical Inst.: Proc. 58th World Statistical Congress, 11, Dublin (Session CPS66) p.568 Accurate and Powerful Multivariate Outlier Detection Cerioli, Andrea Università di Parma, Dipartimento di
More informationCOMPUTER DATA ANALYSIS AND MODELING
MINISTRY OF EDUCATION BELARUSIAN STATE UNIVERSITY NATIONAL RESEARCH CENTER FOR APPLIED PROBLEMS OF MATHEMATICS AND INFORMATICS SCHOOL OF APPLIED MATHEMATICS AND INFORMATICS BELARUSIAN REPUBLICAN FOUNDATION
More informationFinding an unknown number of multivariate outliers
J. R. Statist. Soc. B (2009) 71, Part 2, pp. Finding an unknown number of multivariate outliers Marco Riani, Università di Parma, Italy Anthony C. Atkinson London School of Economics and Political Science,
More informationComputational Statistics and Data Analysis
Computational Statistics and Data Analysis 54 (2010) 3300 3312 Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Robust
More informationCluster Detection and Clustering with Random Start Forward Searches
To appear in the Journal of Applied Statistics Vol., No., Month XX, 9 Cluster Detection and Clustering with Random Start Forward Searches Anthony C. Atkinson a Marco Riani and Andrea Cerioli b a Department
More informationJournal of the Korean Statistical Society
Journal of the Korean Statistical Society 39 (2010) 117 134 Contents lists available at ScienceDirect Journal of the Korean Statistical Society journal homepage: www.elsevier.com/locate/jkss The forward
More informationIntroduction to Robust Statistics. Anthony Atkinson, London School of Economics, UK Marco Riani, Univ. of Parma, Italy
Introduction to Robust Statistics Anthony Atkinson, London School of Economics, UK Marco Riani, Univ. of Parma, Italy Multivariate analysis Multivariate location and scatter Data where the observations
More informationSupplementary Material for Wang and Serfling paper
Supplementary Material for Wang and Serfling paper March 6, 2017 1 Simulation study Here we provide a simulation study to compare empirically the masking and swamping robustness of our selected outlyingness
More informationRobust Bayesian regression with the forward search: theory and data analysis
TEST (2017) 26:869 886 DOI 10.1007/s11749-017-0542-6 ORIGINAL PAPER Robust Bayesian regression with the forward search: theory and data analysis Anthony C. Atkinson 1 Aldo Corbellini 2 Marco Riani 2 Received:
More informationJournal of Biostatistics and Epidemiology
Journal of Biostatistics and Epidemiology Original Article Robust correlation coefficient goodness-of-fit test for the Gumbel distribution Abbas Mahdavi 1* 1 Department of Statistics, School of Mathematical
More informationA Note on Visualizing Response Transformations in Regression
Southern Illinois University Carbondale OpenSIUC Articles and Preprints Department of Mathematics 11-2001 A Note on Visualizing Response Transformations in Regression R. Dennis Cook University of Minnesota
More informationHANDBOOK OF APPLICABLE MATHEMATICS
HANDBOOK OF APPLICABLE MATHEMATICS Chief Editor: Walter Ledermann Volume VI: Statistics PART A Edited by Emlyn Lloyd University of Lancaster A Wiley-Interscience Publication JOHN WILEY & SONS Chichester
More informationComputational Statistics and Data Analysis
Computational Statistics and Data Analysis 65 (2013) 29 45 Contents lists available at SciVerse ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Robust
More informationSTT 843 Key to Homework 1 Spring 2018
STT 843 Key to Homework Spring 208 Due date: Feb 4, 208 42 (a Because σ = 2, σ 22 = and ρ 2 = 05, we have σ 2 = ρ 2 σ σ22 = 2/2 Then, the mean and covariance of the bivariate normal is µ = ( 0 2 and Σ
More informationPreface. Why We Wrote This Book
Preface Why We Wrote This Book This book is about using graphs to explore and model continuous multivariate data. Such data are often modelled using the multivariate normal distribution and, indeed, there
More informationDetecting outliers in weighted univariate survey data
Detecting outliers in weighted univariate survey data Anna Pauliina Sandqvist October 27, 21 Preliminary Version Abstract Outliers and influential observations are a frequent concern in all kind of statistics,
More informationSTAT 501 Assignment 2 NAME Spring Chapter 5, and Sections in Johnson & Wichern.
STAT 01 Assignment NAME Spring 00 Reading Assignment: Written Assignment: Chapter, and Sections 6.1-6.3 in Johnson & Wichern. Due Monday, February 1, in class. You should be able to do the first four problems
More informationCHAPTER 5. Outlier Detection in Multivariate Data
CHAPTER 5 Outlier Detection in Multivariate Data 5.1 Introduction Multivariate outlier detection is the important task of statistical analysis of multivariate data. Many methods have been proposed for
More informationA robust statistical analysis of the 1988 Turin Shroud radiocarbon dating results G. Fanti 1, F. Crosilla 2, M. Riani 3, A.C. Atkinson 4 1 Department of Mechanical Engineering University of Padua, Italy,
More informationIdentifying and accounting for outliers and extreme response patterns in latent variable modelling
Identifying and accounting for outliers and extreme response patterns in latent variable modelling Irini Moustaki Athens University of Economics and Business Outline 1. Define the problem of outliers and
More informationMultivariate Statistical Analysis
Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 9 for Applied Multivariate Analysis Outline Addressing ourliers 1 Addressing ourliers 2 Outliers in Multivariate samples (1) For
More informationRegression Graphics. 1 Introduction. 2 The Central Subspace. R. D. Cook Department of Applied Statistics University of Minnesota St.
Regression Graphics R. D. Cook Department of Applied Statistics University of Minnesota St. Paul, MN 55108 Abstract This article, which is based on an Interface tutorial, presents an overview of regression
More informationDiagnostics and Remedial Measures
Diagnostics and Remedial Measures Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Diagnostics and Remedial Measures 1 / 72 Remedial Measures How do we know that the regression
More informationSome Results on the Multivariate Truncated Normal Distribution
Syracuse University SURFACE Economics Faculty Scholarship Maxwell School of Citizenship and Public Affairs 5-2005 Some Results on the Multivariate Truncated Normal Distribution William C. Horrace Syracuse
More informationGlossary. The ISI glossary of statistical terms provides definitions in a number of different languages:
Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the
More information9 Correlation and Regression
9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the
More informationEconometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018
Econometrics I KS Module 2: Multivariate Linear Regression Alexander Ahammer Department of Economics Johannes Kepler University of Linz This version: April 16, 2018 Alexander Ahammer (JKU) Module 2: Multivariate
More informationPackage ForwardSearch
Package ForwardSearch February 19, 2015 Type Package Title Forward Search using asymptotic theory Version 1.0 Date 2014-09-10 Author Bent Nielsen Maintainer Bent Nielsen
More informationJournal of Statistical Software
JSS Journal of Statistical Software October 215, Volume 67, Code Snippet 1. doi: 1.18637/jss.v67.c1 The Forward Search for Very Large Datasets Marco Riani Università di Parma Domenico Perrotta European
More informationLECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity
LECTURE 10 Introduction to Econometrics Multicollinearity & Heteroskedasticity November 22, 2016 1 / 23 ON PREVIOUS LECTURES We discussed the specification of a regression equation Specification consists
More informationLinear Models 1. Isfahan University of Technology Fall Semester, 2014
Linear Models 1 Isfahan University of Technology Fall Semester, 2014 References: [1] G. A. F., Seber and A. J. Lee (2003). Linear Regression Analysis (2nd ed.). Hoboken, NJ: Wiley. [2] A. C. Rencher and
More informationIn most cases, a plot of d (j) = {d (j) 2 } against {χ p 2 (1-q j )} is preferable since there is less piling up
THE UNIVERSITY OF MINNESOTA Statistics 5401 September 17, 2005 Chi-Squared Q-Q plots to Assess Multivariate Normality Suppose x 1, x 2,..., x n is a random sample from a p-dimensional multivariate distribution
More informationApplied Multivariate and Longitudinal Data Analysis
Applied Multivariate and Longitudinal Data Analysis Chapter 2: Inference about the mean vector(s) Ana-Maria Staicu SAS Hall 5220; 919-515-0644; astaicu@ncsu.edu 1 In this chapter we will discuss inference
More informationRegression Diagnostics for Survey Data
Regression Diagnostics for Survey Data Richard Valliant Joint Program in Survey Methodology, University of Maryland and University of Michigan USA Jianzhu Li (Westat), Dan Liao (JPSM) 1 Introduction Topics
More informationA CONNECTION BETWEEN LOCAL AND DELETION INFLUENCE
Sankhyā : The Indian Journal of Statistics 2000, Volume 62, Series A, Pt. 1, pp. 144 149 A CONNECTION BETWEEN LOCAL AND DELETION INFLUENCE By M. MERCEDES SUÁREZ RANCEL and MIGUEL A. GONZÁLEZ SIERRA University
More informationNew robust dynamic plots for regression mixture detection
Adv Data Anal Classif (2009) 3:263 279 DOI 10.1007/s11634-009-0050-y REGULAR ARTICLE New robust dynamic plots for regression mixture detection Domenico Perrotta Marco Riani Francesca Torti Received: 12
More informationApplication of Variance Homogeneity Tests Under Violation of Normality Assumption
Application of Variance Homogeneity Tests Under Violation of Normality Assumption Alisa A. Gorbunova, Boris Yu. Lemeshko Novosibirsk State Technical University Novosibirsk, Russia e-mail: gorbunova.alisa@gmail.com
More informationIdentification of Multivariate Outliers: A Performance Study
AUSTRIAN JOURNAL OF STATISTICS Volume 34 (2005), Number 2, 127 138 Identification of Multivariate Outliers: A Performance Study Peter Filzmoser Vienna University of Technology, Austria Abstract: Three
More informationContents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1
Contents Preface to Second Edition Preface to First Edition Abbreviations xv xvii xix PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1 1 The Role of Statistical Methods in Modern Industry and Services
More informationThe power of (extended) monitoring in robust clustering
Statistical Methods & Applications manuscript No. (will be inserted by the editor) The power of (extended) monitoring in robust clustering Alessio Farcomeni and Francesco Dotto Received: date / Accepted:
More informationAnalysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems
Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems Jeremy S. Conner and Dale E. Seborg Department of Chemical Engineering University of California, Santa Barbara, CA
More information, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1
Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression
More informationof the Citizens, Support to External Security Unit, Ispra, Italy
Robust Methods for Complex Data ( ) Metodi robusti per l analisi di dati complessi Marco Riani 1, Andrea Cerioli 1, Domenico Perrotta 2, Francesca Torti 2 1 Dipartimento di Economia, Università di Parma,
More informationChapter 1 Statistical Inference
Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations
More informationPrentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)
National Advanced Placement (AP) Statistics Course Outline (Grades 9-12) Following is an outline of the major topics covered by the AP Statistics Examination. The ordering here is intended to define the
More informationTesting Equality of Two Intercepts for the Parallel Regression Model with Non-sample Prior Information
Testing Equality of Two Intercepts for the Parallel Regression Model with Non-sample Prior Information Budi Pratikno 1 and Shahjahan Khan 2 1 Department of Mathematics and Natural Science Jenderal Soedirman
More informationRegression Analysis for Data Containing Outliers and High Leverage Points
Alabama Journal of Mathematics 39 (2015) ISSN 2373-0404 Regression Analysis for Data Containing Outliers and High Leverage Points Asim Kumer Dey Department of Mathematics Lamar University Md. Amir Hossain
More informationLecture 3. Inference about multivariate normal distribution
Lecture 3. Inference about multivariate normal distribution 3.1 Point and Interval Estimation Let X 1,..., X n be i.i.d. N p (µ, Σ). We are interested in evaluation of the maximum likelihood estimates
More informationCS 195-5: Machine Learning Problem Set 1
CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of
More informationMultivariate Statistical Analysis
Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 9 for Applied Multivariate Analysis Outline Two sample T 2 test 1 Two sample T 2 test 2 Analogous to the univariate context, we
More informationMultivariate Capability Analysis Using Statgraphics. Presented by Dr. Neil W. Polhemus
Multivariate Capability Analysis Using Statgraphics Presented by Dr. Neil W. Polhemus Multivariate Capability Analysis Used to demonstrate conformance of a process to requirements or specifications that
More informationRegression III Lecture 1: Preliminary
Regression III Lecture 1: Preliminary Dave Armstrong University of Western Ontario Department of Political Science Department of Statistics and Actuarial Science (by courtesy) e: dave.armstrong@uwo.ca
More informationDiagnostics for Linear Models With Functional Responses
Diagnostics for Linear Models With Functional Responses Qing Shen Edmunds.com Inc. 2401 Colorado Ave., Suite 250 Santa Monica, CA 90404 (shenqing26@hotmail.com) Hongquan Xu Department of Statistics University
More informationWiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.
Methods and Applications of Linear Models Regression and the Analysis of Variance Third Edition RONALD R. HOCKING PenHock Statistical Consultants Ishpeming, Michigan Wiley Contents Preface to the Third
More informationOn the Distribution of Hotelling s T 2 Statistic Based on the Successive Differences Covariance Matrix Estimator
On the Distribution of Hotelling s T 2 Statistic Based on the Successive Differences Covariance Matrix Estimator JAMES D. WILLIAMS GE Global Research, Niskayuna, NY 12309 WILLIAM H. WOODALL and JEFFREY
More information-However, this definition can be expanded to include: biology (biometrics), environmental science (environmetrics), economics (econometrics).
Chemometrics Application of mathematical, statistical, graphical or symbolic methods to maximize chemical information. -However, this definition can be expanded to include: biology (biometrics), environmental
More informationON THE CALCULATION OF A ROBUST S-ESTIMATOR OF A COVARIANCE MATRIX
STATISTICS IN MEDICINE Statist. Med. 17, 2685 2695 (1998) ON THE CALCULATION OF A ROBUST S-ESTIMATOR OF A COVARIANCE MATRIX N. A. CAMPBELL *, H. P. LOPUHAA AND P. J. ROUSSEEUW CSIRO Mathematical and Information
More informationLater in the same chapter (page 45) he asserted that
Chapter 7 Randomization 7 Randomization 1 1 Fisher on randomization..................... 1 2 Shoes: a paired comparison................... 2 3 The randomization distribution................. 4 4 Theoretical
More informationSmall Sample Corrections for LTS and MCD
myjournal manuscript No. (will be inserted by the editor) Small Sample Corrections for LTS and MCD G. Pison, S. Van Aelst, and G. Willems Department of Mathematics and Computer Science, Universitaire Instelling
More informationarxiv: v1 [stat.me] 5 Apr 2013
Logistic regression geometry Karim Anaya-Izquierdo arxiv:1304.1720v1 [stat.me] 5 Apr 2013 London School of Hygiene and Tropical Medicine, London WC1E 7HT, UK e-mail: karim.anaya@lshtm.ac.uk and Frank Critchley
More informationResearch Article A Nonparametric Two-Sample Wald Test of Equality of Variances
Advances in Decision Sciences Volume 211, Article ID 74858, 8 pages doi:1.1155/211/74858 Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances David Allingham 1 andj.c.w.rayner
More informationComputational Connections Between Robust Multivariate Analysis and Clustering
1 Computational Connections Between Robust Multivariate Analysis and Clustering David M. Rocke 1 and David L. Woodruff 2 1 Department of Applied Science, University of California at Davis, Davis, CA 95616,
More informationOutlier Detection via Feature Selection Algorithms in
Int. Statistical Inst.: Proc. 58th World Statistical Congress, 2011, Dublin (Session CPS032) p.4638 Outlier Detection via Feature Selection Algorithms in Covariance Estimation Menjoge, Rajiv S. M.I.T.,
More informationMeasurement And Uncertainty
Measurement And Uncertainty Based on Guidelines for Evaluating and Expressing the Uncertainty of NIST Measurement Results, NIST Technical Note 1297, 1994 Edition PHYS 407 1 Measurement approximates or
More informationBootstrapping Hypotheses Tests
Southern Illinois University Carbondale OpenSIUC Research Papers Graduate School Summer 2015 Bootstrapping Hypotheses Tests Chathurangi H. Pathiravasan Southern Illinois University Carbondale, chathurangi@siu.edu
More informationLAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION
LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION In this lab you will first learn how to display the relationship between two quantitative variables with a scatterplot and also how to measure the strength of
More informationExperimental Design and Data Analysis for Biologists
Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1
More information1. Density and properties Brief outline 2. Sampling from multivariate normal and MLE 3. Sampling distribution and large sample behavior of X and S 4.
Multivariate normal distribution Reading: AMSA: pages 149-200 Multivariate Analysis, Spring 2016 Institute of Statistics, National Chiao Tung University March 1, 2016 1. Density and properties Brief outline
More informationIssues on quantile autoregression
Issues on quantile autoregression Jianqing Fan and Yingying Fan We congratulate Koenker and Xiao on their interesting and important contribution to the quantile autoregression (QAR). The paper provides
More informationFinite Population Sampling and Inference
Finite Population Sampling and Inference A Prediction Approach RICHARD VALLIANT ALAN H. DORFMAN RICHARD M. ROYALL A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane
More informationThe regression model with one fixed regressor cont d
The regression model with one fixed regressor cont d 3150/4150 Lecture 4 Ragnar Nymoen 27 January 2012 The model with transformed variables Regression with transformed variables I References HGL Ch 2.8
More informationDeccan Education Society s FERGUSSON COLLEGE, PUNE (AUTONOMOUS) SYLLABUS UNDER AUTOMONY. SECOND YEAR B.Sc. SEMESTER - III
Deccan Education Society s FERGUSSON COLLEGE, PUNE (AUTONOMOUS) SYLLABUS UNDER AUTOMONY SECOND YEAR B.Sc. SEMESTER - III SYLLABUS FOR S. Y. B. Sc. STATISTICS Academic Year 07-8 S.Y. B.Sc. (Statistics)
More informationTransition Passage to Descriptive Statistics 28
viii Preface xiv chapter 1 Introduction 1 Disciplines That Use Quantitative Data 5 What Do You Mean, Statistics? 6 Statistics: A Dynamic Discipline 8 Some Terminology 9 Problems and Answers 12 Scales of
More informationVariable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1
Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr
More information4 Statistics of Normally Distributed Data
4 Statistics of Normally Distributed Data 4.1 One Sample a The Three Basic Questions of Inferential Statistics. Inferential statistics form the bridge between the probability models that structure our
More informationREGRESSION DIAGNOSTICS AND REMEDIAL MEASURES
REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES Lalmohan Bhar I.A.S.R.I., Library Avenue, Pusa, New Delhi 110 01 lmbhar@iasri.res.in 1. Introduction Regression analysis is a statistical methodology that utilizes
More informationRegression diagnostics
Regression diagnostics Kerby Shedden Department of Statistics, University of Michigan November 5, 018 1 / 6 Motivation When working with a linear model with design matrix X, the conventional linear model
More informationFor Bonnie and Jesse (again)
SECOND EDITION A P P L I E D R E G R E S S I O N A N A L Y S I S a n d G E N E R A L I Z E D L I N E A R M O D E L S For Bonnie and Jesse (again) SECOND EDITION A P P L I E D R E G R E S S I O N A N A
More informationCross-Validation with Confidence
Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation
More informationAn Introduction to Multivariate Statistical Analysis
An Introduction to Multivariate Statistical Analysis Third Edition T. W. ANDERSON Stanford University Department of Statistics Stanford, CA WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents
More informationReliability of inference (1 of 2 lectures)
Reliability of inference (1 of 2 lectures) Ragnar Nymoen University of Oslo 5 March 2013 1 / 19 This lecture (#13 and 14): I The optimality of the OLS estimators and tests depend on the assumptions of
More informationA NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL
Discussiones Mathematicae Probability and Statistics 36 206 43 5 doi:0.75/dmps.80 A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL Tadeusz Bednarski Wroclaw University e-mail: t.bednarski@prawo.uni.wroc.pl
More informationTESTING FOR CO-INTEGRATION
Bo Sjö 2010-12-05 TESTING FOR CO-INTEGRATION To be used in combination with Sjö (2008) Testing for Unit Roots and Cointegration A Guide. Instructions: Use the Johansen method to test for Purchasing Power
More informationDetection of outliers in multivariate data:
1 Detection of outliers in multivariate data: a method based on clustering and robust estimators Carla M. Santos-Pereira 1 and Ana M. Pires 2 1 Universidade Portucalense Infante D. Henrique, Oporto, Portugal
More informationG. S. Maddala Kajal Lahiri. WILEY A John Wiley and Sons, Ltd., Publication
G. S. Maddala Kajal Lahiri WILEY A John Wiley and Sons, Ltd., Publication TEMT Foreword Preface to the Fourth Edition xvii xix Part I Introduction and the Linear Regression Model 1 CHAPTER 1 What is Econometrics?
More informationYUN WU. B.S., Gui Zhou University of Finance and Economics, 1996 A REPORT. Submitted in partial fulfillment of the requirements for the degree
A SIMULATION STUDY OF THE ROBUSTNESS OF HOTELLING S T TEST FOR THE MEAN OF A MULTIVARIATE DISTRIBUTION WHEN SAMPLING FROM A MULTIVARIATE SKEW-NORMAL DISTRIBUTION by YUN WU B.S., Gui Zhou University of
More informationThe S-estimator of multivariate location and scatter in Stata
The Stata Journal (yyyy) vv, Number ii, pp. 1 9 The S-estimator of multivariate location and scatter in Stata Vincenzo Verardi University of Namur (FUNDP) Center for Research in the Economics of Development
More informationAn overview of applied econometrics
An overview of applied econometrics Jo Thori Lind September 4, 2011 1 Introduction This note is intended as a brief overview of what is necessary to read and understand journal articles with empirical
More informationEXAMINERS REPORT & SOLUTIONS STATISTICS 1 (MATH 11400) May-June 2009
EAMINERS REPORT & SOLUTIONS STATISTICS (MATH 400) May-June 2009 Examiners Report A. Most plots were well done. Some candidates muddled hinges and quartiles and gave the wrong one. Generally candidates
More informationPAijpam.eu ADAPTIVE K-S TESTS FOR WHITE NOISE IN THE FREQUENCY DOMAIN Hossein Arsham University of Baltimore Baltimore, MD 21201, USA
International Journal of Pure and Applied Mathematics Volume 82 No. 4 2013, 521-529 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu doi: http://dx.doi.org/10.12732/ijpam.v82i4.2
More informationFAST CROSS-VALIDATION IN ROBUST PCA
COMPSTAT 2004 Symposium c Physica-Verlag/Springer 2004 FAST CROSS-VALIDATION IN ROBUST PCA Sanne Engelen, Mia Hubert Key words: Cross-Validation, Robustness, fast algorithm COMPSTAT 2004 section: Partial
More informationA nonparametric two-sample wald test of equality of variances
University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 211 A nonparametric two-sample wald test of equality of variances David
More informationGeochemical Data Evaluation and Interpretation
Geochemical Data Evaluation and Interpretation Eric Grunsky Geological Survey of Canada Workshop 2: Exploration Geochemistry Basic Principles & Concepts Exploration 07 8-Sep-2007 Outline What is geochemical
More informationConfidence Intervals, Testing and ANOVA Summary
Confidence Intervals, Testing and ANOVA Summary 1 One Sample Tests 1.1 One Sample z test: Mean (σ known) Let X 1,, X n a r.s. from N(µ, σ) or n > 30. Let The test statistic is H 0 : µ = µ 0. z = x µ 0
More informationModified confidence intervals for the Mahalanobis distance
Modified confidence intervals for the Mahalanobis distance Paul H. Garthwaite a, Fadlalla G. Elfadaly a,b, John R. Crawford c a School of Mathematics and Statistics, The Open University, MK7 6AA, UK; b
More informationWooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares
Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares Many economic models involve endogeneity: that is, a theoretical relationship does not fit
More information6348 Final, Fall 14. Closed book, closed notes, no electronic devices. Points (out of 200) in parentheses.
6348 Final, Fall 14. Closed book, closed notes, no electronic devices. Points (out of 200) in parentheses. 0 11 1 1.(5) Give the result of the following matrix multiplication: 1 10 1 Solution: 0 1 1 2
More informationForecasting Levels of log Variables in Vector Autoregressions
September 24, 200 Forecasting Levels of log Variables in Vector Autoregressions Gunnar Bårdsen Department of Economics, Dragvoll, NTNU, N-749 Trondheim, NORWAY email: gunnar.bardsen@svt.ntnu.no Helmut
More informationIntroduction to statistics. Laurent Eyer Maria Suveges, Marie Heim-Vögtlin SNSF grant
Introduction to statistics Laurent Eyer Maria Suveges, Marie Heim-Vögtlin SNSF grant Recent history at the Observatory Request of something on statistics from PhD students, because of an impression of
More information