ROBUST AND EFFICIENT ESTIMATION OF THE MODE OF CONTINUOUS DATA: THE MODE AS A VIABLE MEASURE OF CENTRAL TENDENCY

Similar documents
Chapter 3 Describing Data Using Numerical Measures

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

A Robust Method for Calculating the Correlation Coefficient

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests

An (almost) unbiased estimator for the S-Gini index

Modeling and Simulation NETW 707

/ n ) are compared. The logic is: if the two

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

Kernel Methods and SVMs Extension

Composite Hypotheses testing

Statistics Chapter 4

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Appendix B: Resampling Algorithms

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

Lecture 3: Probability Distributions

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Joint Statistical Meetings - Biopharmaceutical Section

Negative Binomial Regression

Convergence of random processes

Chapter Newton s Method

Estimation of the Mean of Truncated Exponential Distribution

Uncertainty as the Overlap of Alternate Conditional Distributions

Statistics for Economics & Business

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method

Errors for Linear Systems

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

Limited Dependent Variables

Chapter 5 Multilevel Models

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Chapter 11: Simple Linear Regression and Correlation

4.3 Poisson Regression

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Global Sensitivity. Tuesday 20 th February, 2018

Problem Set 9 Solutions

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for P Charts. Dr. Wayne A. Taylor

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

AS-Level Maths: Statistics 1 for Edexcel

Bézier curves. Michael S. Floater. September 10, These notes provide an introduction to Bézier curves. i=0

SIMPLE LINEAR REGRESSION

NUMERICAL DIFFERENTIATION

More metrics on cartesian products

Chapter 13: Multiple Regression

Economics 130. Lecture 4 Simple Linear Regression Continued

Statistical Evaluation of WATFLOOD

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

ANOMALIES OF THE MAGNITUDE OF THE BIAS OF THE MAXIMUM LIKELIHOOD ESTIMATOR OF THE REGRESSION SLOPE

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Cathy Walker March 5, 2010

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

x = , so that calculated

USE OF DOUBLE SAMPLING SCHEME IN ESTIMATING THE MEAN OF STRATIFIED POPULATION UNDER NON-RESPONSE

Non-Mixture Cure Model for Interval Censored Data: Simulation Study ABSTRACT

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

First Year Examination Department of Statistics, University of Florida

Notes prepared by Prof Mrs) M.J. Gholba Class M.Sc Part(I) Information Technology

Explaining the Stein Paradox

= z 20 z n. (k 20) + 4 z k = 4

A New Method for Estimating Overdispersion. David Fletcher and Peter Green Department of Mathematics and Statistics

The Minimum Universal Cost Flow in an Infeasible Flow Network

U-Pb Geochronology Practical: Background

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

DETERMINATION OF UNCERTAINTY ASSOCIATED WITH QUANTIZATION ERRORS USING THE BAYESIAN APPROACH

Linear Approximation with Regularization and Moving Least Squares

Statistics II Final Exam 26/6/18

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Definition. Measures of Dispersion. Measures of Dispersion. Definition. The Range. Measures of Dispersion 3/24/2014

Notes on Frequency Estimation in Data Streams

RELIABILITY ASSESSMENT

Comparison of Regression Lines

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Using the estimated penetrances to determine the range of the underlying genetic model in casecontrol

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Chapter 6. Supplemental Text Material

Statistical Hypothesis Testing for Returns to Scale Using Data Envelopment Analysis

Generalized Linear Methods

Appendix B. The Finite Difference Scheme

Lecture Notes on Linear Regression

CHAPTER IV RESEARCH FINDING AND DISCUSSIONS

A Note on Test of Homogeneity Against Umbrella Scale Alternative Based on U-Statistics

A Hybrid Variational Iteration Method for Blasius Equation

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Rockefeller College University at Albany

TAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES

Chapter 9: Statistical Inference and the Relationship between Two Variables

Lecture 3 Stat102, Spring 2007

The Geometry of Logit and Probit

UNR Joint Economics Working Paper Series Working Paper No Further Analysis of the Zipf Law: Does the Rank-Size Rule Really Exist?

Transcription:

ROBUST AND EFFICIENT ESTIMATION OF THE MODE OF CONTINUOUS DATA: THE MODE AS A VIABLE MEASURE OF CENTRAL TENDENCY Davd R. Bckel Medcal College of Georga Offce of Bostatstcs and Bonformatcs Ffteenth St., AE-337 Augusta, GA 39-49 URL: http://www.mcg.edu/research/bostat/bckel.html E-mal: dbckel@mal.mcg.edu, bckel@malaps.org Key words: Robust estmaton; robust mode; mode estmator; average value; measure of locaton; asymmetry; transformaton; effcency. ABSTRACT. Although a natural measure of the central tendency of a sample of contnuous data s ts mode (the most probable value), the mean and medan are the most popular measures of locaton due to ther smplcty and ease of estmaton. The medan s often used nstead of the mean for asymmetrc data because t s closer to the mode and s nsenstve to extreme values n the sample. However, the mode tself can be relably estmated by frst transformng the data nto approxmately normal data by rasng the values to a real power, and then estmatng the mean and standard devaton of the transformed data. Wth ths method, two estmators of the mode of the orgnal data are proposed: a smple estmator based on estmatng the mean by the sample mean and the standard devaton by the sample standard devaton, and a more robust estmator based on estmatng the mean by the medan and the standard devaton by the standardzed medan absolute devaton. Both of these mode estmators were tested usng smulated data drawn from normal (symmetrc), lognormal (asymmetrc), and Pareto (very asymmetrc) dstrbutons. The latter two dstrbutons were chosen to test the generalty of the method snce they are not power transforms of the normal dstrbuton. Each of the proposed estmators of the mode has a much lower varance than the mean and medan for the two asymmetrc dstrbutons. When outlers were added to the smulatons, the more robust of the two proposed mode estmators had a lower bas and varance than the medan for the asymmetrc dstrbutons, especally when the level of contamnaton approached the 5% breakdown pont. It s concluded that the mode s often a more relable measure of locaton than the mean or medan for asymmetrc data. The proposed estmators also performed well relatve to prevous estmators of the mode. Whle dfferent estmators are better under dfferent condtons, the proposed robust estmator s relable for a wde varety of dstrbutons and contamnaton levels. D. R. Bckel, submtted to InterStat.

. INTRODUCTION Although many measures of locaton have been developed n recent years, researchers stll mostly use the mean and medan to descrbe the locaton or average value of contnuous data, largely because those measures are easy to understand and estmate. The concept of the mode s also easly understood and s attractve as the most probable value, but relable methods of estmatng the mode of contnuous data are not wdely known. Most nvestgators descrbe the average of contnuous data by the mean except when the data are hghly skewed, hghly kurtotc, or contamnated wth outlers, n whch case the medan s often used. The medan s ndeed a good choce n the latter two cases, when the mean s very unrelable or even undefned. In the case of asymmetrc data, the medan s preferred to the mean, often snce the medan s almost always closer to the mode, but a better approach would be to estmate the mode tself n many cases. (Dharmadhkar and Joag-dev (988) gve the condtons under whch the medan s between the mode and the mean.) The mode s the most ntutve measure of central tendency snce the mode represents the most typcal value of the data. However, prevous estmators of the mode have suffered from hgh bas, low effcency (hgh varance), or a senstvty to outlers, and these lmtatons have probably contrbuted to the neglect of the mode as a descrpton of the central tendency. To enable a wder use of the mode, t wll be demonstrated heren that the mode can be estmated wth lower bas and even hgher effcency and robustness to outlers than the medan for asymmetrc, contnuous data. The mean, medan, and mode are all measures of locaton µ ( X) of a contnuous random varable X n the sense that they satsfy µ ( ax + b )= aµ ( X)+ b, µ ( X )= µ ( X ), and X µ ( X) (Staudte and a>,b Sheather 99). The sample mean and sample medan are smple nonparametrc estmators of the mean and medan of the underlyng contnuous dstrbuton. For symmetrc dstrbutons, the sample mean and sample medan estmate the same D. R. Bckel, submtted to InterStat.

estmand snce the mean and medan are equal n ths case. For asymmetrc dstrbutons, the sample mean and sample medan estmate dfferent, but known, estmands. The mode, however, has no natural estmator. In ths paper, prevous estmators of the mode are compared wth estmators desgned to have low varance. The proposed strategy of mode estmaton conssts of the followng steps:. transform the data such that the transformed data s approxmately normal;. estmate the mean and standard devaton of the transformed data; 3. assumng that the transformed data were drawn from a normal dstrbuton, use the estmated mean and standard devaton of the transformed data to estmate the mode of the orgnal data. The transformaton used heren s the smple power transformaton: yx;α ( )= x α, where y s called the transformed varable, x s called the orgnal varable, and α s a nonzero real constant. We requre that x >, but the transformaton can be generalzed by yx;α,β ( )= ( x + β) α to allow negatve values of x (Box & Tao 99). Thus, gven a data set x n {} = n { } =, the transformed data set s y ( α), where y ( α)= x α. The value of α s chosen to make the transformed data as close as possble to normally-dstrbuted data. Although y s not exactly normal, t s constructed to be approxmately normal through the choce of α, so t can be consdered normal for the purpose of estmaton. If y s normally dstrbuted wth parameters y and σ, then the probablty densty functon (PDF) of y s f y ( y; y,σ)= πσ and thus the PDF of x s f x ( ) ( ) exp y y σ () ( ) ( x; y,σ,α)= f y ( x α ; y,σ) y x = ( πσ) α x α exp xα y σ. () D. R. Bckel, submtted to InterStat. 3

Many dstrbutons can be approxmated by Eq. (), n whch α quantfes the skewness, wth α = for zero skewness. The mode of x, denoted by M, s the value of x that maxmzes ts PDF. Requrng that [ f ( x; y, σ, α ) x] = fnd that ( α ) α x x=m, we 4σ M = y + y +. (3) α (Note that α = mples that M = y, as expected from the fact that the mode of a normal dstrbuton equals ts mean.) Therefore, M can be estmated by replacng y and σ wth estmates of the mean and standard devaton from the transformed sample { ( )} n α y =. If α s postve and so low that the argument of the square root s negatve, as sometmes occurs for small samples wth hgh skewness, then the estmate of the mode s the mnmum value of the sample. Ths method of estmatng the mode s descrbed n detal n Secton. Its bas, effcency, and resstance to outlers were studed by smulaton, as reported n Secton 3.. ESTIMATORS OF THE MODE. Standard parametrc estmator A smple mplementaton of the mode estmaton technque of Secton s the followng algorthm:. transform the data usng the value of α that maxmzes the standard correlaton coeffcent between the ordered transformed data and the expected order statstcs for a normal dstrbuton;. estmate the mean and standard devaton of the transformed data usng the sample mean and sample standard devaton; D. R. Bckel, submtted to InterStat. 4

3. n Eq. (3), substtute for y and σ the sample mean and sample standard devaton of the transformed data n order to estmate the mode of the orgnal data. The frst step nvolves computng Pearson s correlaton coeffcent between { ( )} n α, ordered such that y ( α ) y ( α ) y ( α ) L n, and { } n y = z =, the expected order statstcs gven by the cumulatve densty functon (CDF) Φ of the standard normal dstrbuton: z : = Φ. (4) n The correlaton coeffcent can be expressed as where ( α ) s ( α ) ( α ) + s ( α ) s r, (5) + ( α ) = s+ ( α ) z ( ) ± α δz y s ( ) = ± α δ. (6) δy The operator δ gves the sample standard devaton of ts argument; e.g., Let δy ( α)= n y n ( α) n y n j ( α). (7) = j = r α reaches a maxmum. There s only one α be the value of α for whch ( ) maxmum for data from sngle-modal dstrbutons snce r ( α ) decreases monotoncally as the transformed data becomes less and less normal. Thus, α s easy to compute numercally; the Appendx gves an algorthm that can fnd the α maxmum. The transformaton ( α ) x y = ensures that the transformed data s as close as possble to followng a normal dstrbuton. Then the sample mean and sample standard devaton of { ( )} n y = mode of the dstrbuton for whch { } n α are used n Eq. (3) to estmate M, the x = s a sample. D. R. Bckel, submtted to InterStat. 5

Ths estmator of the mode, called the standard parametrc mode (SPM), has advantages n ts smplcty and ts effcency n the case that { ( )} n y = α s approxmately normal. However, t s not robust to outlers snce f the value of a sngle x s suffcently large, then α can be brought past any bound and the estmaton can thereby be rendered worthless. The next subsecton modfes the algorthm to make t resstant to outlers.. Robust parametrc estmator The steps n Secton for computng the mode become hghly robust to contamnaton n the data when they take ths form:. transform the data usng the value of α that maxmzes a robust correlaton coeffcent between the ordered transformed data and the expected order statstcs for a normal dstrbuton;. estmate the mean and standard devaton of the transformed data usng the medan and standardzed medan absolute devaton (MAD); 3. n Eq. (3), substtute for y and σ the medan and MAD of the transformed data n order to estmate the mode of the orgnal data. The robust correlaton coeffcent s based on a generalzaton of the lnear correlaton coeffcent (Huber, 98), wth the δ of Eq. (6) denotng a general measure of dsperson or scale, rather than the standard devaton. Agan assumng that y ( α ) y ( α ) L y ( α ) where n, we use the correlaton coeffcent gven by ( α ) S ( α ) ( α ) + S ( α ) S R, (8) + ( α ) = S + ( α ) z ( ) ± α z y S ( ) = ± α. (9) y Here, the operator yelds MAD, normalzed such that y = σ f y s normally dstrbuted wth standard devaton σ. For example, D. R. Bckel, submtted to InterStat. 6

y ( α)= [ Φ ( 34) ]med y ( α) medy ( α), ().486med y ( α) med y ( α) where med s the sample medan operator, so that ( α ) { ( )} n α. Snce ( α ) y = med s the medan of R quantfes the normalty of the transformed data, the value α s found that maxmzes R ( α ),.e., R( α ) max R( α ) α y =. For a normal dstrbuton, the mean s equal to the medan and the standard devaton s equal to MAD, so y ( ) and ( ) med α estmate the mode M. are substtuted for y and σ n Eq. (3) to y α The robustness of ths estmator of the mode, termed the robust parametrc mode (RPM), can be quantfed by ts fnte-sample breakdown pont, the mnmum proporton of outlers n a sample that could make an estmator unbounded (Donoho and Huber 983). For example, for a sample of sze n, the breakdown pont of the medan s ( n ) ( n) + for odd n or for even n snce at least half of the ponts n the sample would have to be replaced wth suffcently hgh or suffcently low values before the medan would be hgher or lower than any bound. Beng based on the medan, the mode estmator descrbed n ths subsecton has the same breakdown pont, whch s the hghest breakdown pont possble for a measure of locaton. The mode estmator of Secton., on the other hand, s less robust, snce the sample mean and sample standard devaton each has a breakdown pont of only them arbtrarly large..3 Grenander s estmators n, entalng that a sngle outler can make The estmators of the mode ntroduced above are called parametrc estmators snce they make use of the parameters of the famly of normal dstrbutons. A smple class of nonparametrc estmators s Grenander s (965) x famly of estmators of the mode of { } n, wth x x x = L n : D. R. Bckel, submtted to InterStat. 7

M, k = = n k p, < p k, () n k p ( x+ k + x ) ( x+ k x ) p < = ( x x ) + k where p and k are real numbers, fxed for each estmator. M p, k has a breakdown pont of only ( k + ) n, whch approaches as n (Bckel, b), so, lke the estmator of Subsecton., M, s not robust to outlers. p k M, s compared p k to the parametrc estmators n Secton 3..4 Robust drect estmators Grenander s estmators are drect n the sense that they do not requre densty estmaton. A class of drect estmators of the mode that are much more robust to outlers s based on the shortest half sample, the subsample of half of the orgnal data wth the mnmum dfference between the mnmum and maxmum values. The mdpont of the shortest half sample (locaton of the least medan of squares) and the mean of the shortest half sample (Rousseeuw and Leroy, 987) are hghly based estmators of the mode (Bckel, a). A low-bas mode estmator, the half-sample mode (HSM), can be computed by repeatedly takng shortest half samples wthn shortest half samples (Bckel, b). A closelyrelated drect, nonparametrc estmator s the half-range mode (HRM), whch s based on the modal nterval, the nterval of a certan wdth that contans more values than any other nterval of that wdth. The HRM s found by computng modal ntervals wthn modal ntervals, where each modal nterval has a wdth equal to half the range of the observatons wthn the prevous modal nterval, begnnng wth a modal nterval contanng the entre sample; Bckel (a) provdes a detaled algorthm for ths estmator. All of the estmators of ths subsecton have the same breakdown pont as the medan and are even more robust than the medan n the sense that they are unaffected by any suffcently hgh, fnte outler (Bckel, a)..5 Nonparametrc densty estmators D. R. Bckel, submtted to InterStat. 8

The nonparametrc estmators of Sectons.3 and.4 are drect, but there are also nonparametrc estmators of the mode that depend on estmaton of the probablty densty. Grenander (965) and Dharmadhkar and Joag-dev (988) note that the mode can be estmated as the argument for whch a smoothed emprcal densty functon (EDF), an estmate of the PDF, reaches a maxmum. The EDF based on a normal kernel functon s n ˆ f ()= x exp x x nh π h. () = Smaller values of the smoothng parameter h yeld lower bases, but hgher varances, n the mode estmates. Based on optmal estmates of the PDF, Slverman (986) recommended settng h equal to (.9)S, where S s the mnmum of the sample standard devaton and the normal-consstent nterquartle range. Ths recommendaton s followed n the smulatons below, except that the nterquartle range s replaced wth the MAD, made consstent wth the normal dstrbuton by multplyng by.486. The MAD s preferred for these studes wth large numbers of outlers snce ts asymptotc breakdown pont s twce that of the nterquartle range (Rousseeuw and Croux, 993). The mode s estmated by the emprcal densty functon mode (EDFM), denoted by M and defned such that ˆ f ( M )= max f ˆ (). x x 3. SIMULATIONS The methods of Secton were used to estmate the mode for samples generated from a normal dstrbuton, whch s symmetrc (zero skewness), a lognormal dstrbuton, whch s moderately asymmetrc (fnte postve skewness), and a Pareto dstrbuton, whch s extremely asymmetrc (nfnte skewness). The normal dstrbuton has a mean parameter of 6 and a standard devaton parameter of, wth a medan of 6 and a mode of 6; the lognormal dstrbuton has a mean parameter of and a standard devaton parameter of, D. R. Bckel, submtted to InterStat. 9

wth a medan of e. 7 and a mode of ; the Pareto dstrbuton has a PDF of ( 3 ) x for x and for x <, wth a medan of 4 and a mode of. From each of these dstrbutons, samples, each of n random numbers, were generated for n =,, and. For each sample, the mode was estmated by the SPM of Subsecton., by the RPM of Subsecton., by M,3 and M, of Subsecton.3, by the HRM of Subsecton.4, and by the EDFM of Subsecton.5. The sample means and medans were also computed for comparson. The bas, defned as the mean of the estmates mnus the value of the estmand (e.g., 6 for the mode of the normal dstrbuton), and the varance of the estmates are dsplayed n Tables -3 for each estmator and sample sze, based on the smulatons wthout contamnaton. D. R. Bckel, submtted to InterStat.

Normal dstr. Lognormal dstr. Pareto dstr. Mean.77 (.5795).73546 (.439) N/A (N/A) Medan.466 (.837).75 (.59698).65 (6.39938) Standard parametrc mode.368 (.459).4738 (.8785).543767 (.6455) Robust parametrc mode.538 (.9475).34578 (.6567).46777 (.35) Grenander M.69599.656.47,3 (.6737) (.3887) (5.46) Half-range mode.69 (.3385).74653 (.79).83833 (.6493) Emprcal densty mode.59 (.874).9566 (.333).64963 (.556) Table. Bas (varance) of seven estmators of locaton, based on smulatons of samples of observatons each, wth observatons drawn from one of three dstrbutons. N/A ndcates that estmates of the mean are unstable snce the Pareto dstrbuton has an nfnte populaton mean; M s not ncluded here, snce the sample sze s too small for that estmator. The smallest absolute bas or varance of the mode estmators for a dstrbuton. D. R. Bckel, submtted to InterStat.

Normal dstr. Lognormal dstr. Pareto dstr. Mean.7595 (.755).77665 (.35677) N/A (N/A) Medan.386 (.56868).35984 (.39).36 (.67445) Standard parametrc mode.5738 (.669).649 (.373943).46835 (.555) Robust parametrc mode.4438 (.64788).6989 (.79789).573 (.685) Grenander M.84455.756.93797,3 (.4634) (.5499) (.56443) Grenander M.4834.945.3, (.3979) (3.87444) (.33) Half-range mode.98699 (.877).33984 (.3335).95489 (.78498) Emprcal densty mode.434379 (.74397).6544 (.59535).87 (.758) Table. Bas (varance) of eght estmators of locaton, based on smulatons of samples of observatons each, wth observatons drawn from one of three dstrbutons. N/A ndcates that estmates of the mean are unstable snce the Pareto dstrbuton has an nfnte populaton mean. The smallest absolute bas or varance of the mode estmators for a dstrbuton. D. R. Bckel, submtted to InterStat.

Normal dstr. Lognormal dstr. Pareto dstr. Mean.36734 (.95839).939 (.355888) N/A (N/A) Medan.35 (.7484).6 (.634).48599 (.67678) Standard parametrc mode.56764 (.9474).8966 (.974).43899 (.9655) Robust parametrc mode.4354 (.8849).4563 (.66873).3349 (.386) Grenander M.37.9.97745,3 (.) (.9399) (.55889) Grenander M.3753.3373.834, (.359484) (3.8784) (4.8345) Half-range mode.563 (.64674).5975 (.8978).544 (.73768) Emprcal densty mode.479948 (.4846).35788 (.67536).77875 (.43567) Table 3. Bas (varance) of eght estmators of locaton, based on smulatons of samples of observatons each, wth observatons drawn from one of three dstrbutons. N/A ndcates that estmates of the mean are unstable snce the Pareto dstrbuton has an nfnte populaton mean. The smallest absolute bas or varance of the mode estmators for a dstrbuton. Because of ther hgh breakdown ponts, the medan, RPM, HRM, and EDFM have meanng even n the presence of many outlers, so they were appled to samples generated as descrbed above wth four levels of contamnaton: 5%, %, 5%, and % for n = and n =, and %, %, 3%, and 4% for n =. The level of contamnaton s the probablty that a gven value n the sample was replaced by a value drawn from a normal dstrbuton wth a mean equal to the 99.99 th percentle of the man dstrbuton (normal, lognormal, or Pareto) and wth a standard devaton equal to a hundredth of the nterquartle range of the man dstrbuton dvded by the nterquartle range of the standard normal dstrbuton. Thus, the normal dstrbuton was contamnated by N( 9.79, (.) ), the lognormal dstrbuton by N(.58, (.993) ), and the Pareto dstrbuton by N( 8,(.549) ). Hgher levels of contamnaton D. R. Bckel, submtted to InterStat. 3

could not be used for the smaller sample szes because that would sometmes lead to more than half of the values of a sample drawn from the outler dstrbuton, whch would break down any estmator. The bas and varance n the estmators for each contamnaton level and each man dstrbuton are dsplayed n Fg. ( n = ), Fg. ( n = ), and Fg. 3 ( n = ). Fgs. 4-6 dsplay the PDFs estmated from a sample of values from each dstrbuton and the same sample wth 4% contamnaton. The PDFs were estmated by Eq. (), usng the parameters α, y, and σ estmated as descrbed n Secton.. The value of x yeldng the maxmum value of each estmated PDF s the RPM. D. R. Bckel, submtted to InterStat. 4

4. DISCUSSION The bas and varance of the two proposed parametrc estmators of the mode were low for all three dstrbutons, even though the lognormal and Pareto dstrbutons cannot be converted nto a normal dstrbuton by the smple power transformaton. For the contamnated normal and lognormal dstrbutons and for the Pareto dstrbuton wth and wthout contamnaton, Fgs. 4-6 show large dscrepances between the theoretcal dstrbutons and the dstrbutons estmated usng Eq. (). The fact that the estmates of the mode were affected lttle by those dscrepances suggests that the parametrc estmators can be successfully appled not only to power transforms of the normal dstrbuton, but also to a much more general class of contamnated sngle-modal dstrbutons. Tables -3 gve an ndcaton of the relatve performance of the locaton estmators consdered n the absence of contamnaton. The two mode estmators of Grenander (965) have the hghest bas and varance of the mode estmators for all dstrbutons consdered. The SPM performs consstently better than the other estmators of the mode, except that the RPM and HRM tend to be less based for the Pareto dstrbuton, and, at n=, the EDFM has a lower bas for the normal dstrbuton. Based on these results, the SPM s a good choce of a mode estmator for uncontamnated data, except when the dstrbuton of the data has extreme skewness or long tals, n whch cases the RPM or HRM would be better. When a partcular data set may be contamnated wth outlers, the selecton of an estmator of the mode for that data can be nformed by the estmator propertes dsplayed n Fgs. -3. Whle the EDFM has the lowest absolute bas for the contamnated normal dstrbuton, the RPM has the best bas for the other two dstrbutons, except for the Pareto dstrbuton at n= (Fg. 3), n whch case the HRM has a lower bas and varance. Thus, the RPM s approprate for many cases of data wth outlers and moderate to hgh skewness. The HRM s better n some nstances of hgh sample sze and very hgh skewness, but n these D. R. Bckel, submtted to InterStat. 5

cases, the computaton speed of HRM s slow and smlar results can nstead be obtaned usng the HSM, whch can be computed very quckly. The EDFM appears to work well for contamnated normal dstrbutons. The RPM s recommended as a general-purpose estmator of the mode snce t was often the best mode estmator and when t was not, t never performed much worse than the best mode estmator n the smulatons of ths study, except for the uncontamnated normal dstrbuton. If the data are known to be approxmately normal and uncontamnated, then the sample mean would have the lowest bas and varance and would be approxmately equal to the mode snce the normal dstrbuton s symmetrc. Wthout ths knowledge, the RPM s a safe estmator of the mode: although t has less effcency n some cases, t has low bas and varance n many cases and s never affected much by outlers when the number of outlers s less than the number of good values. Its much greater resstance to hgh levels of outlers makes the mode a vable alternatve to the medan as a robust measure of central tendency: the bas and varance of the sample medan consstently ncrease wth the level of contamnaton, a reflecton of the fact that the medan does not reject outlers, unlke robust estmators of the mode (Bckel, a) such as RPM. Fgs. -3 show that the medan can be much less relable than RPM for contamnated, skewed dstrbutons. Modfcatons of the SPM and RPM may lead to mproved estmaton; I make three suggestons:. Usng a trmmed mean to estmate the mean of transformed data would have better effcency n the uncontamnated normal case than usng the medan and better robustness to outlers than usng the mean, so the resultng estmator of the mode would have characterstcs ntermedate to those of the SPM and RPM.. The method of Secton can be mplemented wth crtera for selectng the transformaton exponent other than those proposed n Secton, e.g., α D. R. Bckel, submtted to InterStat. 6

could be defned as the value of α for whch the Kolmogorov-Smrnov dstance (Press et al., 996) between the EDF of the transformed data and a normal dstrbuton s mnmzed. 3. The power transformaton was chosen for ts smplcty, but other transformatons to approxmate normalty could gve better results. Explorng the propertes of these modfed estmators and other generalzatons of the proposed technque requres further research. BIBLIOGRAPHY Bckel, D. R. (a) Robust estmators of the mode and skewness of contnuous data, Computatonal Statstcs and Data Analyss (n press); avalable from preprnt server: http://www.mathpreprnts.com/math/preprnt/bckel/75./3/ Bckel, D. R. (b), Smple estmator of the mode for contnuous dstrbutons: The mode as a more robust measure of locaton, under revew. Box, G. E. P. and G. C. Tao (99), Bayesan Inference n Statstcal Analyss, John Wley and Sons (New York). Dharmadhkar, S. and K. Joag-dev (988), Unmodalty, Convexty, and Applcatons, Academc Press (New York). Donoho, D. L. and P. J. Huber (983), The noton of a breakdown pont, n Fetschrft for Erch L. Lehmann, edted by P. J. Bckel, K. A. Doksum, J. L. Hodges Jr., Wadsworth: Belmont, CA. Grenander, U. (965), Some drect estmates of the mode, Annals of Mathematcal Statstcs 36, 3-38. Huber, P. J. (98), Robust Statstcs, John Wley & Sons (New York). Press, W. H., S. A. Teukolsky, W. T. Vetterlng, B. P. Flannery (996), Numercal Recpes n C, Cambrdge Unversty Press: Cambrdge. Rousseeuw, P. J. and C. Croux (993) Alternatves to the medan absolute devaton Journal of the Amercan Statstcal Assocaton 88, 73-83. Slverman, B. W. (986) Densty Estmaton for Statstcs and Data Analyss, Chapman and Hall (New York). Staudte, R. G. and S. J. Sheather (99), Robust Estmaton and Testng, Wley- Interscence: New York, 99. D. R. Bckel, submtted to InterStat. 7

( α ) APPENDIX: Algorthm to fnd the transformaton exponent, α Let ρ ( α ) be a functon, such as r ( α ) or R ( α ), wth a sngle maxmum, ρ, and no plateaus (monotoncally ncreasng for α α and monotoncally decreasng for α α ). To compute α, frst fnd three values α, α, and α 3, that satsfy α α < α 3 α α 3 <, ρ ( α ) < ρ( ), and ( α ) ρ( ) α ρ < ; ths ensures that 3 α α < <. For the smulatons n ths paper, the values α =, α, and α =. were used as ntal guesses and 3 α was decreased or α 3 was ncreased as needed to ensure that ρ ( α ) < ρ( ) and ( α 3 ) ρ( ) α to avod the numercal dffcultes of evaluatng ( ) used for 3 = ρ <. Non-ntegral values were ρ n the followng algorthm. The algorthm ArgumentForMax( α, α 3 ) returns α to wthn the desred level of precson (. was used n ths study). ArgumentForMax( A, 5. If A. 5 A ) [ ρ ( A ) ρ( α ) < ρ( ) A, then return ( ) Step. A 5 < must be true] A5 A + and stop; otherwse, proceed to. Dvde the doman [ A ] nto four equally-spaced ntervals, [ ] [ A, A 3 ],[ A, A ],[ A, A ] 3 4 4 5 A 3. Compute, d, A 5, whch satsfy A = A3 A = A4 A3 = A5 A4 A, A,. (3) d, the dfference n ( α ) ( A ) ρ( A ) = ρ + for =,,3,4. ρ across each of the four ntervals, lettng 4. If there s an nterval number j for whch d and d, then t s known that ρ ( A ) < ρ( ) and ρ ( ) < ρ( ) j α j+ α j j+ A and thus that the recursve call ArgumentForMax( A j, A j+ ) satsfes the condtons needed to return α, so return ArgumentForMax( A j, A j+ ); otherwse, proceed to Step 5. D. R. Bckel, submtted to InterStat. 8

5. If ( A ) ρ( ) ρ < A 5, then return ArgumentForMax( 4 ArgumentForMax( A, A ). A, A 5 ); otherwse, return D. R. Bckel, submtted to InterStat. 9

Estmator bas for normal data Estmator varance for normal data.4.. -..5..5..5..5. Estmator bas for lognormal data Estmator varance for lognormal data.5.5.5..5...5..5. Estmator bas for Pareto data Estmator varance for Pareto data.5..5...5..5. Medan Parametrc mode Half-range mode Emp. densty mode Fg.. Bas and varance of locaton estmators for samples of n= values from the normal, lognormal, and Pareto dstrbutons. Fgures of D. R. Bckel

Estmator bas for normal data Estmator varance for normal data.4.. -. -.4.5..5..5..5. Estmator bas for lognormal data Estmator varance for lognormal data.5.5..5..5..5..5. Estmator bas for Pareto data Estmator varance for Pareto data...5..5..5..5. Medan Parametrc mode Half-range mode Emp. densty mode Fg.. Bas and varance of locaton estmators for samples of n= values from the normal, lognormal, and Pareto dstrbutons. Fgures of D. R. Bckel

Estmator bas for normal data Estmator varance for normal data 4 3. -...3.4...3.4 Estmator bas for lognormal data Estmator varance for lognormal data 5 4.5 4 3.5 3..5.5.5 -.5...3.4...3.4 Estmator bas for Pareto data Estmator varance for Pareto data 8 6 4 -...3.4....3.4 Medan Parametrc mode Half-range mode Emp. densty mode Fg. 3. Bas and varance of locaton estmators for samples of n= values from the normal, lognormal, and Pareto dstrbutons. The bas of the medan for the Pareto dstrbuton wth 4% contamnaton s 34.45, whch s too hgh to plot here. Fgures of D. R. Bckel