G M = = = 0.927

Similar documents
Midrange: mean of highest and lowest scores. easy to compute, rough estimate, rarely used

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

ABSTRACT INTRODUCTION SUMMARY OF ANALYSIS. Paper

Paper: ST-161. Techniques for Evidence-Based Decision Making Using SAS Ian Stockwell, The Hilltop UMBC, Baltimore, MD

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Postal Test Paper_P4_Foundation_Syllabus 2016_Set 1 Paper 4 - Fundamentals of Business Mathematics and Statistics

SESUG 2011 ABSTRACT INTRODUCTION BACKGROUND ON LOGLINEAR SMOOTHING DESCRIPTION OF AN EXAMPLE. Paper CC-01

Mapping Participants to the Closest Medical Center

Chapter 1 - Lecture 3 Measures of Location

Dynamic Determination of Mixed Model Covariance Structures. in Double-blind Clinical Trials. Matthew Davis - Omnicare Clinical Research

Describing distributions with numbers

SAS/STAT 12.3 User s Guide. The GLMMOD Procedure (Chapter)

Descriptive Statistics-I. Dr Mahmoud Alhussami

SAS/STAT 15.1 User s Guide The GLMMOD Procedure

1. Exploratory Data Analysis

AP Final Review II Exploring Data (20% 30%)

Introduction to Statistics for Traffic Crash Reconstruction

Similarity Analysis an Introduction, a Process, and a Supernova Paper

Lecture 3B: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

Automatic Singular Spectrum Analysis and Forecasting Michele Trovero, Michael Leonard, and Bruce Elsheimer SAS Institute Inc.

Chapter 31 The GLMMOD Procedure. Chapter Table of Contents

Chapter 3. Data Description

Chapter 4. Displaying and Summarizing. Quantitative Data

Lecture Slides. Elementary Statistics Eleventh Edition. by Mario F. Triola. and the Triola Statistics Series 3.1-1

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

2011 Pearson Education, Inc

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 03

Unit 2: Numerical Descriptive Measures

Chapter. Numerically Summarizing Data Pearson Prentice Hall. All rights reserved

Chapter 2: Tools for Exploring Univariate Data

Describing distributions with numbers

Design of Experiments

Solutions to Selected Questions from Denis Sevee s Vector Geometry. (Updated )

Dosing In NONMEM Data Sets an Enigma

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Topic Page: Central tendency

Paper Equivalence Tests. Fei Wang and John Amrhein, McDougall Scientific Ltd.

Unit 2. Describing Data: Numerical

Charity Quick, Rho, Inc, Chapel Hill, NC Paul Nguyen, Rho, Inc, Chapel Hill, NC

Fitting PK Models with SAS NLMIXED Procedure Halimu Haridona, PPD Inc., Beijing

MEASURES OF CENTRAL TENDENCY Shahbaz Baig

CHAPTER 2: Describing Distributions with Numbers

MITOCW ocw f99-lec16_300k

Summarizing Measured Data

Leverage Sparse Information in Predictive Modeling

Mathematics skills framework

Statistics and parameters

Chapter 1: Introduction. Material from Devore s book (Ed 8), and Cengagebrain.com

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Chapter 1: Exploring Data

Unit 27 One-Way Analysis of Variance

Summarizing and Displaying Measurement Data/Understanding and Comparing Distributions

3 Lecture 3 Notes: Measures of Variation. The Boxplot. Definition of Probability

Ex-Ante Forecast Model Performance with Rolling Simulations

Chapter 3 Data Description

Business Mathematics & Statistics (MTH 302)

A Multistage Modeling Strategy for Demand Forecasting

A Little Stats Won t Hurt You

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

Topic-1 Describing Data with Numerical Measures

Application of Ghosh, Grizzle and Sen s Nonparametric Methods in. Longitudinal Studies Using SAS PROC GLM

Tornado Inflicted Damages Pattern

MgtOp 215 Chapter 3 Dr. Ahn

Chapter 6 The Standard Deviation as a Ruler and the Normal Model

Chapter 3. Measuring data

Metabolite Identification and Characterization by Mining Mass Spectrometry Data with SAS and Python

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

CHAPTER 1. Introduction

Chapter 1: Introduction. Material from Devore s book (Ed 8), and Cengagebrain.com

SUMMARIZING MEASURED DATA. Gaia Maselli

Error Correcting Codes Prof. Dr. P. Vijay Kumar Department of Electrical Communication Engineering Indian Institute of Science, Bangalore

MATH 117 Statistical Methods for Management I Chapter Three

Generating Half-normal Plot for Zero-inflated Binomial Regression

DESCRIPTIVE STATISTICS

Comparing Measures of Central Tendency *

Chapters 1 & 2 Exam Review

The standard deviation as a descriptive statistic

= n 1. n 1. Measures of Variability. Sample Variance. Range. Sample Standard Deviation ( ) 2. Chapter 2 Slides. Maurice Geraghty

A SAS/AF Application For Sample Size And Power Determination

Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA

You are allowed two hours to answer this question paper. All questions are compulsory.

The Law of Averages. MARK FLANAGAN School of Electrical, Electronic and Communications Engineering University College Dublin

a+bi form abscissa absolute value x x x absolute value equation x x absolute value function absolute value inequality x x accurate adjacent angles

Analyzing the effect of Weather on Uber Ridership

Stat 2300 International, Fall 2006 Sample Midterm. Friday, October 20, Your Name: A Number:

After completing this chapter, you should be able to:

Describing Distributions with Numbers

Calculating Confidence Intervals on Proportions of Variability Using the. VARCOMP and IML Procedures

Perinatal Mental Health Profile User Guide. 1. Using Fingertips Software

CS 147: Computer Systems Performance Analysis

1.3.1 Measuring Center: The Mean

Descriptive Data Summarization

Slide 1. Slide 2. Slide 3. Pick a Brick. Daphne. 400 pts 200 pts 300 pts 500 pts 100 pts. 300 pts. 300 pts 400 pts 100 pts 400 pts.

MITOCW watch?v=ztnnigvy5iq

Summarizing Measured Data

Transcription:

PharmaSUG 2016 Paper SP10 I Want the Mean, But not That One! David Franklin, Quintiles Real Late Phase Research, Cambridge, MA ABSTRACT The Mean, as most SAS programmers know it, is the Arithmetic Mean. However, there are situations where it may necessary to calculate different means. This paper first looks at different methods that are widely used from a programmer's perspective, starting with the humble Arithmetic Mean, then proceeding to the other Pythagorean Means, known as the Geometric Mean and Harmonic Mean, before ending with a quick look at the Interquartile Mean and its related Truncated Mean. During the journey there will be examples of data and code given to demonstrate how each method is done and output. INTRODUCTION I want the mean, but not that one! Some of us have heard that phrase, others have not. As a programmer we use procedures such as PROC MEANS or PROC UNIVARIATE, and occasionally write datastep code, to calculate the mean. But these ways of calculating the mean calculates the Arithmetic Mean. However, other methods exist. This paper looks at first that the Arithmetic Mean, then proceeds to the Geometric Mean and Harmonic Mean, before ending with a quick look at the Interquartile Mean and its related Truncated Mean. THE METHODS We have all heard of the humble mean or average, but what exactly is it? When we hear of mean or average we are actually often thinking of the Arithmetic Mean (AM) with defined as the sum of the non missing values in a set divided by the number of non missing values in the same set. The formula is often seen in textbooks as The Arithmetic Mean is the most commonly used and readily understood measure of central tendency, and is best when the data is not skewed (having no extreme outlier values) and the individual data points are not dependent on each other. But did you know there are other means out there? The first is the Geometric Mean (GM) which uses the product of the values as opposed to the arithmetic mean which uses their sum. This formula is often seen as The Geometric Mean should be used whenever the data are interrelated, usually ratios or percentages, and usually involve a time component. As an example, lets look at the growth rate of bacteria over a four hour period (as a ratio): Time 0 60 minutes = 1.2 (100 start, 120 end) Time >60 120 minutes = 1.4 (120 start, 168 end) Time >120 180 minutes = 1.1 (168 start, 185 end) Time >180 240 minutes = 0.4 (185 start, 74 end) G M = 4 1.2 1.4 1.1 0.4 = 4 0.7392 = 0.927 This is different to the arithmetic mean which would be

A M = 1.2+1.4+1.1+0.4 4 = 4.1 4 = 1.025 One shows an average decrease of 7.3% uniformly across each period while the other shows an increase of 2.5%. Which is more correct? 100 x (0.927) 4 = 74, the Geometric Mean. The main reason for using the Geometric Mean over the Arithmetic Mean is that in the example the starting value of the second period is dependent on the first result. The second mean we will look at is the harmonic mean, sometimes called the subcontrary mean) which uses reciprocals, with the formula often seen as This statistic is useful in situations involving data where the majority of the values are distributed uniformly but there there are a few outliers with significantly higher or lower values. The Harmonic Mean gives less significance to high value outliers, providing a truer picture of the average. Lets look at some data: is different to the arithmetic mean which would be 10 H M = 3.094 = 3.23 = 10 + + + + + + + + 21+ 7 1 5 1 2 1 7 1 1 1 2 1 9 1 4 1 5 1 1 A M = 5+2+7+1+2+9+4+5+21+7 10 = 63 10 = 6.3 In general the harmonic average is less biased due to a small number of outliers. The Arithmetic Mean, Geometric Mean and Harmonic Mean methods are considered part of what is often called the Pythagorean means. Another method that is useful when dealing with outlier data, as shown in the data above, is to calculate the Interquartile Mean (IM), calculated using the formula In this method the first and fourth quartiles are removed and the arithmetic mean is done on the records in the second and third quartiles. One useful reason for removing the first and fourth quartiles is that and outliers are removed. Looking at the data above and putting into order we get Using the formula above, the Interquartile Mean is 1 2 2 4 5 5 7 7 9 21 I M = 2 3 * 10 10/4 = 10 2 (4 + 5 + 5 + 7 ) = 4.2 (10/4)+1 = 2 10 7 4 A variation of the Interquartile Mean which incorporates the simplicity of the Arithmetic Mean is the Truncated Mean (TM) where numbers are excluded which are outside specified percentiles. Using this method we get bounds by removing particular observations. If we were to remove the top and bottom observations using a 10% trimmed mean, we would remove 10% of the observations on both sides then calculate the mean from the resulting data. Using our example above and a 10% truncated mean, taking 10% of the observations off each side result in the 1 and 21 are excluded, so the Truncated Mean can be computed T M = 2+2+4+5+5+7+7+9 8 = 41 8 = 5.125 2

As said previously, the Harmonic Mean, Interquartile Mean and Truncated Mean are useful if there are outliers which strongly influence the result. Which is correct, well that depends on the data and the statistician programmers, follow any guidance from the statistician. What does SAS do? This is an interesting question. Using SAS procedures the only mean that are available is the Arithmetic Mean (a number of procedures) and the Geometric Mean via the SURVEYMEANS procedure. Using the SAS functions in SAS 9.4, there is the MEAN, GEOMEAN and HARMEAN functions, but the problem with using these is that you have to have the data transposed to a row in order to use these while our examples have this form, real world we do not have this option. Now we have to write some code using a procedure and a datastep. To help understand the code with the examples above, I am going to use the same data in each example. First we will look at the Arithmetic Mean and the data we used for the Harmonic Mean, to get both of these means: data _dat input x @@ datalines data _null_ set _dat length numx sumx rcplx 8 retain numx sumx rcplx set _dat end=eof if _n_=1 then do numx=0 sumx=0 rcplx=0 if ^missing(x) then do numx+1 sumx+x rcplx=rcplx+(1/x) if eof then do AM=sumx/numx HM=numx/rcplx put NUMX= / SUMX= / AM= / HM= numx=10 sumx=63 AM=6.3 HM=3.2315978456 This is what we got by hand. Because the data used in the Geometric Mean is different than that for the other two, it is impossible to use the same data, so we go back to our original example data 3

data _dat input x @@ datalines 1.2 1.4 1.1 0.4 data _null_ length numx prodx 8 retain numx prodx set _dat end=eof if _n_=1 then do numx=0 prodx=1 if ^missing(x) then do numx+1 prodx=prodx*x if eof then do GM=prodx**(1/numx) put NUMX= / GM= numx=4 GM=0.9272364372 This is the same result we got by hand. For the the Interquartile Mean, we will go back to our original data that we used for the Harmonic Mean, and get the following output: data _dat input x @@ datalines proc sort data=_dat by x data _null_ set _dat nobs=nobs end=eof if ((nobs/4)+1)<=_n_<=(3*nobs/4) then sumx+x if eof then do IM=(2/nobs)*sumx put nobs= / sumx= / IM= nobs=10 sumx=21 IM=4.2 This is the same result we got by hand. For the the Truncated Mean, we go back to the previous data and do a Truncated Mean of 10%, which means we drop 10% of the observations each side, and do 4

data _dat input x @@ datalines 1 2 2 4 5 5 7 7 9 21 proc sort data=_dat by x data _null_ set _dat nobs=nobs end=eof if ((nobs*0.1)+1)<=_n_<=(nobs*0.9) then do numx+1 sumx+x put x= numx= sumx= _n_= if eof then do TM=sumx/numx put numx= / sumx= / TM= to get in the SAS LOG numx=8 sumx=41 TM=5.125 which we got by hand. CONCLUSION Key to using statistical software is knowing what it can and cannot do. It is very interesting that SAS itself does not calculate the Geometric Mean or Harmonic Mean using the MEANS or UNIVARIATE procedure, nor have an option for either the Interquartile Mean or Truncated Mean. Maybe in the next SAS Software Ballot we can ask SAS to apply this option? CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Name: Enterprise: E mail: David Franklin Quintiles Real Late Phase Research David.Franklin@Quintiles.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 5