S The Over-Reliance on the Central Limit Theorem

Similar documents
Introduction to Linear regression analysis. Part 2. Model comparisons

Chapter 3. Measuring data

MAT Mathematics in Today's World

Regression Model Building

YEAR 12 - Mathematics Pure (C1) Term 1 plan

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Practical Statistics for the Analytical Scientist Table of Contents

Basics of Experimental Design. Review of Statistics. Basic Study. Experimental Design. When an Experiment is Not Possible. Studying Relations

In this investigation you will use the statistics skills that you learned the to display and analyze a cup of peanut M&Ms.

Weighted Least Squares

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

The bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap

Access Algebra Scope and Sequence

Created by T. Madas. Candidates may use any calculator allowed by the regulations of this examination.

Chapter 8 Sampling Distributions Defn Defn

P8130: Biostatistical Methods I

The Normal Distribution. Chapter 6

PROBABILITY DENSITY FUNCTIONS

Data Analysis. with Excel. An introduction for Physical scientists. LesKirkup university of Technology, Sydney CAMBRIDGE UNIVERSITY PRESS

Multiple Linear Regression estimation, testing and checking assumptions

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Accelerated Advanced Algebra. Chapter 1 Patterns and Recursion Homework List and Objectives

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Sets and Set notation. Algebra 2 Unit 8 Notes

5-1. For which functions in Problem 4-3 does the Central Limit Theorem hold / fail?

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

Appendix D INTRODUCTION TO BOOTSTRAP ESTIMATION D.1 INTRODUCTION

STAT 3A03 Applied Regression Analysis With SAS Fall 2017

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Nonparametric Estimation of Distributions in a Large-p, Small-n Setting

Bivariate Relationships Between Variables

Sociology 6Z03 Review I

Simple Linear Regression Using Ordinary Least Squares

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Density Temp vs Ratio. temp

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

A Little Stats Won t Hurt You

STATISTICS 1 REVISION NOTES

2. The Sample Mean and the Law of Large Numbers

Single and multiple linear regression analysis

Checking model assumptions with regression diagnostics

Subject CS1 Actuarial Statistics 1 Core Principles

Prediction Improvement from Geostatistical Methodology in Regression Modeling For Census Tract Data

Module 1. Identify parts of an expression using vocabulary such as term, equation, inequality

SAS/STAT 13.2 User s Guide. Introduction to Survey Sampling and Analysis Procedures

Chapter 3: Examining Relationships

University of California, Berkeley, Statistics 131A: Statistical Inference for the Social and Life Sciences. Michael Lugo, Spring 2012

Design of Experiments

Institute of Actuaries of India

SAS/STAT 14.2 User s Guide. Introduction to Survey Sampling and Analysis Procedures

Revised: 2/19/09 Unit 1 Pre-Algebra Concepts and Operations Review

Weighted Least Squares

Residuals from regression on original data 1

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

APPENDIX A. Watershed Delineation and Stream Network Defined from WMS

MATH 1150 Chapter 2 Notation and Terminology

Assessing Model Adequacy

3rd Quartile. 1st Quartile) Minimum

Measures of disease spread

Chart types and when to use them

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Bootstrapping, Randomization, 2B-PLS

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)

Model Fitting. Jean Yves Le Boudec

Ch Inference for Linear Regression

Probability Distributions

ALGEBRA (SMR Domain 1) Algebraic Structures (SMR 1.1)...1

Section 3. Measures of Variation

KAAF- GE_Notes GIS APPLICATIONS LECTURE 3

2.3 Estimating PDFs and PDF Parameters

Module 1: Equations and Inequalities (30 days) Solving Equations: (10 Days) (10 Days)

STAT 704 Sections IRLS and Bootstrap

Inference and Regression

Module 6: Model Diagnostics

AP Final Review II Exploring Data (20% 30%)

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

are the objects described by a set of data. They may be people, animals or things.

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Announcements. Final Review: Units 1-7

Model Checking and Improvement

For instance, we want to know whether freshmen with parents of BA degree are predicted to get higher GPA than those with parents without BA degree.

Problem set 1: answers. April 6, 2018

Homework Example Chapter 1 Similar to Problem #14

SMAM 319 Exam 1 Name. 1.Pick the best choice for the multiple choice questions below (10 points 2 each)

THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH. Robert R. SOKAL and F. James ROHLF. State University of New York at Stony Brook

df=degrees of freedom = n - 1

Lecture 12: Interactions and Splines

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Lecture Slides. Elementary Statistics Eleventh Edition. by Mario F. Triola. and the Triola Statistics Series 3.1-1

Chapter. Numerically Summarizing Data. Copyright 2013, 2010 and 2007 Pearson Education, Inc.

AP Statistics Semester I Examination Section I Questions 1-30 Spend approximately 60 minutes on this part of the exam.

Spatial Analysis II. Spatial data analysis Spatial analysis and inference

Regression Diagnostics Procedures

Lecture 8. Using the CLR Model

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

BE640 Intermediate Biostatistics 2. Regression and Correlation. Simple Linear Regression Software: SAS. Emergency Calls to the New York Auto Club

STA 303H1F: Two-way Analysis of Variance Practice Problems

Transcription:

S04-2008 The Over-Reliance on the Central Limit Theorem Abstract The objective is to demonstrate the theoretical and practical implication of the central limit theorem. The theorem states that as n approaches infinity, the distribution of the sample mean approaches normality with mean equal to the population mean and variance equal to the population variance divided by n. However, as n approaches infinity, the variance of the mean approaches zero. In practice, the population variance is unknown, and so the sample variance is used to estimate the population distribution. In that case, we assume the format of a t-distribution, which requires the assumption that the population is itself normally distributed. In this presentation, we use data visualization to show some problems that can occur when assuming that n is sufficiently large to assume that the sample mean is normally distributed. In particular, we use PROC SURVEYSELECT to sample data from non-normal distributions to compare the distribution of the sample mean to that of the population mean. Introduction Regression requires the assumption that the residuals are normally distributed. However, most healthcare data are exponential or gamma because of the presence of extreme outliers. As shown in the previous section, there is a considerable difference between the mean and the median. Linear regression requires large samples to be effective. Power analysis tends to assume that the population distribution is sufficiently homogeneous to be normally distributed. However, healthcare outcomes tend to be exponential or gamma distributions, and we must consider just how large n has to be before the Central Limit Theorem is realistic. 65 To examine the issue, we take samples of different sizes to compute the distribution of the sample mean. The following code will compute 100 mean values from sample sizes starting with 5 and increasing to 10,000. PROC SURVEYSELECT DATA=nis.nis_205 OUT=work.samples METHOD=SRS N=5 rep=100 noprint; RUN; proc means data=work.samples noprint; by replicate; var los; output out=out mean=mean; run; Once we have computed the means, we can graph them using kernel density estimation (a smoothed histogram). We show the difference between the distribution of the population, and the distribution of the sample mean for the differing sample sizes. Figures 1-4 show the distribution of the sample mean compared to the distribution of the population for differing sample sizes. To compute the distribution of the sample mean, we collect 100

different samples using the above code. We compute the mean for the patient length of stay using the National Inpatient Sample. Figure 1. Sample Mean With Sample=5 Figure 2. Sample Mean With Sample=30

Figure 3. Sample Mean With Sample=100 Figure 4. Sample Mean With Sample=1000

In Figure 1, the sample mean peaks slightly to the right of the peak of the population distribution; this peak is much more exaggerated in Figure 2. The reason for this shift in the peak is because the sample mean is susceptible to the influence of outliers, and the population is very skewed. Because it is so skewed, the distribution of the sample mean is not entirely normal. As the sample increases to 100 and then to 1000, this shift from the population peak to the sample peak becomes much more exaggerated. We use the same sample sizes for 1000 replicates (Figures 5-8). Figure 1. Sample Mean for Sample Size=5 and 1000 Replicates

Figure 2. Sample Mean for Sample Size=30 and 1000 Replicates Figure 3. Sample Mean for Sample Size=100 and 1000 Replicates

Figure 4. Sample Mean for Sample Size=1000 and 1000 Replicates It is again noticeable that the sample mean is shifted away from the peak value of the population distribution because of the skewed distribution. However, the distribution of the mean is not normally distributed.