Chapter 2 Descriptive Statistics

Similar documents
P8130: Biostatistical Methods I

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Descriptive Statistics

Descriptive Univariate Statistics and Bivariate Correlation

Chapter 1 - Lecture 3 Measures of Location

Chapter 4. Displaying and Summarizing. Quantitative Data

Lecture 2 and Lecture 3

Instrumentation (cont.) Statistics vs. Parameters. Descriptive Statistics. Types of Numerical Data

Unit 2. Describing Data: Numerical

2.1 Measures of Location (P.9-11)

BNG 495 Capstone Design. Descriptive Statistics

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

MgtOp 215 Chapter 3 Dr. Ahn

STAT 200 Chapter 1 Looking at Data - Distributions

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Chapter 3. Data Description

Describing distributions with numbers

Describing distributions with numbers

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

1. Exploratory Data Analysis

CIVL 7012/8012. Collection and Analysis of Information

Lecture 2. Descriptive Statistics: Measures of Center

BIOS 2041: Introduction to Statistical Methods

Class 11 Maths Chapter 15. Statistics

CHAPTER 2: Describing Distributions with Numbers

Descriptive Statistics-I. Dr Mahmoud Alhussami

SUMMARIZING MEASURED DATA. Gaia Maselli

Units. Exploratory Data Analysis. Variables. Student Data

Statistics I Chapter 2: Univariate data analysis

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Elementary Statistics

University of Jordan Fall 2009/2010 Department of Mathematics

Statistics I Chapter 2: Univariate data analysis

MEASURES OF LOCATION AND SPREAD

Statistics for Managers using Microsoft Excel 6 th Edition

Quantitative Tools for Research

Chapter 1 Descriptive Statistics

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

2011 Pearson Education, Inc

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Descriptive Statistics

Chapter 1. Looking at Data

Determining the Spread of a Distribution

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

Determining the Spread of a Distribution

Lecture 1: Descriptive Statistics

TOPIC: Descriptive Statistics Single Variable

Unit 2: Numerical Descriptive Measures

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

200 participants [EUR] ( =60) 200 = 30% i.e. nearly a third of the phone bills are greater than 75 EUR

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

Math Sec 4 CST Topic 7. Statistics. i.e: Add up all values and divide by the total number of values.

Foundations of Algebra/Algebra/Math I Curriculum Map

Statistical Concepts. Constructing a Trend Plot

A is one of the categories into which qualitative data can be classified.

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

Module 1. Identify parts of an expression using vocabulary such as term, equation, inequality

After completing this chapter, you should be able to:

Chapter 1: Exploring Data

Preliminary Statistics course. Lecture 1: Descriptive Statistics

Probabilities and Statistics Probabilities and Statistics Probabilities and Statistics

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

MATH 1150 Chapter 2 Notation and Terminology

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation

Statistics in medicine

Histograms allow a visual interpretation

Chapter 7: Statistics Describing Data. Chapter 7: Statistics Describing Data 1 / 27

Sets and Set notation. Algebra 2 Unit 8 Notes

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 03

Stat 20: Intro to Probability and Statistics

Measures of Central Tendency

Types of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511

Summarizing Measured Data

Describing Distributions With Numbers

MATHEMATICS Grade 7 Standard: Number, Number Sense and Operations. Organizing Topic Benchmark Indicator Number and Number Systems

Chapter 3. Measuring data

Biostatistics for biomedical profession. BIMM34 Karin Källen & Linda Hartman November-December 2015

1 Measures of the Center of a Distribution

Introduction to statistics

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

Correlation of Moving with Algebra Grade 7 To Ohio Academic Content Standards

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

CURRICULUM MAP. Course/Subject: Honors Math I Grade: 10 Teacher: Davis. Month: September (19 instructional days)

Chapter 2: Tools for Exploring Univariate Data

Chapter 3 Statistics for Describing, Exploring, and Comparing Data. Section 3-1: Overview. 3-2 Measures of Center. Definition. Key Concept.

Chapter 1:Descriptive statistics

Chapter 5. Understanding and Comparing. Distributions

Describing Distributions with Numbers

Describing Distributions

Introduction to Statistics

MIDTERM EXAMINATION (Spring 2011) STA301- Statistics and Probability

3 Lecture 3 Notes: Measures of Variation. The Boxplot. Definition of Probability

Continuous Distributions

AP Final Review II Exploring Data (20% 30%)

MATH 117 Statistical Methods for Management I Chapter Three

AMS 5 NUMERICAL DESCRIPTIVE METHODS

Transcription:

Chapter 2 Descriptive Statistics Lecture 1: Measures of Central Tendency and Dispersion Donald E. Mercante, PhD Biostatistics May 2010 Biostatistics (LSUHSC) Chapter 2 05/10 1 / 34

Lecture 1: Descriptive Statistics We begin with a discussion on Desciptive Statistics, which will be followed later in the course by Inferential Statistics. Descriptive Statistics generally fall into one of two categories: 1 Measures of Location or Central Tendency 2 Measures of Dispersion or Variability Measures of Location Arithmetic Mean Median Mode Geometric Mean Biostatistics (LSUHSC) Chapter 2 05/10 2 / 34

Arithmetic Mean Arithmetic Mean Uses all of the data in the sample Susceptible to extreme values (outliers) Generally, the preferred measure of location for continuous data. Mean = X = 1 N X i = X 1+X 2 + +X N N Median easily determined uses at most two observations and order of data resistant to extreme values [ ] N +1 2 largest observation if N is odd Median = X = N 2 +( N +1) 2 2 largest observation if N even i.e., the median is the middle value if N odd, or average of two middle values if N even. Biostatistics (LSUHSC) Chapter 2 05/10 3 / 34

Mode Mode Mode = Most frequently occurring value(s) in the data set. easily determined not unique uses very little of the data Data Set: Sample of Birthweights for 20 newborns (Table 2.1) i x i i x i i x i i x i 1 3265 6 3323 11 2581 16 2759 2 3260 7 3649 12 2841 17 3248 3 3245 8 3200 13 3609 18 3314 4 3484 9 3031 14 2838 19 3101 5 4146 10 2069 15 3541 20 2834 Biostatistics (LSUHSC) Chapter 2 05/10 4 / 34

Data Summaries Using R to calculate descriptives > birthwgt [1] 3265 3260 3245 3484 4146 3323 3649 3200 3031 2069 2581 2841 3609 2838 3541 2759 3248 3314 3101 2834 Data set sorted in ascending order: > sort(birthwgt) [1] 2069 2581 2759 2834 2838 2841 3031 3101 3200 3245 3248 3260 3265 3314 3323 3484 3541 3609 3649 4146 > summary(birthwgt) Min. 1st Qu. Median Mean 3rd Qu. Max. 2069 2840 3246 3167 3363 4146 Biostatistics (LSUHSC) Chapter 2 05/10 5 / 34

Geometric Mean Geometric Mean particularly useful for right skewed data (e.g., serial dilutions) where log transformation improves symmetry of the distribution. Calculation: Log (x) = 1 N log (x i ) Geometric Mean = anti log ( ) Log(x) if the log was taken base 10, then the antilog is 10 log(x ) If the log was taken base e, then the antilog is e log(x ) R-Code > exp(mean(log(birthwgt))) [1] 3135.317 Biostatistics (LSUHSC) Chapter 2 05/10 6 / 34

Symmetry in Distribution Symmetry in Distribution If the distribution of the data is symmetric, then the Mean, Median, and Mode will coincide. In particular, we will see this is true for data that follow a normal distribution. If the data distribution is skewed, the Median is the preferred measure of location. Biostatistics (LSUHSC) Chapter 2 05/10 7 / 34

Measures of Spread or Variability Measures of Spread or Variability Range Quantiles/Percentiles Variance / Standard Deviation Coeffi cient of Variation Range = R = Max - Min Note on Calculating Percentiles and Quantiles: Pth percentile is found as: Average(np, np+1) largest values, if np is an integer. (1 + Largest integer in np) largest value, if np is not an integer. Biostatistics (LSUHSC) Chapter 2 05/10 8 / 34

Percentiles Calculating Percentiles Example Let the sample size be N=200. percentiles. Np is an Integer. Calculate the 25th, 50th and 75th 25th percentile (p=0.25) : Np = 200(.25) = 50. Since NP is an integer, the 25th percentile would be the average of the 50th and 51st observations starting with the smallest observation. That is, it is the average of the NP and NP+1 observations. Biostatistics (LSUHSC) Chapter 2 05/10 9 / 34

Percentiles Calculating Percentiles Example Let the sample size be N=200. Calculate the 50th. Np is an integer. 50th percentile (p=0.50) : Np = 200(.5) = 100. Since NP is an integer, the 50th percentile (median) is the average of the 100th and 101st observations starting from the smallest value. Biostatistics (LSUHSC) Chapter 2 05/10 10 / 34

Percentiles Calculating Percentiles Example Let the sample size be N=200. Calculate the 75th. Np is an integer. 75th percentile (p=0.75) : Np = 200(.75) = 150. Since NP is an integer, the 75th percentile is the average of the 150th and 151st observations starting from the smallest value. Biostatistics (LSUHSC) Chapter 2 05/10 11 / 34

Percentiles Calculating Percentiles Example Let the sample size be N=35. percentiles. Calculate the 25th, 50th and 75th 25th percentile (p=0.25) : Np = 35(.25) = 8.75. Np is an Not an Integer. Since NP is an NOT an integer, the 25th percentile would be found as the 1 + the largest integer in Np. For example, NP=8.75 and the largest integer contained in 8.75 is 8. Add one to this value, ie. 8 + 1 =9, and the value of the 25th percentile is the 9th value starting from the smallest value. Biostatistics (LSUHSC) Chapter 2 05/10 12 / 34

Percentiles Calculating Percentiles Example Let the sample size be N=35. Calculate the 50th. 50th percentile (p=0.50): Np = 35(.5) =17.5, which is Not an integer. Since NP is an NOT an integer, the 50th percentile would be found as the 1 + the largest integer in Np. NP=17.5 and the largest integer contained in 17.5 is 17. Add one to this value, ie. 17 + 1 = 18, and the value of the 50th percentile is the 18th value starting from the smallest value. Biostatistics (LSUHSC) Chapter 2 05/10 13 / 34

Percentiles Calculating Percentiles Example Let the sample size be N=35. Calculate the 75th. 75th percentile (p=0.75) : Np = 35(.75) = 26.25. Np is Not an integer. Since NP is an NOT an integer, the 75th percentile would be found as the 1 + the largest integer in Np. NP=26.25 and the largest integer contained in 26.25 is 26. Add one to this value, ie. 26 + 1 = 27, and the value of the 75th percentile is the 27th value starting from the smallest value. Biostatistics (LSUHSC) Chapter 2 05/10 14 / 34

Sample Variance Sample Variance The variance and its square root, the standard deviation, are the most widely used measures of variability. all of the data is used. the mean is used as a reference point. only positive values are possible S 2 = 1 N 1 ( X i X ) 2 definitional form ( S 2 = 1 N 1 Xi 2 NX 2) computational form Standard deviation is the Square Root of the Variance: S = S 2 Biostatistics (LSUHSC) Chapter 2 05/10 15 / 34

Sample Variance Computing the sample variance and standard deviation using R data<-read.table("c:\\table2_1.txt",header=t) > attach(data) > var(birthwgt) [1] 198323.6 > sd(birthwgt) [1] 445.3353 S 2 = 1 N 1 ( X 2 i NX 2) = > (1/(length(birthwgt)-1))*(sum(birthwgt^2)- +length(birthwgt)*mean(birthwgt)^2) [1] 198323.6 Biostatistics (LSUHSC) Chapter 2 05/10 16 / 34

Computing the sample variance using R N 1 > nm1<-1/(length(birthwgt)-1) [1] 0.05263158 X 2 i > sum.x2<-sum(birthwgt^2) [1] 204353260 N [1] 20 > n<-length(birthwgt) > xbar2<-mean(birthwgt)^2 > xbar2 [1] 10029256 X 2 Computing ( sample variance: S 2 = 1 N 1 Xi 2 NX 2) = > nm1*(sum.x2-n*xbar2) [1] 198323.6 Biostatistics (LSUHSC) Chapter 2 05/10 17 / 34

Coeffi cient of Variation Coeffi cient of Variation (CV) The coeffi cient of variation is a unitless measure of variability that is the ratio of the standard deviation to the mean. useful for comparing variability of datasets measured in different units only useful for data on ratio scale of measurement. CV = S X 100% R-Code for computing the C.V. > 100*sd(birthwgt)/mean(birthwgt) [1] 14.06219 Biostatistics (LSUHSC) Chapter 2 05/10 18 / 34

Graphics: Scatter Plots R-Code > plot(fwtright,fwtleft,main="scatter Plot") Biostatistics (LSUHSC) Chapter 2 05/10 19 / 34

Stem and Leaf Plots Stem and Leaf Plots Constructed from ordered array of data.. > sort(birthwgt) [1] 2069 2581 2759 2834 2838 2841 3031 3101 3200 3245 3248 3260 3265 3314 3323 3484 3541 3609 3649 4146 > stem(birthwgt) The decimal point is 3 digit(s) to the right of the 2 1 2 68888 3 012223333 3 5566 4 1 Biostatistics (LSUHSC) Chapter 2 05/10 20 / 34

Stem and Leaf Plots > stem(rnorm(n=200,mean=2.5,sd=.5)) The decimal point is 1 digit(s) to the left of the 12 33 14 035 16 00567067788899 18 00144668901123333366 20 011244677890134445566778889 22 0000023444668890023344556677788999 24 1112226777888912335567 26 0000011133555679999124446668 28 001114477801222222556667 30 67123589 32 0034673448 34 2658 36 356 Biostatistics (LSUHSC) Chapter 2 05/10 21 / 34

Box Plots Biostatistics (LSUHSC) Chapter 2 05/10 22 / 34

Side by Side Box Plots Data set Lead.txt from Rosner s CD: R-code: boxplot(fwt_r~sex) Biostatistics (LSUHSC) Chapter 2 05/10 23 / 34

Box Plots Box Plots Based on quartiles of sample data: 25th (Q1), 50th (Q2), and 75th (Q3) percentiles. Step 1: Draw number line scale encompassing the range of he data. Step 2: Compute quartiles Q1, Q2, and Q3 (see section on calculating percentiles). Step 3: Draw box above number line from Q1 to Q3. Step 4: Draw vertical hash within box at Q2. Step 5: Determine outliers as points further than 1.5*(Q1-Q3) from ends of box. Step 6: Extend "whiskers" to largest (smallest) observations not outliers. Step 7: Draw small circles to represent outliers Biostatistics (LSUHSC) Chapter 2 05/10 24 / 34

Box Plots Example We will construct a box plot from a sample of n=10 observations taken as a random sample from a larger data set containing n=100 observations. y <- sample(y2,10) sort(y) 2.000298 2.179386 2.236754 2.249943 2.342899 2.465442 2.596121 2.741588 2.772203 2.834250 summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. 2.000 2.240 2.404 2.442 2.705 2.834 par(plt=c(0.2,0.5,.6,.9)) boxplot(y,main="box Plot of y",xlab="y") Biostatistics (LSUHSC) Chapter 2 05/10 25 / 34

Box Plots The quartiles were easily obatined using R on previous slide. Alternately, we could use the method of computing percentiles on the data: 2.000298 2.179386 2.236754 2.249943 2.342899 2.465442 2.596121 2.741588 2.772203 2.834250 Method of calculating percentiles: 25th percentile: np = 10(.25) = 2.5. Since not an integer, add one to largest integer in 2.5 = 2+1 =3. The 25th percentile (Q1) is the 3rd observation from the left (when sorted in ascending order) = 2.236754 Likewise, Q3 is the 3rd obs from the right end = 2.741588 To determine the median (Q2): np=10*(.5) = 5. Since np is an integer, Q2 is the average of npth and npth +1 obs = (2.342899 2.465442)/2 = 2.4042 Biostatistics (LSUHSC) Chapter 2 05/10 26 / 34

Box Plots Biostatistics (LSUHSC) Chapter 2 05/10 27 / 34

Graphics: Histograms Biostatistics (LSUHSC) Chapter 2 05/10 28 / 34

Histogram with Normal Distribution Curve Biostatistics (LSUHSC) Chapter 2 05/10 29 / 34

Histograms Based on frequency distribution obtained by categorizing a continuous variable. Step 1: Create categorical ranges (bins) of equal size by dividing range of the data by # bins Step 2: Obtain frequency distribution for # obs per bin Step 3: Plot histograms with height of rectangles proportional to bin frequency. R-Code: hist(y) which can be embellished with titles and axis labels: hist(y,main="histogram of y",xlab="y") Biostatistics (LSUHSC) Chapter 2 05/10 30 / 34

Histogram Frequency Table Category (bin) Frequency 2.0-2.2 2 2.2-2.4 3 2.4-2.6 2 2.6-2.8 2 2.8-3.0 1 Biostatistics (LSUHSC) Chapter 2 05/10 31 / 34

Histogram Biostatistics (LSUHSC) Chapter 2 05/10 32 / 34

Data Graphics Biostatistics (LSUHSC) Chapter 2 05/10 33 / 34

R Code R Code for Generating 4-Panel Graphics data<-read.table("c:\\table2_1.txt",header=t) attach(data) par(plt=c(0,0.5,.5,1.0)) par(mfrow=c(2,2)) par(fig=c(0.05,.25,.8,.95)) plot(birthwgt,xlab="") par(fig=c(.25,.45,.8,.95),new=t) boxplot(birthwgt,xlab="") par(fig=c(0.05,.25,.6,.8),new=t) hist(birthwgt,main="",xlab="") par(fig=c(.25,.45,.6,.8),new=t) qqnorm(birthwgt,xlab="") qqline(birthwgt,lty=2) Biostatistics (LSUHSC) Chapter 2 05/10 34 / 34