Data Analysis and Statistical Methods Statistics 651

Similar documents
CHAPTER 2: Describing Distributions with Numbers

Describing Distributions with Numbers

Chapter 1: Exploring Data

Data Analysis and Statistical Methods Statistics 651

Business Statistics. Lecture 10: Course Review

Lecture 3: Chapter 3

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

1.3.1 Measuring Center: The Mean

Chapter 2: Tools for Exploring Univariate Data

Describing distributions with numbers

Data Analysis and Statistical Methods Statistics 651

1.3: Describing Quantitative Data with Numbers

Resistant Measure - A statistic that is not affected very much by extreme observations.

CIVL 7012/8012. Collection and Analysis of Information

2011 Pearson Education, Inc

Chapter 5. Understanding and Comparing. Distributions

Statistics for Managers using Microsoft Excel 6 th Edition

MAT Mathematics in Today's World

Data Analysis and Statistical Methods Statistics 651

CHAPTER 1 Exploring Data

Section 3. Measures of Variation

Descriptive Statistics-I. Dr Mahmoud Alhussami

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

1/11/2011. Chapter 4: Variability. Overview

Recap: Ø Distribution Shape Ø Mean, Median, Mode Ø Standard Deviations

Chapter 4. Displaying and Summarizing. Quantitative Data

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Describing distributions with numbers

STOR 155 Introductory Statistics. Lecture 4: Displaying Distributions with Numbers (II)

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

- a value calculated or derived from the data.

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

- measures the center of our distribution. In the case of a sample, it s given by: y i. y = where n = sample size.

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

Elementary Statistics

Data Analysis and Statistical Methods Statistics 651

Descriptive Data Summarization

Chapter 2. Mean and Standard Deviation

Chapter 3. Measuring data

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Statistics I Chapter 2: Univariate data analysis

MATH4427 Notebook 4 Fall Semester 2017/2018

Basics of Experimental Design. Review of Statistics. Basic Study. Experimental Design. When an Experiment is Not Possible. Studying Relations

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

Math 14 Lecture Notes Ch Percentile

Statistics I Chapter 2: Univariate data analysis

Chapter 24. Comparing Means. Copyright 2010 Pearson Education, Inc.

Measures of Central Tendency. Mean, Median, and Mode

Data Analysis and Statistical Methods Statistics 651

MATH 117 Statistical Methods for Management I Chapter Three

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Numerical Measures of Central Tendency

Chapter 6 Group Activity - SOLUTIONS

Unit 2. Describing Data: Numerical

MgtOp 215 Chapter 3 Dr. Ahn

Lecture 2 and Lecture 3

3.2 Numerical Ways to Describe Data

A is one of the categories into which qualitative data can be classified.

MATH 1150 Chapter 2 Notation and Terminology

3 Lecture 3 Notes: Measures of Variation. The Boxplot. Definition of Probability

Description of Samples and Populations

Data Analysis and Statistical Methods Statistics 651

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

2.1 Measures of Location (P.9-11)

Lecture 3B: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

STAT 200 Chapter 1 Looking at Data - Distributions

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Descriptive Statistics (And a little bit on rounding and significant digits)

Chapter 2: Descriptive Analysis and Presentation of Single- Variable Data

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Determining the Spread of a Distribution

Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore

TOPIC: Descriptive Statistics Single Variable

AIM HIGH SCHOOL. Curriculum Map W. 12 Mile Road Farmington Hills, MI (248)

Determining the Spread of a Distribution

are the objects described by a set of data. They may be people, animals or things.

Descriptive statistics

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Section 2.3: One Quantitative Variable: Measures of Spread

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Lecture 1 : Basic Statistical Measures

Chapter 7: Random Variables

Chapters 1 & 2 Exam Review

Lecture 1: Descriptive Statistics

Chapter 3. Data Description

Lecture Slides. Elementary Statistics Eleventh Edition. by Mario F. Triola. and the Triola Statistics Series 3.1-1

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Probability and Statistics

Review of Statistics 101

Describing Distributions with Numbers

12 Statistical Justifications; the Bias-Variance Decomposition

Determining the Spread of a Distribution Variance & Standard Deviation

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

Units. Exploratory Data Analysis. Variables. Student Data

Survey on Population Mean

MATH 137 : Calculus 1 for Honours Mathematics. Online Assignment #2. Introduction to Sequences

Data Analysis and Statistical Methods Statistics 651

Transcription:

Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html Boxplots and standard deviations Suhasini Subba Rao

Review of previous lecture In the previous lecture we defined the population mean and median and the sample mean and median. Both the mean and median are measures of centrality in a population or sample. It is also useful to have measures of spread. Measures of spread give us information about how spread out the population is and it will also give information about how accurate the sample mean will be. The more spread out the population the less accurate the sample mean is likely to be (see variance explanation lecture4.pdf, pages 1-2). The most simple measure of spread is to use the range of the data (smallest and largest number in the data set). But as illustrated by the 1

extreme examples (given in lecture 3) Sample 1: 1, 0, 0, 0, 0, 0 Sample 2: 1, 0, 0, 0, 0, 2 Sample 3: 1, 0, 0, 0, 0, 20 using the range does not necessarily represent the typical spread. An alternative measure of spread is the Interquartile Range. 2

Comparing multiple samples using boxplots Boxplots are excellent for comparing two samples. In JMP you can compare multiple samples through boxplots. Instructions are given in my JMP notes on how to to do this. Please do this: Make boxplots for the sample (since we only have the data from class) distribution of the total number of M&Ms for each type of M&M. 3

The Height data Sample 1 Sample 2 Sample 3 Sample 4 68 65 67 69 74 62 65 62 68 60 64 71 61 66 68 72 61 66 65 66 Average 66.4 63.8 65.8 68 Recall that the spread of the individual heights is far greater compared with the spread of the sample averages. We now define the variance, which is a measure of spread that is very useful in statistics. We will show that the variance of the averages can be completely determined from the variance of the individual observations and size of sample. 4

Measures of spread - The Variance The variance is one of the standard measures of spread. population x 1,..., x N we define the population variance as Given the σ 2 = 1 N N (x i µ) 2, µ = population mean. k=1 In words: this is the sum of all the squared differences between all possible outcomes and the population mean. In reality it will never be observed (since the population is unknown) and it has to be estimated instead. 5

Given the sample X 1,..., X n we define the sample variance as s 2 = 1 (n 1) n (X i X) 2, i=1 given the sample we not know the mean µ, so we estimate it using the sample mean X (swop µ with X). The reason we divide by (n 1) and not n when obtaining the sample variance is a quirk of statistics. We do it to reduce the bias. I don t much care whether you divide by n or (n 1). What you need to remember is that the difference between the sample variance and the population is that the population variance is calculated using the entire population, whereas the sample variance is calculated using the sample. 6

Example - the variance (by hand) We are given the sample 5, 4, 3, 0, 3. The sample mean x is 3. Calculating the sample variance: x i frequency x i x (x i x) 2 5 5 3 = 2 (5 3) 2 = 4 4 4 3 = 1 (4 3) 2 = 1 3 3 3 = 0 (3 3) 2 = 0 0 0 3 = 3 (0 3) 2 = 9 3 3 3 = 0 (3 3) 2 = 0 sum = 14 The sample variance is s 2 1 n = (x i x) 2 = 14 (n 1) 5 1 = 14 4. i=1 7

The variance does not change with shifts The variance is invariant to shifts in the data. Suppose we shift the data by 20 (this could be because of a change in units of observations), so we observe 25, 24, 23, 20, 23. The mean is shifted by 20 (it is 20 + 3 = 23), but the spread stays the same: x i frequency x i x (x i x) 2 5 5 3 = 2 (5 3) 2 = 4 4 4 3 = 1 (4 3) 2 = 1 3 3 3 = 0 (3 3) 2 = 0 0 0 3 = 3 (0 3) 2 = 9 3 3 3 = 0 (3 3) 2 = 0 sum = 14 x i frequency x i x (x i x) 2 25 25 23 = 2 (25 23) 2 = 4 24 24 23 = 1 (24 23) 2 = 1 23 23 23 = 0 (23 23) 2 = 0 20 20 23 = 3 (20 23) 2 = 9 23 23 23 = 0 (23 23) 2 = 0 sum = 14 The example illustrates that the variance is simply measuring the squared distance from the mean (and is invariant to shift transformations). 8

The population, sample and variance 3,4,6,9,10,15,44,16 3,16 Sample 1 Set of all measurements (population) 6,9,3 Sample 2 Population variance In this case it is σ 2 = 1 8 [(3 13.375)2 + (4 13.375) 2 +... + (16 13.375) 2 ] = 153.4844. 9

Observations Like the sample mean the sample variance is random and varies from sample to sample. Sample variance For Sample 2: it is s 2 2 = 1 3 1 [(6 6)2 + (9 6) 2 + (3 6) 2 ] = 9. For Sample 1: s 2 1 = 1 2 1 [(3 9.5)2 + (16 9.5) 2 ] = 84.5. Similar to the sample mean, the sample variance is also random (changes according to the sample). Often the sample variance will underestimate the population variance (this will have an impact on the statistical analysis). 10

When we start learning statistical methods in this class, we will work under the assumption that the variance of the population known. In many situations this is an unrealistic assumption, and we need to estimate the variance from the data. Rule for later on: If prior information means that the population variance is known (a rare case). Then inference about the population mean uses the normal distribution (the reason for normal distribution will come later). If the population variance is unknown and we use the same data to estimate both the mean and variance. Then we use the t-distribution or F-distribution. 11

The variance does not measure distance! Example Consider the observations 4, 3, 2, 1, 1, 2, 3, 4. The mean 1 is zero and the sample variance is 8 1 (42 +3 2 +2 2 +1 2 +1 2 +2 2 +3 2 +4 2 ) = 8.57 The range of the data is 4, 4, which has length 8. The sample variance = 8.57, is larger than the range itself! This is because the variance squares the distances. To overcome this we square root the variance, to give the standard deviation. This is a true measure of average distance. 12

The standard deviation The standard deviation is the square root of the variance. That is σ = 1 N (x i µ) N 2 N = size of population k=1 The sample standard deviation is s = 1 (n 1) n (X i X) 2 n = size of sample i=1 The standard deviation has the advantage that it has the same set of units as the data. 13

Comparing standard deviations and IQR For the majority of the analysis we do in this class, we be using the standard deviation, but it is useful to understand how outliers may influence the standard deviation: Observe that the majority of the points are about zero but one value is 5. This pulls the mean and the standard deviation to the right. However, it does not effect the median or IQR. 14

Standard deviation and IQR Here we illustrate the differences between the two measures of spread; IQR and standard deviations. We observe if all the data takes the same value, the IQR and standard deviation are both zero. The standard deviation is only zero if all the data is the same. However, we see on the second plot that the IQR can still be zero if most of the values (but not all) are the same. 15

The empirical rule (this just a rule of thumb) A general rule of thumb of data (which is not always true) is that Approximately 68% of the observations lie inside the interval [ x s, x+s]. Approximately 95% of the observations lie inside the interval [ x 2s, x + 2s]. Approximately 99.7% of the observations lie inside the interval [ x 3s, x + 3s]. This is only a rule of thumb and it only applies to data, which is normally distributed. Don t take it seriously. The take home message is that that the majority of the data will be within three standard deviations of the mean. 16

Interpretation of the empirical rule Some maths background: What do we mean by the interval [a, b]? These are two points of the time line. Starting with a and ending at b. Example, what does the interval [3, 4.5] mean? What we mean by Approximately 68% of the observations lie inside the interval [ x s, x + s]? We calculate s and X from the data. For example, it could be X = 3 and s = 1. For this example, the interval [ x s, x + s] is [3 1, 3 + 1], and we count the number of observations in this interval. The empirical rule states that for many data sets 68% of the observations lie in this interval. Similarly [ x 2s, x+2s] = [3 2, 3+2] and [ x 3s, x+3s] = [3 3, 3+3]. 17

Standard deviation: Graphical illustration Mean is 29 and standard deviation is 6.6. The green line [29 6.6, 29 + 6.6] = [22.4, 35.6] is one standard deviation from the mean. Observe that 7/11 = 63.3% of the points are within one standard deviation of the mean. 18

Where you may have previously encountered the standard deviation When you go to the doctors office, the technicians take several measurements, such as weight, height, blood pressure and blood samples. How do they determine whether it is normal (this does not refer to the distribution). Consider the following example: Suppose that you have a blood sample taken. The mean for the reading is 20 and you have 14, is that normal? The raw difference of 14 20 is unformative unless you know the general spread of the readings. Suppose, the standard deviation is 8. Then you are within (14 20)/8 = 6/8 standard deviations of the mean. The reading is relatively close to the mean. 19

In addition if blood samples followed the empirical rule, since your reading is within one standard deviation of the mean then your reading is within 68% of the mean Another example: A friend has their blood sample taken, he has the reading 45. Is that normal? His reading is 25 from the mean and (45 20)/8 = 3.125 standard deviations from the mean. His reading is in the far right tail of the distribution. The empirical rule tells us that for many data sets 99.7% of observations are within 3 standard deviations of the mean). 20