Median and IQR The median is the value which divides the ordered data values in half.

Similar documents
CHAPTER 2. Mean This is the usual arithmetic mean or average and is equal to the sum of the measurements divided by number of measurements.

Data Description. Measure of Central Tendency. Data Description. Chapter x i

(# x) 2 n. (" x) 2 = 30 2 = 900. = sum. " x 2 = =174. " x. Chapter 12. Quick math overview. #(x " x ) 2 = # x 2 "

Chapter 2 Descriptive Statistics

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

Lecture 1. Statistics: A science of information. Population: The population is the collection of all subjects we re interested in studying.

Economics 250 Assignment 1 Suggested Answers. 1. We have the following data set on the lengths (in minutes) of a sample of long-distance phone calls

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

2: Describing Data with Numerical Measures

Summarizing Data. Major Properties of Numerical Data

MEASURES OF DISPERSION (VARIABILITY)

Parameter, Statistic and Random Samples


1 Lesson 6: Measure of Variation

ENGI 4421 Probability and Statistics Faculty of Engineering and Applied Science Problem Set 1 Solutions Descriptive Statistics. None at all!

Anna Janicka Mathematical Statistics 2018/2019 Lecture 1, Parts 1 & 2

Statistics 511 Additional Materials

HUMBEHV 3HB3 Measures of Central Tendency & Variability Week 2

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Elementary Statistics

Understanding Samples

Number of fatalities X Sunday 4 Monday 6 Tuesday 2 Wednesday 0 Thursday 3 Friday 5 Saturday 8 Total 28. Day

(6) Fundamental Sampling Distribution and Data Discription

Example: Find the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}.

Sample Size Determination (Two or More Samples)

Census. Mean. µ = x 1 + x x n n

Random Variables, Sampling and Estimation

Measures of Spread: Standard Deviation

CONFIDENCE INTERVALS STUDY GUIDE

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Formulas and Tables for Gerstman

ANALYSIS OF EXPERIMENTAL ERRORS

Lecture 7: Properties of Random Samples

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Measures of Spread: Variance and Standard Deviation

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Dotting The Dot Map, Revisited. A. Jon Kimerling Dept. of Geosciences Oregon State University

6.3 Testing Series With Positive Terms

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Probability and statistics: basic terms

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

Topic 9: Sampling Distributions of Estimators

Topic 10: Introduction to Estimation

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Variance of Discrete Random Variables Class 5, Jeremy Orloff and Jonathan Bloom

Agreement of CI and HT. Lecture 13 - Tests of Proportions. Example - Waiting Times

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Statistics Lecture 27. Final review. Administrative Notes. Outline. Experiments. Sampling and Surveys. Administrative Notes

Analysis of Experimental Data

Comparing Two Populations. Topic 15 - Two Sample Inference I. Comparing Two Means. Comparing Two Pop Means. Background Reading

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

multiplies all measures of center and the standard deviation and range by k, while the variance is multiplied by k 2.

Chapter 8: Estimating with Confidence

Lesson 10: Limits and Continuity

STAT 515 fa 2016 Lec Sampling distribution of the mean, part 2 (central limit theorem)

Infinite Sequences and Series

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

The Poisson Distribution

Name: MATH 65 LAB INTEGER EXPONENTS and SCIENTIFIC NOTATION. Instructor: T. Henson

Sequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

Read through these prior to coming to the test and follow them when you take your test.

Topic 9: Sampling Distributions of Estimators

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

Section 1.1. Calculus: Areas And Tangents. Difference Equations to Differential Equations

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

Chapter 4 - Summarizing Numerical Data

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Chapter 2 The Monte Carlo Method

Questions about the Assignment. Describing Data: Distributions and Relationships. Measures of Spread Standard Deviation. One Quantitative Variable

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Topic 1 2: Sequences and Series. A sequence is an ordered list of numbers, e.g. 1, 2, 4, 8, 16, or

Unit 6: Sequences and Series

MA131 - Analysis 1. Workbook 3 Sequences II

Riemann Sums y = f (x)

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Properties and Hypothesis Testing

Module 1 Fundamentals in statistics

Binomial Distribution

MA131 - Analysis 1. Workbook 2 Sequences I

MidtermII Review. Sta Fall Office Hours Wednesday 12:30-2:30pm Watch linear regression videos before lab on Thursday

MATH/STAT 352: Lecture 15

PH 425 Quantum Measurement and Spin Winter SPINS Lab 1

STAT 350 Handout 19 Sampling Distribution, Central Limit Theorem (6.6)

Homework 5 Solutions

Sequences I. Chapter Introduction

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

Describing the Relation between Two Variables

Measures of Variation Cumulative Fequency Box and Whisker Plots Standard Deviation

Topic 9: Sampling Distributions of Estimators

STP 226 EXAMPLE EXAM #1

Estimation of a population proportion March 23,

Chapter 10: Power Series

Introducing Sample Proportions

The Hong Kong University of Science & Technology ISOM551 Introductory Statistics for Business Assignment 3 Suggested Solution

Chapter 8: STATISTICAL INTERVALS FOR A SINGLE SAMPLE. Part 3: Summary of CI for µ Confidence Interval for a Population Proportion p

CORRELATION AND REGRESSION

Transcription:

STA 666 Fall 2007 Web-based Course Notes 4: Describig Distributios Numerically Numerical summaries for quatitative variables media ad iterquartile rage (IQR) 5-umber summary mea ad stadard deviatio Media ad IQR The media is the value which divides the ordered data values i half. A geeral formula for the positio of the media is (+)/2. Example: = 5 gives 3 as the positio of the media (the 3 rd ordered value); = 6 gives 3.5 which meas halfway betwee the 3 rd ad 4 th ordered values (which is the average of the two middle values). The media is a measure of the ceter of a distributio. The iterquartile rage (IQR) is the differece betwee the upper quartile (also called the third quartile or Q3) ad the lower quartile (also called the first quartile or Q). The quartiles are the values which divide the data ito quarters. The lower quartile is the 25 th percetile. The upper quartile is the 75 th percetile. IQR is a measure of the spread of a distributio. Checkpoit : What s aother ame for the secod quartile? There are several algorithms for fidig the quartiles by had. They do ot all give the same result because it s ot clear how the lower quartile, for example, should be defied for a variable with = 7 cases. However, they geerally give very similar aswers. The oe we ll use whe we do computatios by had is described below. Example: Sammy Sosa s ad Barry Bods home ru couts: Barry Bods home ru couts: 6 9 24 25 25 33 33 34 34 37 37 40 42 45 46 46 49 73 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 media positio is (8+)/2 = 9.5, halfway betwee the 9 th ad 0 th ordered values. Hece, M = media = (34+37)/2. M= 35.5.

Q is media of the 9 values below the positio of M; hece Q is 5 th value; Q= 25. Q3 is media of the 9 values above the positio of M; hece Q3 is 5 th value above M (or 5 th from top); Q3 = 45. IQR = Q3 - Q = 45 25 ; IQR =20. Note : IQR is a sigle umber; it is ot the iterval from 25 to 45, it s the legth of this iterval. Sammy Sosa s homeru couts: 4 8 0 5 25 33 36 36 40 40 49 50 63 64 66 2 3 4 5 6 7 8 9 0 2 3 4 5 Checkpoit 2: Fid media ad IQR for Sosa ad compare to Bods. Media = Q = Q3 = IQR = 5-umber summary While the media ad IQR are a useful two-umber summary of ceter ad spread; a more complete summary is the 5-umber summary: miimum, Q, media, Q3, maximum mi Q M Q3 max Bods 6 25 35.5 45 73 Sosa The 5-umber summary divides the data approximately ito quarters. The most commo use of the 5-umber summary is as the basis for creatig a boxplot (also called a box-ad-whisker plot), a hady graphical tool for comparig two or more distributios. Boxplots A boxplot is a graphical display of a 5-umber summary with oe modificatio: poits which are outliers are idetified ad plotted idividually. Bods: Upper fece = Q3 +.5 IQR = 45 +.5(20) = 75 Lower fece = Q.5 IQR = 25.5(20) = -5 The cetral box i a boxplot shows Q, media, ad Q3. The whiskers exted to the most extreme values which are withi the feces (larger tha 5 ad smaller tha 75). Ay poits outside the feces are cosidered outliers ad are plotted idividually. For Bods, all the values are withi the feces, so there are o outliers to be plotted idividually. Note: this is ot the oly defiitio of a outlier. Perhap Bods s 73 is a outlier. This is simply a reasoable defiitio that a computer ca use.

Checkpoit 3: Repeat these calculatios for Sosa. Sosa: Upper fece = Q3 +.5 IQR = Lower fece = Q.5 IQR = Side by side boxplots for Sosa ad Bods. Sosa Bods 0 0 20 30 40 50 60 70 80 Homerus Draw what you thik Sosa s data would look like as a histogram: Draw what you thik Bods data would look like as a histogram:

A boxplot does ot show the shape of the distributio as well as a histogram; it caot show multiple modes, for example. Its big advatage is that it ca be used to compare several distributios easily. Note: if there are a small umber of data values i each group (0 to 5 or less), the you should cosider makig side-by-side dotplots that show the actual values istead. Ad be very sure that you do t use a boxplot to summarize a data set of 5 values or less. Mea ad stadard deviatio The most commo umerical summary of a distributio is the mea (a measure of ceter) ad the stadard deviatio (a measure of spread). y y,..., Notatio: the data values are geerally deoted. The mea is deoted by y (proouced y-bar ). The formula for the mea is the,, 2 y y = y i i=, where deotes summatio. We ofte use a shorthad otatio, y = y ; which is ot as precise mathematically, but as log as we uderstad what it meas, we re OK. The stadard deviatio is s = i= ( y i y) 2, or, i shorthad, ( y y) 2. This is roughly the average distace of the data values from the mea, which is a logical measure of spread. Roughly because it is actually the square root of almost the average squared distace of the data values from the mea. Takig the square root puts it back i the origial uits. Why ot simply take the average distace to the mea, y y i? This is a legitimate measure of spread, but is ot commoly used because the stadard deviatio has some ice properties for some distributios, oe of which we ll discover later.

ame Sosa Bods hr hr Mea Std. Deviatio 5 35.93 20.47 8 36.56 3.2 Resistace: A measure is said to be resistat if it is ot much affected by chages i the umerical values of a small proportio of the observatios (i.e., it is resistat to outliers) Checkpoit 4: Is the media a resistat measure of ceter? How about the mea? Is the IQR a resistat measure of spread? How about the stadard deviatio? Relatioship betwee mea ad media Checkpoit 5: What s the relatioship betwee the mea ad media for the followig distributio shapes: Symmetric Skewed to the right Skewed to the left Summarizig a distributio with a measure of ceter ad spread which measures should you use? Sice the mea ad stadard deviatio are ot resistat, they are ot appropriate for skewed distributios or distributios with outliers. They re most appropriate for symmetric distributios with o outliers. Symmetric distributio with o outliers: mea ad stadard deviatio (possibly, media ad IQR also) Skewed distributios: media ad IQR Symmetric distributios with outliers: media ad IQR or mea ad stadard deviatio with ad without outliers

Checkpoit 6: Why use the mea ad stadard deviatio at all if the media ad IQR are always appropriate? Other measures of ceter ad spread: Trimmed mea: mea computed after trimmig off a percetage of the largest ad smallest values. A 5% trimmed mea is the mea after trimmig off the 5% of largest ad 5% of smallest values. It s a useful compromise betwee the media (which is the 50% trimmed mea) ad the mea. There is a trimmed stadard deviatio also. Midrage = average of smallest ad largest values ad Rage = maximum - miimum Checkpoit 7: Which of the above measures are resistat?