Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Similar documents
CHAPTER 2: Describing Distributions with Numbers

Chapter 1: Exploring Data

Describing distributions with numbers

Describing distributions with numbers

Unit 2. Describing Data: Numerical

2011 Pearson Education, Inc

Resistant Measure - A statistic that is not affected very much by extreme observations.

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Unit 2: Numerical Descriptive Measures

3.1 Measure of Center

Describing Distributions with Numbers

1.3: Describing Quantitative Data with Numbers

Statistics I Chapter 2: Univariate data analysis

Section 3. Measures of Variation

Chapter 3. Data Description

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

Statistics I Chapter 2: Univariate data analysis

1.3.1 Measuring Center: The Mean

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

Lecture 2 and Lecture 3

CHAPTER 1 Exploring Data

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Elementary Statistics

P8130: Biostatistical Methods I

1. Exploratory Data Analysis

Chapter 3: Displaying and summarizing quantitative data p52 The pattern of variation of a variable is called its distribution.

Statistics for Managers using Microsoft Excel 6 th Edition

Lecture 1: Descriptive Statistics

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

CHAPTER 1. Introduction

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

are the objects described by a set of data. They may be people, animals or things.

Determining the Spread of a Distribution

MATH 1150 Chapter 2 Notation and Terminology

Determining the Spread of a Distribution

Chapter 1. Looking at Data

Chapter 4. Displaying and Summarizing. Quantitative Data

Review: Central Measures

Math 14 Lecture Notes Ch Percentile

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

1 Measures of the Center of a Distribution

2.1 Measures of Location (P.9-11)

Chapter 3: Displaying and summarizing quantitative data p52 The pattern of variation of a variable is called its distribution.

Chapter2 Description of samples and populations. 2.1 Introduction.

Chapter 5. Understanding and Comparing. Distributions

STAT 200 Chapter 1 Looking at Data - Distributions

Chapter 2: Tools for Exploring Univariate Data

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Chapter 3. Measuring data

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

Section 2.4. Measuring Spread. How Can We Describe the Spread of Quantitative Data? Review: Central Measures

200 participants [EUR] ( =60) 200 = 30% i.e. nearly a third of the phone bills are greater than 75 EUR

Sections 2.3 and 2.4

MATH 117 Statistical Methods for Management I Chapter Three

Chapter 1 - Lecture 3 Measures of Location

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Describing Distributions

STOR 155 Introductory Statistics. Lecture 4: Displaying Distributions with Numbers (II)

3 Lecture 3 Notes: Measures of Variation. The Boxplot. Definition of Probability

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

Units. Exploratory Data Analysis. Variables. Student Data

Measures of Central Tendency

Measures of disease spread

Descriptive Statistics

3.1 Measures of Central Tendency: Mode, Median and Mean. Average a single number that is used to describe the entire sample or population

Module 1. Identify parts of an expression using vocabulary such as term, equation, inequality

MATH4427 Notebook 4 Fall Semester 2017/2018

MgtOp 215 Chapter 3 Dr. Ahn

MEASURING THE SPREAD OF DATA: 6F

After completing this chapter, you should be able to:

Chapter 2 Descriptive Statistics

Measures of Location. Measures of position are used to describe the relative location of an observation

Introduction to Statistics

Describing Distributions With Numbers

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Math Section SR MW 1-2:30pm. Bekki George: University of Houston. Sections

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

STA 218: Statistics for Management

2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS

TOPIC: Descriptive Statistics Single Variable

Lecture 3B: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Descriptive Statistics-I. Dr Mahmoud Alhussami

Chapter 1:Descriptive statistics

Name SUMMARY/QUESTIONS TO ASK IN CLASS AP STATISTICS CHAPTER 1: NOTES CUES. 1. What is the difference between descriptive and inferential statistics?

Perhaps the most important measure of location is the mean (average). Sample mean: where n = sample size. Arrange the values from smallest to largest:

Measures of the Location of the Data

Let's Do It! What Type of Variable?

Exploratory data analysis: numerical summaries

Descriptive Data Summarization

Example 2. Given the data below, complete the chart:

Introduction to Statistics

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

Let's Do It! What Type of Variable?

Slide 1. Slide 2. Slide 3. Pick a Brick. Daphne. 400 pts 200 pts 300 pts 500 pts 100 pts. 300 pts. 300 pts 400 pts 100 pts 400 pts.

Transcription:

Measures of center The mean The mean of a distribution is the arithmetic average of the observations: x = x 1 + + x n n n = 1 x i n i=1 The median The median is the midpoint of a distribution: the number M such that half the observations are smaller and half are larger. How to find the median Suppose the observations are x 1, x 2,..., x n. 1. Arrange the data in increasing order and let x (i) denote the i th smallest observation. 2. If the number of observations n is odd, the median is the center observation in the ordered list: M = x ((n+1)/2) 3. If the number of observation n is even, the median is the average of the two center observations in the ordered list: M = x (n/2) + x (n/2+1) 2 Numerical Description of Data, Jan 7, 2004-1 -

Measures of center Examples: Data set 1: x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 2 4 3 4 6 5 4-6 5 Arrange in increasing order: x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) -6 2 3 4 4 4 5 5 6 There is an odd number of observations, so the median is M = x ((n+1)/2) = x (5) = 4. The mean is given by x = 2 + 4 + 3 + 4 + 6 + 5 + 4 + ( 6) + 5 9 = 27 9 = 3. Data set 2: x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 2.3 8.8 3.9 4.1 6.4 5.9 4.2 2.9 1.3 5.1 Arrange in increasing order: x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) x (10) 1.3 2.3 2.9 3.9 4.1 4.2 5.1 5.9 6.4 8.8 There is an even number of observations, so the median is M = x (n/2) + x (n/2+1) 2 The mean is given by x = = x (5) + x (6) 2 = 4.1 + 4.2 2 = 4.15. 2.3 + 8.8 + 3.9 + 4.1 + 6.4 + 5.9 + 4.2 + 2.9 + 1.3 + 5.1 10 = 44.9 10 = 4.49. Numerical Description of Data, Jan 7, 2004-2 -

Mean versus median The mean is easy to work with algebraically, while the median is not. The mean is sensitive to extreme observations, while the median is more robust. Example: 0 1 2 3 4 5 6 7 8 9 10 The original mean and median are x = 0 + 1 + 2 3 = 1 and M = x ((n+1)/2) = 1 The modified mean and median are x = 0 + 1 + 10 = 3 2 3 3 and M = x ((n+1)/2) = 1 If the distribution is exactly symmetric, then mean=median. In a skewed distribution, the mean is further out in the longer tail than the median. The median is preferable for strongly skewed distributions, or when outliers are present. Numerical Description of Data, Jan 7, 2004-3 -

Measures of spread Example: Monthly returns on two stocks 40 Stock A 40 Stock B Frequency 30 20 10 Frequency 30 20 10 0 10 5 0 5 10 15 20 0 10 5 0 5 10 15 20 Daily returns (in %) Daily returns (in %) Stock A Stock B Mean 4.95 4.82 Median 4.99 4.68 The distributions of the two stocks have approximately the same mean and median, but stock B is more volatile and thus more risky. Measures of center alone are an insufficient description of a distribution and can be misleading The simplest useful numerical description of a distribution consists of both a measure of center and a measure of spread. Common measures of spread are the quartiles and the interquartile range the standard deviation Numerical Description of Data, Jan 7, 2004-4 -

Quartiles Quartiles divide data into 4 even parts Lower (or first) quartile Q L : median of all observations less than the median M Middle (or second) quartile M = Q M : median of all observations Upper (or third) quartile Q U : median of all observations lgreater than the median M Interquartile range: IQR = Q U Q L distance between upper and lower quartile How to find the quartiles 1. Arrange the data in increasing order and find the median M 2. Find the median of the observations to the left of M, that is the lower quartiles, Q L 3. Find the median of the observations to the right of M, that is the upper quartiles, Q U Examples: Data set: x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 2 4 3 4 6 5 4-6 5 Arrange in increasing order: x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) -6 2 3 4 4 4 5 5 6 Q L is the median of { 6, 2, 3, 4}: Q L = 2.5 Q U is the median of {4, 5, 5, 6}: Q U = 5 IQR = 5 2.5 = 2.5 Numerical Description of Data, Jan 7, 2004-5 -

Percentiles More generally we might be interested in the value which is exceeded only by a certain percentage of observations: The pth percentile of a set of observations is the value such that p% of the observation are less than or equal to it and (100 p)% of the observation are greater than or equal to it. How to find the percentiles 1. Arrange the data into increasing order. 2. If np/100 is not an integer, then x (k+1) is the p th percentile, where k is the largest integer less than np/100. 3. If np/100 is an integer, the p th percentile is the average of the x (np/100) and x (np/100+1). Five-number summary A numerical summary of a distribution {x 1,..., x n } is given by x (1) Q L M Q U x (n) A simple boxplot is a graph of the five-number summary. Numerical Description of Data, Jan 7, 2004-6 -

Boxplots A common rule for discovering outliers is the 1.5 IQR rule: An observations is a suspected outlier if it lies more than falls more than 1.5 IQR below Q L or above Q U. How to draw a boxplot Box-and-whisker plot) 1. A box (the box) is drawn from the lower to the upper quartile (Q L and Q U ). 2. The median of the data is shown by a line in the box. 3. Lines (the whiskers) are drawn from the ends of the box to the most extreme observations within a distance of 1.5 IQR (Interquartile range). 4. Measurements falling outside 1.5 IQR from the ends of the box are potential outliers and marked by or. 10 0 10 20 Stock A Stock B Plotting a boxplot with STATA:. infile A B using stocks.txt, clear. label var A "Stock A". label var B "Stock B". graph box A B, xsize(2) ysize(5) Numerical Description of Data, Jan 7, 2004-7 -

Boxplots Interpretation of Box Plots The IQR is a measure for the sample s variability. If the whiskers differ in length the distribution of the data is probably skewed in the direction of the longer whisker. Very extreme observations (more than 3 IQR away from the lower resp. upper quartile) are outliers, with one of the following explanations: a) The measurement is incorrect (error in measurement process or data processing). b) The measurement belongs to a different population. c) The measurement is correct, but represents a rare (chance) event. We accept the last explanation only after carefully ruling out all others. Numerical Description of Data, Jan 7, 2004-8 -

Variance and standard deviation Suppose there are n observations x 1, x 2,..., x n, The variance of the n observations is: s 2 = (x 1 x) 2 + (x 2 x) 2 + + (x n x) 2 n 1 n (x i x) 2 = 1 n 1 i=1 This is (approximately) the average of the squared distances of the observations from the mean. The standard deviation is: s = s 2 = 1 n 1 n (x i x) 2 i=1 Why n 1? Division by n 1 instead of n in the variance calculation is a common cause of confusion. Why n 1? Note that n (x i x) = 0 i=1 Thus, if you know any n 1 of the differences, the last difference can be determined from the others. The number of freely varying observations, n 1 in this case, is called the degrees of freedom. Numerical Description of Data, Jan 7, 2004-9 -

Properties of s Measures spread around the mean = use only if the mean is used as a measure of center. s = 0 all observations are the same s is in the same units as the measurements, while s 2 is in the square of these units. s, like x is not resistant to outliers. Five-number summary versus standard deviation The 5-number summary is better for describing skewed distributions, since each side has a different spread. x and s are preferred for symmetric distributions with no outliers. Numerical Description of Data, Jan 7, 2004-10 -