Chapter 5. Understanding and Comparing. Distributions

Similar documents
Chapter 4. Displaying and Summarizing. Quantitative Data

Chapter 6 Group Activity - SOLUTIONS

Chapter 7. Scatterplots, Association, and Correlation

2011 Pearson Education, Inc

Describing Distributions With Numbers

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

are the objects described by a set of data. They may be people, animals or things.

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

Stats Review Chapter 3. Mary Stangler Center for Academic Success Revised 8/16

Describing Distributions

Essential Statistics Chapter 6

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Elementary Statistics

Chapter 3. Data Description

MATH 1150 Chapter 2 Notation and Terminology

Percentile: Formula: To find the percentile rank of a score, x, out of a set of n scores, where x is included:

Unit 2: Numerical Descriptive Measures

Section 3. Measures of Variation

Density Curves and the Normal Distributions. Histogram: 10 groups

Example 2. Given the data below, complete the chart:

Math 140 Introductory Statistics

Math 140 Introductory Statistics

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Introduction to Statistics

STAT 200 Chapter 1 Looking at Data - Distributions

CHAPTER 2: Describing Distributions with Numbers

STATISTICS 141 Final Review

Stat 101 Exam 1 Important Formulas and Concepts 1

Chapter 1. Looking at Data

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

Performance of fourth-grade students on an agility test

Describing distributions with numbers

3.1 Measure of Center

Descriptive Univariate Statistics and Bivariate Correlation

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Chapter 6 The Standard Deviation as a Ruler and the Normal Model

In this investigation you will use the statistics skills that you learned the to display and analyze a cup of peanut M&Ms.

Section 5.4. Ken Ueda

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Statistics I Chapter 2: Univariate data analysis

1 Probability Distributions

Unit 2. Describing Data: Numerical

Chapter 6. The Standard Deviation as a Ruler and the Normal Model 1 /67

Describing Distributions with Numbers

Chapter 2: Tools for Exploring Univariate Data

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

Statistics I Chapter 2: Univariate data analysis

1.3: Describing Quantitative Data with Numbers

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

1 Measures of the Center of a Distribution

The Normal Distribution. Chapter 6

Chapter 3: Displaying and summarizing quantitative data p52 The pattern of variation of a variable is called its distribution.

Chapter 1: Exploring Data

Describing Distributions With Numbers Chapter 12

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

Chapter 3: Displaying and summarizing quantitative data p52 The pattern of variation of a variable is called its distribution.

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

STT 315 This lecture is based on Chapter 2 of the textbook.

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

MEASURING THE SPREAD OF DATA: 6F

The empirical ( ) rule

Describing distributions with numbers

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

Chapter 4.notebook. August 30, 2017

Finding Quartiles. . Q1 is the median of the lower half of the data. Q3 is the median of the upper half of the data

Section 3.2 Measures of Central Tendency

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

Section 2.5 Formulas and Additional Applications from Geometry Section 2.6 Solving Linear Inequalities Section 7.

Reminders. Homework due tomorrow Quiz tomorrow

Histograms allow a visual interpretation

GRE Quantitative Reasoning Practice Questions

P8130: Biostatistical Methods I

Chapter 18. Sampling Distribution Models. Bin Zou STAT 141 University of Alberta Winter / 10

Chapter 6 The Normal Distribution

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

Lecture 3B: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

AP Final Review II Exploring Data (20% 30%)

Using the z-table: Given an Area, Find z ID1050 Quantitative & Qualitative Reasoning

FREQUENCY DISTRIBUTIONS AND PERCENTILES

MATH4427 Notebook 4 Fall Semester 2017/2018

Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

Chapter 5: Exploring Data: Distributions Lesson Plan

Lecture 2 and Lecture 3

Units. Exploratory Data Analysis. Variables. Student Data

BNG 495 Capstone Design. Descriptive Statistics

MgtOp 215 Chapter 3 Dr. Ahn

Sections 6.1 and 6.2: The Normal Distribution and its Applications

MATH 2560 C F03 Elementary Statistics I Lecture 1: Displaying Distributions with Graphs. Outline.

1. Exploratory Data Analysis

MATH 117 Statistical Methods for Management I Chapter Three

+ Check for Understanding

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation

Data Analysis and Statistical Methods Statistics 651

CHAPTER 1. Introduction

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

How spread out is the data? Are all the numbers fairly close to General Education Statistics

Probability Distributions

EXPERIMENT: REACTION TIME

Transcription:

STAT 141 Introduction to Statistics Chapter 5 Understanding and Comparing Distributions Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 1 / 27

Boxplots How to create a boxplot? Assume we are given the histogram and 5-number summary. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 2 / 27

Step 1: draw a box with bottom Q1 and top Q3, then insert a line at Q2. Note: The red lines and labels of Q1,Q2,Q3 are NOT necessary, for illustration only. Step 2: draw two fences : upper fence = Q3 + 1.5 IQR, lower fence = Q1 1.5 IQR. Step 3: draw whiskers -draw lines from the ends of the box to the largest and smallest values within the fences. Step 4: add outliers, observations out of the fences, with special symbols. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 3 / 27

Summary of Boxplots The bottom line and the top line of the box are Q1 and Q3. The height of the box is IQR. The line insider the box is the median. If the median line is centred, then the distribution is symmetric. If the median line is closer to the bottom (Q1), equivalently Q2 Q1 < Q3 Q2, the distribution is right skewed. If the median line is closer to the top (Q3), equivalently Q2 Q1 > Q3 Q2, the distribution is left skewed. Boxplots can be drawn horizontally. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 4 / 27

Comparing Groups with Boxplots Conclusions: wind speeds are low in the summer. The tendency is to go down from Jan to Aug, and then go up. Jan has the strongest winds with the largest spread. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 5 / 27

Chapter 6 The Standard Deviation and the Normal Model Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 6 / 27

z-score z-score, also called standardized value, is a measure of relative standing. Assume y is an observation from a sample with mean ȳ and standard deviation s. Then z-score of y is defined as z = y ȳ. s This is the most important formula for the midterm. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 7 / 27

Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 8 / 27

z-score tells: how many standard deviations away from the mean does the measurement lie and in which direction? Positive z-score: observation is greater than the mean. Negative z-score: observation is smaller than the mean. Zero z-score: observation is equal to the mean. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 9 / 27

Shifting Data Add (or subtract) a constant c to each value of the data. Results: all measures of position (centre, percentiles, minimum, maximum) will increase (or decrease) by the same constant. However, the spread (range, IQR, standard deviation) does NOT change. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 10 / 27

Rescaling Data Multiply (or divide) all the data values by a constant d. In formula, vspace-1.5ex y new = d y original. Result: position new = d position original. spread new = d spread original. Standardizing into z-scores involves shifting down by the value of the mean and rescaling (dividing) by the value of the standard deviation. Standardizing into z-scores changes the centre by making the mean 0. Standardizing into z-scores changes the spread by making the standard deviation 1. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 11 / 27

Density Curve Note: The area enclosed by the density and the x-axis is always 1. Why? Relative frequency adds up to 1. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 12 / 27

Histogram VS Density Both describe the overall shape of the data, but density curve is smooth (without sharp corners). You can think density curve as a limit case of histogram when the class width approaches 0 (rectangles get narrower). Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 13 / 27

As shown by the graph, the area between a and b under the density curve is the proportion (percentage) of observations that fall in [a, b]. What if we want to know the proportion of observations that lie below a or above b? Note: we do NOT discuss the proportion of observations that hit exactly a or b in a density curve. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 14 / 27

Normal Model The density curve of a normal distribution/model is bell-shaped, symmetric and unimodal. Its shape is determined by two parameters: the mean µ (also the median and the mode) and the standard deviation σ. The above graph is the density curve of the standard normal distribution with µ = 0 and σ = 1. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 15 / 27

Standard Normal Model Recall what we have learnt from shifting and rescaling data. Assume we are given a normal model with the mean µ and the standard deviation σ (short notation N(µ,σ), where N stands for normal distribution). By subtracting µ and dividing by σ for all values (exactly the same as z-score), we obtain the standard normal model: z = y µ σ. Thus, only the distribution of the standard normal is provided in the table. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 16 / 27

The 68-95-99.7 Rule Does this graph look somehow familiar to you? If NO, go back to the slide of Empirical rules in Chapter 4. In a normal model, approximately 68% of the values fall within one standard deviation of the mean. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 17 / 27

Normal Table Important! You must know how to use the normal table! The normal table provides proportion of the left tail (shadowed area) of the standard normal model below a given value z. The value of z is provided by two side bars: integer part and the first decimal by the vertical bar while the second decimal by the horizontal bar. Example: to find the proportion of values below 3.65, we first locate the row of 3.6 from the rightmost column, next locate the second decimal 0.05 from the top row. Then the unique intersection 0.0001 (0.01% if converted into percentage) gives the answer. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 18 / 27

The first page of the z-table covers negative z values from -3.99 to 0, while the second page goes to the positive side. But, still left tail. What if question asks you to find the proportion of observations above a number (right tail), say greater than 0.19? From the table, we can obtain that the proportion of observations that fall below 0.19 is 0.5753. Since the total area is 1, the area of the right tail is 1 0.5753 = 0.4247 = 42.47%. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 19 / 27

Some people have no interest in either tails, they rather care about the middle portion of the standard normal model. Example: what is the proportion of the values between -0.52 and 1.19 in the standard normal model? From the table, we find two numbers: 0.3015 (from -0.52) and 0.8830 (from 1.19). (Please check!) Of course, these two numbers are the area of the left tails of -0.52 and 1.19. To get the area between these two values, we only need to do subtraction: 0.8830-0.3015=0.5815=58.15%. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 20 / 27

Quick Summary Find the proportion of values in an interval. Interval can only take three types. z < a or the values below a (left tail): directly report the number found from the table. a < z < b or the values between a and b (middle interval): bigger number (found from using b) - smaller number (found from using a). z > b or the values above b (right tail): 1 - the number found from the table. Not standard normal? Convert into the standard normal by z = y µ σ. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 21 / 27

From Percentiles to Scores We just learnt how to find proportion from z-scores. Now we study how to go backwards, finding z-scores for given percentiles. 1 Obtain the proportion below z (left tail). Think of the three cases discussed in the previous slide. 2 In the normal table, find the number (with four decimals) which is closest to the proportion. 3 From the position of the number, identity the value of z-score. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 22 / 27

Examples Example 6.1 Suppose we want to find the z-score, z, that makes up the smallest 2% in the standard normal model. Smallest indicates the left tail. So in this question, the proportion of the left tail is directly given, which is 2%, or 0.0200. The closest number to 0.0200 in the normal table is 0.0202. Do NOT look for 0.2 on the leftmost column under z. Proportion is known, but z-score is unknown. From the position of 0.0202, look to the rightmost column, we get -2.0, to the topmost row 0.05. Hence, the z-score is 2.05. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 23 / 27

Examples Example 6.2 Suppose now we are interested in the largest 5%. largest =right tail. So we are looking for z such that the area of (z > z ) is 5%, or 0.05. The corresponding area of the left tail is then 0.95. From the table, we find 0.9495 and 0.9505, both are equally closest to 0.95 among all numbers. Notice that 0.9495 gives z = 1.64 and 0.9505 gives z = 1.65. In this special case, since 0.95 is exactly the middle of 0.9495 and 0.9505, we take z-score to be the middle of 1.64 and 1.65 as well. The solution is then 1.645. Note: remember this example! You need the result million times throughout the course. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 24 / 27

Examples Example 6.3 Now we want to find z-scores that given the proportion of 95% in the middle. We are looking for z such that the area between z and z is 0.95. Can you tell why these two statements are equivalent? Do you remember normal distributions are all symmetric, including the standard normal model. After partitioning the middle 95% out, we are left with 5% for two tails with equal area. Hence, each tail accounts for 2.5%. Using the proportion of 0.0250, we find z = 1.96, then z = 1.96. Another way: area below z is 0.025+0.95=0.975, which yields the same z-score z = 1.96. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 25 / 27

Examples If the normal model in the question is not standard, then using the standardization z = y µ σ to convert the non-standard into standard. Example 6.4 Assume that the length of a human pregnancy follows a normal distribution with mean 266 (days) and standard deviation 16 (days). What is the proportion that a human pregnancy lasts longer than 280 days. Denote y the length of a human pregnancy, then y N(266,16). What is the area (y > 280)? Using the standardization z = y µ σ, we convert y into z (the standard normal). Area (y > 280) = area (z > 280 266 16 = 0.875 0.88) =1-0.8106=0.1894. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 26 / 27

Examples Example 6.5 Assume a variable y is normally distributed with µ = 10 and σ = 2. Find the value that makes up the smallest 10% of this distribution. Find y such that area (y < y ) = 0.1. Equivalently, find z such that area (z < z ) = 0.1, where z = y µ σ. Note: After standardization, y becomes z, and y becomes z. But the inequality direction stays the same. From the proportion of 10%, we obtain z = 1.28. Rewriting z = y µ σ gives y = µ + σ z. Hence, y = 10 + 2 ( 1.28) = 7.44. Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 27 / 27