Example: Find the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}.

Similar documents
Data Description. Measure of Central Tendency. Data Description. Chapter x i

Statistics 511 Additional Materials

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

CHAPTER 2. Mean This is the usual arithmetic mean or average and is equal to the sum of the measurements divided by number of measurements.

11 Correlation and Regression

Median and IQR The median is the value which divides the ordered data values in half.

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Homework 5 Solutions

Economics Spring 2015

Math 140 Introductory Statistics

Lecture 24 Floods and flood frequency

Chapter 18 Summary Sampling Distribution Models

Chapter 23: Inferences About Means

Error & Uncertainty. Error. More on errors. Uncertainty. Page # The error is the difference between a TRUE value, x, and a MEASURED value, x i :

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

STAT 350 Handout 19 Sampling Distribution, Central Limit Theorem (6.6)

October 25, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 1

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

CHAPTER 8 FUNDAMENTAL SAMPLING DISTRIBUTIONS AND DATA DESCRIPTIONS. 8.1 Random Sampling. 8.2 Some Important Statistics

Introduction There are two really interesting things to do in statistics.

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

Binomial Distribution

Topic 9: Sampling Distributions of Estimators

MEASURES OF DISPERSION (VARIABILITY)

1 Inferential Methods for Correlation and Regression Analysis

multiplies all measures of center and the standard deviation and range by k, while the variance is multiplied by k 2.

BIOS 4110: Introduction to Biostatistics. Breheny. Lab #9

Frequentist Inference

April 18, 2017 CONFIDENCE INTERVALS AND HYPOTHESIS TESTING, UNDERGRADUATE MATH 526 STYLE

Sampling Error. Chapter 6 Student Lecture Notes 6-1. Business Statistics: A Decision-Making Approach, 6e. Chapter Goals

Estimation of a population proportion March 23,

2: Describing Data with Numerical Measures

Sample Size Determination (Two or More Samples)

MATH/STAT 352: Lecture 15

Statistical Intervals for a Single Sample

Chapter 2 Descriptive Statistics

4.1 Sigma Notation and Riemann Sums

Topic 9: Sampling Distributions of Estimators

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Inferential Statistics. Inference Process. Inferential Statistics and Probability a Holistic Approach. Inference Process.

NCSS Statistical Software. Tolerance Intervals

Recall the study where we estimated the difference between mean systolic blood pressure levels of users of oral contraceptives and non-users, x - y.

Data Analysis and Statistical Methods Statistics 651

Parameter, Statistic and Random Samples

(6) Fundamental Sampling Distribution and Data Discription

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Number of fatalities X Sunday 4 Monday 6 Tuesday 2 Wednesday 0 Thursday 3 Friday 5 Saturday 8 Total 28. Day

Section 13.3 Area and the Definite Integral

Confidence Intervals for the Population Proportion p

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Chapter 6 Sampling Distributions

Infinite Sequences and Series

Economics 250 Assignment 1 Suggested Answers. 1. We have the following data set on the lengths (in minutes) of a sample of long-distance phone calls

6.3 Testing Series With Positive Terms

STAT 515 fa 2016 Lec Sampling distribution of the mean, part 2 (central limit theorem)

GG313 GEOLOGICAL DATA ANALYSIS

Axis Aligned Ellipsoid

Chapter 8: Estimating with Confidence

Power and Type II Error

ENGI 4421 Probability and Statistics Faculty of Engineering and Applied Science Problem Set 1 Solutions Descriptive Statistics. None at all!

Paired Data and Linear Correlation

Topic 9: Sampling Distributions of Estimators

Lecture 5. Random variable and distribution of probability

Sampling Distributions, Z-Tests, Power

September 2012 C1 Note. C1 Notes (Edexcel) Copyright - For AS, A2 notes and IGCSE / GCSE worksheets 1

- E < p. ˆ p q ˆ E = q ˆ = 1 - p ˆ = sample proportion of x failures in a sample size of n. where. x n sample proportion. population proportion

Eco411 Lab: Central Limit Theorem, Normal Distribution, and Journey to Girl State

We will conclude the chapter with the study a few methods and techniques which are useful

Module 1 Fundamentals in statistics

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

Measures of Spread: Variance and Standard Deviation

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Understanding Samples

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

STAT 203 Chapter 18 Sampling Distribution Models

WORKING WITH NUMBERS

Statistical Fundamentals and Control Charts

ANALYSIS OF EXPERIMENTAL ERRORS

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Understanding Dissimilarity Among Samples

Final Review for MATH 3510

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

1 Lesson 6: Measure of Variation

PH 425 Quantum Measurement and Spin Winter SPINS Lab 1

1 Models for Matched Pairs

µ and π p i.e. Point Estimation x And, more generally, the population proportion is approximately equal to a sample proportion

MCT242: Electronic Instrumentation Lecture 2: Instrumentation Definitions

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

Stat 421-SP2012 Interval Estimation Section

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

IE 230 Seat # Name < KEY > Please read these directions. Closed book and notes. 60 minutes.

BUSINESS STATISTICS (PART-9) AVERAGE OR MEASURES OF CENTRAL TENDENCY: THE GEOMETRIC AND HARMONIC MEANS

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 9

Lecture 3: August 31

Random Variables, Sampling and Estimation

Lecture 4. Random variable and distribution of probability

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

Successful HE applicants. Information sheet A Number of applicants. Gender Applicants Accepts Applicants Accepts. Age. Domicile

Analytic Continuation

Transcription:

1 (*) If a lot of the data is far from the mea, the may of the (x j x) 2 terms will be quite large, so the mea of these terms will be large ad the SD of the data will be large. (*) I particular, outliers ca make the SD bigger. (Outliers have a eve bigger effect o the rage of the data.) (*) O the other had, if the data is all clustered close to the mea, the all of the (x j x) 2 terms will be fairly small, so their mea will be small ad the SD will be small. To be cotiued...

Example: Fid the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}. 2

2 Example: Fid the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}. Step 1. Fid the mea: x = 2 + 4 + 5 + 8 + 5 + 11 + 7 7 = 42 7 = 6.

2 Example: Fid the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}. Step 1. Fid the mea: x = 2 + 4 + 5 + 8 + 5 + 11 + 7 = 42 7 7 = 6. Step 2. Fid the mea of the squared deviatios of the umbers from their mea: (2 6) 2 + (4 6) 2 + (5 6) 2 + + (7 6) 2 7 = 52 7.

2 Example: Fid the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}. Step 1. Fid the mea: x = 2 + 4 + 5 + 8 + 5 + 11 + 7 = 42 7 7 = 6. Step 2. Fid the mea of the squared deviatios of the umbers from their mea: (2 6) 2 + (4 6) 2 + (5 6) 2 + + (7 6) 2 Step 3. SD x = 52/7 2.726. 7 = 52 7.

3 (*) (Very) useful shortcut (for calculatios doe by had): 1 (xj x) 2 = ( 1 x 2 j ) (x) 2 so SD x = 1 (xj x) 2 = ( 1 x 2 j ) (x) 2 Check with example: {x j } = {2, 4, 5, 8, 5, 11, 7} ad x = 6: ( 1 7 x 2 j ) x 2 = 304 7 36 = 52 7

4 Very useful special case: All the umbers i the data are 0s ad 1s. m 1s ad m 0s ( umbers i all).

4 Very useful special case: All the umbers i the data are 0s ad 1s. m 1s ad m 0s ( umbers i all). x = m {}}{ 1 + 1 + + 1 + m {}}{ 0 + 0 + + 0 = m (I.e., the average is equal to the proportio of 1s i the data).

4 Very useful special case: All the umbers i the data are 0s ad 1s. m 1s ad m 0s ( umbers i all). x = m {}}{ 1 + 1 + + 1 + m {}}{ 0 + 0 + + 0 = m (I.e., the average is equal to the proportio of 1s i the data). m m {}}{{}}{ 1 2 + 1 2 + + 1 2 + 0 2 + 0 2 + + 0 2 ( m ) 2 SD x = m ( m ) 2 m ( = = 1 m ) = m m m = (proportio of 1s) (proportio of 0s)

5 SD vs. SD + Oe of the most importat uses of sample statistics is to estimate the correspodig populatio parameters. The mea of a represetative sample is a good estimate of the mea of the populatio that the sample represets. The SD of a represetative sample teds to uderestimate the SD of the populatio from which it was draw. To correct for this, statisticias use the SD + of the sample to estimate the SD of the populatio. If is sample size, the 1 SD + = 1 SD sample = (xj x) 1 2 If the sample size is large, the there is o sigificat differece betwee SD ad SD + because /( 1) 1 whe is large. The SD + is called the sample stadard deviatio.

6 How is the data clustered? The proportio of the data that lies more tha k SDs from the mea is always less tha 1/k 2. This fact is kow as Chebychev s iequality, ad follows directly from how the stadard deviatio is defied. For example, less tha 1/4 = 25% of the values i ay data set lie more tha 2 SDs from the average value (mea). Less tha 1/9 11.11% of the data lie more tha 3 SDs from the average value. Etc. Turig this aroud, more tha 75% of the data lie withi 2 SDs of the mea, ad more tha 88.88% of the data lie withi 3 SDs of the mea. The estimates above are true for ay set of data. O the other had, if we kow more about the data, the we ca ofte get sharper estimates.

For certai types of data sets, almost all of the data lies withi two or three SDs of the average. Example (from the book): h = 63.5 iches ad SD h 3 iches... 7 Statistics, Fourth Editio Copyright 2007 W. W. Norto & Co., Ic.

8 Example (cotiued): h = 63.5 iches ad SD h 3 iches... Statistics, Fourth Editio Copyright 2007 W. W. Norto & Co., Ic.

9 Stadard uits (*) We commoly measure the distace of data to their average i terms of the stadard deviatio of the data set... This leads to the cocept of stadard uits. If x j comes from a distributio with average x ad stadard deviatio SD x, we covert x j to its stadard uits, z j, by settig z j = x j x SD x. (*) z j tells us how far x j is from x as a multiple of SD x. (*) If z j > 0, the x j is above average; if z j < 0, the x j is below average. (*) Stadard uits are pure umbers. This meas that there are o uits of measuremet (iches, dollars, etc.) associated with stadard uits. (*) The stadard uits value z j of a give datum x j is also called the z-score of x j.

10 Example. Suppose that the average Jauary temperature i Poduk is 45 F, with a SD of 2 F, while i Whoville the average Jauary temperature is 25 F with a SD of 5 F. O Jauary 20th, the temperature i Whoville was 16 F ad i Poduk it was 38 F. Where was the temperature more uusual that day? We ca aswer this by covertig the temperatures o Jauary 20th i both tows to stadard uits: z p = 38 45 2 = 3.5 ad z w = (*) Both temperatures were below average. 16 25 5 = 1.8. (*) The z-score for Poduk is more egative tha the z-score for Whoville, so from a statistical poit of view the temperature i Poduk was more uusual that day. (*) The larger z j, the more uusual x j is.

11 Observatio. Covertig ay set of data, {x 1, x 2,..., x } with average x ad stadard deviatio SD x = s, to stadard uits produces a set of umbers {z 1, z 2,..., z } with average z = 0 ad stadard deviatio SD z = 1. Because arithmetic... z = z 1 + z 2 + + z = = = x 1 x s x 1 x + x 2 x s + x 2 x x 1 +x 2 + +x = x x s = 0 s + + x x s + + x x s {}}{ x + x + x

12 ad more arithmetic SD z = = = = = z 2 1 + z 2 2 + + z2 ( x1 x s ) 2 ( + x2 x s ) 2 ( + + x ) x 2 s (x 1 x) 2 s + (x 2 x) 2 2 s + + (x x) 2 2 s 2 (x 1 x) 2 +(x 2 x) 2 + +(x x) 2 s 2 (x 1 x) 2 +(x 2 x) 2 + +(x x) 2 s 2 = s s = 1

13 The ormal approximatio, I Differet sets of data may be see to have very similar distributios, oce they have bee coverted to stadard uits. Covertig to stadard uits moves the ceter of the histogram (the average of the data) to 0, ad scales the data as a whole so that oe SD is coverted to 1 uit. I may cases, the histogram of the data, oce coverted to stadard uits, takes o a somewhat bell-shaped form the form of the ormal curve. The ormal curve is the graph of the fuctio (where e = 2.7182818...). y = 1 2π e z2 /2,

14 50 % per Stadard Uit 25 5-4 -3-2 -1 0 1 2 3 4 Stadard Uits z The ormal curve is symmetric aroud the lie z = 0, ad the total area uder the curve is equal to 1 (or 100%, if you prefer).

15 Example: The distributio of heights of wome age 18 ad over i HANES5 (Health ad Nutritio Examiatio Study, 03-04) appears i the histogram below (from page 81 i chapter 5 of FPP). The average height is 63.5 ad the SD is about 3. The shaded regio represets the heights that fall withi oe SD of average.

16 To see how well the distributio of the height data is approximated by the ormal curve, we must covert the data to stadard uits ad sketch the histogram for the stadardized (or ormalized) data. To save a lot of drawig time, we observe that the coversio to stadard uits is just a rescalig. This meas that istead of actually covertig all of the heights to their stadard uits ad the drawig a ew histogram, we ca simply chage the horizotal ad vertical scales o the origial histogram.

17

18 If the (rescaled) histogram is well-approximated by the ormal curve, the area of regios uder the histogram will be approximately equal to areas uder the ormal curve for the same rage of stadard uits. I.e., the percetage of the data that lies withi 1 SD of the average will be approximately equal to the area uder the ormal curve betwee -1 ad 1; the percetage of the data lyig withi 2 SDs of the average will be approximately equal to the area uder the ormal curve betwee -2 ad 2; ad so forth. This is useful, because the distributio of the area uder the ormal curve is well-uderstood. I particular...

19 50 % per Stadard Uit 25 68% 5-4 -3-2 -1 0 1 2 3 4 Stadard Uits z (*) The area uder the ormal curve betwee 1 ad 1 is 0.68 = 68%.

20 50 % per Stadard Uit 25 95% 5-4 -3-2 -1 0 1 2 3 4 Stadard Uits z (*) The area uder the ormal curve betwee 2 ad 2 is 0.95 = 95%.

21 50 % per Stadard Uit 25 99% 5-4 -3-2 -1 0 1 2 3 4 Stadard Uits z (*) The area uder the ormal curve betwee 3 ad 3 is 0.99 = 99%.

22 Rule of thumb : If a set of data has a approximately ormal distributio, the: About 68% of the data lies withi oe SD of average; About 95% of the data lies withi two SDs of average; About 99% of the data lies withi three SDs of average; Remember: This rule oly applies to data that is (approximately) ormally distributed! Abset that coditio (or assumptios about how the data is distributed) we rely o weaker (but more geeral) estimates (like Chebychev s iequality). To calculate areas uder the ormal curve for regios other tha those above ( 1 to 1, 2 to 2 ad 3 to 3), we use a ormal table, like the oe foud i the back of the textbook.

A ormal table 23

(From Statistics, 4th ed., W.W.Norto & Co., Ic.) Copyright 200 24

25 Usig the ormal table (i) The table i the appedix gives the areas for symmetric regios z 0 z z 0 (as percetages), where 0 z 0 4.45. If z 0 4.50, you ca assume that the correspodig area is 99.9999%. Example: Suppose that the heights of me aged 25 35 i a certai city are distributed (approximately) ormally with a average of 67 iches ad a stadard deviatio of 2.5 iches. What percetage of these me are betwee 65 ad 69 iches tall? a. A height of 65 iches correspods to 65 67 2.5 = 0.8 stadard uits, ad 69 iches correspods to 69 67 2.5 = 0.8 stadard uits. b. The percetage we wat is (approximately) equal to the area uder the ormal curve betwee 0.8 ad 0.8 which is equal to the table etry for z 0 = 0.8, which is 57.63%.

26 (ii) The ormal curve is symmetric aroud z = 0 so the area uder the curve betwee 0 ad z 0 is equal to the area uder the curve betwee z 0 ad 0, ad both are equal to exactly oe half the table etry for z 0. 50 50 % per Stadard Uit 25 = % per Stadard Uit 25-4 -3-2 -1 0 1 2 3 4-4 -3-2 -1 0 1 2 3 4 Stadard Uits z 0 -z 0 Stadard Uits Example. What percetage of the me i the previous example are betwee 67 ad 70 iches tall? a. 67 iches is average which correspods to 0 stadard uits ad 70 iches correspods to 70 67 2.5 = 1.2 stadard uits. b. The percetage we wat is (approximately) equal to the area uder the ormal curve betwee 0 ad 1.2 which is equal to half the table etry for z 0 = 1.2. This is 76.99/2% 38.5%.

27 (iii) If z 0 > 0, the the area uder the ormal curve to the left of z 0 is equal to 50% plus half the table etry for z 0, because... 50 % per Stadard Uit 25 = -4-3 -2-1 0 1 2 3 4 Stadard Uits z 0 50 50 % per Stadard Uit 25 + % per Stadard Uit 25 = 50% + 1 2 Table(z 0). -4-3 -2-1 0 1 2 3 4-4 -3-2 -1 0 1 2 3 4 Stadard Uits z 0 Stadard Uits z 0 Example. What percetage of the me i the previous examples are less tha six feet, two iches tall? Six feet, two iches is 74 iches which correspods to 74 67 2.5 = 2.8 stadard uits. The table etry for 2.8 is 99.49%, so the percetage of me who are uder 74 iches tall is 50% + 99.49% 2 99.75%.

28 (iv) If z 0 > 0, the the area uder the ormal curve to the right of z 0 is equal to 50% half the table etry for z 0, because 50 % per Stadard Uit 25 = -4-3 -2-1 0 1 2 3 4 Stadard Uits z 0 50 50 % per Stadard Uit 25 % per Stadard Uit 25 = 50% 1 2 Table(z 0). -4-3 -2-1 0 1 2 3 4-4 -3-2 -1 0 1 2 3 4 Stadard Uits z 0 Stadard Uits z 0 Example. What percetage of the me are taller tha 68 iches? 68 iches correspods to 68 67 2.5 = 0.4, so the percetage of me who are more tha 68 iches tall is (approximately) 50% 31.08% 2 = 34.46%.

29 (*) The areas of other types of regios uder the ormal curve ca be calculated from the table by usig (i) (iv) ad the symmetry of the ormal curve aroud 0. For example, if 0 < z 0 < z 1, the the area uder the ormal curve betwee z 0 ad z 1 is because = 1 2 Table(z 1) 1 2 Table(z 0) 50 % per Stadard Uit 25 = -4-3 -2-1 0 1 2 3 4 z 0 z 1 Stadard Uits 50 50 % per Stadard Uit 25 % per Stadard Uit 25-4 -3-2 -1 0 1 2 3 4-4 -3-2 -1 0 1 2 3 4 Stadard Uits z 1 z 0 Stadard Uits