Stat 101: Lecture 12. Summer 2006

Similar documents
ACMS Statistics for Life Sciences. Chapter 13: Sampling Distributions

*Karle Laska s Sections: There is no class tomorrow and Friday! Have a good weekend! Scores will be posted in Compass early Friday morning

Statistics and Quantitative Analysis U4320. Segment 5: Sampling and inference Prof. Sharyn O Halloran

STA Why Sampling? Module 6 The Sampling Distributions. Module Objectives

Lecture 8 Sampling Theory

The Central Limit Theorem

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

One-sample categorical data: approximate inference

Significance Tests. Review Confidence Intervals. The Gauss Model. Genetics

Business Statistics: Lecture 8: Introduction to Estimation & Hypothesis Testing

value of the sum standard units

Week 11 Sample Means, CLT, Correlation

COMP6053 lecture: Sampling and the central limit theorem. Markus Brede,

MITOCW ocw f99-lec30_300k

Chapter 18. Sampling Distribution Models /51

Sampling Distribution Models. Chapter 17

Chapter 18. Sampling Distribution Models. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Chapter 1 Review of Equations and Inequalities

LECTURE 15: SIMPLE LINEAR REGRESSION I

Introduction to Survey Analysis!

Random processes. Lecture 17: Probability, Part 1. Probability. Law of large numbers

Note: Please use the actual date you accessed this material in your citation.

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Chapter 15 Sampling Distribution Models

Econ 6900: Statistical Problems. Instructor: Yogesh Uppal

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom

AP Calculus AB Summer Assignment

COMP6053 lecture: Sampling and the central limit theorem. Jason Noble,

Lecture 17: Small-Sample Inferences for Normal Populations. Confidence intervals for µ when σ is unknown

Lecture 27. DATA 8 Spring Sample Averages. Slides created by John DeNero and Ani Adhikari

Lecture 7: Confidence interval and Normal approximation

Lecture 30. DATA 8 Summer Regression Inference

MITOCW MIT8_01F16_w02s05v06_360p

Math 124: Modules Overall Goal. Point Estimations. Interval Estimation. Math 124: Modules Overall Goal.

Example. If 4 tickets are drawn with replacement from ,

Are data normally normally distributed?

Lecture 2: Change of Basis

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

CS 5014: Research Methods in Computer Science. Bernoulli Distribution. Binomial Distribution. Poisson Distribution. Clifford A. Shaffer.

MATH 1150 Chapter 2 Notation and Terminology

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

COSC 341 Human Computer Interaction. Dr. Bowen Hui University of British Columbia Okanagan

Overview. Confidence Intervals Sampling and Opinion Polls Error Correcting Codes Number of Pet Unicorns in Ireland

Statistic: a that can be from a sample without making use of any unknown. In practice we will use to establish unknown parameters.

Business Statistics. Lecture 5: Confidence Intervals

Lecture 02: Summations and Probability. Summations and Probability

STA Module 8 The Sampling Distribution of the Sample Mean. Rev.F08 1

Bayes Formula. MATH 107: Finite Mathematics University of Louisville. March 26, 2014

Announcements. Lecture 5: Probability. Dangling threads from last week: Mean vs. median. Dangling threads from last week: Sampling bias

Series, Parallel, and other Resistance

In the real world, objects don t just move back and forth in 1-D! Projectile

Lecture 2: Exchangeable networks and the Aldous-Hoover representation theorem

10.4 Hypothesis Testing: Two Independent Samples Proportion

Sequences and infinite series

Senior Math Circles November 19, 2008 Probability II

In other words, we are interested in what is happening to the y values as we get really large x values and as we get really small x values.

ECO220Y Hypothesis Testing: Type I and Type II Errors and Power Readings: Chapter 12,

Class 15. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

E = hν light = hc λ = ( J s)( m/s) m = ev J = ev

Lecture 10A: Chapter 8, Section 1 Sampling Distributions: Proportions

Bivariate Regression Analysis. The most useful means of discerning causality and significance of variables

CS 124 Math Review Section January 29, 2018

Volume vs. Diameter. Teacher Lab Discussion. Overview. Picture, Data Table, and Graph

Chapter 7 Quadratic Equations

Chapter 5. Understanding and Comparing. Distributions

Survey on Population Mean

You separate binary numbers into columns in a similar fashion. 2 5 = 32

Exam #2 Results (as percentages)

Statistical Inference

Sampling Distribution Models. Central Limit Theorem

Chapter 26: Comparing Counts (Chi Square)

Uncertainty. Michael Peters December 27, 2013

Statistics 511 Additional Materials

STAT:5100 (22S:193) Statistical Inference I

You may use your calculator and a single page of notes.

, p 1 < p 2 < < p l primes.

Multiple Linear Regression for the Supervisor Data

POL 681 Lecture Notes: Statistical Interactions

Big Data Analysis with Apache Spark UC#BERKELEY

Probability Experiments, Trials, Outcomes, Sample Spaces Example 1 Example 2

Harvard University. Rigorous Research in Engineering Education

Nicole Dalzell. July 3, 2014

Do students sleep the recommended 8 hours a night on average?

31. TRANSFORMING TOOL #2 (the Multiplication Property of Equality)

Solving with Absolute Value

AP Calculus AB Summer Assignment

3.1 Inequalities - Graphing and Solving

Distributive property and its connection to areas

MITOCW ocw f99-lec01_300k

Lecture 4: Constructing the Integers, Rationals and Reals

Section 3: Simple Linear Regression

Midterm 1 and 2 results

Distance in the Plane

Proton. Size of cell. 100 = 10 2, so the logarithm of 100 is 2, written Log 100= 2

Histograms, Central Tendency, and Variability

CHAPTER 18 SAMPLING DISTRIBUTION MODELS STAT 203

1/18/2011. Chapter 6: Probability. Introduction to Probability. Probability Definition

Fundamentals of Semiconductor Devices Prof. Digbijoy N. Nath Centre for Nano Science and Engineering Indian Institute of Science, Bangalore

1 Maintaining a Dictionary

Section 7.1 Quadratic Equations

Transcription:

Stat 101: Lecture 12 Summer 2006

Outline Answer Questions More on the CLT The Finite Population Correction Factor Confidence Intervals Problems

More on the CLT Recall the Central Limit Theorem for averages: X EV sd/ n N(0, 1) where EV is the mean of the box, sd is the standard deviation of the box, and n is the number of draws. Similarly, for sums, just multiply the lefthand side by n/n to get: sum n EV sd n N(0, 1)

Suppose the population (i.e., the tickets in the box) has some weird, non-normal distribution. If I take 1000 samples of size 20, find the X s for each of these 1000 samples, and make a histogram of them, what will the histogram of the averages look like?

As an extreme example of a non-normal box, suppose the box just contains B tickets, aportion labeled as zero, the remainder as one. The CLT still applies. The expected value for the box is the population proportion of ones, or p. (We could replace zeroes and ones by Heads and Tails or Democrats and Republicans, of course, since our real interest is in estimating a proportion - the zeroes and ones are just coded for two different categories.) One can show that the standard deviation for this box of zeroes and ones is, σ = p(1 p)

Chapter 21 raises a point that we have mentioned before. In practice, the important question is: Q1: I have drawn a sample and found its average (or proportion, or sum). How close is this sample estimate to the true population value? But so far, our discussion has focused on: What is the probability that I will draw a sample whose estimate is within a given distance from the true population value? Note that the first question asks about the sample one has, whereas the second asks about the sample one might get.

You sample 100 people at random from the U.S. and ask whether they believe that Karl Rove is Lance Armstrong s half-brother. Assume 22 say yes. What is the probability that the true proportion of Americans who think Rove and Lance are related is at leat 25%. The box model has ones for people who believe the LA/KR connection, zeroes for those who do not. The EV of the box is, EV = 1/B{sum of zeroes and ones for all the U.S.} = p which is the proportion we are trying to estimate. Our estimated proportion is just a sample average: ˆp = 1 100 n X i = 22/100 = 0.22 i=1 where X i is 1 if the ith respondent says yes, and is zero otherwise. Thus the CLT for averages applies.

( X EV P(EV > 0.25) = P se ( = P = P = P ( ( < X ) 0.25 se ˆp EV ˆp (1 ˆp)/ n < Z < Z < = P(Z < 0.724) ) ˆp 0.25 ˆp (1 ˆp)/ n ) ˆp 0.25 ˆp (1 ˆp)/ n ) 0.22 0.25 0.22 (1 0.22)/ 100 From the standard normal table, we know this has chance (1/2)(100 51.61) = 24.195%, so the probability that the true proportion is more than 0.25 is about 0.24. Note that we used the sample value ˆp to estimate the σ of the box.

The Finite Population Correction Factor Our use of the CLT assumes draws from the box with replacement (or, equivalently, that the population is infinite). But in most survey situations, we do not draw with replacement. Respondents are only sampled once. This can have a big effect in small populations. Sampling without replacement shrinks the standard error for the average or the sum (why?), and we should adjust for this by using the finite population correction factor or FPCF. The finite population correction factor is: B n FPCF = B 1 where B is (as always) the number of tickets in the box and n is the number of draws.

This FPCF will reduce the standard error by a lot when n is a significant fraction of B, and not much otherwise. For example, in the Karl Rove/Lace Armstrong example, we chose a random sample from the entire U.S. population. In that case, 290, 000, 000 100 FPCF = 290, 000, 000 1 = 0.99999 But if we had drawn the random sample from a town of 200, and were making inferences about the gullibility of that town, then, FPCF = 200 100200 1 = 0.70888

Whenever one samples without replacement from a small population, one should multiply the usual standard error by the FPCF. Recall that: se = σ/ n for averages. se = σ n for sums. For the Karl Rove example in a town of 200, P(EV > 0.25) ( ) ˆp EV = P FPCF ˆp (1 ˆp)/ n < ˆp 0.25 FPCF ˆp (1 ˆp)/ n ( ) ˆp 0.25 = P Z < FPCF ˆp (1 ˆp)/ n ( = P Z < (0.22 0.25)/(0.70888 0.22 (1 0.22)/ ) 100) = P(Z < 1.024) which is 0.16, a bit smaller than the 0.24 obtained before.

Confidence Intervals A confidence interval is an interval L, U such that C% of the time, the EV, the population average or proportion will be greater than L but less than U. (Usually use a 95% confidence interval, but sometimes the situation demands more or less confidence. When?) The analyst gets to pick the confidence level C. The numbers L and U are obtained from the sample by using the CLT.

The formula for a confidence interval for a population mean is: L = X se Z c ; U = X + se Z c where Z c is the value from a normal table such that the area between Z c and Z c is C. (For a 95% confidence interval, Z 95 = 1.96, often approximated by 2.) Since se = σ/ n, the width U L of the confidence interval goes to zero as n increases. If we have sampled without replacement from a finite population, then se = FPCF σ/ n, and the width goes to zero even more quickly. If we do not know the standard deviation of the box (or the population), then, typically, we estimate it by the standard deviation of the sample.

Similarly, the formula for a confidence interval for a proportion is: ˆp (1 ˆp) L = ˆp + Z c n This is the same formula as before, since the sample proportion is just a sample average. All we have done is re-write the formula in a way that is a bit more readable. As before, if we do not know the standard deviation of the box (i.e., the population), then we can estimate the standard deviation by ˆp (1 ˆp), which is just the standard deviation of the sample. (Note: we shall always have to do this for proportions. The true standard deviation is p(1 p), bit p is what we are trying to estimate in the first place. If we knew the true σ, then we could solve to find p.)

Problems Suppose you want a 95% confidence interval on the proportion of U.S. adults who have seen The Rocky Horror Picture show. You sample 100 people at random; 82 are virgins. (Do you need to worry about the FPCF?) Your estimate of the proportion of people who have seen the show is ˆp = 18/100 = 0.18.

A C% CI on the true p is given by ˆp (1 ˆp) ˆp (1 ˆp) L = ˆp Z c ; L = ˆp + Z c n n For C = 95, the normal table gives Z c = 1.96. So, ˆp (1 ˆp) U = ˆp + Z c n 0.18 0.82 = 0.18 + 1.96 100 = 0.2549 Similarly, L is found by substracting se Z c from ˆp, and is 0.1051. Thus the 95% confidence interval on the proportion who have seen the show is (0.1051, 0.2549).

One has to be careful when interpreting this confidence interval. It is technically wrong to say that the probability is 0.95 that the true proportion of people who have seen the Rocky Horror Picture show is between 0.1051 and 0.2549. Instead, one should say that In 95% of similarly constructed intervals, the true proportion will be within the interval.. The reason for this is that the true proportion is either within the interval or it isn t there is no randomness in the parameter (unless you are a Bayesian...). Instead, the randomness comes from the sample. So all we can say is that 95% of the time, we will draw a sample that generates a confidence interval which contains the true value.