Inferential Statistics

Similar documents
CHAPTER 10 Comparing Two Populations or Groups

Statistical Inference. Why Use Statistical Inference. Point Estimates. Point Estimates. Greg C Elvers

Examine characteristics of a sample and make inferences about the population

Unit 4 Probability. Dr Mahmoud Alhussami

Sampling Methods and the Central Limit Theorem GOALS. Why Sample the Population? 9/25/17. Dr. Richard Jerz

Statistical Inference for Means

Sampling Distributions

THE SAMPLING DISTRIBUTION OF THE MEAN

Math 10 - Compilation of Sample Exam Questions + Answers

The t-test: A z-score for a sample mean tells us where in the distribution the particular mean lies

Chapter 23. Inferences About Means. Monday, May 6, 13. Copyright 2009 Pearson Education, Inc.

Single Sample Means. SOCY601 Alan Neustadtl

HYPOTHESIS TESTING. Hypothesis Testing

p = q ˆ = 1 -ˆp = sample proportion of failures in a sample size of n x n Chapter 7 Estimates and Sample Sizes

Lab #12: Exam 3 Review Key

Notes 3: Statistical Inference: Sampling, Sampling Distributions Confidence Intervals, and Hypothesis Testing

Business Statistics: Lecture 8: Introduction to Estimation & Hypothesis Testing

Inference About Means and Proportions with Two Populations. Chapter 10

Stochastic calculus for summable processes 1

Do students sleep the recommended 8 hours a night on average?

COSC 341 Human Computer Interaction. Dr. Bowen Hui University of British Columbia Okanagan

Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6.

Sampling. What is the purpose of sampling: Sampling Terms. Sampling and Sampling Distributions

Last week: Sample, population and sampling distributions finished with estimation & confidence intervals

Inferences for Correlation

Probability and Statistics

Two-Sample Inferential Statistics

Ch. 17. DETERMINATION OF SAMPLE SIZE

How do we compare the relative performance among competing models?

2011 Pearson Education, Inc

Probably About Probability p <.05. Probability. What Is Probability?

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Business Statistics:

12.4. The Normal Distribution: A Problem-Solving Tool

Final Exam STAT On a Pareto chart, the frequency should be represented on the A) X-axis B) regression C) Y-axis D) none of the above

POLI 443 Applied Political Research

PSY 305. Module 3. Page Title. Introduction to Hypothesis Testing Z-tests. Five steps in hypothesis testing

Harvard University. Rigorous Research in Engineering Education

Chapter 9 Inferences from Two Samples

Last two weeks: Sample, population and sampling distributions finished with estimation & confidence intervals

INTERVAL ESTIMATION AND HYPOTHESES TESTING

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

10.1. Comparing Two Proportions. Section 10.1

Data Collection: What Is Sampling?

Introduction to Statistics

1 Descriptive statistics. 2 Scores and probability distributions. 3 Hypothesis testing and one-sample t-test. 4 More on t-tests

Purposes of Data Analysis. Variables and Samples. Parameters and Statistics. Part 1: Probability Distributions

Lecture 5: Sampling Methods

Comparing Means from Two-Sample

CBA4 is live in practice mode this week exam mode from Saturday!

10/4/2013. Hypothesis Testing & z-test. Hypothesis Testing. Hypothesis Testing

Elementary Statistics Triola, Elementary Statistics 11/e Unit 17 The Basics of Hypotheses Testing

Inferences About Two Proportions

Sampling distribution of t. 2. Sampling distribution of t. 3. Example: Gas mileage investigation. II. Inferential Statistics (8) t =

Business Statistics:

Section 6.2 Hypothesis Testing

MALLOY PSYCH 3000 MEAN & VARIANCE PAGE 1 STATISTICS MEASURES OF CENTRAL TENDENCY. In an experiment, these are applied to the dependent variable (DV)

ECO220Y Review and Introduction to Hypothesis Testing Readings: Chapter 12

Inference for Single Proportions and Means T.Scofield

Ordinary Least Squares Regression Explained: Vartanian

Probability and Inference. POLI 205 Doing Research in Politics. Populations and Samples. Probability. Fall 2015

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

UNIVERSITY OF TORONTO MISSISSAUGA. SOC222 Measuring Society In-Class Test. November 11, 2011 Duration 11:15a.m. 13 :00p.m.

Final Exam - Spring ST 370 Online - A

Math Released Item Algebra 1. Solve the Equation VH046614

Applied Statistics for the Behavioral Sciences

EXAM 3 Math 1342 Elementary Statistics 6-7

ME3620. Theory of Engineering Experimentation. Spring Chapter IV. Decision Making for a Single Sample. Chapter IV

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

Sampling, Confidence Interval and Hypothesis Testing

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

Chapter 2 Descriptive Statistics

CENTRAL LIMIT THEOREM (CLT)

Inferential statistics

Hypothesis testing: Steps

Normal Curve in standard form: Answer each of the following questions

Bus 216: Business Statistics II Introduction Business statistics II is purely inferential or applied statistics.

Review. A Bernoulli Trial is a very simple experiment:

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

LC OL - Statistics. Types of Data

Probability and Samples. Sampling. Point Estimates

DSST Principles of Statistics

Performance Evaluation and Comparison

Tribhuvan University Institute of Science and Technology 2065

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Chapter 12: Estimation

CHAPTER 1. Introduction

INTRODUCTION TO ANALYSIS OF VARIANCE

AP Statistics Ch 12 Inference for Proportions

Descriptive Statistics-I. Dr Mahmoud Alhussami

M(t) = 1 t. (1 t), 6 M (0) = 20 P (95. X i 110) i=1

Section 9 1B: Using Confidence Intervals to Estimate the Difference ( p 1 p 2 ) in 2 Population Proportions p 1 and p 2 using Two Independent Samples

Big Data Analysis with Apache Spark UC#BERKELEY

Sampling Distributions

Psychology 282 Lecture #4 Outline Inferences in SLR

Statistics Primer. ORC Staff: Jayme Palka Peter Boedeker Marcus Fagan Trey Dejong

Frequency table: Var2 (Spreadsheet1) Count Cumulative Percent Cumulative From To. Percent <x<=

t-test for b Copyright 2000 Tom Malloy. All rights reserved. Regression

Background to Statistics

Statistical inference provides methods for drawing conclusions about a population from sample data.

Transcription:

Inferential Statistics Part 1 Sampling Distributions, Point Estimates & Confidence Intervals Inferential statistics are used to draw inferences (make conclusions/judgements) about a population from a sample. Consider an experiment in which 10 students who sat an exam after 24 hours of sleep deprivation scored 12% lower than 10 students who sat the same exam after a normal night's sleep. Is the difference real or could it be due to chance? How much larger could the real difference be than the 12% found in the sample? These are the types of questions answered by inferential statistics. There are two main methods used in inferential statistics: estimation and hypothesis testing. In estimation, the sample is used to estimate a parameter and a confidence interval about the estimate is constructed. In the most common use of hypothesis testing, a null hypothesis is put forward and it is determined whether the data is strong enough to reject it. For the sleep deprivation study, the null hypothesis would be that sleep deprivation has no effect on performance. Sampling Error When we looked at primary data collection and sampling methods, we stressed the importance of selecting a random sample so that every item or individual in the population had a known chance of being selected. To accomplish this, we could choose a simple random sample, a systematic sample, a stratified sample, a cluster sample, or a combination of these methods. However, it is unlikely that the mean of a sample would be identical to the population mean. Likewise, the sample standard deviation or other measure computed from a sample would probably not be exactly equal to the corresponding population value. We can therefore expect some difference between a sample statistic, such as the sample mean or sample standard deviation, and the corresponding population parameter. The difference between a sample statistic and a population parameter is called sampling error. Example: Suppose that a population of five students had exam results of 68, 72, 67, 69 and 74. Suppose that a sample of two results 68 and 74 - is selected to estimate the population mean result. The mean of that sample would be 70.7. Another sample of two is selected 72 and 67 - with a sample mean of 69.65. The mean of all the results (the population mean) is 70. The sampling error for the first sample is 0.7, determined by X - µ = 70.7 70 The second sample has a sampling error of -0.35

Each of these differences, 0.7 and -0.35, is the error made in estimating the population mean based on a sample mean, and these sampling errors are due to chance. The amount of these errors will vary from one sample to the next. So given the possibility of a sampling error when sample results are used to estimate a population parameter, how can we make accurate inferences/conclusions about the population based only on sample results? To begin with we develop a sampling distribution of the sample means. Sampling Distribution of the Sample Means The exam results example showed the means for samples of a specified size vary from sample to sample. The mean exam result of the first sample of two students was 70.7, and the second sample mean was 69.65. A third sample would probably result in a different mean. The population mean was 70. If we organised the means of all possible samples of two results into a probability distribution, we would obtain the sampling distribution of the sample means. Example: A firm has seven production workers (considered the population). The hourly earnings of each worker are given below. Employee No. Hourly Earnings 10001 7 10002 7 10003 8 10004 8 10005 7 10006 8 10007 9 1. What is the population mean? 2. What is the sampling distribution of the sample means for samples with a size of 2? 3. What is the mean of the sampling distribution? 4. What observations can be made about the population and the sampling distribution? The population mean is found by: To get the sampling distribution of the sample means, all possible samples of 2 are selected without replacement from the population, and their means are computed. There are 21 possible samples, found by:

where N is the number of observations in the population and n is the number of observations in the sample. The 21 distinct sample means from all possible samples of 2 that can be drawn from the population are shown below. Sample Employees Hourly Earnings Sum Mean 1 10001 + 10002 7 + 7 14 7.00 2 10001 + 10003 7 + 8 15 7.50 3 10001 + 10004 7 + 8 15 7.50 4 10001 + 10005 7 + 7 14 7.00 5 10001 +10006 7 + 8 15 7.50 6 10001 + 10007 7 + 9 16 8.00 7 10002 + 10003 7 + 8 15 7.50 8 10002 + 10004 7 + 8 15 7.50 9 10002 + 10005 7 + 7 14 7.00 10 10002 + 10006 7 + 8 15 7.50 11 10002 + 10007 7 + 9 16 8.00 12 10003 + 10004 8 + 8 16 8.00 13 10003 + 10005 8 + 7 15 7.50 14 10003 + 10006 8 + 8 16 8.00 15 10003 + 10007 8 + 9 17 8.50 16 10004 + 10005 8 + 7 15 7.50 17 10004 + 10006 8 + 8 16 8.00 18 10004 + 10007 8 + 9 17 8.50 19 10005 + 10006 7 + 8 15 7.50 20 10005 + 10007 7 + 9 16 8.00 21 10006 + 10007 8 + 9 17 8.50 This probability distribution is the sampling distribution of the sample means and can be summarised as follows. Sampling Distribution of the Sample Mean for n = 2 Sample Mean No. Means Probability 7.00 3 0.1429 7.50 9 0.4286 8.00 6 0.2857 8.50 3 0.1429 21 1.0000

The mean of the sampling distribution of the sample mean is obtained by summing the various sample means and dividing the sum by the number of samples. The mean of all the sample means is usually written. The µ reminds us that it is a population value because we have considered all possible samples. The subscript X indicates that it is the sampling distribution of mean.

These observations can be made: a. The mean of the sample means ( 7.71) is equal to the mean of the population: b. The spread in the distribution of the sample means is less than the spread in the population values. The sample means range from 7 to 8.50, while the population values vary from 7 to 9. In fact, the standard deviation of the distribution of the sample means is equal to the population standard deviation divided by the square root of the sample size. So the formula for the standard deviation of the distribution of sample means is: Therefore, as we increase the size of the sample, the spread of the distribution of the sample means becomes smaller. c. The shape of the sampling distribution of the sample means and the shape of the frequency distribution of the population values are different. The distribution of sample means tends to be more bell-shaped and to approximate the normal probability distribution. In summary, we took all possible random samples from a population and for each sample calculated a sample statistic (the mean). Because each possible sample has a chance of being selected, the probability that the mean amount earned will be values such as 7.27,

8.50, 6.50 etc. can be determined. The distribution of the mean amounts earned is called the sampling distribution of the sample means. Even though in practice we see only one particular random sample, in theory any of the samples could arise. Consequently, we view the sampling process as repeated sampling of the statistic from its sampling distribution. This sampling distribution is then used to measure how likely a particular outcome might be. The Central Limit Theorem Applying the central limit theorem to the sampling distribution of the sample means allows us to use the normal probability distribution to create confidence intervals for the population mean. The central limit theorem states that, for large random samples, the shape of the sampling distribution of the sample means is close to a normal probability distribution. The approximation is more accurate for large samples than for small samples (most statisticians consider a sample of 30, or more, large enough for the central limit theorem to be employed) General Procedure Sampling requires that we draw successive samples from a defined population. The samples must be randomly selected and of the same size. Calculate the mean for each sample and plot the sample means. This produces a distribution of sample means. A plot of an " infinite" number of sample means is called the sampling distribution of the mean. Successive Sampling Frequency distributions of sample means quickly approach the shape of a normal distribution, even if we are taking relatively few, small samples from a population that is not normally distributed. As we randomly select more and more samples from the population, the distribution of sample means becomes more normally distributed and looks smoother. With " infinite" numbers of successive random samples, the sampling distributions all have a normal distribution with a mean that is equal to the population mean (μ). Increasing Sample Size As sample sizes increase, the sampling distributions approach a normal distribution. With " infinite" numbers of successive random samples, the mean of the sampling distribution is equal to the population mean (μ). As the sample sizes increases, the variability of each sampling distribution decreases. The range of the sampling distribution is smaller than the range of the original population.

Taken together, these distributions suggest that the sample mean provides a good estimate of μ and that errors in our estimates (indicated by the variability of scores in the distribution) decrease as the size of the samples we draw from the population increase. Population Distributions The principles of successive sampling and increasing sample size work for all distributions. We can count on the sampling distribution of the mean being approximately normally distributed, no matter what the original population distribution looks like as long as the sample size is relatively large. The central limit theorem states that when an infinite number of successive random samples are taken from a population, the distribution of sample means calculated for each sample will become approximately normally distributed with mean μ and standard deviation σ/ n as the sample size (n) becomes larger, irrespective of the shape of the population distribution. This is one of the most useful conclusions in statistics. We can reason about the distribution of the sample means with absolutely no information about the shape of the original distribution from which the sample is taken. In other words, the central limit theorem is true for all distributions. The central limit theorem applies only to sample means; its tenets cannot be applied to any other statistic. The central limit theorem tells that what to expect of the distribution of sample means when we take an infinite number of relatively large samples of a given size from a population. The central limit theorem works no matter what how the population distribution is shaped. The central limit theorem helps us test hypotheses about means because it tells us what to expect when we draw samples from a population. Point Estimates & Confidence Intervals In statistics, point estimation involves the use of sample data to calculate a single value (known as a statistic) which serves as a "best guess" for an unknown population parameter. For example, the sample mean X, is a statistic, and is a point estimate of the population mean, a parameter, μ. While we expect the point estimate (statistic) to be close to the population parameter, we would like to measure how close (accurate) it is. A confidence interval serves this purpose. In contrast to point estimation, which is a single number, with confidence intervals we use sample data to construct an interval (range) of possible (or probable) values of an unknown population parameter, so that the parameter occurs within that range at a specified probability. The specified probability is called the level of confidence.

The information developed about the shape of a sampling distribution of the sample means, that is the sampling distribution of X, allows us to locate an interval that has a specified probability of containing the population mean μ. for reasonably large samples, we can use the central limit theorem and state the following: 1. 95% of the sample means selected from a population will be within 1.96 standard deviations of the population mean μ. 2. 99% of the sample means will lie within 2.58 standard deviations of the population mean. How are the values of 1.96 and 2.58 obtained? The 95% and 99% refer to the percent of the time that similarly constructed intervals would include the parameter being estimated. For example, 95% refers to the middle 95% of the observations. Therefore, the remaining 5% are equally divided between the two tails. The central limit theorem states that, for large random samples, the shape of the sampling distribution of the sample means is approximately normal. Therefore, we use z-tables to look at areas under the normal curve (see z-table) When the sample size, n, is at least 30, it is generally accepted that the central limit theorem will ensure a normal distribution of the sample means. This is an important consideration. If the sample means are normally distributed, we can use the standard normal distribution, that is, z, in our calculations. When n 30 the formula for getting the confidence interval for a mean is: where:

Example: X = the sample mean z = appropriate z value for level of confidence s = sample standard deviation n = sample size An experiment involves selecting a random sample of 256 managers. One item of interest is annual income. The sample mean is 45,420, and the sample standard deviation is 2,050. 1. What is the estimated mean income of all managers (the population) i.e. what is the point estimate? 2. What is the 95% confidence interval for the population mean (rounded to the nearest 10)? 3. Interpret the findings. 1. The point estimate of the population mean is 45,420 2. The confidence interval is: onfidence Interval for a ean X = 45,420 z = 1.96 s = 2,050 n = 256 onfidence Interval for ean 45,420 45,420 45,1 8.875 and 45, 71.125 45,170 and 45,170 μ 3. We can say that we are 95% confident that the unknown population mean income (μ) is between 45,170 and 45, 70. If we had time to select many samples of size 256 from the population of managers and compute sample means and confidence intervals, the population mean and annual income would be in about 95 of every 100 confidence intervals. Either an interval

contains the population mean of not. About 5 out of every 100 confidence intervals would not contain the population mean annual income, μ.