Data Analysis and Statistical Methods Statistics 651

Similar documents
Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Chapter 7: Statistical Inference (Two Samples)

Data Analysis and Statistical Methods Statistics 651

Inferences for Regression

Chapter 24. Comparing Means. Copyright 2010 Pearson Education, Inc.

Comparing Means from Two-Sample

STAT Chapter 9: Two-Sample Problems. Paired Differences (Section 9.3)

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6.

Chapter 27 Summary Inferences for Regression

Chapter 10: STATISTICAL INFERENCE FOR TWO SAMPLES. Part 1: Hypothesis tests on a µ 1 µ 2 for independent groups

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

Statistics Primer. ORC Staff: Jayme Palka Peter Boedeker Marcus Fagan Trey Dejong

Data Analysis and Statistical Methods Statistics 651

Chapter 23. Inferences About Means. Monday, May 6, 13. Copyright 2009 Pearson Education, Inc.

Chapter 23: Inferences About Means

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

appstats27.notebook April 06, 2017

This is particularly true if you see long tails in your data. What are you testing? That the two distributions are the same!

hypotheses. P-value Test for a 2 Sample z-test (Large Independent Samples) n > 30 P-value Test for a 2 Sample t-test (Small Samples) n < 30 Identify α

AMS7: WEEK 7. CLASS 1. More on Hypothesis Testing Monday May 11th, 2015

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

Introduction to hypothesis testing

The Components of a Statistical Hypothesis Testing Problem

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

Lecture 18: Simple Linear Regression

Chapter 22. Comparing Two Proportions. Bin Zou STAT 141 University of Alberta Winter / 15

STA2601. Tutorial letter 203/2/2017. Applied Statistics II. Semester 2. Department of Statistics STA2601/203/2/2017. Solutions to Assignment 03

CHAPTER 9: HYPOTHESIS TESTING

CHAPTER 10 Comparing Two Populations or Groups

CHAPTER 10 Comparing Two Populations or Groups

LAB 2. HYPOTHESIS TESTING IN THE BIOLOGICAL SCIENCES- Part 2

Hypothesis Testing with Z and T

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Stat 427/527: Advanced Data Analysis I

Standard normal distribution. t-distribution, (df=5) t-distribution, (df=2) PDF created with pdffactory Pro trial version

Inferences Based on Two Samples

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras. Lecture 11 t- Tests

Multiple samples: Modeling and ANOVA

Confidence intervals CE 311S

Lecture 11: Simple Linear Regression

Chapter 22. Comparing Two Proportions 1 /29

One-sample categorical data: approximate inference

CHAPTER 5 Probabilistic Features of the Distributions of Certain Sample Statistics

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

Chapter 7. Inference for Distributions. Introduction to the Practice of STATISTICS SEVENTH. Moore / McCabe / Craig. Lecture Presentation Slides

Chapter 22. Comparing Two Proportions 1 /30

Multiple Regression Analysis

Harvard University. Rigorous Research in Engineering Education

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Mathematical Notation Math Introduction to Applied Statistics

Section 3: Simple Linear Regression

CHAPTER 10 Comparing Two Populations or Groups

Formal Statement of Simple Linear Regression Model

Statistical Inference for Means

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

Interactions and Factorial ANOVA

Interactions and Factorial ANOVA

Data Analysis and Statistical Methods Statistics 651

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Business Statistics. Lecture 10: Course Review

4 Hypothesis testing. 4.1 Types of hypothesis and types of error 4 HYPOTHESIS TESTING 49

Design of Engineering Experiments Part 2 Basic Statistical Concepts Simple comparative experiments

Chapter 26: Comparing Counts (Chi Square)

Midterm 1 and 2 results

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Inference in Regression Analysis

1 Least Squares Estimation - multiple regression.

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Constant linear models

Lecture 5: ANOVA and Correlation

Keppel, G. & Wickens, T.D. Design and Analysis Chapter 2: Sources of Variability and Sums of Squares

Chapter 12 - Lecture 2 Inferences about regression coefficient

Confidence Intervals. - simply, an interval for which we have a certain confidence.

STA Module 11 Inferences for Two Population Means

STA Rev. F Learning Objectives. Two Population Means. Module 11 Inferences for Two Population Means

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Confidence intervals

Statistical inference (estimation, hypothesis tests, confidence intervals) Oct 2018

Inferences About Two Proportions

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

INTRODUCTION TO ANALYSIS OF VARIANCE

Data Analysis and Statistical Methods Statistics 651

Analysis of Variance. Contents. 1 Analysis of Variance. 1.1 Review. Anthony Tanbakuchi Department of Mathematics Pima Community College

1 Statistical inference for a population mean

Solving with Absolute Value

Relating Graph to Matlab

Chapter 16. Simple Linear Regression and dcorrelation

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Sociology 6Z03 Review II

CS 5014: Research Methods in Computer Science

Warm-up Using the given data Create a scatterplot Find the regression line

MATH 1150 Chapter 2 Notation and Terminology

16.400/453J Human Factors Engineering. Design of Experiments II

Sampling Distributions

Sampling Distributions: Central Limit Theorem

Transcription:

Data Analysis and Statistical Methods Statistics 65 http://www.stat.tamu.edu/~suhasini/teaching.html Suhasini Subba Rao Comparing populations Suppose I want to compare the heights of males and females at A&M. I can consider all boys at Texas A&M as one population and all girls at Texas A&M as another population. Question : Is the mean girl height less than that of the boy height? Question 2: What is the difference in the mean girl and mean boy height. How much larger are boys than girls. As I do not have data from the entire student population, I can use the data from a class. Suggestion: Compare the sample mean of the male heights with the sample mean of female heights. Female Heights: 5.33 5.33 5.7 5.75 5.42 5.42 5.50 5.50 5.58 5.33 5.50 5.67 5.42 5.25 6.7 5.42 5.33 5.7 5.42 5.42 5.42 5.42 5.42 5.83 5.33 5.67 5.33 5.66 5.25 5.75 5.57 5.35 5.42 5.08 5.75 5.33 5.08 Male Heights: 5.75 5.92 6.7 6.08 5.58 5.92 6.00 5.75 5.92 5.75 5.75 5.83 6.58 6.00 5.75 6.42 6.50 6.7 6.00 5.67 5.58 5.83 5.58 5.58 6.08 5.67 6.00 Let X be the height of a randomly selected female and Y the height of randomly selected male. There are n = 37 girls and m = 27 boys in the samples. The sample mean for girls is X = 37 sample mean for boys is Ȳ = 27 27 i= Y i = 5.92. Let µ X be the female population mean height and µ Y population mean height. 37 i= X i = 5.45 and be the male We are interesting in the quantity µ X µ Y. It will tell us how much larger, how small or whether the male and female heights are equal. Of course, we do not know that difference µ X µ Y, and need to infer something about µ X µ Y from the samples. Intuitively it is obvious that to see whether µ X and µ Y are equal, we need to compare the sample averages X and Ȳ and look at their difference X Ȳ. What can the differences in the sample means say about the differences in the true means that is µ X µ Y (population mean of females - population mean of males)? 2 3

We would expect that population mean of females is less than population mean of males, in other words population mean of females - population mean of males to be less than zero. Hence we would be interested in testing H 0 : µ X µ Y 0 against H A : µ X µ Y < 0. We also want know the magnitude of the difference, this mean constructing a CI for µ X µ Y. Clearly if X Ȳ > 0 we would be unable to reject the null (why??? - remember X Ȳ has to pointing in the same direction as the alternative). But if X Ȳ < 0, then we can use a statistical test. The question is how to make the comparison, what is the distribution of X Ȳ, we look at this now. Aims: Comparing male and female heights To build a confidence interval for the mean difference µ X µ Y (this will tell us where the mean difference lies an is very informative). To test the hypothesis that H 0 : µ X µ Y 0 (mean female height and male height are the same or mean female height is greater than mean male height) against the alternative H A : µ X µ Y < 0. We can also test whether H 0 : µ X µ Y 0.3 against H A : µ X µ Y < 0.3. This is essentially testing whether boys tend on average to be more than 0.3 feet taller than girls. This situation can also arise. 4 5 We will consider both constructing CIs for the difference between the sample means and also hypothesis testing. We do how to do this by both hand and reading the output in JMP. It is important to understand both. Below we will consider assumptions that are required to make the test and also the details. The details may appear to be overwhelming, but do not be detered by them. In order to do any test, ie. H 0 : µ X µ Y 0 against H A : µ X µ Y < 0 or H 0 : µ X µ Y 0.3 against H A : µ X µ Y < 0.3 or to construct CI for µ X µ Y we need three magical ingredients: The difference of the sample averages: X Ȳ. The standard error of X Ȳ (this will turn out to be The sample sizes m and n are relatively large. σ 2 n + σ2 m )). 6 7

Formal: comparing populations We have two samples from two different populations. That is X,..., X n is a size n sample (eg. heights of females in the 65 class) from population (eg. heights of all females) and Y,..., Y m is a size m sample (eg. the heights of males in the 65 class) from population 2 (eg. heights of all males). The mean of population, is µ X (eg. mean height of a female) and the mean of population 2 is µ Y (eg. mean height of a male). Given the samples we want to make inference about the difference µ X µ Y. Is one housing material better than another? In the above examples what are the different populations and samples? All these questions are important and can lead to quite important decisions, therefore it is important that we do a careful analysis. To construct CIs and do a hypothesis we do an independent 2 sample t-test. To do this test we have to ensure the data satisfies the assumptions below. It is clear this is an important question. Other examples include: Does a new therapy work better than old therapy? Is there a difference in the performance of one school over an other? On average does eating healthy food mean you live longer? On average if one studies more do they get better grades? 8 9 Assumptions and how to check them We have two samples from two different populations X,..., X n and Y,..., Y m (sample size n and m respectively). Both samples are independent of each other and independent within the sample. For example the values X,..., X n should have no influence on Y,..., Y m and X should not have any influence on X 2,..., X n. Can you think of examples when this may not be true? It is likely for observations taken over time, those taken around the time will be close. Checking for independence can be difficult, though there are methods available. In practice this may not be true, but it does not have to be strictly the same so long as they have similar sample sizes (see Ott and Longnecker, page 275). Make a boxplot of both samples and check if the variation is the same. We can also do a test to see if the variances from the two populations are the same (we do this later). If n and m are small the observations X,..., X n and Y,..., Y m should be close to normal. If n and m are large this normality of the observations does not matter (this is the same as in the one-sample tests). When n and m are small make a QQ-plot. There may be good reasons why the original data is normal. The variance of both populations need to be about the same. That is var(x i ) = σ 2 X and var(y i) = σ 2 Y, and we must have σ2 X = σ2 Y. 0

Compare means of populations We do not have the populations available, only the samples, and we have to base our conclusions on the samples. To make inference about the population mean we should look at the difference between the sample means: X Ȳ. We are interested in constructing a confidence interval for the difference in the population means. The CI will tell us how much larger one mean is than another or if they could be similar in value. Testing the hypothesis H 0 : µ x µ y = 0 against H A : µ x µ y 0 (or the one-sided versions of this: H 0 : µ x µ y 0 against H A : µ x µ y < 0 or H 0 : µ x µ y 0 against H A : µ x µ y > 0). But to do any of the above we require more than just X Ȳ. Remember both X and Ȳ are sample means hence are random variables, their distribution is centered about the true means µ X and µ Y. Therefore X Ȳ is a random variable too, and their distribution is centered about µ X µ Y. Now to do anything we require the distribution of X Ȳ, and its standard error - this explains how much spread or error there is in X Ȳ. We formalise this below. Don t panic we will go through some examples later on. 2 3 Distribution of the difference of the sample means X Ȳ If X i and Y i are have the same variance σ 2 and X and Ȳ are close to normal, then the difference of the averages has the following distribution: ( ( X Ȳ N µ X µ Y, σ 2 n + )). m Note that σ ( 2 n + ) ) m = (σ 2 n + σ2 m. Important points: The distribution is centered about µ X µ Y, hence I am likely to draw close to µ X µ Y. How close depends on the standard error which is σ ( 2 n + m). The larger the sample sizes n and m are the smaller the standard error (just like in the one sample case, where we deal with just one sample mean X, which has standard error σ2 n ). Therefore we can make a Z-transform of the difference X Ȳ ( X Ȳ ) (µ x µ y ) N(0,). σ n + m Of course in practice σ 2 will not be known and has to be estimated from the data. 4 5

The distribution cont. When the variance is unknown and we use the sample pooled variance s 2, then The distribution of the standardised transform using the sample variance is : ( X Ȳ ) (µ x µ y ) t(m + n 2). s n + m It has a t-distribution with (n + m 2) degrees of freedoms (that is a t-distribution, where the number of degrees of freedom is the sum of the two sample sizes minus two). Don t panic! If the samples from both populations are greater than 30. Then everything is wonderful and all we require is X Ȳ, which we can get from the data, s (the sample standard deviation of the populations), which is always given to you and the sample sizes n and m. With these ingredients you can contruct CI and do tests. s If you are really lucky rather evaluate 2 n + s2 m yourself, if you are given output it will already be there in the JMP output! See how it is all put together intwo sample independent t-test JMP.pdf. 6 7 Confidence intervals for the differences the mean The 99% CI in the case that n = 2 and m = 3 is At the 00( α)% level this gives the confidence interval for the difference in mean to be ( X Ȳ ) t α/2(n + m 2)s n + m,( X Ȳ ) + t α/2(n + m 2)s n + m. ( X Ȳ ) t 0.005(50)s 2 + 3,( X Ȳ ) + t 0.005(50)s 2 + 3. You will need to look up t 0.025 (50) and t 0.005 (50) in the t-tables. Examples: The 95% CI in the case that n = 2 and m = 3 is ( X Ȳ ) t 0.025(50)s 2 + 3,( X Ȳ ) + t 0.025(50)s 2 + 3. 8 9

Choosing the sample sizes Notice that the length of the interval is small when Suppose n + m = 00, then If n = 50 and m = 50, If n = 99 and m =, 50 = 0.2 99 + =.0. 50 + n + m is small. Hypothesis testing Testing H 0 : µ x µ y = 0 against H A : µ x µ y 0. What we need to do. Calculate the Z-statistic under the null: ( X Ȳ ) 0 s n + m We see that the variance will be small when n and m are close. Remember a smaller variance = a better estimator. Therefore having similar sample sizes (for a given total sample) is a good thing! We can always access the quality of the average difference X by looking at its variance: σ 2( n + m). As always the smaller ( n + m) the better. and look this number up in the t-tables with (n + m 2) degrees of freedom. If (n + m 2) is large use the normal tables instead. This will give you the p-value. If p-value is small (say less than 5%), then we reject the null in favour of the alternative. 20 2 Example: Heights of students We know that there are n = 37 girls and m = 27 boys. The sample mean for girls is X = 37 37 i= X i = 5.45 and sample mean for boys is Ȳ = 27 27 i= Y i = 5.92. Build a 95% confidence interval for µ x µ y. Make a hypothesis test that the mean male and female height are the same against the alternative that mean male height is greater than mean female height (α = 0.05). The sample variance for girls is s 2 x = 37 s 2 y = 27 27 i= (Y i Ȳ )2 = 0.0758. 37 i= (X i X) 2 = 0.0484 and The two populations are all male and female heights in A&M. Suppose the population mean female height is µ x and the population mean male height is µ y. Object: 22 23

Checking the assumptions for the height data to do a independent sample t-test Male and Female Boxplot Unless many of the students in the 65 class were related it is reasonable to assume that they are independent. The sample standard deviations are s x = 0.22 and s y = 0.275, which are close. 5.5 6.0 6.5 Below we make boxplots and QQplots. 2 The standard deviations s x = 0.22 and s y = 0.275 are quite smilar and this is confirmed by the boxplots. The spread of the interquatile ranges in the two plots look similar. 24 25 Sample Quantiles 5.2 5.4 5.6 5.8 6.0 6.2 Male and Female QQ-plots Normal Q Q Plot 2 0 2 Theoretical Quantiles They data looks close to normal (in a handwavey sense). The sample size of 27 and 37 are quite large so I think we can stick to the normal assumption. There does seem to be one huge outlier for the female plot and a few male outliers, which we need to keep in mind. We now do the test by hand, but compare it with the JMP output in two sample independent t-test JMP.pdf. Normal Q Q Plot Sample Quantiles 5.6 5.8 6.0 6.2 6.4 6.6 The ingredients we need are: The sample variance of the population is 2 0 2 Theoretical Quantiles Female heights is the top plot, male heights the lower plot. s 2 = (37 ) 0.0484 + (27 ) 0.0758 37 + 27 2 Don t worry how this was obtained. =.74 +.97 62 = 0.06. 26 27

t α/2 (n + m 2) = t 0.025 (62). The t-distribution with 62 degrees of freedom is not in the tables. Either use t 0.025 (60), but since 62 is quite large you can also use the normal approximation: z 0.025 =.96. X Ȳ = 5.45 5.92 = 0.47. The 95% CI is The confidence interval for the heights 0.47.96 0.06 ( 27 + ) (, 0.47 +.96 0.06 37 27 + 37 [ 0.47.96 0.062, 0.47 +.96 0.062] = [ 0.59, 0.34]. Zero is not contained in the above. So it seems like Texas A&M boys tend to be taller than Texas A&M girls. With 95% confidence the difference in mean heights seems to lie in the interval [ 0.59, 0.34]. 28 29 Hypothesis test for the heights This is closely related to what we did above. We want to test H 0 : µ x µ y 0 against H A : µ x µ y < 0. Note that whether the test is a left hand test or a right hand test, depends on you choose to order of µ x and µ y, either µ x µ y or µ y µ x. This becomes even more important when you do the test in JMP. JMP automatically selects whether it is considering the difference µ x µ y or µ y µ x, and this depends on how you code the levels (for example for the male/female data it depends on how you code the male and female categories). But from the output you should see which way it takes the difference. In Means for Oneway Anova, you will see JMP gives the sample mean for each level (you should know what the levels correspond to), for example, in the height example 0 is male and is female, the mean for level 0 is 5.9 and the mean for level is 5.45. In the t-test, it will give the Difference, for example the difference for the height exampe is -0.466, hence you can see that JMP is evaluating level - level 0, ie it formulates the test as µ x µ y, hence you should state your hypothesis in terms of the difference µ x µ y. JMP also gives you a clue, just below t-test it states -0, which means that it formulates the test as level - level 0. We assume for now the null and construct the test statistic. Under the null we have ( X Ȳ ) s 37 + 27 t(37 + 27 2). 30 3

We do the calculation: 0.47 0.06 ( 37 + 27 ) = 8.3. Example: Diets Two diets are being compare for effectiveness. 0 volunteers went on diet and 0 different volunteers went on Diet 2. After one month their weight loss (in kilos) was recorded. The data is given below. We don t have the t(62) in the tables. So we approximate with a normal distribution. Suppose Z N(0, ), then P(Z 8.3) 0. So the p-value is really small. Diet I 2.9 2.7 3.9 2.7 2. 2.6 2.2 4.2 5.0 0.7 Diet II 3.5 2.5 3.8 8. 3.6 2.5 5.0 2.9 2.3 3 Let µ I be the mean weight loss of diet I and µ II be the mean weight loss of diet II. Test the hypothesis that the diets are different. So pretty much for all values of α we reject the null in favour of the alternative. Texas A&M boys tend to be taller than Texas A&M girls. 32 33 Aside: Estimating the variance σ 2 : The pooled sample variance This is the formula for estimating the sample variance σ 2 : Evaluate the sample variance s 2 x = n n i= (X i X) 2. Evaluate the sample variance s 2 y = n n i= (Y i Ȳ )2. Evaluate pooled sample variance: s 2 = (n )s2 x + (m )s 2 y. n + m 2 You do not have to know this, you just need to know that JMP will estimate the variance of the population variance σ 2 using the sample variance above. 34