Study and research skills 2009 Duncan Golicher. and Adrian Newton. Last draft 11/24/2008

Similar documents
Chapter 23. Inferences About Means. Monday, May 6, 13. Copyright 2009 Pearson Education, Inc.

You separate binary numbers into columns in a similar fashion. 2 5 = 32

Chapter 18. Sampling Distribution Models /51

Chapter 1 Review of Equations and Inequalities

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Business Statistics. Lecture 5: Confidence Intervals

Physics 509: Bootstrap and Robust Parameter Estimation

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Descriptive statistics

Confidence Intervals. - simply, an interval for which we have a certain confidence.

Lecture 17: Small-Sample Inferences for Normal Populations. Confidence intervals for µ when σ is unknown

( )( b + c) = ab + ac, but it can also be ( )( a) = ba + ca. Let s use the distributive property on a couple of

Confidence intervals

Error Analysis in Experimental Physical Science Mini-Version

Solution to Proof Questions from September 1st

Getting Started with Communications Engineering

Quantitative Understanding in Biology 1.7 Bayesian Methods

STA Why Sampling? Module 6 The Sampling Distributions. Module Objectives

Physics 6A Lab Experiment 6

Quadratic Equations Part I

DIRECTED NUMBERS ADDING AND SUBTRACTING DIRECTED NUMBERS

Structural Analysis II Prof. P. Banerjee Department of Civil Engineering Indian Institute of Technology, Bombay Lecture 38

- a value calculated or derived from the data.

Ch18 links / ch18 pdf links Ch18 image t-dist table

Uni- and Bivariate Power

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Descriptive Statistics (And a little bit on rounding and significant digits)

Sampling Distribution Models. Chapter 17

Fitting a Straight Line to Data

MITOCW MIT18_01SCF10Rec_24_300k

Algebra 8.6 Simple Equations

Chapter 23. Inference About Means

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

MA 1125 Lecture 15 - The Standard Normal Distribution. Friday, October 6, Objectives: Introduce the standard normal distribution and table.

Chapter 26: Comparing Counts (Chi Square)

Chapter 5 Simplifying Formulas and Solving Equations

P (E) = P (A 1 )P (A 2 )... P (A n ).

Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics

Continuity and One-Sided Limits

LECTURE 15: SIMPLE LINEAR REGRESSION I

Harvard University. Rigorous Research in Engineering Education

Two-sample Categorical data: Testing

Chapter 18. Sampling Distribution Models. Copyright 2010, 2007, 2004 Pearson Education, Inc.

COMP6053 lecture: Sampling and the central limit theorem. Markus Brede,

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

STAT/SOC/CSSS 221 Statistical Concepts and Methods for the Social Sciences. Random Variables

CHAPTER 9: HYPOTHESIS TESTING

MIT BLOSSOMS INITIATIVE

Conceptual Explanations: Simultaneous Equations Distance, rate, and time

Algebra. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

1.20 Formulas, Equations, Expressions and Identities

Calculus II. Calculus II tends to be a very difficult course for many students. There are many reasons for this.

appstats27.notebook April 06, 2017

CHAPTER 1. Introduction

MITOCW watch?v=ruz33p1icrs

Ask. Don t Tell. Annotated Examples

Making Measurements. On a piece of scrap paper, write down an appropriate reading for the length of the blue rectangle shown below: (then continue )

Business Statistics. Lecture 10: Course Review

Fourier and Stats / Astro Stats and Measurement : Stats Notes

Physics 6A Lab Experiment 6

Chapter 10 Regression Analysis

Do students sleep the recommended 8 hours a night on average?

Chapter 27 Summary Inferences for Regression

COMP6053 lecture: Sampling and the central limit theorem. Jason Noble,

Introductory Quantum Chemistry Prof. K. L. Sebastian Department of Inorganic and Physical Chemistry Indian Institute of Science, Bangalore

CH 24 IDENTITIES. [Each product is 35] Ch 24 Identities. Introduction

5.2 Infinite Series Brian E. Veitch

Basic Probability Reference Sheet

Solving with Absolute Value

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

Two-sample inference: Continuous data

MITOCW ocw f99-lec30_300k

CS1800: Strong Induction. Professor Kevin Gold

Exam #2 Results (as percentages)

Intermediate Algebra. Gregg Waterman Oregon Institute of Technology

- measures the center of our distribution. In the case of a sample, it s given by: y i. y = where n = sample size.

the probability of getting either heads or tails must be 1 (excluding the remote possibility of getting it to land on its edge).

Chapter 8: Estimating with Confidence

Originality in the Arts and Sciences: Lecture 2: Probability and Statistics

Multiple Regression Analysis

Introduction to Estimation. Martina Litschmannová K210

Chapter 23: Inferences About Means

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

MITOCW ocw f99-lec01_300k

THE SAMPLING DISTRIBUTION OF THE MEAN

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER /2018

(1) If Bush had not won the last election, then Nader would have won it.

STA Module 4 Probability Concepts. Rev.F08 1

CS1800: Mathematical Induction. Professor Kevin Gold

22. The Quadratic Sieve and Elliptic Curves. 22.a The Quadratic Sieve

Section 5.4. Ken Ueda

Multiple Regression Theory 2006 Samuel L. Baker

STARTING WITH CONFIDENCE

Conservation of Momentum

PHYSICS 15a, Fall 2006 SPEED OF SOUND LAB Due: Tuesday, November 14

Constant linear models

CH 59 SQUARE ROOTS. Every positive number has two square roots. Ch 59 Square Roots. Introduction

Sequential Decision Problems

The Central Limit Theorem

(Refer Slide Time: 00:10)

Transcription:

Study and research skills 2009. and Adrian Newton. Last draft 11/24/2008

Inference about the mean: What you will learn Why we need to draw inferences from samples The difference between a population and a sample An intuitive understanding of the properties of a sample The sampling distribution of the mean The standard error of the mean Confidence intervals Large sample confidence intervals Small sample confidence intervals

Infering the mean Inference Statisticians try to make certain statements about uncertain quantities. We can t know everything about the world. Some properties that we are interested in must be inferred. This is a very difficult concept both to use and accept. Probability is non-intuitive and often misunderstood.

Infering the mean Sometimes we just have to do our best! I am not going to give you a number for it because it s not my business to do intelligent work. What I told him was not necessarily accurate. It might also not have been inaccurate, but I m disinclined to mislead anyone. Donald Rumsfeld when -asked to estimate the number of Iraqi insurgents while testifying before Congress

Infering the mean Making inferences about a mean If we want to know a mean why don t we just calculate it? Didn t we see that last time? It is not quite so simple. The problem arises when we want to know the mean of a population. We usually only have a single sample from that population. If we draw a different sample we get a different mean. So although we know what the mean of a sample is we can only estimate the mean of a population. It is a known unknown

Populations What is a population There is another tricky issue to resolve. What do we mean by a population? This can become a philosophical question A pragmatic way of looking at is that a population is anything we really want to know about by drawing a sample. A population could be finite or infinite For example The population might be all the pine trees in a 5 ha wood in the New Forest. We want to estimate the mean diameter of the trees growing at that site. Or... the population might be Pinus sylvestris. We might be interested in the mean needle length for the species. In the first case we could measure every single tree if we had time In the second case we can only ever get a sample. We can t measure all the members of an infinite population!

An example An example Imagine we have 100 trees in a forest. Their basal areas (area in cross section in cm 2 ) are taken from a theoretical normal distributed with mean= 50 and sd= 20 The population could be either the 100 trees (there are no more) or it could be treated as effectively infinite (Notice that the second assumption is not practical in this case) Put that to one side for a moment and look at the trees.

An example One hundred trees [

If only... If only we knew the truth! If we could really know that the tree basal areas were taken from a normal distribution with mean = 50 and sd = 20 we wouldn t need to measure any of them. About 68% of the values would lie between 30 and 70 About 95% of the values would lie between 10 and 90 We know the mean itself. It is 50. No uncertainty at all.

If only... If only we knew the truth! If there are only 100 trees (finite population) we could also become certain about this particular population of trees. We could measure every single tree with extremely precise instruments. We could then calculate the mean and sd for the population of 100 trees.

If only... The theoretical infinite population (red line) and the empirical finite population (histogram) pdf 2

If only... The finite population parameters The mean is 48.245 The standard deviation is 20.154 Not far from our theoretical vales (50 and 20)

Simulating sampling Sampling It takes about three minutes to measure a tree s diameter accurately plus walking time between each What if we only have one morning available? We might only manage to measure thirty of the hundred trees. What can we then say about the mean of the hundred trees?

Simulating sampling A representative sample It would not be a good idea to measure only the trees on the edge of the forest (they get more light and might be bigger) We should aim for a representative sample This could be obtained by randomly selecting trees

Simulating sampling A random sample

Simulating sampling A representative sample? pdf 2

Simulating sampling In fact it is completely representative We can never expect the histogram of a relatively small sample to look classically normal We can implement a test to find if it could have been drawn from a normal distribution Most small samples could. This was. Small samples rarely look that normal

Sample properties The sample properties The mean is 46.546 The standard deviation is 21.193 But what happens if we send someone else back to draw another random sample?

Sample properties Another random sample

Sample properties A different sample s properties The mean is 43.254 The standard deviation is 18.717 And what happens if we send another person back to draw another random sample?

Sample properties Another random sample

Sample properties Another sample s properties The mean is 50.726 The standard deviation is 18.95 And what happens if we send yet another person back to draw another random sample?

Sample properties Another random sample

Sample properties Another sample s properties The mean is 47.578 The standard deviation is 18.394 And so on and so on...

Sample properties Another random sample

Sample properties Another random sample

Sample properties Where is this taking us? We never really would take repeated samples from the same population in this way However frequentist statistical theory is based on this idea. There is something very interesting about the properties of repeated samples Let s get the computer to do this 1000 times and look at the result.

Sample properties The sampling distribution of the mean Histogram of the population basal areas Frequency 0 10 20 20 40 60 80 Histogram of the mean values of 10000 samples of 30 trees Frequency 0 100 200 20 40 60 80

Sample properties The standard error of the mean The means of our repeated sampling experiments form a much tighter distribution than the data. We can find the mean of the means if we want. It is 48.149 The mean of the finite population of one hundred trees is 48.245 They are close. We can also find the standard deviation of our hypothetical set of sampling experiments. It is 3.053 The standard deviation of the hundred trees is 20.154. They are not close. The standard deviation of the mean is always less than the standard deviation and it decreases with sample size.

Sample properties 1000 samples size 2 Frequency 0 200 400 600 800 1000 1200 1400 20 40 60 80 replicate(10000, mean(sample(d$a, 2)))

Sample properties 1000 samples size 4 Frequency 0 50 100 150 200 20 40 60 80 replicate(1000, mean(sample(d$a, 4)))

Sample properties 1000 samples size 10 Frequency 0 50 100 150 200 250 300 20 40 60 80 replicate(1000, mean(sample(d$a, 10)))

Sample properties 1000 samples size 20 Frequency 0 50 100 150 20 40 60 80 replicate(1000, mean(sample(d$a, 20)))

Sample properties 1000 samples size 50 Frequency 0 50 100 150 200 20 40 60 80 replicate(1000, mean(sample(d$a, 50)))

Sample properties 1000 samples size 90 Frequency 0 50 100 150 200 250 300 20 40 60 80 replicate(1000, mean(sample(d$a, 90)))

Sample properties So what use is this? No one would be stupid enough to waste time this way. But the idea is still very useful. It turns out that if we know the standard deviation of a population we know what the standard deviation of the mean will be for any sample size. SD x = σ n Where SD x represents the true standard deviation for the means and σ is the population standard deviation and n is the sample size.

Standard error The standard error But, that is no use to us! We don t know the population standard deviation unless we measure all the trees! We re back where we started. Fortunately we do have an estimate of it. We can get one every time we take a sample. It is the sample standard deviation. It won t be quite right and it also varies between samples. In the case of small samples it might even be quite hopeless, but if we only measure some of the trees once, its all we ve got. We can call the standard deviation of the mean calculated from the sample standard deviation the standard error. SE x = s n

Large samples Inference from large samples If the sample is large (n>30) we might safely assume that our sample standard deviation s is more or less equal to σ We can also assume that our standard error is pretty close to actually being the standard deviation of the mean. Now, this is beginning to look more useful. We already know all about standard deviations from last time.

Large samples 68 percent of observations lie within 1 sd of the mean 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 x

Large samples 95 percent of observations lie within 2 sds of the mean 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 x

Large samples Confidence intervals So, we can imagine what a histogram of the repeated sampling experiments would look like without having to do them. Their means would form a normal distribution with standard deviation= standard error. A 95% confidence interval can be calculated from the sample mean plus or minus two standard errors (or 1.96 standard errors if we want to be really fussy) x±2.se x It will include the true population mean 95% of the time.

Large samples Calculating a confidence interval: A random sample to try it out on BA 1 48.60 2 39.39 3 86.50 4 52.60 5 21.42 6 87.95 7 34.47 8 78.57 9 49.20 10 43.95 11 22.02 12 47.85 13 66.35 14 26.85 15 42.42 16 63.20 17 34.53 18 63.78 19 41.83 20 58.38 21 52.68 22 27.73 23 46.98 24 48.15 25 55.09 26 28.25 27 33.22 28 26.43 29 74.80 30 98.91

Large samples Get the computer to do the work Sample mean = 50.07 Sample standard deviation = 20.39 Standard error = 3.722 95% confidence interval for mean = 50.07 ±7.44 What was the true population mean? 48.25 It falls inside the interval! And it should do so 19 times out of 20.

Large samples What can make us more confident? The smaller the confidence interval the more precise is our estimate. Remember that if the sample was biased it could still be inaccurate. We can improve precision by Measuring something with intrinsically low variability In ecology most variability is naturally part of the system In this case taking very precise measurements of each element in a sample could be a bit of a waste of time Taking a large sample. But remember the denominator is the square root of the sample size Decreasing the uncertainty in your mean value estimate by a factor of two needs four times as many samples. Decreasing uncertainty by a factor of ten requires a hundred times as many samples.

Small samples Small sample inference Remember that the sample standard deviation varies between samples It varies more for small samples So as sample size decreases the estimate becomes worse The use of the t statistic is designed to control for this

Small samples What is the t distribution A t distribution with a large number of degrees of freedom (large sample) is the same as a normal distribution. So large sample inference using the t distribution is the same as using the z values of a normal distribution (±2SEs for 95% confidence intervals) However as the sample size gets smaller the t distribution gets longer tails

Small samples T distribution with 30 degrees of freedom function(x) dt(df = n 1, x) (x) 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 x

Small samples T distribution with 20 degrees of freedom function(x) dt(df = n 1, x) (x) 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 x

Small samples T distribution with 10 degrees of freedom function(x) dt(df = n 1, x) (x) 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 x

Small samples T distribution with 5 degrees of freedom function(x) dt(df = n 1, x) (x) 0.0 0.1 0.2 0.3 4 2 0 2 4 x

Small samples T distribution with 2 degrees of freedom function(x) dt(df = n 1, x) (x) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 4 2 0 2 4 x

Small samples So how do we use this? Look up the number that corresponds to the 97.5 percentile of the t-distribution (A) for the number of degrees of freedom you have (n-1) Use this instead of the previous rule of thumb number (2) to calculate your confidence intervals.

Small samples Values of the 97.5 percentile for a range of degrees of freedom df qt 1 1 12.71 2 2 4.30 3 3 3.18 4 4 2.78 5 5 2.57 6 6 2.45 7 7 2.36 8 8 2.31 9 9 2.26 10 10 2.23 11 11 2.20 12 12 2.18 13 13 2.16 14 14 2.14 15 15 2.13 16 16 2.12 17 17 2.11 18 18 2.10 19 19 2.09 20 20 2.09 21 21 2.08 22 22 2.07 23 23 2.07 24 24 2.06 25 25 2.06 26 26 2.06 27 27 2.05 28 28 2.05 29 29 2.05 30 30 2.04

Small samples Calculating a confidence interval using the t distribution: A small sample to try it out on BA 1 46.84 2 53.45 3 41.83 4 42.42 5 33.35

Small samples Get the computer to do the work again Sample mean = 43.58 Sample standard deviation = 7.37 Standard error = 3.295 97.5 percentile of t distribution for 4 degrees of freedom = 2.78 95% confidence interval for mean = 43.58 ±9.15 What was the true population mean? 48.25 It falls inside the interval! And it also should do so 19 times out of 20.

Small samples Assumptions for calculating confidence intervals Both small sample and large sample confidence intervals assume that the data are drawn from a normally distributed population. We will look more carefully at this assumption later Perhaps more importantly the sample must be representative of the population from which it has been drawn.

Small samples What have we covered The fundamental basis of statistics Understand this class and you ve passed the course! We ve learnt how to guess a defensible range of values for a number we don t know with certainty We use what we do know to make statements about what we don t. Don Runsfeldt would approve.

Small samples What you need to remember You must remember all the formulas from class 2. The additional concept is the standard error of the mean. This is the sample standard error divided by the square root of the number of observations in the sample. Multiply this by two to get a 95% confidence interval for sample size >30 Multiply it by a number greater than two (that you get from the t-distribution) for sample sizes <30