Quantitative Foundations Project 3 Instructor: Linwei Wang Confidence Intervals Contents 1 Introduction 3 1.1 Warning....................................... 3 1.2 Goals of Statistics.................................. 3 1.3 Random Variables.................................. 6 1.4 Distributions..................................... 6 1.5 Expectations..................................... 7 1.6 Variance....................................... 9 1.7 Exercise........................................ 9 1.8 More Information.................................. 12 2 The Central Limit Theorem 13 2.1 Averaging Variables................................. 13 2.2 Entropy........................................ 14 2.3 The Normal Distribution.............................. 16 2.4 Central Limit Theorem............................... 17 2.5 Summary....................................... 18 2.6 More Information.................................. 19 3 Confidence Intervals 20 3.1 Definition...................................... 20 3.2 Exact Intervals using Hoeffding s Inequality.................. 20 1
Confidence Intervals 2 3.3 Asymptotic Intervals Using the Normal Distribution.............. 24 3.3.1 Known σ................................... 24 3.3.2 Unknown σ................................. 25 3.4 More Information.................................. 27
Confidence Intervals 3 1 Introduction 1.1 Warning By necessity, we will need to use many concepts here such as random variable, independent or probability density without defining them in a mathematically rigorous way. (Which would take the entire time alloted for this section!) Feel free to ask questions, and we will try to convey the meanings of these concepts at a level needed for working knowledge, but in the end, it is your responsibility to fill gaps in your own background as needed. 1.2 Goals of Statistics Let s begin by talking about the basic goals of statistics. Let s take a very simple example. Suppose we have a weighted 4-sided die, which has some probability of returning each number in 1, 2,...,4. Thus, we can think of that die as a simple probability distribution: p 1 x =1 p 2 x =2 p(x) = p 3 x =3 p 4 x =4 So, if we knew the values of p 1,...,p 4 we would know everything we care about the die. Now, in order to be a valid probability distribution, there are a couple of obvious conditions that must be satisfied. First, probabilities must be non-negative, that is i, p i 0. Secondly, we have to get some number. This means that p 1 + p 2 + p 3 + p 4 =1. Any set of numbers p 1,...,p 4 constitutes a valid probability distribution. Now, suppose that we flip the die a bunch of times, and we get the following result: 4, 1, 3, 2, 1, 2, 4, 2, 1, 1. Example statistical problem 1: What is p? Can you think of a way to estimate it? An obvious solution would be to make a histogram, with probability proportional to the
Confidence Intervals 4 number of times each flip occurred. This yields the estimated distribution.4 x =1.3 x =2 ˆp(x) =..1 x =3.2 x =4 This is a reasonable guess, but there are some obvious problems here. In this particular case, perhaps we just happened to get more results of x =1by chance. Obviously, we can t expect that the above probabilities are the true ones. For example, simulating another dataset of size 10, I get the data: along with the estimated distribution 2, 1, 1, 2, 4, 4, 2, 4, 1, 3.3 x =1.3 x =2 ˆp(x) =..1 x =3.3 x =4 If we really consider the situation, we can t make any rigorous guarantees on the difference of our estimated distribution to the true one. Maybe the dice rolls we got just happened to be highly unusual! Let s consider another example. Suppose we are interested in the probability p(x) that a person has a given height. Now, notice a worrisome technical difficulty here. If we pick any particular height, say 152.92125243cm, it seems exceedingly unlikely that we will ever find a person with such a height. Rather, for continuous variables, we should formally speak of probability densities, not probability distributions. This means, we are looking for a function p(x) such that Pr[a X b] = b x=a p(x) dx. That means, we get real probabilities by integrating a probability density. In particular, notice that it is possible for a probability density to be greater than one. (E.g. the density p(x) =ai[0 x 1/a] can have any arbitrarily high value a.) In any case, probability densities obey similar rules to probability distributions, namely x, p(x) 0
Confidence Intervals 5 and + p(x) =1. x= Anyway, suppose that we go out onto the street, get a set of 10 random people, and measure their heights in centimeters. We might get data like the following: 154, 192, 145, 101, 155, 167.23 Now, we might ask to recover the original probability density p(x). However, we might also be interested in only aspects of the distribution. For example, we might only care about the mean of p, µ = + x= xp(x) dx. Example statistical problem 2: What is µ? Can you think of a way to estimate it? The mean of the above dataset is 152.3717. But, of course, we want to know the true mean, which is presumably different. So, rather than simply reporting the mean, we should report some sort of guarantee of its reliability. It would be really nice, if we could make a statement like the following: The true mean is in the range 149-155. The problem is, we can t do that! We could have gotten really unlucky in our dataset. In principle (knowing nothing about the real heights of humans on earth) the true mean height could be 50, and we just happened to be very unlucky and get unusually tall people when we collected our data. Thus, in statistics, we will have to resign ourselves to fundamentally weaker guarantees. Roughly speaking, we will make guarantees of the following type: Unless we were unlucky, the true mean is in the range 149-155. We will even go on to quantify exactly what unlucky means and how unlucky we would have to be. That is, we will ultimately make a guarantee like this: A 95% confidence interval for the true mean is the range 149-155. Now, notice: This does NOT mean that there is a 95% probability the true mean is in the range 149-155. (If you remember one thing about statistics from this course, let it be this!) The true mean is a fixed number. We don t happen to know it, but it is out there in the world, and it is what it is.
Confidence Intervals 6 Rather, what we are saying is this: We have a procedure for building these things we call confidence intervals. The guarantee we make is precisely this: If you go out into the world and collect data, and then build confidence intervals, than 95% of the time your confidence interval will contain the true mean. That s all the guarantee they make. It isn t really the guarantee we would like to make. It is awkward. In real life, you do one experiment, and you want to know what the mean is. A confidence interval doesn t tell you what you want to know. We compute confidence intervals because they are the thing we are able to compute, not because they are the thing we want to compute. The rest of these notes will concentrate on background material to get your statistical brain muscles warmed-up. 1.3 Random Variables Very informally, a random variable is a number that comes from a random event. Example: Flip a coin 7 times, and let X be the number of heads that come up. Example: Gather data on the heights of 15 people, and let X be the mean measured heights. You will come to appreciate the purpose of random variables in time. 1.4 Distributions A variable has a uniform distribution if its probability density is given by, for some numbers a<b 1 a x b b a p(x) =. 0 else A variable has a Bernoulli distribution if its probability distribution is given by for some number θ [0, 1] p(x) = θ x =1 1 θ x =0. A variable has a Normal or Gaussian distribution if it is given by, for some numbers µ and σ>0
Confidence Intervals 7 p(x) σ 2π exp( 1 2σ 2 (x µ)2 ) The normal is extremely important because (as you might imagine from the name) many phenomena tend to have a Normal or approximately Normal distribution. Exercise: Draw some data of sizes 10, 100, 1000, and 10000 from each of these three distributions. Calculate a histogram in each case. Calculate the mean of your data. Do you notice anything funny? 1.5 Expectations Given a random variable X, its expected value is defined as E[X] = xp(x)dx x if Xis continuous, and E[X] = x xp(x) if X is discrete. Exercise: Suppose X is uniform. What is the expected value? (Answer: xp(x)dx = b xp(x) = b x 1 = 1 b x 1 a a b a b a a 2 b a x2 b a 1 2 b a (b2 a 2 ) (b+a)(b a) (b + a)) 2 (b a) 2 Exercise: Suppose X is Bernoulli. What is the expected value? (Answer: 0p(0)+1p(1) = θ) Exercise: Suppose X is Normal. What is the expected value? (Answer: calculus gets ugly. However, clearly by symmetry the answer is µ) An important property of expectations is that Theorem 1. The expected value of the sum of a finite number of random variables is the sum of expected values, i.e. n n E[ X i ]= E[X i ] Note that this theorem does not assume anything about the random variables (other than that the expected values exist). In particular, we do not assume that they are independent.
Confidence Intervals 8 Exercise: Prove this, for the case of two continuous random variables. Answer: E[X 1 + X 2 ] = (x 1 + x 2 )p(x 1,x 2 )dx 1 dx 2 x 1 x 2 = x 1 p(x 1,x 2 )dx 1 dx 2 + x 2 p(x 1,x 2 )dx 1 dx 2 x 1 x 2 x 1 x 2 = x 1 p(x 1 )dx 1 + x 2 p(x 2 )dx 2 x 1 x 2 = E[X 1 ]+E[X 2 ] A second, easy property of expectations is this: Theorem 2. The expected value of a constant times a random variable is that constant times the expected value, i.e. E[aX] =ae[x] Exercise: Prove this. Answer (for continuous variables): E[aX] = ax p(x) dx x = a xp(x) dx x = ae[x] Another important property, which is true only for independent random variables, is this: Theorem 3. The expected value of product of a finite number of random variables is the product of expected values, i.e. n n E[ X i ]= E[X i ] Exercise: Prove this for the case of two variables. Answer: E[X 1 X 2 ] = x 1 x 2 p(x 1,x 2 )dx 1 dx 2 x 1 x 2 = x 1 x 2 p(x 1 )p(x 2 )dx 1 dx 2 (using independence) x 1 x 2 = x 1 p(x 1 )dx 1 x 2 p(x 2 )dx 2 x 1 x 2 = E[X 1 ]E[X 2 ]
Confidence Intervals 9 1.6 Variance Given some random variable X with mean µ, the variance is defined to be V[X] =E[(X µ)] 2, where µ = E[X]. A standard and useful result is that Theorem 4. V[X] =E[X 2 ] µ 2 Exercise: Prove this. (Answer: V[X] =E[(X µ)] 2 = E[X 2 2Xµ+µ 2 ]=E[X 2 ] 2µE[X]+ µ 2 = E[X 2 ] µ 2.) 1.7 Exercise Exercise: Suppose we have a dataset of size N, generated from a Bernoulli distribution (i.e. a bent coin). Let the data be X 1,X 2,X 3,...,X N. Suppose we want to estimate the parameter θ for this distribution. The obvious estimator for this would be θ N N X i. N That is, we estimate the bias of the coin to be exactly the fraction of the data that resulted in a head. Part 1: What is the expected value of θ N? Part 2: What is the variance of θ N? Part 3: Simulate this estimator and calculate its variance. Specifically, write a function that takes a value of θ and N. Generate 10000 Make sure that your simulation actually displays the variance you calculated. Answer to part 1:
Confidence Intervals 10 Eθ N = E 1 N N X i N N E X i (by Theorem 2) N N = θ N EX i (by Theorem 1) N θ
Confidence Intervals 11 Answer to part 2, in laborious detail: Vθ N = E[(θ N µ) 2 ] = E[θ 2 N] µ 2 (by Theorem 4) = E[θ 2 N] θ 2 E[θ 2 N] = E[ 1 N N 2] X i N E[ X 2 i X j ] i j N E[ X 2 i X j ] i j N E[ X 2 i 2 + X i X j ] (split into two groups) i i j=i N E[ X 2 i + X i X j ] (since 0 2 =0and 1 2 =1) i i j=i E[X N 2 i ]+ E[X i X j ] (by Theorem 1) i i j=i E[X N 2 i ]+ E[X i ]E[X j ] (by Theorem 3) i i j=i θ + θ 2 N 2 i i j=i Nθ + N(N 1)θ 2 N 2 N Thus, finally, the variance is (N 1) θ + N θ2 Vθ N (N 1) θ + N N θ2 θ 2 (N 1) θ + N N θ2 N N θ2 N θ 1 N θ2 N (θ θ2 )
Confidence Intervals 12 In particular, over the [0, 1] interval, θ θ 2 is maximized at 1 with θ 2 θ2, and so 4 Answer to part 3: Vθ N 1 4N. function estimate_bernoulli_variance(theta,n) maxrep 0000; theta_est = zeros(maxrep,1); for rep :maxrep X = sample_bernoulli(n,theta); theta_est(rep) = mean(x); end [mean(theta_est) theta] [var(theta_est) (1/N)*(theta-theta^2)] 1.8 More Information See Arian Maleki and Tom Do s review of probability theory..