Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Similar documents
Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Lecture 16

Statistics 511 Additional Materials

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Random Variables, Sampling and Estimation

The standard deviation of the mean

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Expectation and Variance of a random variable

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Topic 9: Sampling Distributions of Estimators

7.1 Convergence of sequences of random variables

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Discrete Mathematics for CS Spring 2005 Clancy/Wagner Notes 21. Some Important Distributions

April 18, 2017 CONFIDENCE INTERVALS AND HYPOTHESIS TESTING, UNDERGRADUATE MATH 526 STYLE

Frequentist Inference

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Topic 9: Sampling Distributions of Estimators

Parameter, Statistic and Random Samples

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 19

Distribution of Random Samples & Limit theorems

Estimation of a population proportion March 23,

Chapter 8: Estimating with Confidence

7.1 Convergence of sequences of random variables

Statisticians use the word population to refer the total number of (potential) observations under consideration

1 Inferential Methods for Correlation and Regression Analysis

Basics of Probability Theory (for Theory of Computation courses)

Output Analysis and Run-Length Control

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Parameter, Statistic and Random Samples

This section is optional.

Binomial Distribution

Topic 9: Sampling Distributions of Estimators

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 15

2 Definition of Variance and the obvious guess

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

Estimation for Complete Data

Chapter 6 Principles of Data Reduction

Understanding Samples

Topic 6 Sampling, hypothesis testing, and the central limit theorem

Stat 421-SP2012 Interval Estimation Section

Simulation. Two Rule For Inverting A Distribution Function

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

MATH/STAT 352: Lecture 15

4. Partial Sums and the Central Limit Theorem

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

1 Introduction to reducing variance in Monte Carlo simulations

CHAPTER 8 FUNDAMENTAL SAMPLING DISTRIBUTIONS AND DATA DESCRIPTIONS. 8.1 Random Sampling. 8.2 Some Important Statistics

Lecture 2: Monte Carlo Simulation

Discrete Mathematics and Probability Theory Spring 2012 Alistair Sinclair Note 15

Fall 2013 MTH431/531 Real analysis Section Notes

An Introduction to Randomized Algorithms

1 Convergence in Probability and the Weak Law of Large Numbers

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Basis for simulation techniques

Unbiased Estimation. February 7-12, 2008

Chapter 6 Sampling Distributions

Problem Set 2 Solutions

Learning Theory: Lecture Notes

MA131 - Analysis 1. Workbook 3 Sequences II

Lecture 12: September 27

Math 140 Introductory Statistics

Chapter 6. Sampling and Estimation

Lecture 10 October Minimaxity and least favorable prior sequences

Infinite Sequences and Series

Lecture 1 Probability and Statistics

Lecture 19: Convergence

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Lesson 10: Limits and Continuity

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Module 1 Fundamentals in statistics

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

MBACATÓLICA. Quantitative Methods. Faculdade de Ciências Económicas e Empresariais UNIVERSIDADE CATÓLICA PORTUGUESA 9. SAMPLING DISTRIBUTIONS

Chapter 8: STATISTICAL INTERVALS FOR A SINGLE SAMPLE. Part 3: Summary of CI for µ Confidence Interval for a Population Proportion p

Sequences and Series of Functions

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

32 estimating the cumulative distribution function

Exponential Families and Bayesian Inference

AMS570 Lecture Notes #2

Last Lecture. Wald Test

6.3 Testing Series With Positive Terms

Exam II Covers. STA 291 Lecture 19. Exam II Next Tuesday 5-7pm Memorial Hall (Same place as exam I) Makeup Exam 7:15pm 9:15pm Location CB 234

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 9

Mathematical Statistics - MS

(7 One- and Two-Sample Estimation Problem )

1 Approximating Integrals using Taylor Polynomials

Analysis of Experimental Data

5. INEQUALITIES, LIMIT THEOREMS AND GEOMETRIC PROBABILITY

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ

Lecture 1 Probability and Statistics

Section 11.8: Power Series

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

AP Statistics Review Ch. 8

Transcription:

CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig a small radom sample. How large does our sample have to be to guaratee that our estimate will be withi (say) ±1 percetage poits (i absolute terms) of the true value with probability at least 0.95? This is perhaps the most basic statistical estimatio problem, ad it shows up everywhere. We will develop a simple solutio that uses oly Chebyshev s iequality. More refied methods ca be used to get sharper results. Let s deote the size of our sample by (to be determied), ad the umber of Democrats i it by the radom variable S. (The subscript just remids us that the r.v. depeds o the size of the sample.) The our estimate will be the value A = 1 S. Now as has ofte bee the case, we will fid it helpful to write S = X 1 + X 2 + +X, where { 1 if perso i i sample is a Democrat; X i = 0 otherwise. Note that each X i ca be viewed as a coi toss, with Heads probability p (though of course we do ot kow the value of p). Ad the coi tosses are idepedet. 1 [We ca say that the X i s are idepedet ad idetically distributed, or just i.i.d. for short.] What is the expectatio of our estimate? E[A ] = E[ 1 S ] = 1 E[X 1 + X 2 + +X ] = 1 (p) = p. So for ay value of, our estimate will always have the correct expectatio p. [Such a r.v. is ofte called a ubiased estimator of p.] Now presumably, as we icrease our sample size, our estimate should get more ad more accurate. This will show up i the fact that the variace decreases with : i.e., as icreases, the probability that we are far from the mea p will get smaller. To see this, we eed to compute Var(A ). Ad sice A = 1 i=1 X i, we eed to figure out how to compute the variace of a sum of radom variables. Theorem 22.1: For ay radom variable X ad costat c, we have Var(cX) = c 2 Var(X). Ad for idepedet radom variables X,Y, we have Var(X +Y) = Var(X)+Var(Y). 1 We are assumig here that the samplig is doe with replacemet ; i.e., we select each perso i the sample from the etire populatio, icludig those we have already picked. So there is a small chace that we will pick the same perso twice. CS 70, Sprig 2008, Note 22 1

Proof: From the defiitio of variace, we have Var(cX) = E[(cX E[cX]) 2 ] = E[(cX ce[x]) 2 ] = E[c 2 (X E[X]) 2 ] = c 2 Var(X). The proof of the secod claim is left as a exercise. Note that the secod claim does ot i geeral hold uless X ad Y are idepedet. Usig this theorem, we ca ow compute Var(A ). Var(A ) = Var( 1 i=1 X i ) = ( 1 )2 Var( i=1 X i ) = ( 1 )2 Var(X i ) = σ 2 i=1, where we have writte σ 2 for the variace of each of the X i, i.e., σ 2 = Var(X i ). So we see that the variace of A decreases liearly with. This fact esures that, as we take larger ad larger sample sizes, the probability that we deviate much from the expectatio p gets smaller ad smaller. Let s ow use Chebyshev s iequality to figure out how large has to be to esure a specified accuracy i our estimate of the proportio of Democrats p. A atural way to measure this is for us to specify two parameters, ε ad δ, both i the rage (0,1). The parameter ε cotrols the error we are prepared to tolerate i our estimate, ad δ cotrols the cofidece we wat to have i our estimate. A more precise versio of our origial questio is the the followig: Questio: For the Democrat-estimatio problem above, how large does the sample size have to be i order to esure that Pr[ A p ε] δ? I our origial questio, we had ε = 0.01 ad δ = 0.05: we wated to kow how large eeds to be so that Pr[p 0.01 < A < p + 0.01] 0.95, which is equivalet to askig how large eeds to be so that Pr[ A p 0.01] 0.05. Notice that, i this example, ε measures the absolute error, i.e., the differece betwee the estimate A ad the true value p. I may applicatios, the relative error is a better measure of error, but i the case of pollig, it s usually eough to boud the absolute error. 2 Let s apply Chebyshev s iequality to aswer our more precise questio above. Sice we kow Var(A ), this will be quite simple. From Chebyshev s iequality, we have Pr[ A p ε] Var(A ) ε 2 = σ 2 ε 2. To make this less tha the desired value δ, we eed to set σ 2 ε 2 δ. (1) 2 I may other applicatios, the absolute error is a poor measure, because a give absolute error (say, ±0.01) might be quite small i the cotext of measurig a large value like p = 0.5, but very large whe measurig a small value like p = 0.005. For this reaso, i most real-life applicatios, it is more useful to examie the relative error, i.e., to measure the error as a ratio of the target value p. (Thus the absolute error of the estimate A is A p, while the relative error is A p /p = A p 1.) The relative error has the advatage of treatig all values of p equally. However, pollig is a special case where it is ofte sufficiet to use the absolute error, ad the mathematics for the absolute error is slightly simpler, so we will cotiue to use the absolute error i this example, cofidet i the kowledge that we could modify these calculatios to use a relative error measure if we wated. CS 70, Sprig 2008, Note 22 2

Now recall that σ 2 = Var(X i ) is the variace of a sigle sample X i. Sice X i is a 0/1-valued r.v., we have σ 2 = p(1 p). It is easy to see usig a bit of calculus that p(1 p) 1 4 (sice 0 p 1). As a result, iequality (1) becomes 1 4ε 2 δ. Pluggig i ε = 0.01 ad δ = 0.05, we see that a sample size of = 50,000 is sufficiet. Oe amazig corollary is that ecessary sample size depeds oly upo the desired margi of error (ε) ad cofidece level (δ), but ot o the size of the uderlyig populatio. We could be pollig the state of Wyomig, the state of Califoria, the whole of the US, or the etire world ad the same sample size is sufficiet. This is perhaps a bit couterituitive. Estimatig a geeral expectatio What if we wated to estimate somethig a little more complex tha the proportio of Democrats i the populatio, such as the average wealth of people i the US? The we could use exactly the same scheme as above, except that ow the r.v. X i is the wealth of the ith perso i our sample. Clearly E[X i ] = µ, the average wealth of people i the US (which is what we are tryig to estimate). Ad our estimate will agai be A = 1 i=1 X i, for a suitably chose sample size. Oce agai the X i are i.i.d. radom variables, so we agai have E[A ] = µ ad Var(A ) = σ 2, where σ 2 = Var(X i ) is the variace of the X i. (Recall that the oly facts we used about the X i was that they were idepedet ad had the same distributio. Actually it would be eough for them to be idepedet ad all have the same expectatio ad variace do you see why?) I this case, we probably wat to use the relative error: we wat to choose to esure that Pr[(1 ε)µ < A < (1+ε)µ] 1 δ], i.e., to esure that Pr[ A µ εµ] δ. Applyig Chebyshev s iequality much as before, we fid Pr[ A µ εµ] Var(A ) (εµ) 2 = σ 2 ε 2 µ 2. Hece it is eough for the sample size to satisfy (2) σ 2 µ 2 1 ε 2 δ. (3) Here ε ad δ are the desired error ad cofidece respectively, as before. Now of course we do t kow the other two quatities, µ ad σ 2, appearig i equatio (3). I practice, we would try to fid some reasoable lower boud o µ ad some reasoable upper boud o σ 2 (just as we used a upper boud o p(1 p) i the Democrats problem). Pluggig these bouds ito equatio (3) will esure that our sample size is large eough. For example, i the average wealth problem we could probably safely take µ to be at least (say) $20k (probably more). However, the existece of people such as Bill Gates meas that we would eed to take a very high value for the variace σ 2. Ideed, if there is at least oe idividual with wealth $50 billio, the assumig a relatively small value of µ meas that the variace must be at least about (50 109 ) 2 = 10 13. 250 10 6 (Check this.) However, this idividual s cotributio to the mea is oly 50 109 = 200. There is really o 250 10 6 way aroud this problem with simple uiform samplig: the ueve distributio of wealth meas that the variace is iheretly very large, ad we will eed a huge umber of samples before we are likely to fid aybody who is immesely wealthy. But if we do t iclude such people i our sample, the our estimate will be way too low. CS 70, Sprig 2008, Note 22 3

As a further example, suppose we are tryig to estimate the average rate of emissio from a radioactive source, ad we are willig to assume that the emissios follow a Poisso distributio with some ukow parameter λ of course, this λ is precisely the expectatio we are tryig to estimate. Now i this case we have µ = λ ad also σ 2 = λ (see the previous lecture otes). So σ 2 = 1 µ 2 λ. Thus i this case a sample size of = 1 suffices. (Agai, i practice we would use a lower boud o λ.) λε 2 δ The Law of Large Numbers The estimatio method we used i the previous two sectios is based o a priciple that we accept as part of everyday life: amely, the Law of Large Numbers. This asserts that, if we observe some radom variable may times, ad take the average of the observatios, the this average will coverge to a sigle value, which is of course the expectatio of the radom variable. I other words, averagig teds to smooth out ay large fluctuatios, ad the more averagig we do the better the smoothig. Theorem 22.2: [Law of Large Numbers] Let X 1,X 2,...,X be i.i.d. radom variables with commo expectatio µ = E[X i ]. Defie A = 1 i=1 X i. The for ay α > 0, we have Pr[ A µ α] 0 as. We will ot prove this theorem here. Notice that it says that the probability of ay deviatio α from the mea, however small, teds to zero as the umber of observatios i our average teds to ifiity. Thus by takig large eough, we ca make the probability of ay give deviatio as small as we like. [Note, however, that the Law of Large Numbers does ot say aythig about how large has to be to achieve a certai accuracy. For that, we eed Chebyshev s iequality or some other quatitative tool.] Actually we ca say somethig much stroger tha the Law of Large Numbers: amely, the distributio of the sample average A, for large eough, looks like a bell-shaped curve cetered about the mea µ. The width of this curve decreases with, so it approaches a sharp spike at µ. This fact is kow as the Cetral Limit Theorem. To say this precisely, we eed to defie the bell-shaped curve. This is the so-called Normal distributio, ad it is the first (ad oly) o-discrete distributio we will meet i this course. For radom variables that take o cotiuous real values, it o loger makes sese to talk about Pr[X = a]. As a example, cosider a r.v. X that has the uiform distributio o the cotiuous iterval [0,1]. The for ay sigle poit 0 a 1, we have Pr[X = a] = 0. However, clearly it is the case that, for example, Pr[ 1 4 X 3 4 ] = 1 2. So i place of poit probabilities Pr[X = a], we eed a differet otio of distributio for cotiuous radom variables. Defiitio 22.1 (desity fuctio): For a real-valued r.v. X, a real-valued fuctio f(x) is called a (probability) desity fuctio for X if Pr[X a] = a f(x)dx. Thus we ca thik of f(x) as defiig a curve, such that the area uder the curve betwee poits x = a ad x = b is precisely Pr[a X b]. Note that we must always have f(x)dx = 1. (Do you see why?) As a example, for the uiform distributio o [0,1] the desity would be 0 for x < 0; f(x) = 1 for 0 x 1; 0 for x > 1. [Check you agree with this. What would be the desity for the uiform distributio o [ 1,1]?] CS 70, Sprig 2008, Note 22 4

Expectatios of cotiuous r.v. s are computed i a aalagous way to those for discrete r.v. s, except that we use itegrals istead of summatios. Thus Ad also E[X] = x f(x)dx. Var(X) = E[X 2 ] (E[X]) 2, where E[X 2 ] = x2 f(x)dx. [You should check that, for the uiform distributio o [0,1], the expectatio is 1 2 Now we are i a positio to defie the Normal distributio. Defiitio 22.2 (Normal distributio): distributio with desity fuctio f(x) = 1 σ 2π e (x µ)2 /2σ 2. ad the variace is 1 12.] The Normal distributio with mea µ ad variace σ 2 is the 1 [The costat factor σ comes from the fact that 2π e (x µ)2 /σ 2 dx = σ 2π. So, we have to ormalize by this costat factor to esure that f(x)dx = 1. If you like calculus, you might like to do the itegrals to check that the expectatio x f(x)dx is ideed µ ad that the variace is ideed σ 2.] If you plot the above desity fuctio f(x), you will see that it is a symmetrical bell-shaped curve cetered aroud the mea µ. Its height ad width are determied by the stadard deviatio σ as follows: 50% of the mass is cotaied i the iterval of width 0.67σ either side of the mea, ad 99.7% i the iterval of width 3σ either side of the mea. (Note that, to get the correct scale, deviatios are o the order of σ rather tha σ 2.) Put aother way, if we sample a radom value from a Normal distributio, the 50% of the time, the value we get will be withi 0.67 stadard deviatios of the mea (i.e., i the rage [µ 0.67σ, µ + 0.67σ]); ad 99.7% of the time, it will be withi 3 stadard deviatios of the mea. Now we are i a positio to state the Cetral Limit Theorem. Because our treatmet of cotiuous distributios has bee rather sketchy, we shall be cotet with a rather imprecise statemet. This ca be made completely rigorous without too much extra effort. Theorem 22.3: [Cetral Limit Theorem] Let X 1,X 2,...,X be i.i.d. radom variables with commo expectatio µ = E[X i ] ad variace σ 2 = Var(X i ) (both assumed to be fiite). Defie A = 1 i=1 X i. The as, the distributio of A approaches the Normal distributio with mea µ ad variace σ 2. Note that the variace is σ 2 of as icreases. (as we would expect) so the width of the bell-shaped curve decreases by a factor The Cetral Limit Theorem is actually a very strikig fact. What it says is the followig. If we take a average of observatios of absolutely ay r.v. X, the the distributio of that average will be a bell-shaped curve cetered at µ = E[X]. Thus all trace of the distributio of X disappears as gets large: all distributios, o matter how complex, 3 look like the Normal distributio whe they are averaged. The oly effect of the origial distributio is through the variace σ 2, which determies the width of the curve for a give value of, ad hece the rate at which the curve shriks to a spike. Oe useful cosequece is that the Biomial distributio ca be approximated by the Normal distributio, sice the Biomial(, p) distributio is obtaied as the sum of i.i.d. (0/1-valued) radom variables. I particular, if we hold p fixed ad let X be a radom variable with the distributio X Biomial(, p), the 3 We do eed to assume that the mea ad variace of X are fiite. CS 70, Sprig 2008, Note 22 5

as, 1 X approaches the Normal distributio with mea p ad variace p(1 p). This meas that if 1 is sufficietly large, we ca approximate the r.v. X as a Normal distributio (with mea p ad variace p(1 p) ). Or, i other words, if is sufficietly large, we ca approximate the r.v. X as a Normal distributio with mea p ad variace p(1 p). This meas that, for large, the Biomial distributio ca ofte be approximated as a Normal distributio. This ca be a helpful tool for approximate computatios about Biomially distributed radom variables. What is amazig about the Cetral Limit Theorem is it shows that the same kid of approximatios apply ot oly to sums of 0/1-valued radom values (i.e., ot oly to the Biomial distributio) but also to sums of ay other kid of i.i.d. r.v.s, as log as is sufficietly large. CS 70, Sprig 2008, Note 22 6