Overview of Estimation

Size: px

Start display at page:

Download "Overview of Estimation"

Herbert Austin
5 years ago
Views:

1 Topic Iferece is the problem of turig data ito kowledge, where kowledge ofte is expressed i terms of etities that are ot preset i the data per se but are preset i models that oe uses to iterpret the data. Statistical rigor is ecessary to justify the iferetial leap from data to kowledge, ad may difficulties arise i attemptig to brig statistical priciples to bear o massive data. Overlookig this foudatio may yield results that are, at best, ot useful, or harmful at worst. I ay discussio of massive data ad iferece, it is essetial to be aware that it is quite possible to tur data ito somethig resemblig kowledge whe actually it is ot. Moreover, it ca be quite difficult to kow that this has happeed. - page, Frotiers i Massive Data Aalysis by the Natioal Research Coucil, 03. The balace of tis book is devoted to developig formal procedures of statistical iferece. I this itroductio to iferece, we will be basig our aalysis o the premise that the data has bee collected accordig to well-desiged procedures. We will focus our presetatio o parametric estimatio ad hypothesis testig based o a give family of probably models chose to be cosistet with the sciece uder ivestigatio ad with the data collectio procedures.. Itroductio I the simplest possible terms, the goal of estimatio theory is to aswer the questio: What is that umber? What is the legth, the reactio rate, the fractio displayig a particular behavior, the temperature, the kietic eergy, the Michaelis costat, the speed of light, mutatio rate, the meltig poit, the probability that the domiat allele is expressed, the elasticity, the force, the mass, the free eergy, the mea umber of offsprig, the focal legth, mea lifetime, the slope ad itercept of a lie? The ext step is to perform a experimet that is well desiged to estimate oe (or more) umbers. However, before we ca embark o such a desig, we must lear some priciples of estimatio to have some uderstadig of the properties of a good estimator ad to preset our ucertaily about the estimatio procedure. Statistics has provided two distict approaches this questio - typically called classical or frequetist ad Bayesia. We shall give a overview of both approaches. However, the otes will emphasize the classical approach. We begi with a defiitio: Defiitio.. A statistic is a fuctio of the data that does ot deped o ay ukow parameter. We have to this poit, see a variety of statistics. Example.. sample mea, x 85

2 sample variace, s sample stadard deviatio, s sample media, sample quartiles Q,Q 3, percetiles ad other quatiles stadardized scores (x i x)/s order statistics x (),x (),...x (), icludig sample maximum ad miimum sample momets x m = x m k, m =,, 3,... k= Here, we will look at a particular type of parameter estimatio, i which we cosider X =(X,...,X ), idepedet radom variables chose accordig to oe of a family of probabilities P where is elemet from the parameter space. Based o our aalysis, we choose a estimator ˆ (X). If the data x takes o the values x,x,...,x, the ˆ (x,x,...,x ) is called the estimate of. Thus we have three closely related objects,. - the parameter, a elemet of the parameter space. This is a umber or a vector.. ˆ (x,x,...,x ) - the estimate. This agai is a umber or a vector obtaied by evaluatig the estimator o the data x =(x,x,...,x ). 3. ˆ (X,...,X ) - the estimator. This is a radom variable. We will aalyze the distributio of this radom variable to decide how well it performs i estimatig. The first of these three objects is a umber. The secod is a statistic. The third ca be aalyzed ad its properties described usig the theory of probability. Keepig the relatioship amog these three objects i mid is essetial i uderstadig the fudametal issues i statistical estimatio. Example.3. For Beroulli trials X =(X,...,X ), we have. p, a sigle parameter, the probability of success, with parameter space [0, ].. ˆp(x,...,x ) is the sample proportio of successes i the data set. 3. ˆp(X,...,X ), the sample mea of the radom variables ˆp(X,...,X )= (X + + X )= S is a estimator of p. We ca give the distributio of this estimator because S is a biomial radom variable. Example.4. Give pairs of observatios (x, y) =((x,y ), (x,y ),...,(x,y )) that display a geeral liear patter, we use ordiary least squares regress for. parameters - the slope ad itercept of the regressio lie. So, the parameter space is R, pairs of real umbers.. They are estimated usig the statistics ˆ ad ˆ i the equatios ˆ(x, y) = cov(x, y) var(x), ȳ =ˆ(x, y)+ ˆ(x, y) x. 3. Later, whe we cosider statistical iferece for liear regressio, we will aalyze the distributio of the estimators. Exercise.5. Let X =(X,...,X ) be idepedet uiform radom variables o the iterval [0, ] with ukow. Give some estimators of from the statistics above. 86

3 . Classical Statistics I classical statistics, the state of ature is assumed to be fixed, but ukow to us. Thus, oe goal of estimatio is to determie which of the P is the source of the data. The estimate is a statistic ˆ : data. Itroductio to estimatio i the classical approach to statistics is based o two fudametal questios: How do we determie estimators? How do we evaluate estimators? We ca ask if this estimator i ay way systematically uder or over estimate the parameter, if it has large or small variace, ad how does it compare to a otio of best possible estimator. How easy is it to determie ad to compute ad how does the procedure improve with icreased sample size? The raw material for our aalysis of ay estimator is the distributio of the radom variables that uderlie the data uder ay possible value of the parameter. To simplify laguage, we shall use the term desity fuctio to refer to both cotiuous ad discrete radom variables. Thus, to each parameter value, there exists a desity fuctio which we deote f X (x ). We focus o experimetal desigs based o a simple radom sample. To be more precise, the observatios are based o a experimetal desig that yields a sequece of radom variables X,...,X, draw from a family of distributios havig commo desity f X (x ) where the parameter value is ukow ad must be estimated. Because the radom variables are idepedet, the joit desity is the product of the margial desities. Y f X (x ) = f X (x k ) =f X (x )f X (x ) f X (x ). k=i I this circumstace, the data x are kow ad the parameter is ukow. Thus, we write the desity fuctio as L( x) =f X (x ) ad call L the likelihood fuctio. Because the algebra ad calculus of f X (x ) are a bit ufamiliar, we will look at several examples. Example.6 (Parametric families of desities).. For Beroulli trials with a kow umber of trials but ukow success probability parameter p has joit desity f X (x p) =p x ( p) x p x ( p) x p x ( p) x = p P k= x k ( p) P k= ( x k) = p P k= x k ( p) P k= x k = p x ( x) ( p). Normal radom variables with kow variace 0 but ukow mea µ has joit desity (x µ) (x µ) (x µ) f X (x µ) = p exp 0 0 p exp 0 0 p exp 0 0 = p ( exp 0 ) 0 (x k µ) k= 87

4 3. Normal radom variables with ukow mea µ ad variace has desity f X (x µ, )= ( p ) exp (x k µ). 4. Beta radom variables with parameters ad has joit desity k= f X (x, )= ( + ) (x x x ) (( x ) ( x ) ( x )) () ( ). Exercise.7. Give the likelihood fuctio for observatios of idepedet (, ) radom variables. The choice of a poit estimator ˆ is ofte the first step. The ext two topics will be devoted to cosider two approaches for determiig estimators - method of momets ad maximum likelihood. We ext move to aalyze the quality of the estimator. With this i view, we will give methods for approximatig the bias ad the variace of the estimators. Typically, this iformatio is, i part, summarized though what is kow as a iterval estimator. This is a procedure that determies a subset of the parameter space with high probability that it cotais the real state of ature. We see this most frequetly i the use of cofidece itervals..3 Bayesia Statistics For a few tosses of a coi always that always tur up tails, the estimate ˆp =0for the probability of heads did ot seem reasoable to Thomas Bayes. He wated a way to place our ucertaily of the value for p ito the procedure for estimatio. Today, the Bayesia approach to statistics takes ito accout ot oly the desity f X (x ) for the data collected for ay give experimet but also exteral iformatio to determie a prior desity o the parameter space. Thus, i this approach, both the parameter ad the data are modeled as radom. Estimatio is based o Bayes formula. Let be a radom variable havig the give prior desity. I the case i which both ad the data take o oly a fiite set of values, is a discrete radom variable ad is a mass fuctio { } = P { = }. Let C = { = } be the evet that takes o the value ad A = {X = x} be the values take o by the data. The {C, } from a partitio of the probability space. Bayes formula is f X ( x) =P { = X = x} = P (C A) = P (A C )P (C ) P P (A C )P (C ) or P {X=x = }P { P = } P {X=x = }P { = } = f X (x ) { } P fx (x ) { }. Give data x, the fuctio of, f X ( x) =P { = X = x} is called the posterior desity. For a cotiuous distributio o the parameter space, is ow a desity for a cotiuous radom variable ad the sum i Bayes formula becomes a itegral. f X ( x) = f X (x ) ( ) R fx (x ) ( ) d (.) Sometimes we shall write (.) as f X ( x) =c(x)f X (x ) ( ) 88

5 where c(x), the reciprocal of the itegral i the deomiator i (.), is the value ecessary so that the itegral of the posterior desity f X ( x) with respect to equals. We might also write f X ( x) / f X (x ) ( ) (.) where c(x) is the costat of proportioality. Estimatio, e.g., poit ad iterval estimates, i the Bayesia approach is based o the data ad a aalysis usig the posterior desity. For example, oe way to estimate is to use the mea of the posterior distributio, or more briefly, the posterior mea, Z ˆ (x) =E[ x] = f X ( x) d. Example.8. As suggested i the origial questio of Thomas Bayes, we will make idepedet flips of a biased coi ad use a Bayesia approach to make some iferece for the probability of heads. We first eed to set a prior distributio for P. The beta family Beta(, ) of distributios takes values i the iterval [0, ] ad provides a coveiet prior desity. Thus, (p) =c, p ( ) ( p) ( ), 0 <p<. Ay desity o the iterval [0, ] that ca be writte a a power of p times a power of that is a member of the beta family. This distributio has mea + ad = Z 0 variace (p) dp p times a costat chose so ( + ) ( + + ). (.3) Thus, the mea is the ratio of ad +. If the two parameters are each multiplied by a factor of k, the the mea does ot chage. However, the variace is reduced by a factor close to k. The prior gives a sese of our prior kowledge of the mea through the ratio of to + ad our ucertaily through the size of ad If we perform Beroulli trials, x =(x,...,x ), the the joit desity f X (x p) =p P k= x k ( p) P k= x k. Thus the posterior distributio of the parameter P give the data x, usig (.), we have. f P X (p x) / f X P (x p) (p) =p P k= x k ( p) P k= xk c, p ( ) ( p) ( ). = c, p +P k= x k ( p) + P k= x k. Cosequetly, the posterior distributio is also from the beta family with parameters + x k ad + x k = + ( x k ). k= Notice that the posterior mea ca be writte as k= k= + # successes ad + # failures. + P k= x k + + = P k= x k + + x k = k= + = x + +. This expressio allow us to see that the posterior mea ca be expresses as a weighted average /( + prior mea ad x, the sample mea from the data. The relative weights are 89 + ) from the

6 + from the prior ad, the umber of observatios. Thus, if the umber of observatios is small compared to + mea /( + ). As the umber of observatios icrease, the + +, the most of the weight is placed o the prior icreases towards. The weight result i a shift the posterior mea away from the prior mea ad towards the sample mea x. This brigs forward two cetral issues i the use of the Bayesia approach to estimatio. If the umber of observatios is small, the the estimate relies heavily o the quality of the choice of the prior distributio. Thus, a ureliable choice for leads to a ureliable estimate. As the umber of observatios icreases, the estimate relies less ad less o the prior distributio. I this circumstace, the prior may simply be playig the roll of a catalyst that allows the machiery of the Bayesia methodology to proceed. Exercise.9. Show that this aswer is equivalet to havig heads ad cois. tails i the data set before actually flippig Example.0. If we flip a coi = 4 times with 8 heads, the the classical estimate of the success probability p is 8/4=4/7. For a Bayesia aalysis with a beta prior distributio, usig (.3) we have a beta posterior distributio with the followig parameters. prior data posterior mea variace heads tails mea variace 6 6 / /5= /(+4)=7/3 68/854= /4 3/08= /(7+9) =7/6 53/85= /4 3/08= /(5+)=/6 65/854= x Figure.: Example of prior (black) ad posterior (red) desities based o 4 coi flips, 8 heads ad 6 tails. Left pael: Prior is Beta(6, 6), Right pael: Prior is Beta(9, 3). Note how the peak is arrowed. This shows that the posterior variace is smaller tha the prior variace. I additio, the peal moves from the prior towards ˆp =4/7, the sample proportio of the umber of heads. x 90

7 I his origial example, Bayes chose was the uiform distributio ( = posterior mea is + + X x k. For the example above k= =) for his prior. I this case the prior data posterior mea variace heads tails mea variace / /= /(9+7)=9/6 63/435=0.044 Example.. Suppose that the prior desity is a ormal radom variable with mea 0 ad variace /. This way of givig the variace may seem uusual, but we will see that is a measure of iformatio. Thus, low variace meas high iformatio. Our data x are a realizatio of idepedet ormal radom variables with ukow mea. We shall choose the variace to be to set a scale for the size of the variatio i the measuremets that yield the data x. We will preset this example omittig some of the algebraic steps to focus o the cetral ideas. The prior desity is r ( ) = exp ( 0). We rewrite the desity for the data to empathize the differece betwee the parameter for the mea ad the x, the sample mea. f X (x ) = ( ) exp (x / i ) i= = (x i x). exp ( ) / ( x) The posterior desity is proportioal to the product f X (x ) ( ), Becsuse the posterior is a fuctio of, we eed oly keep track of the terms which ivolve. Cosequetly, we write the posterior desity as f X ( x) =c(x)exp (( x) + ( 0 ) ) where = c(x)exp( + i= ( (x)) ). (x) = 0 + x. (.4) + + Notice that the posterior distributio is ormal with mea (x) that results from the weighted average with relative weights from the iformatio from the prior ad from the data. The variace i iversely proportioal to the total iformatio +. Thus, if is small compared to, the (x) is ear 0. If is large compared to, (x) is ear x. Exercise.. Fill i the steps i the derivatio of the posterior desity i the example above. For these two examples, we see that the prior distributio ad the posterior distributio are members of the same parameterized family of distributios, amely the beta family ad the ormal family. I these two cases, we say that the prior desity ad the desity of the data form a cojugate pair. I the case of coi tosses, we fid that the beta ad the Beroulli families form a cojugate pair. I Example., we lear that the ormal desity is cojugate to itself. 9

8 x Figure.: Example of prior (black) ad posterior (red) desities for a ormal prior distributio ad ormally distributed data. I this figure the prior desity is N(, /). Thus, 0 =ad =. Here the data cosist of 3 observatios havig sample mea x =/3. Thus, the posterior mea from equatio (.4) is (x) =4/5ad the variace is /(+3) = /5. Typically, the computatio of the posterior desity is much more computatioally itesive that what was show i the two examples above. The choice of cojugate pairs is eticig because the posterior desity is a determied from a simple algebraic computatio. Bayesia statistics is seeig icreasig use i the scieces, icludig the life scieces, as we see the explosive icrease i the amout of data. For example, usig a classical approach, mutatio rates estimated from geetic sequece data are, due to the paucity of mutatio evets, ofte ot very precise. However, we ow have may data sets that ca be sythesized to create a prior distributio for mutatio rates ad will lead to estimates for this ad other parameters of iterest that will have much smaller variace tha uder the classical approach. Exercise.3. Show that the gamma family of distributios is a cojugate prior for the Poisso family of distributios. Give the posterior mea based o observatios..4 Aswers to Selected Exercises.5. Double the average, X. Take the maximum value of the data, max appleiapple x i. Double the differece of the maximum ad the miimum, (max appleiapple x i mi appleiapple x i )..7. The desity of a gamma radom variable Thus, for observatios f(x, )= () x L( x) =f(x, )f(x, ) f(x, ) = = () x e x () (x x x ) () x e e 9 e x. x (x +x + +x ) () x e x

9 .9. I this case the total umber of observatios is + + ad the total umber of successes is + P i= x i. Their ratio is the posterior mea... To iclude some of the details i the computatio, we first add ad subtract x i the sum for the joit desity, f X (x ) = ( ) exp (x / i ) = ( ) exp ((x / i x)+( x )) i= i= The we expad the square i the sum to obtai ((x i x)+( x )) = (x i x) + (x i x) ( x i= = i= i= (x i x) +0+( x ) i= )+ ( x ) i= This gives the joit desity f X (x ) = exp ( ) / ( x) (x i x). i= The posterior desity is f X ( x) =c(x)f X (x ) f ( ) = c(x) exp ( ) / r = c(x) ( ) / = c (x)exp ( x) exp i= r (x i x) exp i= (( x) + ( 0 ) ) ( 0) (x i x) exp (( x) + ( 0 ) ) Here c (x) is the fuctio of x i parethesis. We ow expad the expressios i the expoet,. ( x) + ( 0 ) =( x + x )+( 0 + 0) =( + ) ( x + 0 ) +( x + 0) =( + ) x + 0 +( x + + 0) =( + ) (x) + (x) ( + ) (x) +( x + 0) =( + )( (x)) ( + ) (x) +( x + 0) usig the defiitio of (x) i (.4) ad completig the square. f X ( x) =c (x)exp (( x + 0) ( + ) (x) +( + )( (x)) ) = c (x)exp (( x + 0) ( + ) (x) + ) exp( ( (x)) ) + = c (x)exp( ( (x)) ) 93

10 where c (x) is the fuctio of x i parethesis. This give a posterior desity that is ormal, mea (x) ad variace For observatios x,x,...,x of idepedet Poisso radom variables havig parameter, the joit desity is the product of the margial desities. f X (x )= x x e x x e x x e = x x x The prior desity o has a (, ) desity x +x + +x e = x x x x e. ( )= () e. Thus, the posterior desity f X ( x) =c(x) e x e = c(x) + x e ( +) is the desity of a ( + x, + ) radom variable. Its mea ca be writte as the weighted average + x + = + + x + of the prior mea / ad the sample mea x. The weights are, respectively, proportioal to ad the umber of observatios. The figure below demostrate the case with a (, ) prior desity o ad a sum x +x +x 3 +x 4 +x 5 =6for 5 values for idepedet observatios of a Poisso radom radom variable. Thus the posterior has a ( + 6, + 5) = (8, 6) distributio. desity x 94

Topic 10: Introduction to Estimation

Topic 10: Introduction to Estimation Topic 0: Itroductio to Estimatio Jue, 0 Itroductio I the simplest possible terms, the goal of estimatio theory is to aswer the questio: What is that umber? What is the legth, the reactio rate, the fractio