We have previously leared: KLMED8004 Medical statistics Part I, autum 00 How kow probability distributios (e.g. biomial distributio, ormal distributio) with kow populatio parameters (mea, variace) ca give aswer to questios as for example Discrete distributio: Give =0 childbirths ad probability of low birth weight p=0., what is the probability to observe at least 3 low birth childre? Model: X ~ bi(, p ). P( X 3 = 0, p= 0. ) Estimatio Harald Johse, Sept 00 Cotiuous distributio: Give that the birth weight X is ormally distributed with mea μ = 3750 g ad stadard deviatio σ = 500 g. What is the probability that a ewbor weighs is at least 54 grams? Model: N ~ N( μ, σ ) = N( 3750, 500 ). 54 3750 P( X 54 μ = 3750, σ = 500 ) = P Z 500 Populatio ad sample New questios How ca a sample (NO: utvalg) be used to estimate ukow parameters i a probability distributio? How to estimate the cetral tedecy i a distributio? Which oe is the best measure of the cetral tedecy? How to estimate variability (variace)? How to place a iterval aroud a poit estimate to idicate how sure the estimate is? A populatio (statistical populatio, target populatio) is the complete set of the possible measuremets, or the record of some qualitative trait correspodig to the etire collectio of uits for which ifereces ca be made. A sample is a limited subset of a populatio that is actually collected i the course of a ivestigatio. The objective of the process of data collectio (samplig) is to draw coclusios about the populatio. Essetial questios: Which populatio? How is the samplig doe? Give a sample, to which populatio are the coclusios valid? 3 4
Some commo samplig procedures A radom sample is a selectio of some members of a populatio such that members are idepedetly chose ad each member has a kow ozero probability of beig selected A simple radom sample is a radom sample i which each member has the same probability of beig selected Stratified samplig: the populatio is divided ito homogeous subsets (strata) based upo specified traits of the members(sex, age,.) ad subsequetly radom samplig withi each stratum. ad there are more Poit estimatio Give a populatio ad a represetative sample. Based o the sample, the challege is to estimate ukow quatities (parameters) i the populatio distributio. Recall that a parameter is a fixed, usually ukow umeric quatity ad accordigly o-radom. Examples: I the biomial distributio X ~ bi(, p )the parameter is p I a ormal distributio X ~N( μ, σ ) the parameters are μ og σ I the Poisso distributio X~Po( λ ) the parameter is λ I the geeral case the Greek letter θ (theta) is used as symbol for a parameter. 5 6 Poit estimatio cot Suppose simple radom samplig of size from a effectively ifiite populatio with populatio mea μ ad variace σ. This gives idetically distributed radom variables X, X,..., X (ot ecessarily ormally distributed). A estimator ˆθ (theta hat)is a mathematical fuctio of the radom variables ad is used to estimate the ukow value of the parameter θ. The estimator ˆθ is a radom variable with a probability distributio. Whe a radom sample becomes available from the populatio ad ˆθ is computed from the data set, the umeric value obtaied is called a estimate of θ from the particular sample. Ulike the estimator, a estimate is oradom! The sample arithmetic mea is a ituitive or atural estimator of the populatio mea μ : μˆ X + X +... + X = = X The properties of the estimator ˆμ = X Expected value (mea, NO: forvetig): X + X +... + X ( ˆ ) = ( ) = = = E μ EX E μ μ Hece, the estimator is ubiased (NO: forvetigsrett).. a good property Variace (provided idepedet observatios): X + X +... + X σ ˆ = = = = Var( μ) Var( X ) Var σ It follows that Var( μˆ ) 0whe a good property too 7 8
Estimator cot. For oe ad the same parameter there may exist several ubiased estimators. I symmetric distributios with oly oe mode (NO: modalverdi) the sample mea the sample media the sample mode are all ubiased estimators for the populatio mea, but their variaces may be differet Usually, the estimator havig the least variace is chose. For ormally distributed data that estimator will be the sample mea. (Sometimes a ubiased estimator is chose for the cost of a estimator with less variace.) The distributio of the mea We have show that if X,X,...,X are idepedet ad ormally distributed with meaμ ad variace σ, the σ X is ormally distributed with mea μ ad variace It ca be show that eve if X,X,...,X are ot ormally distributed, X will be approximately ormally distributed whe is sufficietly large. If the distributio of X is reasoably symmetric without too may modes, ad ot too peculiar, this setece will practically hold as early as from >0. 9 0 The variace of the arithmetic mea Mea of o-ormally distributed data Fig. 3: The lower curve shows a ormal distributio with mea (expectatio)3 ad variace.44, (SD=.). The arithmetic mea of a radom sample of size 6 will be ormally distributed with mea 3 ad variace.44/6=0.09, (SD=0.3). This is the peaked, arrow distributio. Fig. 3: Travellig times from home to campus. 300 radom samples of size, 4, 9 ad 6. (Aale 998)
Iterval estimatio A poit estimate (of a ukow parameter) is a umeric value obtaied by puttig observed sample values ito the mathematical formula for the estimator. Questios of iterest: How precise it the estimate? Is it possible to calculate a iterval coverig the estimated parameter with a specified probability? The aswer is NO! o But there is a recipe tellig how such a probability iterval, cofidece iterval, ca be costructed. But as soo as observed values are used ad a umeric iterval is calculated, that iterval ca ot loger be iterpreted withi a probability framework. Costructio of cofidece itervals Suppose X,X,...,X are idepedet ad ormally distributed with mea μ ad variace σ. This gives ˆ μ μ Z = ~N(,0) σ ˆ μ μ < = P zα/< z α/ = α σ ad P( z Z z ) α/ α/ Because zα / = z α /, σ σ P ˆ μ z α/ < μ ˆ μ + z α/ = α, which is a probability statemet ad the recipe to costruct a ( α )-cofidece iterval for the parameter μ 3 4 σ σ The iterval is ˆ μ z α/,z α/ + Suppose repeated samples of size. Each time we estimate a ew ˆμ = x, ad a ew cofidece iterval. Choosig α = 0.05 ad exchagig ˆμ with X, we have σ σ P X.96 < μ X +.96 = 0.05 = 0.95 We arrive at the radom iterval σ σ X.96, X +.96 with σ fixed legth.96 Factors affectig the legth? Siceμ is ad remais ukow, we will ever kow which itervals i fact do cotai the parameter!! 5 6
Cofidece iterval, ormal distributio with ukow variace The same argumet as above, but the populatio variace is estimated by the sample variace σ = s = (Xi X ) i= ad we arrive at the t-distributio with - degrees of freedom givig s s P X t, α/ < μ X + t, α/ = α ad get the radom iterval s s X t, α/ < μ X + t, α/ with s radom legth t, α / (What is radom.?) Approximatio to ormal distributio Biomial series of trials (discrete distributio) i) Each trial yields oe of two outcomes techically called success (A) ad failure (A*) ii) For each trial, the probability of success P(A) is the same ad is deoted p=p(a). The probability of failure is the P(A*) = - p ad is deoted p, so that p + q =. iii) Trials are idepedet. The probability of success i a trial does ot chage give ay amout of iformatio about the outcomes i other trials. iv) The umber if success, X, is observed i trials. k P( X = k ) = p ( p) k k 7 8 Cofidece iterval for p X p( p) ˆp =, E( p) ˆ = p, Var( p ˆ ) = (cosistet estimator) Stadardisig ˆp by subtractio of mea ad divisio to stadard deviatio: We defie The pˆ p pˆ p Z = = SD( p ˆ ) p( p ) atoutcome A, P( A) = p I = 0atoutcome A*, P( A*) = p= q i i= X = I = I + I +... + I is the umber of outcomes A i trials X I + I +... + I Note that ˆp = = is s a sum of several idepedet evets, each oe without domiace to X. The cetral limit theorem ow implies that as icreases, pˆ p pˆ p Z = = will coverge to the stadard ormal SD( p ˆ ) p( p ) distributio. The approximatio works especially well if p( ˆ p ˆ ) > 5 Whe the coditio above is met, the ( α ) cofidece iterval for p is approximately ˆp± z p( p) / α / x A umerical result is achieved by replacig p with a estimate ˆp = (ote small x, observed value of X). 9 0
Example 6.44, prevalece of breast cacer amog wome 50 54: Radom sample =0000, Observed umber of cacer x = 400 Poit estimate of prevalece: the target populatio) 400 ˆp = = 0.040 (estimated prevalece i 0000 p( ˆ p ˆ ) = 0000 0.04 0.96 = 38.4 > 5, approximatio to ormal distributio applies. 0.95-cofidece iterval estimate: p ˆ z p( ˆ p) ˆ /, p ˆ + z p( ˆ p) ˆ / ( 0.975 0.975 ) =( 0.040.96 0.04 0.96 / 0000, 0.040 +.96 0.04 0.96 / 0000 ) = ( 0.040 0.004,0.040 + 0.004) = ( 0.036,0.044) Suppose we kow that the prevalece i the populatio is 0.0. How to iterpret the fidigs above? Example, exercise 4.40 Sample size =00 Primary evet, A: bacteriuria, P(A) = p = 0.05 X: umber of wome havig bacteriuria Questio: What is P( X 3) Model: X bi(,p) P( X 3) = P( X < ) = ( P( X = 0) + P( X = ) + P( X = ) ) 00 0 00 P( X = 0) = 0.05 0.95 = 0.006 0 00 99 P( X = ) = 0.05 0.95 = 0.03 00 98 P( X = ) = 0.05 0.95 = 0.08 P( X < ) = 0.006 + 0.03 + 0.08 = 0.8 P( X 3 ) = 0.8 = 0.88 Approximatio to ormal distributio p( p ) = 00 0.05( 0.05 ) = 4.75, (borderlie for.a.) E X = p = 00 0.05 = 5 [ ] [ ] Var X = p( p ) = 4.75 3 5 0.5 P( X 3) P( Z =.5) = P( Z.5) = 0.875 4.75 reasoably fair compared to 0.88. Fial commet, small sample cases: If is ot sufficietly large for the cetral limit theorem to apply, or approximatio to the ormal distributio does t work, exact methods have to be used. 3 4