Estimation for Complete Data

Size: px

Start display at page:

Download "Estimation for Complete Data"

Meredith Simon
5 years ago
Views:

1 Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of each idividual is recorded. A group data is used whe the populatio uder study is large, so we put idividual observatios ito groups.

2 . Empirical distributio for complete idividual data (sectio.2) The desity fuctio of the populatio from which the data is collected is deoted by f(x). Its distributio fuctio is deoted by F (x). Its survival fuctio is deoted by S(x). Its hazard rate fuctio is deoted by h(x). Fially its cumulative hazard fuctio is deoted by H(x) ad is defied by H(x) = l S(x) Note that for cotiuous radom variables we have H (x) = S (x) S(x) = f(x) S(x) = h(x) To estimate these fuctios we collect a sample from the populatio ad the usig the data such collected we create some fuctios that are used to estimate these fuctios. First of all suppose that we have collected data poits, some of which might be repeated poits. We assume that the collected data represets the whole populatio, so we actually we assume that it is the populatio itself. Sice we have ot kowledge about the distributio of the data (ad that s why the estimatio comes i) the data poits have o privilege over each other ad therefore we assig the probability desity fuctio, which is deoted by f (x), is to each data poit. Therefore the associated f (x) = for each observed value It is called the empirical desity fuctio The distributio fuctio associated with this desity is called the empirical distributio fuctio ad is deoted by F (x) F (x) = umber of observatios beig less tha or equal to x The empirical survival fuctio is S (x) = F (x) 2

3 The empirical hazard rate fuctio is h (x) = f (x) S (x) Ad fially, the empirical cumulative hazard fuctio is H (x) = l S (x) Example (data from the Fia s ote). The followig loss values have bee obtaied 4, 50, 50, 50, 60, 75, 80, 20, 30 Calculate f (x), F (x), S (x), h (x), ad H (x) for all x Solutio. x otherwise f (x) x < 4 x < 4 4 x < x < 50 F (x) = 4 50 x < x < x < 80 S (x) = 5 50 x < x < x < x < x < x < x 20 x < x 3

4 H (x) = l S (x) = 0 x < x < x < x < x < x < x < x x otherwise h (x) udefied udefied Aother way of estimatig the cumulative hazard fuctio is by meas of Nelso-Åale estimatio. Before itroducig this estimatio, we eed to itroduce some otatios: For the observed values {x,,..., x } let y < y 2 < < y k be the uique values of the x i s. Let s j = i I(x i = y i ) be the umber of times the observatio y j appears i the sample; here I deote the logical fuctio that returs if the argumet is true ad returs zero if the argumet is false. Note that s + + s k = Example. For the data set of the previous example we have 4

5 y j s j total For each j we associate a so-called risk set r j as follows: r j = the umber of observatios greater tha or equal to y j = the sum of those s values with idices beig larger tha or equal to j = s j + s j+ + + s k = i=k i=j s i Note that the r j s are decreasig ad So we have r j = s j + r j. Note. I some mauscripts you fid a better otatio (i ) istead of r i because r i is the umber of idividuals at time i before ay evet at that time. I cotiuatio, ote that with this otatio, for every x satisfyig y j x < y j we have F (x) = F (y j ) = = umber of idividuals beig less tha or equal to y j j i= s i = k i=j s i = r j so 5

6 F (x) = 0 x < y r j y j x < y j j = 2,..., k y k x S (x) = x < y r j y j x < y j j = 2,..., k 0 y k x 0 x < y H (x) = l(s (x)) = l ( r j ) y j x < y j j = 2,..., k udefied y k x Note that from r i s i = r i+ we have r j = r j r = Therefore: r j r j r2 j r i+ = = r j r j 2 r r i= i j i= r i s i r i = j i= ( s ) i r i 6

7 0 x < y H (x) = l j i= ( ) s i r i y j x < y j j = 2,..., k udefied y k x We recall from calculus that the values ad therefore for x < we ca approximate e x = + x + x2 2! + x3 3! + e x + x ad by chagig x to x we get Because of this, we approximate s i r i approximatio j i= e x x x < ( ) by exp s i r i ad therefore we will have the ( s ) j ( i exp s ) ( i = exp r i r i i= j i= ) s i r i The j l i= ( s ) j i s i r i r i= i So, H (x) that we have above, ca be approximated with the the Nelso-Åale estimate : 7

8 0 x < y Ĥ(x) = j i= s i r i y j x < y j j = 2,..., k k i= s i r i y k x Oce this is foud, the by settig Ŝ(x) = exp ( Ĥ(x) ) we have a estimatio for the survival fuctio. Example (from the Fia s ote) The followig loss values have bee obtaied 4, 50, 50, 50, 60, 75, 80, 20, 30 Calculate f (x), F (x), Fid the Nelso-Åale estimate for the cumulative hazard fuctio ad the estimate the survival fuctio. Solutio. Ĥ(x) = 0 x < 4 4 x < = x < = x < = x < = x < = x < = x 8

9 ( Ĥ(x) ) Ŝ(x) = exp = x < x < x < x < x < x < x < x

10 2. Empirical distributio for grouped data (sectio.3) For grouped data we divide the values ito k itervals: (c 0, c ], (c, c 2 ],..., (c k, c k ] Suppose that there are observatios i total, ad the umber of those who fall ito the iterval (c j, c j ] is j ; so k j= j =. We the defie F at the edpoits of those itervals by: F (c 0 ) = 0 F (c j ) = j i= i j =,..., k ad the we use a liear iterpolatio to defie F (x) for either poits of the itervals F (x) = c j x c j c j F (c j ) + x c j c j c j F (c j ) c j x c j Its graph is called a ogive. By differetiatig this fuctio we get the estimate for the estimate for the desity fuctio : f (x) = F (c j ) F (c j ) c j c j = j (c j c j ) c j x < c j Its graph is called a histogram. Note. This desity is a spliced desity fuctio. Note. By takig complemets of both sides of the iterpolatio relatioship, we get S (x) = c j x c j c j S (c j ) + x c j c j c j S (c j ) c j x c j Example (from the Fia s ote)- 50. ad Calculate the empirical distributio fuctio ad the empirical desity fuctio for the followig grouped data. 0

11 Iterval Number of observatios (0, 2] 25 (2, 0] 0 (0, 00] 0 (00, 000] 5 Solutio. We first fid F 50 at the edpoits: F 50 (0) = 0 F 50 (2) = = 0.5 F 50 (0) = = 0.7 F 50 (00) = = 0. F 50 (000) = We the calculate F 50 at other poits usig iterpolatio; for example for the iterval (0, 00] we have: F 50 (x) = x 0 00 x (0.) Similarly we will have: x 4 0 x 2 (0.7) = x 0 0 (0.) + 00 x (0.7) = x x x 0 F 50 (x) = x x 00 x x 000 udefied 000 < x

12 4 0 x < x < 0 f 40 (x) = x < x < 000 udefied 000 x 2

13 3. Mea ad Variace of Empirical Estimators for Complete Idividual Data (from sectio 2.2) Let S (x) be the empirical survival fuctio. This radom variable is used to estimate the true value of S(x). For hypothesis testig ad some other reasos we eed the mea ad the variace of the estimator. To fid these quatities of iterest, we start from S (x) = umber of observatios that larger tha x = Y where Y is the umber of observatios that are bigger tha x. If {X,..., X } deote the values observed, this set is a i.i.d. from a populatio i which the probability of beig larger tha x is S(x), ad Y is the umber of this cases i our sample. So, Y is distributed as Biomial(, S(x)), so E(Y ) = S(x) Var(Y ) = S(x) ( S(x)) The from S (x) = Y E(S (x)) = E(Y ) we get : = S(x) Var(S (x)) = 2 Var(Y ) = S(x) ( S(x)) So, first of all, S (x) is a ubiased estimator for the value S(x) (which is ukow to us ad that s why we do samplig), ad secodly, it is a cosistet estimator. So it is a ubiased cosistet estimator. Note. Sice we do ot have the values S(x), we may ot be able to actually calculate the variace Var(S (x)), therefore i this formula we istead use S (x) for S(x) as S (x) is a 3

14 ubiased estimator of S(x). So we use Var[S (x)] = S (x)( S (x)) Questio. I the above discussio we foud a ubiased estimator for the probability S(x) = P (X > x). How about the probabilities P (a X b), P (a X < b), P (a < X b), P (a < X < b), P (X < x), P (X x),... Ca we have reasoable ubiased cosistet estimators for them?. Aswer. Yes. I fact the above argumet works for these cases too. For example if you wat to approximate P (a X b), the the estimator W = umber of observatios X i that satisfy a X i b is a ubiased estimator for P (a X b). Note. For a < b, to approximate the coditioal probabilities P (X > b X > b) = S(b) S (b) S (a). The approximate value for the variace would be ) ( S (b) S (a) S (b) S (a) #(X i > a) S(a) we use Example (data from the Fia s ote). The followig loss values have bee obtaied 4, 50, 50, 50, 60, 75, 80, 20, 30 Calculate f (x), F (x), S (x), h (x), ad H (x) for all x (i) What is the estimated variace of the estimator for P (X > 60). (ii) What is the estimated variace of the estimator for P (75 < X 20). (iii) What is the estimated variace of the estimator for P (X > 60 X > 50). 4

15 Solutio. (i). We estimate the variace by S (60)( S (60)) = ( 4 )( 5 ) = (ii). The approximate value for P (75 < X < 20) is umber of observatios X i that satisfy 75 < X i 20 = 2 So the the approximate value of the variace would be ( 2 )( 7 ) = 4 72 (iii). The approximate value for the coditioal probability P (X > 60 X > 50) = S(60) S(50) is S (60) S (50) = 4 5. The estimate value for the variace is ( 4 5 )( 5 ) 5 =

16 4. Mea ad Variace of Empirical Estimators for Grouped Data (from sectio 2.2) With the otatios used for grouped data, recall the followig two idetities: S (c j ) = F (c j ) = + + j S (x) = c j x c j c j S (c j ) + x c j c j c j S (c j ) c j x c j For a momet let Y be the umber of observatios up to c j : Y = + + j The Y is distributed as Biomial(, S(c j )) Let Z be the umber of observatios i (c j, c j ]. The Z is distributed as Biomial(, S(c j ) S(c j )). The: S (c j ) = Y S (c j ) = Y +Z The : E[S (c j )] = E(Y ) Similarly, = ( S(c j )) = S(c j ) E[S (c j )] = S(c j ) 6

17 The by takig the liear iterpolatio we get: x (c j, c j ] E[S (x)] = c j x c j c j E[S (c j )] + x c j c j c j E[S (c j )] = c j x c j c j S(c j ) + x c j c j c j S(c j ) O the other had: x (c j, c j ] S (x) = c j x c j c j ( ) Y x c + j ( ) c j c j Y +Z {( ) ( ) } cj x = Y c j c j + x cj Y +Z c j c j = { } (cj c j )Y +(x c j )Z (c j c j ) ad therefore Var[S (x)] = (c j c j ) 2 Var(Y ) + (x c j ) 2 Var(Z) + 2(c j c j )(x c j )Cov(Y, Z) 2 (c j c j ) 2 where: Var(Y ) = S(c j )[ S(c j )] Var(Z) = [S(c j ) S(c j )][ S(c j ) + s(c j )] Cov(Y, Z) = [ S(c j )][S(c j ) S(c j )] why? I the followig example we will see how to calculate this variace or its estimate without havig to memorize this formula. Sice the values o the right-had side are ukow, we substitute them with their estimates: we substitute S(c j ) with S (c j ), ad substitute S(c j ) with S (c j. This results i the estimates: 7

18 Var(Y ) = Y ( Y ) Var(Z) = Z( Z) Cov(Y, Z) = Y Z Example. What are E[f (x)] ad Var[f (x)]?. Solutio. We have f (x) = j (c j c j ) = Z (c j c j ) c j x < c j E[f (x)] = Var[f (x)] = E(Z) (c j c j ) = (S(c j ) S(c j )) = S(c j ) S(c j ) (c j c j ) c j c j Var(Z) 2 (c j c j ) 2 But we estimate this variace with Var[f (x)] = where Var(Z) Z( Z) = Var(Z) 2 (c j c j ) 2 Example (from the Fia s ote) For the followig grouped data, estimate the probability that a loss will be o more tha 0. Estimate the variace of the associated estimator. 8

19 Iterval Number of observatios (0, 2] 25 (2, 0] 0 (0, 00] 0 (00, 000] 5 Solutio. Sice the poit 0 is i the iterval (0, 00] we have: S (x) = c j x c j c j S (c j ) + x c j c j c j S (c j ) c j x c j S 50 (0) = 5 50 = 3 0 S 50 (00) = 5 50 = 0 ad the weights: right-ed weight right-ed weight = 80 0 = 8 S 50 (0) = ( )( 3 0 ) + (8 ) 0 = 0 So the aswer to the first part is: S 50 (0) = 7 0 = For the secod part: Y = umber of observatios i the iterval (0, 0] = 35 Var(Y ) = Y ( Y ) (35)(50 35) = = = 0.5 Z = umber of observatios i the iterval (0, 00] = 0

20 Var(Z) Y ( Z) = = (0)(40) = 8 50 Cov(Y, Z) = Y Z = (35)(0) = 7 50 S (x) = (c j c j )Y + (x c j )Z (c j c j ) S 50 (x) = (0)Y + (80)Z (50)(0) = 50 Y Z Var( S 50 (x)) = ( 50 )2 Var(Y ) + ( )2 Var(Z) + 2( 50 )( ) Cov(Y, Z) = ( 50 )2 (0.5) + ( )2 (8) + 2( 50 )( )( 7) =

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10 DS 00: Priciples ad Techiques of Data Sciece Date: April 3, 208 Name: Hypothesis Testig Discussio #0. Defie these terms below as they relate to hypothesis testig. a) Data Geeratio Model: Solutio: A set