Estimation for Complete Data

Similar documents
DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

Simulation. Two Rule For Inverting A Distribution Function

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Random Variables, Sampling and Estimation

Chapter 8: STATISTICAL INTERVALS FOR A SINGLE SAMPLE. Part 3: Summary of CI for µ Confidence Interval for a Population Proportion p

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Unbiased Estimation. February 7-12, 2008

1 Inferential Methods for Correlation and Regression Analysis

Lecture 7: Properties of Random Samples

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Chapter 6 Sampling Distributions

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Element sampling: Part 2

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

SOME THEORY AND PRACTICE OF STATISTICS by Howard G. Tucker

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Lecture 2: Monte Carlo Simulation

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Topic 9: Sampling Distributions of Estimators

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Statistics 511 Additional Materials

Frequentist Inference

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

Please do NOT write in this box. Multiple Choice. Total

Problem Set 4 Due Oct, 12

Topic 9: Sampling Distributions of Estimators

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Math 113 Exam 3 Practice


Topic 9: Sampling Distributions of Estimators

Output Analysis and Run-Length Control

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

Mathematical Statistics - MS

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Lecture 3. Properties of Summary Statistics: Sampling Distribution

NANYANG TECHNOLOGICAL UNIVERSITY SYLLABUS FOR ENTRANCE EXAMINATION FOR INTERNATIONAL STUDENTS AO-LEVEL MATHEMATICS

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Chapter 6 Principles of Data Reduction

November 2002 Course 4 solutions

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Expectation and Variance of a random variable

Properties and Hypothesis Testing

Convergence of random variables. (telegram style notes) P.J.C. Spreij

KLMED8004 Medical statistics. Part I, autumn Estimation. We have previously learned: Population and sample. New questions

Lecture 11 and 12: Basic estimation theory

Last Lecture. Wald Test

Areas and Distances. We can easily find areas of certain geometric figures using well-known formulas:

This is an introductory course in Analysis of Variance and Design of Experiments.

Lecture 12: November 13, 2018

SDS 321: Introduction to Probability and Statistics

Exponential Families and Bayesian Inference

Stat 421-SP2012 Interval Estimation Section

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

7.1 Convergence of sequences of random variables

4. Partial Sums and the Central Limit Theorem

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ

AAEC/ECON 5126 FINAL EXAM: SOLUTIONS

Lecture 33: Bootstrap

32 estimating the cumulative distribution function

This section is optional.

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Parameter, Statistic and Random Samples

Parameter, Statistic and Random Samples

Interval Estimation (Confidence Interval = C.I.): An interval estimate of some population parameter is an interval of the form (, ),

PRACTICE PROBLEMS FOR THE FINAL

6 Sample Size Calculations

A statistical method to determine sample size to estimate characteristic value of soil parameters

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Machine Learning Brett Bernstein

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

STAT Homework 1 - Solutions

Inferential Statistics. Inference Process. Inferential Statistics and Probability a Holistic Approach. Inference Process.

7.1 Convergence of sequences of random variables

x = Pr ( X (n) βx ) =

AMS570 Lecture Notes #2

Chapter 6 Part 5. Confidence Intervals t distribution chi square distribution. October 23, 2008

CSE 527, Additional notes on MLE & EM

Confidence Level We want to estimate the true mean of a random variable X economically and with confidence.

Machine Learning Brett Bernstein

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Lecture 12: September 27

Confidence Intervals for the Population Proportion p

Math 10A final exam, December 16, 2016

MATH 472 / SPRING 2013 ASSIGNMENT 2: DUE FEBRUARY 4 FINALIZED

Zeros of Polynomials

Stat410 Probability and Statistics II (F16)

Department of Mathematics

Clases 7-8: Métodos de reducción de varianza en Monte Carlo *

Module 1 Fundamentals in statistics

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

1.010 Uncertainty in Engineering Fall 2008

There is no straightforward approach for choosing the warmup period l.

Transcription:

Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of each idividual is recorded. A group data is used whe the populatio uder study is large, so we put idividual observatios ito groups.

. Empirical distributio for complete idividual data (sectio.2) The desity fuctio of the populatio from which the data is collected is deoted by f(x). Its distributio fuctio is deoted by F (x). Its survival fuctio is deoted by S(x). Its hazard rate fuctio is deoted by h(x). Fially its cumulative hazard fuctio is deoted by H(x) ad is defied by H(x) = l S(x) Note that for cotiuous radom variables we have H (x) = S (x) S(x) = f(x) S(x) = h(x) To estimate these fuctios we collect a sample from the populatio ad the usig the data such collected we create some fuctios that are used to estimate these fuctios. First of all suppose that we have collected data poits, some of which might be repeated poits. We assume that the collected data represets the whole populatio, so we actually we assume that it is the populatio itself. Sice we have ot kowledge about the distributio of the data (ad that s why the estimatio comes i) the data poits have o privilege over each other ad therefore we assig the probability desity fuctio, which is deoted by f (x), is to each data poit. Therefore the associated f (x) = for each observed value It is called the empirical desity fuctio The distributio fuctio associated with this desity is called the empirical distributio fuctio ad is deoted by F (x) F (x) = umber of observatios beig less tha or equal to x The empirical survival fuctio is S (x) = F (x) 2

The empirical hazard rate fuctio is h (x) = f (x) S (x) Ad fially, the empirical cumulative hazard fuctio is H (x) = l S (x) Example (data from the Fia s ote). The followig loss values have bee obtaied 4, 50, 50, 50, 60, 75, 80, 20, 30 Calculate f (x), F (x), S (x), h (x), ad H (x) for all x Solutio. x 4 50 60 75 80 20 30 otherwise f (x) 3 0 0 x < 4 x < 4 4 x < 50 8 4 x < 50 F (x) = 4 50 x < 60 5 60 x < 75 6 75 x < 80 S (x) = 5 50 x < 60 4 60 x < 75 3 75 x < 80 7 80 x < 20 2 80 x < 20 8 20 x < 30 30 x 20 x < 30 0 30 x 3

H (x) = l S (x) = 0 x < 4 0.78 4 x < 50 0.5878 50 x < 60 0.80 60 x < 75.086 75 x < 80.504 80 x < 20 2.72 20 x < 30 30 x x 4 50 60 75 80 20 30 otherwise h (x) 8 3 5 4 3 2 udefied udefied Aother way of estimatig the cumulative hazard fuctio is by meas of Nelso-Åale estimatio. Before itroducig this estimatio, we eed to itroduce some otatios: For the observed values {x,,..., x } let y < y 2 < < y k be the uique values of the x i s. Let s j = i I(x i = y i ) be the umber of times the observatio y j appears i the sample; here I deote the logical fuctio that returs if the argumet is true ad returs zero if the argumet is false. Note that s + + s k = Example. For the data set of the previous example we have 4

y j s j 4 50 3 60 75 80 20 30 total For each j we associate a so-called risk set r j as follows: r j = the umber of observatios greater tha or equal to y j = the sum of those s values with idices beig larger tha or equal to j = s j + s j+ + + s k = i=k i=j s i Note that the r j s are decreasig ad So we have r j = s j + r j. Note. I some mauscripts you fid a better otatio (i ) istead of r i because r i is the umber of idividuals at time i before ay evet at that time. I cotiuatio, ote that with this otatio, for every x satisfyig y j x < y j we have F (x) = F (y j ) = = umber of idividuals beig less tha or equal to y j j i= s i = k i=j s i = r j so 5

F (x) = 0 x < y r j y j x < y j j = 2,..., k y k x S (x) = x < y r j y j x < y j j = 2,..., k 0 y k x 0 x < y H (x) = l(s (x)) = l ( r j ) y j x < y j j = 2,..., k udefied y k x Note that from r i s i = r i+ we have r j = r j r = Therefore: r j r j r2 j r i+ = = r j r j 2 r r i= i j i= r i s i r i = j i= ( s ) i r i 6

0 x < y H (x) = l j i= ( ) s i r i y j x < y j j = 2,..., k udefied y k x We recall from calculus that the values ad therefore for x < we ca approximate e x = + x + x2 2! + x3 3! + e x + x ad by chagig x to x we get Because of this, we approximate s i r i approximatio j i= e x x x < ( ) by exp s i r i ad therefore we will have the ( s ) j ( i exp s ) ( i = exp r i r i i= j i= ) s i r i The j l i= ( s ) j i s i r i r i= i So, H (x) that we have above, ca be approximated with the the Nelso-Åale estimate : 7

0 x < y Ĥ(x) = j i= s i r i y j x < y j j = 2,..., k k i= s i r i y k x Oce this is foud, the by settig Ŝ(x) = exp ( Ĥ(x) ) we have a estimatio for the survival fuctio. Example (from the Fia s ote)- 4.3. The followig loss values have bee obtaied 4, 50, 50, 50, 60, 75, 80, 20, 30 Calculate f (x), F (x), Fid the Nelso-Åale estimate for the cumulative hazard fuctio ad the estimate the survival fuctio. Solutio. Ĥ(x) = 0 x < 4 4 x < 50 + 3 8 = 35 72 50 x < 60 35 72 + 5 = 247 360 60 x < 75 247 360 + 4 = 337 360 75 x < 80 337 360 + 3 = 457 360 80 x < 20 457 360 + 2 = 637 360 20 x < 30 637 360 + = 7 360 30 x 8

( Ĥ(x) ) Ŝ(x) = exp = x < 4 0.848 4 x < 50 0.650 50 x < 60 0.5035 60 x < 75 0.32 75 x < 80 0.280 80 x < 20 0.704 20 x < 30 0.0627 30 x

2. Empirical distributio for grouped data (sectio.3) For grouped data we divide the values ito k itervals: (c 0, c ], (c, c 2 ],..., (c k, c k ] Suppose that there are observatios i total, ad the umber of those who fall ito the iterval (c j, c j ] is j ; so k j= j =. We the defie F at the edpoits of those itervals by: F (c 0 ) = 0 F (c j ) = j i= i j =,..., k ad the we use a liear iterpolatio to defie F (x) for either poits of the itervals F (x) = c j x c j c j F (c j ) + x c j c j c j F (c j ) c j x c j Its graph is called a ogive. By differetiatig this fuctio we get the estimate for the estimate for the desity fuctio : f (x) = F (c j ) F (c j ) c j c j = j (c j c j ) c j x < c j Its graph is called a histogram. Note. This desity is a spliced desity fuctio. Note. By takig complemets of both sides of the iterpolatio relatioship, we get S (x) = c j x c j c j S (c j ) + x c j c j c j S (c j ) c j x c j Example (from the Fia s ote)- 50. ad 50.2. Calculate the empirical distributio fuctio ad the empirical desity fuctio for the followig grouped data. 0

Iterval Number of observatios (0, 2] 25 (2, 0] 0 (0, 00] 0 (00, 000] 5 Solutio. We first fid F 50 at the edpoits: F 50 (0) = 0 F 50 (2) = 25 50 = 0.5 F 50 (0) = 35 50 = 0.7 F 50 (00) = 45 50 = 0. F 50 (000) = We the calculate F 50 at other poits usig iterpolatio; for example for the iterval (0, 00] we have: F 50 (x) = x 0 00 x (0.) + 00 0 00 0 Similarly we will have: x 4 0 x 2 (0.7) = x 0 0 (0.) + 00 x (0.7) = x 0 450 + 6 0 x 40 + 20 2 x 0 F 50 (x) = x 450 + 6 0 0 x 00 x 0000 + 8 00 x 000 udefied 000 < x

4 0 x < 2 40 2 x < 0 f 40 (x) = 450 0 x < 00 0000 00 x < 000 udefied 000 x 2

3. Mea ad Variace of Empirical Estimators for Complete Idividual Data (from sectio 2.2) Let S (x) be the empirical survival fuctio. This radom variable is used to estimate the true value of S(x). For hypothesis testig ad some other reasos we eed the mea ad the variace of the estimator. To fid these quatities of iterest, we start from S (x) = umber of observatios that larger tha x = Y where Y is the umber of observatios that are bigger tha x. If {X,..., X } deote the values observed, this set is a i.i.d. from a populatio i which the probability of beig larger tha x is S(x), ad Y is the umber of this cases i our sample. So, Y is distributed as Biomial(, S(x)), so E(Y ) = S(x) Var(Y ) = S(x) ( S(x)) The from S (x) = Y E(S (x)) = E(Y ) we get : = S(x) Var(S (x)) = 2 Var(Y ) = S(x) ( S(x)) So, first of all, S (x) is a ubiased estimator for the value S(x) (which is ukow to us ad that s why we do samplig), ad secodly, it is a cosistet estimator. So it is a ubiased cosistet estimator. Note. Sice we do ot have the values S(x), we may ot be able to actually calculate the variace Var(S (x)), therefore i this formula we istead use S (x) for S(x) as S (x) is a 3

ubiased estimator of S(x). So we use Var[S (x)] = S (x)( S (x)) Questio. I the above discussio we foud a ubiased estimator for the probability S(x) = P (X > x). How about the probabilities P (a X b), P (a X < b), P (a < X b), P (a < X < b), P (X < x), P (X x),... Ca we have reasoable ubiased cosistet estimators for them?. Aswer. Yes. I fact the above argumet works for these cases too. For example if you wat to approximate P (a X b), the the estimator W = umber of observatios X i that satisfy a X i b is a ubiased estimator for P (a X b). Note. For a < b, to approximate the coditioal probabilities P (X > b X > b) = S(b) S (b) S (a). The approximate value for the variace would be ) ( S (b) S (a) S (b) S (a) #(X i > a) S(a) we use Example (data from the Fia s ote). The followig loss values have bee obtaied 4, 50, 50, 50, 60, 75, 80, 20, 30 Calculate f (x), F (x), S (x), h (x), ad H (x) for all x (i) What is the estimated variace of the estimator for P (X > 60). (ii) What is the estimated variace of the estimator for P (75 < X 20). (iii) What is the estimated variace of the estimator for P (X > 60 X > 50). 4

Solutio. (i). We estimate the variace by S (60)( S (60)) = ( 4 )( 5 ) = 20 72 (ii). The approximate value for P (75 < X < 20) is umber of observatios X i that satisfy 75 < X i 20 = 2 So the the approximate value of the variace would be ( 2 )( 7 ) = 4 72 (iii). The approximate value for the coditioal probability P (X > 60 X > 50) = S(60) S(50) is S (60) S (50) = 4 5. The estimate value for the variace is ( 4 5 )( 5 ) 5 = 4 25 5

4. Mea ad Variace of Empirical Estimators for Grouped Data (from sectio 2.2) With the otatios used for grouped data, recall the followig two idetities: S (c j ) = F (c j ) = + + j S (x) = c j x c j c j S (c j ) + x c j c j c j S (c j ) c j x c j For a momet let Y be the umber of observatios up to c j : Y = + + j The Y is distributed as Biomial(, S(c j )) Let Z be the umber of observatios i (c j, c j ]. The Z is distributed as Biomial(, S(c j ) S(c j )). The: S (c j ) = Y S (c j ) = Y +Z The : E[S (c j )] = E(Y ) Similarly, = ( S(c j )) = S(c j ) E[S (c j )] = S(c j ) 6

The by takig the liear iterpolatio we get: x (c j, c j ] E[S (x)] = c j x c j c j E[S (c j )] + x c j c j c j E[S (c j )] = c j x c j c j S(c j ) + x c j c j c j S(c j ) O the other had: x (c j, c j ] S (x) = c j x c j c j ( ) Y x c + j ( ) c j c j Y +Z {( ) ( ) } cj x = Y c j c j + x cj Y +Z c j c j = { } (cj c j )Y +(x c j )Z (c j c j ) ad therefore Var[S (x)] = (c j c j ) 2 Var(Y ) + (x c j ) 2 Var(Z) + 2(c j c j )(x c j )Cov(Y, Z) 2 (c j c j ) 2 where: Var(Y ) = S(c j )[ S(c j )] Var(Z) = [S(c j ) S(c j )][ S(c j ) + s(c j )] Cov(Y, Z) = [ S(c j )][S(c j ) S(c j )] why? I the followig example we will see how to calculate this variace or its estimate without havig to memorize this formula. Sice the values o the right-had side are ukow, we substitute them with their estimates: we substitute S(c j ) with S (c j ), ad substitute S(c j ) with S (c j. This results i the estimates: 7

Var(Y ) = Y ( Y ) Var(Z) = Z( Z) Cov(Y, Z) = Y Z Example. What are E[f (x)] ad Var[f (x)]?. Solutio. We have f (x) = j (c j c j ) = Z (c j c j ) c j x < c j E[f (x)] = Var[f (x)] = E(Z) (c j c j ) = (S(c j ) S(c j )) = S(c j ) S(c j ) (c j c j ) c j c j Var(Z) 2 (c j c j ) 2 But we estimate this variace with Var[f (x)] = where Var(Z) Z( Z) = Var(Z) 2 (c j c j ) 2 Example (from the Fia s ote)- 53.6. For the followig grouped data, estimate the probability that a loss will be o more tha 0. Estimate the variace of the associated estimator. 8

Iterval Number of observatios (0, 2] 25 (2, 0] 0 (0, 00] 0 (00, 000] 5 Solutio. Sice the poit 0 is i the iterval (0, 00] we have: S (x) = c j x c j c j S (c j ) + x c j c j c j S (c j ) c j x c j S 50 (0) = 5 50 = 3 0 S 50 (00) = 5 50 = 0 ad the weights: right-ed weight right-ed weight 0 0 00 0 = 80 0 = 8 S 50 (0) = ( )( 3 0 ) + (8 ) 0 = 0 So the aswer to the first part is: S 50 (0) = 7 0 = 0.8778 For the secod part: Y = umber of observatios i the iterval (0, 0] = 35 Var(Y ) = Y ( Y ) (35)(50 35) = = 2 50 2 = 0.5 Z = umber of observatios i the iterval (0, 00] = 0

Var(Z) Y ( Z) = = (0)(40) = 8 50 Cov(Y, Z) = Y Z = (35)(0) = 7 50 S (x) = (c j c j )Y + (x c j )Z (c j c j ) S 50 (x) = (0)Y + (80)Z (50)(0) = 50 Y + 4 225 Z Var( S 50 (x)) = ( 50 )2 Var(Y ) + ( 4 225 )2 Var(Z) + 2( 50 )( 4 225 ) Cov(Y, Z) = ( 50 )2 (0.5) + ( 4 225 )2 (8) + 2( 50 )( 4 225 )( 7) = 0.008 20