Kernel density estimation in R

Size: px

Start display at page:

Download "Kernel density estimation in R"

Leslie Flowers
5 years ago
Views:

1 Kernel density estimation in R Kernel density estimation can be done in R using the density() function in R. The default is a Guassian kernel, but others are possible also. It uses it s own algorithm to determine the bin width, but you can override and choose your own. If you rely on the density() function, you are limited to the built-in kernels. If you want to try a different one, you have to write the code yourself. STAT474/STAT574 February 24, / 94

2 Kernel density estimation in R: effect of bandwidth for rectangular kernel STAT474/STAT574 February 24, / 94

3 Kernel density estimation in R Note that exponential densities are a bit tricky to estimate to using kernel methods. Here is the default behavior estimating the density for exponential data. > x <- rexp(100) > plot(density(x)) STAT474/STAT574 February 24, / 94

4 Kernel density estimation in R: exponential data with Gaussian kernel STAT474/STAT574 February 24, / 94

5 Violin plots: a nice application of kernel density estimation Violin plots are an alternative to boxplots that show nonparametric density estimates of the distribution in addition to the median and interquartile range. The densities are rotated sideways to have a similar orientation as a box plot. > x <- rexp(100) > install.packages("vioplot") > library(vioplot) > x <- vioplot(x) STAT474/STAT574 February 24, / 94

6 Kernel density estimation in R: violin plot STAT474/STAT574 February 24, / 94

7 Kernel density estimation R: violin plot The violin plot uses the function sm.density() rather than density() for the nonparametric density estimate, and this leads to smoother density estimates. If you want to modify the behavior of the violin plot, you can copy the original code to your own function and change how the nonparametric density estimate is done (e.g., replacing sm.density with density, or changing the kernel used). STAT474/STAT574 February 24, / 94

8 Kernel density estimation in R: violin plot STAT474/STAT574 February 24, / 94

9 Kernel density estimation in R: violin plot STAT474/STAT574 February 24, / 94

10 Kernel density estimation in R: violin plot > vioplot function (x,..., range = 1.5, h = NULL, ylim = NULL, names = NULL, horizontal = FALSE, col = "magenta", border = "black", lty = 1, lwd = 1, rectcol = "black", colmed = "white", pchmed = 19, at, add = FALSE, wex = 1, drawrect = TRUE) { datas <- list(x,...) n <- length(datas) if (missing(at)) at <- 1:n upper <- vector(mode = "numeric", length = n) lower <- vector(mode = "numeric", length = n) q1 <- vector(mode = "numeric", length = n) q3 <- vector(mode = "numeric", length = n)... args <- list(display = "none") if (!(is.null(h))) args <- c(args, h = h) for (i in 1:n) {... smout <- do.call("sm.density", c(list(data, xlim = est.xlim), args)) STAT474/STAT574 February 24, / 94

11 Kernel density estimation There are lots of popular Kernel density estimates, and statisticians have put a lot of work into establishing their properties, showing when some Kernels work better than others (for example, using mean integrated square error as a criterion), determining how to choose bandwidths, and so on. In addition to the Guassian, common choices for the hazard function include Uniform, K(u) = 1/2 I ( 1 u 1) Epanechnikov, K(u) =.75(1 u 2 )I ( 1 u 1) biweight, K(u) = (1 x 2 ) 2 I ( 1 u 1) STAT474/STAT574 February 24, / 94

12 Kernel-smoothed hazard estimation To estimate a smoothed version of the hazard function using a kernal method, first pick a kernel, then use ĥ = 1 b D ( t ti K b i=1 ) H(t i ) where D is the number of death times and b is the babdwidth (instead of h). A common notation for bandwidth is h, but we use b because h is used for the hazard function. Also Ĥ(t) is the Nelson-Aalen estimator of the cumulative hazard function: { 0, if t t 1 H(t) = t i t d i Y i, if t > t 1 STAT474/STAT574 February 24, / 94

13 Kernel-smoothed hazard estimation The variance of the smoothed hazard is D 2 σ [ĥ(t)] = b 2 i=1 [ K ( )] t 2 ti V [ H(t)] b STAT474/STAT574 February 24, / 94

14 Asymmetric kernels A difficulty that we saw with the exponential also can occur here, the estimated hazard can give negative values. Consequently, you can use an asymmetric kernel instead for small t. For t < b, let q = t/b. A similar approach can be used for large t, when t D b < t < t D. In this case, you can use q = (t D t)/b and replace x with x in the kernal density estimate for these larger times. STAT474/STAT574 February 24, / 94

15 Asymmetric kernels STAT474/STAT574 February 24, / 94

16 Asymmetric kernels STAT474/STAT574 February 24, / 94

17 Confidence intervals A pointwise confidence interval can be obtained with lower and upper limits ( [ ĥ(t) exp Z1 α/2σ(ĥ(t)) ] [ Z1 α/2, ĥ(t) ĥ(t) exp σ(ĥ(t)) ]) ĥ(t) Note that the confidence interfal is really a confidence interval for the smoothed hazard function and not a confidence interval for the actual hazard function, making it difficult to interpret. In particular, the confidence interval will depend on the both the kernel and the bandwidth. Coverage probabilities for smoothed hazard estimates (the proportion of times the confidence interval includes the true hazard rate) appears to have ongoing research. STAT474/STAT574 February 24, / 94

18 Asymmetric kernels STAT474/STAT574 February 24, / 94

19 Effect of bandwidth STAT474/STAT574 February 24, / 94

20 Effect of bandwidth Because the bandwidth has a big impact, we somehow want to pick the optimal bandwidth. An idea is to minimize the squared area between the true hazard function and estimated hazard function. This squared area between the two functions is called the Mean Integrated Squared Error (MISE): MISE(b) = E = E τu τ L τu τ L [ĥ(u) h(u)]2 du ĥ 2 (u) du 2E τu τ L ĥ(u)h(u) du + f (h(u)) The last term doesn t depend on b so it is sufficient to minimize the function ignoring the last term. The first term can be estimated by τu τ L ĥ 2 (u) du, which can be estimated using the trapezoid rule from calculus. STAT474/STAT574 February 24, / 94

21 Effect of bandwidth The second term can be approximated by 1 ( ) t ti K H(t i ) H(t j ) b b i j summing over event times between τ L and τ U. Minimizing MISE is can be done approximately by minimizing g(b) = ( ) ui+1 u i 2 [ĥ2 (u i ) ĥ2 (u i+1 )] 2 ( ) t ti K H(t i ) H(t j ) b b i The minimization can be done numerically by plugging in different values of b and evaluating. i j STAT474/STAT574 February 24, / 94

22 Effect of bandwidth STAT474/STAT574 February 24, / 94

23 Effect of bandwidth For this example, the minimum occurs around b = 0.17 to b = 0.23 depending on the kernel. Generally, there is a trade-off with smaller bandwidths having smaller bias but higher variance, and larger bandwidths (more smoothing) having less variance but greater bias. Measuring the quality of bandwidths and kernels using MISE is standard in kernel density estimation (not just survival analysis). Bias here means that E[ĥ(t)] h(t). STAT474/STAT574 February 24, / 94

24 Section 6.3: Estimation of Excess Mortality The idea for this topic is to compare the survival curve or hazard rate for one group against a reference group, particular if the non-reference group is thought to have higher risk. The reference group might come from a much larger sample, so that its survival curve can be considered to be known. An example is to compare the mortality for psychiatric patients against the general population. You could use census data to get the lifetable for the general population, and determine the excess mortality for the psychiatric patients. Two approaches are: a multiplicative model, and an additive model. In the multiplicative model, belonging to a particular group multiplies the hazard rate by a factor. In the additive model, belonging to a particular group adds a factor to the hazard rate. STAT474/STAT574 February 24, / 94

25 Excess mortality For the multiplicative model, if there is a reference hazard rate of θ j (t) for the jth individual in a study (based on sex, age, ethnicity, etc.), then due to other risk factors, the hazard rate for the jth individual is h j (t) = β(t)θ j (t) where β(t) 1 implies that the hazard rate is higher than the reference hazard. We define B(t) = t 0 β(u) du as the cumulative relative excess mortality. STAT474/STAT574 February 24, / 94

26 Excess mortality Note that d dt = β(t). To estimate B(t), let Y j(t) = 1 if the jth individual is at risk at time t. Otherwise, let Y j (t) = 0. Here Y j (t) is defined for left-truncated and right-censored data. Let Q(t) = n θ j (t)y j (t) j=1 where n is the sample size. Then we estimate B(t) by B(t) = ti t d i Q(t i ) This value is comparing the actual number of deaths that have occurred by time t i with the expected number of deaths based on the hazard rate and number of patients available to have died. STAT474/STAT574 February 24, / 94

27 Excess mortality The variance is estimated by V [ B(t)] = ti t d i Q(t i ) 2 β(t) can be estimated by slope of B(t), which can be improved by using kernel-smoothing methods on B(t). STAT474/STAT574 February 24, / 94

28 Excess mortality For the additive model, the hazard is h j (t) = α(t) + θ j (t) Similarly to the multiplicative model, we estimate the cumulative excess mortality A(t) = t 0 α(u) du In this case the expected cumulative hazard rate is n t Θ(t) = θ j (u) Y j(u) Y (u) du where is the number at risk at time u. j=1 0 Y (u) = n Y j (u) j=1 STAT474/STAT574 February 24, / 94

29 Excess mortality The estimated excess mortality is Â(t) = t i t d i Y i Θ(t) where the first term is the Nelson-Aalen estimator of the cumulative hazard. The variance is V [Â(t)] = ti t d i Y (t) 2 STAT474/STAT574 February 24, / 94

30 Excess mortality For a lifetable where times are every year, you can also compute Θ(t) = Θ(t 1) + t λ(a j + t 1) Y (t) where a j is the age at the beginning of the study for patient j and λ is the reference hazard. Note that Θ(t) is a smooth function of t while Â(t) is has jumps. STAT474/STAT574 February 24, / 94

31 Excess mortality A more general model is to combine multiplicative and additive components, using h j (t) = β(t)θ j (t) + α(t) which is done in chapter 10. STAT474/STAT574 February 24, / 94

32 Example: Iowa psychiatric patients As an example, starting with the multiplicative model, consider 26 psychiatric patients from Iowa, where we compare to census data. STAT474/STAT574 February 24, / 94

33 Iowa psychiatric patients STAT474/STAT574 February 24, / 94

34 Census data for Iowa STAT474/STAT574 February 24, / 94

35 Excess mortality for Iowa psychiatric patients STAT474/STAT574 February 24, / 94

36 Excess mortality for Iowa psychiatric patients STAT474/STAT574 February 24, / 94

37 Excess mortality The cumulative excess mortality is difficult to interpret. The slope of the curve is more meaningful. The curve is relatively linear. If we consider age 10 to age 30, the curve goes from roughly 50 to 100, suggesting a slope of (100 50)/(30 10) = 2.5, so that patients aged 10 to 30 had a roughly 2.5 times higher chance of dying. This is a fairly low-risk age group, for which suicide is high risk factor. Note that the census data might include psychiatric patients who have committed suicide, so we might be comparing psychiatric patients to the general population which includes psychiatric patients, as opposed to psychiatric patients compared to people who have not been psychiatric patients, so this might bias results. STAT474/STAT574 February 24, / 94

38 Survival curves You can use the reference distribution to inform the survival curve instead of just relying on the data. This results in an adjusted or corrected survival curve. Let S (t) = exp[ Θ(t)] (or use the cumulative hazard based on multiplying the reference hazard by the excess harzard) and let Ŝ(t) be the standard Kaplan-Meier survival curve (using only the data, not the reference survival data). Then S c (t) = Ŝ(t)/S (t) is the corrected survival function. The estimate can be greater than 1, in which case the estimate can be set to 1. Typically, S (t) is less than 1, so that dividing by this quantity increases the estimated survival probabilities. This is somewhat similar in Bayesian statististics to the use of the prior, using the reference survival times as a prior for what the psychiatric patients are likely to experience. Consequently, the adjusted survival curve is in between the kaplan-meier (data only) estimate, and the reference survival times. STAT474/STAT574 February 24, / 94

39 Survival curves STAT474/STAT574 February 24, / 94

40 Survival curves STAT474/STAT574 February 24, / 94

41 Bayesian nonparametric survival analysis The previous example leads naturally to Bayesian nonparametric survival analysis. Here we have prior information (or prior beliefs) about the shape of the survival curve (such as based on a reference survival function). The survival curve based on this previous information is combined with the likelihood of the survival data to produce a posterior estimate of the survival function. Reasons for using a prior are: (1) to take advantage of prior information or expertise of someone familiar with the type of data, (2) to get a reasonable estimate when the sample size is small. STAT474/STAT574 February 24, / 94

42 Bayesian survival analysis In frequentist statistical methods, parameters are treated as fixed, but unknown, and an estimator is chosen to estimate the parameters based on the data and a model (including model assumptions). Parameters are unknown, but are treated as not being random. Philosophically, the Bayesian approach is to try to model all uncertainty using random variables. Uncertainty exists both in the form of the data that would arise from a probability model as well as the parameters of the model itself, so both observations and parameters are treated as random. Typically, the observations have a distribution that depends on the parameters, and the parameters themselves come from some other distribution. Bayesian models are therefore often hierarchical, often with multiple levels in the hierarchy. STAT474/STAT574 February 24, / 94

43 Bayesian survival analysis For survival analysis, we think of the (unknown) survival curve as the parameter. From a frequentist point of view, survival probabilities determine the probabilities of observing different death times, but there are no probabilities of the survival function itself. From a Bayesian point of view, you can imagine that there was some stochastic process generating survival curves according to some distribution on the space of survival curves. One of the survival curves happened to occur for the population we are studying. Once that survival function was chosen, event times could occur according to that survival curve. STAT474/STAT574 February 24, / 94

44 Bayesian survival analysis STAT474/STAT574 February 24, / 94

45 Bayesian survival analysis We imagine that there is a true survival curve S(t), and an estimated survival curve, Ŝ(t). We define a loss function as L(S, Ŝ) = [Ŝ(t) S(t)]2 dt 0 The function Ŝ that minimizes the expected value of the loss function is called the posterior mean, which is used to estimate the survival function. STAT474/STAT574 February 24, / 94

46 A prior for survival curves A typical way to assign a prior on the survival function is to use a Dirichlet process prior. For a Dirichlet process, we partition the real line into intervals A 1,..., A k, so that P(X A i ) = W i. The numbers (W 1,..., W k ) have a k-dimension Dirichlet distribution with parameters α 1,..., α k. For this to be a Dirichlet distribution, we must have Z i, i = 1,..., k are independent gamma random variables with shape parameter α i and W i = W i Z i k i=1 Z i. By construction, the random numbers s are between 0 and 1 and sum to 1, so when interpreted as probabilities, they form a discrete probability distribution. Essentially, we can think of a Dirichlet distribution as a distribution on unfair dice with k sides. We want to make a die that has k sides, and we want the probabilities of each side to be randomly determined. How fair or unfair the die is partly depends on the α parameters and partly depends on chance itself. STAT474/STAT574 February 24, / 94

47 A prior for survival curves We can also think of the Dirichlet distrbibution as generalizing the beta distribution. A beta random variable is a number between 0 and 1. This number partitions the interval [0,1] into two pieces, [0, x) and [x, 1]. A Dirichlet random variable partitions the interval into k regions, using k 1 values between 0 and 1. The joint density for these k 1 values is f (w 1,..., w k 1 ) = Γ[α α k ) Γ(α 1 ) Γ(α k ) [ k 1 i=1 w α i 1 i ] [ ] k 1 αk 1 1 w i which reduces to a beta density with parameters (α 1, α 2 ) when k = 2. i=1 STAT474/STAT574 February 24, / 94

48 Assigning a prior To assign a prior on the space of survival curves, first assume an average survival function, S 0 (t). The Dirichlet prior determines when the jumps occur, and the exponential curve gives the decay of the curve between jumps. Simulated survival curves when S 0 (t) = e 0.1t and α = 5S 0 (t) are given below. STAT474/STAT574 February 24, / 94

49 Bayesian survival analysis STAT474/STAT574 February 24, / 94

50 Bayesian survival analysis Other approaches are to have a prior for the cumulative hazard function and to use Gibb s sampling or Markov chain Monte Carlo. These topics would be more appropriate to cover after a class in Bayes methods. STAT474/STAT574 February 24, / 94

51 Chapter 7: Hypothesis testing Hypothesis testing is typically done based on the cumulative hazard function. Here we ll use the Nelson-Aalen estimate of the cumulative hazard. The survival function is used to weight differences between the observed and expected cumulative hazard. Recall that the Nelson-Aalen estimate of the cumulative hazard is H(t) = t ti d i Y i In a one-sample problem, you test whether the hazard rate h(t) is equal to some reference hazard, h 0 (t). The null hypothesis is H 0 : h(t) = h 0 (t). Under the null hypothesis, the expected hazard rate at time t i is h 0 (t i ). STAT474/STAT574 February 24, / 94

52 Hypothesis testing: one sample The idea is then to compare observed - expected cumulative hazard rates at the time τ, the largest time in the study (τ = t D ) if the largest time is a death time). The test statistic is then Z(τ) = O(τ) E(τ) = D i=1 W (t i ) d τ i W (s)h 0 (s) ds Y i 0 where W ( ) is a weight function. The variance is V [Z(τ)] = τ 0 W 2 (s) h 0(s) Y (s) ds STAT474/STAT574 February 24, / 94

53 Hypothesis testing The expected value of Z(τ) = 0, so if we take a z-score of Z(τ) (subtracting the mean and dividing by the standard deviation), we get Z(τ)/ V [Z(τ)] which has an approximate standard normal distribution. This can be used for either a two-sided or one-sided test. For example, a one-sided test would be H 1 : h ( t) > h 0 (t), and you would reject only for large values of Z(τ)/ V [Z(τ)] STAT474/STAT574 February 24, / 94

54 Hypothesis testing The most popular choice for a weighting function is W (t) = Y (t), which leads to D O(τ) = Y (t i ) d D i = d i Y i i=1 i=1 This is also called the log-rank test (not sure why). Other weight functions are possible. For example W (t) = Y (t)s 0 (t) p [1 S 0 (t)] q with 0 p, q 1 (you don t necessarily need q = 1 p here). The choice of p affects whether you care more about the hazard not matching the hypothesized hazard for small t or large t. For example, if p is large, then more emphasis is placed on the estimated hazard matching the null hazard for small values of t. S 0 (t) can be obtained from S 0 (t) = exp[ H 0 (t)]. STAT474/STAT574 February 24, / 94

55 Hypothesis testing An example where you would use the one-sided hypothesis test is in testing whether some population has a higher hazard than a reference population, such as the psychiatric patients from Iowa. Recall that for this example, we looked at excess mortality previously. STAT474/STAT574 February 24, / 94

56 Hypothesis testing: two or more samples If you have two or more samples (i.e., mortality for three different treatments or three different risk groups), then the null and alternative hypothesis are similar to that for ANOVA: H 0 : h 1 (t) = h 2 (t) = h K (t), for all t τ H A : h i (t) h j (t) for some i j and some t τ where τ is the largest time at which all of the groups have at least one subject at risk. STAT474/STAT574 February 24, / 94

57 Hypothesis testing: two or more samples We now define t i as the unique death times for the pooled data (i.e., ignoring the group that each observation comes from), and again t D is the largest death time. We observe d ij deaths at time t i in sample j, and there are Y ij individuals at risk at time t i in sample j. We let d i = K j=1 d ij be the total number of deaths at time t i and Y i = K j=1 Y ij be the total number of indivdiuals at risk (available for death?) at time t i. STAT474/STAT574 February 24, / 94

58 Hypothesis testing: two or more samples The idea for testing the hypothesis is that under the null hypothesis, the estimate of the hazard (and cumulative hazard) should be the same (in expectation) using the pooled data (ignoring the group the samples are from) and for the individual samples. We can think of the pooled data as providing a more precise estimate of the hazard for the jth sample than the jth sample itself, so using the idea of observed minus expected, we can write D ( dij Z j (τ) = W j (t) d ) i, j = 1,..., K Y ij Y i i=1 If all of the Z j (τ) terms are close to 0, then all of the sample estimated cumulative hazards are close to the pooled cumulative hazard, so they all must be close to each other, and this supports the null hypothesis. STAT474/STAT574 February 24, / 94

59 Hypothesis testing: two or more samples The typical weight function used is W j (t) = Y ij (t)w (t i ), where W (t i ) is a common weight shared by each group. For this weighting scheme, V [Z j (τ)] = σ jj = D i=1 Z j (τ) = D i=1 W (t i ) 2 Y ij Y i cov(z j (τ), Z k (τ)) = σ jk = D i=1 [ ( )] di d ij Y ij Y i ( 1 Y ij Y i W (t i ) 2 Y ij Y i Y ik Y i ) ( ) Yi d i d i, j = 1,..., K Y i 1 ( ) Yi d i d i, j k Y i 1 STAT474/STAT574 February 24, / 94

60 Hypothesis testing: two or more samples Based on the second formula for Z j (τ), the sum K j=1 Z j(τ) is equal to 0, meaning that the Z j (τ) are not independent of one another. In particular Z K (τ) is a linear combination of Z 1 (τ),..., Z K 1 (τ). Consequently, we construct a test statistic just based on the first K 1 Z j (τ) terms: χ 2 = (Z 1 (τ),..., Z K 1 (τ))σ 1 (Z 1 (τ),..., Z K 1 (τ)) where (Z 1 (τ),..., Z K 1 (τ)) is interpreted as a K 1 row-vector, Σ is a (K 1) (K 1) covariance matrix (if you had made a K K matrix using all the variables, it wouldn t be full rank, and therefore not invertible). The χ 2 statistic has K 1 degrees of freedom, and you can base the test on this distribution. STAT474/STAT574 February 24, / 94

61 Hypothesis testing: two samples Several weight functions are possible. W (t) = 1 for all t leads to the two-sample log-rank test. W (t i ) = Y i and W (t i ) = Y i have also been used. In the case of K = 2 samples, the test statistic can be written as [ ( )] D i=1 W (t i) d i1 Y di i1 Y i Z = D ( ) ( ) i=1 W (t i) 2 Y i1 Y i 1 Y i1 Yi d i Y i Y i 1 SInce we don t have to square in this case, we can do one-sided as well as two-sided hypothesis tests based on a standard normal distribution instead of a χ 2, or you can square the statistic and use a χ 2 1 distribution. STAT474/STAT574 February 24, / 94

62 Hypothesis testing: two samples STAT474/STAT574 February 24, / 94

63 Hypothesis testing: two samples This example was kidney dialysis patients with surgically implanted catheters versus percutaneous (needle-puncture) placement of catheter. Even though the survival curves look fairly different after 1 year or so, the differences are not statistically signficant. Note that there are also very few observations for the percutaneous sample. Actually the number of observations is fairly small for both samples, so the confidence intervals would be fairly wide. STAT474/STAT574 February 24, / 94

64 Hypothesis testing: two samples STAT474/STAT574 February 24, / 94

65 Hypothesis testing: two samples STAT474/STAT574 February 24, / 94

66 Hypothesis testing: two samples Different choices for the weight function affect the p-value. It is reassuring if a lot of weighting schemes give the same conclusion. The cases where the p-value were low were where the weighting scheme gave a lot of weight to differences in the hazard for large values of t i, which of course is where they appear different. This can also be sensitive to differences in censoring patterns in the two samples, so should be used cautiously. A problem with using lots of weighting schemes is if you only report weighting schemes that give the results you want and different weights conflict. This would be dishonest, so you should either pick a weighting scheme and stick to it, or report results of the different weighting schemes that you used. STAT474/STAT574 February 24, / 94

67 Hypothesis testing: weight functions STAT474/STAT574 February 24, / 94

68 Hypothesis testing: weight functions The most common weight functions are either flat, W (t i ) = 1 or decreasing, with W (t i ) = Y i. A weight function that is increasing might be used if to compare longer term survival when early survival might be due to complications rather than long term effectiveness of a treatment. An example is in comparing autologous transplants versus allogenic transplants for bone marrow for leukemia. Allogenic transplant patients (receiving bone marrow from sibling) tend to have more complications early on, reducing early survival rates (and increasing early hazard rates), but if interest is in long term survival, then a weight function could be used that emphasized later times. STAT474/STAT574 February 24, / 94

69 Hypothesis testing in R To test the difference in survival curves in R, you can use survdiff() from the survival library. An example is with the allo- versus autopatients in the leukemia data. > x <- read.table("leukemia2.txt") > a <- survdiff(surv(x$v1,x$v2)~factor(x$v3)) Call: survdiff(formula = Surv(x$V1, x$v2) ~ factor(x$v3)) N Observed Expected (O-E)^2/E (O-E)^2/V factor(x$v3)= factor(x$v3)= Chisq= 0.4 on 1 degrees of freedom, p= The results suggest that the two groups had survival experiences that were not statistically significantly different from each other. STAT474/STAT574 February 24, / 94

70 Hypothesis testing in R To plot the two survival curves together you can use > x <- read.table("leukemia2.txt") > a <- survfit(surv(x$v1[x$v3==1],x$v2[x$v3==1])~1) > b <- survfit(surv(x$v1[x$v3==2],x$v2[x$v3==2])~1) > plot(a,conf=f) > points(b$time,b$surv,type="s",col="red",lwd=3) > legend(20,1,legend=c("auto","allo"),col=c("black","red"), lty=c(1,1),lwd=c(1,3),cex=1.3) STAT474/STAT574 February 24, / 94

71 Hypothsis testing in R STAT474/STAT574 February 24, / 94

72 Hypothesis testing in R The survdiff() function in R has an optional paramter rho whose default is 0, which results in the log rank test. Larger values of rho put larger weight on later times and can have a big impact on the p-value. STAT474/STAT574 February 24, / 94

73 Tests of trend For multiple samples (K > 2), a different alternative hypothesis is the following: H A : h 1 (t) h 2 (t) h K (t), for t τ, where at least one inequality is strict. This is equivalent to H A : S 1 (t) S K (t) STAT474/STAT574 February 24, / 94

74 Tests of trend We construct the Z j (τ)s as before and use any weight functions W j (t i ). We also pick a new set of weights a j, j = 1,..., K, where a j = j is often used. The test statistic is now Z = K j=1 a jz j (τ) K K j=1 k=1 a ja k σ jk where Σ = ( σ jk ) is the K K covariance matrix. (It isn t full rank, but we don t need the inverse.) The test statistic can be compared to a standard normal. STAT474/STAT574 February 24, / 94

75 Tests of trend STAT474/STAT574 February 24, / 94

76 Stratified tests If different populations have different covariates (age, sex, etc.), then ideally, you could use a regression approach to survival analysis to adjust for covariates before comparing survival curves or hazard rates. This is done in Chapter 8. If there are a small number of levels for a predictor, then you can use a stratified test instead. Let H 0 : h 1s (t) = h 2s (t) = = h Ks (t), s = 1,..., M, t τ The idea is that for each level of the covariate (indexed by s), the hazard rate should be the same. Typically, M is small. STAT474/STAT574 February 24, / 94

77 Stratified tests For the stratified test, let Z j. (τ) = σ jk = M Z js (τ) s=1 M s=1 σ jks Then the test statistic is as before with multiple samples: (Z 1. (τ),..., Z K 1,. (τ))σ 1 (Z 1. (τ),..., Z K 1,. (τ)) which is approximately χ 2 with K 1 degrees of freedom. Here we have K samples and M strata within each sample. STAT474/STAT574 February 24, / 94

78 Renyi type tests For a two sample problem, if hazard functions cross, then the previous tests might not detect much overall difference in the hazard rates. Thus, the overall survival experience might be similar, but it could be different in the short term and different in the long term. If one group is at more at risk in the short term, and another in the long term, these changes of direction could cancel out leading one to not reject the hypothesis that the hazards are different. Renyi-type tests are based on the maximum absolute value of the differences between cumulative hazard rates rather than the summed differences. The idea is similar to the Kolmogorov-Smirnov test for comparing two distributions, which uses the largest absolute value of the difference betweent the two empirical CDF functions, but Renyi tests allow for censoring. STAT474/STAT574 February 24, / 94

79 Renyi type tests To construct this test, let Z(t i ) = t k t i W (t k ) [ ( )] dk d k1 Y k1, i = 1,..., D Y k where as usual d k = d k1 + d k2 and Y k = Y k1 + Y k2 (i.e., d k and Y k are the pulled number of deaths and number at risk at time t k over both samples). The standard error of Z(τ) is σ 2 (τ) = τ k τ W (t k ) 2 ( Yk1 Y k ) ( Yk2 where τ is the largest death time t k with Y k1, Y k2 > 0 Y k ) ( ) Yk d k d k Y k 1 STAT474/STAT574 February 24, / 94

80 Renyi type tests The test statistic is Q = sup{ Z(t), t τ}/σ(τ) you can think of the supremum here as just the maximum of the absolute values of the Z(t j ) values. Critical values are given in the Appendix, table C.5, and are based on the theory of Brownian motion. STAT474/STAT574 February 24, / 94

81 Renyi type tests STAT474/STAT574 February 24, / 94

82 Renyi type tests: finding the maximum Z(t j ) STAT474/STAT574 February 24, / 94

83 STAT474/STAT574 February 24, / 94

84 Renyi type tests The maximum occurs at 315 days, with the maximum value being 9.8. The p-value (based on Table C.5) is 0.053, which is not significant at α = 0.05 but still gives more signal for the curves being different than the log-rank test, which gives p = STAT474/STAT574 February 24, / 94

85 Testing based on a fixed point in time Instead of testing survival and hazard rates over all time points, you might be interested in the 1-yr survival rate. Note that the time being tested should be chosen before doing the test. If you look at two survival curves and say, Wow, they look really different at year 3, is that significant? then the p-value will biased too low. It is similar to testing at many time points but then not adjusting for multiple comparisons. In practice, this is what happens all the time though. People look at a graph of the data, which is maybe meant to be descriptive, something jumps out at them as being unusual, and they say, Wow, is that significant? It s extremely difficult to answer this type of question. A better approach in this type of case might be the Renyi type of test, because it is accounting for the fact that you are looking at maximum differences over the entire time frame. STAT474/STAT574 February 24, / 94

86 Testing based on a fixed point in time Here we want to test against H 0 : S 1 (t 0 ) = S 2 (t 0 ) H A : S 1 (t 0 ) S 2 (t 0 ) for two survival curves. (The method can be generalized to more survival curves.) The test statistic is Z = Ŝ 1 (t 0 ) Ŝ2(t 0 ) V [Ŝ1(t 0 )] + V [Ŝ2(t 0 )] which has an approximate standard normal distribution for large samples. STAT474/STAT574 February 24, / 94

87 Testing based on a fixed point in time If you want to test multiple fixed time points, such as the 1-yr and 5-yr survival rates, then you should adjust for multiple comparisons. For testing two time points, a Bonferroni adjustment could be made, meaning that you reject each hypothesis only if the p-value is less than α/2. The more time points you check, the less power you will have to find signficant differences. STAT474/STAT574 February 24, / 94

88 Bonferroni adjustments Probably the most popular, and simplest adjustment to make for multiple testing is Bonferroni adjustments. The idea is that to have k tests at level α (meaning that if the null hypotheses are true for all k tests, there is only a 5% chance of making an error on any one of them), you use an α level of α/k for each test. What is the rationale for doing this? STAT474/STAT574 February 24, / 94

89 Bonferroni adjustments There are several ways to justify Bonferroni adjustments. One is to look at the expected number of false positives under the null. Let X i = 1 if you make a correct decision on test i, and otherwise X i = 0. What type of variable is X i? What is the probability that X i = 1 if the null hypothesis (for experiment i) is true? What is the expected value of X i? STAT474/STAT574 February 24, / 94

90 Bonferroni adjustments X i as defined previously is Bernoulli with p = α if testing using level α. The expected value of a Bernoulli(p) random variable is p. (Why?), so the expected value of X i is α. If you do k experiments, the expected number of false positives is [ k E i=1 X i ] = kα However, if you test at the α/k level, then the expected number of false positives is α. Thus, the Bonferroni adjustment controls the expected number of false positives. STAT474/STAT574 February 24, / 94

91 Bonferroni adjustments Another approach is to use something called Bonferroni s inequality. Let A i be the event that you don t reject the null hypothesis. Suppose we set P(A i ) = 1 α/k when the null is true. From the Inclusion-Exclusion formula P(A 1 A 2 ) = P(A 1 ) + P(A 2 ) P(A 1 A 2 ) P(A 1 ) + P(A 2 ) 1 If we apply the formula again, setting B = A 1 A 2, we get P(A 1 A 2 A 3 ) = [P(A 1 )+P(A 2 ) 1]+P(A 3 ) 1 P(A 1 )+P(A 2 )+P(A 3 ) 2 In general for k events P(A 1 A k ) k P(A i ) (k 1) i=1 STAT474/STAT574 February 24, / 94

92 Bonferroni adjustments If P(A i ) = 1 α/k, then we get P(A 1 A k ) k ( 1 α ) k + 1 = 1 α k Thus, the probability of all decisions being correct is at least 1 α, and the probability of making any wrong decision is at most α. STAT474/STAT574 February 24, / 94

93 Bonferroni adjustments Bonferroni s inequality can be useful in other probabilistic arguments as well. STAT474/STAT574 February 24, / 94

Kernel density estimation in R

Kernel density estimation in R Kernel density estimation can be done in R using the density() function in R. The default is a Guassian kernel, but others are possible also. It uses it s own algorithm to