Comparing two independent samples

In many applications it is necessary to compare two competing methods (for example, to compare treatment effects of a standard drug and an experimental drug). To compare two methods from statistical point of view: Select a statistical model for each method Collect a sample of data for each method Estimate parameters of the two models The difference between the two models can be quantified by contrasting the estimated parameters Standard techniques for hypothesis testing can be adapted to compare data for two samples.

Assume that we have two samples, A and B: Y1 A,..., Yn A A i.i.d. f Y ( θ A ) and Y1 B,..., Yn B B i.i.d. f Y ( θ B ). To compare these two samples, we can define δ A,B = θ A θ B. We can use some estimators of θ A and θ B, θ A and θ B, to define δ A,B = θ A θ B. The variance of δ A,B is Var( δ A,B ) = Var( θ A ) + Var( θ B ) 2Cov( θ A, θ B ).

If two samples are independent of each other, then Cov( θ A, θ B ) = 0 and Var( δ A,B ) = Var( θ A ) + Var( θ B ). In this case, θ A and θ B can be maximum likelihood estimators of θ A and θ B, and asymptotic normality of θ A and θ B implies asymptotic normality of δ A,B. If there is dependence across samples, the covariance term should be included. For example, when comparing treatment effects of the two drugs, A and B, the same patient is administered drug A for the first month and then drug B for the second month.

Example 1 (faults on data lines): Assume we collect data on the number of faults for lines of length 22 km (sample A) and for lines of length 170 km (sample B). The sample size n A = 40 and n B = 17 lines and the total number of faults for each line is 40 i=1 y i A = 10 and 17 i=1 y i B = 41. Probability model: Yi A ML estimators in the Poisson model: Pois(λ A ) and Y B i Pois(λ B ) λ A = 1 na n A i=1 Y i A, λb = 1 nb n B i=1 Y i B ML estimates (for observed data): λ A = 1 40 40 i=1 y A i = 0.25, λb = 1 17 17 i=1 y B i = 2.41 Estimates of the variance of MLE: Var( λ A ) = λ A 40 = 0.00625, Var( λ B ) = λ B 17 = 0.14176

Example 1 (faults on data lines, contd): Assume we collect data on the number of faults for lines of length 22 km (sample A) and for lines of length 170 km (sample B). Properties of δ A,B = θ A θ B : E( δ A,B ) = E( θ A ) E( θ B ) = θ A θ B = δ A,B Var( δ A,B ) = Var( θ A ) + Var( θ B ) Var( δ A,B ) = Var( θ A ) + Var( θ B ) = 0.00625 + 0.14176 = 0.14801 Exact distribution of δ A,B : δ A,B P A n A P B n B, where P A Pois(n A θ A ), P B Pois(n B θ B )

Example 1 (faults on data lines, contd): Assume we collect data on the number of faults for lines of length 22 km (sample A) and for lines of length 170 km (sample B). The approximate distribution of δ A,B is normal and δ A,B N(δ A,B, Var( δ A,B )). The approximate 100(1 α)% CI for δ A,B is ( δ A,B z 1 0.5α Var( δ A,B ), δ ) A,B + z 1 0.5α Var( δ A,B ). With α = 0.05, z 1 0.5α = 1.96 and the 95% CI is 2.16 ± 1.96 0.385 = (1.41, 2.91). This interval does not contain zero and therefore we are 95% confident that 170km lines have more faults per line that 22km lines.

: CLICKER QUESTION 1 Example 1 (faults on data lines, contd): Assume we collect data on the number of faults for lines of length 22 km (sample A) and for lines of length 170 km (sample B). Assume now we are interested in comparing the number of faults per km for the two samples. What quantity do we need to estimate? A λ B /λ A B λ B λ A C λ B /170 + λ A /22 D λ B /170 λ A /22 E None of the above

Example 1 (faults on data lines, contd): Assume we collect data on the number of faults for lines of length 22 km (sample A) and for lines of length 170 km (sample B). To compare the number of faults per km for the two samples, we consider δ A,B = λ B/170 λ A /22: δ A,B = 2.41 170 0.25 22 = 0.01419 0.01136 = 0.00283 (fault/km). Var( δ A,B ) = 1 170 2 Var( θ B ) + 1 22 2 Var( θ A ) = 1.7818 10 5. The 95% CI for δ A,B is ( δ A,B ± 1.96 Var( δ A,B ) ) = ( 0.0054, 0.0111) (fault/km). We cannot reject H 0 : δ A,B = 0 in favor of H a : δ A,B 0.

Comparing the means of two normal distributions Assume that we have two independent normal samples Z A 1,..., Z A n A i.i.d. N(µ A, σ 2 A ) and Z B 1,..., Z B n B i.i.d. N(µ B, σ 2 B ). Let δ A,B = µ A µ B. Define δ A,B = Z A Z B. It follows that δ A,B N ( ) δ A,B, σ2 A + σ2 B, n A n B δ A,B δ A,B σ 2 A n A + σ2 B n B N(0, 1). Assuming σa 2 and σ2 B are known, the 100(1 α)% CI for δ A,B is σa 2 δa,b z 1 0.5α + σ2 B < δ A,B < n A n δ σa 2 A,B + z 1 0.5α + σ2 B. B n A n B

Comparing the means of two normal distributions Assume now that σ 2 A = σ2 B = σ2 and σ 2 is unknown: δ A,B N ) (δ A,B, σ2 + σ2, n A n B δ A,B δ A,B N(0, 1). σ 1 n A + 1 n B To estimate σ 2, we use the pooled sample variance: where S 2 A = 1 n A 1 Sp 2 = (n A 1)SA 2 + (n B 1)SB 2, n A 1 + n B 1 n A i=1 (Z A i Z A ) 2, S 2 B = 1 n B 1 n B i=1 (Z B i Z B ) 2.

Comparing the means of two normal distributions We know that if the two normal samples are independent, then (n A 1)S 2 A /σ2 χ 2 n A 1, (n B 1)S 2 B /σ2 χ 2 n B 1 and This implies that [(n A 1)S 2 A + (n B 1)S 2 B ]/σ2 χ 2 n A 1+n B 1. δ A,B δ A,B S p 1 n A + 1 n B t na +n B 2 and (with n = n A + n B 2) the 100(1 α)% CI for δ A,B is ( 1 δ A,B t 1 0.5α,n s p + 1 < δ A,B < n A n δ 1 A,B + t 1 0.5α,n s p + 1 ). B n A n B

Comparing the means of two normal distributions If σ 2 A σ2 B and the variances are unknown, then δ A,B δ A,B t S 2 na,b, (approximate t-distribution) A n A + S2 B n B where n A,B can be obtained using a messy formula. If the difference between σa 2 and σ2 B is not too large, one can assume σa 2 = σ2 B = σ2 and use the pooled sample variance to estimate σ 2 as shown before.

Example 2: Assume that σa 2 = σ2 B = 1 and the observed data are 1.87, 0.95, 0.36, 0.84 (sample A) and 2.72, 1.52, 3.81 (sample B). We find z A = 1.01, z B = 2.68 and δ A,B = 1.67. If the variance is known, then the 95% CI for δ A,B is ( ) 1.67 ± z 0.975 1 3 + 1 4 = ( 3.17, 0.17). If the variance is unknown, we calculate s 2 A = 0.40 and s2 B = 1.31 and the pooled sample variance s 2 p = (4 1)s2 A + (3 1)s2 B 4 1 + 3 1 = 0.76, s p = 0.87.

: CLICKER QUESTION 2 Example 2: Assume that σa 2 = σ2 B = 1 and the observed data are 1.87, 0.95, 0.36, 0.84 (sample A) and 2.72, 1.52, 3.81 (sample B). Quantiles of what distribution should we use to construct CI for δ A,B if the variance is unknown? A t 6 B t 4 C t 2 D t 1 E None of the above

Example 2: Assume that σa 2 = σ2 B = 1 and the observed data are 1.87, 0.95, 0.36, 0.84 (sample A) and 2.72, 1.52, 3.81 (sample B). We use the t-distribution with 5 degrees of freedom to construct the 95% CI for δ A,B : ( ) 1.67 ± t 0.975,5 s p 1 3 + 1 4 = ( 3.38, 0.04). This CI is wider because the variance is unknown and has to be estimated from the data.