Statistical Inference

Size: px

Start display at page:

Download "Statistical Inference"

Beverly Welch
5 years ago
Views:

1 Statistical Inference Robert L. Wolpert Institute of Statistics and Decision Sciences Duke University, Durham, NC, USA Week 12. Testing and Kullback-Leibler Divergence 1. Likelihood Ratios Let 1, 2, 2,... be independent, identically distributed random variables taking values in some space, whose distribution has a probability density function f(x θ). If we believe that one of the two hypotheses H 0 : [θ = θ 0 ] H 1 : [θ = θ 1 ] is true but we are unsure which, we can compute the likelihood ratio against the null hypothesis H 0, λ n = f 1(x 1 )f 1 (x 2 ) f 1 (x n ) f 0 (x 1 )f 0 (x 2 ) f 0 (x n ) where we have simplified the notation by writing f 0 (x) for f(x θ 0 ) and f 1 (x) for f(x θ 1 ). A Likelihoodist Statistician would find the likelihood ratio λ n to be the best direct measure of the relative support of the data for these two hypotheses; a Bayesian statistician with prior probabilities = P[H 0 ] and π 1 = P[H 1 ] = (1 ) would find the best measure to be the posterior probabilities P[H 0 n ] = = f 0 (x 1 )f 0 (x 2 ) f 0 (x n ) f 0 (x 1 )f 0 (x 2 ) f 0 (x n ) + π 1 f 1 (x 1 )f 1 (x 2 ) f 1 (x n ) + π 1 λ n 1

2 or, equivalently, the posterior odds P[H 1 n ] P[H 0 x n ] = π 1 λ n Finally, the Classical (or Frequentist) statistician would rely on the Neymann- Pearson Lemma, which asserts that the most powerful test of the hypothesis H 0 at level 0 < α < 1 upon observing λ n = λ is to Reject H 0 in favor of H 1 if λ c α, where c α inf {c < : P[λ n c θ 0 ] α} and otherwise fail to reject, or equivalently to reject if and only if the Pvalue P n P[λ n λ θ 0 ] satisfies P n α. This is most powerful in the sense that the power P[ Reject H 0 H 1 ] = P[λ n c α θ 1 ] is larger than for any other test of size P[ Reject H 0 H 0 ] = P[λ n c α θ 0 ] α. Let s explore how λ n behaves as n, and from that see what inference Likelihoodists, Bayesians, and Classical statisticians are likely to make Log Likelihood Ratio as a Random Walk For any θ Θ the log likelihood ratio is a sum of independent random variables n log λ n = log [ f 1 ( j )/f 0 ( j ) ], the j th of which has mean j=1 µ = E[log f 1() f 0 () θ] = [log f 1 (x) log f 0 (x)] f(x) dx = [log f 1(x) f(x) log f 0(x) ] f(x) dx f(x) = K(f : f 0 ) K(f : f 1 ) where K(f : g) is the Kullback-Leibler Divergence defined by K(f : g) log g(x) f(x) dx. f(x) 2

3 This divergence satisfies K(f : f) = 0 for all f(x) and K(f : g) > 0 for all other g(x), but is not symmetric and so is not (quite!) a distance metric. It has many interesting properties, some described in the course text by Bickel & Doksom ( 2.2, 3.2) and others we will encounter below. By the Law of Large Numbers, (1/n) log λ n µ as n and hence λ n e nµ, so lim λ n = lim n n enµ = { 0 if K(f : f 0 ) < K(f : f 1 ), if K(f : f 0 ) > K(f : f 1 ) and, in particular, statisticians of all three paradigms will be lead to the right conclusion in the limit as n if either H 0 is true, in which case µ = K(f 0 : f 1 ) < 0 and λ n 0, or H 1 is true, so µ = K(f 1 : f 0 ) > 0 and λ n. Similarly we can compute σ 2 = E[ ( log f 1() f 0 () µ) 2 θ] = [log f 1 (x) log f 0 (x) µ] 2 f(x) dx and, by the Central Limit Theorem, [ log lim P λn nµ n σ n ] z = Φ( z) and hence the Likelihoodist, Bayesian, and Frequentist reports for large n and a true hypothesis H 0 : [θ = θ 0 ] will be approximately λ n e n K(f 0:f 1 ) 0 P[H 0 x] + π 1 e n K(f 0:f 1 ) 1 ( P n Φ log λ ) n + nk(f 0 : f 1 ) σ Un(0, 1) n while if H 1 : [θ = θ 1 ] is true they will be approximately λ n e n K(f 1:f 0 ) P[H 0 x] + π 1 e n K(f 1:f 0 ) 0 ( ) nk(f1 : f 0 ) nk(f 0 : f 1 ) P n Φ σ n = Φ ( [K(f 1 : f 0 ) + K(f 0 : f 1 )] n/σ 2) 0 3

4 1.2. Martingales and Iterated Logarithms Suppose that H 0 : [θ = θ 0 ] is true. At any fixed sample-size n, we have seen how Classical and Bayesian testing methods behave. What happens as we observe successively larger samples? The Law of the Iterated Logarithm for random walks states that, with probability one, a random walk like log λ n with i.i.d. steps of mean µ R and variance σ 2 > 0 will satisfy: lim sup n lim inf n log λ n nµ σ 2n log log n = +1 log λ n nµ σ 2n log log n = 1 and, in particular, that if H 0 is true (so that µ = K(f 0 : f 1 ) < 0), then with probability one there will be infinitely many numbers n N for which log λ n + n K(f 0 : f 1 ) > σ n log log n, so that ( P n Φ log λ ) n + nk(f 0 : f 1 ) σ n < Φ( log log n) 0. Thus the P -value will fall arbitrarily close to zero, as n, lending arbitrarily strong evidence against even a true hypothesis. (Note: log log n grows exceedingly slowly for example, it only exceeds for n > , and only exceeds 1.96 for n > ). If H 0 is true then λ n is a positive martingale starting at λ 0 = 1, since E[λ n+1 1,..., n ] = λ n f 1 (x) f 0 (x) f 0(x) dx = λ n f 1 (x) dx = λ n and so, by Doob s Martingale Maximal Inequality, for any b < and N > 0, P[ sup λ n > b] E[λ 0] = 1 n N b b and hence if H 0 is true the probability that the Bayesian posterior probability of H 0 ever falls below any 0 < p < 1 is bounded above by P [ inf P[H 0 x n ] p ] = P [ inf p ] n N n N + π 1 λ n = P [ sup n N 4 λ n (1 p) π 1 p ] π 1 p 1 p

5 as n so, unlike Classical methods, Bayesian methods will not give arbitrarily strong evidence against a true hypothesis Martingales and Sample Size If H 0 is true we have seen that λ n is a positive martingale starting at λ 0 = 1. For any 0 < a < 1 < b let τ ab be the stopping time τ ab inf{n 0 : λ n / (a, b)} (the first time λ n either exceeds b > 1 or falls below a < 1) and denote by p ab the probability that λ n exceeds b before falling below a, p ab = P[λ τab b]. Then Doob s theorem applied to the two martingales λ n and log λ n nµ give us E[λ τab H 0 ] (1 p ab )a + (p ab )b = 1 = p ab = 1 a b a E[log λ τab µτ ab H 0 ] (1 p ab ) log a + (p ab ) log b µe[τ ab H 0 ] = (b 1) log a + (1 a) log b µe[τ ab H 0 ] b a E[τ ab H 0 ] = = 0 = (b 1) log a + (1 a) log b (b a)k(f 0 : f 1 ) and, in the limit as b, E[inf{n 0 : λ n < a} H 0 ] = log(1/a) K(f 0 : f 1 ) so the sample-size needed to reach a posterior probability of P[H 0 x n ] = > 1 ɛ for a true hypothesis H 0 : {x j f 0 (x)} has expectation +π 1 λ n n log π 1 + log 1 ɛ ɛ K(f 0 : f 1 ) and, similarly, if H 0 is false and x j f 1 (x), then E[inf{n 0 : λ n > b} H 1 ] = 5 log b K(f 1 : f 0 )

6 and the sample-size needed to achieve a posterior probability of P[H 0 x n ] = < ɛ for a false hypothesis is +π 1 λ n n log π 1 + log 1 ɛ ɛ. K(f 1 : f 0 ) Evidently the sample-size needed varies directly with the logistic of the desired posterior probability, and inversely as the Kullback-Leibler discrepancy between f 0 and f Kullback-Leibler and Fisher Information For small ɛ > 0, a second-order Taylor-series approximation of the KL divergence from f(x θ) to f(x θ + ɛ) gives K ( f(x θ), f(x θ + ɛ) ) = = I(θ)ɛ 2 /2 f(x θ + ɛ) log f(x θ) dx f(x θ) [ ] ɛ log f(x θ) ɛ2 2 2 log f(x θ) f(x θ) dx or, in q > 1 dimensions, ɛ T I(θ)ɛ/2. This suggests a close link between KL divergence and the Information Metric notion of the distance between different distributions, θ1 d I (θ 0, θ 1 ) = I(θ) dθ (in q = 1 dimension) θ 0 1 = inf γ T γ t I(γ t) γ t dt 0 where the infimum is over all differentiable paths γ from γ 0 = θ 0 to γ 1 = θ 1 ; evidently, for near-by θ 0, θ 1, K ( f(x θ 0 ), f(x θ 1 ) ) d I (θ 0, θ 1 ) 2 /2 1 2 (θ 1 θ 0 ) I(θ)(θ 1 θ 0 ). 6

Statistical Inference

Statistical Inference Robert L. Wolpert Institute of Statistics and Decision Sciences Duke University, Durham, NC, USA Spring, 2006 1. DeGroot 1973 In (DeGroot 1973), Morrie DeGroot considers testing the