A Bayesian Analysis of Some Nonparametric Problems Thomas S Ferguson April 8, 2003 Introduction Bayesian approach remained rather unsuccessful in treating nonparametric problems. This is primarily due to the difficulty in finding workable prior distribution on the parameter space, which in nonparametric problems is taken to be a set of probability distributions on a given sample space. Two Desirable Properties of a Prior. The support of the prior should be large. 2. Posterior distribution given a sample of observation should be manageable analytically. These properties are antagonistic : One may be obtained at the expense of other. The Main Idea of the Paper It presents a class of prior distributions called Dirichlet process priors, broad in the sense of (), for which (2) is realized and for which treatment of many nonparametric problems may be carried out, yielding results that are comparable to the classical theory.
The Dirichlet Distribution Known to the Bayesians as the conjugate prior for the parameter of a multinomial distribution. G(α, β) denote the gamma distribution with shape parameter α 0, scale parameter β > 0. α = 0 it is degenerate at 0. Let z j iid G(α, ), j =,..., k. Where α j 0 j and α j > 0 for some j. The Dirichlet distribution with parameters (α,..., α k ) denoted by D(α,..., α k ) is defined as the distribution of (Y,..., Y k ), where Y j = Z j k i= Z, j =,..., k. () i Note: Since Y +... + Y k =, the distribution is always singular. If any α j = 0, the corresponding Y j is degenerate at 0. If α j > 0, j distribution of (Y,..., Y k ) is continuous with density f (y,..., y k ) (2) = Γ(α k +... + α k ) Γ(α )... Γ(α k ) ( k y α j j )( y j ) α k I S (y,..., y k ) j= j= k S = {(y,..., y k ) : y j 0, y j } for k = 2, (2) reduces to Beta distribution. j= 2
Properties of Dirichlet Distribution (i) If (Y,..., Y k ) D(α,..., α k ). Let r,..., r l are integers, 0 < r < < r l = k, then ( r Y i, r 2 r + Y i,..., r l r l Y i) D( r α i, r 2 r + α i,..., r l r l α i). (ii) Marginal distribution of Y j is Beta. Y j Be(α j, ( k ) α j) (iii) E(Y i ) = α i α E(Yi 2 ) = α i(α i +) E(Y i Y j ) = α(α+) α iα j α(α+), i j, α = k α i The Dirichlet Process X : a set. A : σ-field of subsets of X. α : finite non-null measure on (X, A). Define a random probability P by defining the joint distribution of (P (B ),..., P (B k )), k and all measurable partitions (B,..., B k ) of X. We say P is a Dirichlet process on (X, A) with parameter α if for everly k =, 2,..., and measurable partition (B,..., B k ) of X, the distribution of (P (B ),..., P (B k )) is Dirichlet, D(α(B ),..., α(b k )). 3
Proposition. Let P be a Dirichlet process on (X, A) with parameter α and let A A. If α(a) = 0, then P (A) = 0 with probability one. If α(a) > 0 then P (A) > 0 with probability one. Furthermore, E(P (A)) = α(a) α(x ). Proposition 2. Let P be a Dirichlet process on (X, A) with parameter α, and let Q be a fixed probability measure on (X, A) with Q α. Then, for any positive integer m and measurable sets A,..., A m, and ɛ > 0, P{ P (A i ) Q(A i ) < ɛ for i =,..., m} > 0. Proposition 3. Let P be a Dirichlet process on (X, A) with parameter α and let X be a sample of size from P. Then for A A, P(X A) = α(a)/α(x ). Theorem : Let P be a Dirichlet process on (X, A) with parameter α, and let X,..., X n be a sample of size n from P. Then conditional distribution of P given X,..., X n, is a Dirichlet process with parameter α+ n. Theorem 2 : Let P be the Dirichlet process and let Z be a measurable real valued function defined on (X, A). If Z dα <, then Z dp with probability one, and E( ZdP ) = ZdE(P ) = α(x ) Zdα. 4
Application P D(α) says P is a Dirichlet process on (X, A) with parameter α. R : The real line. B : The σ-field of Borel sets. (X, A) = (R, B) Nonparametric Statistical Decision Problem The parameter space is the set of all probability measures P on (X, A). Statistician is to choose an action a in some space,thereby incurring a loss, L(P, a). A sample X,..., X n from P are available to the statistician, upon which he may base his choice of action. What a Bayesian does? He seeks a Bayes rule with respect to the prior distribution, P D(α). The posterior distribution of P given the observations is D(α + n ). Find a Bayes rule for no sample problem ( n = 0 ). Bayes rule for the general problem is found by replacing α by α + n. 5
(a) Estimation of a distribution function Let (X, A) = (R, B), and the space of action is the space of all distribution function on R. Loss function : L(P, ˆF ) = (F (t) ˆF (t)) 2 dw (t) W : a given finite measure on (R, B) (a weight function). F (t) = P ((, t]) P D(α) F (t) Be(α((, t]), α((t, ])) for each t. Bayes risk for no sample problem, E(L(P, ˆF )) = E(F (t) ˆF (t)) 2 dw (t) Minimized by choosing ˆF (t) = E(F (t)). Thus the bayes rule for the no-sample problem is ˆF (t) = E(F (t)) = F 0 (t) Where, F 0 (t) = α((, t])/α(r) For a sample of size n the bayes rule is, ˆF n (t X,..., X n ) = α((, t]) + n ((, t]) α(r) + n = p n F 0 (t) + ( p n )F n (t X,..., X n ) p n = α(r)/(α(r) + n) F n (t X,..., X n ) = n n ((, t]) The Bayes rule is a mixture of our prior guess at F and the empirical distribution. α(r) can be interpreted as a measure of faith in the prior guess at F. 6
(b) Estimation of the mean Let (X, A) = (R, B). Loss function : L(P, ˆµ) = (µ ˆµ) 2 µ = xdp (x) Assume P D, where α has finite first moment.the mean of the corresponding probability measure α(.)/α(r) is, µ 0 = xdα(x)/α(r) By Theorem 2, the r.v µ exists. The Bayes rule for the no-sample problem is the mean of µ, ˆµ = µ 0, by Theorem 2. For a sample of size n, the Bayes rule is, ˆµ n (X,..., X n ) = (α(r) + n) xd(α(x) + n δ Xi (x)) = p n µ 0 + ( p n ) X n p n same as before and X n = n n X i. The Bayes estimate is between the prior guess at µ and the sample mean. As α(r) 0, ˆµ n converges to X n. As n, p n 0, i.e the Bayes estimate is strongly consistent within the class of distributions with finite first moments. i= 7
(c) Estimation of the median Let (X, A) = (R, B). We need to estimate the median, m = med P,of an unknown probability measure P on (R, B). For P D(α), m is a random variable. Loss function : L(P, ˆm) = m ˆm (absolute error loss). Any median of the distribution of m is a Bayes estimate of m. For the Dirichlet process prior, any median of the distribution of m is a median of the expectation of P : med(dist. med P ) = med E(P ) A number t is a median of the distribution of m iff P{m < t} 2 P{m t}. P{m t} = P{F (t) > 2 } F (t) has a Be(α((, t]), α((t, ])). Its median is a nondecreasing function of t, with value half iff the 2 parameters are equal. Hence t satisfies the above iff α((,t)) α((,t]). α(r) 2 α(r) such t are the medians of E(P ). Thus any number t satisfying the above is a Bayes estimate of m for prior D(α) and absolute error loss. For F 0 defined as in (a), ˆm = median of F 0. For a sample size n, the Bayes estimate is, ˆm n (X,..., X n ) = median of ˆF n. ˆF n is the Bayes estimate of F as obtained from (a). 8
(d) Estimation of F dg for a two-sample problem F and G are 2 distribution functions on the real line. Let X,..., X m be a sample from F and Y,..., Y n a sample from G. Estimation of = P (X Y ), = F dg Loss function: Squared error. Priors : F is the d.f. of P D(α ) G is the d.f. of P 2 D(α 2 ) P and P 2 are independent. The Bayes rule for the no-sample problem is, 0 = E( ) = F 0 dg 0 F 0 = E(F ) ; G 0 = E(G). Given 2 samples, Bayes rule is ˆ (X,..., X m, Y,..., Y n ) = ˆFm dĝn ˆF m and Ĝn are the respective Bayes estimate of F and G as in application (a). It can be written as, p,m = U = i α (R) ˆ (X,..., X m, Y,..., Y n ) = p,m p,n 0 + p,m ( p 2,n ) n + ( p,m )p 2,n n m ( G 0 (X i )) + ( p,m )( p 2,n ) mn U α 2(R) α 2 (R)+n n F 0 (Y j ), p α (R)+m 2,m = j I (,Y j )(X i ) is the Mann-Whitney statistic. 9