More on Bayes and conjugate forms Saad Mneimneh A cool function, Γ(x) (Gamma) The Gamma function is defined as follows: Γ(x) = t x e t dt For x >, if we integrate by parts ( udv = uv vdu), we have: Γ(x) = t x e t = + (x ) = (x )Γ(x ) (x )t x e t dt t x e t dt Note also that Γ() = e t dt =. We conclude that if x is an integer, Γ(x) = (x )!. Therefore, the Gamma function represents a generalization of the factorial function defined only on the non-negative integers. Given any a >, the product of the following k terms can now be expressed as: a (a + ) (a + )... (a + k ) = Γ(a + k) Γ(a) Perhaps the most famous value of the Gamma function for a non-integer is Γ(/) = π. The Gamma function is a component of various probability density functions as we will see later on. Chi-squared density Let X...X n be independent normal variables such that X i N(,), and consider the sum V = X +...X n. What is the probability density of V? For simplicity, let us first stop the sum at X, i.e. the probability density of X if X N(,). P( y δ X y) + P( y X y + δ) = P(y X ( y + δ) )
f X ( y)δ + f X ( y)δ = f X (y)(δ + yδ) Taking the limit as δ, we have: f X (y) = φ( y) + φ( y) y = π (y ) e y The expression above is arranged to reveal a form similar to the integrand of the Gamma function. Therefore, using a change of variable t = y/: ( y ) e y dy = t e t dt = Γ( ) = π This shows that the density integrates to. Another way for obtaining the above density is to use an appropriate change of variable. To illustrate this general technique, assume that f X (x) is given, and that we are interested in finding f Y (y) where y = g(x). Observe that dy = g (x)dx and, therefore, dx = dy/g (x). Hence, fx (x) f X (x)dx = g (x) dy = with appropriate adjustment of the integral range. If x = g (y) is uniquely obtained from y, then we can write the above as follows: fx (g (y)) g (g (y)) dy = implying that f Y (y) = f X(g (y)) g (g (y)) If, however, x is not uniquely obtained from y, then we add up the contribution of all solutions. Let us apply this to our example where f X (x) = φ(x) and y = g(x) = x. We have g (x) = x and x = ± y. Therefore, g (x) = y. f Y (y) = f X( y) + f X ( y) y and since f X (x) = φ(x) is symmetric, we get f Y (y) = φ( y) y = φ( y) = y π (y ) e y which is as obtained before. The above suggests also that we can generalize the form of the density for k as follows: Γ( k )(y ) k e y
where f X (y) being the special case when k =. This is called the χ (Chisquared) density with parameter k, or k degrees of freedom. It turns out χ is exactly the density for V when k = n. V = X +... + X n χ n This is because the sum of two χ independent random variables with parameters k and k is a χ random variable with parameter k = k + k. To prove this, one can use the transform of a χ random variable. Recall that the transform of a random variable Z is E[e sz ]. Therefore, the transform of Z +Z is E[e s(z+z) ] = E[e sz+sz ] = E[e sz e sz ] = E[e sz ]E[e sz ] if Z and Z are independent. It is easy to show that the transform of a χ k random variable is ( s) k/ and, therefore, the transform of the sum of two independent χ random variables with parameters k and k is ( s) k/ ( s) k/ = ( s) (k+k)/, which is the transform of a χ k random variable where k = k + k. Example: Fairness of dice. Consider throwing a pair of dice, and let S be the random variable corresponding to a given total on a single throw. If the pair of dice is fair, S can take the following values with the given probabilities: 3 4 5 6 7 8 9 8 9 5 6 Assume that we throw the pair of dice n = 44 times, and let Y s be the random variable corresponding to the number of times we obtain a particular value s for S. Note that Y s is the sum of n Bernoulli trials with success probability p s as given in the above table. Consider the following real data: s 3 4 5 6 7 8 9 y s 4 9 5 4 9 6 np s 4 8 6 4 6 8 4 How can we test whether or not the given pair of dice is loaded? For s =,..., and in general for s =...k, consider the following random variable: Z s = 5 9 Y s np s nps ( p s ) where Y s is the number of times outcome s occurs, and p s is the corresponding probability (thus np s is the expected number of times s should occur). A natural way is to consider i Z i to see whether this sum is (probabilistically) too high or too low. By the central limit theorem, Z s N(,) when n is large. Therefore, s Z s χ k. But the Zs are not completely independent! For instance, Y k can be computed if Y,...,Y k are known (because s Y s = n). Instead, let us consider Z s = Y s np s nps 8
and, for simplicity, assume we only have two possible outcomes (Y + Y = n,p + p = ). Z + Z = (Y np ) + (Y np ) = (Y np ) + ( Y + np ) np np np n( p ) = (Y np ) np ( p ) = Z We know that Z χ. In general (proof omitted), if k d variables are sufficient to determine all k, then V = k i= Z i χ k d Applying this to our dice problem (k =,d = ), we compute V = 7.4583. From the table for χ, we can find P(V 7.4583) = V χ (y)dy, and see that V = 7.4583 falls within the entries for 5% and 5%, so it is not significantly high or significantly low; thus the pair of dice is satisfactory (not loaded). 3 Chi-squared prior To Keep a Bayesian spirit, the χ density provides a conjugate prior in a number of cases. For instance, let x,...,x n be independent Poisson random variables with parameter λ. Then where T = i x i. If mλ χ k for a prior, i.e. P(x,...,x n λ) λ T e nλ then f(λ) = m Γ(k/) (mλ )k/ e mλ f(λ x,...,x n ) λ T e nλ ( mλ )k/ e mλ f(λ x,...,x n ) ( (n + m)λ ) (k+t)/ e (n+m)λ (n + m)λ x,...,x n χ k+t Now consider n independent Gaussian random variables with a variance σ that is unknown (mean µ is known). X i σ N(µ,σ )
What would be an appropriate prior for σ (a conjugate one)? Note that Let S = i (x i µ), then: f(σ x,...,x n ) Pi (x σ n e i µ) σ f(σ ) Therefore, f(σ x,...,x n ) (/σ ) n/ e S/σ f(σ ) f(σ dσ x,...,x n ) d(/σ ) (/σ ) n/ e f(σ dσ ) d(/σ ) The term dσ /d(/σ ) adjusts for changing the variable from σ to /σ so that integration of the density is with respect to /σ rather than σ. We get: S/σ f(/σ x,...,x n ) (/σ ) n/ e S/σ f(/σ ) If S /σ χ k for some S, then: Therefore, f(/σ ) (/σ ) k/ e S /σ f(/σ x,...,x n ) (/σ ) n+k e (S+S )/σ (S + S )/σ x,...,x n χ n+k Example: Consider the following sample (n = ): 9 8 6 4 8 7 5 9 9 5 9 4 3 6 4 3 We can compute x, and let us, for simplicity, assume that this is the real value of µ. In this case, S = 664. Now one may argue that knowledge of k and S is not available. In this case, let k = and S = to get the following (improper) prior: f(/σ ) (/σ ) Therefore, the posterior will be given by 644/σ χ and from the tables we see that σ is between and 75 with probability.95.
f(log σ ) (/σ ) / d(log σ ) d(/σ ) ) = (/σ ) / d(log(/σ )) d(/σ ) ) which is uniform in log σ (,+ ). This prompts the following question: is the improper uniform prior equivalent to some conjugate form? In other words, is there a function g(θ) such that f(θ x) has a valid density when f(g(θ))? This is equivalent to the following statement (why?): f(θ x) f(x θ) dg(θ) dθ The answer to this of course depends on f(x θ). For our example, it is clear that we need dg(/σ ) d(/σ ) = (/σ ) which is satisfied if g(/σ ) = log σ. If we recall the example of deciding on the mean µ of independent Gaussian samples, g(µ) = µ is appropriate. Example: Let f(x λ) = λe λx (exponential density). Then f(λ x) λe λx dg(λ) dλ Obviously, g(λ) = log λ provides a valid posterior density (exponential). Therefore, the improper uniform prior f(log λ) works.