Scribe Notes 11/30/07

Size: px

Start display at page:

Download "Scribe Notes 11/30/07"

Megan Young
5 years ago
Views:

1 Scribe Notes /3/7 Lauren Hannah reviewed salient points from Some Developments of the Blackwell-MacQueen Urn Scheme by Jim Pitman. The paper links existing ideas on Dirichlet Processes (Chinese Restaurant Process, stick breaking construction, etc. to the Poisson-Dirichlet distribution. Lauren reviewed measure theory, described the Poisson-Dirichlet distribution, and reviewed theorems from the paper. Peter Frazier then presented a novel extension from a theorem in the paper to Dirichlet hyperparameters. Measure theory review is a state space and is a σ-algebra (a collection of subsets on if \, i.e. the collection of subsets is closed under complements., 2,... i= i, i.e. the collection of subsets is closed under countable unions. xamples = {, }, i.e. the trivial σ-algebra B (, i.e. the Borel σ-algebra generated by the collection of all open sets, most often on the real line R, denoted by B R (, is a measurable space. µ is the function R + : [, ]. µ is a measure if µ( =

2 µ( n n = n µ( n if all subsets i and j i j. are pairwise disjoint. Lebesgue Measure R = length of interval. R 2 = area. Usually denoted by Leb(.2 Dirac Measure Dirac : δ x ( = point x. { x x /, i.e. a point measure [or delta measure] at 2 Random measures Recall that a random variable X is defined as a [deterministic] function that maps the outcome to a state space, e.g. X = f : Ω R. ssume state space (, and outcome [or sample] space (Ω, H. M : Ω R + is a random measure if For fixed, ω M(ω, is a random variable. For fixed ω Ω, M(ω, is a measure on. To understand the equations above it helps to temporarily drop all notions of randomness. In a purely deterministic world, we have two subsets ω and. We assign a mapping from ω, a state of the world to, our subset of, the σ-algebra on the state space that we care about in a random experiment. Furthermore, we additionally have another function M(ω, that maps any combination of ω and to the positive portion of the real line, R +. In formal measure-theory terms, M is a transition kernel from the product space Ω to R +. See figs., 2, and 3. Randomness comes about by stating that some other agent fixed ω in advance, yielding the observed and corresponding M(ω,. or rather, as Prof. Cinlar likes to say, the Greek Goddess of chance, Tyche, fixes ω for you. Doesn t it seem unnecessarily complex and unproductive to describe the source of randomness in a model through ω and then map it to another variable of interest,? Why not just represent probabilities, etc. directly on the values of the variable of interest, 2

3 Figure : Fixing ω Ω. The resulting cdf of M(ω, on (, is depicted here. The cdf is a deterministic measure dependent on ω. ( Figure 2: Fixing. The area of is a random variable dependent on the underlying measure, which itself depends on ω. ( 3 x x x x

4 ( Value x x x x Figure 3: Fixing. Let M(ω, = Poisson(λ. The measure of [# dots in, here 4] is a Poisson randomly distributed variable. The dot pattern on is a realization of outcome ω. n example of a random measure that we ve seen is a Dirichlet Random Measure DRM. In a DRM, realizations of M are almost surely purely atomic measures on. Fixing ω while varying, we obtain a R + -valued scalar M(ω, for each, which is a deterministic measure, here a probability distribution, on (,. This would look like fig. 4. Similarly [as is often done in Poisson random measures, although not commonly with Dirichlet Processes] we can isolate a specific subset, i.e. restrict ourselves to x x x x x x x r r3 ri (xi,ri e.g. through a probability density function? Well, the relationship between states of the world and experiment realizations may be more complicated than just a : bijection, e.g. consider an experiment where the cardinality of H and differ substantially: many (x,r states of the world influencing the outcome of a binary-valued variable of interest vs. only two states of the world influencing the outcome. s more complex issues arise, one finds additional comforts modeling a random experiment in a measure-theoretical terms. Joint probabilities and conditional (x2,r2 expectations may be interpreted as the sequential composition of functions, mapping a value from one space to another, and to the next, and so on. Thus, by positing the existence of ω c. 92, the forefathers of probability theory (Kolmogorov et al. allowed us to retain deterministic tools of math to analyze seemingly random or undeterministic systems. Flowery notions of chance and dice throws were violently usurped, seemingly overnight, by the strict formalism of measure theory and calculus - the very elements of probability theory as we know them today. One can only imagine the bewilderment of haggardly gambling-types upon initial confrontation with a presumably equally-bewildered camp of geeky Russian mathematicians. This mixing phenomenon still resonates within the halls of the Princeton Graduate School, a.k.a Where the Nerd World Meets the Third World. 4

5 Figure 4: DRM realization fixing ω Ω. Discrete stick-valued probability atoms sitting on having =. ( Figure 5: Fixing. 5 x ( x x x

6 only a portion of our state space, while varying ω. This yields a R + -valued scalar random variable ω M(ω,. See fig Poisson random measures Let ν be a measure on (,. random measure N is a Poisson random measure with mean measure ν if, N( is a Poisson random variable. P(N( = k = e ν( (ν( k k! ( when, 2,..., n are disjoint then random variables N(, N( 2,..., N( n are independent. ( xamples of Poisson random measures x x x x Value Value (x,r (x2,r2 x Figure x 6: Poisson x arrival x process. (X,Γ (X2, Γ 2 (xi,ri (Xi, Γ i Figure 7: Compound Poisson Process. x x x x x x x r r3 x x x x x x x Γ( Γ(4 6 Γ(i ri

7 Poisson Process. See fig. 6. T exp(λ. The number of arrivals in each disjoint subinterval of space is independent. is an arbitrary subset, e.g. = {(X i, Γ i : Γ 2 i > X i 2}. Compound Poisson Process. See fig. 7. X i Poisson, Γ i gamma distributed. is a region of the x-y space. # dots in Poisson( dtλdxax e cx, or in other words, a Poisson random variable with parameter p where p is the integral of the mean measure over the region. Here, [, ] [, ], i.e. a Lebesgue measure on x-axis [] and a gamma measure on the y-axis [a value]. For a pictoral understanding of the y-axis, see fig Theorems from the paper We now review some theorems from the paper. 4. Theorem and 2 Let (X n be a sample from F. Then Note: here, µ = αg from before. 4.2 Theorem 3 F X,..., X n Dirichlet(µ n n µ n = µ + δ(x i i= Measure < µ(s <, ν = µ θ, θ = µ(s. Let Γ ( > Γ (2 >... be the points of a Poisson random measure on (, with mean measure θx e x dx. To further aid our understanding of what a gamma measure might look like, we isolate the y-axis from fig. 7 above. Plotting the y-values now horizontally, we observe a dense clustering of points [raindrops] nearer the origin, thinning out as x. See fig. 8. If we were to count the # of points falling in each subinterval of the x-axis, or in other 7

8 Value (X,Γ (Xi, Γ i (x2, Γ 2 x x x x x x x Γ( Γ(4 Γ(i Figure 8: y-values, Γ i in fig. 7 are reordered and labeled Γ (i, here plotted horizontally. The dotted line, the resulting pdf of Γ (i is a gamma distribution. words integrating mean measure θx e x dx over subintervals [, ɛ] and [ɛ, ] we get ɛ ɛ θx e x dx θ e x dx <. ( ɛ ɛ ɛ θx e x dx θe ɛ dx =. (2 x ( reveals that there are a finite # of points as x while (2 reveals that there an infinite # of points on [, ɛ]. Note that the on the righthand side of (2 does not render the measure invalid since we are not using the Lebesgue measure but instead the gamma measure. lso note that θx gives us ( (2 since as x the mean measure θx e x dx [i.e. the intensity of rain drops decreases]. Peter illustrated how we would simulate these raindrops on a computer:. Choose an interval of the x-axis [ɛ, ]. For example, [, ]. 2. To determine the # of points in the subinterval, draw a sample N from a Poisson distribution with rate θx e x dx. That is, compute N = θx e x dx. Let s assume that N = Obtain the x-values for these N points by taking N # of samples from this gamma distribution, conditioned on the interval [, ]. Here, we sample 3 values from the conditional gamma distribution. We then plot the 3 points along the x-axis at those values. 8

9 4. Choose the interval of the x-axis [ ɛ 2, ɛ, e.g. [,. Repeat step 2. 2 s ɛ, we achieve our x-axis raindrop plot as depicted above. Thus, the density parameter θx controls both the number of points and location of points in each subinterval. Generating a sequence of (x i, y i values using the recipe above for y i while drawing x i exp(λ yields the Compound Poisson process depicted above. D. Blei provided some intuition on relating the θ parameter in this Compound Poisson process with the α parameter as we traditionally have seen it. θ, α large = more points having a smaller height. In fig. 7 this would correspond to many raindrops clustered around y = which, in a Dirichlet stick-breaking figure, looks like many low-probability stick lengths spread out evenly along the x-axis. θ, α small = fewer points having a larger height. In fig. 7 this would correspond to fewer raindrops around y = and more raindrops around higher y-values which, in fig. 4, would look like a few highly-varying probability stick lengths with a larger degree of clustering. Continuing with the theorem, put Finally, define P i = Γ (i Σ, Σ = F = i= Γ (i P i δ( ˆX i (3 i= where the ˆX i are i.i.d. (ν, independent also of the Γ (i. It may be worth nothing that we only use the Poisson random measure to generate the gamma random variables (rather than just drawing gamma random variables to get an ordering of the gamma random variables. Otherwise, try to answer the question: have we drawn the largest random variable yet? The second largest? etc. Then F has Dirichlet(θν distribution, independently of Σ which has gamma(θ distribution. How does a gamma distribution emerge from a Poisson random measure? Specifically, how do we have Σ gamma(θ,? Let f(x = x 9

10 nd Nf = f(xn(ω, dx [dx is a measure] R + = x( Γi (dx i= Then e rσ = e rnf (4 ( = exp dxθx e x ( e rx (5 = exp( θ log( + r (6 ( θ = Nf gamma(θ, (7 + r Note: 4 5 is a property of Poisson random measure, e rnf = exp ( ν( e rf. Thus using this setup we are able to order our stick breaks such that they follow a gamma distribution. ach data point X i has an associated value Γ i drawn from a Poisson random measure with mean measure θx e x dx. Reordering Γ i yields a gamma distribution with scale parameter θ as a result as shown in eqn. 7. Integrating Γ (i over ω Ω and normalizing such that = gives the posterior probability stick-lengths. The scale of the obtained gamma distribution θ is exactly the hyperparameter α of the posterior Dirichlet distribution. 5 Theorem 5 [Definition 4 and 6, as well as Theorem 5 and associated corollaries from the paper were written]. 6 Species sampling We demonstrate how the previous definition can be applied to species sampling models. First, let

11 (P i be the frequencies (of each species; ( ˆX i be the tags (of each species; X n X j be the species of the nth observation; be the jth species to be observed; N jn be the number of s the jth species appears in the sample X,..., X n, i.e. n m= (X m = X j ; K n be the number of distinct species in the first n observations, i.e. max{j N jn > }. Further assume that ν is drawn from a uniform distribution U[, ]. Using the Blackwell-MacQueen prediction rule, the predictive distribution can be written as P(X n+ X,..., X n, K n = k = k j= N jn n + θ X ( j + θ ν(. (8 n + θ Observe that the terms N jn and θ can be seen as functions of the partition. n+θ n+θ This leads to generalizations of quation 8 wherein the aforementioned terms are replaced by other functions of the partition. Define N n (N n, N 2n,... (n,..., n k n, and p j (n = P(X n+ = X j N n = n j k(n + (9 P(X = ν(. ( With these definitions in place, (X n now has a distribution determined by p j (n. The Blackwell-MacQueen rule can then be seen as a special case of this formulation, with p j (n = n j n + θ ( j k + θ n + θ (j = k + ( statement about this more general case is given by Proposition. assumes that. (X n is an exchangeable sequence; It

12 2. (X n obeys the predictive rules given by quation 9 and quation. Then the predictive distribution converges (in total variation norm a.s. as n to ( F = j P j δ( X j + where X j is drawn i.i.d. from ν and This proposition implies that the number of values may be finite; j F is almost surely discrete iff j P j = ; the form of F is not specified. P j ν, (2 N jn P j = lim n n. (3 7 xchangeable partition probability functions (PPFs Let [n] = {,..., n} be partitioned into k non-empty subsets,..., k. If (X n is exchangeable then ( k P (X l = X j, l j = p(#,..., # k, (4 j= where p is symmetric and # j denotes the number of elements of j. p is called the exchangeable partition probability function (PPF derived from the exchangeable sequence (X n. Now denote n PPF must then satisfy n = (n,..., n k,,,... (5 n j+ = (n,..., n j +,..., n k,.... (6 p( = (7 p(n = k(n+ j= p(n j+. (8 2

13 Therefore, a p defined in such a way is an PPF of an exchangeable sequence (X n. Such a p can thus be used to define the predictive rules given in quation 9 and quation in such a way as to satisfy the condition for Proposition (quation 2. The p j governing the prediction rule can be written in terms of p as p j (n = p(nj+ p(n j k(n +, p(n >. (9 The previous statement can be clarified by Proposition 3, Corresponding to each pair (p, ν where p is an PPF and ν is a diffuse probability distribution, there is a unique distribution for a sampling sequence (X n such that p is the PPF of (X n and ν is the distribution of X. The implication of all this is the stronger Theorem 4 which states that Given a diffuse probability distribution ν and a sequence of functions p j which satisfy quation 5 and quation 6, let (X n be a sequence governed by quation 9 and quation. (X n is exchangeable iff there exists a non-negative, symmetric function p defined such that quation 9 holds. Then (X n is a sample from F as in quation 2 and the PPF of (X n is the uniquely p. Finally, we can once again return to the Blackwell-MacQueen Urn Scheme we are familiar with. For a given θ, we can write the PPF p θ as p θ (n = θk k i= (n i! [ + θ] n, (2 where [x] m = m j= (x + j. The result of Theorem 4 is that p θ is the PPF of a sample from a Dirichlet process with parameter θν. 3

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

X i Ν CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr 2004 Dirichlet Process I Lecturer: Prof. Michael Jordan Scribe: Daniel Schonberg dschonbe@eecs.berkeley.edu 22.1 Dirichlet