Partition models and cluster processes

Size: px

Start display at page:

Download "Partition models and cluster processes"

Moses Kory Wells
5 years ago
Views:

1 and cluster processes and cluster processes With applications to classification Jie Yang Department of Statistics University of Chicago ICM, Madrid, August 26

2 and cluster processes utline 1 and cluster processes 2 Cox process models Permanent partition process Superposition of permanent processes Permanent partition models Conditional relative intensity Chequerboard illustration

3 and cluster processes Fisher Discriminant model Description as a process (i.i.d.) k classes with frequencies π 1,..., π k Features X for class r are iid N(µ r, Σ) on X = R q bservations (Y 1, X 1 ), (Y 2, X 2 ),... are i.i.d pr(y i = r) = π r i.i.d. Given Y, features are iid X j N(µ yj, Σ) Density of (Y i, X i ) at (y, x) π y Σ q/2 exp ( (x µ y ) Σ 1 (x µ y )/2 )

4 and cluster processes Fisher Discriminant model Application to classification Estimation step: ˆπ r = n r /n, ˆµ r = x r, ˆΣ = S; Classification step: for a new unit u such that X(u ) = x pr(y (u ) = r data) π r φ(x µ r ; Σ) (independent of data other than x ) Substitution step: ˆpr(Y (u ) = r data) n r φ(x x r ; S) (dependent on data through estimate)

5 and cluster processes Description as a process (indep but not i.d.) Feature x(u) regarded as a covariate for unit u Class Y (u) regarded as response Linear logistic model: Components Y (u 1 ), Y (u 2 ),... independent log pr(y (u) = r) = α r + β r x(u) + const consistent with Fisher model β r = Σ 1 µ r

6 and cluster processes Application to classification Estimation step: maximum likelihood gives ˆα, ˆβ Classification step: pr(y (u ) = r data) = exp(α r + β r x(u )) c exp(α c + β c x(u )) (independent of data other than x(u )) Substitution step: β ˆβ (fitted = predicted)

7 and cluster processes Random partition of [n] = {1,..., n} A set of disjoint non-empty subsets whose union is [n] n = 2: B = {{1, 2}} = 12 or B = {{1}, {2}} = 1 2 n = 3: B 3 = {123, 1 23, 2 13, 3 12, 1 2 3} n = 4: B 4 = {1234, 123 4[4], 12 34[3], [6], }. Bell numbers: #B 3 = 5; #B 4 = 15; #B 5 = 52; #B 6 = B = B 5 1

8 and cluster processes Permutation of elements: B σ ij = B σ(i)σ(j) B = B 5 1 Example 1: σ = (1, 3): B σ = Example 2: σ = (2, 4, 5): B σ = B σ (1, 5) = B(1, 2) = 1, B σ (1, 2) = B(1, 4) = Block sizes unaffected Exchangeable product partition distribution (Hartigan 199) p n (B) = D n C(#b) b B

9 and cluster processes Deletion of elements Deletion (of 5) from gives in B 4 Deletion (of 5) from gives in B 4 Deletion is a map D : B n B n 1 How does this affect distributions? Example: p 5 is uniform on the 15 elements of B 5 Marginal distribution on B 4 after deletion is p 4 (1234) = p 5 (12345) + p 5 (1234 5) = 2/15 p 4 (123 4) = p 5 (1235 4) + p 5 (123 45) + p 5 ( ) = 3/15 p 4 (12 3 4) = p 5 ( ) + + p 5 ( ) = 4/15 Marginal distribution not uniform.

10 and cluster processes Ewens partition process p n (B) = λ#b Γ(λ) Γ(#b) Γ(n + λ) b B Exponential family of distns on B n canonical parameter θ = log λ canonical statistic #B: E(#B) λ log(n) cumulant function log Γ(n + e θ ) log Γ(e θ ) Product partition model: C(#b) = λ (#b 1)!. Finitely exchangeable for each n p n is marginal distn of p n+1

11 and cluster processes distribution for n = 4 p 4 (1234) = 6/(λ + 1)(λ + 2)(λ + 3), p 4 (123 4) = 2λ/(λ + 1)(λ + 2)(λ + 3), p 4 (12 34) = λ/(λ + 1)(λ + 2)(λ + 3) p 4 (12 3 4) = λ 2 /(λ + 1)(λ + 2)(λ + 3), p 4 ( ) = λ 3 /(λ + 1)(λ + 2)(λ + 3) p 3 (123) = p 4 (1234) + p 4 (123 4) = 2/(λ + 1)(λ + 2) p 3 (12 3) = p 4 (124 3) + p 4 (12 34) + p 4 (12 3 4) p 3 (1 2 3) = p 4 (14 2 3) + p 4 (1 24 3) + p 4 (1 2 34) + p 4 ( ) p n is marginal distribution of p n+1 Infinitely exchangeable partition process

12 and cluster processes Chinese restaurant process Ewens process: B B is a random partition of natural numbers Regard as a sequence of matrices B 1, B 2,... such that B n (i, j) = B (i, j) is leading sub-matrix of order n. B n Ewens(n, λ) Conditional distribution of B n+1 given B n Transition distribution B n B n+1 { #b/(n + λ) b B n pr(u n+1 b B n ) = λ/(n + λ) b = CRP interpretation due to Dubins and Pitman No of occupied tables is approx Poisson(λ log(n))

13 and cluster processes Gauss Ewens cluster processes Pair (B, X) such that X = (X 1, X 2,...) (random sequence) B = {B ij } is a random partition of integers Finite-dimensional distributions p n (B, x) is distn on B n X n Invariant under permutation p n (B σ, x σ ) = p n (B, x) p n is the marginal of p n+1 B Ewens n (λ); given B, X N(, Σ = I n + θb) p n (B, x) = λ#b Γ(λ) Γ(#b) 2πΣ 1/2 exp( x Σ 1 x/2) Γ(n + λ) b B

14 and cluster processes rpts$points[,2] rpts$points[,2] rpts$points[,1] rpts$points[,1] rpts$points[,2] rpts$points[,2] 4 7 5

15 and cluster processes rpts$points[,2] rpts$points[,2] rpts$points[,1] rpts$points[,1] rpts$points[,2] rpts$points[,2] 9 9 8

16 and cluster processes!!!!! " $ % 6 9 # # 7 # # # rpts$points[,2] rpts$points[,2] * * rpts$points[,1] + * * * )) * 2, * * + rpts$points[,1] + 2 +, + rpts$points[,1], ) rpts$points[,2] rpts$points[,2] 8 " " " $ $ %! # 8 $ " " " " & #! #!! 8 ( $ $ " "! & & #

17 and cluster processes Potential applications of cluster process Prediction: given (B, X) (n) predict next value Easily computed but not interesting or useful Density estimation: Given X 1,..., X n (No B) predict next value p n+1 (X n+1 X 1,..., X n ) Cluster analysis: p n (B X) is a distn on clusterings of [n] p n (B i,j = 1 X) = E(B i,j X) (triples too) p n (#B = r X) (conditional distn of #B) preliminary conclusions... Classification: given (B, X) (n) and X(u n+1 ) = x compute p n+1 (u n+1 b data) Easily computed and fairly useful

18 and cluster processes Potential applications of cluster process Prediction: given (B, X) (n) predict next value Easily computed but not interesting or useful Density estimation: Given X 1,..., X n (No B) predict next value p n+1 (X n+1 X 1,..., X n ) Cluster analysis: p n (B X) is a distn on clusterings of [n] p n (B i,j = 1 X) = E(B i,j X) (triples too) p n (#B = r X) (conditional distn of #B) preliminary conclusions... Classification: given (B, X) (n) and X(u n+1 ) = x compute p n+1 (u n+1 b data) Easily computed and fairly useful

19 and cluster processes Potential applications of cluster process Prediction: given (B, X) (n) predict next value Easily computed but not interesting or useful Density estimation: Given X 1,..., X n (No B) predict next value p n+1 (X n+1 X 1,..., X n ) Cluster analysis: p n (B X) is a distn on clusterings of [n] p n (B i,j = 1 X) = E(B i,j X) (triples too) p n (#B = r X) (conditional distn of #B) preliminary conclusions... Classification: given (B, X) (n) and X(u n+1 ) = x compute p n+1 (u n+1 b data) Easily computed and fairly useful

20 and cluster processes Potential applications of cluster process Prediction: given (B, X) (n) predict next value Easily computed but not interesting or useful Density estimation: Given X 1,..., X n (No B) predict next value p n+1 (X n+1 X 1,..., X n ) Cluster analysis: p n (B X) is a distn on clusterings of [n] p n (B i,j = 1 X) = E(B i,j X) (triples too) p n (#B = r X) (conditional distn of #B) preliminary conclusions... Classification: given (B, X) (n) and X(u n+1 ) = x compute p n+1 (u n+1 b data) Easily computed and fairly useful

21 and cluster processes Classification and cluster processes Given (B, x) on a sample of n units and a new unit u with x(u ) = x. Classify the new unit (conditional distribution) Gauss-Ewens conditional distribution { pr(u (#b) φ n+1 (x ; I n+1 + θb b ) b...) λ φ n+1 (x ; I n+1 + θb ) b B b = x = (x, x ); B b = {B, u b}

22 and cluster processes Application to classification (contd) Simplification of conditional distribution { pr(u (#b) φ 1 (x(u ) µ b ; σ b 2 b...) ) λ φ 1 (x(u ); 1 + θ) b B b = µ b = x b n b θ/(1 + n b θ), σ 2 b = 1 + θ/(1 + n bθ) Typical values θ 5 and n b 5 so µ b / x b = n b θ/(1 + n b θ).96 (similar to, but with shrinkage)

23 Cox process and cluster processes Cox process models Permanent partition process Superposition of permanent processes Permanent partition models Conditional relative intensity Chequerboard illustration Construction: λ(x) is a random intensity function on X = R X is a Poisson process with intensity λ. Product density: m(x) = E(λ(x)) first-order p.d. m(x 1, x 2 ) = E(λ(x 1 )λ(x 2 ) ) second order p.d. m(x) = E ( λ(x 1 ) λ(x n ) ) is nth order p.d. X(A) = #(X A) = no of events in A m(x) dx: expected no of ordered n-tuples of events in dx

24 and cluster processes Marked Cox process Cox process models Permanent partition process Superposition of permanent processes Permanent partition models Conditional relative intensity Chequerboard illustration λ 1 (x),..., λ k (x) indep random intensity functions X (r) Cox process with intensity λ r (x) X = X (r) superposition process Marked process is labelled partition X (1),..., X (k) ) of X pr(x (r) contains event in dx (r) ) = m r (dx (r) ) + o(dx (r) ) pr(x contains event in dx) = m(dx) + o(dx) pr(y x X) = mr (x (r) ) m(x) label y : x C and x (r) = y 1 (r)

25 and cluster processes Boson and related processes Cox process models Permanent partition process Superposition of permanent processes Permanent partition models Conditional relative intensity Chequerboard illustration Z (x) is a GP N(, K ) on X (complex-valued for simplicity!) λ(x) = Z (x) 2 is intensity at x for Cox process product density: m(x) = E(λ(x 1 ) λ(x n )) = per[k ](x) matrix [K ](x) = {K (x i, x j )} Permanent polynomial of matrix A: per α (A) = σ α #σ A 1,σ(1) A n,σ(n) Convolution or superposition: per α [K ](w) per α [K ]( w) = per α+α [K ](x) w x Macchi 1975; Shirai and Takahashi 23; McCullagh and Møller 26

26 and cluster processes Cox process models Permanent partition process Superposition of permanent processes Permanent partition models Conditional relative intensity Chequerboard illustration Intensity functions and Cox processes

27 and cluster processes Cox process models Permanent partition process Superposition of permanent processes Permanent partition models Conditional relative intensity Chequerboard illustration Intensity functions and Cox processes

28 and cluster processes Superposition processes Cox process models Permanent partition process Superposition of permanent processes Permanent partition models Conditional relative intensity Chequerboard illustration process2[, 2] 5 4

29 and cluster processes Permanent partition models Cox process models Permanent partition process Superposition of permanent processes Permanent partition models Conditional relative intensity Chequerboard illustration Given that x X occurs in the superposition Conditional distribution of labels y : x C p n (y x) = per α 1 [K ](x (1) ) per αk [K ](x (k) ) per α. [K ](x) is a prob distribution on labelled partitions of x. reduces to Dirichlet multinomial if K (x, x ) 1 Classification distribution for new event at x p n+1 (y(x ) = r data) per αr [K ](x (r), x )/ per αr [K ](x (r) )

30 and cluster processes Cox process models Permanent partition process Superposition of permanent processes Permanent partition models Conditional relative intensity Chequerboard illustration Conditional relative intensity per α [K ](x, x )/ per α [K ](x) f x K (x, x ) = exp( (x x ) 2 ) on (, 2π). Solid curve is for α = 1, dashed curve for α =.25, and the dotted curve for α =.

31 and cluster processes Cox process models Permanent partition process Superposition of permanent processes Permanent partition models Conditional relative intensity Chequerboard illustration Illustration 2: Relative intensity functions per[k ](x.x )/ per[k ](x) ne-dimensional feature space: 2 labelled classes:

32 and cluster processes Chequerboard Pattern Cox process models Permanent partition process Superposition of permanent processes Permanent partition models Conditional relative intensity Chequerboard illustration A two-class example is generated by a 3 3 chequerboard pattern. The permanent model with α 1 = α 2 = α and K (x, x ) = exp( x x 2 /τ 2 ) is used. Data Set Cross Validation x sum of log(pr) alpha=.5 alpha=1 alpha=

33 and cluster processes Cox process models Permanent partition process Superposition of permanent processes Permanent partition models Conditional relative intensity Chequerboard illustration Chequerboard Pattern: Classification Classification x

Stochastic classification models

Stochastic classification models Peter McCullagh and Jie Yang Abstract. Two families of stochastic processes are constructed that are intended for use in classification problems where the aim is to classify