Advanced Machine Learning Nonparametric Bayesian Models --Learning/Reasoning in Open Possible Worlds Eric Xing Lecture 7, August 4, 2009 Reading: Eric Xing Eric Xing @ CMU, 2006-2009 Clustering Eric Xing Eric Xing @ CMU, 2006-2009 2
Image Segmentation How to segment images? Manual segmentation (very expensive) Algorithm segmentation K-means Statistical mixture models Spectral clustering Problems with most existing algorithms Ignore the spatial information Perform the segmentation one image at a time Need to specify the number of segments a priori Eric Xing Eric Xing @ CMU, 2006-2009 3 Object Recognition and Tracing (.8, 7.4, 2.3) (.9, 9.0, 2.) (.9, 6., 2.2) (0.9, 5.8, 3.) (0.7, 5., 3.2) (0.6, 5.9, 3.2) t= t=2 t=3 Eric Xing Eric Xing @ CMU, 2006-2009 4 2
Modeling The Mind Latent brain processes: View picture Read sentence Decide whether consistent fmri scan: t= Eric Xing t=t Eric Xing @ CMU, 2006-2009 5 The Evolution of Science Research circles Phy Bio Research topics CS PNAS papers Eric Xing? 2000 900 Eric Xing @ CMU, 2006-2009 6 3
A Classical Approach Clustering as Mixture Modeling Then "model selection" Eric Xing Eric Xing @ CMU, 2006-2009 7 Partially Observed, Open and Evolving Possible Worlds Unbounded # of objects/trajectories Changing attributes Birth/death, merge/split Relational ambiguity The parametric paradigm: Event model 0 : T or p φ p( { φ }) ({ }) Sensor model p( x { φ }) motion model t+ { φ } { t φ } p * Ξ t t ( ) * Ξ t + t + Entity space observation space Finite Structurally unambiguous How to open it up? Eric Xing Eric Xing @ CMU, 2006-2009 8 4
Model Selection vs. Posterior Inference Model selection "intelligent" guess:??? cross validation: data-hungry information theoretic: AIC TIC MDL : Parsimony, Ocam's Razor Bayes factor: need to compute data lielihood Posterior inference: arg min KL ( f ( ) g ( ˆ θ, K) ) we want to handle uncertainty of model complexity explicitly we favor a distribution that does not constrain M in a "closed" space! ML p( M D) p( D M) p( M) M { θ, K} Eric Xing Eric Xing @ CMU, 2006-2009 9 Two "Recent" Developments First order probabilistic languages (FOPLs) Examples: PRM, BLOG Lift graphical models to "open" world (#rv, relation, index, lifespan ) Focus on complete, consistent, and operating rules to instantiate possible worlds, and formal language of expressing such rules Operational way of defining distributions over possible worlds, via sampling methods Bayesian Nonparametrics Examples: Dirichlet processes, stic-breaing processes From finite, to infinite mixture, to more complex constructions (hierarchies, spatial/temporal sequences, ) Focus on the laws and behaviors of both the generative formalisms and resulting distributions Often offer explicit expression of distributions, and expose the structure of the distributions --- motivate various approximate schemes Eric Xing Eric Xing @ CMU, 2006-2009 0 5
Clustering How to label them? How many clusters??? Eric Xing Eric Xing @ CMU, 2006-2009 Random Partition of Probability Space { φ 4, π 4 } { φ 3, π 3 } { φ 6, π 6 } centroid :=φ { φ, } π 5 π. 5 (event, p event ) { φ, } { φ 2, π 2 } Image ele. :=(x,θ) Eric Xing Eric Xing @ CMU, 2006-2009 2 6
Stic-breaing Process G = θ ~ G = π = π = β π δ ( θ ) 0 = - ( - β ) β ~ Beta(, α) j = Location Mass 0 0.4 0.4 0.6 0.5 0.3 0.3 0.8 0.24 G 0 Eric Xing Eric Xing @ CMU, 2006-2009 3 Chinese Restaurant Process θ θ 2 P( c i = c-i ) = 0 0 α 0 +α 2 +α 3+α m i + α - + α 2 +α 2 3+α m2 i + α - α 2 + α α 3+ α α i +α - CRP defines an exchangeable distribution on partitions over an (infinite) sequence of samples, such a distribution is formally nown as the Dirichlet Process (DP) Eric Xing Eric Xing @ CMU, 2006-2009 4... 7
Dirichlet Process φ φ 6 4 φ φ 5 3 φ φ 2 a distribution A CDF, G, on possible worlds of random partitions follows a Dirichlet Process if for any measurable finite partition (φ,φ 2,.., φ m ): (G(φ ), G(φ 2 ),, G(φ m ) ) ~ Dirichlet( αg 0 (φ ),., αg0(φ m ) ) another distribution where G 0 is the base measure and α is the scale parameter Thus a Dirichlet Process G defines a distribution of distribution Eric Xing Eric Xing @ CMU, 2006-2009 5 Graphical Model Representations of DP G 0 G 0 α G α π θ θ i y i x i N x in The CRP construction The Stic-breaing construction Eric Xing Eric Xing @ CMU, 2006-2009 6 8
Ancestral Inference A θ? H n H n2 G n N Essentially a clustering problem, but Better recovery of the ancestors leads to better haplotyping results (because of more accurate grouping of common haplotypes) True haplotypes are obtainable with high cost, but they can validate model more subjectively (as opposed to examining saliency of clustering) Many other biological/scientific utilities Eric Xing Eric Xing @ CMU, 2006-2009 7 Example: DP-haplotyper [Xing et al, 2004] Clustering human populations α G 0 DP G A θ K infinite mixture components (for population haplotypes) H n G n H n2 N Lielihood model (for individual haplotypes and genotypes) Inference: Marov Chain Monte Carlo (MCMC) Gibbs sampling Metropolis Hasting Eric Xing Eric Xing @ CMU, 2006-2009 8 9
The DP Mixture of Ancestral Haplotypes The customers around a table in CRP form a cluster associate a mixture component (i.e., a population haplotype) with a table sample {a, θ} at each table from a base measure G 0 to obtain the population haplotype and nucleotide substitution frequency for that component {A,θ} {A,θ} {A,θ} {A,θ} {A,θ} {A,θ} 3 2 5 4 8 9 6 7 With p(h {Α, θ}) and p(g h,h 2 ), the CRP yields a posterior distribution on the number of population haplotypes (and on the haplotype configurations and the nucleotide substitution frequencies) Eric Xing Eric Xing @ CMU, 2006-2009 9 Inheritance and Observation Models Single-locus mutation model A C θ for ht = at PH ( ht at, θ ) = θ for ht at B h = a with prob. θ t ie t Noisy observation model H H G i, H i 2 i e i C i C i2 A A A 2 3 H i H i2 Ancestral pool Haplotypes P ( g h, h ) : G 2 gt = h, t h2, t with prob. λ G i Genotype Eric Xing Eric Xing @ CMU, 2006-2009 20 0
MCMC for Haplotype Inference Gibbs sampling for exploring the posterior distribution under the proposed model Integrate out the parameters such as or, and sample ci e, a and p h ie θ λ ( ci ] e = c[ i ], h, a) p( ci = c[ i ]) p( hi a, h[ i, c) e e Posterior Prior x Lielihood e e e M CRP Gibbs sampling algorithm: draw samples of each random variable to be sampled given values of all the remaining variables Eric Xing Eric Xing @ CMU, 2006-2009 2 MCMC for Haplotype Inference. Sample c ie (j), from 2. Sample a from 3. Sample h ie (j) from For DP scale parameter α: a vague inverse Gamma prior Eric Xing Eric Xing @ CMU, 2006-2009 22
Convergence of Ancestral Inference Eric Xing Eric Xing @ CMU, 2006-2009 23 DP vs. Finite Mixture via EM individual error 0.45 0.4 0.35 0.3 0.25 0.2 0.5 0. 0.05 0 2 3 4 5 data sets Series DP Series2 EM Eric Xing Eric Xing @ CMU, 2006-2009 24 2
Variational Inference [Blei & Jordan 2005, Kurihara et al 2007] Gibbs sampling solution is not efficient enough to scale up to the large scale problems. Truncated stic-breaing approximation can be formulated in the space of explicit, non-exchangeable cluster labels. Variational inference can now be applied to such a finitedimensional distribution Variational Inference: For a complicated P(X, X 2, X n ), approximate it with Q(X): Eric Xing Eric Xing @ CMU, 2006-2009 25 Approximations to DP Truncated stic-breaing representation Finite symmetric Dirichlet approximation The joint distribution can be expressed as: The joint distribution can be expressed as: Eric Xing Eric Xing @ CMU, 2006-2009 26 3
TDP vs. TSB TDP is size biased cluster labels is NOT interchangeable under TDP but is interchangeable under TSB Eric Xing Eric Xing @ CMU, 2006-2009 27 Marginalization In variational Bayesian approximation, we assume a factorized form for the posterior distribution. However it is not a good assumption since changes in π will have a considerable impact on z. If we can integrate out π, the joint distribution is given by For the TSB representation: For the FSD representation: α Eric Xing Eric Xing @ CMU, 2006-2009 28 4
VB inference We can then apply the VB inference on the four approximations The approximated posterior distribution for TSB and FSD are Depending on marginalization or not, v and π may be integrated out. Eric Xing Eric Xing @ CMU, 2006-2009 29 Experimental results Eric Xing Eric Xing @ CMU, 2006-2009 30 5
Summary A non-parametric Bayesian model for Pattern Uncovery Finite mixture model of latent patterns (e.g., image segments, objects) infinite mixture of propotypes: alternative to model selection hierarchical infinite mixture infinite hidden Marov model temporal infinite mixture model Applications in general data-mining Eric Xing Eric Xing @ CMU, 2006-2009 3 6