Advanced Machine Learning

Size: px

Start display at page:

Download "Advanced Machine Learning"

Egbert McCarthy
6 years ago
Views:

1 Advanced Machine Learning Nonparametric Bayesian Models --Learning/Reasoning in Open Possible Worlds Eric Xing Lecture 7, August 4, 2009 Reading: Eric Xing Eric CMU, Clustering Eric Xing Eric CMU,

most existing algorithms Ignore the spatial information Perform the segmentation one image at a time Need to specify the number of

2 Image Segmentation How to segment images? Manual segmentation (very expensive) Algorithm segmentation K-means Statistical mixture models Spectral clustering Problems with most existing algorithms Ignore the spatial information Perform the segmentation one image at a time Need to specify the number of segments a priori Eric Xing Eric CMU, Object Recognition and Tracing (.8, 7.4, 2.3) (.9, 9.0, 2.) (.9, 6., 2.2) (0.9, 5.8, 3.) (0.7, 5., 3.2) (0.6, 5.9, 3.2) t= t=2 t=3 Eric Xing Eric CMU,

Modeling The Mind Latent brain processes: View

fmri scan: t= Eric Xing t=t Eric Xing @ CMU,

circles Phy Bio Research topics CS PNAS papers

3 Modeling The Mind Latent brain processes: View picture Read sentence Decide whether consistent fmri scan: t= Eric Xing t=t Eric CMU, The Evolution of Science Research circles Phy Bio Research topics CS PNAS papers Eric Xing? Eric CMU,

4 A Classical Approach Clustering as Mixture Modeling Then "model selection" Eric Xing Eric CMU, Partially Observed, Open and Evolving Possible Worlds Unbounded # of objects/trajectories Changing attributes Birth/death, merge/split Relational ambiguity The parametric paradigm: Event model 0 : T or p φ p( { φ }) ({ }) Sensor model p( x { φ }) motion model t+ { φ } { t φ } p * Ξ t t ( ) * Ξ t + t + Entity space observation space Finite Structurally unambiguous How to open it up? Eric Xing Eric CMU,

5 Model Selection vs. Posterior Inference Model selection "intelligent" guess:??? cross validation: data-hungry information theoretic: AIC TIC MDL : Parsimony, Ocam's Razor Bayes factor: need to compute data lielihood Posterior inference: arg min KL ( f ( ) g ( ˆ θ, K) ) we want to handle uncertainty of model complexity explicitly we favor a distribution that does not constrain M in a "closed" space! ML p( M D) p( D M) p( M) M { θ, K} Eric Xing Eric CMU, Two "Recent" Developments First order probabilistic languages (FOPLs) Examples: PRM, BLOG Lift graphical models to "open" world (#rv, relation, index, lifespan ) Focus on complete, consistent, and operating rules to instantiate possible worlds, and formal language of expressing such rules Operational way of defining distributions over possible worlds, via sampling methods Bayesian Nonparametrics Examples: Dirichlet processes, stic-breaing processes From finite, to infinite mixture, to more complex constructions (hierarchies, spatial/temporal sequences, ) Focus on the laws and behaviors of both the generative formalisms and resulting distributions Often offer explicit expression of distributions, and expose the structure of the distributions --- motivate various approximate schemes Eric Xing Eric CMU,

6 Clustering How to label them? How many clusters??? Eric Xing Eric CMU, Random Partition of Probability Space { φ 4, π 4 } { φ 3, π 3 } { φ 6, π 6 } centroid :=φ { φ, } π 5 π. 5 (event, p event ) { φ, } { φ 2, π 2 } Image ele. :=(x,θ) Eric Xing Eric CMU,

7 Stic-breaing Process G = θ ~ G = π = π = β π δ ( θ ) 0 = - ( - β ) β ~ Beta(, α) j = Location Mass G 0 Eric Xing Eric CMU, Chinese Restaurant Process θ θ 2 P( c i = c-i ) = 0 0 α 0 +α 2 +α 3+α m i + α - + α 2 +α 2 3+α m2 i + α - α 2 + α α 3+ α α i +α - CRP defines an exchangeable distribution on partitions over an (infinite) sequence of samples, such a distribution is formally nown as the Dirichlet Process (DP) Eric Xing Eric CMU,

8 Dirichlet Process φ φ 6 4 φ φ 5 3 φ φ 2 a distribution A CDF, G, on possible worlds of random partitions follows a Dirichlet Process if for any measurable finite partition (φ,φ 2,.., φ m ): (G(φ ), G(φ 2 ),, G(φ m ) ) ~ Dirichlet( αg 0 (φ ),., αg0(φ m ) ) another distribution where G 0 is the base measure and α is the scale parameter Thus a Dirichlet Process G defines a distribution of distribution Eric Xing Eric CMU, Graphical Model Representations of DP G 0 G 0 α G α π θ θ i y i x i N x in The CRP construction The Stic-breaing construction Eric Xing Eric CMU,

9 Ancestral Inference A θ? H n H n2 G n N Essentially a clustering problem, but Better recovery of the ancestors leads to better haplotyping results (because of more accurate grouping of common haplotypes) True haplotypes are obtainable with high cost, but they can validate model more subjectively (as opposed to examining saliency of clustering) Many other biological/scientific utilities Eric Xing Eric CMU, Example: DP-haplotyper [Xing et al, 2004] Clustering human populations α G 0 DP G A θ K infinite mixture components (for population haplotypes) H n G n H n2 N Lielihood model (for individual haplotypes and genotypes) Inference: Marov Chain Monte Carlo (MCMC) Gibbs sampling Metropolis Hasting Eric Xing Eric CMU,

10 The DP Mixture of Ancestral Haplotypes The customers around a table in CRP form a cluster associate a mixture component (i.e., a population haplotype) with a table sample {a, θ} at each table from a base measure G 0 to obtain the population haplotype and nucleotide substitution frequency for that component {A,θ} {A,θ} {A,θ} {A,θ} {A,θ} {A,θ} With p(h {Α, θ}) and p(g h,h 2 ), the CRP yields a posterior distribution on the number of population haplotypes (and on the haplotype configurations and the nucleotide substitution frequencies) Eric Xing Eric CMU, Inheritance and Observation Models Single-locus mutation model A C θ for ht = at PH ( ht at, θ ) = θ for ht at B h = a with prob. θ t ie t Noisy observation model H H G i, H i 2 i e i C i C i2 A A A 2 3 H i H i2 Ancestral pool Haplotypes P ( g h, h ) : G 2 gt = h, t h2, t with prob. λ G i Genotype Eric Xing Eric CMU,

11 MCMC for Haplotype Inference Gibbs sampling for exploring the posterior distribution under the proposed model Integrate out the parameters such as or, and sample ci e, a and p h ie θ λ ( ci ] e = c[ i ], h, a) p( ci = c[ i ]) p( hi a, h[ i, c) e e Posterior Prior x Lielihood e e e M CRP Gibbs sampling algorithm: draw samples of each random variable to be sampled given values of all the remaining variables Eric Xing Eric CMU, MCMC for Haplotype Inference. Sample c ie (j), from 2. Sample a from 3. Sample h ie (j) from For DP scale parameter α: a vague inverse Gamma prior Eric Xing Eric CMU,

12 Convergence of Ancestral Inference Eric Xing Eric CMU, DP vs. Finite Mixture via EM individual error data sets Series DP Series2 EM Eric Xing Eric CMU,

13 Variational Inference [Blei & Jordan 2005, Kurihara et al 2007] Gibbs sampling solution is not efficient enough to scale up to the large scale problems. Truncated stic-breaing approximation can be formulated in the space of explicit, non-exchangeable cluster labels. Variational inference can now be applied to such a finitedimensional distribution Variational Inference: For a complicated P(X, X 2, X n ), approximate it with Q(X): Eric Xing Eric CMU, Approximations to DP Truncated stic-breaing representation Finite symmetric Dirichlet approximation The joint distribution can be expressed as: The joint distribution can be expressed as: Eric Xing Eric CMU,

14 TDP vs. TSB TDP is size biased cluster labels is NOT interchangeable under TDP but is interchangeable under TSB Eric Xing Eric CMU, Marginalization In variational Bayesian approximation, we assume a factorized form for the posterior distribution. However it is not a good assumption since changes in π will have a considerable impact on z. If we can integrate out π, the joint distribution is given by For the TSB representation: For the FSD representation: α Eric Xing Eric CMU,

15 VB inference We can then apply the VB inference on the four approximations The approximated posterior distribution for TSB and FSD are Depending on marginalization or not, v and π may be integrated out. Eric Xing Eric CMU, Experimental results Eric Xing Eric CMU,

16 Summary A non-parametric Bayesian model for Pattern Uncovery Finite mixture model of latent patterns (e.g., image segments, objects) infinite mixture of propotypes: alternative to model selection hierarchical infinite mixture infinite hidden Marov model temporal infinite mixture model Applications in general data-mining Eric Xing Eric CMU,

Non-Parametric Bayes

Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian