Advanced Machine Learning

Similar documents
Non-Parametric Bayes

Bayesian Methods for Machine Learning

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Probabilistic Graphical Models

Bayesian Nonparametrics for Speech and Signal Processing

Lecture 3a: Dirichlet processes

Bayesian Nonparametric Models

Collapsed Variational Dirichlet Process Mixture Models

Spatial Normalized Gamma Process

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Bayesian Inference and MCMC

Gentle Introduction to Infinite Gaussian Mixture Modeling

Non-parametric Clustering with Dirichlet Processes

CPSC 540: Machine Learning

Bayesian Nonparametrics: Dirichlet Process

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Topic Modelling and Latent Dirichlet Allocation

Bayesian Nonparametrics

Gaussian Mixture Model

Learning ancestral genetic processes using nonparametric Bayesian models

Unsupervised Learning

Probabilistic Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Hierarchical Dirichlet Processes

A Brief Overview of Nonparametric Bayesian Models

13: Variational inference II

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Image segmentation combining Markov Random Fields and Dirichlet Processes

arxiv: v1 [stat.ml] 8 Jan 2012

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Introduction to Machine Learning

Part IV: Monte Carlo and nonparametric Bayes

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning CMU-10701

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Lecture 13 : Variational Inference: Mean Field Approximation

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Bayesian nonparametrics

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

Bayesian Nonparametrics

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Graphical Models for Query-driven Analysis of Multimodal Data

28 : Approximate Inference - Distributed MCMC

Latent Dirichlet Allocation Introduction/Overview

10-701/15-781, Machine Learning: Homework 4

16 : Approximate Inference: Markov Chain Monte Carlo

Distance dependent Chinese restaurant processes

Tutorial on Probabilistic Programming with PyMC3

Answers and expectations

Bayesian inference J. Daunizeau

MCMC: Markov Chain Monte Carlo

IPSJ SIG Technical Report Vol.2014-MPS-100 No /9/25 1,a) 1 1 SNS / / / / / / Time Series Topic Model Considering Dependence to Multiple Topics S

Latent Dirichlet Alloca/on

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Contents. Part I: Fundamentals of Bayesian Inference 1

Study Notes on the Latent Dirichlet Allocation

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Machine learning: lecture 20. Tommi S. Jaakkola MIT CSAIL

Hierarchical Bayesian Nonparametrics

Probabilistic Graphical Networks: Definitions and Basic Results

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Introduction to Bayesian methods in inverse problems

Machine Learning using Bayesian Approaches

Construction of Dependent Dirichlet Processes based on Poisson Processes

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Linear Dynamical Systems

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Applied Nonparametric Bayes

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

CS532, Winter 2010 Hidden Markov Models

CMPS 242: Project Report

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Density Estimation. Seungjin Choi

STA 4273H: Statistical Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Infinite Latent Feature Models and the Indian Buffet Process

Introduction to Machine Learning

Recent Advances in Bayesian Inference Techniques

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Hidden Markov Models and Extensions

Sampling Methods (11/30/04)

Infinite latent feature models and the Indian Buffet Process

Applying hlda to Practical Topic Modeling

Markov chain Monte Carlo Lecture 9

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Bayesian non parametric approaches: an introduction

CSC 2541: Bayesian Methods for Machine Learning

Transcription:

Advanced Machine Learning Nonparametric Bayesian Models --Learning/Reasoning in Open Possible Worlds Eric Xing Lecture 7, August 4, 2009 Reading: Eric Xing Eric Xing @ CMU, 2006-2009 Clustering Eric Xing Eric Xing @ CMU, 2006-2009 2

Image Segmentation How to segment images? Manual segmentation (very expensive) Algorithm segmentation K-means Statistical mixture models Spectral clustering Problems with most existing algorithms Ignore the spatial information Perform the segmentation one image at a time Need to specify the number of segments a priori Eric Xing Eric Xing @ CMU, 2006-2009 3 Object Recognition and Tracing (.8, 7.4, 2.3) (.9, 9.0, 2.) (.9, 6., 2.2) (0.9, 5.8, 3.) (0.7, 5., 3.2) (0.6, 5.9, 3.2) t= t=2 t=3 Eric Xing Eric Xing @ CMU, 2006-2009 4 2

Modeling The Mind Latent brain processes: View picture Read sentence Decide whether consistent fmri scan: t= Eric Xing t=t Eric Xing @ CMU, 2006-2009 5 The Evolution of Science Research circles Phy Bio Research topics CS PNAS papers Eric Xing? 2000 900 Eric Xing @ CMU, 2006-2009 6 3

A Classical Approach Clustering as Mixture Modeling Then "model selection" Eric Xing Eric Xing @ CMU, 2006-2009 7 Partially Observed, Open and Evolving Possible Worlds Unbounded # of objects/trajectories Changing attributes Birth/death, merge/split Relational ambiguity The parametric paradigm: Event model 0 : T or p φ p( { φ }) ({ }) Sensor model p( x { φ }) motion model t+ { φ } { t φ } p * Ξ t t ( ) * Ξ t + t + Entity space observation space Finite Structurally unambiguous How to open it up? Eric Xing Eric Xing @ CMU, 2006-2009 8 4

Model Selection vs. Posterior Inference Model selection "intelligent" guess:??? cross validation: data-hungry information theoretic: AIC TIC MDL : Parsimony, Ocam's Razor Bayes factor: need to compute data lielihood Posterior inference: arg min KL ( f ( ) g ( ˆ θ, K) ) we want to handle uncertainty of model complexity explicitly we favor a distribution that does not constrain M in a "closed" space! ML p( M D) p( D M) p( M) M { θ, K} Eric Xing Eric Xing @ CMU, 2006-2009 9 Two "Recent" Developments First order probabilistic languages (FOPLs) Examples: PRM, BLOG Lift graphical models to "open" world (#rv, relation, index, lifespan ) Focus on complete, consistent, and operating rules to instantiate possible worlds, and formal language of expressing such rules Operational way of defining distributions over possible worlds, via sampling methods Bayesian Nonparametrics Examples: Dirichlet processes, stic-breaing processes From finite, to infinite mixture, to more complex constructions (hierarchies, spatial/temporal sequences, ) Focus on the laws and behaviors of both the generative formalisms and resulting distributions Often offer explicit expression of distributions, and expose the structure of the distributions --- motivate various approximate schemes Eric Xing Eric Xing @ CMU, 2006-2009 0 5

Clustering How to label them? How many clusters??? Eric Xing Eric Xing @ CMU, 2006-2009 Random Partition of Probability Space { φ 4, π 4 } { φ 3, π 3 } { φ 6, π 6 } centroid :=φ { φ, } π 5 π. 5 (event, p event ) { φ, } { φ 2, π 2 } Image ele. :=(x,θ) Eric Xing Eric Xing @ CMU, 2006-2009 2 6

Stic-breaing Process G = θ ~ G = π = π = β π δ ( θ ) 0 = - ( - β ) β ~ Beta(, α) j = Location Mass 0 0.4 0.4 0.6 0.5 0.3 0.3 0.8 0.24 G 0 Eric Xing Eric Xing @ CMU, 2006-2009 3 Chinese Restaurant Process θ θ 2 P( c i = c-i ) = 0 0 α 0 +α 2 +α 3+α m i + α - + α 2 +α 2 3+α m2 i + α - α 2 + α α 3+ α α i +α - CRP defines an exchangeable distribution on partitions over an (infinite) sequence of samples, such a distribution is formally nown as the Dirichlet Process (DP) Eric Xing Eric Xing @ CMU, 2006-2009 4... 7

Dirichlet Process φ φ 6 4 φ φ 5 3 φ φ 2 a distribution A CDF, G, on possible worlds of random partitions follows a Dirichlet Process if for any measurable finite partition (φ,φ 2,.., φ m ): (G(φ ), G(φ 2 ),, G(φ m ) ) ~ Dirichlet( αg 0 (φ ),., αg0(φ m ) ) another distribution where G 0 is the base measure and α is the scale parameter Thus a Dirichlet Process G defines a distribution of distribution Eric Xing Eric Xing @ CMU, 2006-2009 5 Graphical Model Representations of DP G 0 G 0 α G α π θ θ i y i x i N x in The CRP construction The Stic-breaing construction Eric Xing Eric Xing @ CMU, 2006-2009 6 8

Ancestral Inference A θ? H n H n2 G n N Essentially a clustering problem, but Better recovery of the ancestors leads to better haplotyping results (because of more accurate grouping of common haplotypes) True haplotypes are obtainable with high cost, but they can validate model more subjectively (as opposed to examining saliency of clustering) Many other biological/scientific utilities Eric Xing Eric Xing @ CMU, 2006-2009 7 Example: DP-haplotyper [Xing et al, 2004] Clustering human populations α G 0 DP G A θ K infinite mixture components (for population haplotypes) H n G n H n2 N Lielihood model (for individual haplotypes and genotypes) Inference: Marov Chain Monte Carlo (MCMC) Gibbs sampling Metropolis Hasting Eric Xing Eric Xing @ CMU, 2006-2009 8 9

The DP Mixture of Ancestral Haplotypes The customers around a table in CRP form a cluster associate a mixture component (i.e., a population haplotype) with a table sample {a, θ} at each table from a base measure G 0 to obtain the population haplotype and nucleotide substitution frequency for that component {A,θ} {A,θ} {A,θ} {A,θ} {A,θ} {A,θ} 3 2 5 4 8 9 6 7 With p(h {Α, θ}) and p(g h,h 2 ), the CRP yields a posterior distribution on the number of population haplotypes (and on the haplotype configurations and the nucleotide substitution frequencies) Eric Xing Eric Xing @ CMU, 2006-2009 9 Inheritance and Observation Models Single-locus mutation model A C θ for ht = at PH ( ht at, θ ) = θ for ht at B h = a with prob. θ t ie t Noisy observation model H H G i, H i 2 i e i C i C i2 A A A 2 3 H i H i2 Ancestral pool Haplotypes P ( g h, h ) : G 2 gt = h, t h2, t with prob. λ G i Genotype Eric Xing Eric Xing @ CMU, 2006-2009 20 0

MCMC for Haplotype Inference Gibbs sampling for exploring the posterior distribution under the proposed model Integrate out the parameters such as or, and sample ci e, a and p h ie θ λ ( ci ] e = c[ i ], h, a) p( ci = c[ i ]) p( hi a, h[ i, c) e e Posterior Prior x Lielihood e e e M CRP Gibbs sampling algorithm: draw samples of each random variable to be sampled given values of all the remaining variables Eric Xing Eric Xing @ CMU, 2006-2009 2 MCMC for Haplotype Inference. Sample c ie (j), from 2. Sample a from 3. Sample h ie (j) from For DP scale parameter α: a vague inverse Gamma prior Eric Xing Eric Xing @ CMU, 2006-2009 22

Convergence of Ancestral Inference Eric Xing Eric Xing @ CMU, 2006-2009 23 DP vs. Finite Mixture via EM individual error 0.45 0.4 0.35 0.3 0.25 0.2 0.5 0. 0.05 0 2 3 4 5 data sets Series DP Series2 EM Eric Xing Eric Xing @ CMU, 2006-2009 24 2

Variational Inference [Blei & Jordan 2005, Kurihara et al 2007] Gibbs sampling solution is not efficient enough to scale up to the large scale problems. Truncated stic-breaing approximation can be formulated in the space of explicit, non-exchangeable cluster labels. Variational inference can now be applied to such a finitedimensional distribution Variational Inference: For a complicated P(X, X 2, X n ), approximate it with Q(X): Eric Xing Eric Xing @ CMU, 2006-2009 25 Approximations to DP Truncated stic-breaing representation Finite symmetric Dirichlet approximation The joint distribution can be expressed as: The joint distribution can be expressed as: Eric Xing Eric Xing @ CMU, 2006-2009 26 3

TDP vs. TSB TDP is size biased cluster labels is NOT interchangeable under TDP but is interchangeable under TSB Eric Xing Eric Xing @ CMU, 2006-2009 27 Marginalization In variational Bayesian approximation, we assume a factorized form for the posterior distribution. However it is not a good assumption since changes in π will have a considerable impact on z. If we can integrate out π, the joint distribution is given by For the TSB representation: For the FSD representation: α Eric Xing Eric Xing @ CMU, 2006-2009 28 4

VB inference We can then apply the VB inference on the four approximations The approximated posterior distribution for TSB and FSD are Depending on marginalization or not, v and π may be integrated out. Eric Xing Eric Xing @ CMU, 2006-2009 29 Experimental results Eric Xing Eric Xing @ CMU, 2006-2009 30 5

Summary A non-parametric Bayesian model for Pattern Uncovery Finite mixture model of latent patterns (e.g., image segments, objects) infinite mixture of propotypes: alternative to model selection hierarchical infinite mixture infinite hidden Marov model temporal infinite mixture model Applications in general data-mining Eric Xing Eric Xing @ CMU, 2006-2009 3 6