Machine Learning Techniques for Computer Vision

Size: px

Start display at page:

Download "Machine Learning Techniques for Computer Vision"

Ronald Newton
6 years ago
Views:

1 Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x ECCV 2004, Prague x 2 x 1

2 Overview of Part 2 Mixture models EM Variational Inference Bayesian model complexity Continuous latent variables

3 The Gaussian Distribution Multivariate Gaussian mean covariance Maximum likelihood

4 Gaussian Mixtures Linear super-position of Gaussians Normalization and positivity require

5 Example: Mixture of 3 Gaussians (a) (b)

6 Maximum Likelihood for the GMM Log likelihood function Sum over components appears inside the log no closed form ML solution

7 EM Algorithm Informal Derivation

8 EM Algorithm Informal Derivation M step equations

9 EM Algorithm Informal Derivation E step equation

10 EM Algorithm Informal Derivation Can interpret the mixing coefficients as prior probabilities Corresponding posterior probabilities (responsibilities)

11 Old Faithful Data Set Time between eruptions (minutes) Duration of eruption (minutes)

18 Latent Variable View of EM To sample from a Gaussian mixture: first pick one of the components with probability then draw a sample from that component repeat these two steps for each new data point (a)

19 Latent Variable View of EM Goal: given a data set, find Suppose we knew the colours maximum likelihood would involve fitting each component to the corresponding cluster Problem: the colours are latent (hidden) variables

20 Incomplete and Complete Data (b) incomplete (a) complete

21 Latent Variable Viewpoint

22 Latent Variable Viewpoint Binary latent variables describing which component generated each data point Conditional distribution of observed variable Z Prior distribution of latent variables X Marginalizing over the latent variables we obtain

23 Graphical Representation of GMM z n x n N

24 Latent Variable View of EM Suppose we knew the values for the latent variables maximize the complete-data log likelihood trivial closed-form solution: fit each component to the corresponding set of data points We don t know the values of the latent variables however, for given parameter values we can compute the expected values of the latent variables

25 Posterior Probabilities (colour coded) (b) (a)

26 Over-fitting in Gaussian Mixture Models Infinities in likelihood function when a component collapses onto a data point: with Also, maximum likelihood cannot determine the number K of components

27 Cross Validation Can select model complexity using an independent validation data set If data is scarce use cross-validation: partition data into S subsets train on S 1 subsets test on remainder repeat and average Disadvantages computationally expensive can only determine one or two complexity parameters

28 Bayesian Mixture of Gaussians Parameters and latent variables appear on equal footing Conjugate priors z n x n N

29 Data Set Size Problem 1: learn the function for from 100 (slightly) noisy examples data set is computationally small but statistically large Problem 2: learn to recognize 1,000 everyday objects from 5,000,000 natural images data set is computationally large but statistically small Bayesian inference computationally more demanding than ML or MAP (but see discussion of Gaussian mixtures later) significant benefit for statistically small data sets

30 Variational Inference Exact Bayesian inference intractable Markov chain Monte Carlo computationally expensive issues of convergence Variational Inference broadly applicable deterministic approximation let denote all latent variables and parameters approximate true posterior using a simpler distribution minimize Kullback-Leibler divergence

31 General View of Variational Inference For arbitrary where Maximizing over would give the true posterior this is intractable by definition

32 Variational Lower Bound

33 Factorized Approximation Goal: choose a family of q distributions which are: sufficiently flexible to give good approximation sufficiently simple to remain tractable Here we consider factorized distributions No further assumptions are required! Optimal solution for one factor, keeping the remainder fixed coupled solutions so initialize then cyclically update message passing view (Winn and Bishop, 2004)

34 1 x x 1 (a) 1

35 Lower Bound Can also be evaluated Useful for maths/code verification Also useful for model comparison:

36 Illustration: Univariate Gaussian Likelihood function Conjugate prior Factorized variational distribution

37 Initial Configuration 2 (a) τ µ 1

38 After Updating 2 (b) τ µ 1

39 After Updating 2 (c) τ µ 1

40 Converged Solution 2 (d) τ µ 1

41 Variational Mixture of Gaussians Assume factorized posterior distribution No other approximations needed!

42 Variational Equations for GMM

43 Lower Bound for GMM

44 VIBES Bishop, Spiegelhalter and Winn (2002)

45 ML Limit If instead we choose we recover the maximum likelihood EM algorithm

46 Bound vs. K for Old Faithful Data

47 Bayesian Model Complexity

48 Sparse Bayes for Gaussian Mixture Corduneanu and Bishop (2001) Start with large value of K treat mixing coefficients as parameters maximize marginal likelihood prunes out excess components

51 Summary: Variational Gaussian Mixtures Simple modification of maximum likelihood EM code Small computational overhead compared to EM No singularities Automatic model order selection

52 Continuous Latent Variables Conventional PCA data covariance matrix eigenvector decomposition x 2 x n ~xn ~ xn u 1 Minimizes sum-of-squares projection not a probabilistic model how should we choose L? x 1

53 Probabilistic PCA Tipping and Bishop (1998) L dimensional continuous latent space D dimensional data space x 2 PCA factor analysis w { z x 1

54 Probabilistic PCA Marginal distribution z n Advantages x n N exact ML solution computationally efficient EM algorithm captures dominant correlations with few parameters mixtures of PPCA Bayesian PCA building block for more complex models W

55 EM for PCA 2 (a)

56 EM for PCA 2 (b)

57 EM for PCA 2 (c)

58 EM for PCA 2 (d)

59 EM for PCA 2 (e)

60 EM for PCA 2 (f)

61 EM for PCA 2 (g)

62 Bayesian PCA Bishop (1998) Gaussian prior over columns of Automatic relevance determination (ARD) z n N W x n ML PCA Bayesian PCA

63 Non-linear Manifolds Example: images of a rigid object x 3 x 1 x 2

64 Bayesian Mixture of BPCA Models W m s n x n z nm N m M

66 Flexible Sprites Jojic and Frey (2001) Automatic decomposition of video sequence into background model ordered set of masks (one per object per frame) foreground model (one per object per frame)

68 Transformed Component Analysis Generative model Now include transformations (translations) Extend to L layers s l m l Inference intractable so use variational framework T nl L x n N

70 Bayesian Constellation Model Li, Fergus and Perona (2003) Object recognition from small training sets Variational treatment of fully Bayesian model

71 Bayesian Constellation Model

72 Summary of Part 2 Discrete and continuous latent variables EM algorithm Build complex models from simple components represented graphically incorporates prior knowledge Variational inference Bayesian model comparison

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian