Models. Carl Henrik Ek Philip H. S. Torr Neil D. Lawrence. Computer Science Departmental Seminar University of Bristol October 25 th, 2007

Size: px

Start display at page:

Download "Models. Carl Henrik Ek Philip H. S. Torr Neil D. Lawrence. Computer Science Departmental Seminar University of Bristol October 25 th, 2007"

Benedict Lyons
6 years ago
Views:

1 Carl Henrik Ek Philip H. S. Torr Neil D. Lawrence Oxford Brookes University University of Manchester Computer Science Departmental Seminar University of Bristol October 25 th, 2007

2 Source code and slides are available online MATLAB Toolboxes neill/software.html. Contact Carl Henrik Ek Neil D. Lawrence

3 1 Introduction 2 es 3 GP-LVM 4 GP-LVM 5 Applications 6 Conclusion 7 References

4 1 Introduction 2 es 3 GP-LVM 4 GP-LVM 5 Applications 6 Conclusion 7 References

5 Representation Dimensional Object 1 code/image sample.m

6 Representation Dimensional Object 1 code/image sample.m

7 Representation Dimensional Object 1 code/image sample.m

8 Representation Dimensional Object 1 code/image sample.m

9 Representation Dimensional Object 1 code/image sample.m

10 Representation Dimensional Object 1 code/image sample.m

11 Representation 1 Representation often reflects collection of data rather than characteristics of data 1 code/image sample.m

12 Re-representation Feature Selection: is the technique of selecting a subset of relevant features for building robust learning models. By removing most irrelevant and redundant features from the data, feature selection helps improve the performance of learning models. Feature Extraction: Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately.

13 Re-representation Feature Selection: is the technique of selecting a subset of relevant features for building robust learning models. By removing most irrelevant and redundant features from the data, feature selection helps improve the performance of learning models. Supervised Dimensionality Reduction Feature Extraction: Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately.

14 Re-representation Feature Selection: is the technique of selecting a subset of relevant features for building robust learning models. By removing most irrelevant and redundant features from the data, feature selection helps improve the performance of learning models. Supervised Dimensionality Reduction Feature Extraction: Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately. Unsupervised Dimensionality Reduction

15 Today Feature Extraction with supervised structure

16 1 Introduction 2 es 3 GP-LVM 4 GP-LVM 5 Applications 6 Conclusion 7 References

17 es (GP) 2 Generalisation of Distribution Distribution: Mean vector, Covariance matrix : Mean function, Covariance function Distributions over functions Functions are infinite objects GP s defined over infinite index sets instantiations of functions Provides probabilistic framework for dealing with functions 2 [Rasmussen and Williams(2006)]

18 es: Design : GP (µ(x), k(x, x )) Mean function µ(x) Often taken to be zero, i.e. centering of data Covariance function k(x, x ) Defines the class of functions the GP contains Class of valid covariance functions is the class of Mercer kernels

19 GP-covariance function Linear Radial Basis Function (RBF) k(x i, x j ) = x i x T j k(x i, x j ) = θe γ 2 x i x j 2 Multi-Layered Perceptron (MLP) k(x i, x j ) = θsin -1 wx T i x j + b (wx T i x i + b + 1 ) ( wx T i x i + b + 1 ) Notation: Φ collects parameters of covariance function

20 GP-prior Linear Kernel 0!0.5!1!1.5!0.5!0.4!0.3!0.2! Linear Kernel 3 code/prior sample.m

21 GP-prior RBF Kernel width=1 0!0.5!1!1.5!2!0.5!0.4!0.3!0.2! RBF Kernel width = 1 3 code/prior sample.m

22 GP-prior RBF Kernel width=1e!1!0.5!1!1.5!2!2.5!3!0.5!0.4!0.3!0.2! RBF Kernel width = code/prior sample.m

23 GP-prior ()*,e./e0 1i3t561e!2 0!1!2!3!0"#!0"4!0"3!0"2!0"1 0 0"1 0"2 0"3 0"4 0"# RBF Kernel width = code/prior sample.m

24 GP-prior MLP Kernel!0.5!1!1.5!2!2.5!0.5!0.4!0.3!0.2! MLP Kernel 3 code/prior sample.m

25 GP-prior RBF, Linear, Noise!1!2!3!4!5!6!0.5!0.4!0.3!0.2! Linear + RBF + Noise Kernel 3 code/prior sample.m

26 GP-posterior Corresponding observations D = {(x i, y i ) i = 1,..., N} x i R D y i R Joint Distribution with unobserved data x [ ] ( [ y k(x, X) + σ N 0, 2 ]) I k(x, x ) k(x, X) k(x, x ) + σ 2 y Predictions from the posterior y N (ȳ, cov(y )) details

27 GP-posterior 4 4 code/posterior sample.m

28 GP-posterior 4 4 code/posterior sample.m

29 GP-posterior 4 4 code/posterior sample.m

30 Regression Regression problem y i = f (x i ) + ɛ ɛ N (0, k) How can we choose the co-variance function? How do we choose the parameters of the co-variance function?

31 GP-training Formulate the marginal likelihood p(y X, Φ) = p(y f, X, Φ)p(f X, Φ)df p(f X, Φ) = N (0, K) Find parameters Φ that maximises the marginal likelihood log p(y X) = 1 2 tr ( y T (K + σ 2 I) 1 y ) }{{} data fit 1 2 log det(k + σ2 I) N log 2π }{{} 2 complexity

32 GP-regression 5 Introduction to es Interpolation with es Prediction with es Regression with es Examples Parametric vs GPs Conclusions Learning Kernel Parameters Learning Kernel Parameters Can we determine length scales and noise levels from the data? demoptimisekern 1 0.5!1.5!1! !0.5!1!1.5!2 log!likelihood!4!5!6!7!8!9!10!11!12 10! length scale Neil Lawrence es 1 2 tr ( y T (K + β 1 I) 1 y ) 1 }{{} 2 logdet(k + β 1 I) N }{{} 2 log2π data fit complexity Model Selection 5 Images: N.D. Lawrence

33 GP-regression 5 Introduction to es Interpolation with es Prediction with es Regression with es Examples Parametric vs GPs Conclusions Learning Kernel Parameters Learning Kernel Parameters Can we determine length scales and noise levels from the data? demoptimisekern 1 0.5!1.5!1! !0.5!1!1.5!2 log!likelihood!4!5!6!7!8!9!10!11!12 10! length scale Neil Lawrence es 1 2 tr ( y T (K + β 1 I) 1 y ) 1 }{{} 2 logdet(k + β 1 I) N }{{} 2 log2π data fit complexity Model Selection 5 Images: N.D. Lawrence

34 GP-regression 5 Introduction to es Interpolation with es Prediction with es Regression with es Examples Parametric vs GPs Conclusions Learning Kernel Parameters Learning Kernel Parameters Can we determine length scales and noise levels from the data? demoptimisekern 1 0.5!1.5!1! !0.5!1!1.5!2 log!likelihood!4!5!6!7!8!9!10!11!12 10! length scale Neil Lawrence es 1 2 tr ( y T (K + β 1 I) 1 y ) 1 }{{} 2 logdet(k + β 1 I) N }{{} 2 log2π data fit complexity Model Selection 5 Images: N.D. Lawrence

35 GP-regression 5 Introduction to es Interpolation with es Prediction with es Regression with es Examples Parametric vs GPs Conclusions Learning Kernel Parameters Learning Kernel Parameters Can we determine length scales and noise levels from the data? demoptimisekern 1 0.5!1.5!1! !0.5!1!1.5!2 log!likelihood!4!5!6!7!8!9!10!11!12 10! length scale Neil Lawrence es 1 2 tr ( y T (K + β 1 I) 1 y ) 1 }{{} 2 logdet(k + β 1 I) N }{{} 2 log2π data fit complexity Model Selection 5 Images: N.D. Lawrence

36 GP-regression 5 Introduction to es Interpolation with es Prediction with es Regression with es Examples Parametric vs GPs Conclusions Learning Kernel Parameters Learning Kernel Parameters Can we determine length scales and noise levels from the data? demoptimisekern 1 0.5!1.5!1! !0.5!1!1.5!2 log!likelihood!4!5!6!7!8!9!10!11!12 10! length scale Neil Lawrence es 1 2 tr ( y T (K + β 1 I) 1 y ) 1 }{{} 2 logdet(k + β 1 I) N }{{} 2 log2π data fit complexity Model Selection 5 Images: N.D. Lawrence

37 GP-regression 5 Introduction to es Interpolation with es Prediction with es Regression with es Examples Parametric vs GPs Conclusions Learning Kernel Parameters Learning Kernel Parameters Can we determine length scales and noise levels from the data? demoptimisekern 1 0.5!1.5!1! !0.5!1!1.5!2 log!likelihood!4!5!6!7!8!9!10!11!12 10! length scale Neil Lawrence es 1 2 tr ( y T (K + β 1 I) 1 y ) 1 }{{} 2 logdet(k + β 1 I) N }{{} 2 log2π data fit complexity Model Selection 5 Images: N.D. Lawrence

38 GP-regression 5 Introduction to es Interpolation with es Prediction with es Regression with es Examples Parametric vs GPs Conclusions Learning Kernel Parameters Learning Kernel Parameters Can we determine length scales and noise levels from the data? demoptimisekern 1 0.5!1.5!1! !0.5!1!1.5!2 log!likelihood!4!5!6!7!8!9!10!11!12 10! length scale Neil Lawrence es 1 2 tr ( y T (K + β 1 I) 1 y ) 1 }{{} 2 logdet(k + β 1 I) N }{{} 2 log2π data fit complexity Model Selection 5 Images: N.D. Lawrence

39 GP-regression 5 Introduction to es Interpolation with es Prediction with es Regression with es Examples Parametric vs GPs Conclusions Learning Kernel Parameters Learning Kernel Parameters Can we determine length scales and noise levels from the data? demoptimisekern 1 0.5!1.5!1! !0.5!1!1.5!2 log!likelihood!4!5!6!7!8!9!10!11!12 10! length scale Neil Lawrence es 1 2 tr ( y T (K + β 1 I) 1 y ) 1 }{{} 2 logdet(k + β 1 I) N }{{} 2 log2π data fit complexity Model Selection 5 Images: N.D. Lawrence

40 GP-regression 5 Introduction to es Interpolation with es Prediction with es Regression with es Examples Parametric vs GPs Conclusions Learning Kernel Parameters Learning Kernel Parameters Can we determine length scales and noise levels from the data? demoptimisekern 1 0.5!1.5!1! !0.5!1!1.5!2 log!likelihood!4!5!6!7!8!9!10!11!12 10! length scale Neil Lawrence es 1 2 tr ( y T (K + β 1 I) 1 y ) 1 }{{} 2 logdet(k + β 1 I) N }{{} 2 log2π data fit complexity Model Selection 5 Images: N.D. Lawrence

41 1 Introduction 2 es 3 GP-LVM 4 GP-LVM 5 Applications 6 Conclusion 7 References

42 Dimensionality Reduction Un-raveling: Assumption: manifold structure preserved in observed representation MDS,PCA,Isomap,MVU 6,... Raveling : Assumption: observed data smoothly sampled from low dimensional manifold PPCA,GTM,GP-LVM 6 [Weinberger et al.(2004)weinberger, Sha, and Saul]

43 GP-LVM 7 X Y W PPCA: marginalise latent locations optimize parameters Limited to linear relationships Closed form solution Dual PPCA: marginalise parameters optimise latent locations Allows for non-linear relationships No closed form solution when non-linear 7 [Lawrence(2005)]

44 GP-LVM Observed data y i R D generated from a latent variable x i R q y i = f (x i ) + ɛ Find latent locations X and kernel parameters Φ maximising marginal likelihood {ˆX, ˆΦ} = argmax X,Φ p(y X, Φ)

45 GP-LVM Advantages: + Correctly models sampling process + Provides a mapping to the observed data + Associates uncertainty to latent locations and observed locations Challenges: - Non-Convex optimisation problem for general co-variance functions Initialisation of parameters - Computationally expensive - Manifold dimensionality free parameter

46 Back-Constrained GP-LVM 8 sampling process by a smooth function Points close in latent close in observed space Does not preserve smoothness from observed space Constrain latent coordinates to be represented by smooth mapping from observed data x i = g(y i, W) Indirectly optimise latent locations X {W, Φ} = argmax W,Φ p(y W, Φ) 8 [Lawrence and Candela(2006)]

47 Dynamic GP-LVM 9 Learn latent representation respecting ordering of observations Auto-regressive function h x t = h(x t 1 ) + ɛ dyn ɛ dyn N (0, σ 2 dyni) Place GP-prior over h and combine with GP-LVM {ˆX, ˆΦY, ˆΦ dyn } = argmax X,ΦY,Φ dyn p(y X, Φ Y )p(x Φ dyn ) 9 [Wang et al.(2006)wang, Fleet, and Hertzmann]

48 1 Introduction 2 es 3 GP-LVM 4 GP-LVM 5 Applications 6 Conclusion 7 References

49 GP-LVM Corresponding observations of same underlying phenomenon Example: Different language representations of text Facial expression and robot servos Model both observations using a single latent representation Infer corresponding locations between spaces

50 GP-LVM! Y Y X Z! Z Learn two separate kernels {Φ Y, Φ Z } from a single shared latent representation X Objective p(y, Z X, Φ Y, Φ Z ) = p(y X, Φ Y )p(z X, Φ Z ) Inference

51 GP-LVM representation represent full variance of both observation spaces Manifold alignment Not possible to align manifolds Manifolds topologically different variance small relative full variance Not reflected by objective function argmax X,ΦY,Φ Z p(y, Z X, Φ Y, Φ Z )

52 GP-LVM Experiments 10!dyn! Y X! Z W Y Z Feature Pose Silhouette Features: y i R 100, Pose Parameters: z i R 54 Multi-modal: Same silhouette could have been generated from several different poses NOT possible to model with a regression model Generative model: p(silhouette pose) 1 Dimensionality of pose space 2 Limited amount of training data 10 [Ek et al.(2007)ek, Torr, and Lawrence]

53 GP-LVM Experiments Most variance in feature silhouette space irrelevant for pose

54 GP-LVM Experiments Most variance in feature silhouette space irrelevant for pose

55 Introduction Manifold Alignment: Both observation spaces lie on manifolds of same topology New Model: Subspace of each manifold share topology

56 Assumptions Observations y i Y and z i Z generated from low-dimensional manifold y ni = fi Y (u Y n ) + ɛ Y ni z ni = fi Z (u Z n) + ɛ Z ni Assume U Y and U Z share a non-zero subspace X S U Y X S U Z X S 0

57 Assumption Spaces X Y and X Z completes latent representation y ni = fi Y ({ x S n, x Y }) n + ɛ Y ni z ni = f Z i ({ x S n, x Z }) n + ɛ Z ni

58 Assumptions subspace X S x S i = g Y (y i ) = g Z (z i ) Private subspaces X Y and X Z { x Y i = h Y (y i ) x Z i = h Z (z i )

59 Graphical Model X Y X S X Z Y h f Y g Y g f h Z Z Z! Y! Z Y Z

60 Canonical Correlation Analysis Correlation: ρ YZ = Canonical Correlation Analysis: cov(y, Z) var(y)var(z) Find directions {W Y, W Z } in each observed space maximizing the correlation { ay = YW Canonical variate Y a Z = ZW Z Solution through Eigenvalue problem details solution

61 Non-Consolidating Component Analysis CCA explains shared variance Non-Consolidating Component Analysis Directions explaining remaining variance ˆv 1 Y = argmax v Y(v Y 1 1 ) T cov(y, Y)v1 Y { (v Y subject to: 1 ) T v1 Y = 1 (v1 Y)T W = 0 Solution through Eigenvalue Problem details

62 Non-Consolidating Correlation Analysis Successive directions solved through additional eigenvalue problem details Add directions until sufficient amount of variance explained by basis Summary Variance: X S CCA Private Variance: X Y,Z NCCA

63 Non-linear Both algorithms can easily be kernelised Non-linearise through kernel induced feature space { ΨY : Y F Y Ψ Z : Z F Z Many kernels do dimensionality expansion instead of reduction (e.g. RBF)

64 Non-linear Sampled Correlation: ρ s = cov s (y, z) vars (y)var s (z) = { Expand } = cos(y, z) Gram matrix in point expanded feature space close full-rank Feature spaces effectively the same CCA trivial Details

65 Practical Non-Linearisation 1 Represent each feature space by dominant principal directions Remove trivial CCA solution Find correlated directions explaining significant variance 2 Apply CCA and NCCA in reduced feature space Feature Spaces: Many possible choices of feature space 1 Linear Kernel 2 RBF 3 Maximum Variance Unfolding, Isomap Main interest topology of latent space kernels not providing explicit mapping learn { g {Y,Z}, h {Y,Z}} by GP-regression

66 Model Selection Rank embedding according to generating function f 1 Data-fit of f 2 Complexity of f Encapsulated by GP-LVM objective Lawrence [Lawrence(2005)] suggested spectral initialisation of latent locations details Hamerling [Harmeling(2007)] compared GP-LVM likelihood with procrustes score to ground truth { If { g {Y,Z}, h {Y,Z}} have correctly unraveled manifold f {Y,Z} should hold with a high likelihood Select embedding according to p(y, Z {X S, X Y, X Z }, Φ Y, Φ Z )

67 Summary 1 Map observations to feature space 2 Re-represent feature space by dominant principal directions 3 Extract shared directions using CCA 4 Extract private directions using NCCA 5 Chose embedding maximising GP-LVM likelihood 6 Train GP-regressors over implicit mappings

68 Illusion demo 11 DEMO: ILLUSION 11 code/demo illusion1.m

69 Illusion demo 12 DEMO: ILLUSION2 12 code/demo illusion2.m

70 Human Pose Estimation X Y X Y Z h g g! Y X f Y Y f S Z Z Z h! dyn! Z Y Z 1 Embed observations using NCCA 2 Learn dynamic GP over pose re-representation

71 NCCA demo 13 DEMO: NCCA POSE 13 code/demo ncca.m

72 Conclusion Introduced es and the GP-LVM GP-LVM models for multiple observation spaces GP-LVM analogy to CCA Application for Multimodal regression

73 eof.

74 References C. H. Ek, P. H. Torr, and N. D. Lawrence. process latent variable models for human pose estimation. In MLMI, S. Harmeling. Exploring model selection techniques for nonlinear dimensionality reduction. Technical Report EDI-INF-RR-0960, University of Edinburgh, N. D. Lawrence. Probabilistic non-linear principal component analysis with process latent variable models. JMLR, 6: , N. D. Lawrence and J. Q. Candela. Local distance preservation in the gp-lvm through back constraints. In ICML, pages , C. E. Rasmussen and C. K. Williams. es for Machine Learning. The MIT Press, J. M. Wang, D. J. Fleet, and A. Hertzmann. process dynamical models. In Y. Weiss, B. Schölkopf, and J. C. Platt, editors, NIPS, volume 18, Cambridge, MA, MIT.

75 K. Q. Weinberger, F. Sha, and L. K. Saul. Learning a kernel matrix for nonlinear dimensionality reduction. In R. Greiner and D. Schuurmans, editors, ICML, volume 21, pages Omnipress, 2004.

76 8 Appendix

77 GP-Prediction Predictive Equations y N (ȳ, cov(y )) ȳ = k(x, X) ( k(x, X) + σ 2) 1 y cov(y ) = ( k(x, x ) + σ 2) ( k(x, X) + σ 2) 1 k(x, x ) Linear Predictor N y = α i k(x i, x j ) i=1 α = ( K(X, X) + σ 2 I ) 1 y Return

78 GP-LVM: Inference Infer location of z i given corresponding y i ˆx = argmax x p(y x, X, Φ Y ) ẑ = f Z (ˆx ) Return

79 CCA Correlation ρ = tr ( W T Y YT ZW Z ) ( tr ( W T Z Z T ZW Z ) tr ( W T Y Y T YW Y )) 1 2 CCA {W Y, W Z } = argmax WY,W Z tr(wy T YT ZW Z ) { W T subject to: Y Y T YW Y = I WZ TZT ZW Z = I Return

80 CCA CCA Solution: (Y T Y) 1 Y T Z(Z T Z) 1 Z T Yw i y = λ 2 i w i y (Z T Z) 1 Z T Y(Y T Y) 1 Y T Zw i z = λ 2 i w i z Return

81 CCA Geometry Sampled Correlation: ρ s = cov s (y, z) vars (y)var s (z) = { covs (y, z) = y T z var s (y) = y T y = y 2 } = y T z y 2 z = { y T y = y z cos(y, z) } y z cos(y, z) = 2 y z = cos(y, z) Return

82 CCA Trivial Solution Kernel Trick { ΨY : Y F Map points to feature space Y Ψ Z : Z F Z Reduced feature space representation: subspace spanned by training data R N Arbitrary vector v = N i=1 Ψ(y i) T α i Canonical Variate: a = Ψ Y Ψ T Y α }{{} K Y rank(k Y ) = rank(ψ Y ) = {rank-nullity theorem} = dim (im (Ψ Y )) K full rank each point explained by separate feature perfect correlation by alignment Return

83 NCCA NCCA solution: ( cov(y, Y) WY W T Y cov(y, Y)) v 1 = λ 1 v 1 Successive Directions: k th ( ( ) ) k 1 cov(y, Y) W Y WY T + v i vi T cov(y, Y) v k = λ k v k i=1 Return1 Return2

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,