The Gaussian process latent variable model with Cox regression

The Gaussian process latent variable model with Cox regression James Barrett Institute for Mathematical and Molecular Biomedicine (IMMB), King s College London James Barrett (IMMB, KCL) GP Latent Variable Model 1 / 19

Layout 1 Introduction 2 GPLVM 3 Results 4 Extension to Cox proportional hazards model James Barrett (IMMB, KCL) GP Latent Variable Model 2 / 19

Integrating Multiple Data Sources via the GPLVM The Gaussian process latent variable model (Lawrence, 2005) is a flexible non-parametric probabilistic dimensionality reduction method. We want to: Also: Represent each dataset in terms of latent variables. Extract information common to each data source. Retain information unique to each source. Account for dimension mismatch between multiple datasets. Detect any intrinsic low dimensional structure. 1 2 James Barrett (IMMB, KCL) GP Latent Variable Model 3 / 19

Layout 1 Introduction 2 GPLVM 3 Results 4 Extension to Cox proportional hazards model James Barrett (IMMB, KCL) GP Latent Variable Model 4 / 19

Model Definition Observe S datasets Y 1 R N d 1,..., Y S R N d S. It is assumed each column of Y s is normalised to zero mean and unit variance. Represent these data in terms of q latent variables x where q < min s(d s). For individual i and covariate µ in source s we write Where y s iµ = M wµmφ s s m(x i ) + ξiµ s m=1 φ s m : R q R M are non-linear mappings that may depend on hyperparameters φ s w s µm are mapping coefficients ξ s iµ are noise variables. James Barrett (IMMB, KCL) GP Latent Variable Model 5 / 19

Data Likelihood Assume Gaussian priors for p(w s) and p(ξ s β s) with zero mean and covariances given by w s µm wνn s = δss δ µνδ mn and ξiµξ s jν s = β 1 s δ ss δ ij δ µν. For notational simplicity we define β = {β 1,..., β S }, Φ = {φ 1,..., φ S }, W = {W 1,..., W S }, ξ = {ξ 1,..., ξ S } and Y = {Y 1,..., Y S }. The data likelihood factorises over samples p(y X, W, ξ, β, Φ) = N p(y i x i, W, ξ i, β, Φ) Marginalising W and ξ we get a Gaussian distribution for Y with mean y iµ = 0 and covariance ( ) y s iµ yjν s = δss δ µν φ s m(x i )φ s m(x j ) + βs 1 δ ij m i=1 = δ ss δ µνk s(x i, x j ) The data likelihood can then be written as p(y X, β, Φ) = S d s s=1 µ=1 e 1 2 ys :,µ K 1 s y s :,µ (2π) N 2 K s 1 2 James Barrett (IMMB, KCL) GP Latent Variable Model 6 / 19

Bayesian Inference We specify three levels of uncertainty: Microscopic parameters: {X} Hyperparameters: {β, Φ} Models: H = {q, φ m} Posterior distributions are p(x Y, β, Φ, H) = p(β, Φ Y, H) = P(H Y) = p(y X, β, Φ, H)p(X H) dx p(y X, β, Φ, H)p(X H) p(y β, Φ, H)p(β, Φ H) dβ dφ p(y β, Φ, H)p(β, Φ H) p(y H)p(H) H p(y H )p(h ), where p(y β, Φ, H) = p(y H) = dxp(y X, β, Φ, H)p(X H) dβdφ p(y β, Φ, H)p(β, Φ H). James Barrett (IMMB, KCL) GP Latent Variable Model 7 / 19

Inferring latent variables To find the optimal latent variable representation, X we will numerically minimise the negative log likelihood of p(x Y, β, Φ, H) L X (X; β, Φ) = [ ] ds 2N tr(k 1 s S s) + ds ds log Ks + log 2π 2N 2 s where S s = 1 d s Y sy T s. Should we rescale the contribution from each source by d tot/d s where d tot = s ds? Expand to second order L X (X; β, Φ) L X (X ; β, Φ) + 1 2 N i,j q (xiµ x iµ )(xjν x jν )A iµ,jν µ,ν where 2 A iµ,jν = L X (X; β, Φ}) x iµ x jν X=X p(y β, Φ, H) = dxe NL X (X;β,Φ) = p(y X, β, Φ, H) dxe 1 2 ij µν (x iµ x iµ)(xjν x jν )A iµ,jν = p(y X, β, Φ, H)(2π) Nq/2 A(X, β, Φ) 1/2 James Barrett (IMMB, KCL) GP Latent Variable Model 8 / 19

Invariance under Unitary Transformations The kernel functions considered here are all invariant under arbitrary unitary transformations. Let U be a unitary matrix, such that U T U = UU T = I and let x = Ux. Then x i x j = x i U T Ux j = x i x j and ( x i x j ) 2 = (x i x j )U T U(x i x j ) = (x i x j ) 2. This invariance under unitary transformations induces symmetries in the posterior search space of X R N q. Fix with x 11 0 0 0 x 21 x 22 0 0 X = x 31 x 32 x 33 0. x 41 x 42 x 43 x 44.. We pin down the latent variables and optimise over the Nq (q 2 q)/2 non zero entries. James Barrett (IMMB, KCL) GP Latent Variable Model 9 / 19

Layout 1 Introduction 2 GPLVM 3 Results 4 Extension to Cox proportional hazards model James Barrett (IMMB, KCL) GP Latent Variable Model 10 / 19

Synthetic Data 2 1 1.5 1 0.5 q2 0 q2 0 1 0.5 2 2 1 0 1 2 q 1 1 1.5 1.5 1 0.5 0 0.5 1 1.5 q 1 (a) True latent variables (b) Retrieved latent variables We can define three ad hoc error measures E radial = 1 x i r C r i C E angular = 1 θ i θ C θ i C E linear = SSerr SS tot where SS err = (x i2 αx i1 ) 2 and SS tot = (x i2 x i2 ) 2. James Barrett (IMMB, KCL) GP Latent Variable Model 11 / 19

Dependence on β and d β E radial E angular E linear 0.1 0.0060 0.0046 0.0079 0.5 0.0766 0.0813 0.2577 1.0 0.0998 0.1701 0.3263 (a) Dependence on β d E radial E angular E linear 10 0.0944 0.0454 0.5491 100 0.0061 0.0051 0.0108 1000 0.0004 0.0008 0.0016 (b) Dependence on d Table: (a) The magnitude of the errors increases as more noise is added (for fixed d) to the synthetic data. (b) For fixed noise levels the greater d is the better the extraction of the true low dimensional structure from a dataset. James Barrett (IMMB, KCL) GP Latent Variable Model 12 / 19

Dimensionality detection 10 0 Linear Polynomial Min Lh y p 10 20 30 1 2 3 4 5 6 No of latent variables q Figure: Plot of the minimal values of L hyp obtained for different values of q and two different kernels, the linear kernel and the polynomial kernel. Both the kernel types detect that q = 2 is the optimal dimension. Furthermore, the model can distinguish that the linear kernel offers the best explanation of the data in this case. James Barrett (IMMB, KCL) GP Latent Variable Model 13 / 19

Integration of two sources 10 15 5 10 Min Lh y p 0 5 10 Min Lh y p 5 0 5 10 15 15 20 2 4 6 8 10 No of latent variables q 20 2 4 6 8 10 No of latent variables q (a) Linear (b) Polynomial 150 100 Min Lh y p 50 0 50 2 4 6 8 10 No of latent variables q (c) Combined James Barrett (IMMB, KCL) GP Latent Variable Model 14 / 19

Classification experiment We generated N = 100 samples with q = 2. 50 samples from a Gaussian with unit variance and mean (1, 1) with class +1. 50 samples from a unit variance Gaussians with means ( 1 2, 1 2 ) and ( 1 2, 1 2 ) with class 1. Projected into d = 100 space with a linear mapping. Y (d = 100) X (q = 2) Training Success 86.3% 86.6% Validation Success 74.0% 83.0% We repeated this 300 times. The mean improvement between the validation success on Y and the success on X was found to be 8.7% with a standard deviation of 4.8%. James Barrett (IMMB, KCL) GP Latent Variable Model 15 / 19

Layout 1 Introduction 2 GPLVM 3 Results 4 Extension to Cox proportional hazards model James Barrett (IMMB, KCL) GP Latent Variable Model 16 / 19

What is Survival Analysis? Suppose we have a group of N cancer patients. For each individual i we measure: The time τ i 0 until an event of interest occurs, for example the time to metastasis. A vector of covariates (also called features or input variables) x i R d We will assume one risk and use an indicator variable { 0 if i is censored i = 1 if the primary risk occurs Event Time τ (years) 9 8 7 6 5 4 3 2 1 Primary events Censoring events 0 3 2 1 0 1 2 3 Covariate x Aim To extract any statistical relationship between x and τ for each risk. Challenges: How can we incorporate information from censored individuals? How can we deal with non-negative outputs? James Barrett (IMMB, KCL) GP Latent Variable Model 17 / 19

Linking Survival Data to Latent Variables For each patient we observe covariates y, time to event t, and event type. p(x Y, t) p(y, t X)p(X) where as above = p(y X) p(t X) p(x) }{{}}{{} GPLVM Cox Covariates Y X Latent Variables p(y X) = d µ=1 and for the Cox model e 1 2 yt :,µ K 1 y :,µ (2π) N 2 K 1 2 T,Δ Survival Data p(t X) = N λ 0(t)e β x i e eβ x i Λ 0 (t) i=1 James Barrett (IMMB, KCL) GP Latent Variable Model 18 / 19

Predictions If we observe a new patient with y we predict the corresponding event time t via y GPLVM x Cox t We can also use Cox to generate survival curves: 1 0.9 True survival curves and retrieved low dimensional survival curves True Survival Retrieved Low Dim Survival 1 0.9 True survival curves and high dimensional survival curves True Survival High Dim Survival 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 2 4 6 8 10 0 0 2 4 6 8 10 James Barrett (IMMB, KCL) GP Latent Variable Model 19 / 19