Modelling gene expression dynamics with Gaussian processes Regulatory Genomics and Epigenomics March th 6 Magnus Rattray Faculty of Life Sciences University of Manchester
Talk Outline Introduction to Gaussian process regression Example. Hierarchical models: batches and clusters Example. Branching models: perturbations and bifurcations Example 3. Differential equations: Pol-II to mrna dynamics
Gaussian processes: flexible non-parametric models Probability distributions over functions f (t) GP ( mean(t), cov(t, t ) ) Covariance function k = cov(t, t ) defines typical properties, Static... Dynamic Smooth... Rough Stationary... non-stationary Periodic... Chaotic The covariance function has parameters tuning these properties Bayesian Machine Learning perspective: Rasmussen & Williams Gaussian Processes for Machine Learning (MIT Press, 6)
Gaussian processes Samples from a 5-dimensional multivariate Gaussian distribution:.5 5.8 f n.5 m.6 5.4.5 5 5 5 n 5 5 5 5 n. [f, f,..., f 5 ] N (, C) Learning and Inference in Computational Systems Biology, MIT Press
Gaussian processes Take dimension.5.5.5 3 3.5 3.5.5 f GP(, k) k ( t, t ) = exp 3 3 ( 3 (t t ) l ) Learning and Inference in Computational Systems Biology, MIT Press
Gaussian processes for inference: Bayesian Regression..5..5..5..5. Figures from James Hensman
Regression example..5..5..5..5. Figures from James Hensman
Regression example..5..5..5..5. Figures from James Hensman
Regression example..5..5..5..5. Figures from James Hensman
Ex. Hierarchical models: batches and clusters 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 Data from Kalinka et al. Gene expression divergence recapitulates the developmental hourglass model Nature Joint work with James Hensman and Neil Lawrence
Usual processing options for time course batches Lumped Averaged 5 5 5 5
Hierarchical Gaussian process gene: g(t) GP (, k g (t, t ) ) replicate: f i (t) GP ( g(t), k f (t, t ) )
Hierarchical Gaussian process 3 J. Hensman, N.D. Lawrence, M.Rattray Hierarchical Bayesian modelling of gene expression time series across irregularly sampled replicates and clusters BMC Bioinformatics 3
Hierarchical Gaussian process for clustering An extended hierarchy ( ) h(t) GP, k h (t, t ) cluster ( ) g i (t) GP h(t), k g (t, t ) gene ( ) f ir (t) GP g i (t), k f (t, t ) replicate
Hierarchical Gaussian process for clustering CG786 CG73 Normalised gene expression CG396 CG367 CG4386 Osi5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 Time (hours)
Hierarchical Gaussian process for clustering Modifying an existing algorithm to include this model of replicate and cluster structure leads to more meaningful clustering MF BP CC L N. clust. agglomerative HGP.46.6.5 736.8 5 agglomerative GP.39.3.36 63.7 8 Mclust (concat.).39.7.5 34. 6 Mclust (averaged).4.8.4-736. Variational Bayes algorithm is more efficient, allowing Bayesian clustering of >K profiles with a Dirichlet Process prior J. Hensman, M.Rattray, N.D. Lawrence Fast non-parametric clustering of time-series data IEEE TPAMI 5
Clustering with a periodic covariance
Ex. Branching models: perturbations and bifurcations 5 4 Simulated control data Simulated perturbed data Simulated gene expression 3 3 4 5 5 Time (hrs) Joint work with Jing Yang, Chris Penfold and Murray Grant
Samples from a branching model Sample
Samples from a branching model Sample
Samples from a branching model Sample 3
Samples from a branching model Sample 4
Samples from a branching model Sample 5
Joint distribution to two functions crossing at t p f GP(, K), g GP(, K), g(t p ) = f (t p ) ( ) Kff K Σ = fg = K gf K gg ( K(T,t K(T, T) p)k(t p,t) ) k(t p,t p) K(T,t p)k(t p,t) k(t p,t p) K(T, T) ()
Joint distribution of two datasets diverging at t p y c (t n ) N (f (t n ), σ ) y p (t n ) N (f (t n ), σ ) for t n t p y p (t n ) N (g(t n ), σ ) for t n > t p
Posterior probability of the perturbation time t p p(t p y c (T), y p (T)) p(y c (T), y p (T) t p ) t=tmax t=t min p(y c (T), y p (T) t) 5 4 Simulated control data Simulated perturbed data Simulated gene expression 3 3 4 5 5 Time (hrs)
Comparison to DE thresholding approach Alternative: use first point where some DE threshold is passed We tested several DE metrics in the nsgp package DEtime nsgp Mean EMLL.5 Estimated perturbation times 5 5 Median MAP EMLL 5 5 5 5 Testing perturbation times
Investigating a plant s response to bacterial challenge Infection with virulent Pseudomonas syringage pv. tomato DC3 vs. disarmed strain DC3hrpA 3 CATMAc74 Condition Condition CATMAa9335 Condition Condition Gene expression 9 8 7 Times (hrs) 7 Times (hrs) Yang et al. Inferring the perturbation time from biological time course data Bioinformatics (accepted) preprint: arxiv 6.743
Current work: modelling branching in single-cell data 4 8 6 3 64 start 4 35 3 5 5 5 3 4 5 6 7 with A. Boukouvalas, M. Zweissele, J. Hensman and N. Lawrence
Ex 3. Linking Pol-II activity to mrna profiles a b Pol II mrna Pol II ENSG63659 (TIPARP) 6 4.8 c.6 d e f mrna (FPKM).4 5. 4 8 6 3 648 5 Delay (min) 4 8 6 3 64 8 4 8 6 3 64 8 Joint work with Antti Honkela, Jaakko Peltonen, Neil Lawrence
Linking Pol-II activity to mrna profiles dm(t) dt = βp(t ) αm(t) m(t) is mrna concentration (RNA-Seq data) p(t) is mrna production rate (3 pol-ii ChIP-Seq data) α is degradation rate (mrna half-life t / = /α) is processing delay
Linking Pol-II activity to mrna profiles dm(t) dt = βp(t ) αm(t) m(t) is mrna concentration (RNA-Seq data) p(t) is mrna production rate (3 pol-ii ChIP-Seq data) α is degradation rate (mrna half-life t / = /α) is processing delay We model p(t) GP(, k p ) as a Gaussian process (GP) Likelihood can be worked out exactly Bayesian MCMC used to estimate parameters α, β, and GP covariance () and noise variance () parameters
Example fits Pol II mrna a: Early pol-ii, no production delay a b ENSG66483 (WEE) 4 c d Pol II.8 e.6 f 4 8 6 3 64 8 4 8 6 3 64 8 mrna (FPKM).4 4. 4 8 6 3 648 5 Delay (min)
Example fits Pol II mrna b: Early pol-ii, delayed production a b ENSG9549 (SNAI) 3 c d Pol II.8 e.6 f 4 8 6 3 64 8 4 8 6 3 64 8 mrna (FPKM) 8.4 6 4. 4 8 6 3 648 5 Delay (min)
Example fits Pol II mrna c: Later pol-ii, delayed production a b ENSG63659 (TIPARP) 6 c d Pol II 4.8 e.6 f 4 8 6 3 64 8 4 8 6 3 64 8 mrna (FPKM).4 5. 4 8 6 3 648 5 Delay (min)
Example fits Pol II mrna d: Late pol-ii, no delay a b ENSG6949 (CAPN3) c d e Pol II.5.8.6 f 4 8 6 3 64 8 4 8 6 3 64 8 mrna (FPKM) 5.4. 5 4 8 6 3 648 5 Delay (min)
Example fits Pol II mrna e: Late pol-ii, no delay a b ENSG798 (ECE) c d e Pol II.5.8.6 f 4 8 6 3 64 8 4 8 6 3 64 8 mrna (FPKM).4 4. 4 8 6 3 648 5 Delay (min)
Example fits Pol II mrna f: Late pol-ii, delayed production a b ENSG86 (AEN) 3 c d Pol II.8 e.6 f 4 8 6 3 64 8 4 8 6 3 64 8 mrna (FPKM).4 3. 4 8 6 3 648 5 Delay (min)
Large processing delays observed in % of genes # of genes 5 5 55 4 8 > Delay (min)
Delay linked with gene length and intron structure Fraction of genes with > t..5..5..5.3 m< (N=8) m> (N=54) p value 4 6 8 4 6 8 log (p value) Fraction of genes with > t..5..5..5 f<. (N=443) f>. (N=343) p value 4 6 8 : delay m: gene length f: final intron length / gene length 4 6 8 log (p value)
Delayed genes show evidence of late-intron retention 3 min 6 min 8 min 4 min min min DLX3 6 4 5 3 5 5 8 6 4 8 6 4 Mean pre mrna end accumulation index.....3.4 > t < t p value 3 4 5 6 3 4 5 6 7 log (p value) Left: density of RNA-Seq reads uniquely mapping to the introns in the DLX3 gene Right: Differences in the mean pre-mrna accumulation index in long delay genes (blue) and short delay genes (red)
Comparison of processing and transcription times Transcriptional delay (min) 3 3 Transcription time = length/velocity estimate from Danko Mol Cell 3 Transcription time > processing delay in 87% of genes 3 3 RNA processing delay (min) Honkela et al. Genome-wide modeling of transcription kinetics reveals patterns of RNA production delays PNAS 5
Extension - inferring time-varying degradation rates dm(t) dt = βp(t) α(t)m(t).5. D 35 3 mrna measurements pre-mrna measurements 5.5 Degradation rate..5 Gene expression 5 5..5 4 6 8 Times 5 4 6 8 Times
Conclusion Gaussian process models provide a good mix of flexibility and tractability in temporal and spatial modelling: Hierarchical modelling, e.g. replicates, clusters, species Periodic models without strong sinuisoidal assumptions Tractable under branching and bifurcations Tractable under linear operations on functions Easy addition and multiplication of kernels Good approximate inference algorithms for non-gaussian data Codebase is growing making methods increasingly flexible and computationally efficient (e.g. GPy, GPflow and some R) Funding: BBSRC (EraSysBio+ SYNERGY), EU FP7 (RADIANT) Collaborators: Neil Lawrence, James Hensman, Antti Honkela, Jaakko Peltonen, Jing Yang, Alexis Boukouvalas, Max Zweissele
Our existential crisis