Some Statistical Models and Algorithms for Change-Point Problems in Genomics

Size: px

Start display at page:

Download "Some Statistical Models and Algorithms for Change-Point Problems in Genomics"

Herbert Homer Richard
5 years ago
Views:

1 Some Statistical Models and Algorithms for Change-Point Problems in Genomics S. Robin UMR 518 AgroParisTech / INRA Applied MAth & Comput. Sc. Journées SMAI-MAIRCI Grenoble, September 2012 S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 1 / 35

2 Outline 1 Biological Motivation 2 Frequentist Inference One Profile Several Series 3 Bayesian Inference S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 2 / 35

3 Biological Motivation Biological (and Statistical) Motivation S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 3 / 35

4 Biological Motivation Genomic Alterations in Cancer Cells Genomic Alterations in Cancer Cells Genomic alterations are associated with (responsible for?) various types of cancers. Normal cell S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 4 / 35

5 Biological Motivation Genomic Alterations in Cancer Cells Genomic Alterations in Cancer Cells Genomic alterations are associated with (responsible for?) various types of cancers. Normal cell Tumor cell Source: Hupé (2008). S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 4 / 35

6 Biological Motivation Array CGH technology: Principle Genomic Alterations in Cancer Cells S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 5 / 35

7 Biological Motivation Genomic Alterations in Cancer Cells Segmentation: The basic problem Data at hand: for a given individual a sequence of known positions along the reference genome, labeled as t = 1,2,...,n a signal measured at each position Y t = signal at position t S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 6 / 35

8 Biological Motivation Segmentation: The basic problem Genomic Alterations in Cancer Cells Data at hand: for a given individual a sequence of known positions along the reference genome, labeled as t = 1,2,...,n a signal measured at each position log 2 rat Y t = signal at position t genomic position x 10 6 S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 6 / 35

9 Biological Motivation Segmentation: The basic problem Genomic Alterations in Cancer Cells Data at hand: for a given individual a sequence of known positions along the reference genome, labeled as t = 1,2,...,n a signal measured at each position log 2 rat Y t = signal at position t genomic position x 10 6 Y t f(relative copy number at position t) = log-fluorescence, sum of the two allele signals for a given SNP, etc. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 6 / 35

10 Biological Motivation Segmentation: The basic problem Genomic Alterations in Cancer Cells Data at hand: for a given individual a sequence of known positions along the reference genome, labeled as Amplified segments t = 1,2,...,n log 2 rat 0 a signal measured at each position 1 2 Deleted segment Unaltered segment Y t = signal at position t genomic position x 10 6 Y t f(relative copy number at position t) = log-fluorescence, sum of the two allele signals for a given SNP, etc. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 6 / 35

11 Biological Motivation Cancer cell genomic alterations Genomic Alterations in Cancer Cells Zoom on CGH profile chrom. 1 chrom. 17 Karyotype CGH technology gives access to CNV, but does not allow the positioning of translocations. (Hupé (2008)) S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 7 / 35

12 Genome Annotation Biological Motivation Genome Annotation S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 8 / 35

13 Genome Annotation Biological Motivation Genome Annotation A bit of biology: Gene coding sequences are organized in exons / introns. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 9 / 35

14 Genome Annotation Biological Motivation Genome Annotation A bit of biology: Gene coding sequences are organized in exons / introns. RNA-seq counts RNA-seq: Data provided by NGS should allow to locate precisely the exon boundaries. Known exons source: genomebiology.com S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 9 / 35

15 Genome Annotation Biological Motivation Genome Annotation A bit of biology: Gene coding sequences are organized in exons / introns. RNA-seq: Data provided by NGS should allow to locate precisely the exon boundaries. Reconsidering annotation: Hence the official annotation can be reconsidered. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 9 / 35

16 Biological Motivation Statistical Modeling (Statistical) Questions Biological questions: How many change points? Where are the change points? (How precise is the location?) Do the change points have the same location in samples A and B?... Statistical issues: Statistical modeling Computational efficiency Model selection S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 10 / 35

17 Biological Motivation Basic Segmentation Model Statistical Modeling S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 11 / 35

18 Biological Motivation Statistical Modeling Basic Segmentation Model Statistical model. Signal = f(position); Y t signal time S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 11 / 35

19 Biological Motivation Statistical Modeling Basic Segmentation Model Statistical model. Signal = f(position); Breakpoint positions: τ 1,τ 2,...,τ K 1 ; signal if t r k, Y t time S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 11 / 35

20 Biological Motivation Statistical Modeling Basic Segmentation Model Statistical model. Signal = f(position); Breakpoint positions: τ 1,τ 2,...,τ K 1 ; Mean signal (e.g. relative copy number) within each interval: θ 1,θ 2,...,θ K ; signal if t r k, Y t θ k time S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 11 / 35

21 Biological Motivation Statistical Modeling Basic Segmentation Model Statistical model. Signal = f(position); Breakpoint positions: τ 1,τ 2,...,τ K 1 ; Mean signal (e.g. relative copy number) within each interval: θ 1,θ 2,...,θ K ; Observed signal at time t: independent variables with given means. Signal if t r k, Y t F(θ k ) Time S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 11 / 35

22 Biological Motivation Statistical Modeling Basic Segmentation Model Statistical model. Signal = f(position); Breakpoint positions: τ 1,τ 2,...,τ K 1 ; Mean signal (e.g. relative copy number) within each interval: θ 1,θ 2,...,θ K ; Observed signal at time t: independent variables with given means if t r k, Y t F(θ k ) Possible choices for the distribution F: CGHarray: continuous signal (fluorescence) Gaussian distribution; NGS: discrete signal (counts) Poisson or negative binomial distribution. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 11 / 35 Signal Time

23 Biological Motivation Statistical Modeling Some Specificities Two different types of parameters of the models: Segmentation (change point locations): T = (τ 1,...,τ K 1 ) discrete parameter; Distribution parameters (within segment means): Θ = (θ 1,...,θ K ) continuous parameter. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 12 / 35

24 Some Specificities Biological Motivation Statistical Modeling Two different types of parameters of the models: Segmentation (change point locations): T = (τ 1,...,τ K 1 ) discrete parameter; Distribution parameters (within segment means): Θ = (θ 1,...,θ K ) continuous parameter. Segmentation space: The number of possible segmentation T is huge: ( ) n 1 M K = ; K 1 The systematic exploration of M K is impossible. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 12 / 35

25 Frequentist Framework Frequentist Inference S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 13 / 35

26 Frequentist Framework Frequentist Inference Frequentist principle: The parameters T and Θ are fixed and we want to provide estimates T and Θ of their true values. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 13 / 35

27 Frequentist Framework Frequentist Inference Frequentist principle: The parameters T and Θ are fixed and we want to provide estimates T and Θ of their true values. One classical approach: Maximum likelihood estimates (MLE) ( T, Θ) = argmax T,Θ logp(y;t,θ) where Y stands for the whole data set Y = (Y 1,...,Y n ). Hybrid optimization problem with both discrete and continuous arguments. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 13 / 35

28 One Profile Frequentist Inference One Profile Likelihood. Thanks to independence assumptions, we have logp(y;t,θ) = logp(y t ;θ k ). k t r k S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 14 / 35

29 One Profile Frequentist Inference One Profile Likelihood. Thanks to independence assumptions, we have logp(y;t,θ) = logp(y t ;θ k ). k t r k Parameter estimates Θ. For any segment r k, the MLE of θ k is θ k = argmax logp(y t ;θ k ), θ k t r k S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 14 / 35

30 One Profile Frequentist Inference One Profile Likelihood. Thanks to independence assumptions, we have logp(y;t,θ) = logp(y t ;θ k ). k t r k Parameter estimates Θ. For any segment r k, the MLE of θ k is θ k = argmax logp(y t ;θ k ), e.g. θ k = mean within segment r k. θ k t r k S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 14 / 35

31 One Profile Frequentist Inference One Profile Likelihood. Thanks to independence assumptions, we have logp(y;t,θ) = logp(y t ;θ k ). k t r k Parameter estimates Θ. For any segment r k, the MLE of θ k is θ k = argmax logp(y t ;θ k ), e.g. θ k = mean within segment r k. θ k t r k Segmentation estimates T. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 14 / 35

32 One Profile Frequentist Inference One Profile Likelihood. Thanks to independence assumptions, we have logp(y;t,θ) = logp(y t ;θ k ). k t r k Parameter estimates Θ. For any segment r k, the MLE of θ k is θ k = argmax logp(y t ;θ k ), e.g. θ k = mean within segment r k. θ k t r k Segmentation estimates T. Finding T is equivalent to a shortest path problem where t r k logp(y t ; θ k ) is the cost of segment r k. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 14 / 35

33 One Profile Frequentist Inference One Profile Likelihood. Thanks to independence assumptions, we have logp(y;t,θ) = logp(y t ;θ k ). k t r k Parameter estimates Θ. For any segment r k, the MLE of θ k is θ k = argmax logp(y t ;θ k ), e.g. θ k = mean within segment r k. θ k t r k Segmentation estimates T. Finding T is equivalent to a shortest path problem where t r k logp(y t ; θ k ) is the cost of segment r k. Because the cost is additive, this can be solve via dynamic programming with time complexity O(Kn 2 ) and space complexity O(Kn). S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 14 / 35

34 Frequentist Inference One Profile How Many Change-Points? How many segments? The number of segments K is not known a priori. The fit of the segmentation improves as K increases. Penalized likelihood = most common strategy: S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 15 / 35

35 Frequentist Inference How Many Change-Points? One Profile How many segments? The number of segments K is not known a priori. The fit of the segmentation improves as K increases. Penalized likelihood = most common strategy: logp(y;k) Contrast K S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 15 / 35

36 Frequentist Inference How Many Change-Points? One Profile How many segments? The number of segments K is not known a priori. The fit of the segmentation improves as K increases. Penalized likelihood = most common strategy: logp(y;k)+pen(k) Contrast K S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 15 / 35

37 Frequentist Inference How Many Change-Points? One Profile How many segments? The number of segments K is not known a priori. The fit of the segmentation improves as K increases. Penalized likelihood = most common strategy: Contrast K = argmin K K S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 15 / 35

38 Frequentist Inference How Many Change-Points? One Profile How many segments? The number of segments K is not known a priori. The fit of the segmentation improves as K increases. Penalized likelihood = most common strategy: Contrast K = argmin K A huge literature has been developed (Lebarbier (2005), Lavielle (2005), Zhang and Siegmund (2007),...) to insure that Pr{ K = K} 1 when n or to approximate P(K Y). K S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 15 / 35

39 Frequentist Inference One Profile Speeding-up Dynamic Programming Quadratic complexity O(Kn 2 ) is not acceptable for very large signals (e.g. NGS for DNAseq where n = 10 6 to S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 16 / 35

40 Frequentist Inference One Profile Speeding-up Dynamic Programming Quadratic complexity O(Kn 2 ) is not acceptable for very large signals (e.g. NGS for DNAseq where n = 10 6 to Pruned DP can strongly reduce the computational burden (Rigaill (2010)). Its theoretical worst complexity is the same as regular DP. Its mean empirical complexity is almost linear: O(Kn log n). S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 16 / 35

41 Frequentist Inference One Profile Speeding-up Dynamic Programming Quadratic complexity O(Kn 2 ) is not acceptable for very large signals (e.g. NGS for DNAseq where n = 10 6 to Pruned DP can strongly reduce the computational burden (Rigaill (2010)). Its theoretical worst complexity is the same as regular DP. Its mean empirical complexity is almost linear: O(Kn log n). Trick. Not to compute parameter estimates Θ (see # 14) before to perform segmentation. Only applicable for nice models (e.g. Gaussian, Poisson, negative binomial with known φ,...) S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 16 / 35

42 Frequentist Inference One Profile Breast Cancer Data (collab. Curie/Servier) Chrom. 10 and 8 of two breast carcinomas (TNBC). Rigaill (2011) S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 17 / 35

43 Several Series Frequentist Inference Several Series Cancer studies are most often carried on cohorts involving numerous CGH profiles (m 100). S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 18 / 35

44 Frequentist Inference Several Series Several Series Cancer studies are most often carried on cohorts involving numerous CGH profiles (m 100). Thegoalisthentodetectrecurrentalterations among patients carrying the same type of cancer. As all the profiles are obtained with the same technology: same CGH probes, same NGS sequences, correlations between profiles are likelily to exist. Picard et al. (2011b) S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 18 / 35

45 Algorithmic Issue Frequentist Inference One series: OK with classical DP. Several Series Graphical model representation. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 19 / 35

46 Algorithmic Issue Frequentist Inference One series: OK with classical DP. Several Series Graphical model representation. Independent series: OK with 2-stage DP. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 19 / 35

47 Algorithmic Issue Frequentist Inference One series: OK with classical DP. Several Series Graphical model representation. Independent series: OK with 2-stage DP. Correlation, same change points: same complexity as for one series classical DP. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 19 / 35

48 Algorithmic Issue Frequentist Inference One series: OK with classical DP. Several Series Graphical model representation. Independent series: OK with 2-stage DP. Correlation, same change points: same complexity as for one series classical DP. Correlation, different change points: Non-additivity of the likelihood DP cannot apply as such. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 19 / 35

49 Frequentist Inference Modeling the Dependency Several Series Joint segmentation model. In the Gaussian framework, we can write Y tm = µ km +F tm, t r m k S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 20 / 35

50 Frequentist Inference Modeling the Dependency Several Series Joint segmentation model. In the Gaussian framework, we can write Y tm = µ km +F tm, t r m k Factor model (Random effect model). An old trick in multivariate Gaussian models: S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 20 / 35

51 Frequentist Inference Modeling the Dependency Several Series Joint segmentation model. In the Gaussian framework, we can write Y tm = µ km +F tm, t r m k Factor model (Random effect model). An old trick in multivariate Gaussian models: a (unobserved) random effect Z t = (Z t1,...,z tq ) N(0,I) is associated to each position (probe); the residual terms F t at position t all depend on this effect: F t = BZ t +E t, {E t } i.i.d. N(0,σ 2 I); S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 20 / 35

52 Frequentist Inference Modeling the Dependency Several Series Joint segmentation model. In the Gaussian framework, we can write Y tm = µ km +F tm, t r m k Factor model (Random effect model). An old trick in multivariate Gaussian models: a (unobserved) random effect Z t = (Z t1,...,z tq ) N(0,I) is associated to each position (probe); the residual terms F t at position t all depend on this effect: F t = BZ t +E t, {E t } i.i.d. N(0,σ 2 I); so the residuals at position t have covariance matrix Σ = BB +σ 2 I. This allows to capture any covariance structure (e.g. taking Q = m 1) S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 20 / 35

53 Frequentist Inference Breaking down dependency Several Series Marginal likelihood: log p(y) is not additive w.r.t. the segments r m k. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 21 / 35

54 Frequentist Inference Breaking down dependency Several Series Marginal likelihood: log p(y) is not additive w.r.t. the segments r m k. Joint likelihood: log p(y, Z) is still not additive. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 21 / 35

55 Frequentist Inference Breaking down dependency Several Series Marginal likelihood: log p(y) is not additive w.r.t. the segments r m k. Joint likelihood: log p(y, Z) is still not additive. Conditional likelihood: log p(y Z) is additive w.r.t. the segments. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 21 / 35

56 E-M algorithm Frequentist Inference Several Series Incomplete data model. In the factor model, the random effect Z is unobserved. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 22 / 35

57 E-M algorithm Frequentist Inference Several Series Incomplete data model. In the factor model, the random effect Z is unobserved. E-M algorithm. To perform maximum likelihood inference: first notice that ( T, Θ, Σ) = argmax T,Θ logp(y;t,θ,σ) logp(y;t,θ) = E[logP(Y,Z;T,Θ) Y] E[logP(Z Y;T,Θ) Y] S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 22 / 35

58 E-M algorithm Frequentist Inference Several Series Incomplete data model. In the factor model, the random effect Z is unobserved. E-M algorithm. To perform maximum likelihood inference: first notice that ( T, Θ, Σ) = argmax T,Θ logp(y;t,θ,σ) logp(y;t,θ) = E[logP(Y,Z;T,Θ) Y] E[logP(Z Y;T,Θ) Y] then, iteratively, E-step: compute the conditional distribution P(Z Y; T, Θ, Σ); M-step: update the estimates as ( T, Θ, Σ) = argmax T,Θ E[logP(Y,Z;T,Θ,Σ) Y] = argmax T,Θ E[logP(Z;Σ) Y]+E[logP(Y Z;T,Θ) Y]... which can be achieved with dynamic programming. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 22 / 35

59 Frequentist Inference Several Series Bladder Cancer Data (collab. Curie) Some large probe effect Z t are detected Poor probe affinity? Wrong annotation? Polymorphism? S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 23 / 35

60 Frequentist Inference Several Series Bladder Cancer Data (collab. Curie) Some large probe effect Z t are detected Poor probe affinity? Wrong annotation? Polymorphism? Breakpoints near this position are likely to be artifacts S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 23 / 35

61 Frequentist Inference Several Series Bladder Cancer Data (collab. Curie) Some large probe effect Z t are detected Poor probe affinity? Wrong annotation? Polymorphism? Breakpoints near this position are likely to be artifacts. The global segmentation is affected by this correction: : independent segmentation, : multivariate mixed segmentation S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 23 / 35

62 Frequentist Inference CGH-seg Package CGH-seg Package CGHseg is an R package dedicated to the analysis of CGH profiles Picard et al. (2011a). 1 uniform correlation S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 24 / 35

63 Frequentist Inference CGH-seg Package CGH-seg Package CGHseg is an R package dedicated to the analysis of CGH profiles Picard et al. (2011a). Linear model framework: Segmentation regression on unknown T: Y = Tµ + E 1 uniform correlation S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 24 / 35

64 Frequentist Inference CGH-seg Package CGH-seg Package CGHseg is an R package dedicated to the analysis of CGH profiles Picard et al. (2011a). Linear model framework: Segmentation regression on unknown T: Y = Tµ + E Correlation Factor model( 1 ): Z Y = Tµ+BZ+E 1 uniform correlation S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 24 / 35

65 Frequentist Inference CGH-seg Package CGH-seg Package CGHseg is an R package dedicated to the analysis of CGH profiles Picard et al. (2011a). Linear model framework: Segmentation regression on unknown T: Y = Tµ + E Correlation Factor model( 1 ): Z Y = Tµ+BZ+E Correction fixed covariates effect: β Y = Tµ+Xβ +E 1 uniform correlation S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 24 / 35

66 Frequentist Inference CGH-seg Package CGH-seg Package CGHseg is an R package dedicated to the analysis of CGH profiles Picard et al. (2011a). Linear model framework: Segmentation regression on unknown T: Y = Tµ + E Correlation Factor model( 1 ): Z Y = Tµ+BZ+E Correction fixed covariates effect: β Y = Tµ+Xβ +E Calling crosstabulation table C: Y = TCm + E 1 uniform correlation S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 24 / 35

67 Frequentist Inference CGH-seg Package CGH-seg Package CGHseg is an R package dedicated to the analysis of CGH profiles Picard et al. (2011a). Linear model framework: Segmentation regression on unknown T: Y = Tµ + E Correlation Factor model( 1 ): Z Y = Tµ+BZ+E Correction fixed covariates effect: β Y = Tµ+Xβ +E Calling crosstabulation table C: Y = TCm + E Combinations fixed + random effects: Y = Tµ+Xβ +LZ+E fixed effects + calling: Y = TCm+Xβ +E 1 uniform correlation S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 24 / 35

68 Bayesian Framework Bayesian Inference S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 25 / 35

69 Bayesian Framework Bayesian Inference Limitation of the frequentist approach: Due the discrete nature of the change points, their inference is complex and confidence intervals can not been easily derived. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 25 / 35

70 Bayesian Inference Bayesian Framework Limitation of the frequentist approach: Due the discrete nature of the change points, their inference is complex and confidence intervals can not been easily derived. Bayesian principle: All parameters (K, T, Θ) are random. Prior distribution of the parameters: P(K), P(T K), P(Θ T,K); Likelihood of the data given the parameters: P(Y K,T,Θ); Posterior distribution of the parameter: P(K,T,Θ Y) = P(K,T,Θ,Y) P(Y) S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 25 / 35

71 Bayesian Framework Bayesian Inference Limitation of the frequentist approach: Due the discrete nature of the change points, their inference is complex and confidence intervals can not been easily derived. Bayesian principle: All parameters (K, T, Θ) are random. Prior distribution of the parameters: P(K), P(T K), P(Θ T,K); Likelihood of the data given the parameters: P(Y K,T,Θ); Posterior distribution of the parameter: where P(K,T,Θ Y) = P(K,T,Θ,Y) P(Y) P(K,T,Θ,Y) = P(K)P(T K)P(Θ T,K)(Y K,T,Θ) P(Y) = P(K,T,Θ,Y). K Θ T M K S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 25 / 35

72 Posterior Distributions Bayesian Inference Biological questions: S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 26 / 35

73 Bayesian Inference Posterior Distributions Biological questions: How many change points? P(K Y); S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 26 / 35

74 Bayesian Inference Posterior Distributions Biological questions: How many change points? P(K Y); Where are the change points? P(T Y,K) or P(T Y); S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 26 / 35

75 Bayesian Inference Posterior Distributions Biological questions: How many change points? P(K Y); Where are the change points? P(T Y,K) or P(T Y); How precise is the location? P(τ 1 Y,K) or P(τ 1 Y); S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 26 / 35

76 Bayesian Inference Posterior Distributions Biological questions: How many change points? P(K Y); Where are the change points? P(T Y,K) or P(T Y); How precise is the location? P(τ 1 Y,K) or P(τ 1 Y); Do the change points have the same location in samples A and B? P(τ A 1 τb 1 YA,Y B ). S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 26 / 35

77 Bayesian Inference Posterior Distributions Biological questions: How many change points? P(K Y); Where are the change points? P(T Y,K) or P(T Y); How precise is the location? P(τ 1 Y,K) or P(τ 1 Y); Do the change points have the same location in samples A and B? P(τ A 1 τb 1 YA,Y B ). One common issue. The calculation of all these quantities require this of P(Y K) = T M K P(Y,T K) where M K stands for the set of all K-segment segmentations. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 26 / 35

78 Bayesian Inference Exact Posterior Calculating Exact Posterior Distributions 1 - Factorization assumption: P(T) = f(r), P(θ T) = f(θ r ) P(Y T,θ) = f(y r,θ r ) r T r T r T S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 27 / 35

79 Bayesian Inference Exact Posterior Calculating Exact Posterior Distributions 1 - Factorization assumption: P(T) = f(r), P(θ T) = f(θ r ) P(Y T,θ) = f(y r,θ r ) r T r T r T 2 - Conjugate prior: P(Y r θ r )P(θ r )dθ r has a closed form expression. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 27 / 35

80 Bayesian Inference Exact Posterior Calculating Exact Posterior Distributions 1 - Factorization assumption: P(T) = f(r), P(θ T) = f(θ r ) P(Y T,θ) = f(y r,θ r ) r T r T r T 2 - Conjugate prior: P(Y r θ r )P(θ r )dθ r has a closed form expression. Sum-product form: The total probability can be rewritten as P(Y K) = P(Y,T,Θ K) T M K = Θ r M K r T P(Y r ) S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 27 / 35

81 Bayesian Inference Exploring the segmentation space Exact Posterior Computational consequence. r M K r T P(Yr ) can be computed as [A K ] 0,n, where A ij = P(Y i+1,...y j ); with complexity O(Kn 2 ) Rigaill et al. (2011). S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 28 / 35

82 Bayesian Inference Exploring the segmentation space Exact Posterior Computational consequence. r M K r T P(Yr ) can be computed as [A K ] 0,n, where A ij = P(Y i+1,...y j ); with complexity O(Kn 2 ) Rigaill et al. (2011). We are hence able to compute with quadratic complexity, S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 28 / 35

83 Bayesian Inference Exact Posterior Exploring the segmentation space Computational consequence. r M K r T P(Yr ) can be computed as [A K ] 0,n, where A ij = P(Y i+1,...y j ); with complexity O(Kn 2 ) Rigaill et al. (2011). We are hence able to compute with quadratic complexity, The probability that there is a breakpoint at position t; The probability for a segment r = t 1,t 2 to be part of segmentation; The posterior entropy of T within a dimension; S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 28 / 35

84 Bayesian Inference Exact Posterior Exploring the segmentation space Computational consequence. r M K r T P(Yr ) can be computed as [A K ] 0,n, where A ij = P(Y i+1,...y j ); with complexity O(Kn 2 ) Rigaill et al. (2011). We are hence able to compute with quadratic complexity, The probability that there is a breakpoint at position t; The probability for a segment r = t 1,t 2 to be part of segmentation; The posterior entropy of T within a dimension; Similar to an algorithm proposed by Guédon (2007) in the HMM context with known parameters. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 28 / 35

85 Bayesian Inference Exact Posterior Exploring the segmentation space Computational consequence. r M K r T P(Yr ) can be computed as [A K ] 0,n, where A ij = P(Y i+1,...y j ); with complexity O(Kn 2 ) Rigaill et al. (2011). We are hence able to compute with quadratic complexity, The probability that there is a breakpoint at position t; The probability for a segment r = t 1,t 2 to be part of segmentation; The posterior entropy of T within a dimension; Similar to an algorithm proposed by Guédon (2007) in the HMM context with known parameters. Algorithmic comment. Replacing products with sums and sums with min gives the dynamic programming algorithm mentioned in # 14. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 28 / 35

86 Bayesian Inference A CGH profile: K = 3,4 Exact Posterior S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 29 / 35

87 Bayesian Inference Presence of an alteration Exact Posterior To assess the gain or loss of a given locus (e.g. gene), one need to know if it is included within an alteration (see slide # 17). The posterior probability of a segment can be computed as well: f(t 1,t 2 ) = Pr{ k : r k = t 1,t 2 Y,K} S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 30 / 35

88 Bayesian Inference Presence of an alteration Exact Posterior To assess the gain or loss of a given locus (e.g. gene), one need to know if it is included within an alteration (see slide # 17). The posterior probability of a segment can be computed as well: f(t 1,t 2 ) = Pr{ k : r k = t 1,t 2 Y,K} K = 3 K = 4 S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 30 / 35

89 Bayesian Inference Exact Posterior Model selection: Exact BIC and ICL Standard model selection criteria can be computed exactly (without, say, Laplace approximation). Choice of K. Exact version of BIC(K): logp(y,k) S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 31 / 35

90 Bayesian Inference Exact Posterior Model selection: Exact BIC and ICL Standard model selection criteria can be computed exactly (without, say, Laplace approximation). Choice of K. Exact version of BIC(K): logp(y,k) Choice of T. Exact version of BIC(T): logp(y,t) S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 31 / 35

91 Bayesian Inference Exact Posterior Model selection: Exact BIC and ICL Standard model selection criteria can be computed exactly (without, say, Laplace approximation). Choice of K. Exact version of BIC(K): logp(y,k) Choice of T. Exact version of BIC(T): logp(y,t) Alternative choice of K. Exact version of ICL(T) (Biernacki et al. (2000)): logp(y,k)+ r M K P(T Y,K)logP(T Y,K) }{{} Entropy within M K favors dimensions K the best segmentation T K clearly outperforms the others. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 31 / 35

92 Bayesian Inference Exact Posterior Comparison BIC/ICL: Simulation study Simulations: Poisson signal with alternating means λ 0 and λ 1, n = 100. BIC(K) BIC(T) ICL(K) Criterion = % of recovery of the true number of segments (K = 5). BIC(T) achieves (much) better performances than BIC(K) ICL(K) outperforms both BIC(K) and BIC(T). λ 1 λ 0 (λ 0 = 1) Rigaill et al. (2011) S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 32 / 35

93 Re-annotation Bayesian Inference Alternative Splicing The posterior distribution of the change-points location can be used to assess the genome annotation. It also allows to study variations of the transcription start due to condition changes. Cleynen (2012) S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 33 / 35

94 Bayesian Inference Alternative Splicing Alternative Splicing (collab. univ. Berkeley) Alternative splicing refer to modification of exon boundaries (or to exon skipping). Based on the posterior distribution of exon boundaries, a test for alternative splicing can be derived. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 34 / 35

95 Bayesian Inference Alternative Splicing Alternative Splicing (collab. univ. Berkeley) Alternative splicing refer to modification of exon boundaries (or to exon skipping). Based on the posterior distribution of exon boundaries, a test for alternative splicing can be derived. The EBS package performs an exact bayesian segmentation on data and returns all quantities described above: cran.r-project.org/web/packages/ebs S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 34 / 35

96 Discussion Bayesian Inference Alternative Splicing Statistics are needed to assess the existence and the location of genomic alteration or to comfort genome annotation. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 35 / 35

97 Discussion Bayesian Inference Alternative Splicing Statistics are needed to assess the existence and the location of genomic alteration or to comfort genome annotation. Algorithmic complexity is one major limitation of such analyses. How to manage profiles on 10 8 nucleotides for a cohort of 10 3 or 10 4 patients? S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 35 / 35

98 Bayesian Inference Alternative Splicing Discussion Statistics are needed to assess the existence and the location of genomic alteration or to comfort genome annotation. Algorithmic complexity is one major limitation of such analyses. How to manage profiles on 10 8 nucleotides for a cohort of 10 3 or 10 4 patients? Some related questions. Assess the relation between an alteration with given length and frequency in a cohort an a certain type of cancer. Study the influence of various experimental conditions on alternative splicing for genes with a large (10, 20) number of exons. S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 35 / 35

99 Appendix Biernacki, C., Celeux, G. and Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Machine Intel. 22 (7) Guédon, Y. (2007). Exploring the state sequence space for hidden Markov and semi-markov chain. Comput. Statist. and Data Analysis Hupé, P. (2008). Biostatistical algorithms for omics data in oncology: Application to DNA copy number microarray experiments. PhD thesis, AgroParisTech. Lavielle, M. (2005). Using penalized contrasts for the change-point problem. Signal Proc Lebarbier, E. (2005). Detecting multiple change-points in the mean of gaussian process by model selection. Signal Proc Picard, F., Lebarbier, E., Hoebeke, M., Rigaill, G., Thiam, B. and Robin, S. (2011a). Joint segmentation, calling, and normalization of multiple CGH profiles. Biostatistics. doi: /biostatistics/kxq075. Picard, F., Lebarbier, E., Budinska, E. and Robin, S. (2011b). Joint segmentation of multivariate gaussian processes using mixed linear models. Comput. Statist. and Data Analysis. 55 (2) Rigaill, G. (2010), Pruned dynamic programming for optimal multiple change-point detection. Technical report, arxiv. arxiv: Rigaill, G. (2011). Statistical and algorithmic developments for the analysis of Triple Negative Breast Cancers. PhD thesis, ABIES. Rigaill, G., Lebarbier, E. and Robin, S. (2011). Exact posterior distributions over the segmentation space and model selection for multiple change-point detection problem. Stat. Comp DOI: /s Zhang, N. R. and Siegmund, D. O. (2007). A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics. 63 (1) S. Robin (AgroParisTech / INRA) Statistics & Segmentation in Genomics SMAI-MAIRCI 35 / 35

Linear models for the joint analysis of multiple. array-cgh profiles

Linear models for the joint analysis of multiple array-cgh profiles F. Picard, E. Lebarbier, B. Thiam, S. Robin. UMR 5558 CNRS Univ. Lyon 1, Lyon UMR 518 AgroParisTech/INRA, F-75231, Paris Statistics for