A segmentation-clustering problem for the analysis of array CGH data

Size: px

Start display at page:

Download "A segmentation-clustering problem for the analysis of array CGH data"

Lynne Watson
6 years ago
Views:

1 A segmentation-clustering problem for the analysis of array CGH data F. Picard, S. Robin, E. Lebarbier, J-J. Daudin UMR INA P-G / ENGREF / INRA MIA 518 APPLIED STOCHASTIC MODELS AND DATA ANALYSIS Brest May 24

2 Microarray CGH technology - Known effects of big size chromosomal aberrations (ex: trisomy). experimental tool: Karyotype (Resolution chromosome). - Change of scale: what are the effects of small size DNA sequences deletions/amplifications? experimental tool: "conventional" CGH (resolution 1Mb). - CGH= Comparative Genomic Hybridization : method for the comparative measurement of relative DNA copy numbers between two samples (normal/disease, test/reference). Application of the microarray technology to CGH : last generation of chips: resolution 1kb.

3 Microarray technology in its principle

4 Interpretation of a CGH profile 3 2 segment amplifié 1 1 segment "normal" 2 segment délété A dot on the graph represents copies of BAC(t) in the test genome copies of BAC(t) in the reference genome

5 First step of the statistical analysis Break-points detection in a gaussian signal - a random process such that - Suppose that the parameters of the distribution of the abrupt-changes at unknown coordinates - Those break-points define a partition of the data into. are affected by K-1. segments of size :!! - Suppose that those parameters are constant between two changes: " - The parameters of this model are : - Break-points detection aims at studying the spatial structure of the signal.

6 Estimating the parameters in a model of abrupt-changes detection Log-Likelihood Estimating the parameters with fixed by maximum likelihood - Joint estimation of and with dynamic programming. - Necessary property of the likelihood : additivity in (sum of local likelihoods calculated on each segment). Model Selection : choice of - Penalized Likelihood :. - With. - is adaptively estimated to the data (Lavielle(23)).

7 Example of segmentation on array CGH data log 2 rat log 2 rat genomic position x 1 5 BT474 chromosome 1, genomic position x 1 6 BT474 chromosome 9,

8 Considering biologists objective and the need for a new model y y, x x x x x, x x x x x x x x,, x x x x x structure sur les y x, x x x x x x, x x x x x x, x x x x x structure sur les t t structure sur les t t Segmentation: structure spatiale du signal Segmentation/Classification

9 A new model for segmentation-clustering purposes - We suppose there exists a secondary underlying structure of the segments into populations with weights. - We introduce hidden variables, segment. indicators of the population of origin of - Those variables are supposed independent, with multinomial distribution: - Conditionnally to the hidden variables, we know the distribution of : - It is a model of segmentation/clustering. - The parameters of this model are avec

10 Likelihood and statistical units of the model - Mixture Model of segments : the statistical units are segments :, the density of is a mixture density: are independent, we have: If the s, - Classical mixture model : the statistical units are the

11 An hybrid algorithm for the optimization of the likelihood and known Alternate parameters estimation with : is fixed, the EM algorithm estimates 1 When : is fixed, dynamic programming estimates 2 When An increasing sequence of likelihoods:

12 Mixture Model when the segmentation is knwon Mixture model parameters estimators. - the estimator the the mixing proportions is: : - In the gaussian case, - Big size vectors will have a bigger impact in the estimation of the parameters, via the term

13 Influence of the vectors size on the affectation (MAP) can be written as follows: - The density of to population : distance of the mean of vector variability : intra-vector - Big size Individuals will be affected with certitude to the closest population

14 Segmentation with a fixed mixture Back to dynamic programming - the incomplete mixture log-likelihood can be written as a sum of local loglikelihoods: - the local log-likelihood of segment of vector corresponds to the mixture log-density - can be optimized in with fixed, by dynamix programming.

15 A decreasing log-likelihood? Evolution of the incomplete log-likelihood with respect to the number of segments.

16 What is going on? When the true number of segments is reached (6), segments are cut on the edges.

17 Explaining the behavior of the likelihood Optimization of the incomplete likelihood with dynamic programming: Hypothesis: 1 We suppose that the true number of segments is and that the partitions are nested for. Segment is cut into : 2 We suppose that if then :

18 " An intrinsic penality Under hypothesis 1-2: The log-likelihood is decomposed into two terms - A term of fit that increases with partitions), and is constant from a certain (nested - A term of differences of entropies that decreases with penalty for the choice of : plays the role of Choosing the number of segments when is fixed can be done with a penalized likelihood

19 Incomplete Likelihood behavior with respect to the number of segments P=2 P=3 P=4 P=5 P= The incomplete log-likelihood is decreasing from de.

20 Decomposition of the log-likelihood P=2 P=3 P=4 P=5 P= P=2 P=3 P=4 P=5 P= term of fit differences of entropies

21 Resulting clusters log 2 rat Segmentation/Clustering, genomic position x 1 5 Segmentation

22 Resulting clusters log 2 rat Segmentation/Clustering, genomic position x 1 5 Segmentation

23 Perspective : simultaneous choice for and Incomplete Log-likelihood with respect to 15 2 and.

24 This is the end Conclusions: - Definition of a new model that considers the a priori knowledge we have about the biological phenomena under study. - Development of an hybrid algorithm (EM/dynamic programming) for the parameters estimation (problems linked to EM : initializtion, local maxima, degeneracy). - Still waiting for an other data set to assess the performance of the clustering. Perspectives: - Modeling : Comparison with Hidden Markov Models - Model choice: Develop an adaptive procedure for two components. - Other application field DNA sequences (in progress)

Linear models for the joint analysis of multiple. array-cgh profiles

Linear models for the joint analysis of multiple array-cgh profiles F. Picard, E. Lebarbier, B. Thiam, S. Robin. UMR 5558 CNRS Univ. Lyon 1, Lyon UMR 518 AgroParisTech/INRA, F-75231, Paris Statistics for