Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis

Size: px

Start display at page:

Download "Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis"

Chloe Melanie Harvey
5 years ago
Views:

1 Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis Hongzhe Li University of Pennsylvania Perelman School of Medicine Joint work with Jessie Jeng and Tony Cai Robust Segment Identification - 1 of 34-1 p. 1

Genetic variations and complex diseases Commonly observed genetic variations: Single nucleotide variants (SNVs). Small insertions/deletions (InDels).

2 Genetic variations and complex diseases Commonly observed genetic variations: Single nucleotide variants (SNVs). Small insertions/deletions (InDels). Structure variations, including the copy number variations (CNVs). All are associated with risk of complex diseases. Robust Segment Identification - 2 of 34-1 p. 2

disrupt genes, or uncover deleterious alleles.

3 Copy number variants (CNVs) Copy Number Variation Many CNVs have functional consequences: alter gene dosage, disrupt genes, or uncover deleterious alleles. NEJM 2007, 356: Robust Segment Identification - 3 of 34-1 p. 3

4 CNV Associations CNVs and Human Diseases Science 315: (2007) Science 316: (2007) Am J Med Genet 144: (2007) Science: in press (2008) Circulation 115: (2007) 4 Robust Segment Identification - 4 of 34-1 p. 4

5 Data available for CNV Analysis, literature Two types of data can be used for CNV analysis for germline DNA. SNP chip data for GWAS - high dimensional continuous data. Sebat et al. 2004; McCarrol and Altshuler 2007; Wellcome trust (Nature 2010). Next generation sequencing data (NGS) - ultra high dimensional discrete data; Medvedev et al., 2009 (Nat. Meth). Methods available: GWAS SNP data: circular binary segmentation (CBS) (Olshen et al, 2004); HMM based method (PennCNV, Wang et al 2007); scanning-based methods (Zhang and Siegmund 2010); Likelihood ratio selection (Jeng, Cai and Li, 2010). NGS data: Most methods are computational. Robust Segment Identification - 5 of 34-1 p. 5

6 CNV analysis based on SNP Chip data Visualization of CNVs 3 copies BBB ABB AAB AAA 4 copies BBBB ABBB AABB Normal BB AB AA AAAB AAAA 7 Robust Segment Identification - 6 of 34-1 p. 6

7 CNV Analysis and Next Generation Sequence Data Sequence Read Depth Analysis Individual sequence Reads Mapping Reference genome Read depth signal Counting mapped reads Zero level Robust Segment Identification - 7 of 34-1 p. 7

8 CNV Analysis and Next Generation Sequence Data Robust Segment Identification - 8 of 34-1 p. 8

9 CNV Analysis and Next Generation Sequence Data Robust Segment Identification - 9 of 34-1 p. 9

10 Statistical challenges Y i : # of read counts covering location i, i = 1,,n; or Y i : # of read counts in 100bp intervals. n is ultra-high, computational challenge. Y i usually does not follow a normal distribution, outliers. Existing methods do not work well when noise distribution is non-gaussian and hard to be estimated. Cauchy distribution: data, LRS, RSI Location index Location index Location index Robust Segment Identification - 10 of 34-1 p. 10

11 Statistical model for read depth data - one sample For a given individual, observe read counts {Y i,i = 1,...,n} with Y i = µ 1 1 {i I1 } +...+µ q 1 {i Iq } +ξ i, 1 i n. (1) n: length of genome (billions); q = q n : unknown number of the signal segments; I = {I 1,...I q }: disjoint intervals representing signal segments with unknown locations; µ 1,...µ q are unknown means ξ i is symmetric at 0 (median(ξ i )=0) and density function h s.t. h(0) > 0, h(y) h(0) Cy 2 in an open nbhd of 0. We want to (a) (detection) test H 0 : I = against H 1 : I, (b) (identification) if the alternative is true, identify each I j I. Robust Segment Identification - 11 of 34-1 p. 11

12 Methods Assuming Gaussian Noise (Jeng et al., 2010) Methods for detecting the presence of segments assuming Gaussian noise: Arias-Castro, Donoho and Huo (2005) Identification methods assuming Gaussian noise: likelihood ratio selector (LRS) (Jeng, Cai and Li 2010 JASA). Key of the LRS: (1) For any given interval as Ĩ {1,2,...,n}, define its likelihood ratio statistic Y(Ĩ) = i Ĩ Y i / Ĩ. (2) Scan the genome with intervals of length L, threshold σ 2log(nL). (3) Identify local maximums by removing the overlapping intervals. Detection boundaries and optimality results are established. Robust Segment Identification - 12 of 34-1 p. 12

13 Non-normal Data - Local median transformation Equally divide the n observations (e.g., counts at each bp) into T = T n groups with m = m n observations in each group. Define kth interval J k = {i : (k 1)m+1 i km} and take median: X k = median(y i : i J k ), η k = median{ξ i : i J k }, 1 k T. We have θ k X k = θ k +η k, 1 k T, = µ j, J k I j for some I j, [0,µ j ], J k I j for some I j and J k I j, = 0, otherwise. Key point: mη k = 1 2h(0) Z k +ζ k, Z k N(0,1),ζ k D 0 fast η k N(0,1/(4h 2 (0)m)). (Brown, Cai and Zhou: AoS 08). Robust Segment Identification - 13 of 34-1 p. 13

14 Robust Segment Detection (RSD) Segment detection: test H 0 : I = vs H 1 : I. For any interval Ĩ, define X(Ĩ) = k / k ĨX Ĩ, and threshold λ n = 2logn/(2h(0) m). The RSD rejects H 0 when maxĩ JT X(Ĩ) > λ n, where J T is the collection of all possible intervals in {1,...,T}. Note the effect of h(0). Robust Segment Identification - 14 of 34-1 p. 14

15 Robust Segment Detection - Type 1 error and power Under the assumed model and median transformation with m = log 1+b n for some b > 0. Type 1 error: For the collection J T of all the possible intervals in {1,...,T}, P H0 (max Ĩ J T X(Ĩ) > λ n ) C logt 0, T. Power: If there exists some segment I j I that satisfies I j /m and µ j I j 2(1+ǫ)logn/(2h(0)) for some ǫ > 0, then RSD has the sum of the probabilities of type I and type II errors going to 0. Robust Segment Identification - 15 of 34-1 p. 15

16 Robust Segment Identifier (RSI) Perform local median transformation with bin size m, get X k = θ k +η k, 1 k T. Set data-driven threshold at λ n = ˆσ 2logn, ˆσ 2 : estimate of Var(η k )(e.g.,mad) Apply LRS (Jeng, Cai and Li, JASA 2010) on X k : select intervals with their likelihood ratio statistics > λ n and achieve local maximums (removing overlapping intervals). only consider short intervals with length L/m, L: max CNV size. Conditions on m and L: m = log 1+b n, s L < d, b > 0, s =length of the longest segment, d =shortest distance between two adjacent segments. Robust Segment Identification - 16 of 34-1 p. 16

17 Theory - consistency Assume the general condition on the background noise ξ i and some sparsity conditions on the signal segments. Define s = min Ij I I j, s = max Ij I I j, and d = min Ij I{distance between I j and I j+1 }, assume s log 2 n and log s = o(logn) and logq = o(logn). If all I j I satisfies I j /m and µ j I j 2(1+ǫ)logn/(2h(0)) for some ǫ > 0, then the RSI with m = log 1+b n for b > 0 and s L < d, is consistent for I, i.e., for some δ n = o(1), P ( Î > 0)+P H0 H 1 (max min D(Î j,i j ) > δ n ) 0 I j I Î j Î Dissimilarity: D(Î,I) = 1 Î I / Î I Robust Segment Identification - 17 of 34-1 p. 17

18 Theory - Optimality If for all I j I, µ j I j 2(1 ǫ)logn/(2h(0)), then no method constructed on X k with m is consistent. Robust Segment Identification - 18 of 34-1 p. 18

19 Comparison with Gaussian noises Compare to the case with Gaussian noise: Assume ξ i N(0,1), then the original GLRT based on Y i is optimal. Further, if I j I s.t. µ j I j 2(1+ǫ n )logn, then the original GLRT is consistent. Possible price for robustness: 2(1+ǫn )logn/(2h(0)) (1+ǫ n )logn Robust Segment Identification - 19 of 34-1 p. 19

20 Simulation studies n = , I = 3, I 1 = 100, I 2 = 40 and I 3 = 20, the signal mean for all segments at µ = 1.0,1.5, and 2.0. Noise is generated from Cauchy(t(1)), t(3), t(30), median transformation m = 20. Frequency Frequency Y X Robust Segment Identification - 20 of 34-1 p. 20

21 Simulation results - robustness n = , I = 3. Noise is generated from t(1),t(3),t(30). Estimation error for I j : D j = minîk Î{ 1 Ij Îk / I j Îk } [0,1]. Number of over-selections:#o = #{Î Î : Î I j =, j = 1,...,q}.. Medians of D j and #O for RSI with m = 20,L = 120, 100 replications. D 1( I1 =100) D 2( I2 =40) D 3( I3 =20) #O t(1) µ = (0.015) 1.000(0.026) 1.000(0.000) 2(0.33) µ = (0.003) 0.184(0.017) 1.000(0.000) 2(0.26) µ = (0.009) 0.150(0.020) 0.423(0.220) 2(0.14) t(3) µ = (0.005) 1.000(0.270) 1.000(0.000) 0(0.00) µ = (0.009) 0.175(0.029) 1.000(0.000) 0(0.00) µ = (0.008) 0.150(0.016) 0.293(0.019) 0(0.00) t(30) µ = (0.014) 1.000(0.320) 1.000(0.000) 0(0.00) µ = (0.012) 0.175(0.021) 1.000(0.245) 0(0.00) µ = (0.010) (0.019) 0.250(0.028) 0(0.00) Robust Segment Identification - 21 of 34-1 p. 21

22 Simulation results - comparison with LRS and CBS Table 1: Both homogeneous and heterogeneous noises are considered. Homogenous noise is generated from the t-distribution with degrees of freedom 1, 3, and 30. Heterogeneous noise is generated from a mixture of N(0,1) and N(0,σ 2 ), where σ Gamma(2,τ). µ is fixed at 2.0. RSI LRS CBS D 2( I2 =40) #O D 2( I2 =40) #O D 2( I2 =40) #O t(1) 0.163(0.024) 2(0.2) 0.340(0.054) 3882(7) 1.000(0.000) 0(0.0) t(3) 0.125(0.028) 0(0.0) 0.025(0.006) 467(4) 1.000(0.000) 0(0.0) t(30) 0.125(0.018) 0(0.0) 0.000(0.001) 2(0) 0.006(0.006) 0(0.0) τ = (0.015) 2(0.4) 0.013(0.005) 37(3) 0.180(0.006) 4(0.6) τ = (0.022) 12(0.6) 0.000(0.006) 227(6) 1.000(0.010) 10(1.1) τ = (0.016) 26(0.8) 0.000(0.006) 461(11) 1.000(0.000) 8(1.1) Robust Segment Identification - 22 of 34-1 p. 22

23 Simulation results - effect of m Table 2: Effect of bin size m on the performance of RSI. µ is fixed at 2. D 1( I1 =100) D 2( I2 =40) D 3( I3 =20) #O t(1) m = (0.009) 0.10(0.018) 0.184(0.033) 19(0.85) m = (0.009) 0.15(0.020) 0.423(0.220) 2(0.14) m = (0.006) 0.25(0.056) 1.000(0.024) 0(0.00) t(3) m = (0.004) 0.088(0.015) 0.150(0.033) 1(0.22) m = (0.008) 0.150(0.016) 0.293(0.019) 0(0.00) m = (0.006) 0.293(0.041) 1.000(0.250) 0(0.00) t(30) m = (0.007) 0.075(0.008) 0.150(0.018) 0(0.00) m = (0.010) 0.175(0.019) 0.250(0.028) 0(0.00) m = (0.008) 0.293(0.035) 1.000(0.094) 0(0.00) Robust Segment Identification - 23 of 34-1 p. 23

24 1000 Genomes Project - NA19240, Chr 19 NA19240: an International HapMap project Yoruban daughter sample and parents, lymphoblastoid cell lines (Nature 2010). Ave 42x, SOLiD, map to the human reference genome (BAM file), ave 2.36 Gb accessed. n = 54,361,060 read counts for Chr 19. Apply RSI with m = 400, L = 60,000. Identified 106 deletions and 63 duplications. Take less than 3 mins. Compare with the CNV map from 1000 Genomes Project based on 185 samples (Mills et al. 2011), 76 overlap with the reported deletions based on 185 low-coverage samples and three methods ( 438, 332 and 615 CNVs). Robust Segment Identification - 24 of 34-1 p. 24

25 Concordant of the Yoruba trio - top ranked CNVs CNVs are inheritable - concordant rates, ranked by µ Î, apply to 3 samples separately. Concordance % Median NB top20 top50 top100 top200 top300 Robust Segment Identification - 25 of 34-1 p. 25

26 Chr 19 sequence data - NA19240 Depth of overage 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 Depth of overage Genomic location index Genomic location index Possible error in reference genome: multicopy sequences which have been incorrectly assembled and collapsed into a single copy. Robust Segment Identification - 26 of 34-1 p. 26

27 Chr 19 sequence data - NA19240 Depth of overage Depth of overage Genomic location index Genomic location index Robust Segment Identification - 27 of 34-1 p. 27

28 Chr 19 sequence data - NA19240 Depth of overage Depth of overage Genomic location index Genomic location index Robust Segment Identification - 28 of 34-1 p. 28

29 Alternative Approach -Negative Binomial Counts NB model: Y i Negative Binomial(r,p i ), p i = p 0 + q d j 1 (i Ij ). j=1 µ i = rp i 1 p i,σ 2 i = µ2 i r +µ i. Data transformation - mean-matching variance stabilizing transformation (VST) to turn into Gaussian noise problem: divide the n obs into T = T n groups of m = m n obs, for the kth interval, define X k = 2 ˆrln Y i Jk i +1/4 mˆr 1/ i J k Y i +1/4 mˆr 1/2, 1 k T, Robust Segment Identification - 29 of 34-1 p. 29

30 Transformed Data - normally distributed Model for transformed data: X k = 2ln( θ k + r +θ k )+ǫ k +m 1/2 Z k +ξ k, where = r(p 0 +d j )/(1 p 0 d j ), J k I j for some I j, θ k [rp 0 /(1 p 0 ), r(p 0 +d j )/(1 p 0 d j )], J k I j for some I j and J k I j, = rp 0 /(1 p 0 ), otherwise, ǫ k and ξ k are stochastically small, Z k N(0,1). Apply LRS to X k assuming the baseline distribution as ( ) rp0 r N 2ln( + ),1/m. 1 p 0 1 p 0 Robust Segment Identification - 30 of 34-1 p. 30

31 Concordant of the Yoruba trio - top ranked CNVs CNVs are inheritable - concordant rates, ranked by µ Î, apply to 3 samples separately. Concordance % Median NB top20 top50 top100 top200 top300 Robust Segment Identification - 31 of 34-1 p. 31

32 Comments and extensions Cai, Jeng and Li (2012): JRSS(B), in press. Many other complicated factors: repeated regions, complex rearrangments, highly repetitive elements. There is a relationship between GC content and coverage, but this effect is small for false CNVs. Read depths data: difficulty in finding high repetitive CNVs (LINE, SINE), uncertain in CNV location, but can ba applied to paired-end, single-end and mixed data; Paired-end whole genome sequencing data: statistical modeling of anomalous read pairs, can detect highly repetitive CNVs (LINE and SINE), precise location of CNVs; but span distances have effects on resolution. Robust Segment Identification - 32 of 34-1 p. 32

33 Software -RobustCNV C++ software available, with some post-processing steps: CG content adjustment; refining boundaries; filtering CNVs based on read mapping qualities; plots of the CNVs. Robust Segment Identification - 33 of 34-1 p. 33

34 THANKS! Collaborators: Jessie Jeng - Postdoc Yinghua Wu - Postdoc Tony Cai - Statistics Dept John Maris - CHOP. NIH grants support. Robust Segment Identification - 34 of 34-1 p. 34

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem