1 Decomposition of ESG

Size: px

Start display at page:

Download "1 Decomposition of ESG"

Primrose Barrett
6 years ago
Views:

1 1 Decomposition of ESG DiffSplice resolves alternative splicing events in complex gene models through decomposition of the splice graph. Figure 1 shows the hierarchical decomposition on gene VEGFA. In total, 6 ASMs result from the decomposition. VEGFA E1 E2 E3 E7 E9 E10-12 E13-14 E4-6 E8 E15 E16 E3 ASM1.1 E7 E7 ASM1.2 E16 level1 Decomposition p3 E4 E5 E5 E6 E8 p3 E9 ASM3.1 ASM2 E13 ASM3.2 level2 level3 E10 ASM4 E14 p3 E15 level4 p3 level5 E11-12 Figure 1: Gene model and decomposition of gene VEGFA. 1

2 Following is the pseudo-code for the algorithm to decompose an ESG. input : G =< V, E, ts, te, w > output: E max E max any edge {e E}; for all e 1 = (u 1, v 1 ) E do for all e 2 = (u 2, v 2 ) E max do if there is a path from u 1 to u 2 and a path from v 2 to v 1 then E max E max \{e 2 } {e 1 }; end end end Algorithm 1: Find maximal edges in an ESG G (CalculateMaximalEdges(G)) input : An ESG G =< V, E, ts, te, w >, parent P output: The set of ASMs A Calculate pre-dominators in G; Calculate post-dominators in G; Candidate entry {u : d + (u) > 1}; Candidate exit {v : d + (v) > 1}; for all u Candidate entry do v the immediate post-dominator of u; if v Candidate exit and u is the immediate pre-dominator of v then parent(h(u, v)) P ; A A H(u, v); E max CalculateMaximalEdges(H(u, v)); Decompose(H(u, v)\e max, H(u, v)); end end Algorithm 2: Find the ASMs in an ESG G (Decompose(G, P )) 2

3 2 Abundance estimation in ASM Consider an ASM with n alternative transcription paths and m features (exonic segments and splice junctions). We define A t,e as an indicator for the presence of a feature e in transcription path t, with value of 1 if t covers e and 0 otherwise. The indicators for the presence of every exon/junction in each path form an n m indicator matrix A. 2.1 Derivation of likelihood function Let C e t denote the coverage on the eth feature from the tth path. Under the independence assumption, the likelihood can be factorized as L(q, N C 1,, C m ) = P (C 1 1,, C 1 n, C 2 1,, C 2 n,, C m 1,, C m n q, N) n = P (C 1 t, C 2 t,, C m t ) = = = t=1 n P (C 1 t, C 2 t,, C m t N t )P (N t ) t=1 n t=1 i=1 n t=1 i=1 m P (C i t N t )P (N t ) m f(c i t N t )g(n t ), where f( ) is the density of N(C t, r(lt le)ct l tl e ) and g( ) is the density of P oisson(λ t ), λ t = N p t. 2.2 Maximum likelihood estimators The maximum likelihood estimator for q and N are the ones that maximize the likelihood, (ˆq, ˆN) = arg max L(q, N data). q,n l(q, N C 1,, C m ) = log L(q, N C 1,, C m ) n m = [log(g(n t )) + log f(c i t N t )] = = t=1 i=1 n λ m {log e λt Nt (C t 1 i t C t ) 2 + log[ N t! 2πr(lt l i=1 i )C t /(l t l i ) e 2r(l t le)c t /(l t le) ]} n m { λ t + N t log λ t log N t! + [ 1 2 log l t log l i 1 2 log 2π 1 2 log r t=1 t=1 1 2 log(l t l i ) 1 2 log C t i=1 (C i t C t ) 2 2r(l t l e )C t /(l t l e ) ]} 3

4 2.3 EM algorithm for deriving estimators The expectation maximization (EM) algorithm to find the maximum likelihood estimator ˆq and ˆN is detailed as the following. 1. E-step: Denoting the values of q t at step v as q (v) t, we first calculate the conditional expectation of C t conditioning on q (v) t. Let C (1), C (2),, C (m ) be the read coverage of the exonic segments that are in path t, i.e., A t,e = 1 if e {(1), (2),, (m )} and A t,e = 0 otherwise. Let Cˆ e t denote the expected coverage on exonic segment e from t, Cˆ e t = peq(v) t A t,e n C e. Let k t,e denote r(lt le) j=1 peq(v) j A j,e l tl e, so we have C e t N(C t, k t,e C t ). Therefore, the conditional expectation of C t is the maximum likelihood estimator that maximizes the joint density of the m normal densities, [ m Ct C (1), C (2),, C (m )] + m m m 2 i=1 k 1 t,(i) i=1 k 1 ˆ t,(i) C (i) t = Ĉ t = E q (v) t 2 m i=1 k 1 t,(i) The expected number of reads on path t is hence calculated as ˆN t = Ĉtlt r. 2. M-step: Then we derive the parameters that maximize the conditional likelihood on ˆN t : Set L N to 0 n N t ˆN = 0 ˆN = t=1 n t=1 ˆN t Set L to 0 q t n ( dλ t + N t 1 dˆq t=1 t λ dλ ) t = 0 dˆq t n ( ( N t N 1) dλ ) t = 0 dˆq t=1 t ˆN t n ( j=1,j t ˆq (v) t = ˆq (v 1) j ( ) m i=1 p ia j,i ) ( ˆN ˆN t ) ( m i=1 p ia t,i ) 4

5 3 Statistical test for differential transcription Jensen-Shannon divergence. Let p = (p 1,, p t ) T and q = (q 1,, q t ) T be two t-dimensional distributions. The Jensen-Shannon divergence (JSD) is calculated as JSD(p q) = (KLD(p µ) + KLD(q µ))/2, where KLD(p q) = t j=1 p j log p j q j and µ = (p + q)/2. 5

6 4 Biological meanings and applications of ASM Here we give three examples to demonstrate that the investigation of ASMs may reveal functional sequences. The first two examples (ERBB4 and VEGFA) show significant sequences residing in single ASMs, while the third example (CD44) show an isoform transition associated with multiple ASMs. In Figure 2 we plot the ASM in gene ERBB4. ASM 1 indicates an exon skipping event that alternatively includes or excludes exon E 3. The skipping path (p 2 ), which corresponds to the CYT-2 isoform in ERBB4, deletes a WW binding motif, leading to increased cell proliferation. [3] ERBB4 E1 E3 E4E6 E8 E2 E5 E7 E9 E11 E13 E15 E17 E19 E21 E23 E24 E25 E26 E27 E28 E10 E12 E14 E16 E18 E20 E22 Decomposition E2 ASM1 E4 E3 level1 level2 Figure 2: The splice graph and the ASM decomposition of gene ERBB4. We take gene VEGFA as another example which has 6 ASMs with complex nesting structure. Bainbridge et al. have identified a 7-amino acid peptide, RKRKKSR, encoded by exon E 10. [1] This peptide could inhibit VEGF receptor binding and angiogenesis in vitro. In Figure 1 we show the ASMs in gene VEGFA. ASM3.1 captures the alternative inclusion/exclusion of E 10. Thus, this ASM shows that some isoforms of VEGFA lack this important peptide sequence. Lastly, we look at two isoforms in gene CD44, CD44s and CD44v. Isoform CD44s includes exons E 1 E 5, E 14 E 17 and E 18, and CD44v includes exons E 1 E 5, E 6 E 13, E 14 E 17 and E 18 (Figure 3). Brown et al. have suggested a shift in CD44 expression from variant isoforms (CD44v) to the standard isoform (CD44s) is essential in epithelial cell development and is associated with breast cancer progression. [2] The alternative exons by which CD44s and CD44v differ, E 6 E 13, are captured by three ASMs ASM 4, ASM 5 and ASM 6, where CD44s takes path p 1 in ASM 4 and CD44v takes path p 2 in all ASM 4, ASM 5 and ASM 6. Therefore, the joint analysis of all the three ASMs will be essential for the study of the isoform transition in this gene. References [1] James Bainbridge, Haiyan Jia, Azadeh Bagherzadeh, David Selwood, Robin Ali, and Ian Zachary. A peptide encoded by exon 6 of vegf (eg3306) inhibits vegf-induced angiogenesis in vitro and ischaemic retinal neovascularisation in vivo. Biochemical and Biophysical Research Communications, 302(4):793 9, [2] Rhonda Brown, Lauren Reinke, Marin Damerow, Denise Perez, Lewis Chodosh, Jing Yang, and Chonghui Cheng. Cd44 splice isoform switching in human and mouse epithelium is essential for epithelial-mesenchymal transition and breast cancer progression. The Journal of Clinical Investigation, 121(3): ,

7 CD44 E1 E2 E3 E4 E5 E7 E9 E11 E13 E15 E17 E19 E6 E8 E10 E12E14 E16 E18 E2 ASM1.1 E17 E17 ASM1.2 level1 E3 E4 E5 ASM2 E16 E18 E19 level2 Decomposition ASM3 ASM4 E14 E15 level3 level4 ASM5 E12E13 level5 ASM6 E7 E9 E11 E8 E10 E6 level6 level7 Figure 3: The splice graph and the ASM decomposition of gene CD44. [3] Rebecca Muraoka-Cook, Melissa Sandahl, Karen Strunk, Leah Miraglia, Carty Husted, Debra Hunter, Klaus Elenius, Lewis Chodosh, and H. Shelton Earp. Erbb4 splice variants cyt1 and cyt2 differ by 16 amino acids and exert opposing effects on the mammary epithelium in vivo. Molecular and Cellular Biology, 29(18): ,

8 5 Simulation datasets 5.1 Gene VEGFA We simulated 100 runs of experiments on this gene. In each run, 2 sets of RNA-seq reads were generated by 2 independently created transcript expression profiles. Every set of reads had 50K 50bp single-end reads. In Figure 4a, every single dot represents an ASM in one run. All ASMs have the divergence estimated by DiffSplice very close to the profile divergence, with a Pearson correlation as high as This precision in quantifying sample-sample divergence results from the accuracy in path abundance estimation. Figure 4b plots the distribution of the MSE between path distribution for every single ASM. All 6 ASMs have the majority of their MSE below with mean close to 0 and small variances, showing the accuracy of the abundance estimator developed in DiffSplice Sqrt of Profile JSD Sqrt of DiffSplice JSD ASM ID MSE between Path Distribution (a) (b) Figure 4: Evaluation of DiffSplice on simulated dataset of gene VEGFA. (a) Comparison between difference calculated from sampling profile and difference estimated by DiffSplice, measured by the square root of JSD. The Pearson correlation is (b) The mean squared error (MSE) between sampling profile and estimated alternative path distribution, averaged between the two samples. The abundance estimation procedure of DiffSplice has very low error on all the 6 ASMs. 5.2 Human transcriptome Following the UCSC human hg19 gene annotation, two sets of RNA-seq reads were generated by sampling from the whole human transcriptome with different transcript expression profiles. Each dataset consisted of 50M 50bp single-end reads. Genes with averaged read coverage per base greater than 10 were picked to compare the difference by profile and the difference derived by DiffSplice. The majority of the points stay close to the diagonal where the DiffSplice JSD and the profile JSD are equal, resulting a correlation of (Figure 5a). The variance of the difference between DiffSplice JSD and profile JSD is larger at ASMs with similar profiles in the two samples (i.e. ASMs with low profile JSD) and 8

9 decreases as ASMs having higher divergence between profiles. This observation follows the nonlinearity of the JSD: compared to the Euclidean distance, the JSD gives larger value than the Euclidean distance for small differences and smaller value for greater differences. The randomness in read sampling procedure may deviate from the profile expression. Therefore the differences measured by JSD might get slightly inflated when the difference is low. However, the MSE in path abundance estimation still mainly stays below 0.01 (Figure 5b). As coverage increases, the deviation between estimated path distribution and profile distribution converges to 0 and the variance also tends to decrease, consistent with an unbiased and asymptotically efficient abundance estimator Sqrt of Profile JSD Sqrt of DiffSplice JSD Expression Level MSE between Path Distribution (a) (b) Figure 5: Evaluation of DiffSplice on simulated dataset of human transcriptome. (a) Comparison between difference calculated from sampling profile and difference estimated by DiffSplice, measured by the square root of JSD. The Pearson correlation is (b) The mean squared error (MSE) between sampling profile and estimated alternative path distribution, averaged between the two samples. ASMs are separated into 10 quantile groups according to their expression level. ASMs with higher expression have less estimation error. 9

10 6 Real datasets 6.1 qrt-pcr validation RNA was isolated from the cell lines using standard Trizol protocol (Invitrogen, Inc.). RNA was reverse transcribed into cdna using an iscript cdna synthesis kit exactly according to manufacturer s instructions (Bio-Rad, Hercules, CA). Expression of target genes TMC5, LMO7, and TBP, a normalizing control, was measured by real-time PCR using 20ng template cdna, forward and reverse primers at a final concentration of 500nM each, and SsoFast EvaGreen Supermix with low ROX (Bio-Rad, Hercules, CA). The total reaction volume was 20µL. Reactions were run on an Applied Biosystems 7500HT thermocycler under the following conditions: denaturation at 95 C for 30 seconds followed by 40 cycles of denaturation at 95 C for 5 seconds and annealing/extension at 60 C for 30 seconds. Relative expression levels were calculated by the delta-delta Ct method. 7 Relative splice variant expression, Day 3 and Day 35 fold change, relative to TBP D3-IN D3-EX D35-IN D35-EX D3-IN D3-EX D35-IN D35-EX TMC5 LMO7 Figure 6: Relative splice variant expression at day 3 and day 35 from the PCR validation. 10

11 6.2 Lung differentiation dataset Scale chr13: 116 _ 100 kb hg Day 3 Replicate1 Day 3 Replicate1 102 _ Day 3 Replicate2 Day 3 Replicate2 117 _ Day 3 Replicate3 Day 3 Replicate3 146 _ Day 35 Replicate1 Day 35 Replicate1 225 _ Day 35 Replicate2 Day 35 Replicate2 157 _ Day 35 Replicate3 Day 35 Replicate3 Gene ASM1.path1 ASM1.path2 ASM2.path1 ASM2.path2 ASM3.path1 ASM3.path2 ASM4.path1 ASM4.path2 LMO7 LMO7 DiffSplice Splice Graph RefSeq Genes Figure 7: Exon skipping event identified by DiffSplice in gene LMO7. The skipping variant (ASM 2.path1) had significantly higher relative abundance at day 35 (78%) than day 3 (28%), consistent with the result of qrt-pcr experiment. 11

12 Scale chr10: 208 _ 10 kb hg Day 3 Replicate1 Day 3 Replicate1 225 _ Day 3 Replicate2 Day 3 Replicate2 359 _ Day 3 Replicate3 Day 3 Replicate3 40 Day 35 Replicate1 Day 35 Replicate1 369 _ Day 35 Replicate2 Day 35 Replicate2 515 _ Day 35 Replicate3 Day 35 Replicate3 Gene ASM1.path1 ASM1.path2 ASM1.path3 ASM1.path4 ASM1.path5 ASM1.path6 TCONS_ TCONS_ TCONS_ TCONS_ TCONS_ HNRNPF HNRNPF HNRNPF HNRNPF HNRNPF HNRNPF DiffSplice Splice Graph Cufflinks transcripts RefSeq Genes Figure 8: Alternative transcription start sites identified by DiffSplice in gene HNRNPF. DiffSplice correctly reconstructed all 6 alternative transcription start sites in RefSeq annotation and tested the differential transcription in this event as significant change. The alternative path ASM 1.path4 (corresponding to the 5th transcript in RefSeq annotation) had significantly higher expression at day

13 6.3 Breast cancer dataset Figure 9: The Venn-diagram of the differentially transcribed genes called by DiffSplice and FDM on the breast cancer dataset. The number of shared genes is 955, 38.1% of the result of DiffSplice and 45.7% of the result of FDM. 13

14 Scale chr7: 306 _ 20 kb hg19 158,550, ,600,000 MCF7_SM6_HS MCF7_SM6_HS 295 _ MCF7_SM4_HS MCF7_SM4_HS 446 _ MCF7_11_HS MCF7_11_HS 352 _ MCF7_5_HS MCF7_5_HS 421 _ SUM102_12_HS SUM102_12_HS 287 _ SUM102_10_HS SUM102_10_HS 279 _ SUM102_SM6_HS SUM102_SM6_HS 316 _ SUM102_SM7_HS SUM102_SM7_HS Gene ASM1.path1 ASM1.path2 ESYT2 DiffSplice splice graph RefSeq Genes Figure 10: Exon skipping event identified by DiffSplice but not by FDM in gene ESYT2. The skipping variant (ASM 1.path1) had significantly higher relative abundance in the SUM102 group than in the MCF7 group. 14

15 Scale chr10: 1477 _ 50 kb hg19 95,100,000 95,150,000 95,200,000 MCF7_SM6_HS MCF7_SM6_HS 1341 _ MCF7_SM4_HS MCF7_SM4_HS 2513 _ MCF7_11_HS MCF7_11_HS 183 MCF7_5_HS MCF7_5_HS 1078 _ SUM102_12_HS SUM102_12_HS 828 _ SUM102_10_HS SUM102_10_HS 45 SUM102_SM6_HS SUM102_SM6_HS 498 _ SUM102_SM7_HS SUM102_SM7_HS Gene ASM1.path1 ASM1.path2 ASM2.path1 ASM2.path2 ASM3.path1 ASM3.path2 MYOF MYOF DiffSplice splice graph RefSeq Genes Figure 11: Exon skipping event identified by DiffSplice but not by FDM in gene MYOF. Three ASMs were found in this gene. The skipping variant (ASM3.path1) in ASM3 had significantly higher relative abundance in the MCF7 group than in the SUM102 group. 15

16 Scale chr9: 39 _ 50 kb hg19 116,250, ,300, ,350,000 MCF7_SM6_HS MCF7_SM6_HS 35 _ MCF7_SM4_HS MCF7_SM4_HS 51 _ MCF7_11_HS MCF7_11_HS 56 _ MCF7_5_HS MCF7_5_HS 88 _ SUM102_12_HS SUM102_12_HS 81 _ SUM102_10_HS SUM102_10_HS 51 _ SUM102_SM6_HS SUM102_SM6_HS 72 _ SUM102_SM7_HS SUM102_SM7_HS Gene ASM1.path1 ASM1.path2 ASM1.path3 ASM1.path4 ASM2.path1 ASM2.path2 ASM3.path1 ASM3.path2 ASM3.path3 RGS3 RGS3 RGS3 RGS3 RGS3 RGS3 DiffSplice splice graph RefSeq Genes Figure 12: Alternative transcription start sites identified by DiffSplice but not by FDM in gene RGS3. In MCF7 group, the earliest start site (ASM 1.path1) was expressed but the second start site (ASM 1.path2) was barely expressed. In SUM102 group, the earliest start site was barely expressed but the second start site was expressed. 16

17 Scale chr1: 26 5 kb hg19 154,940, ,945,000 MCF7_SM6_HS MCF7_SM6_HS 191 _ MCF7_SM4_HS MCF7_SM4_HS 312 _ MCF7_11_HS MCF7_11_HS 31 MCF7_5_HS MCF7_5_HS 3669 _ SUM102_12_HS SUM102_12_HS 270 SUM102_10_HS SUM102_10_HS 2294 _ SUM102_SM6_HS SUM102_SM6_HS 2629 _ SUM102_SM7_HS SUM102_SM7_HS Gene ASM1.path1 ASM1.path2 ASM2.path1 ASM2.path2 ASM3.path1 ASM3.path2 ASM4.path1 ASM4.path2 ASM4.path3 SHC1 SHC1 SHC1 SHC1 SHC1 DiffSplice splice graph RefSeq Genes Figure 13: Alternative transcription start sites identified by DiffSplice but not by FDM in gene SHC1. In MCF7 group, the start site ASM4.path1 had low expression but the start site ASM4.path2 was highly expressed, as compared to the overall gene expression level. In the SUM102 group, expression switched from ASM 4.path2 to ASM 4.path1. 17

Isoform discovery and quantification from RNA-Seq data

Isoform discovery and quantification from RNA-Seq data C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Deloger November 2016 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification