Supplementary Information for Discovery and characterization of indel and point mutations

Size: px

Start display at page:

Download "Supplementary Information for Discovery and characterization of indel and point mutations"

Annabella Stephens
5 years ago
Views:

1 Supplementary Information for Discovery and characterization of indel and point mutations using DeNovoGear Avinash Ramu 1 Michiel J. Noordam 1 Rachel S. Schwartz 2 Arthur Wuster 3 Matthew E. Hurles 3 Reed A. artwright 2,4 Donald F. onrad 1,5 1 Department of Genetics, 5 Department of Pathology and Immunology Washington University School of Medicine, St. Louis, MO 63110, USA 2 enter for Evolutionary Medicine and Informatics, The Biodesign Institute, 4 School of Life Sciences, Arizona State University, Tempe, AZ , USA 3 Wellcome Trust Sanger Institute, Hinxton, ambridge, B10-1SA, UK 1

2 Supplementary Figures 2

3 Supplementary Figure 1 A Sensitivity Prior 1e 12 1e 10 1e 08 1e 06 1e FDR B Sensitivity FDR Effect of different mutation rate prior values on de novo SNP calling on the Whole Genome Sequencing dataset. Sensitivity and False Discovery Rates for each mutation prior were calculated for (a) germline and (b) somatic SNPs using the validated 1000 genomes call-set as reference. Germline sensitivity was defined as the number of true positive germline mutations that were called by DeNovoGear whereas somatic sensitivity was defined as the number of true positive somatic mutations that were called. The False Discovery Rate was defined as the proportion of calls validated as false positive to the total number of calls that were validated. 3

4 Supplementary Figure 2 A Sensitivity Prior 1e 12 1e 10 1e 08 1e 06 1e FDR B Sensitivity FDR Effect of different mutation rate prior values on de novo SNP calling on the Whole Exome Sequencing dataset. Sensitivity and False Discovery Rates for each mutation prior were calculated for (a) germline and (b) somatic SNPs using the validated 1000 genomes call-set as reference. Germline sensitivity was defined as the number of true positive germline mutations that were called by DeNovoGear whereas somatic sensitivity was defined as the number of true positive somatic mutations that were called. The False Discovery Rate was defined as the proportion of calls validated as false positive to the total number of calls that were validated. 4

5 Supplementary Figure 3 A 1.00 aller Denovogear GATK Sensitivity Naivealler Polymutt Samtools FDR B 0.98 Sensitivity FDR omparison of the results from several de novo SNP callers on the Whole Genome Sequencing dataset. Sensitivity and False Discovery Rates for each caller were calculated for (a) germline and (b) somatic SNPs using the validated 1000 genomes call-set as reference. Germline sensitivity was defined as the number of true positive germline mutations that were called by each caller whereas somatic sensitivity was defined as the number of true positive somatic mutations that were called. The False Discovery Rate was defined as the proportion of calls validated as false positive to the total number of calls that were validated. 5

6 Supplementary Figure 4 A 1.00 aller Denovogear Sensitivity Denovogear BB GATK Naivealler Polymutt Samtools FDR B Sensitivity FDR omparison of the results from several de novo SNP callers on the Whole Exome Sequencing dataset. Sensitivity and False Discovery Rates for each caller were calculated for (a) germline and (b) somatic SNPs using the validated 1000 genomes call-set as reference. Germline sensitivity was defined as the number of true positive germline mutations that were called by each caller whereas somatic sensitivity was defined as the number of true positive somatic mutations that were called. The False Discovery Rate was defined as the proportion of calls validated as false positive to the total number of calls that were validated. 6

7 Supplementary Figure 5 A ount B Length of Indel ount Length of Insertion ount Length of Deletion The frequency distribution of indel lengths in the 1000 genomes phase 1 dataset. The counts of (a) all indels of different lengths is plotted as well as the separate counts for (b) insertions and (c) deletions. 7

8 Supplementary Figure 6 A log(µ) Empirical Log linear fit B Length of Insertion Empirical Log linear fit log(µ) Length of Deletion A log linear model is used to estimate the mutation rate given the length of an indel. Separate models were estimated for (a)insertions and (b)deletions. The fitted values of the model are plotted against the logarithm of the mutation rate. The smaller indels have a higher rate than the insertions of larger size. 8

9 Supplementary Figure 7 De novo indel artifact 1. In this class of artifact, an indel is clearly present in one of the parents but has been placed with a slightly different alignment than in the child. This class can be avoided by filtering out candidate DNMs that overlap an indel call in the parent. 9

Supplementary Figure 8 De novo indel artifact 2. In this class of artifact, an indel is clearly present in both parents with the same breakpoints, yet has not been called in either.

10 Supplementary Figure 8 De novo indel artifact 2. In this class of artifact, an indel is clearly present in both parents with the same breakpoints, yet has not been called in either. This undercalling in the parents could be addressed by either using an alternative likelihood function for indel genotypes, as described in the main text, or by filtering sites where some indel reads are observed in the parents. 10

Supplementary Figure 9 De novo indel artifact 3. In this class of artifact, a high frequency of reads with a non-reference allele is seen in both parents.

11 Supplementary Figure 9 De novo indel artifact 3. In this class of artifact, a high frequency of reads with a non-reference allele is seen in both parents. Many of these alternate base calls occupy the first/last position in the read, suggesting that perhaps both the insertion in the child and SNPs in the parents are alignment artifacts, possible caused by the presence of a large structural variant with a breakpoint in this region. This class of variant could be avoided by filtering sites with a high frequency of non-reference reads in one or both parents. This could be done strictly spanning the indel call or within a small window of the indel (say plus or minus 10 bp). 11

12 Supplementary Figure 10 A chromosome 3 p26.1 p25.1 p24.2 p21 p22.3 p21.33 p21.1 p14.1 p12.3 p12.1 q11.2 q13.12 q13.32 q21.3 q22.3 q24 q25.2 q26.1 q26.31 q27.1 q29 B 102,153, ,153, ,153,406 forward strand sequence reverse strand sequence T A A A T T T T A T T A T A A T T A A A T T T T A T T A T A A T T T A T A A T T A G T T G T A A A A T G A A T A T T A G T T A A G T T G T A A Example of a de novo indel called by DNG and confirmed by Sanger sequencing. A) Location of the indel is indicated by a red box. B) At this specific location, has a 3-base pair deletion, namely a deletion of the bases T, T and A (see left box), which will result in a double sequence read on the forward strand (middle box) and the reverse strand (right box) that both start at the site of this indel. ) and do not have an indel at this position and subsequently have single reads on both strands. We confirmed 53/56 (95%) de novo indel predictions in this family by Sanger sequencing (summary statistics of the predicted de novo indels are in Supplementary Table 7, a full list of de novo indels in Supplementary Table 9, and validation plots for all calls in Supplementary Figure 11). 12

13 Supplementary Figure 11: Validation Plots The following pages depict (a) IGV screenshots of nextgen sequencing data and (b) Sanger sequencing traces for all candidate de novo indels for which we attempted validation. See legend of Supplementary Figure 10 and main text for additional details. Within the True Positive class, the figures are ordered by chromosome and position of the de novo indel. All coordinates with respect to NBI37. 13

3 B 28,388,993 28,389,003 28,389,012 G A A T A A G A T T T T T T G

14 chromosome 17 p13.2 p13.1 p12p21 p11.2 p11.1 q11.2 q12 q21.1 q21.31 q21.33 q22 q23.1 q23.3 q24.2 q24.3 q25.1 q25.3 B 28,388,993 28,389,003 28,389,012 G A A T A A G A T T T T T T G A A T A A G A T T T T T T T T T T T A E7: A site for which primers could not be designed 14

15 chromosome 15 p12 p11.2 p11.1 p21 q11.2 q12 q13.2 q14 q15.1 q21.1 q21.3 q22.2 q22.32 q24.1 q25.1 q25.3 q26.1 q26.2 B 74,185,293 74,185,303 74,185,312 T A A T A T A G A A A A T A A T A T A G A A A A A7: A false positive site 15

16 chromosome 17 p13.2 p13.1 p12p21 p11.2 p11.1 q11.2 q12 q21.1 q21.31 q21.33 q22 q23.1 q23.3 q24.2 q24.3 q25.1 q25.3 B 2,120,331 2,120,341 2,120,350 A T T A T G G A G G A A A G T T T A A T T A T G G A G G A A A G T T T A B7: A false positive site 16

17 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 69,058,980 69,058,990 69,058,999 T A A G A G G T G T T G T T T A A G A G G T G T T G T T G1: A false positive site 17

18 chromosome 1 p36.23 p36.12 p35.1 p34.1 p32.2 p31.2 p22.3 p21.3 p13.3 p12 q11 q12 q21.1 q22 q24.1 q25.2 q31.2 q32.1 q41 q42.13 q43 B 36,060,969 36,060,979 36,060,988 A T T T A T G T T T T T G G A T T T A T G T T T T T G G T T T T G G A 18

19 chromosome 1 p36.23 p36.12 p35.1 p34.1 p32.2 p31.2 p22.3 p21.3 p13.3 p12 q11 q12 q21.1 q22 q24.1 q25.2 q31.2 q32.1 q41 q42.13 q43 B 62,752,481 62,752,491 62,752,500 A A A T T G T G G G G A G A G A A A T T G T G G G G A G A G G G G G G A G A G 19

20 chromosome 1 p36.23 p36.12 p35.1 p34.1 p32.2 p31.2 p22.3 p21.3 p13.3 p12 q11 q12 q21.1 q22 q24.1 q25.2 q31.2 q32.1 q41 q42.13 q43 B 79,985,409 79,985,409 79,985,418 T T G A G T A G T T T T G T G A A A T T G A G T A G T T T T G T G A A A T T T G T G A A A T 20

21 chromosome 1 p36.23 p36.12 p35.1 p34.1 p32.2 p31.2 p p21.3 p13.3 p12 q11 q12 q21.1 q22 q24.1 q25.2 q31.2 q32.1 q41 q42.13 q43 B 85,896,648 85,896,658 85,896,667 G A G G T A A A A G T T T T T G A G G T A A A A G T T T T T A A A G G A G A G G 21

22 chromosome 1 p36.23 p36.12 p35.1 p34.1 p32.2 p31.2 p22.3 p21.3 p13.3 p12 q11 q12 q21.1 q22 q24.1 q25.2 q31.2 q32.1 q41 q42.13 q43 B 215,772, ,772, ,772,765 A A A T A A T T T T T A A G A T A A A T A A T T T T T A A G A T T A A A T A A A 22

23 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 56,479,693 56,479,703 56,479,712 T A T A T T T A A G T T T G G G A T T A T A T T T A A G T T T G G G A T T T T G G G A T A G T G T 23

24 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 69,058,980 69,058,990 69,058,999 T A A G A G G T G T T G T T T A A G A G G T G T T G T T 24

25 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 83,119,261 83,119,271 83,119,280 A G T G G G T T A T A T A A G T T A G T G G G T T A T A T A A G T T A A G T G G G T 25

26 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 124,110, ,110, ,110,179 A T A T A A A T G T T A T A G G T A A T A T A A A T G T T A T A G G T A T T A T A G G T A T A T T 26

2 B 128,809,725 128,809,735 128,809,744 A G T T T G T G G T A

27 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 128,809, ,809, ,809,744 A G T T T G T G G T A A A A A G A A G T T T G T G G T A A A A A G A G A G T T T G T 27

28 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 140,541, ,541, ,541,050 T T T T T T A A T T G A T A A A T A A T T T T T T A A T T G A T A A A T A A T G A T A A A T A A T 28

2 B 197,724,617 197,724,627 197,724,636 A G T A A A T T A T A A

29 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 197,724, ,724, ,724,636 A G T A A A T T A T A A A A A T G A G T A A A T T A T A A A A A T G A T A A A A A T G G G A 29

30 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 206,261, ,261, ,261,601 T T T T T A A A A A A T A G T T T T T T A A A A A A T A G T A A A A A A T A G 30

31 chromosome 3 p26.1 p25.1 p24.2 p22.3 p21 p21.33 p21.1 p14.1 p12.3 p12.1 q11.2 q13.12 q13.32 q21.3 q22.3 q24 q25.2 q26.1 q26.31 q27.1 q29 B 102,153, ,153, ,153,406 T A A A T T T T A T T A T A A T T A A A T T T T A T T A T A A T T T A T A A T T 31

32 chromosome 3 p26.1 p25.1 p24.2 p22.3 p21 p21.33 p21.1 p14.1 p12.3 p12.1 q11.2 q13.12 q13.32 q21.3 q22.3 q24 q25.2 q26.1 q26.31 q27.1 q29 B 119,913, ,913, ,913,685 T T T T A T T T T T T T T T T T T T T A T T T T T T T T T T T T T T T T T T T T 32

33 chromosome 3 p26.1 p25.1 p24.2 p22.3 p21 p21.33 p21.1 p14.1 p12.3 p12.1 q11.2 q13.12 q13.32 q21.3 q22.3 q24 q25.2 q26.1 q26.31 q27.1 q29 B 119,913, ,913, ,913,685 T G A G T T A A T T T T T G A A A T T G A G T T A A T T T T T G A A A T T T T T T G A A A T G 33

34 chromosome 3 p26.1 p25.1 p24.2 p22.3 p21 p21.33 p21.1 p14.1 p12.3 p12.1 q11.2 q13.12 q13.32 q21.3 q22.3 q24 q25.2 q26.1 q26.31 q27.1 q29 B 146,628, ,628, ,628,651 G A A A T G G T T T T T A A A G A A A T G G T T T T T A A A T T T T T A A A G 34

35 chromosome 4 p16.2 p15.32 p15.1 p21 p13 p11 q12 q13.1 q13.1 q21.22 q22.2 q24 q25 q26 q28.1 q28.3 q31.21 q31.3 q32.2 q33 q34.3 q35 B 28,194,765 28,194,775 28,194,784 A A T T A G A A T T T T T A T G G A A T T A G A A T T T T T A T G G T T T T A T G G T 35

36 chromosome 4 p16.2 p15.32 p15.1 p21 p13 p11 q12 q13.1 q13.1 q21.22 q22.2 q24 q25 q26 q28.1 q28.3 q31.21 q31.3 q32.2 q33 q34.3 q35 B 115,663, ,663, ,663,969 A T T T T T A A T T A T T T T T T A T T T T T A A T T A T T T T T T A T T T T T T A A A G 36

37 chromosome 4 p16.2 p15.32 p15.1 p21 p13 p11 q12 q13.1 q13.1 q21.22 q22.2 q24 q25 q26 q28.1 q28.3 q31.21 q31.3 q32.2 q33 q34.3 q35 B 134,749, ,749, ,745,002 T A T A T A T T T T T T T T A T T T T A T A T A T T T T T T T T A T T T T T T T A T T T A A A A 37

38 chromosome 4 p16.2 p15.32 p15.1 p21 p13 p11 q12 q13.1 q13.1 q21.22 q22.2 q24 q25 q26 q28.1 q28.3 q31.21 q31.3 q32.2 q33 q34.3 q35 B 176,373, ,373, ,373,080 T T T G T A A T T G A T T T T T T G T A A T T G A T T T T T G A T T T G 38

39 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 8,675,861 8,675,871 8,675,880 A T A A G T T T T T T G A A T A T A A G T T T T T T G A A T T T T T T T G A A T 39

40 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 27,924,741 27,924,751 27,924,760 G A A A G A A T T T T T T T G A A A G A A T T T T T T T T T T T T T 40

41 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 32,884,887 32,884,897 32,884,906 G G A A G T T G T T G A A G A A G G AA G T T G T T G A A G A A T G A G A G A A G A A G G A 41

42 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 53,623,826 53,623,836 53,623,845 T G A A T G G A A A T T T G G T G A A T G G A A A T T T G G A A T T T G G T 42

43 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 64,122,061 64,122,071 64,122,080 T G A T T T T T T T T G A T G T T G A T T T T T T T T G A T G T T T G A T G T G 43

44 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 74,866,858 74,866,868 74,866,877 G T A T A A T A A A A T T G T G G T A T A A T A A A A T T G T G A A T T G T G T 44

45 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 105,980, ,980, ,980,810 T G G T G A A G A T T T A G G G A T T T G G T G A A G A T T T A G G G A T T T T A G G G A T T T 45

46 chromosome 5 p21 p13.2 p15.32 p12 q12.2 q13.2 q14.1 q14.3 p15.1 q15 q11.22 q32.2 q31.1 q31.3 q33.1 q21.2 q34 p14.1 q22.2 q35.1 q35 B 152,139, ,139, ,139,425 A G T T T T T T A T G G A T T T T A G T T T T T T A T G G A T T T T T T T T T AT G G A T T T T A 46

47 chromosome 6 p25.1 p23 p22.2 p21 p21.31 p21.1 p12.2 p11.1 q12 q12 q14.1 q15 q16.2 q12 q22.1 q22.32 q23.3 q24.2 q25.2 q26 B 48,157,882 48,157,892 48,157,901 A T A T T A T T A T T T T T T T G A A T A T T A T T A T T T T T T T G A T T T T T T T T G 47

48 chromosome 7 p22.1 p21.2 p15.3 p21 p14.3 p14.1 p12.3 p11.2 q11.21 q11.23 q21.12 q21.3 q22.2 q31.1 q31.31 q32.1 q33 q34 q35 q36.2 B 78,999,365 78,999,375 78,999,384 T A A T T A G T A T T A T G T T A T A A T T A G T A T T A T G T T A A A A T A A T 48

49 chromosome 7 p22.1 p21.2 p15.3 p21 p14.3 p14.1 p12.3 p11.2 q11.21 q11.23 q21.12 q21.3 q22.2 q31.1 q31.31 q32.1 q33 q34 q35 q36.2 B 80,119,047 80,119,057 80,119,066 T A T T A T T T A T T T T A T T G T T A T T A T T T A T T T T A T T G T G T A T T A T T T 49

50 chromosome 8 p23.2 p22 p21.3 p21.1 p11.23 p11.1 q11.22 q12.2 q13.2 q21.11 q21.2 q22.1 q22.3 q23.2 q24.11 q24.21 q24.23 B 34,242,517 34,242,527 34,242,536 T T A T A G T T A A A A A A T T A T A G T T A A A A A A A A A A G A G A 50

51 chromosome 8 p23.2 p22 p21.3 p21.1 p11.23 p11.1 q11.22 q12.2 q13.2 q21.11 q21.2 q22.1 q22.3 q23.2 q24.11 q24.21 q24.23 B 27,870,072 27,870,082 27,870,091 G G T A A T T T G A T A G G G G T A A T T T G A T A G G A A A T T T T A G T T G 51

52 chromosome 8 p23.2 p22 p21.3 p21.1 p11.23 p11.1 q11.22 q12.2 q13.2 q21.11 q21.2 q22.1 q22.3 q23.2 q24.11 q24.21 q24.23 B 34,242,517 34,242,527 34,242,536 T T A T A G T T A A A A A A T T A T A G T T A A A A A A A A A A G A G A 52

53 chromosome 8 p23.2 p22 p21.3 p21.1 p11.23 p q q12.2 q13.2 q21.11 q21.2 q22.1 q22.3 q23.2 q24.11 q24.21 q24.23 B 51,457,368 51,457,378 51,457,387 A G T G G G A G A A T T A T G T T T G A G T G G G A G A A T T A T G T T T G A T G T T T G A T T 53

54 chromosome 8 p23.2 p22 p21.3 p21.1 p11.23 p11.1 q q12.2 q13.2 q21.11 q21.2 q22.1 q22.3 q23.2 q24.11 q24.21 q24.23 B 57,613,105 57,613,115 57,613,124 A A A A A A A G A A T A A A T G A A A T A A A A A A A G A A T A A A T G A A A T T T A A T 54

55 chromosome 10 p15.1 p13 p12.31p21 p12.1 p11.21 q11.21 q11.23 q21.2 q21.1 q22.3 q23.2 q23.33 q24.31 q25.1 q25.3 q26.13 q26.3 B 78,698,792 78,698,802 78,698,811 A A G G A A A A A T T T A A T T T A A A G G A A A A A T T T A A T T T A A T T T A A G T G 55

56 chromosome 10 p15.1 p13 p12.31p21 p12.1 p11.21 q11.21 q11.23 q21.2 q21.1 q22.3 q23.2 q23.33 q24.31 q25.1 q25.3 q26.13 q26.3 B 82,162,004 82,162,104 82,162,113 T T A T A T A A A A T G G G T T T A T A T A A A A T G G G T T T T A A A T G G G 56

57 chromosome 11 p15.4 p15.2 p14.3 p21 p13 p12 p11.2 p11.11 q12.2 q13.2 q13.5 q14.2 q21 q22.2 q23.1 q23.3 q24.2 q25 B 39,326,082 39,326,092 39,326,101 T T T G T T T G T T T G T T T G T T T A T T T G T T T G T T T G T T T G T T T A T T T G T T T G T T T G T T T A A T G 57

58 chromosome 11 p15.4 p15.2 p14.3 p21 p13 p12 p11.2 p11.11 q12.2 q13.2 q13.5 q14.2 q21 q22.2 q23.1 q23.3 q24.2 q25 B 44,762,096 44,762,106 44,762,115 T G A T G G G G A G T A G T G G G T G A T G G G G A G T A G T G G G T A G T G G G A 58

59 chromosome 11 p15.4 p15.2 p14.3 p21 p13 p12 p11.2 p11.11 q12.2 q13.2 q13.5 q14.2 q21 q22.2 q23.1 q23.3 q24.2 q25 B 44,762,096 44,762,106 44,762,115 A G G A A G G G G A G A G T G T A G G A A G G G G A G A G T G T G A G T G T 59

60 chromosome 12 p13.32 p13.2 p12.3 p12.1 p21 p11.21 q12 q13.11 q13.2 q14.2 q15 q21.2 q21.32 q22 q23.2 q24.11 q24.22 q24.32 B 23,363,908 23,363,918 23,363,927 A A A A A A A A A A G T A A A T A A A A A A A A A A A A A G T A A A T A A A T G T A A A T A A 60

61 chromosome 12 p13.32 p13.2 p12.3 p12.1 p21 p11.21 q12 q13.11 q13.2 q14.2 q15 q21.2 q21.32 q22 q23.2 q24.11 q24.22 q24.32 B 96,920,559 96,920,569 96,920,578 A G T G A A T T A A A A A A A A G T G A A T T A A A A A A A A A A A A A T A 61

32 B 110,398,738 110,398,748 110,398,757

62 chromosome 12 p13.32 p13.2 p12.3 p12.1 p21 p11.21 q12 q13.11 q13.2 q14.2 q15 q21.2 q21.32 q22 q23.2 q24.11 q24.22 q24.32 B 110,398, ,398, ,398,757 T T T A A G T T T G G A G T A T T T T A A G T T T G G A G T A T A T T T T A T T A A 62

63 chromosome 14 p12 p11.2 p11.1p21q11.2 q12 q13.1 q21.1 q21.2 q22.1 q23.1 q23.3 q24.2 q24.3 q31.1 q31.3 q32.12 q32.2 q32.32 B 47,965,766 47,965,776 47,965,785 T T A T A G A A A A T T T T A A T T T T T A T A G A A A A T T T T A A T T T T T T A G T T T 63

64 chromosome 14 p12 p11.2 p11.1p21q11.2 q12 q13.1 q21.1 q21.2 q22.1 q23.1 q23.3 q24.2 q24.3 q31.1 q31.3 q32.12 q32.2 q32.32 B 53,307,154 53,307,164 53,307,173 A T T T T G T T T G T T T G T T T G T T A T T T T G T T T G T T T G T T T G T T T T T G T T T G T T T G T T T T T G 64

65 chromosome 14 p12 p11.2 p11.1p21q11.2 q12 q13.1 q21.1 q21.2 q22.1 q23.1 q23.3 q24.2 q24.3 q31.1 q31.3 q32.12 q32.2 q32.32 B 86,537,553 86,537,563 86,537,572 G T A T G A A G T A G T T T G T A T G A A G T A G T T T G T G T A T G A 65

66 chromosome 17 p13.2 p13.1 p12p21 p11.2 p11.1 q11.2 q12 q21.1 q21.31 q21.33 q22 q23.1 q23.3 q24.2 q24.3 q25.1 q25.3 B 3,936,611 3,936,621 3,936,630 G T G A A G T A G G G T G A T T T G T G A A G T A G G G T G A T T T G G T G A T T T T 66

67 chromosome 17 p13.2 p13.1 p12p21 p11.2 p11.1 q11.2 q12 q21.1 q21.31 q21.33 q22 q23.1 q23.3 q24.2 q24.3 q25.1 q25.3 B 28,388,993 28,389,003 28,389,012 G A A T A A G A T T T T T T G A A T A A G A T T T T T T T T T T T A 67

chromosome 18 p11.31 p11.22 p21p11.21.2 q11.2 q12.1 q12.

68 chromosome 18 p11.31 p11.22 p21p q11.2 q12.1 q12.2 q12.3 q21.1 q21.2 q21.31 q21.33 q22.1 q22.2 q23 B 18,681,824 18,681,834 18,681,843 A G A A A A T G T T T T T T T T T T A G A A A A T G T T T T T T T T T T T T T T T T T T T 68

69 chromosome 18 p11.31 p11.22 p21p11.2 q11.2 q12.1 q12.2 q12.3 q21.1 q21.2 q21.31 q21.33 q22.1 q22.2 q23 B 54,759,473 54,759,483 54,759,492 T A T T A T T A T T A T T A T A T T A T T A T T A T T A A A G A G A T A A T T 69

70 chromosome 22 p13 p12 p21 p11.2 p11.1 q11.1 q11.21 q11.22 q12.1 q12.2 q12.3 q13.1 q13.2 q13.31 q13.32 B 31,174,501 31,174,511 41,174,520 T T A T G G T G T T A T A A A T T A T G G T G T T A T A A A T A T A A A 70

71 Supplementary Tables Supplementary Table 1 71

72 Prior Posterior Probability cutoff Known Germline DNMs Known Somatic DNMs Known False Positives Total DNM alls Sensitivity (Germline) Sensitivity (Somatic) 1e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e Assessing the impact of mutation rate prior on the sensitivity and specificity of DNM discovery with DeNovoGear in the WES dataset (see Methods of main text for description of the WES dataset). In a previous study of the same samples, we experimentally validated DNM predictions and assigned each prediction a value of Germline DNM, Somatic DNM or False Positive (onrad et al. 2011, PMID ). For each combination of mutation rate prior and posterior probability cutoff used in the current analysis, we list Total DNM alls : the total number of unfiltered DNM calls made by DNG; Known Germline DNMs : the total number of calls made at sites of validated germline mutation reported by onrad et al; Known Somatic DNMs : the total number of calls made at sites of validated somatic mutation reported by onrad et al; Known False Positive : the total number of calls made at sites of false positives reported by onrad et al. In total onrad et al. validated 2 germline denovo mutations, 19 somatic mutations and identified 39 false positive calls in the regions with coverage in the WES dataset.a higher mutation rate prior leads to an increase in the total number of calls made which brings about increased sensitivity at the cost of an increase in the False Discovery Rate. The sensitivity and FDR calculations are similar to Table??. FDR 72

73 Supplementary Table 2 Prior Posterior Probability cutoff Known Germline DNMs Known Somatic DNMs Known False Positives Total DNM alls Sensitivity (Germline) Sensitivity (Somatic) 1e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e Assessing the impact of mutation rate prior on the sensitivity and specificity of DNM discovery with DeNovoGear in the WGS dataset (see Methods of main text for description of the WGS dataset). In a previous study of the same samples, we experimentally validated DNM predictions and assigned each prediction a value of Germline DNM, Somatic DNM or False Positive (onrad et al. 2011, PMID ). For each combination of mutation rate prior and posterior probability cutoff used in the current analysis, we list Total DNM alls : the total number of unfiltered DNM calls made by DNG; Known Germline DNMs : the total number of calls made at sites of validated germline mutation reported by onrad et al; Known Somatic DNMs : the total number of calls made at sites of validated somatic mutation reported by onrad et al; Known False Positive : the total number of calls made at sites of false positives reported by onrad et al. In total onrad et al. validated 49 germline denovo mutations, 952 somatic mutations and identified 2235 false positive calls in whole genome sequencing data from these samples.a higher mutation rate prior leads to an increase in the total number of calls made which brings about increased sensitivity at the cost of an increase in the False Discovery Rate. FDR 73

74 Supplementary Table 3 Dataset lass Alpha Beta RA RA RA EU-trio AA EU-trio RR WUSTL exome RA The alpha and beta values estimated using Maximum Likelihood Estimation for various exome datasets. A different model is fitted to each genotype class (homozygous reference, RR ; heterozygous, RA, and homozygous alternate AA ). The values of α and β estimated for the RA class of genotypes are significantly different between any of the EU exomes and an internal exome dataset generated at Washington University (p < , LRT). 74

75 Supplementary Table 4 aller Posterior Probability cutoff Known Germline DNMs Known Somatic DNMs Known False Positives Total DNM alls Sensitivity (Germline) Sensitivity (Somatic) Denovogear Denovogear Denovogear Denovogear Denovogear Denovogear Denovogear-BB Denovogear-BB Denovogear-BB Denovogear-BB Denovogear-BB Denovogear-BB GATK GATK GATK GATK GATK GATK Polymutt Polymutt Polymutt Polymutt Polymutt Polymutt Samtools Samtools Samtools Samtools Samtools Samtools Naivealler NA FDR omparison between different denovo mutation callers on the WES dataset. Likelihood ratios from some packages were converted to posterior probabilities to enable comparison. The Naivealler does not have a score associated with the calls. In a previous study of the same samples, we experimentally validated DNM predictions and assigned each prediction a value of Germline DNM, Somatic DNM or False Positive (onrad et al. 2011, PMID ). Here, for each combination of calling method and posterior probability cutoff, we list Total DNM alls : the total number of unfiltered DNM calls made by DNG; Known Germline DNMs : the total number of calls made at sites of validated germline mutation reported by onrad et al; Known Somatic DNMs : the total number of calls made at sites of validated somatic mutation reported by onrad et al; Known False Positive : the total number of calls made at sites of false positives reported by onrad et al. The same validation dataset used in Table?? was used for this analysis. All the tools have similar germline sensitivity. Denovogear has the lowest false discovery rates. SamTools and the Naivealler show slightly increased somatic sensitivity but have a very high False Discovery Rate due to the high number of total calls made. 75

76 Supplementary Table 5 aller Posterior Probability cutoff Known Germline DNMs Known Somatic DNMs Known False Positives Total DNM alls Sensitivity (Germline) Sensitivity (Somatic) Denovogear Denovogear Denovogear Denovogear Denovogear Denovogear Polymutt Polymutt Polymutt Polymutt Polymutt Polymutt GATK GATK GATK GATK GATK GATK Samtools Samtools Samtools Samtools Samtools Samtools Naivealler NA omparison between the different denovo mutation callers on the WGS dataset. Here, for each combination of calling method and posterior probability cutoff, we list Total DNM alls : the total number of unfiltered DNM calls made by DNG; Known Germline DNMs : the total number of calls made at sites of validated germline mutation reported by onrad et al; Known Somatic DNMs : the total number of calls made at sites of validated somatic mutation reported by onrad et al; Known False Positive : the total number of calls made at sites of false positives reported by onrad et al. The sensitivity and false discovery rate calculation was done similar to Table??. Polymutt overcalls on the X chromosome and hence all the calls on the X made by Polymutt were ignored as per the recommendation on the developers website, this leads to a drop in its somatic sensitivity. Denovogear shows high germline and somatic sensitivity at a comparatively low False Discovery Rate. FDR 76

77 Supplementary Table 6 Program Dataset Total alls Validated alls DeNovoGear WEx 98 0 SamTools WEx DeNovoGear WGS SamTools WGS DeNovoGear DINDEL WGS The total number of denovo indel calls (posterior probability > ) and the number of validated true positive calls made by DeNovoGear, SamTools, and DeNovoGear using DINDEL genotype likelihoods on the WEx and WGS datasets. 77

78 Supplementary Table 7 Validated Not Validated True Positives False Positives Insertions Total called Avg posterior probability Samtools DINDEL NA Average length oding context Intergenic Exonic Intronic Upstream Downstream ncrna exonic ncrna intronic UTR Deletions Total called Avg posterior probability Samtools DINDEL Average length oding context Intergenic Exonic 0-0 Intronic Upstream 2-1 Downstream 1-4 ncrna exonic 0-1 ncrna intronic 0-7 UTR3 0-4 Summary statistics for de novo indels called by DNG. Sites classified as Validated are sites for which we attempted validation by Sanger sequencing. Sites classified as Not Validated are sites which we did not attempt to validate, due to manual or algorithmic filtering. 78

79 Supplementary Table 8 57 different PR assays designed to confirm indel calls made by DNG Assay # hromosome Position mutation Forward primer Reverse primer Amplicon size Annealing temperature AGTTGTTGTGTTGTT TAAAAAG GTGTTTAGAAAAT GAGAGGAATAAGGAGTG GAGTTATGTAATAAAATATGTAGG GTTTTAATTTTAGTTT GGATTTGTTTG AATTAAATTATG GGAGAAGGTTAAAAATTG GAAAGTGATTAATAATTT GAGTGTGAAAATGATAGTA GTAGATGAGGGTTGATGGG GAGGGATTTGGAAAGAA GAAAGGGATGGTGATGTG TTTGATTATTGA TAAAATGTATTTTAGAAGTTGT TGTTAATTAGGTATTTT GAAATTGGAAATGT AGATGGGGAGTTTTT AATAAAAGAAAAG TTTGTGTTGGTTTTTTATGTAG ATGTATAGTATAAATGG AATAGATGTGAGAAAGATTTG TTTATTTAGATTTGGAGGAAAAG ATAATGGGAGGTGGAATG TTTGTAGAGTTTGTGG TTGAATATTTGAAGAAATA GGTTGAAGATATGTGTG GGAATATTTTTTTTTG AGAGGTTGTGGTGAGAAG TGATTTTGTAAATTTA GGTTTAAAATAAAAAG TATTTAAAATTTGTTTA ATAAAGAAATTATTGGGAAA TTGGGGGAAATTATAA TGGGGAGTTATTGGTATG AATTAATAAAGATGTGATG ATAATGGGTGATA AAAGGAGAGGGAGGAAAAG TATAATATGTTAAGTAAAAAG TGAAGTTGGGGTGGATG TGTAAGGGTGGATTTGAG TTTGTGTAATTTAAGAAGTA AATAATATGAAAAAATT AAAATGTAAATTAATTTTTG GTAAAAGAAAATTAGAGG AAATAAGGGGTAGATGG GAGGATAAGGTTTGATAGGATTAG GAGAGTTGTAAAGTG ATGGGGGTGTGTAAT TTGGTTGAATGGATGA GAAATATAAGAATATGTTTGA AAAGGAAATTTTTA TTTGTAGATTTAAAAGAGTTGG ATGAATATTAAAAATGTAAAA TTTTTTAAAAAAAAGTAAA TTGATATTGATTAATATTTG TATGGAGATGAGTAGGG GTGAAAAATGGAGTAAAATTG GGAATTGTTGTTGTTTAGG GGATGATGAATATAAATAT GGGATTTTTAGAAAATAAATAAAAG Elongation time (s)

80 TGGTATGTTTGTTTGTG TGTATGTTTATTAAGTTG TTATATGTAGGTTAGTTTTGT TGAAGAGAAGATATT TAAAAAGTTAATG ATGGAGGATTGTGTTAG AATTGTGTGTTTGG TTTATGATGGAAGTG TAATAAATGAAGAATAAG ATTATATGAATTATTATAAATA TGATAGGGAAGAAAATGTA TGAGTAAATGAGTTTTG TTTGATGATAAAT AGAAGAGGTGGATTGTGG GGAGAAATGAAGAGA GAGGGATGTTTAAA AATAAATTAAA TTTAAATTTG ATGTGTGTTGTGAG ATTTAGGTGTTGT GTTGGAAGAAAAATG TGGATAAGATTTG GTTGTGTAGTTT TATAGGGATGGATGATG ATTTTAATGTGTTGG TGTAAAAAGAGAGAGTTGG TTTTGAGGTTTGAAATGG TGGTTAAAATGTGAAGAAATG AATGGGGTAGGTATGTTG TTTAATTATTTTTA GTTTGGATGGTTTGAGG ATTAGAGGAGAAGAG GTATTGTTTGGATG ATTTATGGAAAG GGTGATGGTGATGGTGAT GGAGAATGAAAAGTAGA TAAAATAAGTTGTT AGTATTGTGGATGGAATA GAGAAAAAAATAAGATTGTTA ATTAAAGTGGA TATTTTTTTATTTG GGATAGAAGAATGTTAATAG TTGGAATTTTAATG AGTTGTATAAAT TGGGAATTAATGATAAAAG TTTGAGTTTTAAAATGTAT TAGTGAATAATG TTTGAAATTTTA TTTGTTTGTTGTGGAG AATGGGTAGAGGAAGG X GTGAGATTAAGAG ATGGTGTAAGGATTG The 57 different PR assays that designed used to confirm indel calls made by DNG. The chromosomes on which these assays were designed as well as the locations of the indels to be captured are indicated in addition to the primers, annealing and elongation times used for each PR assay.

81 Supplementary Table 9 is provided as a stand-alone document available through the publisher s website. 81

82 Supplementary Note Analysis of Priors The DeNovoGear framework allows the user to specify the prior probability of observing a DNM, which in principle can be used as a lever to increase and decrease calling sensitivity. We performed simulations to show that increasing the mutation rate prior increases detection sensitivity (Supplementary Figures 1 and 2, Supplementary Tables 1 and 2). Specifically, we ran DeNovoGear by setting the mutation rate prior from 10 4 to mutations/bp in geometric increments of Our results show that varying the mutation rate prior does have a dramatic effect on the sensitivity and specificity of DNM calling when using a standard whole-genome sequencing study design such as the one generating the WGS dataset (Supplementary Tables 1 and 2, Supplementary Figs. 1 and 2). The total number of false positive calls increases over 5-fold when moving from to 10 4, while 879/939 (94%) of validated DNMs are detected at the smallest rate prior, and 100% sensitivity for germline DNMs is achieved at These results indicate that use of biologically realistic values for the mutation prior will give near 100% sensitivity to non-mosaic DNMs, while increasing the prior 82

83 beyond this threshold will massively inflate the number of false positive calls with marginal or no increase in sensitivity. Next, we investigated whether our use of a prior on mutation rate helps control Type I error at low sequencing depth. Low coverage data had high specificity, but low sensitivity. With a cutoff of posterior probability of being de novo of > 0.001, specificity ranged from 1 (1x coverage) to 0.87 (20x coverage); sensitivity ranged from 0 (1x coverage) to 0.97 (20x coverage). Greater than 95% of de novo mutations were identified at 16x coverage and above. DINDEL genotype likelihoods After the fact we wanted to assess the performance of alternative indel genotype likelihood functions for de novo indel calling. We selected one such alternative modeling framework, a computationally intensive, haplotypebased realignment method that is implemented in the package DINDEL (PubMed ID: ). In order to evaluate the feasibility of running DIN- DEL on the whole genome, we first ran DINDEL in the default mode and in the heuristic mode on hromosome 21 of the WGS dataset. DINDEL identified a total of 631,686 distinct candidate indels on hromosome 21 83

84 across all three samples. These calls are spread across 288,658 windows of 120 bp each. Run time varied by only 5% across different samples. The more complex modeling used by DINDEL comes at a great computational cost: the average run time for default mode was 142 hours per sample, and for heuristic mode, 80 hours. Using these numbers produces an estimated run time of 344 or 144 days per whole genome sequence. Using the heuristic likelihoods, DINDEL calls were first made on each member of the trio separately. These calls were then merged to create a list of candidate sites, and we directed DINDEL to calculate genotype likelihoods for all three samples for these candidate sites. The resultant genotype likelihoods were then fed to DeNovo- Gear to call de novo indels. DeNovoGear produced 136 indel calls from the WGS dataset with posterior probability > 0.9 and 463 calls with posterior probability > 1x10 4 (Table S6). Forty-four (79%) of the 56 candidate DNMs from our Samtools analysis were also called as DNMs with DINDEL likelihoods when considering this larger set of 463; in contrast 2/3 false positives were no longer supported as DNMs. Our results suggest that DINDEL genotype likelihoods are conservative (i.e. they underestimate the evidence in support of indel when a true indel is present) but this is balanced by a major 84

85 increase in specificity. Robustness of Indel Mutation Rate Estimate Previous estimates of the ratio of deletion to insertion variants from human polymorphism data range from , thus we interpret our observation of a nearly 8-fold enrichment of validated deletions may reflect a lack of power to detect short insertions with next-generation sequencing data. Alternatively, our finding may be an indication that purifying selection is much stronger on new deletions than on new insertions. If we were to adjust our mutation rate estimate to account for a theoretical under-ascertainment of indels, the resulting values are only slightly higher than what we presented here; assuming a true 4:1 ratio of deletions:insertions produces an estimate of 1.18 x 10 9 while assuming a 2:1 ratio leads to an estimate of 1.42 x It seems likely that discovery power for indels may be lower than that for SNVs. If our power to discover indel DNMs was much less than the value of 0.95 that we assume here, say as low as 0.2, and using the other parameter values described in the main text, our rate estimate would be revised upwards to Due to our filtering strategy, the indel rate estimate we 85

86 provide in the main text applies to the non-repeat portion of the genome. The indel mutation rate is predicted to be much higher in the repetitive portion. One way to approach the true genomic indel mutation rate would be to simply include the DNM calls from these repeat regions in a rate estimate; even under the assumption that these are all true positives (which is unlikely), the resulting callset may also suffer from a lack of power to identify indels in repeat regions. Sensitivity and specificity of indel calling in repeats is still very poorly characterized. We manually removed 55% of the post-filtered calls in our original analysis, due to visual identification of artifacts. Assuming we would remove the same propotion from the unfiltered callset, we would have 203 de novo indels, and we would expect to validate 192 of these based on our observed validation rate. Then, using the equation that we define in the methods section of the main text, with the parameter values a = 1, p = 0.95, s = 49/1001, d = 193, b = , our rate estimate would be

Pyrobayes: an improved base caller for SNP discovery in pyrosequences

Pyrobayes: an improved base caller for SNP discovery in pyrosequences Aaron R Quinlan, Donald A Stewart, Michael P Strömberg & Gábor T Marth Supplementary figures and text: Supplementary Figure 1. The