Supplementary Figure 1. Schematic of split-merger microfluidic device used to add transposase to template drops for fragmentation.

Similar documents
SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

Practical Bioinformatics

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

Supplementary Information

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

Advanced topics in bioinformatics

Number-controlled spatial arrangement of gold nanoparticles with

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),

Supporting Information

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

SUPPLEMENTARY DATA - 1 -

Clay Carter. Department of Biology. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

Crick s early Hypothesis Revisited

SUPPLEMENTARY INFORMATION

Supplementary Information for

Electronic supplementary material

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Supplemental Figure 1.

TM1 TM2 TM3 TM4 TM5 TM6 TM bp

Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

SUPPLEMENTARY INFORMATION

Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R

Supplemental Table 1. Primers used for cloning and PCR amplification in this study

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Supporting Information for. Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)-

Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective

Chain-like assembly of gold nanoparticles on artificial DNA templates via Click Chemistry

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Protein Threading. Combinatorial optimization approach. Stefan Balev.

Supplementary information. Porphyrin-Assisted Docking of a Thermophage Portal Protein into Lipid Bilayers: Nanopore Engineering and Characterization.

The role of the FliD C-terminal domain in pentamer formation and

ydci GTC TGT TTG AAC GCG GGC GAC TGG GCG CGC AAT TAA CGG TGT GTA GGC TGG AGC TGC TTC

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies

ChemiScreen CaS Calcium Sensor Receptor Stable Cell Line

evoglow - express N kit distributed by Cat.#: FP product information broad host range vectors - gram negative bacteria

Re- engineering cellular physiology by rewiring high- level global regulatory genes

Codon Distribution in Error-Detecting Circular Codes

evoglow - express N kit Cat. No.: product information broad host range vectors - gram negative bacteria

Evolvable Neural Networks for Time Series Prediction with Adaptive Learning Interval

The 3 Genomic Numbers Discovery: How Our Genome Single-Stranded DNA Sequence Is Self-Designed as a Numerical Whole

The Trigram and other Fundamental Philosophies

Supporting Information

AtTIL-P91V. AtTIL-P92V. AtTIL-P95V. AtTIL-P98V YFP-HPR

Sex-Linked Inheritance in Macaque Monkeys: Implications for Effective Population Size and Dispersal to Sulawesi

Near-instant surface-selective fluorogenic protein quantification using sulfonated

part 4: phenomenological load and biological inference. phenomenological load review types of models. Gαβ = 8π Tαβ. Newton.

Timing molecular motion and production with a synthetic transcriptional clock

part 3: analysis of natural selection pressure

Supplementary Information

Pathways and Controls of N 2 O Production in Nitritation Anammox Biomass

Chemical Biology on Genomic DNA: minimizing PCR bias. Electronic Supplementary Information (ESI) for Chemical Communications

Supporting Information. Spinning micro-pipette liquid emulsion generator for single cell whole genome

Why do more divergent sequences produce smaller nonsynonymous/synonymous

Supporting Information. An Electric Single-Molecule Hybridisation Detector for short DNA Fragments

Evolutionary dynamics of abundant stop codon readthrough in Anopheles and Drosophila

Michigan State University Diagnostic Center for Population and Animal Health, Lansing MI USA. QIAGEN Leipzig GmbH, Leipzig, Germany

Electronic Supporting Information for

Appendix B Protein-Signaling Networks from Single-cell Fluctuations and Information Theory Profiling B.1. Introduction

Supplemental Figure 1. Phenotype of ProRGA:RGAd17 plants under long day

Supporting Material. Protein Signaling Networks from Single Cell Fluctuations and Information Theory Profiling

Genome Sequencing & DNA Sequence Analysis

Supplementary Materials for

Identification of a Locus Involved in the Utilization of Iron by Haemophilus influenzae

THE MATHEMATICAL STRUCTURE OF THE GENETIC CODE: A TOOL FOR INQUIRING ON THE ORIGIN OF LIFE

Cells in double emulsions for FACS sorting

Symmetry Studies. Marlos A. G. Viana

Evolutionary Analysis of Viral Genomes

Nature Genetics: doi:0.1038/ng.2768

Lecture 15: Programming Example: TASEP

FliZ Is a Posttranslational Activator of FlhD 4 C 2 -Dependent Flagellar Gene Expression

Estimating Phred scores of Illumina base calls by logistic regression and sparse modeling

DNA-encoded library D2 Yuri Takada

Charles Cao. Growth. Properties. Bio-analytical Applications. Assembly. 226 Leigh hall. 20 nm

types of codon models

Cross- talk between emulsion drops: How are hydrophilic reagents transported across oil phases?

Introduction to Molecular Phylogeny

ANALYZING THE DIVERSITY OF A SMALL ANTIBODY MIMIC LIBRARY. Nick Empey. Chapel Hill 2010

Evidence for RNA editing in mitochondria of all major groups of

Insects act as vectors for a number of important diseases of

160, and 220 bases, respectively, shorter than pbr322/hag93. (data not shown). The DNA sequence of approximately 100 bases of each

Supporting Information. Copyright Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, 2007

How DNA barcoding can be more effective in microalgae. identification: a case of cryptic diversity revelation in Scenedesmus

Genotyping By Sequencing (GBS) Method Overview

Using algebraic geometry for phylogenetic reconstruction

Single-cell systems biology by super-resolution imaging and combinatorial labeling

Codon-model based inference of selection pressure. (a very brief review prior to the PAML lab)

Supplementary Figure 1. Phenotype of the HI strain.

The photoluminescent graphene oxide serves as an acceptor rather. than a donor in the fluorescence resonance energy transfer pair of

NEW DNA CYCLIC CODES OVER RINGS

DNA sequence analysis of the imp UV protection and mutation operon of the plasmid TP110: identification of a third gene

codon substitution models and the analysis of natural selection pressure

Glucosylglycerate phosphorylase, a novel enzyme specificity involved in compatible solute metabolism

DNA Barcoding Fishery Resources:

World Journal of Pharmaceutical Research SJIF Impact Factor 8.074

It is the author's version of the article accepted for publication in the journal "Biosystems" on 03/10/2015.

Characterization of Multiple-Antimicrobial-Resistant Salmonella Serovars Isolated from Retail Meats

Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing

Transcription:

Supplementary Figure 1. Schematic of split-merger microfluidic device used to add transposase to template drops for fragmentation. Inlets are labelled in blue, outlets are labelled in red, and static channels labelled in green are filled with 2M NaCl. This device consists of four modules: the droplet inlet and droplet spacer used to reinject drops; the splitter channel used to controllably split a fraction of every drop; the oil inlet and aqueous inlet combining into a droplet maker, and the electrode and moat channels forming a merger channel. The dashed box denotes the area of the device shown in Fig. 2 of the main text. Pump flowrates: oil 700 µl per hr, aqueous 250 µl per hr, droplet 150 µl per hr, droplet spacer 200 µl per hr, splitter 170 µl per hr.

Supplementary Figure 2. Schematic of the barcode merging device which merges two small drops with one large drop then splits the large drop into 4 smaller drops. Inlets are labelled in blue, outlets are labelled in red, and static channels labelled in green are filled with 2M NaCl. This device consists of four modules. The droplet inlets and spacer used to reinject and intercalate two sets of drops. The aqueous and oil inlets to generate large droplets on the device. The moat and electrode channels to merge the droplets. The bifurcations at the outlet to split large droplets into small droplets. The dashed boxes denotes the area of the device shown in Fig. 2 of the main text. Pump flowrates: oil 10000 µl per hr, droplet inlets 50 µl per hr and 70 µl per hr, aqueous 800 µl per hr, spacer 200 µl per hr.

Supplementary Figure 3. The pinched flow fractionation device used to remove large coalesced droplets from an emulsion. a) Emulsion collected at the end of the triple merger device before and after thermal cycling, showing coalescence droplets which will be sorted out in the next step. b) Schematic and bright field picture of the pinched flow fractionation device used to remove large droplets. Droplets are injected at 400 µl per hr. HFE7500 oil is injected at 4000 µl per hr. The smaller droplets are collected into tube, while the outlet for large droplets is attached to a syringe filled with water, pulling at 3000 µl per hr. Microscope image of pinched flow fractionation device separating large and small droplets. Arrows indicate large droplets that flow to the lower outlet, separating them from the smaller droplets.

Supplementary Figure 4. Schematic and validation of droplet barcode. a) Chemically synthesized random Nmer barcodes are encapsulated into droplets so that most droplets contain zero or one barcode. Inside droplets, each barcode is clonally amplified by PCR, generating droplets that contain zero or many copies of a unique barcode. SYBR staining inside droplets is used to identify droplets that contain barcodes. b) Plot showing the probability of reusing the same barcode for an experiment using a total number of barcodes, for barcodes of 15-20 bps long. Error bars represent the standard error of the mean from 10,000 simulation runs. See supplemental methods for simulation details. c) Empirical data from a SMDB sequencing. Distribution of Hamming distance of each barcode to its closest neighbor before and after clustering. Error barcodes are 1 Hamming distance away from their closest neighbor while original barcodes are on average three Hamming distances away. Dashed blue line shows the theoretical distribution of Hamming distances given an equal number of randomly chosen barcodes.

Supplementary Figure 5. Local GC content and aggregate coverage for all 8 templates. Dashed line shows local GC content of the templates plotted on the left axis. Solid line shows the aggregate normalized coverage for each template plotted on the right axis.

Supplementary Figure 6. Additional metrics on success of de novo assembly of barcode clusters a) Correlation between de novo assembly success rate and coverage entropy (green line) for the known templates dataset. For comparison, assembly success is also plotted against the number of reads in the barcode cluster (blue dots) showing a weaker relationship. b) Percent identities of contigs assembled from the E. coli genomic DNA compared to the reference E. coli genome. c) The same data with each contig represented on the x-axis.

Supplementary Figure 7. Number of barcode clusters containing a minimum number of reads for a given sequencing effort. More sequencing effort results in more barcode clusters containing the minimum number of reads, but the rate of new barcode cluster discovery decreases as more reads are sequenced. Supplementary Figure 8. Qualities and distributions of SNP calls. a) Distribution of SNP qualities scores. b) Distribution of number of high quality (>Q50) SNPs in each barcode cluster. ~90% of barcode clusters contain no SNPs while the rest are distributed into the remaining 10%.

Supplementary Table 1. Oligonucleotides used in SMDB Oligo Sequence 5-3 Notes Structure Name FL127 AATGATACGGCGACCACCGAGATCTACAC-TCGTCGGCAGCGTC Primer for barcoded fragment Illumina P5 - Nextera Const Sequence Barcode Oligo GCAGCTGGCGTAATAGCGAGTACAATCTGCTCTGATGCCGCATAGNNNNNN NNNNNNNNNTAAGCCAGCCCCGACACT Barcode Oligo Sequence ConstA-N(15)- ConstB FL128 CTGTCTCTTATACACATCTCCGAGCCCACGAGACGTGTCGGGG CTGGCTTA Barcode PCR Fwd NexteraComple mentary - Barcode Priming Site FL129 CAAGCAGAAGACGGCATACGAGATCAGCTGGCGTAATAGCG Barcode PCR Rev Illumina P7 - Barcode Priming Site FL166 GCCCACGAGACGTGTCGGGGCTGGCTTA Custom Barcode Read Sequencing Primer Contains barcode priming site of UnivAdap tora-w UnivAdap tora-c UnivAdap torb-w UnivAdap torb-c CCACTACGCCTCCGCTTTCCTCTCTATGGGCAGTCGGTGAT ATCACCGACTGCCCATAGAGAGGAAAGCGGAGGCGTAGTGG*T*T CCATCTCATCCCTGCGTGTCTCCGACTCAG CTGAGTCGGAGACACGCAGGGATGAGATGG*T*T Same as adaptor in NEB E6285S Same as adaptor in NEB E6285S Same as adaptor in NEB E6285S Same as adaptor in NEB E6285S FL178 CCACTACGCCTCCGCTTTC Primer for Template PCR FL179 CCATCTCATCCCTGCGTGT Primer for Template PCR FL143 GACCTCGCGGGTTTTCGCT For known template FL144 CCT GAC CGC TGT ACA CTG CA For known template FL145 GGTGAACGATGCGTAATGTG For known template FL128 Watson strand of universal adaptor A Crick strand of universal adaptor A Watson strand of universal adaptor B Crick strand of universal adaptor B Complementar y to Adaptor A Complementar y to Adaptor B FragC Forward FragC Reverse FragD Forward FL146 TCAGCATCTAGCATGCAACC For known template FL147 TCGGATTTAGTGCGCTTTCT For known template FL148 GCCCATGACAGGAAGTTGTT For known template FL170 ATTTGAATCCTCCGGCTCCG For known template FL171 TCC CGG ACG AAC CTC TGT AA For known template FL172 GGC TTG GCT CTG CTA ACA CG For known template FL173 GGA TCA GAA ATG GGA AGA AGG CG For known template FL174 GCC ACC TGT TAC TGG TCG AT For known template FL175 ACC GAC TCA ATA AAC ACG GC For known template FL176 ACC TCT AAA TCG TGC ACA GGC For known template FL177 TTC CCC GAT ACC TTG TGT GC For known template FragD Reverse FragE Forward FragE Reverse FragG Forward Primer (3K) FragG Reverse Primer (3K) FragH Forward FragH Reverse FragI Forward Primer (5k) FragI Reverse Primer (5k) FragJ Forward (4K) FragJ Reverse (4K)

Supplementary Note 1. Droplet stability in thermal cycling Droplet stability to thermocycling is dependent on surfactant, aqueous buffer, and droplet size. Using the EA surfactant and our PCR buffer, we found empirically that droplets are most stable to thermal cycling when they are immersed in FC-40 with 5% w/w EA surfactant and with 2% tween-20 w/v and 2% PEG- 6000 w/v in the aqueous phase. Under these conditions, droplets are most stable to thermal cycling if their spherical diameter is less than 55um. Supplementary Note 2. Algorithm to cluster error barcodes to their original sequences The algorithm we use, called dfscluster, available at https://github.com/abatelab/barcoding, operates under the expectation that each sequence in a barcode cluster is one Hamming distance away from another sequence in that cluster, and at least two Hamming distances away from any sequence not in that cluster. If this is the case, the sequences associated with unique barcode clusters form connected components in Hamming space, which can then be identified using a depth first search (dfs) in time proportional to barcode length times the number of unique sequences amongst the barcodes and their derivatives. One scenario where this expectation doesn t hold is where sequences from one part of a barcode cluster are at least two mutations away from all sequences from another part. Computer simulation using a single length 15 template, 0.8 template replication rate, and 0.0001 single base error rate shows that although clusters do split, the splits are inconsequential. When we run dfscluster on these simulated barcode clusters, it consistently groups 99.99% of the simulated cluster s sequences into a single cluster. The other scenario is collisions, where multiple barcode clusters merge into a single component. Collisions are heavily dependent on the minimum Hamming distances between the original barcodes. To this end, dfscluster does provide the option to identify and remove components suspected of being collisions from its output. This filter marks clusters where the normalized difference between the number of most, and second most populous sequences in each cluster is less than 0.7 as collisions, with a false positive rate of 0.017, and a false negative rate of 0. For more functional details, consult the Abate lab github. Supplementary Note 3. Comparing rate of double encapsulation to theoretical Poisson distribution If the process of template encapsulation is completely random, then the distribution of the number of templates per droplet should follow a Poisson distribution where λ is the average number of templates per droplet and k is the number of templates in a particular droplet (ie k = 2 represents droplets with two templates). P(k, λ) = e λ λ k /k! Approximating each barcode cluster as a single droplet, the fraction of droplets that contain a single template (k = 1) is represented by: P(1, λ) = e λ λ The fraction of droplets containing two templates is: P(2, λ) = e λ λ 2 /2 The ratio between P(1, λ) and P(2, λ) is R = 2/λ. Using the number of one and two template containing barcode clusters, we estimate λ = 0.1, which matches the target encapsulation ratio and supported by

counting fluorescent vs. non fluorescent droplets in SYBR staining after initial template amplification in droplets. Supplementary Note 4. Defining coverage entropy In order to visualize the coverage distribution for every barcode cluster, it must be described as a numerically. We applied the informational entropy from information theory to the distribution of reads to arrive at coverage entropy S: S = (Pi) (log ( 1 Pi )) i where Pi is the probability of finding a read that maps into the ith bin along the template in each barcode cluster. Coverage entropy is maximum when the probability for reads to fall into bin is equal.

Supplementary Methods Gravity induced droplet size fractionation In a tumbling emulsion, larger droplets experience higher buoyant force than smaller droplets. This phenomenon can be used in a simple method of segregating large and small droplets all in one emulsion. To use gravitational fractionation, droplets are loaded into a syringe with equal volumes of HFE7500 with 2% EA surfactant, then gently rolled along the 30 o tilted long axis of the syringe at approximately 0.5Hz for one hour. The rolling fluidizes the emulsion allowing droplets to shuffle past one another based on their buoyancies. After rolling, the syringe is fully tilted to 90 o facing down. The large droplets are on top of the emulsion while the small droplets are at the bottom. Half of the emulsion containing the small droplets are collected. The other half of the emulsion contains a mixture of large and small droplets which are further sorted using a pinch flow fractionation device (Fig. S3). Simulating probability of repeating barcodes The probability of resampling the same barcode by randomly drawing from an unlimited pool is akin to the birthday problem for which the analytical solution is not computationally tractable for such a large number of possible barcodes. In order to determine the probabilities, we performed in silico simulations. Barcodes are generated by randomly selecting one of four bases with equal probabilities until N bases are selected, resulting in a random barcode. A repeat event occurs when a newly generated barcode matches exactly with a previously generated barcode. The probability is determined by averaging the result from multiple simulations. The script used for simulation is available at the Abate Lab Github: https://github.com/abatelab/ Calculating the limit of detection for rare variants The probability of observing a rare mutation present at frequency f when sampling the population n times is described by a Poisson distribution: P(k > 0) = 1 e (f n) Hence, the frequency of mutants that can be detected with probability P is: f = ln(1 P(K > 0)) n Setting P = 0.95 we can calculate the minimum frequency of molecules we expect to detect 95% of the time when we sequence n number of molecules with SMDB.