Article type: Focus Article The Limits of Multiplexing Dan Shen,, D.P. Dittmer 2, an J. S. Marron 3 Keywors Multiplexing, Genomics, nanostring TM, DASL TM, Probability Abstract We were motivate by three novel technologies, which exemplify a new esign paraigm in high throughput genomics: nanostring TM, DNA-meiate Annealing, Selection, extension, an Ligation (DASL TM an multiplex real-time quantitative polymerase chain reaction (QPCR. All three are solution hybriization base, an all three employ on - DNA sequence probes in a small volume, each probe specific for a particular sequence in a ifferent human gene. Nanostring TM uses 5-mer, DASL an multiplex QPCR use 2-mer probes. Assuming a nm probe concentration in a µl volume, there are 9 x 9 x 6.23 x 23 or 6.23 x 5 molecules of each probe present in the reaction compare to -, target molecules. Excess probe rives the sensitivity of the reaction. We are intereste in the limits of multiplexing, i.e. the probability that in such a esign a particular probe woul bin to any other, sequence-relate probe rather than the intene, specific target. If this were to happen with appreciable frequency, this woul result in much reuce sensitivity an potential failure of this esign. We establishe upper an lower bouns for the probability that in a multiplex assay at least one probe woul bin to another sequencerelate probe rather than its cognate target. These bouns are reassuring, because for reasonable egrees of multiplexing ( 3 probes the probability for such an event is practically negligible. As the egree of multiplexing increases to 6 probes, our theoretical bounaries gain practical importance an establish a principal upper limit for the use of highly multiplexe solution-base assays vis-à-vis soli-support anchore esigns. Recently solution-base multiplex hybriization base methos have been evelope an use for messenger ribonucleic aci (mrna profiling experiments that were previously the purview of soli-state, anchore methos the so-calle microarrays or chips. By most practical accounts their performance seems * Correspon to: anshen@usf.eu; ittmer@me.unc.eu Interisciplinary Data Sciences Consortium, Department of Mathematics an Statistics, University of South Floria. 2 Department of Microbiology an Immunology an Center for AIDS Research, Comprehensive Cancer Center, University of North Carolina at Chapel Hill. 3 Department of Statistics an Operations Research, Comprehensive Cancer Center, University of North Carolina at Chapel Hill.
equal, but practical experiments represent examples with a bias towars reporting positive ata. They are not exhaustive an o not represent a general solution. Because massive multiplexing involves - 6 probes, iniviual experimental valiation is no longer feasible. At the time the reaction volume of polymerase chain reaction (PCR an hybriization assays has been reuce ue to nanotechnology. Conventional PCR instruments now can use the 536-well format e.g. the Roche system with µl. Newer microfluiics-base machines perform up to 2, iniviual reactions on the same chip e.g. Fluiigm system with.85 - nanoliter. Recently, a picoliter evice has been escribe by White et al. (2. As the engineering pushes the technical bounaries of miniaturization, it becomes important to efine the statistical bounaries of experimental esigns. Problem We were concerne that the multiplex esign introuces the potential for cross-hybriization among probe molecules, resulting in a loss of sensitivity or etection failure. This possibility is not present in soli-state, anchore esigns, since in this case there are no free probes available in solution only the target molecules. Traitionally crosshybriization refers to a scenario where the probe bins to a secon, relate but not the intene target. In solution-base multiplex esigns there exists in aition the possibility that the probe cross-hybriizes to another probe rather than any target at all. Since every probe has to be in excess over any potential target in orer to rive the hybriization reaction to completion, cross-hybriization to an unrelate probe woul be favore an woul prevent etection of the cognate target. Here we etermine theoretical bouns for this cross-hybriization problem. Figure lays out the problem. A microarray can be seen as a set S of probes in a special orientation. Each probe of length (soli arrow is physically attache to the support surface. Two probes never touch each other. These are then hybriize to a mixture of target mrnas. The probes are in molar excess compare to the target. Each probe is an oligonucleotie of length, i.e. a -mer. The -mer is mae up of the four bases A, T, C, G. We assume that A an T have the same frequency, as o C an G. Furthermore. we parameterize the CG-ratio (i.e. the relative frequency of A or T as p, where p. Each -mer bins to the target mrna with perfect complementarity an we assume percent efficiency. We also assume that a given -mer oes not bin non-specifically to non-target mrnas. These assumptions are suppose to hol in maximum efficient instruments or assays. In a soli-state anchore microarray, this is the only interaction that can take place (Figure panel A. There is no bining of -mer probes to each other. The situation is ifferent in a solution-hybriization base multiplex esign, e.g. a bea array (Figure panel B. In a bea array probes are couple to iniviual beas an in principle any two beas with complementary probes (blue an re arrows in Figure can hybriize to each other. Other esign rely on free oligonucloties/ probes with no 2
bea array Probe: -mer microarray Set of S probes Probe: -mer with m mismatches target mrna A. B. Figure : Conceptual illustrations of bining possibility for (A soli-state anchore micro array, (B multiplex solution array. Soli arrows inicate probes, the otte arrow the correct target. The blue arrow refers to a probe or oligonucleotie of length, which is esire, perfectly complementary to the target mrna (ashe arrow. The re arrow refers to an oligonucleotie of length, which is similar to the target mrna an thus can bin the blue probe except for m mismatches. The blue-re interactions are possible ue to sequence complementarity. beas attache. In aition to each -mer bining is cognate target mrna, each -mer can also bin to any other -mer in the probe set. If the -mer bins to another -mer rather than the target mrna, the assay fails. It is important to keep in min that for solution hybriization base esigns the concentration of probe is orers of magnitue large than the target mrna that is being etecte. Otherwise, the assay woul not be quantitative. Since hybriization efficiency is a function of probe concentration, unwante probe -mer to probe -mer hybriizations poses a novel problem for solution-base multiplex approaches. In a complete set that contains all possible -mers, there exists for each -mer one an only one perfectly complementary -mer. For example, for a -mer of sequence ACT G the perfect complement woul be T GAC. In a complete set S, all -mer probes woul hybriize to their complementary -mer probe an none woul bin to the target mrna. This is avoie in traitional microarray esigns, since all -mers are spatially separate by anchoring them to a soli support matrix, i.e. a slie microarray or chip. The size of the complete set, i.e. the number of all possible -mers is 4. In praxi, one woul never multiplex the complete set, but only a subset of all possible -mers. This subset S is much smaller than the complete set S. In praxi, not only perfectly completely complementary -mers woul hybriize, but also those with m mismatches. The number of mismatches being etermine by the stringency of the hybriization reaction. The more mismatches are tolerate by the reaction conitions the more the size of S approaches the size of the complete set S. Suppose = 4, then the size of S = 64. Assume we only have 3
5 5 2 25 3 35 4 Probability Probability Probability.8.6.4.2.8.6.4.2.8.6.4.2 Panel A (= 4, s=, p=.5 Panel C (= 4, s=, p=.28 5 5 2 25 3 35 4 Lower boun Upper boun Panel E (= 4, s=, p=.2 5 5 2 25 3 35 4 m, number of mismatches log ( Probability log ( Probability log ( Probability Panel B (= 4, s=, p=.5 5 Lower boun Upper boun 5 5 2 25 3 35 4 Panel D (= 4,s=,p=.28 5 5 5 2 25 3 35 4 5 Panel F (= 4, s=, p=.2 5 5 2 25 3 35 4 m, number of mismatches Figure 2: Upper (soli an lower (ashe bouns on the probability that there exist no m mismatche -mers in a subset uner the ifferent CG-ratio: Show a absolute probability scale in the left panels. To epict closeness to at higher resolution, the right panels are plotte on the log(-probability scale. Rows compare affect of the CG-ratio p. Show upper an lower bouns are equally very close. Dashe re an pink lines give specific interesting comparison. one probe of length = 4, then size of S =, e.g. ACT G. S /S = /64. Now we allow one mismatch at the en to yiel : ACT g, ACT t, ACT c, ACT a. The size of S = 4 an S /S = 4/64. If we allow 4 mismatches S = S. This paper gives useful theoretical bouns on how many -mers can be multiplexe an how these epen on the length, the number of mismatches an the CG-ratio p. There exists a vibrant literature on the probability of an iniviual -mers an the complete set S of all possible sequence permutation, D yachkov et al. (25; Bishop et al. (27; Dyachkov an Voronina (29. Fewer stuies have investigate this problem in the context of subsets S of S an how subset size influences the probability of annealing. Results We efine no m mismatche -mers in a subset S if the number of mismatches between any two -mers in a subset S oesn t equal to m. The probability of no m mismatche -mers in a subset S of all possible -mers is a function of s (the size of the subset S, of (the length of the -mer, of m (the number of mismatches an of p (the CG-ratio. Panel A of the Figure 2 show our lower an upper bouns on the probability of no m mismatche -mers in the subset S for s =, = 4 an p =.5. Lower an upper bouns on this probability are close to when m is small, then ecrease to for m between 6 an 24 an increase again for m near = 4. 4
Panels C an E show the lower an upper bouns plot with the same s = an = 4 but a ifferent CG-ratio p =.28 an p =.2 respectively. The boun curves in Panels C an E have a similar tren as in Panel A. For smaller values of the CGratio, the sharp ecrease from to happens for smaller m. Also the increase happens for larger m, an oes not occur for p =.2 in Panel C. Panels of Figure 2 also show that the lower an upper bouns on the probability of no m mismatche -mers in the subset S is almost equal to when m is small. For example, the vertical scale in the panel A oes not effectively istinguish the bouns from, for m < 5. A better visualization of this practically important range is achieve by applying the log function to minus the bouns as shown in the secon column panels of Figure 2. These transforme plots clearly show the orer of magnitue of the ifference between the probability an. For example, the ashe re an pink lines in the plot show that to have a probability within. = 3 of, we nee m 3, 8, or 6, for the CG-ratios p =.5,.28, or.2 respectively, an to be within 6, m, 5, or 3 respectively. As mentione above, the combinatorial problem (Graham, 995; Rioran, 22 of fining a general close form for the probability of no m mismatche -mers in the subset S is very challenging. This motivate us to instea fin close forms for the lower an upper bouns on the probability of no m mismatche -mers in the subset S. Without loss of generality, we assume that the CG-ratio p.5. First, we nee to introuce some notation. Given, m an s, we efine Ni l an N i u, i =,, s, as If i 2, Ni l = ip an Ni u = i( p. If 2 l k= < i 2 l k=, where l, then N l i = 2 l k= p k ( p k + p l ( p l (i 2 l k= ( p l p l (i 2 l k=. an N u i = 2 l k= ( p k p k + In aition, we nee to efine Mi l an M i u, i =,, s, as If i ( m < 2, then Mi l = i ( m ( + p m p m an Mi u p m. = i ( m (2 p m ( If 2 l ( k= < i m 2 l k=, where l m, then M l i = 2 l k= ( + p m ( p k p m k + [i ( m 2 l k= ]( + p m ( p l p m l an Mi u = 2 l k= (2 p m ( p m k p k + [i ( m 2 l k= ](2 p m ( p m l p l. If 2 l k= 2 m k= ( < i m 2 l k=, where m < l, then M l i = (+p m ( p k p m k +2 l k= m (2 p k +m (+ 5
p k ( p m + [i ( m 2 l k= ](2 p l +m ( + p l ( p m an Mi l = 2 m k= (2 p m p k ( p m k + 2 l k= m ( + p k +m (2 p k p m +[i ( m 2 l k= ](+p l +m (2 p l p m. Note that Ni l N i u an Mi l M i u, i =,, s an equality hols when the CG-ratio p =.5. A convenient notation is: The number of possible results for the first -mer is M = (2p + 2 2p = 2. Given i -mers with no m mismatches, the number of possible -mers that are m-mismatche with the given -mers is greater than or equal to Mi l an less than or equal to Mi u. Given two ifferent -mers that are not m mismatche, the number of possible -mers H that are m-mismatche with one of them is { M l H + ( m 2 m p, m < M l + ( + p p, m =. Let the lower boun be H l = M l + ( m 2 m p for m < an H l = M l + ( + p p for m =. Lower an upper bouns on the probability of no m mismatche -mers in the subset S are: If M M u s N u s <, then If M M u s N u s, then P [no m mismatches] =. ( s i= (M Mi u Ni u (M Ni l P [no m mismatches] ( M M l s ( M max { } H l, Ni u i=2 s i= (M N u i. (2 The erivation of equations ( an (2 is shown in the supplementary material. 6
Panel A ( = 4, p=.5 Panel B ( =4, p=.5 Lower boun.5 Upper boun.5 5 5 5 5 2 25 3 35 4 5 5 2 25 3 35 4 Panel C ( = 4, p=.28 Panel D ( =4, p=.28 Lower boun.5 Upper boun.5 5 5 5 5 2 25 3 35 4 5 5 2 25 3 35 4 Lower boun.5 5 Panel E ( = 4, p=.2 Upper boun.5 5 Panel F ( =4, p=.2 s 5 5 2 25 3 35 4 m s 5 5 2 25 3 35 4 m Figure 3: Bouns on the probability of no m mismatche = 4-mers in subsets with ifferent sizes: Extension of Figure 2 to a range of ifferent values of s (the size of the subset. This continues to show large probability for small m (more so for small s, for both lower (Panel A, C an E an upper (Panel B, D an F bouns. The boun remain close, inicating goo approximation quality, over a range of ifferent CG-ratios, p =.5 (A an B,.28 (C an D an.2 (E an F. Discussion Given the CG-ratio p, equations ( an (2 give bouns on the probability of no m mismatche -mers in the subset S as functions of s an m. As an example we moele the bouns for s =, the size of the subset S, in Figure 2. This showe that for probes (-mers with 5% (p =.5 CG- content even if we allow as many as mismatches in a probe of length = 4 the probability that any two probes in this set anneal to each other is in,,, i.e. very unlikely. The situation becomes less favorable as the CG- content ratio becomes more skewe. At 2% (p =.2 CG-content, such as experience in certain microorganisms (mycoplasma has 24% CG-content allowing for 8 mismatches yiels a chance of in in, that any two probes woul anneal to each other. Because our moel is symmetric aroun 5% CG-content the same reasoning applies to positively skewe ration such as foun in streptomyces species, which average 72% CG-content. Hence, multiplex assays with > probes are limite to organisms with balance CG- content. Next, we explore how the bouns change when both subset size s an mismatch number m change. We use Figure 3 to illustrate the lower an upper bouns as functions of both the size of the subset s an the number of allowe mismatches m, for the CGratios p =.5,.28,.2. For practical applications, we want to maximize the size of the subset s, which increases the egree of multiplexing, an we want to minimize the number of allowe mismatches m, which increases specificity. As seen from Figure 3, a practical limit of the egree of multiplexing again epens on the CG-content 7
of the target organism. Up to a set size of s = probes, though it is extremely unlikely that any two probes in a multiplex assay woul bin to each other. This assumes large probes of length = 4 or longer as use in the Nanostring TM assay. The chance of unwante cross-hybriization increases as the probe length ecreases. At a probe length of = 2, such as use in multiplex PCR an applie to the worst case scenario of a microorganism with heavily skewe CG-content, allowing as little as 2 mismatches per probe may result in cross-hybriization between probes in the probe set. Luckily homo sapiens has a balance CG-content, which allows the use of highly multiplexe assays for clinical applications. Current soli-state microarrays can achieve a size of s =.8 6 ifferent probes per chip an m =, since they can etect single nucleotie polymorphisms (SNPs. Base on our calculation, we can answer the question: Can a solution-base, multiplex esign reach or excee this performance? For = 4, s =.8 6 an m =, we have the probability (of no m mismatches within 9, 3 an 2 2 of for the CG-ratios p =.5,.28, or.2 respectively. Hence, solution-base SNP arrays base on probe sizes of 4 or longer have comparable performance to soli-state microarrays only for balance (CG-ratio p =.5 probes. If the CG-ratio rops, as is known for many microbial genomes, solution-base SNP arrays unerperform ue to cross-hybriization among probes. Mismatch an mismatch probability have a concrete biophysical meaning, see Cantor an Schimmel (98. Every match lowers the free energy G of the probe-target uplex an every mismatch m increases G. Every probe-target uplex has a characteristic melting temperature T, which is a function of G. We show here that it is extremely unlikely that in a set S of size s < we woul encounter any pair of probes of length = 4 with m < 3, 8, or 6 (corresponing to the CG-ratio p =.5,.28 or.2 mismatches between them. In sum, the current multiplex assays ( e.g. nanostring TM, DASL TM are expecte to work an have a large margin of error built in before they encounter the theoretical bounaries, which we erive here. As we move into higher an higher moes of multiplexing, it is important to know the principal bounaries of each esign. As it is no longer possible to experimentally test all possible failure scenarios or experimentally valiate the performance for each an every probe our theoretical unerstaning nees to improve to near certainty. Otherwise the true potential of highly multiplexe methos cannot be realize. Acknowlegements. This work was partially supporte by the Startup Fun of University of South Floria, an public health service grants CA94 an AI78 to DPD. References Bishop, M. A., D yachkov, A. G., Macula, A. J., Renz, T. E. an Rykov, V. V. (27 Free energy gap an statistical thermoynamic fielity of na coes. Journal of Computational Biology 4(8, 88 4. 8
Cantor, C. R. an Schimmel, P. R. (98 Biophysical chemistry: Part III: the behavior of biological macromolecules. Macmillan. D yachkov, A. G., Vilenkin, P. A., Ismagilov, I. K., Sarbaev, R. S., Macula, A., Torney, D. an White, S. (25 On na coes. Problems of Information Transmission 4(4, 349 367. Dyachkov, A. G. an Voronina, A. N. (29 Dna coes for aitive stem similarity. Problems of Information Transmission 45(2, 24 44. Graham, R. L. (995 Hanbook of combinatorics, Volume. Elsevier. Rioran, J. (22 Introuction to combinatorial analysis. Courier Corporation. White, A. K., VanInsberghe, M., Petriv, I., Hamii, M., Sikorski, D., Marra, M. A., Piret, J., Aparicio, S. an Hansen, C. L. (2 High-throughput microfluiic single-cell rt-qpcr. Proceeings of the National Acaemy of Sciences 8(34, 3999 44. 9