Haplotyping estimation from aligned single nucleotide polymorphism fragments has attracted increasing

Size: px

Start display at page:

Download "Haplotyping estimation from aligned single nucleotide polymorphism fragments has attracted increasing"

Albert Lambert
5 years ago
Views:

1 INFORMS Journal on Computing Vol. 22, No. 2, Spring 2010, pp issn eissn informs doi /ijoc INFORMS A Class Representative Model for Pure Parsimony Haplotyping Daniele Catanzaro Graphs and Mathematical Optimization, Computer Science Department, Université Libre de Bruxelles, B-1050 Brussels, Belgium, dacatanz@ulb.ac.be Alessandra Godi Istituto di Analisi dei Sistemi ed Informatica, Consiglio Nazionale delle Ricerche, Roma Italy, alessandragodi@inwind.it Martine Labbé Graphs and Mathematical Optimization, Computer Science Department, Université Libre de Bruxelles, B-1050 Brussels, Belgium, mlabbe@ulb.ac.be Haplotyping estimation from aligned single nucleotide polymorphism fragments has attracted increasing attention in recent years because of its importance in the analysis of fine-scale genetic data. Its application fields range from mapping of complex disease genes to inferring population histories, passing through designing drugs, functional genomics, and pharmacogenetics. The literature proposes several criteria for haplotyping populations, each of them characterized by biological motivations. One of the most important haplotyping criteria is parsimony, which consists of finding the minimum number of haplotypes necessary to explain a given set of genotypes. Parsimonious haplotype estimation is an -hard problem for which the literature has proposed several integer programming (IP) models. Here, we describe a new polynomial-sized IP model based on the concept of class representatives, already used for the coloring problem. We propose valid inequalities to strengthen our model and show, through computational experiments, that our model outperforms the best IP models currently known in literature. Key words: haplotype inference; computational biology; integer programming History: Accepted by Harvey Greenberg, former Area Editor for Computational Biology and Medical Applications; received October 2008; revised January 2009, February 2009, March 2009; accepted March Published online in Articles in Advance July 29, Introduction Diploid organisms, such as humans, are characterized by having the DNA organized in pairs of chromosomes, of which one copy is inherited from the father and the other from the mother. The recent completion of the sequencing phase of the Human Genome Project (Venter et al. 2001) showed that such copies are extremely similar and that the genomes of two different individuals are identical in more than 99% of the overall number of nucleotides. Nevertheless, differences at the genomic level (also known as polymorphisms) occur, on average, every 1,000 bases (Chakravarti 1998) and are (excluding the recombination process) the predominant form of human variation as well as of genetic diseases (Hoehe et al. 2000, Terwilliger and Weiss 1998). When a site (i.e., the position of a specific nucleotide) of the genome shows a statistically significant variability within a population (i.e., a set of individuals) it is then called a single nucleotide polymorphism (SNP). Specifically, a site is considered a SNP if for a minority of the population a certain nucleotide is observed (called the least frequent allele) while for the rest of the population a different nucleotide is observed (the most frequent allele). For a given SNP, an individual can be either homozygous (i.e., possess the same allele on both chromosomes) or heterozygous (i.e., possess two different alleles). The values of a set of SNPs on a particular chromosome region define a haplotype. Haplotyping an individual therefore consists of determining, for each copy of a given chromosome region, a pair of haplotypes. Haplotyping populations of individuals has attracted an increasing amount of attention in recent years (Helmuth 2001, Marshall 1999) because of its importance in the analysis of fine-scale genetic data (Clark et al. 1998, Schwartz et al. 2002). For example, haplotypes are necessary in evolutionary studies to extract the information needed to detect diseases and to reduce the number of tests to be carried out. In 195

2 196 INFORMS Journal on Computing 22(2), pp , 2010 INFORMS functional genomics, haplotypes are used to discover a functional gene or to study an altered response of an organism to a particular therapy. In human pharmacogenetics, haplotypes explain why people react differently to different types or amounts of drugs. In fact, because SNPs affect the structure and function of proteins and enzymes, they may influence the way in which a drug is absorbed and metabolized. Direct sequencing of haplotypes via experimental methods is both time-consuming and expensive, and therefore current molecular sequencing methods generally provide more general genotype information. Specifically, genotype data provide information about the multiplicity of each SNP allele of a given individual, i.e., knowledge about its homozygous or heterozygous nature. Unfortunately, a drawback of using genotype data is that information regarding which heterozygous site SNP variants came from the same chromosome copy remains unknown (Wang and Xu 2003). Hence, in silico haplotyping methods become attractive alternatives and in some cases the only viable way for haplotyping populations (Lancia et al. 2004). The simplest way for haplotyping a population is described in Bonizzoni et al. (2003), Gusfield (2003), and Lancia et al. (2004) and can be resumed as follows: first, to experimentally obtain genotype data, and subsequently, for each individual, to retrieve the haplotypes computationally i.e., to find a set of haplotypes such that, if they are assumed to be the corresponding set of chromosome copies, then computing the multiplicity of each SNP allele one can obtain exactly the genotypes given. However, this approach requires the presence of some haplotyping criterion. Several criteria have been proposed for haplotyping populations, each of them based on biological motivations (see, for example, Bafna et al. 2003, Clark et al. 1998, Eskin et al. 2003, Excoffier and Slatkin 1995, Fallin and Schork 2000, Lancia and Rizzi 2006, Niu et al. 2002, Qin et al. 2002, Stephens and Donnelly 2003, Stephens et al. 2001). In this article we consider the parsimony criterion (Lancia et al. 2004). The idea at the core of the parsimony criterion is that under many plausible explanations of an observed phenomenon, the one requiring the fewest assumptions should be preferred (Semple and Steel 2003). Hence, because the number of distinct haplotypes observed in a population is much smaller than the number of possible haplotypes, the parsimony-based approaches aim to determine the minimum number of different haplotypes that, combined in pairs over time, have given rise to a set of observed genotypes (Lancia et al. 2004). The problem of haplotyping populations under parsimony (hereafter denoted as the pure parsimony haplotyping (PPH) problem) is known to be -hard (Lancia et al. 2004). This result justifies the development of several enumerative optimization algorithms that aim to solve exactly instances of PPH. Specifically, Gusfield (2003) first proposed an integerprogramming model to tackle instances of PPH. The author described a model, exponential in size, characterized by two kinds of variables one for haplotypes and the other for haplotype pairs and by the exhaustive generation of the set of all haplotypes compatible with some genotype in the input. Similar integer programming models were also used by Brown and Harrower (2004) and Bertolazzi et al. (2008). To minimize the number of distinct haplotypes, Brown and Harrower (2004) proposed constructing haplotype vectors by associating a variable to each site; they subsequently used constraints to establish the exact haplotype structures. On the other hand, Bertolazzi et al. (2008) first formulated PPH as a minimization problem characterized by a polynomial number of variables and constraints. Then the authors turned the problem into a maximization problem and strengthened the model by using clique inequalities, symmetry breaking, inequalities, and dominance relations. Whereas Gusfield (2003) and Bertolazzi et al. (2008) used commercial mixed-integer programing solvers (CPLEX and Xpress-MP, respectively) to get solutions to their models, Brown and Harrower (2004) used a branch-and-cut algorithm to solve their polynomial model. A comparison of their results shows that the Brown and Harrower (2004) polynomial model is well suited for big dimension samples, whereas the Gusfield (2003) and Bertolazzi et al. (2008) models are more efficient for medium dimensions, and specifically, when the recombination level (i.e., the parameter that affects the structure of the haplotypes) increases. So far, to our best knowledge, data sets containing 68 genotypes and 75 SNPs represent the limit size instances of PPH that can be exactly analyzed (Brown and Harrower 2006). In this article we present a new polynomialsized integer programming (IP) model for PPH. (See the Online Supplement, available at pubs.informs.org/ecompanion.html/, for codes and data.) The idea at the core of the model is based on class representatives already proposed by Campelo et al. (2008) for the coloring problem (Garey and Johnson 2003). Computational experiments show that our model outperforms the existing polynomial IP models for PPH (Bonizzoni et al. 2003, Brown and Harrower 2006, Gusfield 2003, Lancia et al. 2004) both on real and simulated data, and allows haplotyping of populations containing hundreds of genotypes and SNPs. The model is compact, easy to implement, solvable with standard solvers, and usable in those cases for which the parsimony criterion is well suited for haplotyping populations. 2. Notation and Problem Formulation Assume that a population of diploid individuals is given and that for each individual the knowledge of

3 INFORMS Journal on Computing 22(2), pp , 2010 INFORMS 197 Individual Chromosome region DNA sequence Paternal A T A G C T G C C C A A A T Maternal A T A G G T G C C C A T A T Paternal haplotype C C A } Maternal haplotype G C T genotype Figure 1 SNP1 SNP2 SNP3 An Example of Haplotypes and SNPs a specific (paternal and maternal) chromosome region is available. The sites of the chromosome regions from the individuals at which polymorphism arises are called SNP. Given a SNP, the most frequent allele in the population is encoded with the value 0 and the least frequent allele with the value 1. For an individual in the population, a SNP is homozygous if the corresponding paternal and maternal nucleotides are equal, and heterozygous otherwise. The set of SNPs of a paternal (or maternal) chromosome region of an individual defines a haplotype. Haplotypes can be encoded as strings of fixed length over an alphabet = 0 1. For example, consider the paternal and maternal chromosome regions from an individual as shown in Figure 1. Encode the first SNPs (i.e., C and G ) as 0 and 1, respectively; the second SNPs (i.e., C and C ) as 1; and the third SNPs (i.e., A and T ) as 0 and 1, respectively. Then the paternal haplotype is the string (010) and the maternal haplotype is the string (111). Because at each SNP only three possibilities can arise (homozygous of type 0 or 1, or heterozygous), a genotype can be encoded as strings of fixed length over an alphabet = 0 1 2, where entries equal to 0 (or 1) denote homozygous sites of type 0 (or 1), and entries equal to 2 denote heterozygous sites. Given a pair of haplotypes h i and h j, define the operator sum among h i and h j as the genotype g k whose pth entry is h ip if h ip = h jp, and 2 otherwise. For example, the genotype obtained by summing the paternal and maternal haplotypes in Figure 1 is g k = 212. We say that a genotype g k is resolved from the pair of haplotypes h i and h j if g k = h i h j. An instance of PPH is specified by indicating a set of m genotypes. A feasible solution to PPH is constituted by (i) a set H of (at most 2m (Lancia et al. 2004)) haplotypes and (ii) for each genotype g k, a pair of haplotypes, say, h i h j, resolving g k such that g k = h i h j. An optimal (or minimum) solution to PPH is a feasible solution in which the set H has the minimum cardinality (Blain et al. 2009). As an example, an instance of PPH and two alternative (minimum) solutions are shown in Figure 2. It is worth nothing that, although having the same set of haplotypes, Solutions 1 and 2 of Figure 2 are different. Bipartite graphs can be used to represent feasible solutions to PPH (see Blain et al. 2009). Specifically, a feasible solution can be represented by such a graph H E in which each vertex g k is of degree 2 and the two other vertices, say, h i and h j, adjacent to g k satisfy g k = h i h j (i.e., h i and h j are mates for g k ; see Blain et al. 2009). The bipartite graphs corresponding to Solutions 1 and 2 of Figure 2 are depicted in Figures 3(a) and 3(b), respectively. In the bipartite graph representation of a solution, each haplotype is incident to a subset of genotypes. The family of the subsets of genotypes induced by the haplotypes of a feasible solution satisfies the following three properties: (1) each subset of genotypes shares one haplotype, (2) each genotype belongs to exactly two subsets, and (3) every pair of subsets intersects in at most one genotype. As an example, the haplotypes of Solution 1 in Figure 2 induce the family of subsets of Figure 4(a) satisfying properties (1) (3). Specifically, the subsets are induced by the following four haplotypes: h 1 = 1100 inducing the subset S 1 = , h 2 = 0011 inducing the subset S 2 = , h 3 = 1001 inducing the subset S 3 = 1202, and finally, h 4 = 0110 inducing the subset S 4 = Similarly, the haplotypes of Solution 2 in Figure 2 induce the family of subsets of Figure 4(b) satisfying properties (1) (3). Specifically, the subsets are induced by the following four haplotypes: h 1 = 1100 inducing the subset S 1 = 1202, h 2 = 0011 inducing the subset S 2 = 0212, h 3 = 1001 inducing the subset S 3 = , and finally, h 4 = 0110 inducing the subset S 4 = Integer Programming Models In this section we describe an IP model for PPH. First, we describe a basic model that requires a polynomial number of variables and constraints. We subsequently show how to reduce the model size by exploiting the particular structure and properties of the variables involved. Finally, we provide valid inequalities to further strengthen the model Basic Model The integer programming model for PPH that we propose is based on the class representatives with the smallest index, similar to the one proposed by Campelo et al. (2008) to tackle the coloring problem (Garey and Johnson 2003). Following the properties of a feasible solution presented in the previous section, we associate an index to each subset S of genotypes induced by a haplotype h. Specifically, if i is the smallest index of a genotype belonging to S, then i is the index associated to S and the subset will be denoted as S i. Because each genotype g k belongs to exactly two subsets, it may happen that g k is itself the genotype with the smallest index in

4 198 INFORMS Journal on Computing 22(2), pp , 2010 INFORMS Instance of PPH Genotypes SNP Genotype Genotype Figure 2 Haplotypes Haplotype 1 Haplotype 2 Haplotype 3 Haplotype 4 Haplotypes Haplotype 1 Haplotype 2 Haplotype 3 Haplotype 4 Genotype 3 SNP SNP Minimum solution 1 Minimum solution Genotype 1 = Haplotype 1 Haplotype 2 Genotype 2 = Haplotype 1 Haplotype 3 Genotype 3 = Haplotype 2 Haplotype 4 Genotype 1 Graphical Representation of an Instance of PPH and Two Alternative Solutions both subsets. In this case a dummy genotype g k is added, and the subset S k is created. As an example, one can imagine that the haplotype h 1 induces the subset S i = g i g j g k, h 2 induces the subset S i = g i g l g r g s, h 3 induces the subset S k = g k g l g s g t, and so on. We remark that the index k can be considered only if k was previously used, i.e., if the subset S k already exists. For the sake of Figure 3 h 1 h 2 h 3 h 4 (a) g 1 g 2 g 3 h 1 h 2 h 3 h 4 (b) Bipartite Graph Representation of Solutions 1 and 2 of Figure 2 g 1 g 2 g 3 Conflation Conflation = Haplotype 3 Haplotype 4 Genotype 2 = Haplotype 1 Haplotype 3 Genotype 3 = Haplotype 2 Haplotype 4 notation, in the rest of the paper we will refer to the genotype g k as the genotype k. Because at most 2m haplotypes are necessary to resolve m genotypes (Lancia et al. 2004), the indices i of the subsets S i can vary inside K K, where K = 1 m and K = 1 m. Assume that an order is defined on K K in such a way that 1 < 1 < 2 < 2 < <m<m. Define x i, i K K, as a decision variable equal to 1 if, in the solution, there exists a haplotype inducing a subset S i of genotypes whose smallest index genotype is i, and 0 otherwise. Denote y k ij, k K i j K K, i<j, as a decision variable equal to 1 if the genotype k belongs to the subsets S i and S j, and 0 otherwise. Finally, denote as the set of the p SNPs characterizing each genotype g k in, and let z ip, i K K, p, be a decision variable equal to 1 if the haplotype inducing the subset S i of genotypes has such a value at the pth site, and 0 otherwise. Variables z ip describe explicitly the haplotypes of the solution. Then, an integer programming model for PPH follows: Formulation 1. Basic Model BM min x i (1) i K K s.t. x i x i i K (2)

5 INFORMS Journal on Computing 22(2), pp , 2010 INFORMS 199 (a) (b) Figure 4 Genotype 1 = 2222 Genotype 1 = 2222 Haplotype 1 = 1100 Haplotype 4 = 0110 Haplotype 3 = 1001 Haplotype 2 = 0011 Genotype 2 = 1202 Genotype 2 = 1202 Haplotype 1 = 1100 Genotype 3 = 0212 Haplotype 4 = 0110 Haplotype 3 = 1001 Genotype 3 = 0212 Haplotype 2 = 0011 Examples of Subsets of Genotypes Induced by the Two Alternative Solutions of Figure 2 i j K K y k ij j K K j i 1 k K (3) y k ij + j K K j<i y k ji x i k K i K K (4) y k kk x k k K (5) z kp = z k p = 0 k K p g kp = 0 (6) z kp = z k p = 1 k K p g kp = 1 (7) z kp + z k p = 1 k K p g kp = 2 (8) z ip 1 y k ij j K K j i y k ji j K K j<i p k K g kp =0 i K K i k k (9) z ip y k ij + j K K j i y k ji j K K j<i p k K g kp =1 i K K i k k (10) z ip + z jp y k ij p k K g kp = 2 i j K K (11) z ip + z jp 2 y k ij p k K g kp = 2 i j K K (12) x i z ip y k ij 0 1 (13) The objective function (1) represents the number of distinct haplotypes or equivalently the cardinality of H. Because the index i is considered only if i is already used, constraints (2) imply that if the haplotype h i is not used, then h i should not be used. Constraints (3) impose that each genotype g k must belong to exactly two subsets S i S j, and constraints (4) force x i to be 1, i.e., to take haplotype h i into account, if some genotype g k is resolved by h i. Constraints (5) are a consequence of the definition of the dummy genotype k ; actually, they constitute a special version of constraints (4) when genotype k is resolved by haplotype k. Constraints (6) (8) translate the sum operation among haplotypes. Specifically, constraints (6) impose that the pth SNP of the haplotype h i (h i ), which induces the subset S i (S i ), must be set to 0 when genotype g i has its pth SNP equal to 0. By analogy, constraints (7) impose that the pth SNP of the haplotype h i (h i ) must be set to 1 when genotype g i has its pth SNP equal to 1. Constraints (8) impose that exactly one among the pth SNPs of the haplotypes h i and h i can be set to 1 when genotype g i has its pth SNP is equal to 2. Constraints (9) establish the relations between variables z ip and yij k. Specifically, they force the pth SNP of the haplotype h i to be equal to 0 when at least one genotype g k, whose pth entry is equal to 0, belongs to the induced subset S i. By analogy, constraints (10) force the pth SNP of the haplotype h i to be equal to 1 when at least one genotype g k, whose pth entry is equal to 1, belongs to the induced subset S i. Finally, constraints (11) and (12) impose that the sum operation among haplotypes is respected for any pair of haplotypes h i and h j in the solution Reducing Model Size It is worth noting that the particular nature of the set of indices K K can be exploited to reduce the size of BM. In fact, given that yij k = 1 if and only if k belongs to two subsets having g i and g j for the smallest index genotype, we need to define variables yij k only when i<j k or i = k and j = k. For example, variable y1 2 1 does not need to be defined as well as all variables yii k for all i k K, k i. Similarly, variables yk ik or y k k i (depending on whether k is smaller or larger than i) do not need to be defined for i K K with i k and i k. In fact, if y k ik = 1, then k belongs to two subsets, one represented by i and the other by k, which contradicts the assumption that the dummy genotype k can be considered only if k is already used. By extending this analysis to all the possible cases in which variables yij k are redundant and assuming that variable y11 1 = 1, it is easy to see that variables yk ij do not need to be defined whenever they belong to one of the following sets: R 1 = y k ij k K i j K K j<i<k (14)

6 200 INFORMS Journal on Computing 22(2), pp , 2010 INFORMS R 2 = y k ik k K i K K i k 1 (15) R 3 = y k ii k K i K K 2 i k 1 (16) The sets of redundant variables can be further expanded by observing that for each triplet of genotypes g i g j g k such that the respective pth SNP is g ip = 0, g jp = 0, and g kp = 2, variable y k ij is necessarily equal to 0 since the containment of the genotype g k to the subsets S i and S j would violate the sum operator among haplotypes at least at the pth SNP. Extending this argument to all the possible combinations of triplets of genotypes that violate the haplotype sum operator, we see that the following proposition, whose proof is omitted, holds: Proposition 1. The set of variables R 4 = y k ij i j k K p g kp = 2 g ip = g jp 2 (17) is redundant. Similar arguments can be used to prove the following proposition: Proposition 2. The set of variables R 5 = { y k ij i j k K p g kp + g ip = 1 or g kp + g jp = 1 } (18) is redundant. Proof. By contradiction, if the set R 5 is not redundant, then for some p, genotype k may belong to two subsets S i and S j. Without loss of generality, assume g kp = 0 and g ip = 1. Because haplotype h i is associated with the subset S i and explains genotype i, it must have the pth SNP equal to 1; otherwise, the sum operator would be violated. In turn, this implies that genotype k cannot be resolved by h i because a necessary condition for its resolution is that the pth SNP of h i be equal to 0. Hence, independent of j, yij k = 0 in any feasible solution of PPH. Note that removing the redundant variables yij k of Propositions 1 and 2 can be performed in O m 3 n. Finally, a similar process of reduction can be applied to the variables z ip both by removing those whose value is fixed by constraints (6) and (7) and by the set of redundant variables (17) and (18). In this way, only variables z ip involved in constraints (8) need to be defined. In conclusion, denoting Y as the set of variables yij k, R = q R q, and Y = Y \R, we can be reduce BM to the following: Formulation 2. Reduced Model RM min x i (19) i K K s.t. x i x i i K (20) i j y k ij Y \R y k ij y k ij + j yij k Y \R j i 1 k K (21) j y k ij Y j<i y k ji x i k K i y k ij Y (22) y k kk x k k K (23) z kp + z k p = 1 k K p g kp = 2 (24) z ip 1 y k ij p k K z ip z ip y k ij z jp y k ij j y k ij Y j i j y k ij Y j i z ip 1 y k ij z jp 1 y k ij z ip + z jp y k ij y k ji j yij k Y j<i i y k ij Y g kp = 0 g ip = 2 (25) y k ij + p k K y k ji j yij k Y j<i i y k ij Y g kp = 1 g ip = 2 (26) p k K i j y k ij Y g ip = 2 g jp = 0 g kp = 2 (27) p k K i j y k ij Y g ip = 0 g jp = 2 g kp = 2 (28) p k K i j y k ij Y g ip = 2 g jp = 1 g kp = 2 (29) p k K i j y k ij Y g ip = 1 g jp = 2 g kp = 2 (30) p k K i j y k ij Y g ip = 2 g jp = 2 g kp = 2 (31) z ip + z jp 2 y k ij p k K i j y k ij Y g ip = 2 g jp = 2 g kp = 2 (32) x i z ip y k ij 0 1 (33) Note that the integrality of variables yij k suffices to guarantee the integrality of variables x i. Hence, we denote by RMM the formulation in which the integrality condition on variables x i is relaxed Strengthening Inequalities In this section, we provide valid inequalities to strengthen RM. Proposition 3. The inequality k K y k ij Y y k ij x i i j y k ij Y (34)

7 INFORMS Journal on Computing 22(2), pp , 2010 INFORMS 201 Proof. Two subsets S i and S j of a feasible solution can share at most one genotype. So it follows that k K yij k 0 1. If the solution is such that k K yij k = 0, then the inequality reduces to x i 0, which is trivially valid. If the solution is such that k K yij k = 1, then genotype k belongs to the subset represented by genotype i, which in turn implies x i = 1, and the inequality is again valid. Similarly, it is easy to see that the following proposition holds: Proposition 4. The inequality y k ij x j i j y k ij Y (35) k K y k ij Y Note that k K y k ij 0 1 also implies k S y k ij 0 1, for any S K. Let 1 k = k K g kp = 1. The following proposition holds: Proposition 5. The inequality z ip y k ij p i j y k ij Y (36) k 1 k yk ij Y Proof. If the solution is such that k k 1 yk ij = 0, then the inequality reduces to z ip 0, which is trivially valid. If the solution is such that k k 1 yk ij = 1, then the inequality reduces to z ip = 1, which is again valid. Note that inequality (36) only applies if g ip = 2. In fact, if g ip = 1, (36) is redundant, and if g ip = 0, the corresponding variables yij k are not defined. By analogy, fixing i, j, and p, consider 0 k = k K g kp = 0. Then, the following proposition holds: Proposition 6. The inequality z ip x i y k ij p i j y k ij Y (37) k 0 k yk ij Y Proof. If the solution is such that k k 0 yk ij = 0, then the inequality reduces to z ip x i, which is trivially valid. If the solution is such that k k 0 yk ij = 1, then genotype k belongs to the subset represented by genotype i, which implies x i = 1 and thus z ip = 0. Also in this case, the inequality (37) only applies if g ip = 2. In fact, if g ip = 0 and if g ip = 1, the corresponding variables yij k are not defined. Defining 2 k = k K g kp = 2, we have the following proposition: Proposition 7. The inequality z ip + z jp y k ij + 2 k 2 k yk ij Y k 1 k yk ij Y y k ij p i j y k ij Y (38) Proof. The right-hand side of the inequality k k 2 yk ij + 2 k k 1 yk ij If the solution is such that k k 2 yk ij + 2 k k 1 yk ij = 0, then the inequality reduces to z ip + z jp 0, which is trivially valid. If the solution is such that k p 2 yk ij = 1, then the inequality reduces to z ip + z jp 1, which is still valid. Finally, if the solution is such that k k 1 yk ij = 1, then the inequality reduces to z ip +z jp = 2, which is again valid. In fact, if haplotypes h i and h j resolve genotype k, then they must be characterized by having the pth SNP equal to 1. Similar arguments can be used to prove the following propositions: Proposition 8. The inequality z ip + z jp x i + x j y k ij 2 k 2 k yk ij Y k 0 k yk ij Y y k ij p i j y k ij Y (39) A distinct class of inequalities are obtained by introducing the concept of conflict among genotypes (Bertolazzi et al. 2008) at the pth SNP. Specifically, consider a pair of two genotypes g i g j and assume that there exists a conflict; i.e., one of the two genotypes (e.g., g i ) is such that g ip = 1, whereas the corresponding entry of the other is such that g jp = 0 (or vice versa). Then this condition is sufficient to exclude common haplotypes explaining g i g j because their existence would violate the haplotype sum operator. The rationale can be extended to all the subset of genotypes having conflicts and can be formalized as follows. Define CG as a graph whose set of vertices is the set of genotypes and whose set of edges contains all the pairs of genotypes g i g j that are in conflict. Then, the following proposition holds: Proposition 9. Let C be a clique in CG, then the inequality y k ij x i i y k ij Y C CG (40) k C j yij k Y j i Proof. Only one genotype in C can be shared by two subsets S i and S j of a feasible solution. This implies that k C j yij k Y j i yk ij 0 1. If the solution is such that k C j yij k Y j i yk ij = 0, then the inequality reduces to x i 0, which is valid. If the solution is such that k C j yij k Y j i yk ij = 1, then the genotype k belongs to the subset represented by genotype i, which in turn implies x i = 1; hence, the inequality is again valid.

8 202 INFORMS Journal on Computing 22(2), pp , 2010 INFORMS Proposition 9 can be further extended to variables z ip. Specifically, define the set Ep 1 = g i g j 1 k : q q p g iq + g jq = 1 for all p, and consider the graph CG 1 p = 1 k E1 p CG. Then the following proposition holds: Proposition 10. The inequality z ip y k ij + k C 1 j yij k Y j i k C 1 j yij k Y j<i y k ji p i y k ij Y clique C 1 CG 1 p (41) By analogy, defining the set Ep 0 = g i g j 0 k : q q p g iq + g jq = 1, for all p, and considering the graph CG 0 p = 0 k E0 p CG, we have the following proposition: Proposition 11. The inequality z ip x i y k ij k C 0 j yij k Y j i k C 0 j yij k Y j<i y k ji p i y k ij Y clique C 0 CG 0 p (42) The proofs of Propositions 10 and 11 are quite similar to the ones for Propositions 7 and 8 and are omitted. Finally, observe that Propositions 10 and 11 imply Proposition 9 and only apply if g ip = Experiments In this section we analyze the performances of our model to solve the pure parsimony haplotyping problem. Our experiments were motivated by a number of goals namely, to evaluate, with respect to BM, the benefits obtained by removing the redundant variables and including the valid inequalities previously described; to compare the performances of our model with the ones obtained by Brown and Harrower s HybridIP model (Brown and Harrower 2006), currently the best model for PPH; and finally, to allow the analysis of larger data sets with respect to the ones currently analyzed. Similar to Brown and Harrower (2006), we emphasize that our experiments aim simply to evaluate the runtime performance of our model for solving PPH. We neither attempt to study the efficiency of PPH for haplotype inference nor compare the accuracy of our algorithm to haplotype inference solvers that do not use the parsimony criterion; this analysis has been performed by both Gusfield (2003) and Wang and Xu (2003), and we refer the interested reader to these articles Implementation We implemented BM and RM by means of Mosel 2.0 of Xpress-MP, Optimizer version 18, running on a Pentium 4, 3.2 GHz, equipped with 2 GB of RAM, and operating system Gentoo release 7 (kernel linux ). We have combined RM with different constraints obtaining two other models specifically, RM including the valid inequalities (hereafter indicated with SM) and RMM including the valid inequalities and such that variables x i are declared continuous (hereafter indicated with SMM). To evaluate the benefits of the valid inequalities, we have deactivated the Xpress Optimizer automatic cuts and the pre-solving strategy. Finally, we have used the Xpress-MP primal heuristic to generate the first upper bound Separation Oracle for the Valid Inequalities When using models SM and SMM, the valid inequalities (34) and (35) are loaded together with RM. On the contrary, the valid inequalities (36) (39) are dynamically generated by means of a separation oracle. Specifically, let x ȳ z be a current solution at a given node of the search tree. For each p, we build the set 1 k and for all i j such that yk ij Y, and we check if z ip < k k 1 yk ij Y ȳk ij. If so, the inequality z ip y k ij k k 1 yk ij Y is added. To reduce the computational overhead, no more than 10 cuts are added at each node of the search tree. A similar procedure is used to separate inequalities (37) (39). If no inequality of types (36) (39) is added to the formulation, then the separation oracle checks for violated clique inequalities (41) (42). Before running the exact search, we precompute the graph CG for the instance analyzed. At a given node of the search tree, we subsequently pick at random an index p and an index i yij k Y, and we dynamically construct the graph CG 1 p and set w k = j yij k Y j i ȳk ij + j yij k Y j<iȳk ji, k CG1 p. If z ip k w p k, then there does not exist any violated inequality in CG 1 p 1, and we select at random a different index i; otherwise, we proceed by finding a maximum clique in CG 1 p. After testing alternative (heuristic and exact) solution strategies for finding a maximum clique in a graph, a well-known hard problem (Pardalos and Xue 1994), we opted for Nemhauser and Trotter s exact algorithm (Pardalos and Xue 1994), which proved to be, for the data sets analyzed, a good trade-off between speed (a few milliseconds) and efficiency. Specifically, Nemhauser and Trotter s algorithm consists in solving the model Formulation 3. k 1 max z o = k=1 w k t k (43) s.t. t i + t j 1 i j E 1 p (44) t i 0 1 (45)

9 INFORMS Journal on Computing 22(2), pp , 2010 INFORMS 203 where t i denotes a decision variable equal to 1 if vertex t i 1 k belongs to the clique and 0 otherwise, and E p 1 is the complement of the edge set E1 p. If the optimal value of the objective function is greater than z ip, then the inequality z ip k C 1 j yij k Y j i y k ij + k C 1 j yij k Y j<i is added to the formulation. A similar procedure is used for separating the clique inequalities (42). To speed up the cut generation, branching priorities on the integer variables were introduced. Specifically, first the branches are performed on variables z ip, then on variables x i, and finally on variables y k ij. Moreover, the search is performed previously on variables z ip whose relaxed values are in the interval and subsequently on those whose relaxed values are in the interval Basic and Reduced Models on Simulated Data Sets Analogously to Brown and Harrower (2006), we used Hudson s ms program (Hudson 1990) to generate benchmark simulated data sets to compare the performances of BM and RM. Under the neutral evolution hypothesis (Kimura 1983), ms is able to generate many independent replicate samples (haplotypes) through a variety of assumptions about migration, recombination rate, and population size. A coalescent approach (Hudson 1990) is at the core of the simulator: a random genealogy of the sample is generated, and then mutations are randomly placed on the genealogy. A parameter called recombination level allows one to tune the heterogeneity degree of the haplotypes in a population: as r increases, haplotypes become increasingly different from each other, and hence the number of heterozygous sites in genotype samples tends to become large. Using a strategy similar to the one already proposed by Brown and Harrower (2006) we generated three collections: 10 data sets containing 80 samples of 50 SNPs each; 10 data sets containing 140 samples of 70 SNPs each; and 10 data sets containing 200 samples of 100 SNPs each, respectively. For each data set, the recombination level was randomly chosen in The reason for choosing nonzero recombination levels was due to the fact that positive recombination levels may lead to more difficult PPH instances (Brown and Harrower 2006). To obtain data sets of genotypes, sample data sets were paired randomly so that all of them were used. This process led to the generation of three data sets of 10 instances, each containing 40 genotypes, 70 genotypes, and 100 genotypes, respectively. Duplicate genotypes were removed though a preprocessing step. It is worth noting that this reduction does y k ji Table 1 Overview of the Data Sets Analyzed to Compare the Basic and the Reduced Models Max no. Recombination Mean no. of 2s per No. of Genotypes SNPs level of 2s sites genotype (mean) data sets Random in Random in Random in not interfere with the value of the optimal solution or reduce the gap between the optimal solution of the IP problem and its relaxation. Furthermore, Brown and Harrower s preprocessing step consisting in the elimination of duplicate columns was also implemented (Brown and Harrower 2006). Table 1 summarizes the details about the genotype data sets used, whereas Table 2 shows the results obtained by BM and RM. The second and third columns of Table 2 indicate the mean and the standard deviation of the time (RAM) required by BM and RM to solve a given data set. The fourth column indicates the maximum number of branch-and-bound branches required by BM and RM to solve a given data set. Finally, the fifth column shows the overall number of instances solved by BM and RM on each data set. In general, numerical experiments show that RM performs better than BM both in terms of runtime and memory. The reduction of variables yij k and z ip (and the corresponding constraints) performed by RM approaches 99% of the overall number of variables and constraints required by BM. As a consequence, this strategy leads to a model requiring only a few megabytes of RAM, whereas several hundred of megabytes of RAM are required by the BM model (the main reason the BM model is unable to tackle instances of PPH larger than 40 genotypes). Because of its better performance, we prefer RM to BM for the remainder of our numerical analysis Performance Analysis on Brown and Harrower s Data Sets We used Brown and Harrower s data sets (Brown and Harrower 2006) for testing the performances of our models versus Brown and Harrower s HybridIP stateof-the-art model for PPH. Specifically, using Hudson s (1990) ms program, the authors created two families of data sets (called the uniform and nonuniform data sets) by randomly pairing the resulting haplotypes. The distinction in the two simulated methods comes in how the random pairing is performed. In the uniform data sets the haplotypes are randomly paired by sampling uniformly from the set of distinct haplotypes. In the nonuniform data sets the haplotypes are sampled uniformly from the collection of haplotypes generated by the coalescent process. In this collection,

10 204 INFORMS Journal on Computing 22(2), pp , 2010 INFORMS Table 2 Results Obtained by the Basic and the Reduced Models on the Data Sets Analyzed Time (s) RAM required (MB) Max no. of branches Fraction solved Data sets BM RM BM RM BM RM BM RM 40 50s ± ± ± 20 5 ± /10 10/ s ± 0 94 > ± 3 1 0/10 10/ s ± 2 33 > ± 6 1 0/10 10/10 haplotypes may not be unique, so some haplotypes are sampled with higher frequency than others. As shown in Table 3, both the uniform and nonuniform data sets consist of collections of 30 or 50 genotypes having 10, 30, 50, 75, or 100 SNPs each. Each data set contains a number of instances varying between 15 and 50. Brown and Harrower (2006) also considered biological inputs from chromosomes 10 and 21 over all four HapMap (International HapMap Consortium 2004) populations. For each input length, the authors selected sequences of lengths 30, 50, and 75, giving a total of eight data sets consisting of three instances each. Table 4 summarizes the main characteristics of Brown and Harrower s biological data sets. We show in Tables 5 7 the performances of our models on the uniform, nonuniform, and biological data sets, respectively. Specifically, the columns of each table evidence the mean, the maximum, and the minimum of the solution time, the gap (i.e., the difference between the optimal value found and the value of linear relaxation at the root node of the search tree, divided by the optimal value), the number of branches performed, and the number of cuts added to solve each group of instances belonging to a given data set Uniform Data Sets. Brown and Harrower (2006) showed that the HybridIP model is able to Table 3 Characteristics of Brown and Harrower s (2006) Artificial Data Sets Max no. of 2s Mean no. of 2s Recombination Genotypes SNPs in a genotype in a genotype level Uniform Nonuniform n.a n.a n.a n.a n.a. solve instances of the data sets having a genotype length of 30 or less. Specifically, the authors observed that HybridIP needs a run time ranging from 7 seconds to 2 minutes to solve instances of a data set having genotype length of 10, and from 30 seconds to an hour to solve instances of the data set having genotype length of 30. However, HybridIP is only partially able to solve instances of the data sets having a genotype length of 30 or greater. In fact, HybridIP solved 70% of the instances of the data set having a genotype length of 50, 60% of instances of the data set having a genotype length of 75, and only 30% of the instances of the data set having a genotype length of 100. The run times range from five seconds to two hours. On the contrary, our numerical experiments (see Table 5) show that RM on average is able to solve instances of the data sets having genotype length of 10 in about 8 seconds, although five instances of the data set 50 10r16 (specifically, instances 02, 04, 06, 11, and 13) took a bit longer (22.44, , , , and Table 4 Characteristics of Brown and Harrower s (2006) Biological Data Sets Max no. of 2s Mean no. of 2s Data set Genotypes SNPs in a genotype in a genotype Biological Test-chr10-CEU Test-chr10-CEU Test-chr10-CEU Test-chr10-HCB Test-chr10-HCB Test-chr10-HCB Test-chr10-JPT Test-chr10-JPT Test-chr10-JPT Test-chr10-YRI Test-chr10-YRI Test-chr10-YRI Test-chr21-CEU Test-chr21-CEU Test-chr21-CEU Test-chr21-HCB Test-chr21-HCB Test-chr21-HCB Test-chr21-JPT Test-chr21-JPT Test-chr21-JPT Test-chr21-YRI Test-chr21-YRI Test-chr21-YRI

11 INFORMS Journal on Computing 22(2), pp , 2010 INFORMS 205 Table 5 Performances of RM Time (sec.) Gap (%) Nodes Data set Average Max Min Average Max Min Average Max Min Uniform r r Nonuniform Biological CHR10-CEU CHR10-HCB CHR10-JPT CHR10-YRI CHR21-CEU CHR21-HCB CHR21-JPT CHR21-YRI seconds, respectively). Instances of the data set having a genotype length of 30 on average have been solved in about seconds, with two instances (02 and 08) taking more than the average time (53.42 and seconds, respectively). The remaining ones took less than nine seconds. Instances of the data sets having genotype length greater than 30 on average were solved in at most seconds, with a maximum solution of hence, much less than Table 6 Performances of SM HybridIP. Moreover, the time taken by the Xpress-MP primal heuristic to obtain the first feasible solution was always smaller than one second even in the most difficult instance. However, it is worth noting that the hardware used in our experiments is twice as fast than that of Brown and Harrower (2006); hence, a direct comparison of the runtimes is unfair. The comparison of the gap and the branches instead provides a more independent insight about the effectiveness Time (sec.) Gap (%) Nodes Cuts Data set Average Max Min Average Max Min Average Max Min Average Max Min Uniform r r Nonuniform Biological CHR10-CEU CHR10-HCB CHR10-JPT CHR10-YRI CHR21-CEU CHR21-HCB CHR21-JPT CHR21-YRI

12 206 INFORMS Journal on Computing 22(2), pp , 2010 INFORMS Table 7 Performances of SMM Time (sec.) Gap (%) Nodes Cuts Data set Average Max Min Average Max Min Average Max Min Average Max Min Uniform r r Nonuniform Biological CHR10-CEU CHR10-HCB CHR10-JPT CHR10-YRI CHR21-CEU CHR21-HCB CHR21-JPT CHR21-YRI of our model. To this end, we observe that, unlike HybridIP, the gap is 0 in at least the 50% of the instances analyzed and that several instances are typically solved without branching. SM shows better performances than RM (see Table 6) both in terms of solution times and gaps. In fact, on average, SM was able to solve instances of the data sets having genotype length of 10 in about 5.8 seconds, and maximum and minimum solution times were always smaller than RM. The first two data sets were solved at the root node, whereas an instance of the third data set needed at most 49 branches to reach the optimum. The efficiency of the valid inequalities is confirmed by both the smaller gaps than RM and the average number of cuts added during the exact search. This trend is also confirmed for the data sets in which genotype length was greater than or equal to 30, with the only exception being instance 19 of the data set 30 50, which took seconds versus seconds of RM. This instance had a gap of (versus 0.0 of RM), a number of branches equal to 1,109 (versus 75 of RM), and the number of cuts added equal to 741. The anomaly of this instance seems to be due to the valid inequalities (34) and (35), which negatively interfere with the primal heuristic of the Xpress Optimizer. Finally, SMM shows worse running time performances than RM and SM (see Table 7) in all uniform data sets. This is in contrast to the average, maximum, and minimum gaps, which are always smaller than RM and SM. Once again, the anomalous behavior is due to the valid inequalities (34) and (35), which interfere both with the primal heuristic of the Xpress Optimizer and with the automatic branching strategy Nonuniform Data Sets. Brown and Harrower (2006) showed that on the nonuniform data sets the HybridIP solves the instances having lengths 10 and 30. Specifically, the instances having genotypes of length 10 are solved in 1 second, and ten out of the fifteen length 30 instances are solved in less than 15 seconds. However, the HybridIP was only able to solve six instances having genotype length of 50, although those instances were solved in 39 seconds or less. Again, the Hybrid was able to solve 5 of the 15 instances with a genotype length of 75, two of which were solved in less than one minute. Finally, the HybridIP is able to solve three instances having genotype length 100, two of which solved in under two minutes. Numerical experiments show that our models outperform HybridIP on the nonuniform data sets. In fact, as shown in Table 5, RM on average is able to solve instances of the data sets having genotype length of 10 in about seconds, with a maximum runtime of seconds; instances having genotype length of 30 in about seconds, with a maximum runtime of seconds; instances having genotype length of 50 in about seconds, with a maximum runtime of seconds; instances having genotype length of 75 in about seconds, with a maximum

A Class Representative Model for Pure Parsimony Haplotyping

A Class Representative Model for Pure Parsimony Haplotyping Daniele Catanzaro, Alessandra Godi, and Martine Labbé June 5, 2008 Abstract Haplotyping estimation from aligned Single Nucleotide Polymorphism