An Application of Integer Linear Programming to Haplotyping Inference by Parsimony Problem

Size: px

Start display at page:

Download "An Application of Integer Linear Programming to Haplotyping Inference by Parsimony Problem"

Myrtle Owen
5 years ago
Views:

1 Università degli Studi Roma Tre Dottorato di Ricerca in Informatica e Automazione XVIII Ciclo 2005 An Application of Integer Linear Programming to Haplotyping Inference by Parsimony Problem Alessandra Godi

3 Università degli Studi Roma Tre Dottorato di Ricerca in Informatica e Automazione XVIII Ciclo Alessandra Godi An Application of Integer Linear Programming to Haplotyping Inference by Parsimony Problem Advisor Dr. Paola Bertolazzi Reviewers Prof. Martine Labbé

4 Author s address: Alessandra Godi Istituto di Analisi dei Sistemi ed Informatica Antonio Ruberti - CNR Viale Manzoni, 30 - Roma, Italy godi@iasi.cnr.it www:

5 Contents 1 INTRODUCTION 5 2 BIOLOGICAL BACKGROUND DNA and RNA Genes, chromosomes, haplotypes, genotypes and SNPs The importance of SNPs Haplotypes and genotypes in disease association studies Technical methods to obtain genotypes AN OVERVIEW ON HAPLOTYPING INFERENCE The Clark s Rule The Phylogeny (or Coalescent) Haplotyping Problem Statistical models and softwares Maximization Likelihood by Expectation Maximization Bayesian Inference Methods Statistical software tools The Inference Haplotyping in Pedigree The Haplotyping Inference by Parsimony problem SOLVING HIP USING A NEW HEURISTIC: COLLHAPS The COLLHAPS Algorithm The collapse rule The preprocessing The heuristic sequence of collapse steps Haplotype set reduction Precollapsing Postprocessing: removing residual variables Performance measures Experimental Results

6 2 CONTENTS 5 EXISTING ILP FORMULATION FOR HIP An exponential formulation Complete and reduce model An inclusion/exclusion strategy to count the variables of Gusfield s models Experimental results A polynomial formulation Branch-and-cut and experimental results of the polynomial formulation Conclusion about the linear formulations HIP problem is APX-hard SOLVING HIP USING EXPONENTIAL FORMULATIONS Polyhedral study of Gusfield s formulation General polyhedral theory Facets characterization for the HIP problem A Branch-and-Price algorithm for HIP The Branch-and-Price Algorithm Implementation Issues of B&P Computational Experience A new exponential formulation for the HIP problem Basic properties of the set-covering problem Characterization of some SC facets and valid inequalities for the HIP problem SOLVING HIP USING A NEW POLYNOMIAL FORMULATION The basic model as a minimum problem Turning P min into a maximization problem Strengthening of formulation Computational Experience CONCLUSIONS AND FUTURE WORKS 175

7 Acknowledgements I wish to thank all those people who taught me, listened to me, accompanied me, inspired me and distracted me during these three years of PhD. This time has passed away very quickly. I have learned so many interesting things from so many interesting people, and I have had many opportunities for traveling to interesting places to attend conferences and workshops. There are many who deserve to be thanked for their part in making my period of study a pleasant time. Here I can only mention a few of them. In particular I owe my gratitude to Dr. Paola Bertolazzi, my advisor, who introduced me in the computational biology field and has guided me through this work, being the ultimate guide one can possibly hope for. I really wish to thank Prof. Martine Labbé for hosting me at Université Libre de Bruxelles for four months and for a lot of inspiring discussions while I was there. I had the opportunity to learn many things from her and to improve my thesis. It has been a real fun working with her. Then I would like to thank Dr. Giovanni Rinaldi, the director of Istituto di Analisi dei Sistemi ed Informatica - A. Ruberti, for his advises to help me in this experience and for hosting me in his institute that represents my second family. Thanks to Prof. Fernando Nicolò for being a constant guide in the computer science research group of Università di Roma Tre. I wish to thank Prof. Giuseppe Lancia who introduced me the problem addressed in this thesis: he is the Dr. Dolittle of sciences; speaks fluently computer science, mathematics, biology and statistics. Working with him has been really inspiring. Thanks to Dr. Leonardo Tininini for being my office-mate for the last two years and co-author of the heuristic included in this thesis. He has been creative, encouraging and trusting. I am also grateful to Dr. Claudio Gentile and Dr. Paolo Ventura for their help and patience in answering to several questions. Special thanks to my good friends and colleagues Mara and Marta who accompanied me through my PhD, for their friendship and help in good and bad time. 3

8 4 CONTENTS I also thank all friends who distracted me from my PhD (lists may overlap): Luca ( come siamo fortunati! ), Mara, Marta; and all IASI Fellowship : particularly Anna, Barbara, Cristina, Gabriella, Guido, Leonardo, Mariagrazia, and then Mauro (our Frodo Baggings, or definitely better, our Mandrake ). Guys, it was pure fun! Period I spent in Bruxelles was a special time for me. I met a lot of interesting people starting from my nice flat-mate Jennifer to all friends of ULB. They help me in every situation, especially when I was homesick. Thank you! Thanks to mum and dad for all the help and every kind of support. They have the passion for life of 10 people, the humility of 100 people, and the generosity of 1000 people; but the goodness and kindness that only few people have, and I am proud to be their daughter. I thank the rest of my family, especially my brothers Gianmatteo and Francesco who never took my work seriously :-) but they gave me constant encouragement. And last, but not least, I wish to thank my beloved Davide, for making everything better and whose love never ceases to fill me with amaze. In my optimism I dedicate to him this thesis as a story of life and progress, as a small tribute to the many who will follow in the never ending chain of science.

9 Chapter 1 INTRODUCTION The work presented in this thesis is related to mathematical programming techniques for a particular problem with biological relevance called Haplotyping Inference by Parsimony (HIP) problem. Such work is part of an interdisciplinary area called computational biology. That field is concerned with the development and application of dataanalytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological systems. Computational biology spans several classical areas such as biology, chemistry, physics, statistics and computer science, and the activities in the area are numerous. From a computational point of view the activities are ranging from algorithmic theory focusing on problems with biological relevance, via construction of computational tools and mathematical models for specific biological problems, to experimental work where a laboratory with test tubes and microscopes is substituted with a fast computer and a hard disk full of computational tools written to analyze huge amounts of biological data to prove or disprove a certain hypothesis. The area of computational biology is also erroneously referred to as bioinformatics. In fact, computational biology is used to refer to activities which mainly focus on constructing models and algorithms that address problems with biological relevance, while bioinformatics is used to refer to activities which mainly focus on constructing and using computational tools to analyze available biological data. This distinction between computational biology and bioinformatics only needs here to expose the main focus of the work: this thesis lies in the computational biology field, because the aim of the work is to analyze some existing integer programming formulation, to propose new models with the associated polyhedral studies and new (exact and approximate) algorithms that address the HIP problem. This problem is motivated by genetic differences among individuals of a species. Most vegetal and animal cells are diploid, i.e., they have two similar, but not identical, versions (or copies) 5

10 6 CHAPTER 1. INTRODUCTION of each chromosome (homologous chromosomes). In general, individuals from the same species are genetically very similar, as for instance humans: the DNA between two random people is about 99.9% identical. The individual uniqueness lies in a small number of bases that can exist where single base DNA differences occur. Thus a SNP (Single Nucleotide Polymorphism) is a single base pair position in genomic DNA at which different nucleotide variants (alleles) exist. In humans, SNPs are almost always biallelic, that is, there are two of the four possible polymorphisms at each site. The knowledge of these two variants is referred to as the phase of the SNP. The sequence of alleles along a chromosome copy is called a haplotype. Instead, the SNP information of the bases pairs sequence at each site of each chromosome is called a genotype, but it does not specify which base (i.e., which allele) occurs on which chromosome. For a given set of SNPs, an individual possesses two haplotypes, one is inherited from the paternal genome and the other from the maternal genome and exactly one genotype associated with the chromosome pair. The inheritance process is complicated by a phenomenon known as recombination which concerns portion exchanges of the paternal and maternal chromosomes. A SNP site where both haplotypes have the same variant (nucleotide) is called a homozygous site; a SNP site where the haplotypes have different variants is called a heterozygous site. Thus, while in genotype data the nucleotide variants at homozygous sites are known but the information regarding which heterozygous site SNP variants came from the same chromosome copy is unknown, in haplotype data alleles are completely known. The determination of the haplotypes within a population is essential. For instance, haplotypes are necessary in evolutionary studies to extract the information needed to detect diseases and to reduce the number of tests to be carried out, in the discovery of a functional gene or in study of an altered response of an organism to a particular terapy. In human pharmacogenetics, haplotype-snps seem to explain why people react differently to different types or amounts of drugs, in fact, since SNPs can affect the structure and function of proteins and enzymes, they can influence how efficiently a drug is absorbed and metabolized. Unfortunately, experimental techniques to obtain the haplotypes of an individual are very expensive, time consuming and labor intensive. However, it is possible to determine the genotype of an individual quickly and easy. The use of computational techniques joint with specific biological models offers a way of defining the haplotypes from the genotype data (i.e., Haplotyping or Haplotype Inference (HI)). The HI problem has been studied since nineties when a wide variety of techniques (statistical and combinatorial methods) were proposed. Statistical approaches try to iteratively determine the haplotype frequencies, and then infer the haplotype-pairs. In the methods based on expectation-maximization (EM) the haplotype frequency estimates are iteratively updated, starting from

11 7 an initial guess and trying to maximize a likelihood function. Other statistical methods are based on Bayesian inference and on the adoption of a more or less biologically-based prior, so as to get more accurate estimates of the haplotype frequencies and consequently of the genotype recontructions. Combinatorial methods are mostly inspired to the Clark s inference rule, based on the principle that, given a genotype and a haplotype compatible with this genotype, the other haplotype can be inferred simply by difference between the genotype and the given haplotype. Clark s rule was applied directly giving rise to the first algorithm for haplotyping. The algorithm has good accuracy but two major drawbacks: it could not even start or it can resolve only a subset of genotypes leaving the other ones unsolved. A second step is due to Gusfield who first used integer programming for haplotyping problem and formulates two different optimization problems. The first one looks for the best sequence of application of the Clark s rule to solve the maximum number of genotypes. The second one is the formulation of the inference problem with the requirement that the number of inferred haplotypes be minimum (HIP problem, i.e., the problem addresses in this thesis). Parsimony principle does not erroneously state that haplotypes with high frequency in a population should be preferred in a haplotype reconstruction (in fact, parsimony is affected by haplotype frequencies only in the weakest sense) but means that haplotypes of a population can not be so different from each other, as supported by real data from the practice and by phylogenetic haplotype tree history. This thesis is organized as follows. Chapter 2 introduces both the basic concepts and the definition set for computational biology which are fundamental for the knowledge of the biological reality we need. Chapter 3 describes the general Haplotyping Inference problem considering several variants and complexity aspects and proposes an overview on the existing solution methods for HI problem in general and for the Haplotyping Inference by Parsimony problem in particular. The first original result for the HIP is contained in Chapter 4: COLLHAPS, a rule-base heuristic which uses a collapsing rule to reach the minimum number of haplotypes. Then, Chapter 5 introduces the existing integer linear programming formulation for the HIP (exponential and polynomial models). Chapter 6 and Chapter 7 describe two different approaches for solving exactly the HIP problem: the first is based on the use of two exponential formulations (one is introduced in the Chapter 5, the other is originally obtained from the previous one by a Fourier-Motzkin procedure), a Branchand-Price algorithm and some polyhedral results (new facets and new valid inequalities); the second approach is based on a new polynomial formulation

12 8 CHAPTER 1. INTRODUCTION which is described from a basic model up to a strengthened one with the use of clique inequalities, symmetry-breaking inequalities and a dominance study. Finally, Chapter 8, concerning the conclusions, presents some ideas and future works for the HIP problem.

13 Chapter 2 BIOLOGICAL BACKGROUND The genetic material of each living organism - plant or animal, bacterium or virus - possesses sequences of basic elements building blocks (usually DNA, sometimes RNA) that are uniquely and specifically present only in its own species. Indeed, complex organisms beings, such as human, possess DNA sequences that are uniquely and specifically present only in particular individuals. These unique variations make it possible to trace genetic material back to its origin, identifying with precision at least what species of organism it came from and often which particular member of that species, or to isolate DNA regions which carry particular pathologies. In this chapter some fundamental concepts concerning the structure of genome, chromosomes, genes, haplotypes, genotypes and polymorphisms, which are relevant to motivate the topic of this thesis, are described. 2.1 DNA and RNA Most questions in computational biology are related to molecular or evolutionary biology and focus on analyzing and comparing the composition of the key biomolecules DNA, RNA and proteins, that together constitute the fundamental building blocks of organisms. The genetic material of an organism is the system which guides and determines the functions and the characteristics of organisms beings they need for the complex task of living. Questions about how the genetic material is stored and used by an organism have been studied intensively. This has revealed that the biomolecules DNA, RNA and proteins are the many important players of the game, and thus important components to model in any method for comparing and analyzing the genetic material of organisms. 9

14 10 CHAPTER 2. BIOLOGICAL BACKGROUND Figure 2.1: An abstract illustration of a segment of a DNA or RNA molecule. It shows that the molecule consists of a backbone of sugars linked together by phosphates with an base side chain attached to each sugar. The two ends of the backbone are conventionally called the 5 end and the 3 end. The DNA (deoxyribonucleic acid) molecule was discovered in 1869 while studying the chemistry of white blood cells. The very similar RNA (ribonucleic acid) molecule was discovered a few years later. DNA and RNA are chainlike molecules, called polymers, that consist of nucleotides linked together by phosphate bonds. A nucleotide consists of a phosphoric acid, a pentose sugar and an amine base or just base. In DNA the pentose sugar is 2-deoxyribose and the base is either adenine (A), guanine (G), cytosine (C), or thymine (T). In RNA the pentose sugar is ribose instead of 2-deoxyribose and the base thymine is exchanged with the very similar base uracil (U). Two of these bases (A and G) belong to the group of purines and the others (C and T or U for the RNA) to the pyrimidines. A DNA or RNA molecule is a uniform backbone of sugars linked together by the phosphates with side chains of bases attached to each sugar. This implies that a DNA or RNA molecule can be specifed uniquely by listing the sequence of base side chains starting from one end of the sequence of nucleotides. The two ends of a nucleotide sequence are conventionally denoted as the 5 end and the 3 end. These names refer to the orientation of the sugars along the backbone. It is common to start the listing of the base side chains from the 5 end of the sequence (see Fig. 2.1). Since there is only four possible base side chains, the listing can be described as a string over a four letter alphabet. Proteins are polymers that consists of amino acids linked together by peptide bonds. An amino acid consists of a central carbon atom, an amino group, a carboxyl group and a side chain. The side chain determines the type of the amino acid. As illustrated in Figure 2.2 chains of amino acids are formed by peptide bonds between the nitrogen atom in the amino group of one amino acid and the carbon atom in the carboxyl group of another amino acid. A protein thus consists of a backbone of the common structure shared between all amino acids with the different side-chains attached to the central carbon atoms. Even though there is an infinite number of different types of amino acids, only twenty of these types are encountered in proteins. The chemical structure of DNA, RNA and protein molecules that makes

15 2.1. DNA AND RNA 11 Figure 2.2: An abstract illustration of a segment of a protein. It shows that the molecule consists of a backbone of elements shared between the amino acids with a variable side chain attached to the central carbon atom in each amino acid. The peptide bonds linking the amino acids are indicated by gray lines. and RNA molecules, it is thus possible to uniquely specify a protein by listing the sequence of side chains. Since there is only twenty possible side chains, the listing can be described as a string over a twenty letter alphabet. it possible to specify them uniquely by listing the sequence of side chains, also called the sequence of residues, is the reason why these biomolecules are often referred to as biological sequences. The correspondence between biological sequences and strings over finite alphabets has many modeling advantages, most prominently its simplicity. For example, a DNA sequence corresponds to a string over the alphabet {A, G, C, T}, where each character represents one of the four possible nucleotides. Similarly, an RNA sequence corresponds to a string over the alphabet {A, G, C, U}. The relevance of modeling biomolecules as strings over finite alphabets follows from the way the genetic material of an organism is stored and used. Probably one of the most amazing discoveries of this century is that the entire genetic material of an organism, called its genome, is (with few exceptions) stored in two complementary DNA sequences that wound around each other in a helix. The genome is the entire set of hereditary instructions for building, running, maintaining an organism and passing life on to the next generation. Two DNA sequences are complementary if the one is the other read backwards with the complementary bases adenine/thymine and guanine/ cytosine interchanged, e.g. ATTCGC and GCGAAT are complementary because ATTCGC with A and T interchanged and G and C interchanged becomes TAAGCG, which is GCGAAT read backwards. Two complementary bases can form strong interactions, called base pairings, by hydrogen bonds. Hence, two complementary DNA sequences placed against each other such that the head (the 5 end) of the one sequence are placed opposite the tail (the 3 end) of the other sequence is glued together by base pairings between opposition complementary bases. The result is a double stranded DNA molecule with the famous double helix structure described by Watson and Crick in [84] (see Fig. 2.3). Genome size is usually stated as the total number of base pairs(bp); the human genome contains roughly 3 billion bp [75] (see Fig. 2.4). Despite the complex three-dimensional structure of this molecule, the ge-

bases sticking up from it, like the teeth of a comb. Each half will then be the template for a new, complementary strand.

16 12 CHAPTER 2. BIOLOGICAL BACKGROUND Figure 2.3: The base pairing is thus restricted. This restriction is essential when the DNA is being copied: the DNA-helix is first unzipped in two long stretches of sugar-phosphate backbone with a line of free bases sticking up from it, like the teeth of a comb. Each half will then be the template for a new, complementary strand. Biological machines inside the cell put the corresponding free bases onto the split molecule and also proof-read the result to find and correct any mistakes. After the doubling, this gives rise to two exact copies of the original DNA molecule. Figure 2.4: Comparison of largest known DNA sequences.

17 2.2. GENES, CHROMOSOMES, HAPLOTYPES, GENOTYPES AND SNPS 13 netic material it stores only depends on the sequence of nucleotides and can thus be described without loss of information as a string over the alphabet {A, G, C, T}. The genome of an organism contains the templates 1 of all the molecules necessary for the organism to live. A region of the genome that encodes a single molecule is called a gene (see next section). When a particular molecule is needed by the organism, the corresponding gene is transcribed to an RNA sequence. The transcribed RNA sequence is complementary to the complementary DNA sequence of the gene, and thus (except for thymine being replaced by uracil) identical to the gene. Sometimes this RNA sequence is the molecule needed by the organism, but most often it is only intended as an intermediate template for a protein. In eukaryotes (which are higher order organisms such as humans) a gene usually consists of coding parts, called exons, and non-coding parts, called introns. By removing the introns and concatenating the exons, the intermediate template is turned into a sequence of messenger RNA that encodes the protein (see Fig. 2.5). The messenger RNA is translated to a protein by reading it three nucleotides at a time. Each triplet of nucleotides, called a codon, uniquely describes an amino acid which is added to the sequence of amino acids being generated. The correspondence between codons and amino acids are given by the almost universal genetic code shown in Figure Genes, chromosomes, haplotypes, genotypes and SNPs A small piece of the genome that codes for a protein is called gene. Different genes determine the different characteristics, or traits, of an organism. In the simplest terms, one gene might determine the color of a bird s feathers, while another gene would determine the shape of its beak. The number of genes in the genome varies from species to species. More complex organisms tend to have more genes. Bacteria have several hundred to several thousand genes. Estimates of the number of human genes, by contrast, range from 25,000 to 30,000. Most gene products in the human genome are identical in all individuals. Genes are found on chromosomes and are made of DNA. A chromosome is a package containing a chunk of a genome, that is, it contains some of an organism s genes. The important word here is package : chromosomes help a cell to keep a large amount of genetic information neat, organized, and compact. Chromosomes are made of DNA and protein. Most living things have chromosomes that are linear and are kept in the nucleus, a 1 A template is a single DNA strand that serves as pattern for building a new second strand.

18 14 CHAPTER 2. BIOLOGICAL BACKGROUND Figure 2.5: RNA synthesis and processing.

19 2.2. GENES, CHROMOSOMES, HAPLOTYPES, GENOTYPES AND SNPS 15 Figure 2.6: The genetic code that describes how the 64 possible triplets of nucleotides are translated to amino acids. The table is read such that the triplet AUG encodes the amino acid Met. The three triplets UAA, UAG and UGA are termination codons that signal the end of a translation of triplets.

20 16 CHAPTER 2. BIOLOGICAL BACKGROUND sphere-shaped sac within the cell. In a few very simple forms of life, such as bacteria, the entire genome is packaged into a single chromosome. But other organisms, with genomes a thousand or even a million times larger than those of bacteria, divide their hereditary material among a number of different chromosomes. Exactly how many chromosomes we are talking about depends on the species. A mosquito has 6 chromosomes, a pea plant has 14, a sunflower 34, a human being 46, and a dog 78. In the case of humans, the 46 chromosomes are divided into 23 pairs of corresonding (homolougus) chromosomes. So, the structure is clear: the genome contains genes, which are packaged in chromosomes and affect specific characteristics of the organism. A location on a chromosome is called a locus (pl. loci). The locus can be either a single nucleotide or a string of nucleotides. Different variants that are present in the population at a specific locus (or loci) are called alleles, if there are only two variants the locus is biallelic. Without loss of generality, let us consider only biallelic alleles. When an individual inherits DNA from his/her parents one copy of each chromosome is inherited from each parent. This means that for every individual there are two alleles at each locus. The combined outcome of the two alleles at a locus is called a genotype. If a genotype consists of two copies of the same allele it is homozygous and otherwise heterozygous. Traits resulting from genotypes are called phenotypes; they can be quality phonotypes (i.e., healthy/diseased) or quantity phenotypes (i.e., length, colour,...). When an individual inherits a chromosome from a parent it is not one of the two parental copies: each parental chromosome recombine on average about 1.5 times [75] so that the inherited chromosome is a patchwork of the parental chromosomes (recombination). The location at which parental chromosomes recombine differ from generation to generation so that after several generations only small fragments of the original chromosomes remain and only bases that are located close together are inherited together. Recombination does not occur uniformly over chromosomes and it is used, instead of physical distance, to describe distances between loci. Dependence between loci is called Linkage Disequilibrium (LD): if there is a low probability of a recombination between two or more loci, then they have a high probability of being inherited together by successive generations, and the loci are said to be in linkage disequilibrium. These positions are selected in correspondence of genomic sites which have experimentally been confirmed to be polymorphic, that is, where there exists variation between individuals. Areas that are segregating (that is close to each others), but not necessarily coding for the gene of interest, are called genetic markers. When searching for a gene, the hope is that markers are either in a coding part of the gene or are in linkage disequilibrium with the gene. Commonly used genetic markers are Single Nucleotide Polymorphisms (SNPs, pronounced snips ). A SNP is a single-base mutation in a DNA sequence that occur when a single nucleotide

21 2.2. GENES, CHROMOSOMES, HAPLOTYPES, GENOTYPES AND SNPS 17 (A,T,C,or G) in the genome sequence is altered. For example a SNP might change the DNA sequence AAGGCTAA to ACGGCTAA. For a variation to be considered a SNP, it must occur in at least 1% of the population. SNPs, which make up about 90% of all human genetic variation, occur every 100 to 300 bases along the 3-billion-base human genome. Two of every three SNPs involve the replacement of cytosine (C) with thymine (T). SNPs can occur in both coding (gene) and noncoding regions of the genome. Many SNPs have no effect on cell function, but scientists believe others could predispose people to disease or influence their response to a drug (see next section for more information). The mapping of SNPs has been in rapid progress and currently approximately 2.7 million SNPs have been mapped [10], most having been discovered in recent years. These kind of polymorphisms tend to be rare events (in some cases, unique events in the history of the human race), with mutation rates estimated at around 175 total SNP mutations per individual per generation, or per base per generation [60]. Combinations of alleles from different loci which reside on the same copy of a chromosome are called haplotypes. Typically genetic markers are measured one at a time so that it is not always possible to infer haplotype phase with certainty, that is, which alleles belong together on the same chromosome. When the phase is unknown the estimation of haplotypes can be viewed as a missing data problem. It appears that the variation in individuals is limited to an extremely small percentage of the overall genome. In fact, approximately 99.9% of our DNA sequence is conserved; leaving only the remaining 0.1% of the human genome to account for the entire diversity of the human race. These variations consist of insertions, deletions and SNPs within the genome; there is intense interest in identifying the estimated in millions SNPs and determining their role in phenotypic variation. When analyzing multi-locus genotypes, if it is impossible to determine which chromosome of a pair a specific allele came, from the data is said to be unphased. The problem of determining which alleles at each locus of a set of linked diploid loci are physically located on the same chromosome is known as haplotyping or determining phase (see Fig. 2.7). For example, for a set of three linked loci, we have 3 2 = 6 alleles in the unphased genotype, yielding a maximum of 2 3 = 8 possible assignments of alleles to specific chromosomes, or = 4 possible phases when not distinguishing between the chromosomes. Depending on the allele values, some of these phases may be identical to each other, due to homozygosity (where the two alleles at a locus are identical). As opposed to haplotypes, the genotype gives the bases at each SNP for both copies of the chromosome, but loses the information as to the chromosome on which each base appears. Unfortunately, many sequencing techniques provide the genotypes and not the haplotypes (see last section of the chapter). Haplotype analysis has become increasingly common in genetic studies

22 18 CHAPTER 2. BIOLOGICAL BACKGROUND Figure 2.7: (a) The problem of haplotyping, or determining which alleles in a diploid genotype come from the same chromosome; (b) Determining which chromosome came from which parent. of human disease. However, many of these methods rely on phase information, that is, the haplotype information vs. the genotype information. Phase can be inferred by genotyping family members of each subject, but this has its downsides because of logistic and budget issues. Alternatively, laboratory techniques (such as PCR 2 ) have been also used but these are often costly and are not suitable for large scale polymorphism screening. As an alternative to those technologies, many computational methods have been developed for phasing the genotypes (see Chapter 3). 2.3 The importance of SNPs Although more than 99% of human DNA sequences are the same across the population, variations in DNA sequence can have a considerable impact on how humans respond to disease, environmental insults (such as bacteria, viruses, toxins, and chemicals), drugs and other therapies. This makes SNPs of great value for biomedical research and for developing pharmaceutical products or medical diagnostics. In fact, scientists believe SNP maps will help them in identifying the multiple genes associated with such complex diseases, in partic- 2 The Polymerase Chain Reaction is a method for the rapid copying of DNA. The principle itself is very simple: the first step involves the copying of a long, but very specific, DNA fragment - this forms the basis for all subsequent steps. Smaller fragments of a standard length are then synthesized from the DNA copies and then replicated millions of times over.

23 2.3. THE IMPORTANCE OF SNPS 19 ular because their evolutionary stability (not changing much from generation to generation) makes them easier to follow in population studies. Associations between genes and SNPs are difficult to establish with conventional gene-hunting methods because a single altered gene may make only a small contribution to the disease. Genes are the basic physical and functional units of heredity. They basically are specific sequences of bases that encode instructions on how to make proteins. When genes are altered so that the encoded proteins are unable to carry out their normal functions, genetic diseases can result. In the previous section we have seen that genes are carried on chromosomes: the maternal and paternal chromosomes pair up and exchange segments of DNA in a process called recombination. After recombination (which can be interested also exchange of parts of a given gene), the chromosomes contain a mixture of alleles from each parent. Recombination will occur frequently between DNA sequences that are a long way apart but only rarely between sequences that are close together. Therefore, by measuring the frequency of recombination between the disease gene and other DNA sequences whose location is already known, the position of the disease gene can be established. A consequence of recombination is that blocks of sequences on the same chromosome tend to be inherited together (linkage disequilibrium). Several groups worked to find SNPs and ultimately create SNP maps of the human genome. Among these groups are the U.S. Human Genome Project (HGP) and a large group of pharmaceutical companies called the SNP Consortium or TSC project. Their aims were to develop technologies for rapid identification of SNPs, identify common variants in the coding regions of most identified genes and create public resources of DNA samples and cell lines. In the end, many more SNPs (1.8 million total) were discovered than planned originally. Now that the SNP discovery phase of the TSC project is essentially complete, the emphasis has shifted to studying SNPs in populations. Various TSC member laboratories are genotyping (this is the word used to mean the identifying process for SNPs among a population data) a subset of SNPs as part of the Allele Frequency Project. The goal of the TSC allele frequency/genotype project is to determine the frequency of certain SNPs in three major world populations. See the TSC Web site for more information [89]. Besides the TSC Web site, SNP data is also available from the dbsnp database (from the National Center for Biotechnology Information) [90] and HGVbase (Human Genome Variation Database) [91].

24 20 CHAPTER 2. BIOLOGICAL BACKGROUND 2.4 Haplotypes and genotypes in disease association studies The aim of disease genetic association studies is to find or characterize relationships between genes and phenotypes in order to investigate the identities and functions of genes and their roles in presence of diseases, responsiveness to drug therapies, and susceptibility to toxic side-effects. To interpret the results of an association study it is important to understand which mechanisms can lead to association between marker (SNP) and phenotype. What can happen is that the marker is causally related to the phenotype or the marker is in linkage disequilibrium with a causal gene (see previous section). Typically the exact genomic location of the gene responsible of a disease is not known and genetic markers, such as SNPs, are measured instead of the gene of interest. Hence, the first step in identifying a gene which is cause of a disease is to located the chromosome which carries that gene. But it is necessary to individualize the position of the gene more precisely. This analysis is carried out by comparing phenotype distributions between persons with different genotypes: keys in this hunt are the set of SNPs which are situated on the chromosome and the set of persons (actually, their genotypes) who are considered in relation with a quality phenotype (e.g., affected/not affected). Observing the patterns of alleles (the SNP values) is possible to understand which is the closest SNP to the gene which produces the disease and it is easier to realize what is the position of that gene on the genome. Researchers, who work to identify SNPs, are discovering that, as the number of known SNPs increases, identifying the genotype and correlating to phenotype is becoming a huge task. Fortunately, nature may have made this process simpler than would be expected from the number of SNPs. Recent research has shown that groups of SNPs are inherited together in a stretch of DNA, rather than being randomly segregated through genetic recombination. These groups of SNPs are called haplotype blocks or also just haplotypes: in the most general sense, as we have already explained, the haplotype is simply the genotype of a single chromosome or haploid set of chromosome. One advantage of studying haplotypes is that they are more polymorphic than single marker loci; if the SNPs, from which haplotypes are constructed, are closely linked, then it may be easier to demonstrate association between a particular region of the genome with disease, than by using single marker loci. Several recent studies showed that haplotypes, if used as genetic markers, have higher statistical power than individual markers [1]. In fact haplotypes capture the local linkage disequilibrium information and may reflect the presence of additional, undetected mutation sites that are the underlying cause of the disease. Also, haplotypes may reflect two or more mutation sites which

25 2.4. HAPLOTYPES AND GENOTYPES IN DISEASE ASSOCIATION STUDIES 21 must act together to cause a disease, yet are harmless when present on separate chromosomes. In other words, haplotypes are expected to predict the genetic contributions to phenotypes more accurately than by just using single SNP genotypes. Moreover, even if SNPs contained in the haplotype may be found on only one gene, or may be found in multiple genes in the sequence, it is believed that the haplotype provides the context in which those genes act. A major difficulty in using haplotypes as genetic markers lies in determining the haplotype phase for individuals who are heterozygous for more than one marker. There are several approaches to overcome this difficulty. However, in order for an approach to be practical, it needs to meet the low cost and high throughput requirements. Only such approaches can potentially be used in studies using large samples and involving a large number of genetic markers. All in all, both genotype and haplotype data are used in genetic studies. Haplotypes are often more informative. Unfortunately, current experimental methods for haplotype determination are technically complicated and cost prohibitive [24]. In contrast, the genotype SNPs can be detected by using a variety of cheap technologies (see next section). After generating the genotypes of a statistically relevant number of individuals, it is possible to use computer algorithms to infer haplotypes in a process called resolving, phasing or haplotyping [14], [22], [32], [2], [21]. These inferred haplotypes typically have a greater than 90% accuracy. Let us note that a single genotype may be resolved by different, equallyplausible haplotype-pairs (see Fig. 2.8), but the joint inference of a set of genotypes may favor one haplotype-pair over the others for each individual. Such inference is usually based on a model for the data. Informally, most models rely on the observed phenomenon that over relatively short genomic regions, different human genotypes tend to share the same small set of haplotypes [65], [16]. We want to conclude the section with a remark on the International HapMap Project [92] which is conducting an ambitious study to generate haplotype maps based on the genotypes of hundreds of individuals, with the expectation that the resulting data will parse into a few general, common haplotypes. The results of this effort will become public domain, with the HapMap freely available to all researchers, both academic and commercial.

26 22 CHAPTER 2. BIOLOGICAL BACKGROUND Figure 2.8: An example of 6 SNPs along two homologous chromosomes of an individual. (a) Individuals haplotypes. (b) Individuals genotype. Here the set of heterozygous SNPs would be 2,5. (c) Another potential haplotype pair giving rise to the same genotype. Note that only SNPs are presented here. Every two SNPs can be separated by several hundred monomorphic base pairs. 2.5 Technical methods to obtain genotypes One of the aim of the Human Genome Project 3 is the discovery of millions of DNA sequence variants in the human genome. The procedure of detecting SNP is called genotyping. Since genotypes are the data of our problem, we want to give a general idea concerning the methodology for SNP genotyping in terms of the mechanisms of allelic discrimination and the detection modalities; we also describe a genotyping method currently in use. The genotyping methods are preferred to the haplotyping ones because, in general, they possess the following attributes: (a) the assay is easily and quickly developed from sequence information; (b) the cost of assay development is low in terms of marker-specific reagents and time spent by expert personnel on optimization; (c) the assay is easily automated and must require minimal hands-on operation; (d) the data analysis is simple, with automated, accurate genotype definition; (e) the reaction format is flexible and scalable, capable of performing a few hundred to a million assays per day; and (f) once optimized, the total assay cost per genotype (including equipment, reagents, and personnel) is low. The allelic discrimination detects different forms of the same gene that differ by a nucleotide substitution, insertion, or deletion. At DNA level, we can say that the allelic discrimination detects SNPs in a specific sequence. 3 Begun formally in 1990, the Human Genome Project was a 13-year effort coordinated by the U.S. Department of Energy and the National Institutes of Health. The project originally was planned to last 15 years, but rapid technological advances accelerated the completion date to Project goals were to identify all the approximately 20,000-25,000 genes in human DNA, determine the sequences of the 3 billion chemical base pairs that make up human DNA, store this information in databases, improve tools for data analysis, transfer related technologies to the private sector, and address the ethical, legal, and social issues that may arise from the project.

27 2.5. TECHNICAL METHODS TO OBTAIN GENOTYPES 23 Sequence-specific detection relies on four general mechanisms for allelic discrimination: allele-specific hybridization, allele-specific nucleotide incorporation, allele-specific oligonucleotide ligation, and allele-specific invasive cleavage. All four mechanisms are reliable, but each has its pros and cons. For instance, with the hybridization approach, two allele-specific probes 4 are designed to hybridize to the target sequence only when they match perfectly (see Fig. 2.9). Under optimized assay conditions, the one-base mismatch sufficiently destabilizes the hybridization to prevent the allelic probe from annealing 5 to the target sequence. Because no enzymes are involved in allelic discrimination, hybridization is the simplest mechanism for genotyping. The challenge to ensure robust allelic discrimination lies in the design of the probe. With ever more sophisticated probe design algorithms, allele-specific probes can be designed with high success rate. When the allele-specific probes are immobilized on a solid support, labeled target DNA samples are captured, and the hybridization event is visualized with a fluorescence filter by detecting the label after the unbound targets are washed away. Knowing the location of the probe sequences on the solid support allows one to infer the genotype of the target DNA sample. The detection mechanism of a positive allelic discrimination reaction is done by monitoring the light emitted by the products, measuring the mass of the products, or detecting a change in the electrical property when the products are formed. Numerous labels with various light-emitting properties have been synthesized and utilized in detection methods based on light detection or electrical detection. In general, only one label with ordinary properties is needed in genotyping methods where the products are separated or purified from the excess starting reagents. Monitoring light emission is the most widely used detection modality in genotyping, and there are many ways to do so. Luminescence, fluorescence, timeresolved fluorescence, fluorescence resonance energy transfer (FRET), and fluorescence polarization (FP) are useful properties of light utilized in a host of genotyping methods [48]. 4 A sequence of DNA or RNA, labeled or marked with a radioactive isotope, used to detect the presence of complementary nucleotide sequences by hybridization. 5 Annealing, in biology, means for DNA or RNA, to pair by hydrogen bonds to a complementary sequence, forming a double-stranded polynucleotide. The term is often used to describe the binding of a DNA probe, or the binding of a primer to a DNA strand during polymerase chain reaction.

28 24 CHAPTER 2. BIOLOGICAL BACKGROUND Figure 2.9: Allele-specific hybridization (the probe is the sequence-segment with the C ).

29 Chapter 3 AN OVERVIEW ON HAPLOTYPING INFERENCE Any of the four nucleotides {A, T, C, G} could be present at any position in the genome, so it might be imagined that each SNP should have four alleles. Theoretically this is possible, but in practice most SNPs exist as just two variants. This is because of the way in which SNPs arise and spread in a population. A SNP originates when a point mutation occurs in a genome, converting one nucleotide into another. If the mutation is in the reproductive cells of an individual, then one or more of the children might inherit the mutation and, after many generations, the SNP may eventually become established in the population. But there are just two alleles - the original sequence and the mutated version. For a third allele to arise, a new mutation must occur at the same position in the genome in another individual, and this individual and his or her offspring must reproduce in such a way that the new allele becomes established. This scenario is not impossible but it is unlikely; consequently, the vast majority of SNPs are biallelic. That allows us to represent a haplotype h with n SNP as a row vector of length n with binary entries. Each component h j of the vector indicates the state (i.e., the allele) at a particular polymorphic position in this haplotype: h j {0,1}. Similarly, a genotype g, which is the conflated data of two haplotypes, is represented by a n-dimensional vector, where each component g j {0,1,2}: 0 and 1 are related to homozygous sites, while heterozygous sites are denoted by 2. We introduce the conflate operator : {0,1} {0,1,2}, defined as follows: 0 0 = = 1 0 = = 1 25

30 26 CHAPTER 3. AN OVERVIEW ON HAPLOTYPING INFERENCE which generalizes to vectors in the obvious way: given a n-dimensional genotype g and a pair of n-dimensional haplotypes h 1 and h 2, g = h 1 h 2 g j = h 1,j h 2,j (j = 1,...,n) Thus a pair h 1, h 2 of haplotypes is compatible with a genotype g, if h 1 and h 2 both contain 0 in a position where g contains 0, 1 in a position where g contains 1, and opposite binary values where g contains 2 (see Table 3.1 for the haplotype coding and Table 3.2 for the genotyping coding); they are said to generate or explain g and both h 1 and h 2 are said to be consistent with g. Alleles C/A G/A C/G T/C T/C G/A C/G h 1 C G C T T A C h 2 C G G C C G G h 3 A A C T T A C h 4 C G C T T G C h 5 A A C T T G C Table 3.1: Example of haplotype coding. The SNPs are 7 and each of them is biallelic. The most frequent one is encoded by 0 and the least one by 1. The encoded haplotypes (the binary vectors) are in the last column of the table. g 1 C/A G/G C/C T/C T/C A/G C/C g 2 C/C G/G G/C C/T C/C G/A G/C g 3 A/A A/A C/C T/C T/C A/A C/C g 4 C/A G/A C/C T/C T/T G/G C/G Table 3.2: Example of genotype coding. For each SNP of each genotype we have just a mixed information, that is we know if the site is homozygous or heterozygous. The encoded genotypes (vectors in {0,1,2} 7 ) are in the last column of the table. The Haplotyping Inference (HI) problem consists in determining the allele values for a set of SNPs given as a genotype input. In other words, given a set G of genotypes, we have to find a set H of haplotypes, such that for each g G there exist h 1,h 2 H such that h 1 h 2 = g. In literature different versions of haplotyping problems are known and have been extensively studied, under many objective functions, scenarios and applications, in recent years. This chapter is a comprehensive presentation of some approaches proposed for this biological problem and it mainly focuses on the formulations, algorithmic approaches, complexity results and existing software tools.

1. Contains the sugar ribose instead of deoxyribose. 2. Single-stranded instead of double stranded. 3. Contains uracil in place of thymine.

Protein Synthesis & Mutations RNA 1. Contains the sugar ribose instead of deoxyribose. 2. Single-stranded instead of double stranded. 3. Contains uracil in place of thymine. RNA Contains: 1. Adenine 2.