Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis Kumud Joseph Kujur, Sumit Pal Singh, O.P. Vyas, Ruchir Bhatia, Varun Singh* Indian Institute of Information Technology - Allahabad, India *Email: rickky27@rediffmail.com ABSTRACT There exists various techniques/ algorithms for pairwise sequence alignment, multiple sequence alignment and entropy. The technical paper presented here is a consequence of study and implementation of these techniques on DNA to Protein Translation, Sequence Alignment and Comparison, Multiple Sequence Alignment and entropy. The DNA to Protein Translation is performed by detecting open reading frame (ORF) while taking a DNA coding sequence(cds) as an input. This sequence is then converted into aminoacids taking 3 nucleotides(codons) at a time. Each codon specifies an amino acid. 3 frames for this sequence are considered by shifting one position and then taking the codons. The other 3 frames are for the complementary sequence. These codons are checked for start and stop codons which mark the possible protein, longest of which gives the final protein. In pairwise sequence alignment and comparison, the input query sequence which is a primary protein sequence is compared to the various subject sequences that exist in the database. Local alignment and global alignment are the techniques used for sequence alignment. For local alignment Smith Waterman algorithm and for Global alignment, Needleman Wunsch algorithms were implemented. Various scoring techniques can be used including PAM and BLOSUM. Both the above techniques for comparison and alignment have been implemented. Both the techniques are discussed in details later. Pairwise comparison is fundamental to sequence analysis. However, analysis of groups of sequences that form gene families requires the ability to make connections between more than two members of the group, in order to reveal subtle conserved family characteristics. The process of multiple alignments can be regarded as an exercise in enhancing the signal-to-noise ratio with a set of sequences, which ultimately facilitates the elucidation of biologically significant motifs. Entropy analysis is done to detect the coding and noncoding regions in a DNA sequence. Sliding window method and recursive segmentation method are applied to calculate the Jensen- Shannon Divergence. This helps in distinguishing the homogeneous segments in a heterogeneous DNA sequence. Keywords: Pairwise alignment, Multiple alignment, Phylogeny, Entropy, DNA sequences. PROBLEM DEFINITION AND SOLUTION 1. DNA TO PROTEIN TRANSLATION

The conversion of a DNA to Proteins is carried out using Replication, Transcription and Translation. Transcription Translation After transcription, the next process is to remove the garbage or the unnecessary parts called the introns so that the useful part i.e. the exons get concatenated. This is the Coding sequence (CDS) After getting the CDS we start from the first nucleotide and form triplets(codon) till the end. Given a CDS, and knowing the genetic code, it is possible to translate the DNA into protein by looking up successive codons in a genetic code table. However this is only one case, the other cases being when we shift the starting nucleotide by one and two and similarly for its complement. Thus we would have six frames in all, that give us six different options for proteins, but the correct protein is the one which has the longest length out of this six frames starting from a start and ending at the stop codon. [1] Detecting open reading frames: ORF is normally deemed to be the longest frame uninterrupted by a stop codon. Finding an end of the ORF is easier than finding its beginning. Usually, the initial codon in the CDS is that for methionine (ATG); but methionine is also a common residue within the CDS, so its presence is not an absolute indicator of ORF initiation. Several features may be used as indicators of potential protein coding regions in DNA. One of these is sufficient ORF length (based on the premise that long ORFs rarely occur by chance). Recognition of flanking Kozak sequences (CCGCCATGG) may also be helpful in pinpointing the start of the CDS. [1] Understanding the effect of introns and exons: The genes of eukaryotic are connected by regions that contribute towards the CDS, known as exons, and those do not, known as introns. Once consequence of the presence of exons and introns in eukaryotic genes is that potential gene products can be of different length, because not all exons may be represented in the final transcribed mrna. The whole process involved in the DNA-to-Protein translation can be described as below: Query sequence (DNA): ACATGAGTCGTACGTAGCTGACTGATCGT Six frame Amino-acid translation: Forward 0:T#VVRS#LI Forward 1:HESYVAD#S Forward 2: *SRT#LTDR Reverse 0:TISQLRTTH Reverse 1:RSVSYVRL* Reverse 2:DQSATYDSC The start codon is represented by * and the stop codon by #. Hence from the above six-frames we can conclude that the protein is generated from the forward 2 translation and the possible protein is : *SRT# with length 5. After getting this possible protein, the technique of alignment and comparison is applied on it with respect to other protein sequences present in the databases. If we get an identical match or high

similarity, then protein produced is related to a known gene family. And if not, the protein sequence transcribed shows the existence of some new gene family. 2. PAIRWISE SEQUENCE ALLIGNMENT Pair-wise alignment is a fundamental process in sequence analysis, carried out to find the relationship based on sequence properties of any two sequences, may be protein, DNA or RNA. This section describes the comparison of two sequences, a query sequence (the properties of which need to be determined) and a subject sequence (the properties of which are already known) by searching the series of individual characters or character patterns that are in the same order in these sequences. This further helps in the identifying any similarity (similar in functionality) or evolutionary (homology) relationship existing between the query sequence and a family of known genes. We need to calculate the correct alignment as it is required to find which segment of the gene is altered(may be in the form of point mutation, insertion, deletion, duplication etc). There are two types of sequence alignment techniques, local and global. In global alignment, an attempt is made to align the entire sequence, using as many characters as possible, up to both ends of each sequence. Sequences that are quite similar and approximately the same length are suitable candidates for global alignment. In local alignment, stretches of sequences with the highest density of matches are aligned, thus generating one or more islands of matches or sub alignments in the aligned sequence. Local alignments are more suitable for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences that differ in length, or sequences that share a conserved region or domain. The first deals with similarity across the entire length of the sequences and the second one on the regions of similarity in the parts of the sequences (subsequences). It is important to understand the difference between these two alignments as the sequences are not uniformly similar or identical. Thus there is no point in performing a global alignment that just has local similarity. Let us now discuss each of these techniques in details. SIMILARITY AND IDENTITY As we know that not only the identity but also the similarity is biologically significant. Many of the amino acids can be replaced or substituted by another one of same chemical properties and the substituted amino acid remains compatible with protein structure and function. Hence we can also take into account the different scoring matrices (for eg. PAM or BLOSUM). These scoring matrices provide different scores to all the matches depending on similarity/dissimilarity. Some scoring matrices are superior to others at finding related proteins based on either sequences or structures. BLOSUM matrices take into account the full range of amino acid substitution in families of related families. The other matrix, PAM s are based on variation in closely related proteins that are extrapolated to produce matrices for more distantly related proteins. GLOBAL ALIGNMENT: As already discussed this alignment technique takes care of finding the similarity across full length. The algorithm used here is Needleman-Wunsch algorithm (name given after the scientists who proposed the algorithm) based on the theory of dynamic programming. The whole algorithm consists of mainly three parts : Here for example we have consider two sequences 1) We form a matrix representation of the given query sequence and the subject sequence by placing them along the margins of the matrix. This matrix is a unitary matrix, that weights identical elements with value 1 and the rest with value 0. We can score them according to the scoring matrix also.

Table: Initial setup for Needleman-Wunsch 2) The next step is to trace a score to all the pathways. Here we start from the bottom right and end up at the upper left of the matrix. One can trace it in vice-versa fashion also. The scoring(matrix fill up process) is done as: M(i,j) = M(i,j) + max [ M(k,j+1), M(i+1,l)] Where, k is an integer greater than i l is an integer greater than j Table: Half way through the second step 3) The final step is to trace back the whole path. The trace back starts from the highest value (in this case the top leftmost element). The alignment is traced

proceeding left to right, top to bottom choosing the largest numbers available. Table: Trace the alignment LOCAL ALIGNMENT: The Needleman-Wunsch algorithm works well for sequences that show similarity across the full length. However the sequences that are distantly related to each other might show small regions of local similarity rather than across the full length. The Smith-Waterman algorithm takes handles this problem quite efficiently. It follows the same initial matrix based technique as used in the case of Needleman-Wunsch algorithm. The main difference between these two algorithm lies on the point that, in Smith-Waterman case, each element in the matrix defines the end point of the potential alignment(any element of the matrix can have the highest value not necessarily the the terminal end). Only minimal changes to the Needleman-Wunsch algorithm are required. These are: 1) A negative score/weight must be given to the mismatches, if any negative score would result, then zero is substituted. Score at any matrix point is given by Sij = max { Si-1,j + s(aibj), max ( Si-x,j - Wx), x>=1 max (Si,j-y - Wy) y>=1 } where Sij is the score at position i in sequence A and position j in sequence B, s(aibj) is the score for aligning the characters at position i and j, Wx is the gap penalty for a gap of length x in sequence A and Wy is the penalty for a gap of length y in sequence B. 2) As explained above, the beginning and the end of an optimal path may be found anywhere in the matrix and not only the endpoints. In this example penalty for mismatch is 0.5 and gap penalty is 0.

Table: Smith-Waterman example GAP AND GAP PENALTY The inclusion of gaps and gap penalty is necessary in order to obtain the best optimal alignment between any two sequences. The gaps are the result of the changes (mutation) occurring in the particular sequence during evolution. So our job is to allow the gaps in right position of the sequence to get a meaningful result. A gap penalty is the combination of both the gap opening penalty and the gap extension penalty. The summarization of the gap penalty in a sequence can be given as: W(penalty) = (No. of gaps originated)*g(opening penalty) + (gap length)*g(extension penalty) The values of these penalties are chosen in such a way that it shouldn t disturb the overall balance. If gap penalty is too high as compared to matrix scores the gaps will never appear in the alignments. On the other hand if the gap penalty is too low as compared to matrix scores, gaps will appear everywhere in the alignment in order to align as many of the same characters as possible. 3. MULTIPLE SEQUENCE ALIGNMENT Pairwise comparison is fundamental to sequence analysis. However, analysis of groups of sequences that form gene families requires the ability to make connections between more than two members of the group, in order to reveal subtle conserved family characteristics. The process of multiple alignments can be regarded as an exercise in enhancing the signal-to-noise ratio with a set of sequences, which ultimately facilitates the elucidation of biologically significant motifs. The goal of multiple sequence alignment is to generate a concise, information-rich summary of sequence data in order to inform decision-making on the relatedness of the sequences to a gene family. Sometimes, indeed, multiple alignments may be used to express the dissimilarity between a set of sequences. Alignments should be regarded as models that can be used to test the hypothesis. As in pairwise alignment, there is nothing inherently correct, or incorrect, about any particular pairwise alignment, the same maxim holds for multiple alignments.[2]

Definition of multiple sequence alignment: Here, a small alignment of 5 short sequences (I- V) is presented. The sequences have been arranged so that the most similar residues are brought into vertical register, through the use of gaps, while the order of residues in each sequence is preserved. MULTIPLE SEQUENCE ALIGNMENTS -- USES Just as the alignment of the pair of nucleic acid or protein sequences can reveal whether or not there is an evolutionary relationship between the sequences, so can the alignment of three or more sequences reveal relationships among multiple sequences. Multiple sequence alignments of a set of sequences can provide information as to the most alike regions in the set. In proteins, such regions may represent conserved functional or structural domains. If the structure of one or more members of the alignment is known, it may be possible to predict which amino acids occupy the same spatial relationship in other proteins in the alignment. In nucleic acids, such alignments also reveal structural and functional relationships. For example, aligned promoters of a set of similarly regulated genes may reveal consensus binding sites for regulatory proteins. Consensus Another use for consensus information retrieved from a multiple sequence alignment is for the prediction of specific probes for other members of the same group or family of similar sequences in the same or other organisms. There are both computer and molecular biological applications. Once a consensus pattern has been found, database searching programs may be used to find other sequences with a similar pattern. MULTIPLE SEQUENCE ALIGNMENT (MSA) TO PHYLOGENETIC ANALYSIS -- RELATIONSHIP Once the MSA has been found, the number or types of changes in the aligned sequence residues may be used for a phylogenetic analysis. The alignment provides a prediction as to which sequence characters correspond. Each column in the alignment predicts the mutation that occurs at one side during the evolution of the sequence family as illustrated in figure 1. Within the column are original characters that were present early, as well as other derived characters that appeared later in evolutionary time. In some cases the position is so important for function that mutational changes are not observed. It is these conserved positions that are

useful for producing an alignment. In other cases, the position is less important, and substitutions are observed. Deletions and insertions may also be present in some regions of the alignment. Thus, starting with the alignment, one can hope to dissect the order of appearance of the sequences during evolution. Seq A A. N Q P Seq B A. N -- P Seq C A R Y Q P Seq D A. Y Q P Figure 1: The close relationship between MSA and evolutionary tree construction. Shown is a short section of one MSA of four protein sequences including conserved and substituted positions, insertion (of R) and a deletion (of Q). PROGRESSIVE GLOBAL ALIGNMENT Pairwise alignment technique can be extended for aligning multiple sequences. But the number of sequences that can be aligned is limited because the number of computational steps and the amount of memory required grow exponentially with the number of sequences to be analyzed. Progressive alignment is the most commonly used method to align biological sequences. This heuristic approach is very rapid, requires low memory space and offers good performance on relatively well-conserved, homologous sequences. Description of progressive alignment methods: Progressive alignment consists of building a multiple alignment using pair wise alignments in three steps: a) Compute the alignment scores (or distances) between all pairs of sequences. b) Build a guide tree that reflects the similarities between sequences, using the pair wise alignment distances (as in Figure 1). c) Align the sequences following the guide tree. Corresponding to each node in the tree, the alignment aligns the two sequences or alignments that are associated with its daughter nodes. The process is repeated beginning from the tree leaves (the sequences) and ending with tree root. The problem with this progressive alignment stems from the greedy nature of the algorithm: any mistake that appears during early alignments cannot be corrected later as new sequence information is added.

4. SIGNIFICANCE OF SEQUENCE ALIGNMENT Sequence alignment is useful for discovering functional, structural, and evolutionary information in biological sequences. We have to get the best possible (optimal) alignment to discover this information. Sequences which are very much similar or alike, probably have the same functions, be it in some regulatory role in the case of similar DNA molecules, or a similar biochemical function and three dimensional structure in case of protein. In addition, if two sequences from different organisms are similar, then there is a possibility of having a common ancestor shared by these sequences. In this case the sequences are defined as homologous. The alignment technique indicates the changes (mutations) that could have occurred between the two homologous sequences and a common ancestor sequence during evolution. Hence one can easily find out the whereabouts of a new sequence that is occurred from these mutational changes. In other cases, similar regions in sequences may not have a common ancestor but might have arisen independently by two evolutionary pathways converging on the same function, called convergent evolution. 5. ENTROPY ANALYSIS The entropy is measured in linear time as the number of distinctive segments occurring in the regions. It helps in locating out the range of borders between coding and non-coding regions of any gene The entropic segmentation process partitions a heterogeneous DNA sequence into homogeneous subsequences, which we term compositional domains. If we accept a domain picture of DNA sequences, it is natural to design computational approaches that segment a DNA sequence into homogeneous domains, and computer algorithms that accomplish such a segmentation are commonly called segmentation algorithms. Two well-known examples of segmentation algorithms are the one based on hidden Markov model by Churchill and walking Markov model algorithm by Fickett et al. (1992). In the biology community, however, most people still use the old-fashioned moving window approach. One advantage of the widely-used sliding-window methods is that their implementation is straightforward: one calculates the density of a sequence feature of interest within a window, moves the window along the sequence, and recalculates the density. However, the choice of the window size and the moving distance are, in general, arbitrary. If the window size is too large, local fluctuations that contain significant biological information may be averaged out. If the moving distance is too long, one domain can be split between two windows and its distinctive feature may not be revealed. There are also some other drawbacks of moving window approach. Another approach for detection of coding and non-coding borders is Recursive segmentation.[5] Detection of coding-noncoding borders The coding potential measurement is obtained from within a coding or non-coding region (as versus from their borders). Such measurement can either be learned from the data or can be based on a known biological knowledge. However, the current biological knowledge about coding potential is still mainly limited to that of the codon structure. The fact that coding regions, and not the noncoding region, consists of three-base unit, plus the fact that these units are not used with equal probability, provides a strong signal for coding potential.[4]

Jensen Shannon divergence was implemented to do the entropy analysis. Jensen Shannon Divergence using Sliding Window Method For a given DNA sequence of length N, our code calculates Jensen Shannon divergence for multiplicities of step. First it determines the total number of purines (A & G) and pyrimidines (C & T) in our query DNA sequence, then calculates the entropy for the whole sequence i.e. H(W) Entropy (H) = -p*log2(p) - q*log2(q) [4] Where p = pur/size and q = pyr/size LOG2(q)=logq/log2 Second it computes Jensen Shannon divergence at segments U = 1,,i*step and V = i+step+1,.,n Where U is the left segment and V the right segment Move from left to right in steps Determine number of purines and pyrimidines in the left (U) and right (V) segment and calculate entropy for each one of them i.e. H(U) and H(V) Then calculate the divergence for that step by the given formula Divergence = HW -nu/n * HU - nv/n * HV nu is the length of left sequence and nv of right sequence, n is the length of whole sequence. High divergence indicates that the left and right segments are more homogeneous with respect to themselves than with respect to the whole. Example: Consider that the step size for calculating Jensen Shannon divergence is 20. now the results for a query sequence are given below Query Sequence: TCCATTGAGCCTTATACCAGTAACATCTACACTCGAAGATCTTGTCAGGGGAATTTCAGATTG TGAATCCTCACTTACTGAAAGATCTTACTGAGCGGGG FOR THE WHOLE SEQUENCE Purine: 49 Purimidine: 51 Genome Length: 100 Entropy H(W): 0.999711 AFTER STEP 1 Length of left segment(nu) = 20 Length of right segment(nv) = 80 Purines in U = 8 Pyrimidines in U = 12 Purines in V = 41 Pyrimidines in V = 39 DIVERGENCE = 0.005882 AFTER STEP 2 Length of left segment(nu) = 40 Length of right segment(nv) = 60 Purines in U = 18 Pyrimidines in U = 22 Purines in V = 31 Pyrimidines in V = 29

DIVERGENCE = 0.003083 AFTER STEP 3 Length of left segment(nu) = 60 Length of right segment(nv) = 40 Purines in U = 29 Pyrimidines in U = 31 Purines in V = 20 Pyrimidines in V = 20 DIVERGENCE = 0.000192 AFTER STEP 4 Length of left segment(nu) = 80 Length of right segment(nv) = 20 Purines in U = 3 Pyrimidines in U = 44 Purines in V = 13 Pyrimidines in V = 7 DIVERGENCE = 0.018678 Jensen Shannon Divergence using Recursive Segmentation For a sequence of length N, we calculate at each position i (0<i<N) the entropy HW of the whole sequence, the entropy HU of the subsequence on the left side of the partition point, and the entropy HV of the subsequence on the right side of the partition point, then calculate Shannon divergence. As a measure of the heterogeneity of the sequence we choose the maximized Jensen-Shannon divergence say maxdjs. If this is large enough, we say that the sequence is heterogeneous and should be segmented. We recursively apply the same procedure to both the left and the right subsequence, as long as maxdjs falls below that given threshold, the recursion along the current path is stopped. This recursive segmentation procedure is very similar to the procedure of growing a binary tree. When the segmentation is continued, two branches of the tree are generated; if it is stopped, that branch becomes a leaf.[5] CONCLUSION AND DISCUSSION The paper mainly talks about the implementation aspect of basic existing algorithms and techniques used in sequence analysis such as Needleman-Wunsch, Smith-Waterman, progressive methods for MSA, Jensen-Shannon divergence, entroypy, translation of DNA into protein. The Needlman-Wunsch and Smith-Waterman algorithm (examples of Dynamic programming) are considered to be highly accurate in finding out the optimal output but are relatively slow when large chunks of data are taken into account. Dynamic programming has time and space complexity of O(n^2) in the case of pairwise alignment, where n is the total length of the sequence. If we generalize this algorithm for multiple sequence alignments, then we are adding an extra dimension for each new sequence. Thus, the complexity becomes O(n^d) where d accounts for the number of sequences being added up. New methods comprising of dynamic programming, added with a heuristic approach, are coming into the picture. These are much faster than the existing basic algorithms. The topic of entropy is an open area and research works are still going on in order to get accurate results for coding and non-coding regions. Recursive segmentation is an interesting alternative to the traditional moving window approach. Admittedly, the moving window approach is simple, fast (O(N) computational complexity vs. the O(N log(n)) complexity for recursions), and usually provides an answer to questions of interest to investigators. Nevertheless, recursive

segmentation approach can be more accurate; and it also avoids the common problem in a moving window approach to select a window size and a moving distance. We suggest to use recursive segmentation as a refinement of the moving window approach, or a second-stage analysis after a rough result is obtained from moving window approach. BIBLIOGRAPHY Books 1) Introduction to bioinformatics - T K Attwood, D J Parry-Smith, Pearson Education Asia. 2) Bioinformatics: Sequence and Genome Analysis - David W Mount. 3) Developing Bioinformatics Computer Skills - Cynthia Gibbs, Per Jambeck, O Reilly Publications. Papers 4) Pedro Bernaola-Galván, Ivo Grosse, Pedro Carpena, José L. Oliver, Ramón Roldán, and H. Eugene Stanley. Finding Borders Between Coding And Non-coding DNA regions by an entropic segmentation method.(aug 1999) 5) Wentian Li, Pedro Bernaola-Galva n, Fatameh Haghighi, Ivo Grosse.Application of recursive segmentation to the analysis of DNA sequences.(nov 2001) 6) Aaron Davidson. A fast pruning algorithm for optimal sequence alignment.