Gene Prediction in Eukaryotes Using Machine Learning

Size: px

Start display at page:

Download "Gene Prediction in Eukaryotes Using Machine Learning"

Shonda Jones
5 years ago
Views:

1 Aarhus University Master s Thesis Gene Prediction in Eukaryotes Using Machine Learning Author Lukasz Telezynski Supervisor Christian Storm Pedersen June 15, 2018

2 ii

3 Abstract Gene prediction, especially in eukaryotic DNA since it is more complicated than prokaryotic, is an interesting problem that might be solved using machine learning. Since a DNA sequence is a chain of nucleotides, Hidden Markov Model might be used to infer gene structure. In this project I focus on predicting protein coding parts in eukaryotic DNA sequences. The problem is solved using Hidden Markov Model, in particular Viterbi algorithm, since it finds most probable sequence of hidden states for a given sequence of observable symbols. Therefore, given an DNA sequence we can find a sequence of hidden states corresponding to whether a nucleotide at given position is protein coding or non-coding. I have implemented a Viterbi algorithm possible to emit variable number of symbols from hidden states and developed a set of various models of eukaryotic gene structure from very simple to more complex. Each model was trained using program with train by count method implemented. Viterbi implementation with trained models were used to infer which nucleotides of DNA sequence are protein coding and which are non-coding. Cross validation experiment was conducted with Genie data set, and obtained results were evaluated using various prediction accuracy measures, both at nucleotide and exon level. Results in form of prediction accuracy measures were compared with two existing programs Genscan and Augustus. However, prediction accuracy of models I have developed, even though increasing with models complexity, was not as high as in case of Genscan or Augustus. iii

4 iv

5 Contents Abstract Contents iii v 1 Introduction 1 2 Background HMM Viterbi decoding HMM extensions and modifications Training by counting Existing programs Genscan Augustus Implementations Viterbi Training Other Experiments Prediction accuracy measures Data set Validation Training Prediction Results evaluation Models of eukaryotic gene structure Simple 3-state model Single and multi exon genes Translation start and termination codons Nucleotide frequencies bias Introns Codon usage bias v

6 5.7 Splice sites and translation start Conclusions and future work 47 References 49 A Appendix 51 vi

$However, Eukaryotic DNA is significantly different from genetic material in prokaryotic organisms. Their genomes are bigger, but only small fraction consists of protein coding genes.$

7 Chapter 1 Introduction DNA, in the simplest definition, is a chain of 4 nucleotides: adenine, cytosine, guanine and thymine. However, Eukaryotic DNA is significantly different from genetic material in prokaryotic organisms. Their genomes are bigger, but only small fraction consists of protein coding genes. Also, while in prokaryotes each protein is coded by single chain of nucleotides, eukaryotic genes might be interrupted by noncoding introns. Those differences are presented on a simple models in Fig This fact makes a gene prediction a harder task and models are more complex. Figure 1.1: Gene structure with respect to protein coding parts in prokaryotes (a) and eukaryotes (b) Since a DNA can be represented as a sequence of 4 symbols, Hidden Markov Models (HMM) might be used to predict a gene structure, which is a main goal of this work. I have implemented a Viterbi algorithm possible to emit variable number of symbols from hidden states and developed a set of various models of eukaryotic gene structure from very simple to more complex. Each model was trained using program with train by count method implemented. Viterbi implementation with trained models were used to infer which nucleotides of DNA sequence are protein coding and which are non-coding. Obtained results were evaluated using various prediction accuracy measured, both at nucleotide and exon level. 1

8 Results in form of prediction accuracy measures were compared with two existing programs Genscan and Augustus. AC and EAvg as the most important ones evaluate predictions at nucleotide and exon levels respectively. Prediction results of the most complex model developed in this project HMMcod53st are compared with mentioned programs in Table 1.1. Table 1.1: Summary of average prediction accuracy measures for all sequences in validation data sets. AC-approximation correlation, EAvg-average of exon level sensitivity (ESn) and specificity (ESp). Model AC EAvg Genscan Augustus HMMcod53st In Chapter 2, theory behind HMM is presented as well as existing gene finding programs. Next, implementation of Viterbi algorithm and training programs are described in Chapter 3. Data set and experiments setup is explained in Chapter 4. The main part which is description of models of eukaryotic gene structure from the simplest one to the most complex can be found in Chapter 5. Programs and scripts I have developed and used in this project are available online [1] in form of a zip file. It contains also files with models parameters and short descriptions as well as files used for testing my Viterbi implementation. 2

Chapter 2 Background In this chapter I present theory necessary to understand methods and algorithms used in this project. I start with general description of Hidden Markov Model (2.1).

9 Chapter 2 Background In this chapter I present theory necessary to understand methods and algorithms used in this project. I start with general description of Hidden Markov Model (2.1). Then in following sections, I explain Viterbi decoding algorithm (2.2) and its extensions and modifications (2.3) I have used. In section 2.4 training by counting method is described, and in the last part (2.5) 2 gene structure prediction programs based on HMM are presented. 2.1 HMM In simple Markov Model (Fig. 2.1a) all the states can be directly observed X = x 1, x 2,..., x N 1, x N, where N is a length of a sequence. In HMM (Fig. 2.1a), however, a sequence of states that make a Markov chain cannot be directly observed, therefore, they are called hidden states Z = z 1, z 2,..., z N 1, z N. The output of HMM is a sequence of observables that depend on hidden states. (a) (b) Figure 2.1: Markov Model (a) and Hidden Markov Model (b) In order to calculate a probability of observing a particular sequence X called joint probability, we need a set of parameters. In the first case we only need an initial probabilities π and transition probabilities A. In HMM additional emission probabilities φ are necessary. Initial probabilities is a list of probabilities of observing any of L possible symbols at first position x 1 of a sequence X in case of simple MM or starting in any of K hidden states (z 1 ) in HMM. In either case, sum of all the values in 3

10 the list has to always be 1. Transition probabilities, however, is a table of size either LxL or KxK, where each value A i,j is a probability of transition from state i to state j. Similarly, all values in each row has to sum to 1. Emission probabilities is a table of size KxL, where each value φ k,l is a probability of emitting observable symbol l while being in hidden state k. Joint probability in simple MM is simply a product of probability of observing first state and probabilities of transitions between all remaining states in a sequence as given by an Eq JP = π x1 N n=2 A xn 1,x n (2.1) In HMM joint probability is calculated in a similar manner but is slightly more complicated because of emission probabilities. The first part of this product is a product of initial probability of starting in a hidden state z 1 and probability of emitting observable symbol x 1 while being in state z 1. The second part is a product of probability of transition from state z n 1 to state z n and probability of emitting a symbol x n among remaining N-1 states in a sequence (Eq. 2.2). JP = π z1 φ z1,x Viterbi decoding N n=2 A zn 1,z n φ zn,x n (2.2) Viterbi decoding algorithm can be used to find the most plausible explanation Z of a sequence of observables X in HMM, which is a most probable sequence of hidden states given a sequence of observable states. In order to explain this algorithm, some basic notations and definitions must be introduced. Let X be a sequence of length N of observable states, where x[n] is a symbol in X at position n. There are L possible states that can be observed. Let Z be a sequence of the same length as X of hidden states, where z[n] is a hidden state at position n. There are K possible hidden states. Initial probabilities π is a vector of length K, where π[k] is a probability of starting a sequence in a hidden state k. Transition probabilities A is a matrix of size KxK, where A[i][j] is a probability of transition from hidden state i to state j. Emission probabilities φ is a table of size KxL, where φ[k][l] is a probability of emitting symbol l while being in hidden state k. The main part of this algorithm is building a KxN omega table ω, where K is a number of possible hidden states, and N is the length of a sequence. First column is filled out as product of initial probability π[k] and probability of emitting symbol at first position x[1] while being in state k which can be written as φ[k][x[1]]. Each cell in the first column can be calculated according to Eq ω[k][1] = π[k] φ[k][x[1]], for k = 1,..., K (2.3) 4

11 Consecutive columns are computed in following manner. First, the maximum product of a value in a cell in previous column and a transition probability A from that state to a current state has to be found. Then, previous value has to be multiplied by an emission probability φ. This means that computing values in column n, we need to find a maximum product between all values in column n 1, transition probability from that particular hidden state in position n 1 to hidden state at position n and emission probability of observing state x[n]. In particular, calculating value ω[j][n] means finding maximum product of ω[i][n 1]A[i][j] for all possible i, remembering there are K possible hidden states, and multiplying that by an emission probability of observing state x[n] while being in hidden state j which is φ[j][x[n]]. Formula for calculating values in all cells in remaining columns from the second one to N is given in Eq ω[j][n] = max k i=1 (ω[i][n 1] A[i][j]) φ[j][x[n]], for j = 1,..., K and n = 2,..., N (2.4) Using memorization, computing ω table takes time O(NK 2 ) and space O(NK). Given omega table, finding most plausible explanation Z of X can be done by backtracking. We start with the last column N and find j which corresponds to the maximum value in that column. That state j is the last hidden state z[n] (Eq. 2.5). z[n] = K argmax j=1 ω[j][n] (2.5) All remaining hidden states from position N 1 to 1 can be found according to formula in Eq Hidden state at position n is state i if a product of a value in a cell ω[i][n] and transition probability A from state i to state z[n + 1] is maximal considering all values in column n for i = 1 to i = K. z[n] = K argmax(ω[i][n] A[i][z[n + 1]]), for n = N 1,..., 1 (2.6) i=1 Backtracking takes time O(N K) and space O(N K) using previously computed ω table. 2.3 HMM extensions and modifications In this section extensions and modifications of HMM I have implemented (3.1) and experimented (4.3) with are described. Those are emitting a variable number of symbols and working in a log-space. DNA sequence can be represented as a sequence of 4 symbols. However, protein coding parts are better represented as a sequence of triplets of those 4 symbols. Each triplet, called codon, encodes for one amino acid in translated 5

12 protein. Therefore, it makes much sense to model different parts of DNA according to its function. In HMM model that would require emitting variable number of symbols that depends on a hidden state. For instance, a hidden state that encodes non-coding region would emit only one symbol, while a hidden state encoding protein coding part of DNA would emit 3 symbols at the time. This requires an additional list d of length of a number of hidden states K containing number of symbols each state emits. Numbers in a d list has to be integers and greater that 0. Previous formula (Eq. 2.4) for computing an ω table has to be slightly modified in order to compute ω table for HMM with variable number of emitted symbols allowed (Eq. 2.7). ω[j][n] = max k i=1 (ω[i][n d[j]] A[i][j]) φ[j][x[n d[j] + 1]], for j = 1,..., K and n = 2,..., N (2.7) When implementing this modified version of Viterbi algorithm, one has to remember that variable number of symbols can be emitted, which is not explicitly noted in above formula. Only position of first symbol in emitted sequence is indicated. Also, size of emission probabilities table φ is much bigger since every combination of symbols has to be addressed. While we only had 4 symbols in standard HMM, when we allow emission of 3 symbols, we have to add additional 64 codon symbols (4 3 ). Emission probabilities table grows from Kx4 to Kx68. Values in an ω table can get very close to zero. Continuous multiplication of very small numbers can lead to underflow by exceeding precision of floating points. Double precision number can range from approximately to , which would be sufficient to handle sequences of length around 500. That is definitely not enough to work with whole gene containing DNA sequences. Solution to this problem is working with log values, since log(max f) = max(log f). When working with Viterbi algorithm in log-space, multiplication has to be substituted by additions, and computing values in ˆω tables in log-space can be done according to formulas in Eq. 2.8 and Eq ˆ ω[k][1] = log π[k] + log φ[k][x[1]], for k = 1,..., K (2.8) ˆω[j][n] = max k i=1 (ˆω[i][n 1] + log A[i][j]) + log φ[j][x[n]], for j = 1,..., K and n = 2,..., N (2.9) 2.4 Training by counting In order to use Viterbi decoding, we need to have a set of parameters: initial π, transition A and emission φ probabilities. Simple yet very effective method is training by counting. It requires training data and simply counts occurrences of events. 6

13 Initial probabilities can be found by counting how many times all sequences in a training data set starts in each of K hidden states. Then each count is divided by total number of sequences processed, obtaining probabilities of starting in any of K states. All probabilities in π have to sum to 1, and a value at π[k] is a probability of starting a sequence in a hidden state k. Transition probabilities are computed in similar manner. Each row in A contains probabilities of transitions from a particular state k to any hidden state. Therefore, every transition from state k to any hidden state, including state k, has to be counted separately. After counting, each value has to be divided by the total number of occurrences of state k. All rows in A has to be computed in this way, and a value at A[i][j] is a probability of jumping from state i to state j. In emission probabilities table φ each row k corresponds to probabilities of observing any possible symbol while being in a hidden state k. As in a case above, for each hidden state k occurrences of all observable states has to be counted and then divided by the total number of times state k appears in a training data set. Then for example value at φ[k][l] is a probability of observing a symbol l while being in a hidden state k. 2.5 Existing programs Many gene predicting programs have already been developed. Therefore, I have chosen 2 that perform very well in order to compare results obtained with my models against those programs. They were also chosen, because they are based on Hidden Markov Model Genscan Genscan is a program that given a DNA sequence finds a complete exon/intron structure that could be possibly encoded in it. The model applied in the program is described in details in an article written by Burge and Karlin[2]. Web version of Genscan can be tested online [3]. However, only one DNA sequence up to 1 million base pairs in length can be accepted as an input at a time. Therefore, requesting a local copy of a program is suggested to process longer or many sequences. Genscan can not only predict a single gene in a sequence, but is also able to identify multiple as well as only partial genes. Gene structure can be found on both DNA strands, however, overlapping genes, which are very rare, cause problems in correct predictions Augustus Augustus is another program using Hidden Markov Model in gene structure prediction. It can incorporate user defined constraints into prediction process to improve prediction accuracy. As a constraint user can specify a position of 7

14 known complete or part of exon, a part of intron or only a splice site, translation initiation or stop codon. Detailed description of a model used in Augustus can be found in an article [4] written by its authors Mario Stanke and Burkhard Morgenstern. Augustus can be used as a web interface [5] or downloaded [6] and installed locally. In this project I have downloaded and installed augustus-3.3 which was latter used in validation experiments described in Chapter 4. 8

15 Chapter 3 Implementations Programs and scripts developed in this project are written in either Java, Python or Bash. I used Java to implement Viterbi algorithm and write programs that can train all models. Python was used to write programs and scripts used to manipulate data, automate validation process as well as evaluate results. All programs and scripts are available online in form of a zip file [1]. In the first section (3.1) my implementation of Viterbi algorithm is described as well as experiments testing its running time and correctness. Next 2 sections contain information about implemented training programs (3.2) and other programs (3.3) I have written and used in this project. 3.1 Viterbi In this section my implementation of a Viterbi algorithm that can emit variable number of symbols as described in sections 2.2 and 2.3 is presented. Developed program VarVitHMM was later used to conduct experiments described in Chapter 4. I have chosen Java to implement algorithm as a compromise of a running time and ease of development and usage. Program can be used on multiple platforms. Dynamic programming and memorization were used to fill out ω table in an iterative way from left to right. Log of initial, transition and emission probabilities were computed and stored while reading parameters of HMM from a file. log0 is substituted by negative maximum value of Double type. The source code for computing ω table is presented in Listing 3.1. Listing 3.1: Source code for computing ω table in VarVitHMM char [ ] X = s e q s. get (m). tochararray ( ) ; N = X. l e n g t h ; long starttimeomega = System. c u r r e n t T i m e M i l l i s ( ) ; double [ ] [ ] omega = new double [K ] [ N ] ; // f i l l out f i r s t column o f omega t a b l e 9

16 for ( int k=0;k<k; k++){ omega [ k ] [ 0 ] = i n i t [ k ] + em [ k ] [ obsmap. get ( S t r i n g. valueof (X[ 0 ] ) ) ] ; } // second column for ( int k=0;k<k; k++){ i f ( vard [ k]==1){ omega [ k ] [ 1 ] = getmaxomega ( omega, 0, k ) + em [ k ] [ obsmap. get ( S t r i n g. valueof (X[ 1 ] ) ) ] ; } else { omega [ k ] [ 1 ] = omega [ k ] [ 0 ] ; } } // t h i r d column for ( int k=0;k<k; k++){ i f ( vard [ k]==1){ omega [ k ] [ 2 ] = getmaxomega ( omega, 1, k ) + em [ k ] [ obsmap. get ( S t r i n g. valueof (X[ 2 ] ) ) ] ; } else i f ( vard [ k]==2){ omega [ k ] [ 2 ] = getmaxomega ( omega, 0, k ) + em [ k ] [ obsmap. get ( S t r i n g. valueof (X[ 1 ] )+S t r i n g. valueof (X[ 2 ] ) ) ] ; } else { omega [ k ] [ 2 ] = omega [ k ] [ 1 ] ; } } // remaining columns for ( int n=3;n<n; n++){ for ( int k=0;k<k; k++){ i f ( vard [ k]==1){ } } omega [ k ] [ n ] = getmaxomega ( omega, n 1, k ) + em [ k ] [ obsmap. get ( S t r i n g. valueof (X[ n ] ) ) ] ; } else { S t r i n g symbol= ; for ( int d=vard [ k ] 1; d>=0;d ){ symbol+=s t r i n g. valueof (X[ n d ] ) ; } omega [ k ] [ n]=getmaxomega ( omega, n vard [ k ], k )+ em [ k ] [ obsmap. get ( symbol ) ] ; } 10

17 Backtracking was done by recomputing a maximum sum of previous column and a log value of transition probability. In case of a tie, more than one maximum value, the first value from a top of a table is chosen. The source code for backtracking is shown in Listing 3.2. Listing 3.2: Source code for backtracking in VarVitHMM char [ ] Z = new char [N ] ; // f i n d l a s t hidden s t a t e int nextstate = getmaxomegaidx ( omega ) ; double omegalogjp = omega [ nextstate ] [ N 1]; int end = N 2; i f ( vard [ nextstate ]==1){ // i f l a s t s t a t e e m i t t s 1 symbol Z [N 1] = hid [ nextstate ] ; } i f ( vard [ nextstate ]==2){ // i f l a s t s t a t e emitt 2 symbols Z [N 1] = hid [ nextstate ] ; Z [N 2] = hid [ nextstate ] ; end=n 3; } else { // i f l a s t s t a t e e m itt 3 symbols Z [N 1] = hid [ nextstate ] ; Z [N 2] = hid [ nextstate ] ; Z [N 3] = hid [ nextstate ] ; end=n 4; } // f i n d the r e s t o f hidden s t a t e s for ( int n=end ; n>=2;n ){ nextstate = getmaxidx ( omega, nextstate, n ) ; i f ( vard [ nextstate ]==1){ Z [ n ] = hid [ nextstate ] ; } else i f ( vard [ nextstate ]==2){ Z [ n ] = hid [ nextstate ] ; Z [ n 1] = hid [ nextstate ] ; n =1; } else { Z [ n ] = hid [ nextstate ] ; Z [ n 1] = hid [ nextstate ] ; Z [ n 2] = hid [ nextstate ] ; n =2; } } // f i n d f i r s t 2 hidden s t a t e s nextstate = getmaxidx ( omega, nextstate, 1) ; i f ( vard [ nextstate ]==1){ 11

18 Z [ 1 ] = hid [ nextstate ] ; nextstate = getmaxidx ( omega, nextstate, 0) ; i f ( vard [ nextstate ]==1){ Z [ 0 ] = hid [ nextstate ] ; } } else i f ( vard [ nextstate ]==2){ Z [ 1 ] = hid [ nextstate ] ; Z [ 0 ] = hid [ nextstate ] ; } VarVitHMM is a command line program that requires 2 files to be executed: a file with parameters describing HMM and an input file with a DNA sequence or sequences in Fasta format. Names of those files are 2 of 3 required parameters. The third parameter is a name of output file that will contain a gene structure also in Fasta format of sequences decoded from input file. VarVitHMM can be run as java -jar VarVitHMM HMMparameters.mat inpseq.fa outdecoding.fa. As previously mentioned, parameters of HMM are read from a file and are required by a program to execute. All lines starting with # are treated as a comment lines and are ignored. As a minimum, file has to contain: a line that starts with Init: followed by a line with K number of initial probabilities separated with a space a line that starts with Transitions: followed by a K lines with K number of transition probabilities separated with a space in each line a line that starts with Emissions: followed by a K lines with L number of emission probabilities separated with a space in each line a line that starts with hidden: followed by a K characters separated with a space that represent hidden states in output sequence a line that starts with observables: followed by a L characters separated with a space that represent all characters allowed in an input sequence a line that starts with variable followed by a K integers greater that 0 and smaller or equal to 3 separated with a space that represent how many characters are emitted from each of hidden states An example of a file with 3-state HMM parameters is shown in Listing 3.3 Listing 3.3: Example of HMM parameters for VarVitHMM # HMM model with 3 s t a t e s, t r a i n by count ( Java ), # no s e t 0 v a l i d a t i o n s e t Genie I n i t :

19 T r a n s i t i o n s : Emissions : hidden : N C N o b s e r v a b l e s : A C G T v a r i a b l e number o f symbols : An input file with DNA sequences has to be in a Fasta format. A single as well as multiple sequences are allowed per file. However, program allows that sequences are strings of only 4 symbols: A, C, G, and T. Therefore, any other character found in input sequences are substituted with one of 4 characters. In particular, each character that is not allowed but can be found in a DNA sequence represented in Fasta format is substituted as follows: R meaning purine is substituted randomly by A or G Y meaning pyrimidine is substituted randomly by C or T K meaning a base which is Ketones is substituted randomly by G or T M meaning a base with amino group is substituted randomly by A or C S meaning strong interaction is substituted randomly by C or G W meaning weak interaction is substituted randomly by A or T B meaning not A is substituted randomly by C, G or T D meaning not C is substituted randomly by A, G or T H meaning not G is substituted randomly by A, C or T V meaning not T is substituted randomly by A, C or G H meaning any base is substituted randomly by A, C, G or T At first, introducing this substitutions might appear as unnecessary manipulation of a data. However, this might be also seen as a simple way of simulating random mutations. 13

20 An output file is in a format similar to input file with all its sequences, however, each sequence is represented as a string over an alphabet of 2 letters that correspond to nucleotides in input sequences: N - non-coding C - coding Since running time of this algorithm is O(NK 2 ), I have made a small experiment to confirm a correctness of my implementation regarding execution time. Actually, I confirm time of filling ω table which is O(NK 2 ) and time of backtracking O(N K) separately. Also, I have tested total time of execution of the whole program. It is done using System.currentTimeMillis() and System.nanoTime() methods for ω table and backtracking respectively. I have used command time in Linux to measure a running time of a whole program. In order to ensure uninterrupted execution of tested program, I have used Linux system with isolcpus = 3,7 command so that one of 4 cores could be exclusively available for the experiment. Testing was automated with a bash script where tested program was run with taskset -c 7 to make sure it was executed using only one thread of an isolated core of CPU. I have split this experiment into 2 parts. First one confirms a linear relationship between sequence length N and a running time. In the second part, I confirm that a running time of my Viterbi implementation depends on a number of hidden states K and increase in running time is proportional to squared number of hidden states K 2 or proportional to number of hidden states K in case of filling ω table and backtracking respectively. In the first part, I have generated a data set with 19 completely random sequences of variable length consisted of four letter: A, C, G and T. Lengths of sequences are 1kB to 10kB with an increase of 1kB and 10kB to 100kB with an increase of 10kB. For each sequence program was executed 10 times to ensure repeatability of results. Then running times for filling out ω table (Fig. 3.1), backtracking (Fig. 3.2) and complete VarVitHMM program (Fig. 3.3) were divided by a sequence length N and plotted. As can be seen on the plots, obtained values converges to a constant confirming a linear relationship between sequence length and a running time of my implementation. In the second part, I have started by selecting some of the models with increasing number of hidden states K I have developed in this project. That includes 7 HMM with a number of hidden states from K = 3 to K = 34. Test sequence was the random one used previously with a length 100kB. As in the first part of this experiment, program was executed ten times with each model. Then, running time was divided by squared number of hidden states K 2 (filling ω table and complete program) or number of hidden states K (backtracking) and plotted. Later, however, I decided to extend this part of experiment to confirm theoretical running time more clearly. I have generated 10 random models with hidden states from K = 10 to K = 100 with an increase of 10. Experiment was then repeated with those ten models and results confirming 14

to increasing sequence length N (grey - measured values; red - mean of 10 measurements) 2:

21 Figure 3.1: Running time of filling out ω table in my implementation of Viterbi algorithm with respect to increasing sequence length N (grey - measured values; red - mean of 10 measurements) Figure 3.2: Running time of backtracking in my implementation of Viterbi algorithm with respect to increasing sequence length N (grey - measured values; red - mean of 10 measurements) 15

22 Figure 3.3: Running time of complete program VarVitHMM with respect to increasing sequence length N (grey - measured values; red - mean of 10 measurements) theoretical running time of filling ω table (Fig. 3.4), backtracking (Fig. 3.5) and complete program (Fig. 3.6) are shown on the plots, where obtained values converge to a constant with increasing number of hidden states K. The plots with results from experiment with 7 models are presented in Appendix in Fig. A.1, Fig. A.2 and Fig. A.3. Memory consumption O(N K) of Viterbi algorithm is mostly conditional on a size of ω table, therefore its increase should be linear with length of a sequence N as well as number of hidden states K. An experiment confirming correctness of my implementation with respect to memory consumption was performed using command time in Linux. Memory usage of the whole VarVitHMM program was measured. First, program was run with sequences of different length as described before and memory consumption divided by sequence length N was plotted (Fig. 3.7). As can be clearly seen, together with increasing length of sequences values converges to a constant confirming linear relationship between size of a sequences and memory consumption. In the next step, I tried to prove also linear relationship between number of hidden states K and memory consumption. As in part with running time, I started experiment with some of the models with different number of hidden states I developed in this project, and resulting plots are presented in Fig. A.4. Then, I repeated experiment with 10 randomly generated HMM models with number of hidden states from K = 10 to K = 100 and a random sequence of length 100kB. Obtained memory consumption was then divided by number of hidden states K and plotted (Fig. 3.8). Also in this plot values converges to a 16

23 Figure 3.4: Running time of filling out ω table in my implementation of Viterbi algorithm with respect to increasing number of hidden states K (grey - measured values; red - mean of 10 measurements) Figure 3.5: Running time of backtracking in my implementation of Viterbi algorithm with respect to increasing number of hidden states K (grey - measured values; red - mean of 10 measurements) 17

24 Figure 3.6: Running time of complete program VarVitHMM with respect to increasing number of hidden states K (grey - measured values; red - mean of 10 measurements) Figure 3.7: Memory consumption of VarVitHMM with respect to increasing sequence size N; K = 3, N = 1kB to N = 100kB 18

25 constant while number of hidden states increases confirming linear relationship between number of hidden states K and memory consumption. Figure 3.8: Memory consumption of VarVitHMM with respect to increasing number of hidden states K; N = 100kB, K = 10 to K = 100 Even though above experiments confirm theoretical running time and memory consumption of Viterbi algorithm suggesting correctness of implementation, I have made another one that test the results. I have made an artificial 2 state model with only 2 symbols A, G as in Fig In this HMM each A should be decoded as hidden state N and G as C. In order to confirm correctness of the results I have used 6 sequences of length 100 symbols each: all A s all G s half A s and half G s half G s and half A s repeated AG repeated GA Obtained results were as predicted and files with HMM parameters (testhmm2st.mat), test sequences (testsequences.fa) and resulting annotations (testsequences.ann) can be found in a project s subfolder testvarvithmm/ [1]. 19

26 Figure 3.9: Transition diagram of HMM with 2 hidden states used to test correctness of implementation of Viterbi algorithm 3.2 Training HMM I have developed in this project were trained using training by counting. For each model, I have made a Java program counting all possible transitions and emissions for that specific HMM. I have decided to save those counts instead of probabilities. The purpose of that was to allow adding more training data whenever necessary or possible. All training programs require 3 parameters: a file with parameters containing lists of hidden states, observables and number of symbols each state emits fasta file with DNA sequences a fasta file with gene structure encoded by 6 characters: N - non-coding nucleotide S - nucleotide being a part of a single exon gene B - nucleotide being a part of the first exon of multiple exon gene M - nucleotide being a part of a middle exon of multi exon gene E - nucleotide being a part of the last exon of multi exon gene I - nucleotide being a part of an intron Then for each model I have written a Python script that converts counts obtained from Java training program into a file with parameters for HMM required by VarVitHMM as for instance in an example presented in Listing

27 3.3 Other In this section I describe other programs and scripts I have written while doing this project. They all can be found in subfolder Scripts/. gb_to_6states.py is a Python program that extracts a gene structure annotation required by my training programs from a file in GenBank format. It requires 2 parameters: name of GenBank file and name of output file. It can be run as for instance: python gb_to_6states GenBankFile.gb outannotation.ann. gb_to_fasta.py is a Python program that converts a file in GenBank format to fasta format. It requires 2 parameters: name of GenBank file as an input and name of output. If program is run as python gb_to_fasta.py human.gb human_from_gb, two output files will be generated: a fasta file human_from_gb.fa with a DNA sequences in fasta format and human_from_gb_states.fa with gene structure annotations encoded with 2 states N - non-coding and C - coding. split_dataset.py is a python program that splits a data set with DNA sequences in a file in fasta format into subsets in separate files according to a list in a file containing description of this partitioning. A list of subsets has to contain sequences names corresponding to ones in fasta file in a format as in Listing 3.4. Program requires 3 parameters: name of a file with subsets, name of an input file with DNA sequences in fasta format and a name of output file. It can be run as for instance: python split_ dataset. py humandna. sets humandna_ GB. dat humandna There will be as many output files created as many subsets included in humandna.sets file. Output files will get names in form of humandna_0.fa. Program can be used to split files containing DNA sequences as well as gene structure annotations as long as they are in fasta format. conv_aug_ann.py is a Python program that converts results from Augustus into a file with annotations in fasta format. Those annotations are of a lengths of input DNA sequences and consists of two symbols: N and C corresponding to non-coding and coding nucleotides respectively. Program requires name of a file with results from Augustus program and a name of output file and can be run as for instance: python conv_aug_ann.py augustus.res augustus.ann genscan_mult_seq.py is a Python program that is useful to run a Genscan with a file that contains multiple DNA sequences. Since Genscan can only handle one sequence per file and concatenates all sequences in an 21

28 input file into one long sequence if there are more than one, I have written this python script to simplify and automate my experiments. Program can be executed only in Linux with Genscan program installed and its parameters located in /usr/lib/genscan/humaniso.smat. Program requires two parameters: a name of an input file with DNA sequences in fasta format and a name of output file that will contain resulting annotations. Results from Genscan are automatically converted into annotations as a sequences of two symbols N and C corresponding to non-coding and coding nucleotides respectively. It can be run as for instance: python genscan_mult_seq.py human.fa human.ann The output of this program is in the same format as annotation from gb_to_fasta.py program, results from VarVitHMM and an input annotation required by eval_pred.py program, so that results from different programs and true annotations can be compared. eval_pred.py is a Python program that is used to evaluate predicted annotations by comparing them with the true ones and calculating various measures described in the next chapter (4.1). It requires 3 parameters: name of a file with true annotations, name of a file with predicted annotations and a name of an output file. Files with annotations have to be in a fasta format with sequences encoded by only 2 symbols N and C corresponding to non-coding and coding nucleotides respectively. Sequences in both input files does not have to be in the same order, however, their names has to match. Output file is in a.csv format and contains all calculated evaluation measures. If file does not exist, a new one is created. Otherwise, results are appended at the end of a file. PART: 0 HSCOMT2 HUMCYC1A PART 1 : HSU37106 HSGTRH PART 2 : HUMGALK1A HSU01102 Listing 3.4: Example of file with list of data set partitioning 22

29 Chapter 4 Experiments In this chapter I present experiments conducted with various models of gene structure in DNA sequences. In the first section (4.1) I describe different accuracy measures used to evaluate results of gene structure predictions. Next, I present a Genie data set used in this project (4.2). In the last part of this chapter (4.3) validation process is described. 4.1 Prediction accuracy measures Prediction accuracy measures used in this project are described in details in the article by Burset and Guigo [7]. Results obtained during experiments were compared with known gene annotations at both nucleotide and exon levels using previously mentioned program eval_pred (3.3) that is able to calculate all measures presented in this section. Since annotations of gene structure used in this project are represented as 2 symbols, there are 4 possible relationships between predictions and reality: true positive (T P ) when nucleotide is correctly predicted as coding false positive (F P ) when nucleotide is uncorrectly predicted as coding false negative (F N) when nucleotide is uncorrectly predicted as noncoding true negative (T N) when nucleotide is correctly predicted as non-coding. At nucleotide level, 4 prediction accuracy measures were calculated based on above relationships: Sensitivity (Sn) is a proportion of coding nucleotides that have been correctly predicted as coding and can be computed according to formula in Eq T P Sn = (4.1) T P + F N 23

30 Specificity (Sp) is a proportion of predicted coding nucleotides that are truly coding and can be computed using formula in Eq 4.2. Sn = T P T P + F P (4.2) Normally, specificity should be calculated as a proportion of non-coding nucleotides that have beed correctly predicted as non-coding according to formula in Eq However, since number of non-coding nucleotides is much greater that coding nucleotides in eukaryotic DNA sequences, such calculated values of Sp tend to be very large and non-informative, because number of T N will usually be much bigger than number of F P. Sn = T N T N + F P (4.3) Correlation Coefficient (CC) has been traditionally preferred in gene structure prediction literature and can be calculated according to formula in Eq However, CC is not defined when one of the factors of the product in denominator is equal to 0 which happens when either predicted or true annotation does not contain both coding and non-coding nucleotides. CC = (T P T N) (F N F P ) (T P + F N) (T N + F P ) (T P + F P ) (T N + F N) (4.4) Approximation correlation (AC) is a measure that approximates CC and can be calculated in any circumstances. It is calculated based on Average Conditional Probability (ACP ) that is computed according to formula in Eq ACP = 1 4 [ T P T P + F N + T P T P + F P + T N T N + F P + T N T N + F N ] (4.5) ACP is a probability that ranges from 0 to 1, but AC is its transformation computed according to formula in Eq. 4.6 that ranges from 1 to 1. Therefore, it can be compared to CC. AC = (ACP 0.5) 2 (4.6) At exon level, 5 different types of exons were identified as a measures of prediction accuracy: CR - correct exon, where both ends of predicted exons match with true annotations; PC - partially correct, where only one end is matched correctly; OL - overlapping exon, where some nucleotides of predicted exon match but neither of ends do; 24

31 WE - wrong exon, where all nucleotides of predicted exon are noncoding in true annotation; ME - missed exon, where all nucleotides in true exon are predicted as noncoding. Moreover, following measures were used to evaluate results at exon level: Exon Level Sensitivity (ESn) is a fraction of true exons that are correctly predicted (Eq. 4.7). ESn = # of correctly predicted exons # of true exons (4.7) Exon Level Specificity (ESp) is a fraction of predicted exons that are correctly predicted (Eq. 4.8). ESp = # of correctly predicted exons # of predicted exons (4.8) Average of ESn and ESp (EAvg) calculated according to formula in Eq. 4.9 ESn + ESp EAvg = (4.9) 2 Last of the measures called Missed Gene (MG) is a number of genes that where not predicted even partially. MG occur when a the whole sequence that truly contain a protein coding gene is predicted as non-coding. 4.2 Data set Data set called Genie that was used in this project was downloaded from a Berkeley Drosophila Genome Project website [8]. It is gene finding data set, containing a total of 793 unrelated human genes. This data set was used to train the Genie gene finding system (WWW-access) developed at LBNL and UC Santa Cruz. The last update was done in March 1998 using GenBank v.105. Sequences with multiple exon genes and single exon genes are separated. Together with data set, lists with suggestion for partitioning into 7 validation subsets is available and was used in validation experiment in this project. Genie data set consists of 2 files in GenBank format multi_exon_gb.dat and single_exon_gb.dat containing sequences with multiple exon genes and single exon genes respectively. In addition, 2 files with suggested validation partitioning (multi_exon_gb.sets and single_exon_gb.sets) are available. Each sequence in this data set contains one and only one protein coding gene. Since Genie data set files are in GeneBank format and training and prediction programs developed in this project require input files in fasta format, I have converted data files using 2 Python programs gb_to_6states.py and gb_to_fasta.py described in previous chapter (3.3). Also, sequences in each file were split into 7 validation subsets with a help of split_dataset.py Python script (3.3) making a data set ready for validation experiments. 25

32 4.3 Validation Validation experiments in this project were performed on a previously introduced Genie data set (4.2). Since it already had a list with suggested partitioning into 7 validation subsets, 7-fold validation was performed with each model developed in this project and described in Chapter Training Training all HMM models were done using train by count method described in Chapter 2 (2.4) using implementations from Chapter 3 (3.2 and 3.3). For each model 7 different sets of parameters were acquired by training HMM on 6 parts of the data set. I have divided training process into two parts: counting occurrences of transitions and emissions using Java programs and making files with parameters for HMM required by prediction program VarVitHMM. I have automated this process with Python scripts, two for each model. They all can be found in Scripts/v1/ subfolder. Counting scripts simply reads all validation parts of data set and execute Java counting program with appropriate sequences. All counts of initial states, transitions and emissions were saved into one file with data for each of 7 sets separated. Another type of Python script reads a file with counts and outputs 7 files with HMM parameters Prediction The same VarVitHMM program were used to obtain gene structure predictions with all models developed in this project. For each model, a Bash script was written to execute VarVitHMM with appropriate parameters and data subsets. That helps to automate a validation process and ensure repeatability of the results. Since data set was structured in a way that single exon genes were separated from multi exon genes, also in this experiment this separation was sustained and for each HMM two files with those two types of sequences were produced. In order to compare prediction performance of my models together with Viterbi implementation not only among themselves, but also with other existing gene finding programs, this part of experiment was conducted with Genscan (2.5.1) and Agustus (2.5.2) as well. Since those programs were already trained, training part was not necessary. However, prediction was done as with VarVitHMM on all 7 validation parts separately Results evaluation Results obtained during validation experiments were evaluated using prediction accuracy measures mentioned earlier in this chapter (4.1). They were computed by previously described in Chapter 3 program eval_pred.py (3.3). All those 26

33 measures were calculated for each sequence and an average outputed by the program and stored in a.csv file. Mean values from all sequences in all validation parts for each of my HMM, Genscan and Augustus are presented in Table 4.1. Besides that, evaluation results for sequences with multiple exons genes (Table 4.2) and single exon genes (Table 4.3) are presented as well. Two most informative prediction accuracy measures, approximation correlation AC and EAvg, were visualized on plots in Fig. 4.1, Fig. 4.2 and Fig Starting from the left side, results from Genscan and Augustus are presented, followed by models I have developed and tested from the simplest one to the most complex on the right side. Together with increasing models complexity, improvements in prediction accuracy can be noticed. Figure 4.1: Average prediction accuracy measures for all sequences in validation data sets. AC-approximation correlation, EAvg-average of exon level sensitivity ESn and specificity ESp 27

34 Figure 4.2: Average prediction accuracy measures for sequences with multiple exon genes in validation data sets. AC-approximation correlation, EAvg-average of exon level sensitivity ESn and specificity ESp Figure 4.3: Average prediction accuracy measures for sequences with single exon genes in validation data sets. AC-approximation correlation, EAvg-average of exon level sensitivity ESn and specificity ESp 28

35 Table 4.1: Average prediction accuracy measures for all sequences in validation data sets. Sn-sensitivity, Sp-specificity, CC-correlation coefficient, AC-approximation correlation, CR-correct exon, PC-partially correct exon, OL-overlapping exon, WE-wrong exon, ME-missed exon, ESnexon level sensitivity, ESp-exon level specificity, EAvg-average of ESn and ESp, MG-missed gene Model Sn Sp CC AC CR PC OL WE ME ESn ESp EAvg MG Genscan Augustus HMM3st HMM6st HMM18st HMM19st HMM13st HMMs13st HMM22st HMMs22st HMM34st HMMs34st HMM25st HMMs25st HMMcod16st HMMcod19st HMMcods19st HMMcod22st HMMcods22st HMMcod31st HMMcod25st HMMcod28st HMMcod40st HMMcod43st HMMcod47st HMMcod53st

36 Table 4.2: Average prediction accuracy measures for sequences with multiple exons genes in validation data sets. Sn-sensitivity, Sp-specificity, CC-correlation coefficient, AC-approximation correlation, CR-correct exon, PC-partially correct exon, OL-overlapping exon, WE-wrong exon, ME-missed exon, ESn-exon level sensitivity, ESp-exon level specificity, EAvg-average of ESn and ESp, MG-missed gene Model Sn Sp CC AC CR PC OL WE ME ESn ESp EAvg MG Genscan Augustus HMM3st HMM6st HMM18st HMM19st HMM13st HMMs13st HMM22st HMMs22st HMM34st HMMs34st HMM25st HMMs25st HMMcod16st HMMcod19st HMMcods19st HMMcod22st HMMcods22st HMMcod31st HMMcod25st HMMcod28st HMMcod40st HMMcod43st HMMcod47st HMMcod53st

37 Table 4.3: Average prediction accuracy measures for sequences with single exon genes in validation data sets. Sn-sensitivity, Spspecificity, CC-correlation coefficient, AC-approximation correlation, CR-correct exon, PC-partially correct exon, OL-overlapping exon, WE-wrong exon, ME-missed exon, ESn-exon level sensitivity, ESp-exon level specificity, EAvg-average of ESn and ESp, MG-missed gene Model Sn Sp CC AC CR PC OL WE ME ESn ESp EAvg MG Genscan Augustus HMM3st HMM6st HMM18st HMM19st HMM13st HMMs13st HMM22st HMMs22st HMM34st HMMs34st HMM25st HMMs25st HMMcod16st HMMcod19st HMMcods19st HMMcod22st HMMcods22st HMMcod31st HMMcod25st HMMcod28st HMMcod40st HMMcod43st HMMcod47st HMMcod53st

38 32

39 Chapter 5 Models of eukaryotic gene structure In this chapter different models of eukaryotic gene structure I have developed and tested during this project are described. They start with very simple models upon which more complex and accurate ones are built. I often refer to prediction accuracy measures presented in 3 tables in previous chapter (Table 4.1, Table 4.2 and Table 4.3) when I mention how much introduced models improved prediction accuracy. 5.1 Simple 3-state model All nucleotides in a DNA sequence can be divided into either gene coding (C) or non-coding (N). Furthermore, non-coding parts between coding regions inside a gene are distinguish from other non-coding parts and called introns (I) in eukaryotes. Therefore, in the simplest HMM model of gene structure, called HMM3st in this work, those 3 different parts of DNA sequence are encoded as 3 hidden states. Each nucleotide type corresponds to one of 3 hidden states. In this model, it is either possible to jump from a non-coding state to coding state or stay in non-coding. From coding state, transition to any state is possible, and intron state can only be changed to coding state or continue to stay in intron state (Fig. 5.1). Each of those states emit one of four possible nucleotides adenine (A), cytosine(c), guanine(g) or thymine (T). Figure 5.1: Transition diagram of HMM3st; N-noncoding, C-coding, I-intron 33

HMM3st can only capture an average length of different regions, based on transition probabilities, and a bias of emitting different nucleotides from hidden states, based on emission probabilities.

40 HMM3st can only capture an average length of different regions, based on transition probabilities, and a bias of emitting different nucleotides from hidden states, based on emission probabilities. However, the only difference in nucleotide frequencies between coding, non-coding and intron states was slightly elevated C and G occurrence in coding parts. Therefore, this model results in very poor predictions, but is an essential foundation to build more complex models upon it. 5.2 Single and multi exon genes Genes in eukaryotes might consist of a single exon as well as of multiple exons. Those distinct structures might be modeled as different hidden states of HMM. In this 6-state HMM (HMM6st) hidden states are: noncoding parts (N), coding parts of single exon genes (S), coding parts of first exon (B), internal exons (M) and last exon (E) of multiple exon genes, and of course introns (I). Possible transitions between hidden states are indicated by arrows in Fig Figure 5.2: Transition diagram of HMM6st; N-noncoding, S-single exon, B-first exon of multiple exon gene, M-internal exon of multiple exon gene, E-last exon of multiple exon gene, I-intron HMM6st model showed improvement in prediction accuracy at nucleotide level in compare with the simplest HMM3st only with sequences containing single exon genes. At exon level, while none of exons of multiple exon genes were predicted correctly (CR), small improvement can be again noticed with single exon genes. 34

41 5.3 Translation start and termination codons Translation always starts with the same codon ATG and terminates at one of 3 codons: TAA, TAG TGA. Therefore, next model HMM18st divides exons according to an occurrence of those signals. Possible transitions between hidden states are presented in Fig 5.3. Single exon genes were divided into 7 hidden states: S1 - first nucleotide in exon S2 - second nucleotide in exon S3 - third nucleotide in exon S4 - all internal nucleotides in exon between start and stop codons S5 - first nucleotide of stop codon S6 - second nucleotide of stop codon S7 - third nucleotide of stop codon Multiple exon genes were divided into 8 hidden states as follow: B1 - first nucleotide in the first exon B2 - second nucleotide in the first exon B3 - third nucleotide in first exon B4 - all other nucleotides in first exon from the end of start codon to the end of exon M - all nucleotides in all internal exons E1 - all nucleotides in the last exon, but not in stop codon E2 - first nucleotide in the stop codon E3 - second nucleotide in the stop codon E4 - third nucleotide in stop codon This information improved results, and an average AC score almost doubled in compare with previous models up to Since data set under consideration consists of sequences with one and only one gene, non-coding parts before and after gene were modeled separately in HMM19st in order to investigate if such an information will improve results. Transition diagram is similar to the HMM18st (Fig. 5.3), however, transitions from states S7 and E4 are now possible only to a separate non-coding absorbing state that models non-coding end of a sequence. Obtained results in this case were slightly worse, which is due to very low prediction accuracy and the fact that previous model could predict more than 35

42 Figure 5.3: Transition diagram of HMM18st one gene in a sequence, and some of those additional coding sequences could match with actual genes, while in the latter model, non-coding end of sequence state is absorbing, and when entering to that state the rest of the sequence is just non-coding. 5.4 Nucleotide frequencies bias Codon usage inside coding parts of DNA is biased towards certain nucleotide triplets. The easiest way to use this fact in gene prediction is to assume different nucleotide frequencies at each of 3 codon positions. In HMM13st model each nucleotide position in codon is encoded as different hidden state. Additionally, first and last codons of gene are modeled as 3 different hidden states each. Transition diagram for HMM13st model is presented in Fig. 5.4, and hidden states encoding coding parts of a gene are as follows: M1 - first nucleotide in gene M2 - second nucleotide in gene M3 - third nucleotide in gene M4 - first nucleotide in codon between start and stop codons M4 - second nucleotide in codon between start and stop codons M4 - third nucleotide in codon between start and stop codons 36

M7 - first nucleotide of stop codon M8 - second nucleotide of stop codon M9 - third nucleotide of stop codon Introns can interrupt exons at any position, which means that codons can also be divided.

43 M7 - first nucleotide of stop codon M8 - second nucleotide of stop codon M9 - third nucleotide of stop codon Introns can interrupt exons at any position, which means that codons can also be divided. In order to keep track of where, whether it was after first, second or last nucleotide of a codon, previous exon has ended, 3 separate hidden states encode introns: I1 - introns that split gene after the first nucleotide in codon I2 - introns that split gene after the second nucleotide in codon I3 - introns that split gene after the third nucleotide in codon Figure 5.4: Transition diagram of HMM13st Results significantly improved in this model, and AC score increased up to Since the data set used contains only one gene per sequence, HMM13st model was modified to allow only one gene per sequence (HMMs13st). However, obtained results were sightly worse as in the case with models in previous section. HMM13st does not distinguish between single and multiple genes, therefore, next model (HMM22st) includes that using additional 9 hidden states. In this model, there are 2 sets of almost identical hidden states, one for multiple and one for single exon genes. The only difference is that in case of single exon genes, introns cannot interrupt protein coding states. Modeling this information at this stage did not help to improve predictions. Also, prediction of only one gene 37

44 (HMMs22st) did not help neither, repeating a pattern from previously described models. 5.5 Introns Each intron starts with a sequence GT and end with AG. This information should increase prediction accuracy in multiple exon genes. Next models HMM25st and HMM34st are built upon 2 previously described models HMM13st and HMM22st with only difference in modeling intron hidden states. Each of 3 hidden states that were encoding introns is now expanded to 5 different hidden states as in Fig. 5.5: first (I1), second (I2), any internal (I3), second last (I4) and the last (I5) nucleotide in introns. Figure 5.5: Transition diagram of hidden states modeling introns in HMM25st and HMM34st Results improved as expected, especially in sequences with multiple exon genes. As with previous models information that each of tested sequences contain exactly one gene were also added to these new models resulting with 2 additional models HMMs25st and HMMs34st. However, again any improvement in prediction accuracy were not noticed. 5.6 Codon usage bias Nucleotides can make 64 different triplets (4 3 ) that can code amino acids in coding regions of DNA sequence, and since number of amino acids is much smaller then number of possible codons, each amino acid can be coded by multiple codons. Therefore, not all codons are used equally frequent. Different organisms are biased towards more often usage of different set of codons. Also in a Genie data set used in this project, distribution of frequencies of all codons in coding regions is not uniform which is presented in Fig Discrepancies in how often different codons are used can be clearly noticed. In contrast, Fig. 5.7 presents how frequent each of 64 possible nucleotide triplets occur in non-coding parts of DNA sequences, both non-coding outside of genes and in introns. Clearly, differences between triplets usage in coding and non-coding parts are visible, however there is no distinguishable difference in how frequent they occur between non-coding parts outside of genes and inside introns (Fig. A.5 and A.6). Codon usage bias was not captured very well by previous models, since each of hidden states were allowed to emit only one symbol. In this model 38

Figure 5.6: Frequencies of 64 codons in Genie data set in coding regions (in first (B), internal (M) and the last (E) exons of multiple exon genes and in exons in single exon genes (S) Figure 5.

45 Figure 5.6: Frequencies of 64 codons in Genie data set in coding regions (in first (B), internal (M) and the last (E) exons of multiple exon genes and in exons in single exon genes (S) Figure 5.7: Frequencies of 64 codons in Genie data set in non-coding and intron parts HMMcod16st, emitting triplets of symbols from hidden states that are encoding protein coding parts of DNA is allowed, since each amino acid is coded by 3 nucleotides and always sequences of length of multiple of 3 are translated into proteins. The only problem here is that eukaryotes might have exons separated by introns, which might occur anywhere inside a gene, not necessarily after full codons. Therefore, while non-coding and intron satates in this model always emit only one symbol, coding states might emit either one, two or three symbols 39

46 at a time: B1* - first nucleotide of a codon in first exon of multiple exon gene; B2* - first 2 nucleotides of a codon in first exon of multiple exon gene; B - complete codon in first exon of multiple exon gene; M*1 - last nucleotide of a codon in internal exon of multiple exon gene; M*2 - last 2 nucleotides of a codon in internal exon of multiple exon gene; M - complete codon in internal exon of multiple exon gene; M1* - first nucleotide of a codon in internal exon of multiple exon gene; M2* - first 2 nucleotides of a codon in internal exon of multiple exon gene; E*1 - last nucleotide of a codon in the last exon of multiple exon gene; E*2 - last 2 nucleotides of a codon in the last exon of multiple exon gene; E - complete codon in the last exon of multiple exon gene; S - complete codon in the exon of single exon gene; Introns are encoded as 3 different hidden states in order to keep track if they split coding parts after first, second or third nucleotide of a codon. I - introns occurring after full codon has been emitted I1* - introns occurring after only one nucleotide of a codon has been emitted I2* - introns occurring after two nucleotides of a codon have been emitted The fact that introns can interrupt coding parts at any position, not only between full codons, makes the model much more complicated and forces to use additional hidden states in order to keep track of whether previously emitted codon was complete or it was only one or two nucleotides. Transition diagram with all allowed transitions indicated by arrows is presented in Fig Even though HMMcod16st does not use information of translation start and stop codons, its performance at nucleotide level is better than all models described before. However, none of exons were predicted completely correct. Therefore next model HMMcod19st models translation start and stop codons. It is similar to HMM16st, but instead of dividing codons based on in which exon they are, HMMcod19st models only first and last codons of coding parts of each gene, and any other codon between them. Transition diagram is given in Fig Hidden states that model introns are the same as before, and all hidden states are described as: B1* - first nucleotide of a translation start codon; 40

nucleotides of a translation start codon; M*1 - last nucleotide of an internal codon; M*2 - last 2 nucleotides of an internal codon; M - complete internal codon; M1* - first nucleotide of an internal

47 Figure 5.8: Transition diagram of HMMcod16st B2* - first 2 nucleotides of a translation start codon; B - complete translation start codon; B*1 - last nucleotide of a translation start codon; B*2 - last 2 nucleotides of a translation start codon; M*1 - last nucleotide of an internal codon; M*2 - last 2 nucleotides of an internal codon; M - complete internal codon; M1* - first nucleotide of an internal codon; M2* - first 2 nucleotides of an internal codon; E1* - first nucleotide of a stop codon; E2* - first 2 nucleotides of a stop codon; E - complete stop codon; E*1 - last nucleotide of a stop codon; E*2 - last 2 nucleotides of a stop codon; In addition to HMMcod19st, I have made and tested some of its derivations. HMMcod22st has 3 additional hidden states that model translation start, stop and any internal codons. Next 2 models (HMMcods19st and HMMcods22st) are 41

Information about introns from previous section (5.

48 Figure 5.9: Transition diagram of HMMcod19st modeling DNA sequences that contain a single gene. While information about start and stop codons in HMMcod19st improved prediction accuracy, especially at exon level, the other 3 models were not very helpful. Information about introns from previous section (5.5) should improve multiple exon gene prediction, therefore, next model HMMcod31st is almost the same as HMMcod19st but with each hidden state modeling introns extended into 5 separate states as described before and presented in Fig Since nucleotides at first two and the last two positions of introns in all tested sequences were the same with emission probabilities equal to 1, I have decided to modify HMMcod31st. Hidden states that were modeling first two and the last to positions of introns were reduced to one state at the beginning and one at the end of intron. Therefore, 4 hidden states emitting 1 symbol each were replaced with 2 hidden states emitting 2 symbols each. Number of total hidden states decreased from 31 to 25, and since running time of Viterbi algorithm (O(NK 2 )) depends on number of hidden states K, HMMcod25st can be used to predict gene structure faster while producing the same results. This model helped to improve multiple gene prediction at exon level as expected, resulting in more exons predicted correctly (CR) and less missed exons (ME) and genes (MG). However, prediction of single exon genes decreased its accuracy. Therefore, I have added 3 additional hidden states emitting 3 symbols each to model start, stop and internal codons of single exon gene. This new model HMMcod28st has improved accuracy of single exon genes prediction significantly as expected. 42

10: Consensus sequence for U2-dependent introns in mammals (Rpurine, Y-pyrimidine) I have also analyzed nucleotides frequencies at different positions around splice sites and confirmed that

49 5.7 Splice sites and translation start In previous sections I have modeled almost invariant nucleotides at the beginning and end of introns. However, nucleotides usage at other positions around splice sites are not completely random. One of the patterns called U2-dependent introns in presented in Fig Figure 5.10: Consensus sequence for U2-dependent introns in mammals (Rpurine, Y-pyrimidine) I have also analyzed nucleotides frequencies at different positions around splice sites and confirmed that nucleotides do not occur randomly there. Therefore, I have increase number of hidden states encoding introns from 3 to 8 resulting with a new model HMMcos40st. Besides hidden states coding introns, it is the same as HMMcod25st. HMMcod40st is trying to capture differences in nucletide frequencies at first 6 and 3 last positios of introns, and each of 3 sets of hidden states coding introns are presented as a transition diagram in Fig Figure 5.11: Transitions diagram of one set of hidden states modeling introns in HMMcod40st (I1-first 2 positions; I2-I5-positions from 3 to 6; I6-any internal nucleotide between position 6 and last 3; I7-third position from the end; I8-last 2 positions) HMMcod40st introduced further improvements in prediction accuracy. Furthermore, its derivation with 3 additional hidden states encoding coding parts of single exon genes (HMMcod43st) increased quality of predictions, especially with sequences containing single exon genes. In fact, this was the model with the best prediction accuracy in sequences with single exon genes among all that were developed during this project. 43

I have analyzed nucleotide frequencies at 10 first and last 10 positions of introns, first, internal and last exons of mutliple exon genes as well as exons in single exon genes.

50 I have analyzed nucleotide frequencies at 10 first and last 10 positions of introns, first, internal and last exons of mutliple exon genes as well as exons in single exon genes. Inspired by differences in nucleotide occurrence in those parts of DNA, new model HMMcod47st was developed. Introns are modeled as before, however, while previously in coding parts only start and stop codons were special ones, in this new model exons are modeled differently as presented in Fig I noticed that none of first or last exons of multiple exon genes were shorter that 3 nucleotides which means that start and stop codons were never interrupted by introns. Therefore, using 5 hidden states to model each of those codons was unnecessary and reduced to only one hidden state emitting 3 symbols. Also, each type of exons, whether first, internal or last exon of multiple exon gene or an exon of single exon gene, are distinguished in this model. Figure 5.12: Parts of HMMcod47st transition diagram with hidden states encoding exons. Multiple exon genes codons (first exon: B1-first, B2-second, B3-internal, B4-last, B41*-one nucleotide of the last one, B42*-two nucleotides of the last one; internal exon: M1-first, M1*1-one nucleotide of the first, M1*2-two nucleotides of the first, M2-internal, M3-last, M31*-one nucleotide of the last one, M32*-two nucleotides of the last one; last exon: E1-first, E1*1-one nucleotide of the first one, E1*2-two nucleotides of the first one, E2-internal, E3-last) and single exon genes codons (S1-first, S2-second, S3-internal, S4-last) This model further improved multiple exon genes prediction accuracy, however, quality of single exon genes predictions decreased. While differences at nucleotide level are barely noticeable, at exon level EAvg decreased almost double, meaning that many beginnings and ends of coding parts were not identified precisely. In fact, many of correct exons (CR) are predicted only partially (PC). 44

51 I have also analyzed nucleotide frequency in Genie data set at non-coding parts right before translation start, and results were correct with Kozak consensus sequence occurring at this position. Therefore, building upon previous model, I have added 6 hidden states encoding last 6 nucleotide positions in noncoding part right before translation start, resulting with HMMcod53st model. Overall prediction accuracy was further improved with HMMcod53st which is the last and the most complex model developed in this project. 45

52 46

53 Chapter 6 Conclusions and future work In this project I have used machine learning, Hidden Markov Model in particular, to predict eukaryotic gene structure. In fact, only inference of coding and non-coding positions was of a concern. The problem was solved using Viterbi decoding, which given a sequence of observable states and set of probabilities produces a sequence of most probable hidden states. My implementation of Viterbi algorithm was compatible with all the models I have developed during this project. Modifications introduced allowed emitting variable number of symbols from hidden states. Therefore, codons and other DNA sequence parts could be modeled more properly. Correctness of implementation was confirmed by experiments measuring running time and memory consumption as well as running VarVitHMM with a special case model and sequences with known results. Beside implementation of Viterbi and training algorithms, I have written multiple small programs that were helpful in data handling and converting between different formats, converting results from different prediction programs as well as automating execution of Genscan with files containing multiple sequences. Also, all experiments, as for instance running time of Viterbi implementation and validation, I have run using various scripts, which simplifies later reproducibility and troubleshooting. Cross validation experiments with all models using VarVitHMM and Genie data set were conducted. All models were trained using training by counting method, and results were evaluated with help of various prediction accuracy measures. Prediction accuracy of models developed in this project, even though increasing with models complexity, was not as high as in case of Genscan or Augustus. Therefore, further analysis and experiments should be performed with a special focus on identifying splice sites as well as translation start and termination positions. 47

54 In future work, I might analyze those regions mentioned above in search for different types of splice sites or translation start and termination, since different mechanisms might be involved in those events. Sequences containing those signals might be clustered with K-means algorithm, and Silhouette scores might be useful to asses number of clusters. Results might be used to build a better HMM with more appropriate hidden states. Convolutional Neural Networks might be used to find special signals as for instance splice sites. Obtained probability of being splice site might be incorporated in HMM by modifying transition probabilities. 48

55 References [1] Master s thesis: Gene prediction in eukaryotes using machine learning - downloads. 1SUZUPW6OWmBSwNwNSz3HldiFMegcDutt. [2] C. Burge and S. Karlin, Prediction of complete gene structures in human genomic dna, Journal of Molecular Biology, vol. 268, no. 1, pp , [3] The genscan web server at mit. [4] M. Stanke and B. Morgenstern, Augustus: a web server for gene prediction in eukaryotes that allows user-defined constraints, Nucleic Acids Research, vol. 33, Jan [5] Augustus web interface. augustus/submission.php. [6] M. Stanke, Augustus download. augustus/downloads/. [7] M. Burset and R. Guig, Evaluation of gene structure prediction programs, Genomics, vol. 34, no. 3, pp , [8] Berkeley drosophila genome project. sequence/human-datasets.html. 49

56 50

57 Appendix A Appendix Figure A.1: Running time of filling out ω table in my implementation of Viterbi algorithm with respect to increasing number of hidden states K; N = 100kb, K = 3 to K = 34 (grey - measured values; red - mean of 10 measurements) 51

58 Figure A.2: Running time of backtracking in my implementation of Viterbi algorithm with respect to increasing number of hidden states K; N = 100kb, K = 3 to K = 34 (grey - measured values; red - mean of 10 measurements) Figure A.3: Running time of complete program VarVitHMM with respect to increasing number of hidden states K; N = 100kb, K = 3 to K = 34 (grey - measured values; red - mean of 10 measurements) 52

59 Figure A.4: Memory consumption of VarVitHMM with respect to increasing number of hidden states K; N = 100kb, K = 3 to K = 34 Figure A.5: Frequencies of 64 possible nucleotide triplets in Genie data set in non-coding parts 53

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in