Master s Thesis June 2018 Supervisor: Christian Nørgaard Storm Pedersen Aarhus University

P R O T E I N S E C O N D A RY S T R U C T U R E P R E D I C T I O N U S I N G A RT I F I C I A L N E U R A L N E T W O R K S judit kisistók, 201602119 bakhtawar noor, 201602561 Master s Thesis June 2018 Supervisor: Christian Nørgaard Storm Pedersen Aarhus University

Bakhtawar Noor, Judit Kisistók: Protein secondary structure prediction using artificial neural networks, Master s Thesis, June 2018

A B S T R A C T Protein secondary structure prediction is an important step in the process of attempting to infer a protein s tertiary structure and its function. This thesis intends to explore the usage of feed-forward artificial neural networks to solve this problem. As a frame of reference, we present experimental and machine learning methods used to determine and predict protein secondary structure, with an in-depth overview of artificial neural networks. We give an overview of the four algorithms that have been implemented as part of this thesis: a simple, one-layer neural network described by Qian and Sejnowski [1], an extension of the previous implementation incorporating multiple sequence alignments by majority voting, a cascaded neural network utilizing a profile table created from multiple sequence alignment data described by Rost and Sander [2] and a convolutional neural network learning from the positionspecific scoring matrices of proteins described by Liu and Cheng [3]. We have conducted experiments in order to optimize the performance of our models and tested the optimal networks on previously unseen data. In each case, we obtained results that are comparable to the results presented in the papers we based our implementations on. When considered collectively, the conducted experiments gave us the impression that our models are robust and we believe they will generalize well to data not featured in this thesis. iii

A C K N O W L E D G E M E N T S First and foremost, we want to thank our supervisor, Christian Nørgaard Storm Pedersen for encouraging us to freely explore and experiment within our project. We appreciate all his valuable advice, time and work he devoted to this thesis, and the faith he put in the quality of our work - we hope we could live up to it. We want to express our gratitude to our close friends, Emil Malta- Müller and Tine Sneibjerg Ebsen, a.k.a. the Knights of 420, for seeing us at our most annoyingly frustrated states and still wanting to hang out with us. They never failed to lift our spirits, and we appreciate every single insider joke, deep conversation and cup of coffee we shared. Last but not least, we would like to acknowledge our respective families in Pakistan and Hungary for always supporting us, believing in us and being there for us. v

C O N T E N T S i theoretical framework 1 1 introduction 3 1.1 Proteins............................ 3 1.1.1 Primary structure.................. 3 1.1.2 Secondary structure................ 4 1.1.3 Tertiary structure.................. 4 1.1.4 Quaternary structure................ 5 1.2 Objective........................... 6 2 experimental protein secondary structure determination 9 2.1 Spectroscopic methods................... 9 2.1.1 Circular dichroism spectroscopy......... 9 2.1.2 Fourier transform infrared (FT-IR) spectroscopy 9 2.1.3 Raman spectroscopy................ 10 2.1.4 NMR spectroscopy................. 10 2.2 X-ray crystallography.................... 10 3 machine learning approaches to predict protein secondary structure 13 3.1 Support Vector Machines................. 13 3.2 Hidden Markov Models.................. 14 3.3 Neural networks...................... 15 3.3.1 The biological neuron............... 15 3.3.2 The artificial neuron................ 15 3.3.3 Convolutional neural networks.......... 17 3.3.4 Training a neural network............. 18 3.3.5 Commonly used PSSP tools utilizing neural networks......................... 20 4 tools and methods 25 4.1 Simple neural network (jnn)............... 25 4.2 Using multiple sequence alignments (jsnn)...... 26 4.3 Cascaded neural network (mnn)............. 27 4.4 Convolutional neural network (snn).......... 28 ii practical experiments 31 5 fnn - our command line tool for secondary structure prediction 33 5.1 Dataset............................ 33 5.2 Framework used...................... 34 5.3 Implemented algorithms.................. 34 5.3.1 Parsing........................ 34 5.3.2 Encoding....................... 34 5.3.3 Neural networks.................. 35 vii

viii contents 5.4 User manual......................... 35 5.4.1 System specifications................ 35 5.4.2 Getting started................... 36 5.4.3 Structure prediction................ 36 6 experiments 39 6.1 Workbench.......................... 39 6.2 Preliminaries......................... 40 6.2.1 Batch size...................... 40 6.2.2 Epochs........................ 40 6.2.3 Dropout regularization.............. 40 6.2.4 L1/L2 regularization................ 41 6.2.5 Q3 score....................... 41 6.3 Experiments......................... 41 6.3.1 JNN.......................... 41 6.3.2 JSNN......................... 52 6.3.3 MNN......................... 61 6.3.4 SNN......................... 70 7 conclusion and outlook 85 iii appendix 87 a appendix 89 a.1 Derivation of the backpropagation algorithm...... 89 a.2 Supplementary material.................. 91 bibliography 111

L I S T O F F I G U R E S Figure 1 Structure of an amino acid............ 4 Figure 2 Levels of protein structure. [8].......... 5 Figure 3 The schematic diagram of SVMs as given by [22]. (a) shows the linear separable and (b) shows the non-linear separable case........... 14 Figure 4 The schematic diagram of HMMs, where x i are observables, z i are hidden variables, A i are transition probabilities and φ i are emission probabilities........................ 15 Figure 5 Structure of a biological neuron [33]...... 16 Figure 6 Structure of a single artificial neuron...... 16 Figure 7 Topology of a fully connected feed-forward neural network [33].................. 17 Figure 8 Computing output values of a convolutional layer......................... 17 Figure 9 Max pooling on the output obtained from some convolutional layer. Different patches are represented by a different color............ 18 Figure 10 Part of the neural network considered for the derivation of backpropagation. See section A.1 in appendix for the whole derivation...... 19 Figure 11 The outline of the PHD method, as give by Rost and Sander in [36].............. 21 Figure 12 The outline of the PSIPRED method, as give by Jones in [37]..................... 22 Figure 13 The outline of the JPred method, as give by Cuff and Barton in [11].............. 23 Figure 14 The neural network architecture described by Qian and Sejnowski [1].............. 26 Figure 15 Majority voting on a multiple sequence alignment......................... 27 Figure 16 Sequence to Structure neural network architecture described by Rost and Sander [2]. Structure to structure neural network is not shown. 28 Figure 17 The convolutional neural network described by Liu and Cheng in [3]............... 29 Figure 18 The process of encoding target sequences before presenting them to the neural networks. 35 Figure 19 Steps followed to find the optimal set of hyperparameters for jnn and jsnn.......... 42 ix

x List of Figures Figure 20 Validation accuracies observed using five different batch sizes and twenty different number of nodes in the hidden layer of jnn. Windows sizes 13,17 and 21 were considered........ 43 Figure 21 Validation accuracies obtained after doing local search around batch size of 100 in jnn.. 44 Figure 22 Accuracy and loss plot for jnn. The neural network was trained without regularization.. 45 Figure 23 Accuracy and loss plots of jnn with L1 regularizer and Adam optimization......... 47 Figure 24 Validation and training accuracies obtained by iteratively increasing the number of hidden layers in jnn...................... 48 Figure 25 Final cross-validation accuracies of jnn.... 49 Figure 26 Validation accuracy and loss obtained by jnn using the TSP1607 dataset............. 51 Figure 27 Validation accuracies observed using five different batch sizes in jsnn. Windows sizes of 13, 17 and 21 were considered.......... 53 Figure 28 Validation accuracy using different batch sizes for window sizes of 17 and 21 in jsnn..... 55 Figure 29 Validation and loss plots for jsnn using L2 regularizer with a dropout layer and Adam optimizer........................ 57 Figure 30 Validation accuracy and test accuracy obtained by iteratively increasing the number of hidden layers in jsnn.................... 58 Figure 31 Validation accuracy and test accuracy observed after performing K-fold cross-validation on jsnn. 59 Figure 32 Steps followed to find optimal set of hyperparameters for mnn................. 61 Figure 33 Validation accuracy using different number of nodes in the first neural network of mnn... 63 Figure 34 Validation accuracy using different number of nodes in the second neural network of mnn.. 63 Figure 35 Validation accuracy using different batch sizes in mnn....................... 64 Figure 36 Accuracy and loss plots of mnn using 10 and 100 nodes in the first and second neural network and a batch size of 200........... 65 Figure 37 Accuracy and loss plots of mnn using Adam optimizer and one dropout layer in each network with a dropout rate of 0.2.......... 67 Figure 38 Training and validation accuracy using different number of hidden layers in mnn...... 68

List of Figures xi Figure 39 Validation accuracy of mnn using different number of folds in K-fold cross-validation... 69 Figure 40 Steps followed to find optimal set of hyperparameters for snn................. 71 Figure 41 Accuracy and loss plots of snn without regularization...................... 72 Figure 42 Accuracy and loss plots of snn with L2 norm regularization and two dropout layers (one in each convolutional layer)............. 74 Figure 43 Accuracy and loss plots of snn with L1 norm regularization.................... 75 Figure 44 Mean validation accuracy for batch sizes 100, 500 and 1000 in snn................ 77 Figure 45 Mean validation accuracy for number of filters in the first convolutional layer of snn...... 78 Figure 46 Mean validation accuracy for number of filters in the second convolutional layer of snn.... 78 Figure 47 Accuracy and loss plots for snn obtained using 96 and 10 filters in the first and second convolutional layers, respectively, and a batch size of 1000........................ 79 Figure 48 Accuracy and loss plots of snn obtained using Adam and 5*5 and 2*2 filters in the first and second convolutional layers............ 81 Figure 49 Validation accuracy using different number of folds in K-fold cross-validation in snn..... 83 Figure 50 Part of the neural network considered in order to do derivation for backpropagation...... 89 Figure 51 Left: Training and validation accuracies of jnn using L2 regularizer with Adam optimizer. Right: Training and validation losses. Batch size used is 50 and number of nodes is 110......... 91 Figure 52 Left: Training and validation accuracies of jnn using L2 regularizer with Adam optimizer. Right: Training and validation losses. Batch size used is 60 and number of nodes is 190......... 92 Figure 53 Left: Training and validation accuracies of jnn using L2 regularizer with Adam optimizer. Right: Training and validation losses. Batch size used is 60 and number of nodes is 130......... 93 Figure 54 Left: Training and validation accuracies of jnn using L2 regularizer with Adam optimizer. Right: Training and validation losses. Batch size used is 50 and number of nodes is 110......... 94

xii List of Figures Figure 55 Figure 56 Figure 57 Figure 58 Figure 59 Figure 60 Figure 61 Figure 62 Figure 63 Figure 64 Left: Training and validation accuracies of jnn using L2 regularizer with Adam optimizer. Right: Training and validation losses. Batch size used is 60 and number of nodes is 130......... 95 Left: Training and validation accuracies of jnn using L2 regularizer with Adam optimizer. Right: Training and validation losses. Batch size used is 60 and number of nodes is 190......... 96 Left: Training and validation accuracies of jnn using L1 regularizer with Adam optimizer. Right: Training and validation losses. Batch size used is 50 and number of nodes is 110......... 97 Left: Training and validation accuracies of jnn using L1 regularizer with Adam optimizer. Right: Training and validation losses. Batch size used is 60 and number of nodes is 130......... 98 Left: Training and validation accuracies of jsnn using L1 regularizer with Adam optimizer. Right: Training and validation losses. Batch size, number of nodes, and window size are 100, 90 and 21, respectively.................. 101 Left: Training and validation accuracies of jsnn using L1 regularizer with Adam optimizer. Right: Training and validation losses. Batch size, number of nodes and window size are 100, 180 and 21, respectively.................. 102 Left: Training and validation accuracies of jsnn using L1 regularizer with Adam optimizer. Right: Training and validation losses. Batch size, number of nodes and window size are 100, 90 and 21, respectively.................. 103 Accuracy and loss plots obtained using 20 and 10 nodes in the first and second convolutional layers, respectively, and a batch size of 500 in snn......................... 104 Accuracy and loss plots of snn obtained using Adam and 5*5 filters in the first and second convolutional layers................ 105 Accuracy and loss plots of snn obtained using Adam and 10*10 and 5*5 filters in the first and second convolutional layers........... 106

L I S T O F TA B L E S Table 1 Mean validation accuracies for window sizes 13, 17 and 21 after experimenting with the number of nodes and batch size in jnn........ 44 Table 2 Hyperparameters chosen to find the optimal regularization method and optimizer...... 45 Table 3 Hyperparameters chosen to investigate the effect of the number of hidden layers on jnn.. 47 Table 4 Final architecture of jnn............. 48 Table 5 Q 3 accuracy of jnn (trained on CB513 dataset) for each test sequence............... 50 Table 6 Q 3 accuracy of jnn (trained on the TSP1607 dataset) for each test sequence.......... 52 Table 7 Mean validation accuracies for window sizes 13, 17 and 21 after experimenting with the number of nodes and batch size in jsnn....... 54 Table 8 Hyperparameters chosen to find the optimal regularization method and optimizer for jsnn 56 Table 9 Final architecture of jsnn............ 58 Table 10 Q 3 accuracy of jsnn (trained by virtue of majority voting) for each test sequence...... 60 Table 11 Q 3 accuracy of jnn for each test sequence.. 60 Table 12 Mean validation accuracies for window sizes 7, 13, 17 and 21 after experimenting with the number of nodes and batch size in mnn. Validation accuracy 1 indicates the results obtained with only the first neural network and validation accuracy 2 indicates the total accuracy obtained from the entire cascaded system..... 62 Table 13 The effect of the choice of regularizer and optimizer on the validation accuracy in mnn... 66 Table 14 Final hyperparameters chosen to be used in mnn......................... 68 Table 15 Q 3 accuracy of mnn for each test sequence. 70 Table 16 Mean validation accuracy for window sizes 13, 17 and 21 after regularization experiments using snn....................... 72 Table 17 The effect of different regularization methods on the accuracy of snn.............. 73 Table 18 Mean validation accuracy for window sizes 13, 17 and 21 after filter number and batch size experiments in snn................ 76 xiii

xiv List of Tables Table 19 Validation accuracy using different filter sizes and optimizers in snn............... 80 Table 20 Final hyperparameters chosen for snn..... 82 Table 21 Results obtained by testing snn on unseen data. 83 Table 22 Training and validation accuracies obtained after performing regularization and optimizer experiments on jnn................. 99 Table 23 Validation accuracies obtained by performing regularization and optimizer experiments on jsnn......................... 107 Table 24 The effect of regularization on the accuracy of snn, for window sizes 13, 17 and 21...... 108 Table 25 The effect of the number of nodes in the convolutional layers and the batch size on the accuracy of snn, for window size 21........ 109

A C R O N Y M S CD Circular Dichroism NMR Nuclear Magnetic Resonance FT-IR Fourier Transform Infrared SVM Support Vector Machines ANN Artificial Neural Network PSSM Position Specific Scoring Matrix CNN Convolutional Neural Network PSSP Protein Secondary Structure Prediction HMM Hidden Markov Model ML NN Machine Learning Neural Network xv

Part I T H E O R E T I C A L F R A M E W O R K

I N T R O D U C T I O N 1 In this chapter we give an overview of proteins and their four levels of protein structure required to understand the reported work. We also explain the main objective (in section 1.2) behind implementing neural network models for protein secondary structure prediction (PSSP). 1.1 proteins Proteins are macromolecules that play a pivotal role in various processes in all organisms. Proteins serve as catalysts that could be part of a large complex or temporarily associate with a cofactor thus either accelerating or inhibiting chemical processes in living organisms. They can also be responsible for various other functions like transportation of other molecules, immunity, cell growth and differentiation, mechanical support, and storage. All of these functions are dictated by the three-dimensional (3D) structure of a protein which in turn is determined by the linear sequence of amino acids. This process of going from linear sequence of amino acids to 3D protein structure is referred to as protein folding [4]. Protein folding is an important biological process which, if not done in the right way, could lead to certain neurological diseases such as Alzheimer s disease, Parkinson s disease, Huntington s disease, and several diseases related to cancer as well. Therefore, understanding the process of protein folding and elucidating a protein structure is important in order to understand its function which in turn provides useful insights for various medical and pharmaceutical applications. Although the amino acid sequence drives the process of the protein folding, there is an intermediate step, i.e. identifying β-sheets, helices and coil regions in a protein that also contribute to form a 3D protein structure [4]. 1.1.1 Primary structure Proteins are linear chains of polymers composed of 20 naturally occurring amino acids. Each amino acid consists of a carboxyl group, an amino group, a hydrogen atom and an R-side chain attached to a central carbon atom called α-carbon (shown in Figure 1). It is this side chain that distinguishes one amino acid from another and determines whether it will be hydrophilic, hydrophobic or neutral [5]. 3

4 introduction Figure 1: Structure of an amino acid Amino acids can form a peptide bond with each other whereby a carboxyl group of one amino acid forms a bond with the amino group of another. Peptide bonds hold amino acids in a linear polypeptide chain which is known as the primary structure of a protein [5]. 1.1.2 Secondary structure Polypeptide chains have various segments that are either coiled or folded, which are stabilized by hydrogen bonds between atoms of the polypeptide backbone. These coiled and folded segments are referred to as the secondary structure of a protein. Two major folds in a protein are α-helices and β-sheets [6]. α-helices are rigid, rod-like coiled strands held together by hydrogen bonds between every fourth amino acid, whereas β-sheets are arranged side by side and held together by hydrogen bonds. β-sheets can either be parallel or anti-parallel depending on whether the direction of the strands is the same or the opposite. Anti-parallel sheets have more well-aligned hydrogen bonds, thus making them more stable than parallel sheets [6]. 1.1.3 Tertiary structure The three dimensional structure of a protein is referred to as tertiary structure which is formed when a protein molecule twists, folds and bends in a certain way. The aim of protein folding is to minimize the energy of the structure thus maximizing the structural stability. Unlike secondary structure, tertiary structure is formed by various interactions between the R-side chains of the amino acids. Hydrophobic interaction ensures that polypeptide chains fold into the correct shape such that the non-polar amino acids cluster together to form a nonpolar protein core away from the aqueous environment. Weak van der Waals forces act on this hydrophobic core to further stabilize the protein. Polar amino acids interacting with each other via hydrogen

1.1 proteins 5 bonds and ionic interactions also contribute to the structural stability of a protein. Individually all of these interactions are weak but their cumulative effect is enough to help to give the protein a unique 3D structure. Lastly, there are covalent bonds formed between two cysteine residues which further stabilize the structure. These covalent bonds are known as disulfide bridges and are categorized as weak interactions [7]. 1.1.4 Quaternary structure Quaternary structure is the result of several proteins interacting with each other and arranging to form a protein complex. This protein complex is stabilized by various interactions: hydrogen bonding, disulfide bridges and salt bridges. The four levels of protein structure are shown in Figure 2. Figure 2: Levels of protein structure. [8]

6 introduction 1.2 objective A protein s function is determined by its 3D structure which in turn is dictated by its amino acid sequence. Therefore, elucidating protein structure is important to understand its function. Experimental approaches such as X-ray crystallography and nuclear magnetic resonance spectroscopy have played a major role in determining protein structures [9]. There are 40,000 proteins in Protein Data Bank whose structure has been determined experimentally [9] [10]. These structures have provided useful insights as to how protein chains fold into their unique 3D structure, how chains interact to form complexes, and how to use the amino acid sequence of a protein to predict its structure. However, due to current high-throughput DNA and protein sequencing technologies, the number of proteins with known sequences has increased exponentially and structures are not being resolved at that pace. It has been estimated that only 40,000 out of 2.5 million known sequences have resolved structures. This gap is still increasing because experimental approaches are not only expensive but also time-consuming, labor-intensive and at times not possible to use. Therefore, computational approaches are required to narrow the gap between known sequences and solved structures [9]. In our thesis, we focused on one of the computational approaches i.e. the Artificial Neural Networks (ANN) to decipher proteins secondary structure using their primary sequence. We implemented classic neural network algorithms that predict helices, β-sheets and coil regions in a protein given its primary sequence. Although all of the models are well-reported, they are largely unexplored in practice in terms of their validation accuracy and Q 3 score when trained and tested on a different dataset. Our thesis gives a working implementation of the algorithms in addition to the theoretical background. We use two novel datasets (CB513 dataset [11] and a combination of TMP166 and SP1441 dataset [12]) and see if we get results similar to those reported by the authors. We give a comparison of the different implementations based on their accuracies. The thesis is divided into two parts: theory and experiments. Chapter 2 focuses on experimental approaches that are used for protein structure prediction. From there, we delve into the details of machine learning approaches that are being used to study the challenging problem of protein secondary structure prediction in Chapter 3, including ANN. Following these, Chapter 4 focuses on different neural network models that we implemented. The second part of the thesis solely focuses on how the models worked in practice. Chapter 5 presents the datasets and framework used in our project. In this chapter we also explain how you can use our command line tool to predict the structure of any protein using the models we implemented. In Chapter 6 we explain the experiments

1.2 objective 7 carried out to find the optimal set of hyperparameters for the different neural network models. We compare our results to those reported by the authors and we cover how similar or different our hyperparameters are to those used in the original publications. We end our thesis by stating our conclusions and possible future work in Chapter 7.

E X P E R I M E N TA L P R O T E I N S E C O N D A RY S T R U C T U R E D E T E R M I N AT I O N 2 In this chapter we give an overview of several analytical techniques used in experimental protein secondary structure determination - spectroscopic methods including circular dichroism, Fourier transform infrared, Raman and NMR spectroscopy, and X-ray crystallography. 2.1 spectroscopic methods 2.1.1 Circular dichroism spectroscopy Circular dichroism spectroscopy is a method that enables the quick determination of protein secondary structure. It is based on the concept of circular dichroism, defined as the differential absorption of left and right circularly polarized light. In proteins, the intensity and the wavelength of the optical transition depends on the orientation of the peptide bonds, therefore, many secondary structure motifs give rise to characteristic CD spectra, allowing for the estimation of the structure of unknown proteins. The CD of the protein molecule is measured over a range of wavelengths and secondary structure information is obtained from the resulting CD spectra utilizing the assumption that the spectrum of a protein molecule is given by a linear combination of the spectra of its secondary structure elements and a noise term. For the analysis, one can use polypeptide standards with defined compositions in known conformations or proteins whose secondary structures have been determined with X-ray crystallography. There is a variety of methods used to compare the reference spectra to the spectrum of the unknown protein, including linear regression aiming to fit the spectrum of the unknown protein to the spectra of fixed standards, or neural networks trained on the CD spectra of reference sequences aiming to predict the secondary structure of unknown proteins. [13] 2.1.2 Fourier transform infrared (FT-IR) spectroscopy Fourier transform spectroscopy is a type of vibrational spectroscopy using a beam combining many frequencies of light to obtain the interferogram, the raw light absorption data. This raw data is then turned into the spectrum using Fourier transform. [14] 9

10 experimental protein secondary structure determination The strength and polarity of the vibrating bonds of the molecules influence the wavelength and probability of absorption, hence, the spectrum is influenced by the conformation. In proteins, there are nine characteristic IR absorption bands - amides A, B and I-VII. The amide I band is related to the backbone conformation, therefore, it is quite sensitive to the secondary structure compositions and has been widely used to determine protein secondary structures. [15] 2.1.3 Raman spectroscopy Raman spectroscopy, similarly to FT-IR, is a vibrational spectroscopy method, however, it differs in the manner in which the vibrational state of the molecule is changed, thus resulting in complimentary information. [16] It uses a monochromatic beam of radiation, normally laser, to illuminate the sample and measures the molecular vibrations triggered by the inelastic scattering of light, resulting in characteristic spectra. [14] [17] Primarily amide I and amide III bands are used in practice to determine protein secondary structures. [18] The spectra are analyzed as the linear combination of the spectra of proteins with known structures. [19] 2.1.4 NMR spectroscopy Nuclei in a magnetic field absorb and re-emit electromagnetic radiation, which phenomenon serves as the basis of nuclear magnetic resonance spectroscopy. The nuclear chemical shift observed in the NMR spectra is a reliable indicator of biomolecular structure. [20] This method can provide detailed structural information at atomic resolution. The link between the chemical shifts and secondary structure elements is not fully understood, however, a library of NMR chemical shift data exists for peptides and small proteins that can be used to infer the structure of larger sequences. [18] 2.2 x-ray crystallography X-ray crystallography enables the study of protein structure at near atomic resolution. The proteins have to be arranged into three-dimensional periodic arrays, also known as crystals, to amplify the scattering signal so that meaningful X-ray diffraction data can be obtained that would not have been possible otherwise as scattering from individual molecules is rather weak. Crystallization is a time-consuming process influenced by many factors such as protein concentration and purity, temperature or ionic strength.

2.2 x-ray crystallography 11 X-ray diffraction can be used to determine atomic structures because X-ray has wavelengths comparable to atomic bond distances. As the waves encounter the molecules, they bend and interfere, and the intensities of these diffracted waves can be analyzed using diffraction theory to reconstruct the atomic structures of the molecules. [21]

M A C H I N E L E A R N I N G A P P R O A C H E S T O P R E D I C T P R O T E I N S E C O N D A RY S T R U C T U R E 3 In this chapter we give an overview of several machine learning techniques used to predict protein secondary structure, including Support Vector Machines, Hidden Markov Models and Neural Networks. In the section covering neural networks we also describe PHD, PSIPRED and JPred, three protein secondary structure prediction tools utilizing this machine learning technique. 3.1 support vector machines When presented with a linear separable problem, Support Vector Machines aim to find a separating hyperplane with maximum margin, which means that the method attempts to create a hyperplane such that the distance between the hyperplane and the closest training samples is maximized. It can also be used to find the boundary between classes in non-linear separable datasets. This is achieved using a kernel function, which maps the input data into a higher-dimensional feature space where it may indeed be linear separable. The decision boundary found in the higher dimension space is then projected back to the original space, giving a nonlinear decision boundary. A schematic diagram of SVMs can be seen in Figure 3. The task of protein secondary structure prediction is high - dimensional and nonlinear. Due to SVMs ability to implicitly map nonlinear data into high-dimensional space using kernel functions, this method is well-suited to attempt to solve the PSSP problem as the complexity of the problem will continue to depend on the dimensionality of the input data and not of the feature space. This kernel trick helps to avoid the "curse of dimensionality", the problem of overfitting when the number of parameters is too large with respect to the number of training samples. [22] A variety of methods exist to implement SVMs in the context of protein secondary structure prediction. The frequency patterns of consecutive amino acids or amino acid groups with common properties can be used as the input vector as introduced by Birzele and Kramer in [23]. In this case, only the patterns exceeding a frequency threshold are considered to ensure good predictivity. Karypis method discussed in [24] is based on an input coding scheme combining both position-specific and nonposition-specific information, generated by PSI-BLAST and BLOSUM62, and utilizing a kernel function designed to capture sequence conservation signals around the local window of 13

14 machine learning approaches to predict protein secondary structure Figure 3: The schematic diagram of SVMs as given by [22]. (a) shows the linear separable and (b) shows the non-linear separable case. each residue. Zamani and Kremer [25] propose the usage of amino acid codon encoding, which incorporates evolutionary information in the prediction model since encoding is based on the genetic code. SVMs were originally designed to perform binary classification - how to extend it effectively to be able to handle multiclass classification is still an ongoing research issue. [26] 3.2 hidden markov models Hidden Markov Models are probabilistic graphical models that can be represented as directed acyclic graphs reflecting a series of probabilistic dependency relationships among variables. In a HMM, the states can not be observed directly, they are latent, however, they each emit an observation. The nth observable in a chain of observations only depends on the corresponding hidden variable and the nth hidden variable only depends on the n 1st hidden variable, as shown in Figure 4. The parameters governing the model are π, A and φ, the initial, transition and emission probabilities, respectively. The initial probability π k expresses the probability of state k being the initial state, the transition probability A jk gives the probability of going from state j to state k, and the emission probability φ kn describes the probability of emitting observable k from latent state n. [27] In the context of protein secondary structure prediction, the secondary structure of a given residue is the hidden variable and the amino acids are the observables. The first attempt to use HMMs for protein secondary structure prediction was due to Asai et al. [28] In their approach, four submodels are trained separately, representing the four secondary structures: helix, sheet, turn and other. At the end, the four submodels are combined to give a network suitable for practical use. Other methods were proposed to add more biologically relevant details such as the solvent accessibility status, the length distribution of the secondary structure segments [29], distinction between N-cap and C-cap positions or the explicit modeling of amphipatic helices and β-turns [30],

3.3 neural networks 15 Figure 4: The schematic diagram of HMMs, where x i are observables, z i are hidden variables, A i are transition probabilities and φ i are emission probabilities. however, good results have also been obtained with models not taking into account prior biological knowledge. [31] 3.3 neural networks Neural networks are non-linear hypothesis sets that are widely used in Machine Learning (ML). The topology of the present day ANN was developed simulating a network of biological neurons. 3.3.1 The biological neuron A neuron (shown in Figure 7) consists of three basic units: dendrites, soma and axon. Dendrites are input wires, arising from the main cell body of the soma, receiving signals from other neurons. The axon is an output wire that sends signals to other neurons. The neurons communicate with each other by sending electrical impulses via their axons. The axon s terminal makes contact with a dendrite through a synapse. A neuron receiving these pulses of electricity manipulates the cell potential and it fires an electrical impulse if the cell potential exceeds a certain threshold. The incoming signals from each synapse are categorized as either excitatory or inhibitory. The neuron assigns a "positive weight" to the excitatory signal and a "negative weight" to the inhibitory signal. The excitatory signal must exceed the inhibitory signal by a certain threshold in a certain amount of time which then causes the firing of electrical impulses from a neuron [32]. 3.3.2 The artificial neuron The artificial neuron is composed of a very simple model of what a neuron does (shown in Figure 6). An artificial neuron receives an input x 1,x 2,x 3,...,x N from the input layer where each input has some

16 machine learning approaches to predict protein secondary structure Figure 5: Structure of a biological neuron [33] weight associated to it. Therefore, a neuron/node receives a weighted sum of the inputs: N x i.w i i=1 This weighted sum of the inputs are passed through an activation function, such as sigmoid, ReLu or tanh, which then outputs some value. It is the activation function that introduces non-linearity to the model. Figure 6: Structure of a single artificial neuron

3.3 neural networks 17 A neural network is simply a group of these neurons stacked together. The most common neural network is a fully-connected feedforward neural network shown in Figure 7 in which the first is the input layer, the last is the output layer and all other layers are termed as hidden layers. Each layer can have different number of nodes and a given layer is fully connected to its adjacent layer thus making neural network an acyclic graph. Output from a given layer serves as an input to the next [33]. Figure 7: Topology of a fully connected feed-forward neural network [33] 3.3.3 Convolutional neural networks Convolutional neural networks (CNN) have been very successful in the field of image recognition and classification. They are feed-forward neural networks inspired by the organization of the visual cortex system. Mathematically, the term "convolution" means how much two functions overlap when one function is passed over another. Convolution can be thought of as multiplying two functions (or matrices in the case of NN) in order to mix two functions. The convolutional layer has small matrices called kernels, also known as filters, that are sliding across the input matrix [34]. This layer involves element-wise multiplication of the values in the filter and the input matrix as shown in Figure 8. Figure 8: Computing output values of a convolutional layer

18 machine learning approaches to predict protein secondary structure For each filter one takes a local receptive field of the input matrix, applies element-wise multiplication, moves the filter by a stride and repeats the process for the entire input matrix. Padding can be applied around the input matrix to ensure that the output from the convolutional layer does not get smaller [34]. The next layer in a CNN is the pooling layer which reduces the size of the input matrix and therefore, reduces the total number of computations required to train a NN. In Figure 9 max pooling is applied to a matrix which simply takes the maximum value from a patch of the matrix and stores it in a new matrix while the rest of the values in the patch are discarded. No learning takes place in a pooling layer, it only identifies and stores the locations that have the strongest correlation with a given feature [34]. Figure 9: Max pooling on the output obtained from some convolutional layer. Different patches are represented by a different color. After the convolutional layer(s), a CNN can have one or more fully connected layers that are similar to the layers in a regular multilayered neural network [34]. 3.3.4 Training a neural network Like any supervised learning model, NN is implemented such that it approximates the unknown target function h(x) well by ensuring that the objective function is as small as possible: y ŷ 2 (1) The weights in a NN are responsible for amplifying the input signal and dampen the noise. A large weight signifies higher correlation between the signal and the NN s output [33], therefore, a NN has to learn these weights such that the objective function is minimized. This derivative could be set to zero if it s a closed form solution or one can perform gradient descent in which the objective function is iteratively decreased until convergence. Following the procedure of other learning algorithms, one has to take the derivative of the objective function with respect to the parameters, the weights in this case, that are to be optimized. A NN is trained using backpropagation which is a way to take the derivative of the objective function. Backpropagation is a chain rule that helps to find the derivative of the objective function. The derivation in section A.1 shows that the

3.3 neural networks 19 Figure 10: Part of the neural network considered for the derivation of backpropagation. See section A.1 in appendix for the whole derivation. basis of backpropagation is to find the error of each node in some layer i which simply is the weighted average of the output errors: δ i = σ (a i ) j δ j.u ji (2) where σ is some activation function, a i is the weighted sum of inputs from the previous layer and u ji is the weight connecting a node in layer i to a node in layer j. The error of the last layer (k) is simply the derivative of the objective function: δ k = E ŷ = y ŷ 2 = 2 y ŷ (3) Equation (2) shows that the error of some layer n depends on layer n + 1. The way backpropagation works is that one initializes the weights randomly because if the weights are the same then the nodes in the NN will follow the same gradient and thus become identical. The network is given the first datapoint, each node in a layer receives a weighted signal from the previous layer and if this signal exceeds some threshold then that node in a given layer is activated. This way the signal is propagated forward which allows the network to get an output in the last layer. That output can be used to compute the δ (derivative of the objective function) of the last layer which can be used to compute the δ of the previous layer and so on. Once one has these δs, the derivative for all of the weights can be taken: err u il = δ i.z l (4)

20 machine learning approaches to predict protein secondary structure Having derivatives of all of the weights, gradient descent can be performed in which the weights are updated: u il = u il lr E u il (5) This process is repeated until convergence. 3.3.5 Commonly used PSSP tools utilizing neural networks In this section we describe three protein secondary structure prediction tools utilizing neural networks - PHD, PSIPRED and JPred. PHD PHD is an automatic e-mail server presented by Rost, Sander and Schneider in [35]. The algorithm it is based on is described in detail in [36]. This method incorporates evolutionary information coming from multiple sequence alignments serving as inputs to the system of networks consisting of 3 levels. The first level is a sequenceto-structure net, using sliding windows of 13 residues and classifying each central residue into the 3 secondary structure classes (helix, strand and loop). The second level is a structure-to-structure net, taking into account the correlations between consecutive patterns. In this network, a window of 17 basic cells is used, each cell corresponding to the 3 output values for the secondary structure prediction of the central residue. In both of these networks, the target output is the secondary structure of the central residue. 2*2 different architectures were created and trained independently in an effort to reduce noise and improve accuracy, one quartet trained with a real coding of sequences and the other by adding conservation weights introducing additional evolutionary information. As a third level, a jury decision step is implemented, which takes the arithmetic averages for alpha, beta and loop based on the outputs coming from the different networks. An outline of the network system is shown by Figure 11. According to the authors, the method achieves a three-state accuracy of 70.8%. The original e-mail server tool is inaccessible, as we attempted to submit a protein for prediction to the e-mail address given by [36] and we got an error message back about the e-mail being undeliverable. PSIPRED PSIPRED [37] is a cascaded system of two neural networks, used to predict protein secondary structure based on the position specific scoring matrices generated by PSI-BLAST. First, a sequence profile is created using the profiles PSI-BLAST generates as an intermediate

3.3 neural networks 21 Figure 11: The outline of the PHD method, as give by Rost and Sander in [36] step during its search process. As PSI-BLAST is sensitive to biases in the sequence data banks, a a custom sequence data bank was created for PSIPRED. The final PSSM, which is a 20*M matrix where M is the length of the target sequence, is used as input to the neural network. In the first neural network a standard feed-forward architecture with one hidden layer is used, with a window of 15 amino acid residues. The final input layer is made up of 15 groups of 21 units, where the extra unit per amino acid is used to take into account where the window spans either the N or C terminus of the protein. The hidden layer contains 75 units and the output layer includes three units, corresponding to the three secondary structure elements - helix, strand or coil. The second neural network also uses a feed-forward architecture with a window size of 15, and it is used to filter the outputs of the main network. This network comprises 15 groups of 4 input units, where 3 input units correspond to the secondary structure element and the 4th is used to indicate where the window spans an N or C terminus, similarly to the first network. The hidden layer is made up of 60 units. An outline of the method is shown by Figure 12. In order to train the network, an on-line backpropagation training procedure is used, meaning that the weights are updated after each time the network is presented with a pattern. Using a new testing set and three-way cross validation, the author claims to have achieved an average Q 3 score of 75.5% to 78.3%. JPred4 JPred4 [38] is a secondary structure prediction server providing predictions using the JNet algorithm, presented in [11]. The PSSP algo-

22 machine learning approaches to predict protein secondary structure Figure 12: The outline of the PSIPRED method, as give by Jones in [37]. rithm is trained with different types of multiple sequence alignment profiles originating from the same sequences, including PSIBLAST profile (PSSM), Multiple Sequence Alignment, HMMer2 GCG profile and PSIBLAST profile frequency. The outline of the method is shown by Figure 13. JNet uses a network ensemble consisting of two artificial neural networks. The first one utilizes a sliding window of 17 residues over each amino acid in the multiple sequence alignment and the addition of a conservation number. The network comprises 9 nodes in the hidden layer and 3 nodes in the output layer. The input for the second network is the output from the first network windowed into 19 residues, and a conservation number. The second network also has 9 nodes in the hidden layer and 3 nodes in the output layer. As Figure 13 shows, each type of alignment data trains a different neural network. At the end, the networks are combined to yield a consensus solution based on the average taken for each predicted state. The positions where the predictions given by all methods were identical ("jury agreement") were taken as final predictions, while the other ("no jury") positions were used to train a separate neural network and the final predictions were obtained by replacing the original "no jury" positions with these predictions. Cuff and Baron claim to have achieved an average accuracy of 76.4%.

3.3 neural networks 23 Figure 13: The outline of the JPred method, as give by Cuff and Barton in [11].

T O O L S A N D M E T H O D S 4 In this chapter we give an overview of four neural network models we implemented for our thesis. 4.1 simple neural network (jnn) Qian and Sejnowski [1] made the first attempt to predict protein secondary structure using a neural network. Their approach serves as the basis for jnn, our multi-layered feed-forward neural network implementation. The objective behind the method introduced was to use the information of known protein structures in the database to predict the structures of unresolved proteins for which no homologous structures are present. The proposed model consisted of 13 groups where each group had 21 units. 20 units corresponded to one of the amino acids and 1 unit was used as a spacer between the sliding windows. In the sliding window method, a window serves as a training pattern for predicting the structure of the amino acid at the center of the window. Qian and Sejnowski used local encoding to encode their dataset (consisting of 106 proteins) where only one unit corresponding to a particular amino acid was set to 1 and the rest of the units were set to 0 in a group. The network was trained using backpropagation with 40 units in the hidden layer resulting in a Q 3 accuracy of 62.7%. Qian and Sejnowski investigated the dependence of the test accuracy on the number of nodes in the hidden layer (0, 3, 5, 7, 10, 15, 20, 30, 40, 60). A neural network with 40 hidden units gave the optimal performance. They used these 40 hidden nodes to explore the dependence of the neural network s performance on the size of the windows. They observed peak performance when the window size reached 13. 25

26 tools and methods Figure 14: The neural network architecture described by Qian and Sejnowski [1] 4.2 using multiple sequence alignments (jsnn) In this section we describe the approach we used to improve the last implementation by incorporating multiple sequence alignments. The basic topology of jsnn is similar to the one proposed by Qian and Sejnowski [1]. The input was still a single sequence, however, this sequence was generated by performing a majority vote on a multiple sequence alignment (shown in Figure 15). We did one hot encoding on the sequence generated from majority voting and used the sliding window method to generate input for the network. The input was propagated forward to the output layer via a hidden layer. The output layer had three nodes for helix, strand, and coil. This method gave us a prediction accuracy of 67.8%.

4.3 cascaded neural network (mnn) 27 Figure 15: Majority voting on a multiple sequence alignment 4.3 cascaded neural network (mnn) Rost and Sander [2] proposed a cascaded neural network, which served as a basis for our mnn implementation. The model makes use of multiple sequence alignments because unlike a single sequence, multiple alignments have more information as amino acid substitutions reflect the protein family s folding properties. Figure 16 shows the architecture of the first neural network of the proposed model. It is a regular sequence - to - structure neural network as proposed by Qian and Sejnowski [1]. The second neural network is a structure - to - structure neural network which refines the structural information obtained from the first neural network. For each protein in the dataset, a set of aligned homologous proteins was created. Instead of feeding a single sequence, an entire alignment was given to the network in the form of a profile table. Each sequence position in the profile table is represented by residue frequencies which is determined from the alignment. Therefore, the input to the network is a residue frequency vector for each residue in the sequence. The first neural network uses sliding windows of size 13 corresponding to 13 * 21 (273) input units. The input signal is passed through a network of one hidden layer and an output layer with three nodes. Three nodes in the output layer correspond to the three secondary structures: coil, α-helix, and β-sheet. The output values are between 0 and 1. The output from this network serves as an input to the second structure - to - structure neural network. For the second network, overlapping windows of size 17 are used. In addition to a spacer, the inputs to the second network are three real numbers where each number corresponds to one of the three secondary structure elements, therefore, the second network has 17 * 4 (68) input units. Like the first network, the signal is propagated through one hidden layer to the output layer which consists of three nodes for helix, sheet, and coil.

28 tools and methods Using the model described above, the authors claim to have obtained a Q 3 accuracy of 69.7%. Figure 16: Sequence to Structure neural network architecture described by Rost and Sander [2]. Structure to structure neural network is not shown. 4.4 convolutional neural network (snn) In this section we describe the approach proposed by Liu and Cheng in [3] which serves as the basis for snn, our convolutional neural network implementation. Liu and Cheng propose the usage of a 2D convolutional neural network architecture utilising position-specific scoring matrices (PSSMs). Improvement in the protein secondary structure prediction is given by the introduction of evolutionary information carried by the PSSM matrix. The PSSM was obtained by running the PSIBLAST software with the BLOSUM62 scoring matrix on multiple sequence alignments, giving a two dimensional matrix of size 20*N, where 20 is the number of different amino acid types and N is the protein length. In order to obtain information about the sequence interaction of the residue and to predict the secondary structure of the central residue, a consecutive sliding window of length 21 is used. The architecture of the convolutional neural network is given by Figure??. The first convolutional layer has 96 filters of size 5*5, followed by a max pooling layer of size 2*2 and a second convolutional layer with 24 filters of size 2*2. The features extracted by the second convolutional layer are used as the features for prediction. This

4.4 convolutional neural network (snn) 29 Figure 17: The convolutional neural network described by Liu and Cheng in [3] layer is followed by a fully connected layer with 3 units as this approach predicts three classes corresponding to three different secondary structure elements - H (α-helix), E (extended strand) and C (coil). This multi-class classification problem is solved using a softmax classifier. On the widely used benchmark dataset 25PDB, the authors claim to have obtained a Q 3 accuracy of 77.7%.

Part II P R A C T I C A L E X P E R I M E N T S

F N N - O U R C O M M A N D L I N E T O O L F O R S E C O N D A RY S T R U C T U R E P R E D I C T I O N 5 In this chapter we are going to give an overview and usage manual of fnn, the command line tool we created. fnn is a tool for predicting secondary structure of proteins using their primary sequence, multiple sequence alignment data (in FASTA format), or position-specific scoring matrix (PSSM). 5.1 dataset Two datasets were used in our work: the CB513 dataset [11] and the combination of the TMP166 and the SP1441 datasets [12] which we will call TSP1607 from now on. Cuff and Barton [11] used the CB513 dataset to train the Jnet neural networks. This dataset contains 513 non-redundant sequences which we used to train and test our neural network models. The CB513 dataset consists of 396 sequences from the 3Dee database of protein domains, 117 proteins from Rost, and 126 non-redundant proteins. All of the proteins were compared pairwise, and are non-redundant to a 5 standard deviation cut-off. Each file in the CB513 dataset contains secondary structure definitions from the DSSP, DEFINE and STRIDE definition methods. DSSP, DEFINE and STRIDE have 8 categories of secondary structure: G (3- turn helix), H (4-turn helix), I (5-turn helix), T (hydrogen bond turn), E (extended strand in parallel or anti-parallel β-sheet conformation), B (residue in isolated β-bridge), S (bend), and _ or C (coil). For our project, we utilized the widely used DSSP definition and carried out an 8-state to 3-state reduction on the data. H, G and I were translated to H, E and B to S, and all other states to C (represented as a blank space in our data). The TSP1607 dataset was used to train and test our convolutional neural network. This dataset was originally used to develop TM- SEG [12], a method to predict transmembrane helices. In addition to the sequences, this dataset has evolutionary information for each of the sequences in the form of Position-Specific Scoring Matrices (PSSM). These matrices were generated using PSI-BLAST. However, this dataset did not contain secondary structure information. We obtained 3-state structure prediction, helices (H), β-sheets (B), and Coil (C) from PSIPRED [37]. PSIPRED took a long time (approximately 2-3 hours) to predict the secondary structure of a given sequence and it only accepted 20 job submissions from a given IP address. Therefore, due to time constraints and the restriction to submit at most 33

34 fnn - our command line tool for secondary structure prediction 20 jobs at a time, we decided to work with 511 sequences from the TSP1607 dataset. This dataset contains helical transmembrane proteins and short signal peptides, serving as an overall easier target with less patterns to learn. 5.2 framework used Several machine learning libraries have been developed, for example TensorFlow, Keras, Theano or NumPy. We used Keras [39] for the construction of our neural networks. Keras is an open source neural network library in Python, capable of running on top of TensorFlow, CNTK or Theano. We used Keras with Theano as backend. Theano is a python library allowing the efficient definition, optimization and evaluation of mathematical expressions involving multi-dimensional arrays. [40] Keras is slow compared to other libraries mostly because it first constructs a computational graph using the backend infrastructure and then uses it to perform operations. We chose it because it is relatively easy to implement a neural network in Keras and it provides useful utilities such as data preprocessing, model compilation, result evaluation and graph visualization. 5.3 implemented algorithms 5.3.1 Parsing We implemented our own parsers for PSSM and FASTA files. In the case of the secondary structures, we included a three state generator which performs the 8-state to 3-state reduction of the data as described in section 5.1. 5.3.2 Encoding As both the input and output sequences are strings, we needed to encode the variables before presenting them to the networks. In the case of the target sequences, each secondary structure symbol is integerencoded first, forming an N*1 array where N is the length of the protein and the possible values in the array are 0, 1 and 2, corresponding to coil, strand and helix. The elements of this array are then one hot encoded, yielding a matrix of N*3. Figure 18 shows this encoding process.

5.4 user manual 35 Figure 18: The process of encoding target sequences before presenting them to the neural networks 5.3.3 Neural networks The core functionality of fnn is given by the four neural networks we implemented - jnn, jsnn, mnn and snn, as described in Chapter 4. 5.4 user manual In this section we explain how to use fnn. 5.4.1 System specifications For the tool to work properly, following packages are required: Python 3.6 : https://anaconda.org/anaconda/python Keras : https://anaconda.org/conda-forge/keras Scikit-learn : https://anaconda.org/anaconda/scikit-learn

36 fnn - our command line tool for secondary structure prediction 5.4.2 Getting started In order to download fnn, you should clone the repository via the commands git clone https://github.com/sbnoor/fnn.git cd fnn Once the above process has completed, you can run python tool.py -h to list all of the command-line options. If this command fails it means that something went wrong during the installation process. 5.4.3 Structure prediction In this section you will predict the secondary structure of a protein using its primary sequence. We will assume you have already followed the instructions from subsection 5.4.1 and 5.4.2 for downloading Python 3.6, Keras, Scikit-learn and fnn. The following command will allow you to predict the secondary structure of a protein: python tool.py <file_name>.fasta Protein structure prediction will take approximately 12 minutes if a single FASTA sequence is given as an input and will take approximately 27 minutes if the input is a multiple sequence alignment. The time is mostly spent on training the neural network. You can also specify which neural network you want to give your data to. In that case you can use one of the following flags: -j JNN This flag will run the neural network described by Qian and Sejnowski [1], which is a simple one hidden layer feed-forward neural network that requires a single sequence as an input. This neural network will run by default if you provide a single fasta sequence even without the above mentioned flag. The command is : python tool.py -j JNN <file_name>.fasta -js MSA This flag will run the neural network explained in section 4.2, which is a standard feed-forward neural network that also requires a single sequence as an input but that sequence is generated from a multiple sequence alignment by virtue of majority voting. This neural network is the one that will be used by default if a multiple sequence alignment is provided without a flag. The command is:

5.4 user manual 37 python tool.py -js MSA <file_name>.fasta -m mnn This flag will predict the secondary structure using the network similar to the one explained by Rost and Sander [2]. It is a cascaded neural network whereby the first neural network is a sequence - to - structure network and the second one is a structure - to - structure neural network. The command is as follows: python tool.py -m mnn <file_name>.fasta -s snn This flag runs a convolutional neural network based on the approach of Liu and Cheng [3]. This neural network will run by default if the user enters a PSSM as an input. The command to be entered is: python tool.py -s snn <file_name>.pssm You can also have the prediction written to a text file using the -o flag: python tool.py -o <file_name>.fasta

E X P E R I M E N T S 6 In this section we are going to discuss the experiments conducted to find the optimal neural networks for PSSP. Our search for finding optimal parameters was different for the different models, therefore, experiments for each of the models will be explained separately. That being said, we generally tweaked the same set of parameters in each model although the steps might differ between them. We experimented with the following parameters: Number of nodes in a hidden layer Batch size L1 and L2 norm regularization Dropout Adam or Stochastic Gradient Descent (SGD) optimizer Number of hidden layers A major drawback of using an artificial neural network is that it takes a long time to converge to some minima compared to other machine learning techniques. There are various hyperparameters, such as the number of nodes, number of hidden layers, type of optimizer, type of regularization, that could help to optimize a neural network. SciKit-learn has a gridsearchcv function that allows the user to specify different hyperparameters and it in turn gives the most optimal set of parameters. However, gridsearchcv is a brute force method that runs the base model with different parameters that the user specifies. GridSearchCV could take a lot of time to run if a large number of hyperparameters is to be tested and/or the model in question is generally slow. Therefore, it was not feasible for us to do a complete grid search to find the best combination of hyperparameters because we had a large number of hyperparameter combinations to be tested coupled with the fact that training a neural network is a time-consuming process overall, thus making the entire process really slow. 6.1 workbench All experiments were conducted on the GenomeDK HPC cluster and two regular laptops. The specifications of each of the laptops and the cluster are as follows: 39

40 experiments GenomeDK HPC Cluster, 190 nodes (3384 compute cores) connected with 10GigE/Infiniband, each node having 16 to 32 cores and either 64 GB, 128 GB, 256 GB, 512 GB or 1 TB of RAM, 3.5 PB storage capacity MacBook Pro Retina, 13-inch with macos High Sierra (version 10.13.14), Processor 3.1 GHz Intel Core i7, and memory of 16GB MacBook Pro Retina, 15-inch with macos High Sierra (version 10.13.3), Processor 2.3GHz Intel Core i7, and memory of 8GB We used Python 3.6 to implement the neural networks. We made sure that all other programs were closed while conducting the experiments. 6.2 preliminaries In this section we will briefly explain terms that will be used repeatedly in the next section (6.3). 6.2.1 Batch size Batch size refers to the total number of training examples given to a neural network at a time. The dataset is divided into such batches/chunks because the entire dataset can not be given to a neural network at one time. 6.2.2 Epochs One epoch is when the whole dataset is forward propagated and backpropagated through the neural network once. One epoch is never enough to train a neural network - instead, we pass the complete dataset multiple times through a neural network. This is necessary because data is limited and the neural network is trained using gradient descent which is an iterative process in which weights are updated. Therefore, one epoch is not enough. 6.2.3 Dropout regularization Dropout is a technique in which random nodes are selected and dropped/ignored during training. It means that nodes that are ignored will not contribute to the signal for the next layer in a forward pass and no weight updates are applied to those nodes in backpropagation.

6.3 experiments 41 6.2.4 L1/L2 regularization L1 and L2 regularization are the most common regularization techniques. Both techniques add a penalty to the objective function for each additional coefficient added to the model. What makes them different is the way they apply these penalties. L2 regularization adds squared magnitude 1 2 λw2 for every weight in the NN to the objective function. L1 regularization, on the other hand, adds absolute magnitude of the weight, λ w, as penalty to the objective function. 6.2.5 Q3 score Q3 is the most common performance measure used in PSSP. It is the percentage of correctly predicted residues for a sequence: Q 3 = i=h,s,c ( correctlypredicted i ) 100 (6) observed i 6.3 experiments In this section we are going to give a detailed overview of the experiments we did to obtain the optimal hyperparameters for jnn, jsnn, mnn and snn. In each case, we present cross-validation results to evaluate model performance using the network with the optimal hyperparameters, and we finish each implementation s experiment section with testing the prediction accuracy of the optimal network on unseen data. 6.3.1 JNN We started our experiments by investigating different number of nodes (10, 20, 30,..., 200) in the hidden layer and batch sizes (100, 200, 300, 400, 500) for overlapping window sizes 13, 17 and 21. We then finetuned the batch size and the number of nodes by doing a local search around the optimal batch size and the number of nodes found in the global search. After that we experimented with regularization methods (L1 and L2) in addition to a dropout layer (with a dropout rate of 0.2) and optimizers (Adam and SGD). Finally, we experimented with the number of hidden layers to see if we are able to observe any improvements in the implemented model. Figure 19 summarizes the steps we followed to find the optimal hyperparameters for jnn.

42 experiments Figure 19: Steps followed to find the optimal set of hyperparameters for jnn and jsnn. 6.3.1.1 Number of nodes and batch size First, we determined the optimal number of nodes and batch size (100, 200, 300, 400, 500) in jnn. For each batch size we considered 20 different number of nodes in the hidden layer (10, 20, 30,..., 200). We ran these experiments using three different window sizes: 13, 17, and 21. Figure 20 shows the validation accuracies obtained after training and validating NN using different batch sizes. NN trained with a batch size of 100 gave far superior results compared to the neural networks trained on other batch sizes.

6.3 experiments 43 Figure 20: Validation accuracies observed using five different batch sizes and twenty different number of nodes in the hidden layer of jnn. Windows sizes 13,17 and 21 were considered.

44 experiments Table 1: Mean validation accuracies for window sizes 13, 17 and 21 after experimenting with the number of nodes and batch size in jnn. Window size Mean validation accuracy (%) 13 61.3 17 63.4 21 61.9 On average, we obtained the best accuracies with window size 17 as shown by Table 1, therefore, we are going to concentrate on this window size in our discussion. Because we got peak performance using a batch size of 100, we decided to do a local search around it and considered batch sizes of 50, 60, 70,..., 150 with the same number of nodes as before. Figure 21 shows that neural networks receiving inputs with batch sizes 50 and 60 gave better results overall. Therefore, we decided to test regularization methods using batch sizes 50 and 60. For both batch sizes, we chose the node numbers corresponding to the peaks (110 and 190, respectively), and we decided to keep another node numbers giving slightly, but not considerably smaller validation accuracies as well (180 with batch size 50 and 130 and 180 with batch size 60). Table 2 summarizes the hyperparameters selected so far. Figure 21: Validation accuracies obtained after doing local search around batch size of 100 in jnn

6.3 experiments 45 Table 2: Hyperparameters chosen to find the optimal regularization method and optimizer Window size Batch size Number of nodes 17 50 110, 180 17 60 130, 180, 190 6.3.1.2 Regularization and optimizer Figure 22 shows that without regularization and early stopping, the network started overfitting the data after 30 epochs because the training accuracy kept increasing, whereas validation accuracy flattened out. Figure 22: Accuracy and loss plot for jnn. The neural network was trained without regularization.

46 experiments In order to overcome the problem of overfitting, we decided to explore L1 and L2 regularization methods in addition to a dropout layer (with a dropout rate of 0.2) and optimizers (Adam and SGD) on the neural network. We tested these parameters using the hyperparameters mentioned in Table 2. Table 22 (in appendix) summarizes the results obtained from the experiments we conducted in order to optimize the performance of our neural network. Regardless of the number of nodes in the hidden layer and batch size, L2 regularization gave higher validation accuracies with both Adam and SGD. However, the gap between the training and validation accuracies and losses was bigger. In contrast, L1 regularizer together with Adam optimizer gave lower accuracies but the gap between the training and validation accuracy and loss curves was small thus implying good generalization properties (see Figure 23). This is not what we expected to see. We know that L2 regularization works better in practice because it has a stable and analytical solution. However, since L1 regularization with Adam optimizer gave us better results, we proceeded with this set of hyperparameters. We believe this could be due to the greedy approach through which we are trying to find the optimal set of hyperparameters. We do not rule out the possibility that training the network using L2 regularizer and a different set of hyperparameters may yield a better result. Furthermore, when we used SGD optimizer with L1 regularization, the model did not learn anything. Moreover, using L1 regularizer and a dropout layer resulted in underfitting. Using L2 regularizer and a dropout layer gave us a higher validation accuracy, however, there were fluctuations in the learning curve due to which we decided not to use these hyperparameters in our neural network. We believe it is because jnn is a simple feed forward neural network with one hidden layer and therefore, does not have to be that strongly regularized.

6.3 experiments 47 Figure 23: Accuracy and loss plots of jnn with L1 regularizer and Adam optimization Table 3 summarizes the hyperparameters we decided to proceed with to experiment with the number of hidden layers in jnn. Table 3: Hyperparameters chosen to investigate the effect of the number of hidden layers on jnn Window size Batch size Regularization Optimizer Number of nodes 17 50 L1 Adam 110 6.3.1.3 Number of hidden layers Figure 24 suggests that adding more hidden layers did not enable the model to learn anything - instead, the model seems to suffer from diminishing gradient, meaning that as more layers are added, the

48 experiments gradient becomes so small that weights could no longer be updated during backpropagation. Figure 24: Validation and training accuracies obtained by iteratively increasing the number of hidden layers in jnn. 6.3.1.4 Performance of the neural network Table 4 shows the final architecture of jnn. We then cross validated our model using k-fold cross validation where k {2, 3,..., 15} Table 4: Final architecture of jnn Window Batch Regularization nodes hidden layers Number of Number of Optimizer size size 17 50 L1 Adam 110 1 Figure 25 shows that the performance of the neural network remained more or less the same regardless of the number of folds used. We got the highest cross-validation accuracy (62.2%) using 6 folds. However, 5 folds and 10 folds gave us an accuracy of 62.1% and 61.9%, respectively. These accuracies are not considerably inferior to what 6- fold cross-validation gave us. Since 5-fold and 10-fold cross validation are mostly used in practice, we decided to use 5-fold cross validation

6.3 experiments 49 since it is computationally less expensive than 10-fold cross validation. Lastly, since cross-validation accuracies did not change considerably, we are led to a conclusion that the model we created is robust and will generalize well to unseen data. Figure 25: Final cross-validation accuracies of jnn 6.3.1.5 Prediction accuracy We used the most common measure, Q 3, to determine the prediction accuracy of jnn. Table 5 summarizes the test accuracies obtained when we gave the trained jnn different sequences it has not seen before. Based on these results, jnn can predict the secondary structure of sequences with an average accuracy of 62.6% which is comparable to what the authors [1] reported (64.3%). 6.3.1.6 Conclusion The final validation accuracy we obtained was 62.1% which is comparable to what Qian and Sejnowski reported (64.3%). This difference in accuracies could be due to the difference in datasets. Our hyperparameters were different from the authors. Firstly, they used a window size of 13 whereas our neural network uses overlapping windows of size 17. Moreover, they used 40 nodes in the hidden layer and we had to use 110 nodes. Other hyperparameters, such as batch size and regularization methods, were not mentioned in the article.

50 experiments Table 5: Q 3 accuracy of jnn (trained on CB513 dataset) for each test sequence Test sequence Number of residues Q3C(%) Q3H(%) Q3S(%) Q3(%) VTC1_YEAST 129 77.6 85.8 0.0 66.7 VPS55_SCHPO 128 50.8 74.5 0.0 34.8 Y1176_CORGB 378 82.9 28.5 69.7 64.7 XLF1_SCHPO 203 68.9 45.7 31.6 51.9 XENO_XENLA 81 60.7 96.5 0.0 69.9 WOX3_ORYSJ 203 97.5 63.8 32.6 74.5 XKR3_HUMAN 459 79.7 74.5 19.8 62.7 Y1796_SYNY3 201 86.5 58.6 57.6 68.9 Y2070_CORGB 211 87.6 52.6 60.6 72.6 VSTM1_HUMAN 236 95.8 19.7 55.9 59.6 6.3.1.7 Testing jnn on TSP1607 dataset We ran jnn on the TSP1607 dataset to see if we would get similar results as before. However, using the TSP1607 dataset gave us a higher validation accuracy (shown in Figure 26) i.e. jnn gave us a validation accuracy of 70% which is almost 7% higher than what Qian and Sejnowski [1] reported. We did not expect our validation accuracy to be this different from what the authors obtained. We believe this could be due to noise in the CB513 dataset. CB513 had certain residues for which there were no predictions, whereas the TSP1607 dataset did not have such problems. Secondly, the TSP1607 dataset is an easy dataset and has less patterns compared to the CB513 dataset.

6.3 experiments 51 Figure 26: Validation accuracy and loss obtained by jnn using the TSP1607 dataset. Table 6 summarizes the Q 3 score obtained for 10 test sequences after jnn was trained on the TSP1607 dataset. The average accuracy obtained is 75.8% which is considerably higher than what we got when we trained jnn with the CB513 dataset. Therefore, having less noise in the dataset could explain both the higher validation accuracy and the higher average Q 3 score than the ones reported. However, this needs to be looked into in more detail to be certain why training our model on the TSP1607 dataset yielded a better validation accuracy.

52 experiments Table 6: Q 3 accuracy of jnn (trained on the TSP1607 dataset) for each test sequence Test sequence Number of residues Q3C(%) Q3H(%) Q3S(%) Q3(%) VTC1_YEAST 129 75.6 93.4 0.0 81.8 VPS55_SCHPO 128 53.8 89.6 0.0 72.6 Y1176_CORGB 378 78.9 34.7 88.6 68.5 XLF1_SCHPO 203 66.4 47.5 49.5 59.5 XENO_XENLA 81 88.7 95.6 100 85.8 WOX3_ORYSJ 203 97.9 80.8 74.8 89.9 XKR3_HUMAN 459 85.5 82.9 69.6 80.8 Y1796_SYNY3 201 82.7 65.5 87.9 76.5 Y2070_CORGB 211 79.6 59.8 60.7 72.8 VSTM1_HUMAN 236 91.8 19.6 87.5 69.5 6.3.2 JSNN First, we experimented with different number of nodes in the hidden layer and batch sizes. We carried out these experiments using overlapping windows of sizes 13, 17 and 21. Through this process we were able to find the optimal combination of window size, batch size, and number of hidden nodes after which we investigated the performance of the neural network using different regularization methods i.e. L1 and L2 regularization and/or dropout layer in addition to the optimizers. Next, we experimented with introducing more hidden layers in jsnn. 6.3.2.1 Number of nodes and batch size We first explored the best combination of the number of nodes and batch size using overlapping windows of 13, 17 and 21. This experiment is similar to the one carried out in jnn. First, we conducted a global search on batch sizes and number of nodes, then we finetuned our hyperparameters by doing a local search around the hyperparameters found in global search. Figure 27 shows that similarly to jnn, the batch size 100 gave superior results compared to other batch sizes. Moreover, using overlapping windows of size 17 and 21 gave relatively better results than window size 13. We believe that the performance of the neural network using overlapping windows of 13 might have been reduced because information outside the window was not available for prediction. Lastly, the neural network gave peak performance for input windows of 17 and 21 when using 170 nodes in the hidden layer.

6.3 experiments 53 Figure 27: Validation accuracies observed using five different batch sizes in jsnn. Windows sizes of 13, 17 and 21 were considered.

54 experiments Table 7: Mean validation accuracies for window sizes 13, 17 and 21 after experimenting with the number of nodes and batch size in jsnn. Window size Mean validation accuracy (%) 13 63.7 17 64.5 21 64.2 On average, we obtained the best accuracies with window sizes 17 and 21 as shown by Table 7, therefore, we are going to concentrate on these windows in our discussion. Next, we did a local search on batch size 100 and we explored batch sizes from 50 to 190. Figure 28 indicates that in case of window size 17, the neural network s performance peaked using batch size of 90 and 190 nodes in the hidden layer, whereas batch size of 100 gave highest validation accuracy with 180 hidden nodes when using overlapping windows of size 21.

6.3 experiments 55 Figure 28: Validation accuracy using different batch sizes for window sizes of 17 and 21 in jsnn.

56 experiments Table 8 summarizes the hyperparameters we decided to use to investigate the regularization methods (explained in 6.3.2.2) Table 8: Hyperparameters chosen to find the optimal regularization method and optimizer for jsnn Window size Batch size Number of nodes 17 90 70 17 100 190 21 100 90, 170, 180 6.3.2.2 Regularization and optimizer To ensure that the neural network does not overfit or underfit, we explored a combination of regularization methods (L2/L1 and dropout layer) in addition to two optimizers: Adam and SGD. Table 23 (in appendix) summarizes the regularizers and optimizers we investigated using the hyperparameters found in section 6.3.2.1. It shows that the neural network using window size 21, 180 nodes in hidden layer and batch size 100 gave the highest validation accuracy when using L2 regularizer with a dropout layer and Adam optimizer. Figure 29 shows that there was a small gap between training and validation accuracies and and a minimal gap between training and validation losses which implies good generalization properties. Furthermore, similarly to jnn, when we used SGD with L1 regularization, the test and training accuracy remained constant and NN did not learn any patterns in the dataset. The L1 regularized network with an addition of a dropout layer and Adam optimizer resulted in underfitting. The L2 regularized network with Adam optimizer resulted in a wider gap between training and validation accuracies and losses. Unlike jnn, using a combination of L2 regularization and a dropout layer with a dropout rate of 0.2 gave superior results. We believe that in jnn we had simpler and less noisy data, therefore, soft regularization was able to keep the balance between overfitting and underfitting. However, in the case of jsnn, the data was not as clean. We assumed that the secondary structure of the majority-voted sequence will be same as the secondary structure of the reference sequence. We believe that this overly simplistic assumption introduced some variation in the dataset which is why adding a dropout layer gave relatively better results.

6.3 experiments 57 Figure 29: Validation and loss plots for jsnn using L2 regularizer with a dropout layer and Adam optimizer. 6.3.2.3 Number of hidden layers Figure 30 shows that adding more hidden layers did not improve the performance of the neural network. With a 2 hidden layer feed forward neural network, the validation accuracy was similar to the one obtained from the one hidden layer neural network, which means that adding an extra hidden layer did not enable the network to learn more patterns in the training set. Moreover, as the number of hidden layers increased, the validation accuracy and training accuracy kept on decreasing. With 6 hidden layers or more, the neural network suffered from vanishing gradient as both the validation and training accuracy stopped changing.

58 experiments Figure 30: Validation accuracy and test accuracy obtained by iteratively increasing the number of hidden layers in jsnn. 6.3.2.4 Performance of the neural network Table 9 shows the final architecture of jsnn. We cross-validated our model using k-fold cross validation where k {2, 3,..., 15} Table 9: Final architecture of jsnn Window Batch Regularization nodes hidden layers Number of Number of Optimizer size size 21 100 L2 + dropout Adam 180 1 Figure 31 shows that the performance of our neural network remained roughly the same regardless of the number of folds we used. We got the highest cross validation accuracy (66.7%) using 14 folds. However, 5 folds and 10 folds gave us accuracies of 66.2% and 66.5%, respectively, which are not considerably different from what 14 fold cross-validation gave us. As mentioned previously, 5-fold and 10-fold cross validation are widely used in practice, therefore, we decided to use 5-fold cross validation since it is computationally less expensive than 10-fold cross validation. Lastly, our cross validation accuracies did not change considerably which means that the model we created is robust and generalizes well to unseen data.

6.3 experiments 59 Figure 31: Validation accuracy and test accuracy observed after performing K-fold cross-validation on jsnn. 6.3.2.5 Prediction accuracy Table 10 summarizes the Q 3 scores we obtained when we tested jsnn on sequences that the model have not seen before. We got an average Q 3 accuracy of 67.7% which is 5.1% higher than what we got using jnn in subsection 6.3.1.5. Table 11 summarizes the results we obtained when we ran jnn on the same set of sequences we tested jsnn on. Using these sequences jnn gave us an average prediction accuracy of 63% which is 4.7% lower than jsnn.

60 experiments Table 10: Q 3 accuracy of jsnn (trained by virtue of majority voting) for each test sequence Test sequence Number of residues Q3C(%) Q3H(%) Q3S(%) Q3(%) 1comc-1-DOMAK 119 68.6 59.8 92.7 67.8 1bmv1 185 69.8 29.7 85.9 63.9 1krca-1-AUTO.1 100 41.8 82.5 49.7 58.5 1qbb-3-AUTO.1 483 67.3 81.9 36.5 66.8 1add-1-AS 349 61.5 80.7 71.5 70.7 1mla-2-AS.1 70 49.8 80.5 90.7 70.6 1cei-1-GJB 95 72.9 100.0 0.0 73.0 1bds 43 74.0 49.6 100.0 76.7 1alkb-1-AS 449 60.7 69.8 54.9 61.8 1bsdb-1-DOMAK 107 71.5 64.9 56.7 66.7 Table 11: Q 3 accuracy of jnn for each test sequence Test Number Sequence of residues Q3C(%) Q3H(%) Q3S(%) Q3(%) 1comc-1-DOMAK 119 65.7 59.6 78.9 59.8 1bmv1 185 72.0 29.6 86.7 63.9 1krca-1-AUTO.1 100 40.1 82.7 69.7 62.5 1qbb-3-AUTO.1 483 63.8 66.7 30.6 59.7 1add-1-AS 349 57.8 80.7 44.9 62.7 1mla-2-AS.1 70 51.6 73.6 90.7 67.8 1cei-1-GJB 95 76.7 90.1 0.0 65.6 1bds 43 74.5 22.9 100 62.9 1alkb-1-AS 449 60.6 57.8 47.8 56.7 1bsdb-1-DOMAK 107 69.0 74.6 59.5 67.8

6.3 experiments 61 Figure 32: Steps followed to find optimal set of hyperparameters for mnn. 6.3.2.6 Conclusion The overall performance of jsnn was better than jnn which shows that multiple sequence alignments contain more information, therefore, yield better predictions compared to when a single sequence is fed to a neural network. 6.3.3 MNN In the case of the cascaded neural network mnn, we started our experiments examining different numbers of nodes in the first and second neural networks, as well as the batch size for overlapping window sizes of 7, 13, 17 and 21. After that, we explored different regularization methods (L1 and L2 norm, a combination of L1 and L2 and the addition of dropout layers with a dropout rate of 0.2) as well as the choice of optimizers (Adam or SGD). Finally, we experimented with the addition of multiple hidden layers (2 to 8). 6.3.3.1 Number of nodes and batch size First, we determined the optimal number of nodes in the first and second neural networks (10, 50, 100, 200, 300, 400, 500 in each) as well as the optimal batch size (10, 50, 100, 200, 300, 400, 500), testing

62 experiments all combinations for overlapping windows of sizes 7, 13, 17 and 21. On average, we obtained the best validation accuracies with window size 7 as shown in Table 12, therefore, we are going to concentrate on this window size in our discussion. A window of size 7 is quite small compared to the window sizes we found in the literature, whether we are just strictly looking at the basis of our mnn implementation, the Rost and Sander approach presented in [2] using a window size of 13, or other approaches presented in Chapters 3 and 4. We are uncertain about the specifics as to why decreasing the window size seemingly increases the validation accuracy of our model. We suspect that it might be due to the noisy nature of the CB513 dataset where certain residues do not have predictions and the multiple sequence alignments might also carry alignment errors. Therefore, it is possible that a smaller window size captures all necessary information and increasing the window size would introduce noise and contamination overpowering the valuable information content, thus contributing to a deteriorating validation accuracy. Based on this consideration and due to our trust in our experiments, we continue to focus on window size 7. We also observed that the addition of the second, structure- to - structure network indeed brings improvement as seen in Table 12 (2.98% on average). This increase can be explained by the second network eliminating predictions that are biologically less plausible, such as segments that are too short to form a helix in nature but the first network predicted them as such. We aggregated the raw data according to the number of nodes in the first and second neural net and the batch size to determine the optimal set of hyperparameters. Table 12: Mean validation accuracies for window sizes 7, 13, 17 and 21 after experimenting with the number of nodes and batch size in mnn. Validation accuracy 1 indicates the results obtained with only the first neural network and validation accuracy 2 indicates the total accuracy obtained from the entire cascaded system. Window size Validation accuracy 1 (%) Validation accuracy 2 (%) 7 61.8 64.5 13 56.3 60.0 17 54.6 57.4 21 52.9 55.6 Figures 33, 34 and 35 show that in the case of window size 7, mnn provides a balanced performance and the tested hyperparameters only cause small change in terms of the validation accuracy. Because of this observation, we decided to omit the local search step we performed in the case of jnn and jsnn as we believe that it would not give us considerable improvements. We decided to use 10 nodes in

6.3 experiments 63 Figure 33: Validation accuracy using different number of nodes in the first neural network of mnn. Figure 34: Validation accuracy using different number of nodes in the second neural network of mnn.

64 experiments Figure 35: Validation accuracy using different batch sizes in mnn. the first neural network, 100 nodes in the second neural network and a batch size of 200 as this combination gave us the second highest validation accuracy (65.4%) and a reasonably small gap between the training and validation accuracies and losses as shown in Figure 36. The best result, however, was given by a combination of 10 and 500 nodes in the first and second neural networks and a batch size of 500, however, this only yields a 0.1% improvement compared to the chosen combination, corresponding to a validation accuracy of 65.5%, therefore, we concluded that it seems unnecessary to increase the number of nodes by 400 for such a small improvement.

6.3 experiments 65 Figure 36: Accuracy and loss plots of mnn using 10 and 100 nodes in the first and second neural network and a batch size of 200. 6.3.3.2 Regularization and optimizers In the next step, we explored two optimizers, Adam and SGD, as well as different regularization options - L1 and L2 norm, a combination of L1 and L2 and the addition of dropout layers with a dropout rate of 0.2. All combinations were tested on a network having 10 and 100 nodes in the first and second neural network, respectively, and a batch size of 200. Table 13 shows that mnn severely underfits when using SGD, and choosing Adam with the addition of dropout layers gives superior results. We also observed that the network does not learn when utilizing L1, L2 or L1L2 regularizers, which might be due to the less noisy input profile emerging as a result of the usage of a small window size. Therefore, the addition of dropout layers provide

66 experiments enough regularization to keep the balance between underfitting and overfitting, while the more stringent L1, L2 and L1L2 regularizers diminish our model s flexibility. Table 13: The effect of the choice of regularizer and optimizer on the validation accuracy in mnn. Regularizer Optimizer Validation accuracy (%) no SGD 44.0 no Adam 65.1 Dropout (0.2) SGD 43.7 Dropout (0.2) Adam 65.7 L1 SGD 43.7 L1 Adam 43.7 L2 SGD 43.7 L2 Adam 43.7 L1L2 SGD 43.7 L1L2 Adam 43.7 Figure 37 shows that we were able to obtain reasonable accuracy and loss plots using this combination of optimizers and regularizers, and comparing this figure to Figure 36 also shows that the gap between the training and validation accuracy curves narrowed, suggesting an improvement in the generalization properties of the model.

6.3 experiments 67 Figure 37: Accuracy and loss plots of mnn using Adam optimizer and one dropout layer in each network with a dropout rate of 0.2. 6.3.3.3 Number of hidden layers Using the optimal hyperparameters found so far, we experimented with the addition of more hidden layers (2-8), each having the same parameters as the first one. Figure 38 shows that adding more hidden layers does not improve the performance of mnn - the training accuracy slightly increases when adding 3 or 4, however, the validation accuracy keeps decreasing. Based on these results, we decided to keep using one hidden layer.

68 experiments Figure 38: Training and validation accuracy using different number of hidden layers in mnn. 6.3.3.4 Performance of neural network Table 14 shows the final architecture of mnn. We cross-validated our model using k-fold cross validation where k {2, 3,..., 15} Table 14: Final hyperparameters chosen to be used in mnn. Nodes in Number of Window the first Batch size Regularizer Optimizer hidden size and second layers network Dropout 7 10, 100 200 (dropout rate: 0.2) Adam 1

6.3 experiments 69 Figure 39: Validation accuracy of mnn using different number of folds in K-fold cross-validation. Figure 39 shows that regardless of the number of folds used, the performance of our model remained roughly the same. The highest validation accuracy, 66.5% was obtained using 3 folds, however, because 5-fold and 10-fold cross-validation are the most widely used in practice, we decided to use 5-fold cross-validation which gave us a validation accuracy of 66.2%, a result not considerably inferior to the one obtained using 3 folds. As our cross-validation results did not vary substantially, we conclude that our model is robust and generalizes well to new and unseen data.

70 experiments Table 15: Q 3 accuracy of mnn for each test sequence Test Number of sequence residues Q3H (%) Q3S (%) Q3C (%) Q3 (%) 1add-1-AS 349 65.4 21.6 79.8 63.9 1alkb-1-AS 449 62.2 32.6 87.5 66.1 1bds 43 0.0 9.1 90.6 69.7 1bmv1 185 54.6 61.5 75.0 68.1 1bsdb-1-DOMAK 107 50.0 28.0 74.1 57.9 1cei-1-GJB 95 57.1 0.0 94.1 70.6 1comc-1-DOMAK 119 81.1 66.7 67.4 71.4 1krca-1-AUTO.1 100 80.0 44.4 55.6 67.0 1mla-2-AS.1 70 100.0 30.8 70.0 65.7 1qbb-3-AUTO.1 483 63.7 17.1 79.5 64.6 6.3.3.5 Prediction accuracy We presented mnn with the same test sequences that we used to evaluate the prediction accuracy of jsnn, Table 15 shows the Q 3 scores obtained. We got an average Q 3 accuracy of 66.5%, which is 1.3% lower than the results we got from jsnn. 6.3.3.6 Conclusion These results obtained from mnn and jsnn are comparable, which is expected as both algorithms incorporate the idea of using multiple sequence alignment data in an attempt to increase the amount of information the neural network can learn about the sequence-structure relationship. Based on our experiments, we conclude that mnn, even though being more complex in architecture, is slightly inferior to jsnn in terms of results, however, we do not rule out the possibility that training the networks on a different dataset or using hyperparameters we have not explored may result in mnn surpassing the performance of jsnn. 6.3.4 SNN First, we explored different regularization choices for the convolutional neural network snn - L1, L2, a combination of L1 and L2 as well as the addition of dropout layers with a dropout rate of 0.2. After finding the best regularization method, we experimented with the number of filters in the first and second convolutional layers (20, 60, 96, 140 and 10, 24, 50, respectively) and the batch size (100, 500, 1000). In the next step, we examined the effect of different filter sizes in the

6.3 experiments 71 first and second convolutional layers (2*2, 5*5, 10*10), as well as the optimizers (SGD or Adam) used when compiling the model. Figure 40: Steps followed to find optimal set of hyperparameters for snn 6.3.4.1 Regularization We decided to start our experiments with the exploration of regularizers because our model was severely overfitting before applying regularization. Figure 41 shows that the validation accuracy is slightly decreasing and the validation loss is increasing with the epochs, which indicates that our model generalizes poorly to unseen data.

72 experiments Figure 41: Accuracy and loss plots of snn without regularization These experiments were carried out using window sizes 13, 17 and 21. We found that using window size 21 yielded the best accuracies on average as Table 16 shows, therefore, we are going to report our results for this window size. The results for window sizes 13 and 17 can be seen in Table 24 (in Appendix). Table 16: Mean validation accuracy for window sizes 13, 17 and 21 after regularization experiments using snn. Window size Mean validation accuracy (%) 13 81.6 17 81.8 21 82.1

6.3 experiments 73 The hyperparameters described in [41] were used during this experiment (96 5*5 filters in the first convolutional layer, a 2*2 max-pooling layer, 24 2*2 filters in the second convolutional layer). We used a batch size of 500 and ran 100 epochs in each case. Table 17 summarizes the regularizers we experimented with and the accuracies we obtained. Table 17: The effect of different regularization methods on the accuracy of snn Regularizer Accuracy (%) Validation accuracy (%) No 99.5 79.6 Dropout (0.2) 91.0 82.6 L1 84.3 82.2 L1 + Dropout (0.2) 82.0 81.8 L2 91.7 82.7 L2 + Dropout (0.2) 88.4 84.1 L1L2 84.2 81.8 L1L2 + Dropout (0.2) 81.9 81.5 In terms of validation accuracy, using L2 regularization with two dropout layers (dropout rate: 0.2) seems to be the best choice, however, figure 42 shows that the network is overfitting as the training accuracy keeps increasing while the validation accuracy remains relatively stagnant after epoch 10.

74 experiments Figure 42: Accuracy and loss plots of snn with L2 norm regularization and two dropout layers (one in each convolutional layer). Based on these results, we decided to use the L1 norm regularizer without dropout layers because this method yields satisfactory accuracies with a very small gap between the training and validation accuracies and losses as figure 43 shows, which implies good generalization properties.

6.3 experiments 75 Figure 43: Accuracy and loss plots of snn with L1 norm regularization. 6.3.4.2 Number of filters and batch size We used the L1 regularized neural network to experiment with the number of filters in the convolutional layers and the batch size. We carried out our tests using overlapping windows of 13, 17 and 21, however, we obtained the best accuracies on average with window size 21 as shown by table 18, therefore, we are going to report those results in this section.

76 experiments Table 18: Mean validation accuracy for window sizes 13, 17 and 21 after filter number and batch size experiments in snn. Window size Mean validation accuracy (%) 13 80.8 17 81.5 21 81.9 The hyperparameters explored in this step are the following: 20, 60, 96, 140 and 10, 24, 50 filters in the first and second convolutional layers, respectively, and batch sizes of 100, 500, and 1000. All combinations of hyperparameters were tested. To obtain the optimal set of hyperparameters, we decided to aggregate the raw data according to batch size and the number of filters in the first and second convolutional layers and we examined the hyperparameter corresponding to the largest mean validation accuracy for each. Based on these results, the optimal combination seemed to be batch size 500 (figure 44), 20 filters in the first convolutional layer (figure 45) and 10 filters in the second convolutional layer (figure 46). If we look at the raw data for window size 21 as given by 25 (in appendix), however, we see that this is not the combination corresponding to the overall highest validation accuracy, as it is given by the combination 96, 10, 1000 (number of filters in layer 1 and 2 and batch size, respectively).

6.3 experiments 77 Figure 44: Mean validation accuracy for batch sizes 100, 500 and 1000 in snn. We looked at the accuracy and loss plots obtained using these hyperparameter combinations and based on those, we decided to use 96 filters in the first layer, 10 filters in the second layer and the batch size of 1000, as the other combination yielded an accuracy plot where the training and validation accuracies are rather tangled, suggesting mild overfitting as shown in Figure 62. Our chosen combination, however, shows that the validation accuracy tracks the training accuracy, as expected, with a sufficiently small gap between the training and validation accuracies (figure 47).

78 experiments Figure 45: Mean validation accuracy for number of filters in the first convolutional layer of snn. Figure 46: Mean validation accuracy for number of filters in the second convolutional layer of snn.