Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Similar documents
Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

CAP 5510 Lecture 3 Protein Structures

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Basics of protein structure

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Presentation Outline. Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Bioinformatics: Secondary Structure Prediction

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Protein Secondary Structure Prediction using Feed-Forward Neural Network

SUPPLEMENTARY MATERIALS

Protein Secondary Structure Prediction

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

Protein Structure Prediction and Display

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter

Improved Protein Secondary Structure Prediction

Can protein model accuracy be. identified? NO! CBS, BioCentrum, Morten Nielsen, DTU

Week 10: Homology Modelling (II) - HHpred

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

Protein Secondary Structure Prediction

Conditional Graphical Models

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

Feedforward Neural Nets and Backpropagation

IT og Sundhed 2010/11

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis

Lecture 4: Feed Forward Neural Networks

Building 3D models of proteins

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

Protein Structure Prediction, Engineering & Design CHEM 430

Artificial Neural Networks

Master s Thesis June 2018 Supervisor: Christian Nørgaard Storm Pedersen Aarhus University

Sequence analysis and comparison

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Using Information Theory to Reduce Complexities of Neural Networks in Protein Secondary Structure Prediction

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Final Examination CS 540-2: Introduction to Artificial Intelligence

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Jessica Wehner. Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008

Optimization of the Sliding Window Size for Protein Structure Prediction

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Protein Structure Prediction

Using a Hopfield Network: A Nuts and Bolts Approach

ECE521 Lectures 9 Fully Connected Neural Networks

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein Structure Prediction Using Neural Networks

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier *

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Protein structure alignments

Regression Adjustment with Artificial Neural Networks

Bioinformatics: Secondary Structure Prediction

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Protein Structures: Experiments and Modeling. Patrice Koehl

Simple neuron model Components of simple neuron

The Relative Importance of Input Encoding and Learning Methodology on Protein Secondary Structure Prediction

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

CSC Neural Networks. Perceptron Learning Rule

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS

An Artificial Neural Network Classifier for the Prediction of Protein Structural Classes

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

ECE521 Lecture 7/8. Logistic Regression

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy

Protein Structure Determination

CS612 - Algorithms in Bioinformatics

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Outline. Levels of Protein Structure. Primary (1 ) Structure. Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

Data Mining Part 5. Prediction

Today s Lecture: HMMs

Artificial Neural Networks (ANN)

Physiochemical Properties of Residues

NMR, X-ray Diffraction, Protein Structure, and RasMol

BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer Protein Structure Prediction I

Artificial Neural Networks. MGS Lecture 2

Lattice protein models

Convolutional Neural Networks

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Multilayer Perceptrons and Backpropagation

ECE521 Lecture7. Logistic Regression

Artificial Neural Network and Fuzzy Logic

SINGLE-SEQUENCE PROTEIN SECONDARY STRUCTURE PREDICTION BY NEAREST-NEIGHBOR CLASSIFICATION OF PROTEIN WORDS

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Protein Structure Prediction using String Kernels. Technical Report

Long-Short Term Memory and Other Gated RNNs

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Transcription:

Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Outline Goal is to predict secondary structure of a protein from its sequence Artificial Neural Network used for this task Evaluation of prediction accuracy

What is Protein Structure?

http://academic.brooklyn.cuny.edu/biology/bio4fv/page/3d_prot.htm

http://matcmadison.edu/biotech/resources/proteins/labmanual/images/220_04_114.png

Protein Structure An amino acid sequence folds into a complex 3-D structure Finding out this 3-D structure is a crucial and challenging task Experimental methods (e.g., X-ray crystallography) are very tedious Computational predictions are a possibility, but very difficult

What is secondary structure?

Strand Helix http://www.wiley.com/college/pratt/0471393878/student/structure/secondary_structure/secondary_structure.gif

Helix Strand http://www.npaci.edu/features/00/mar/protein.jpg

Secondary structure prediction Well, the whole 3-D tertiary protein structure may be hard to predict from sequence But can we at least predict the secondary structural elements such as strand, helix or coil? This is what this paper does.. and so do many other papers (it is a hard problem!)

A survey of structure prediction The most reliable technique is comparative modeling Find a protein P whose amino acid sequence is very similar to your target protein T Hope that this other protein P does have a known structure Predict a similar structure similar to that of P, after carefully considering how the sequences of P and T differ

A survey of structure prediction Comparative modeling fails if we don t have a suitable homologous template protein P for our protein T Ab initio tertiary methods attempt to predict the structure without using a protein structure Incorporate basic physical and chemical principles into the structure calculation Gets very hairy, and highly computationally intensive The other option is prediction of secondary structure only (i.e., making the goal more modest) These may be used to provide constraints for tertiary structure prediction

Secondary structure prediction Early methods were based on stereochemical principles Later methods realized that we can do better if we use not only the one sequence T (our sequence), but also a family of related sequences Search for sequences similar to T, build a multiple alignment of these, and predict secondary structure from the multiple alignment of sequence

What s multiple alignment doing here? Most conserved regions of a protein sequence are either functionally important or buried in the protein core More variable regions are usually on surface of the protein, there are few constraints on what type of amino acids have to be here (apart from bias towards hydrophilic residues) Multiple alignment tells us which portions are conserved and which are not

hydrophobic core http://bio.nagaokaut.ac.jp/~mbp-lab/img/hpc.png

What s multiple alignment doing here? Therefore, by looking at multiple alignment, we could predict which residues are in the core of the protein and which are on the surface ( solvent accessibility ) Secondary structure then predicted by comparing the accessibility patterns associated with helices, strands etc. This approach (Benner & Gerloff) mostly manual Today s paper suggest an automated method

The PSI-PRED algorithm Given an amino-acid sequence, predict secondary structure elements in the protein Three stages: 1. Generation of a sequence profile (the multiple alignment step) 2. Prediction of an initial secondary structure (the neural network step) 3. Filtering of the predicted structure (another neural network step)

Generation of sequence profile A BLAST-like program called PSI-BLAST used for this step We saw BLAST earlier -- it is a fast way to find high scoring local alignments PSI-BLAST is an iterative approach an initial scan of a protein database using the target sequence T align all matching sequences to construct a sequence profile scan the database using this new profile Can also pick out and align distantly related protein sequences for our target sequence T

The sequence profile looks like this Has 20 x M numbers The numbers are log likelihood of each residue at each position

Preparing for the second step Feed the sequence profile to an artificial neural network But before feeding, do a simply scaling to bring the numbers to 0-1 scale x " 1 1+ e #x

Intro to Neural nets (the second and third steps of PSIPRED)

Artificial Neural Network Supervised learning algorithm Training examples. Each example has a label class of the example, e.g., positive or negative helix, strand, or coil Learns how to predict the class of an example

Artificial Neural Network Directed graph Nodes or units or neurons Edges between units Each edge has a weight (not known a priori)

Layered Architecture http://www.akri.org/cognition/images/annet2.gif Input here is a four-dimensional vector. Each dimension goes into one input unit

Layered Architecture http://www.geocomputation.org/2000/gc016/gc016_01.gif (units)

What a unit (neuron) does Unit i receives a total input x i from the units connected to it, and produces an output y i = f i (x i ) where f i () is the transfer function of unit i x i = $ j "N#{i} w ij y j + w i % y i = f i (x i ) = f i ' & $ j "N#{i} ( w ij y j + w i * ) w i is called the bias of the unit

Weights, bias and transfer function Unit takes n inputs Each input edge has weight w i Bias b Output a Transfer function f() Linear, Sigmoidal, or other

Weights, bias and transfer function Weights w ij and bias w i of each unit are parameters of the ANN. Parameter values are learned from input data Transfer function is usually the same for every unit in the same layer Graphical architecture (connectivity) is decided by you. Could use fully connected architecture: all units in one layer connect to all units in next layer

Where s the algorithm? It s in the training of parameters! Given several examples and their labels: the training data Search for parameter values such that output units make correct predictions on the training examples Back-propagation algorithm Read up more on neural nets if you are interested

Back to PSIPRED

Step 2 Feed the sequence profile to the input layer of an ANN Not the whole profile, only a window of 15 consecutive positions For each position, there are 20 numbers in the profile (one for each amino acid) Therefore ~ 15 x 20 = 300 numbers fed Therefore, ~ 300 input units in ANN 3 output units, for strand, helix, coil each number is confidence in that secondary structure for the central position in the window of 15

e.g., 15 helix strand coil 0.18 0.09 0.67 Input layer Hidden layer

Step 3 Feed the output of 1st ANN to the 2nd ANN Each window of 15 positions gave 3 numbers from the 1st ANN Take 15 successive windows outputs and feed them to 2nd ANN Therefore, ~ 15 x 3 = 45 input units in ANN 3 output units, for strand, helix, coil

Test of performance

Cross-validation Partition the training data into training set (two thirds of the examples) and test set (remaining one third) Train PSIPRED on training set, test predictions and compare with known answers on test set. What is an answer? For each position of sequence, a prediction of what secondary structure that position is involved in That is, a sequence over H/S/C (helix/strand/coil) How to compare answer with known answer? Number of positions that match