Accelerating Biomolecular Nuclear Magnetic Resonance Assignment with A*

Similar documents
Automated Assignment of Backbone NMR Data using Artificial Intelligence

Utilizing Machine Learning to Accelerate Automated Assignment of Backbone NMR Data

Protein Structure Analysis and Verification. Course S Basics for Biosystems of the Cell exercise work. Maija Nevala, BIO, 67485U 16.1.

CAP 5510 Lecture 3 Protein Structures

Protein NMR. Bin Huang

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years.

Determining Protein Structure BIBC 100

Basics of protein structure

Protein Structure Determination using NMR Spectroscopy. Cesar Trinidad

Model-based assignment and inference of protein backbone nuclear magnetic resonances.

Protein NMR. Part III. (let s start by reviewing some of the things we have learned already)

NMR Spectroscopy of Polymers

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

I690/B680 Structural Bioinformatics Spring Protein Structure Determination by NMR Spectroscopy

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Sequential Assignment Strategies in Proteins

BMB/Bi/Ch 173 Winter 2018

Conformational Geometry of Peptides and Proteins:

Analysis of NMR Spectra Part 2

Chem 460 Laboratory Fall 2008 Experiment 3: Investigating Fumarase: ph Profile, Stereospecificity and Thermodynamics of Reaction

Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain, Rensselaer Polytechnic Institute

Section III - Designing Models for 3D Printing

ORGANIC - EGE 5E CH NUCLEAR MAGNETIC RESONANCE SPECTROSCOPY

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Two Dimensional (2D) NMR Spectroscopy

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier *

State the position of protons, neutrons and electrons in the atom

Qualitative Analysis of Unknown Compounds

The rest of topic 11 INTRODUCTION TO ORGANIC SPECTROSCOPY

MRI Homework. i. (0.5 pt each) Consider the following arrangements of bar magnets in a strong magnetic field.

IPASS: Error Tolerant NMR Backbone Resonance Assignment by Linear Programming

NMR BMB 173 Lecture 16, February

Experiment 11: NUCLEAR MAGNETIC RESONANCE SPECTROSCOPY

Spectroscopy in Organic Chemistry. Types of Spectroscopy in Organic

9. Nuclear Magnetic Resonance

Artificial Intelligence (AI) Common AI Methods. Training. Signals to Perceptrons. Artificial Neural Networks (ANN) Artificial Intelligence

7. Nuclear Magnetic Resonance

NMRis the most valuable spectroscopic technique for organic chemists because it maps the carbon-hydrogen framework of a molecule.

Triple Resonance Experiments For Proteins

NMR in Medicine and Biology

Guided Prediction with Sparse NMR Data

Experimental Techniques in Protein Structure Determination

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Timescales of Protein Dynamics

Acta Crystallographica Section D

Name Date Class. covalent bond molecule sigma bond exothermic pi bond

BME Engineering Molecular Cell Biology. Structure and Dynamics of Cellular Molecules. Basics of Cell Biology Literature Reading

What is Systems Biology

NMR Resonance Assignment Assisted by Mass Spectrometry

Timescales of Protein Dynamics

Introduction to Bioinformatics

quantum mechanics is a hugely successful theory... QSIT08.V01 Page 1

Preparing a PDB File

Introduction to Comparative Protein Modeling. Chapter 4 Part I

1.0 Introduction to Quantum Systems for Information Technology 1.1 Motivation

LineShapeKin NMR Line Shape Analysis Software for Studies of Protein-Ligand Interaction Kinetics

Basic principles of multidimensional NMR in solution

GASA: A GRAPH-BASED AUTOMATED NMR BACKBONE RESONANCE SEQUENTIAL ASSIGNMENT PROGRAM

Covalent Bonding 1 of 27 Boardworks Ltd 2016

Name. Molecular Models

NMR in Structural Biology

BMB/Bi/Ch 173 Winter 2018

ZAHID IQBAL WARRAICH

Carbon and Molecular Diversity - 1

1) NMR is a method of chemical analysis. (Who uses NMR in this way?) 2) NMR is used as a method for medical imaging. (called MRI )

Proteins and Graph Theory [50] Scientific Mathematics Round

Electrospray ionization mass spectrometry (ESI-

Yifei Bao. Beatrix. Manor Askenazi

Course Syllabus. Department: Science & Technology. Date: April I. Course Prefix and Number: CHM 212. Course Name: Organic Chemistry II

Indirect Coupling. aka: J-coupling, indirect spin-spin coupling, indirect dipole-dipole coupling, mutual coupling, scalar coupling (liquids only)

file:///biology Exploring Life/BiologyExploringLife04/

Mitochondrial Genome Annotation

The Logic of Biological Phenomena

Biochemistry 530 NMR Theory and Practice. Gabriele Varani Department of Biochemistry and Department of Chemistry University of Washington

1. 3-hour Open book exam. No discussion among yourselves.

Filtered/edited NOESY spectra

the spatial arrangement of atoms in a molecule and the chemical bonds that hold the atoms together Chemical structure Covalent bond Ionic bond

THE UNIVERSITY OF MANITOBA. PAPER NO: 409 LOCATION: Fr. Kennedy Gold Gym PAGE NO: 1 of 6 DEPARTMENT & COURSE NO: CHEM 4630 TIME: 3 HOURS

(Refer Slide Time: 1:03)

Computational Biology: Basics & Interesting Problems

Biochemistry 530 NMR Theory and Practice

NMR Nuclear Magnetic Resonance Spectroscopy p. 83. a hydrogen nucleus (a proton) has a charge, spread over the surface

Course Home Page. Why is PHYSICS important in life science? Why is biophysics important?

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES

Calculate a rate given a species concentration change.

From Amino Acids to Proteins - in 4 Easy Steps

APPENDIX 1 Version of 8/24/05 11:01 AM Simulation of H-NMR without a structure*:

EH1008 : Biology for Public Health : Biomolecules and Metabolism

Introduction to" Protein Structure

Better Bond Angles in the Protein Data Bank

High Magnetic Field Science and the Magnetic Resonance Industry

Orientational degeneracy in the presence of one alignment tensor.

Structural biomathematics: an overview of molecular simulations and protein structure prediction

Shapes of Molecules. Lewis structures are useful but don t allow prediction of the shape of a molecule.

Name: BCMB/CHEM 8190, BIOMOLECULAR NMR FINAL EXAM-5/5/10

1. (5) Draw a diagram of an isomeric molecule to demonstrate a structural, geometric, and an enantiomer organization.

MD Simulations of Proteins: Practical Hints and Pitfalls to Avoid

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Automated NMR protein structure calculation

Introduction to Computational Structural Biology

Transcription:

Accelerating Biomolecular Nuclear Magnetic Resonance Assignment with A* Joel Venzke, Paxten Johnson, Rachel Davis, John Emmons, Katherine Roth, David Mascharka, Leah Robison, Timothy Urness and Adina Kilpatrick Department of Mathematics and Computer Science Drake University Des Moines, IA joel.venzke@drake.edu Abstract Nuclear magnetic resonance (NMR) spectroscopy is a powerful method for studying the three-dimensional structure of molecules such as proteins. The post-genomic era has created the need to gather functional and structural information of unknown proteins encoded by newly discovered genes. Unfortunately, current manual techniques for analyzing NMR datasets can take days to months to assign, and are prone to error. The current goal of this research is to develop an algorithm to automate the process of assigning nontrivial NMR datasets, in an attempt to minimize human error and accelerate a time consuming task. NMR images yield structural information that needs to be analyzed rapidly and accurately. Accurate assignment of nontrivial NMR datasets will provide the necessary framework for continued advancement in the fields of structural biology and proteomics. The algorithm utilizes A* search to sequence amino acids produced from NMR datasets of proteins. The program is given a set of nodes, each corresponding to a single amino acid in the protein chain, from an input file. Each node is placed in the initial node position of the protein chain. The node that most correctly fits the reference protein chain is used to generate child nodes. Unplaced nodes are sequentially added to the end of a list to create child node lists. The node list with the smallest error is then expanded further. The error is a running total of how well the nodes fit the reference protein chain, coupled with the accuracy of the match of the placed child. This process of assigning the child nodes and calculating costs repeats until every node is placed and the best node list contains the lowest cost of all generated lists. Assignments of test sets have shown this method can be used to assign NMR datasets, but there remains need for improvement. Research is ongoing as we continue to refine our algorithm. Future goals include adjusting for potentially missing data, sorting nodes into amino acid groups, and testing with larger datasets. Upon completion, this algorithm will allow for great advancements in the areas of structural biology and proteomics.

1 Introduction Knowledge of a protein s structure can explain the function of a protein and how alterations to its function can lead to disease. With the use of nuclear magnetic resonance (NMR) spectroscopy, the structure of molecules, such as proteins, can be studied. However, NMR spectroscopy generates large amounts of data that need to be analyzed. This process can take months to complete and is prone to error. To allow for major advances in structural biology, a faster and more accurate method of analysis needs to be developed. Our research focuses on developing an algorithm to automate the assignment process. We hope our attempts can minimize the human error and decrease the time consumption associated with manual assignment, increasing the overall effectiveness of the operation. 1.1 Nuclear Magnetic Resonance Spectroscopy Several types of NMR variables can be used in the analysis of proteins structures. In particular, essential information is provided by the chemical shifts of NMR-active nuclei present in proteins, including hydrogen and isotopes of carbon and nitrogen. The chemical shift is a quantifier for the deviation in the resonant frequency of a nucleus from its value in a structure-free environment, and therefore provides information on the local conformation. Determining the chemical shifts of all or most of the nuclei in a biomolecule is the first step in determining its structure. An important set of chemical shifts in a protein are those corresponding to the nuclei in the backbone of the protein polypeptide chain, including the nitrogen, attached hydrogen, and the alpha and beta carbon atoms (C α and C β ) of each residue shown in Figure 1. These chemical shifts are measured using various threedimensional NMR experiments, and then matched to the individual residues in the protein in a process called sequential assignment. HNCACB (i 1) i C β C β N C α C N C α C H H O H H O 1 Figure 1: NMR 1

A prerequisite of the assignment process is data collection using experiments that can provide connectivities between neighboring residues. For example, the HNCACB experiment identifies the chemical shifts corresponding to the C α and C β nuclei of one residue (residue i), as well as the C α and C β chemical shifts of the immediately preceding residue (residue i 1). In this experiment, the values corresponding to residue i can usually be distinguished from the (i 1) values by their higher intensity. However, ambiguities can arise if the intensities are comparable or if the chemical shift values are overlapping. These ambiguities can be resolved by using additional experiments, such as CBCA(CO) NH, which yields the chemical shifts of the preceding residue only. Using all inter-residue connectivities, a chain of correlations through the protein backbone chemical shifts can be established. This pattern of sequentially linked chemical shift values reflects the sequential linear arrangement of the individual residues in the protein sequence. The pattern is then matched with the protein sequence by using the fact that certain residues have characteristic C α and C β chemical shift ranges to uniquely identify them. Thus, each measured chemical shift is assigned to a protein residue, and this information can then be used to infer structural information about the biomolecule. 2 Assignment 2.1 Data Collection Determining the structure of a protein requires many highly involved steps. The first, which can take at least five days, is protein production. NMR experiments are then performed using the produced proteins in order to gather the data necessary for the assignment process. This process usually takes another one to two days for every spectrum involved. After all the data has been gathered, the assigning of the protein s structure, the most lengthy of the steps, will begin [1]. 2.2 Manual Assignment In most cases, backbone NMR data is assigned manually. However, with hundreds of chemical shift values, this process can take anywhere from 20 days to 9 months with perfect data [3]. To complicate matters even more, experience has shown most data sets to have missing or false peaks, requiring whomever is doing the actual assignment to either skip large portions of the protein sequence or to guess where a certain chain is located in the sequence. As a result, this method is extremely error prone [1]. 2.3 Algorithm Design Our algorithm, which is designed to decrease the error and time consumption associated with manual assignment, begins by reading in an expected amino acid sequence and the 2

NMR chemical shift data. It then changes the amino acids to their expected C α and C β values and stores them in a list that represents the protein chain. A tile containing the C α and C β values for the amino acid residue i and residue (i 1) is then created. Next, missing data is accounted for. If there are fewer tiles than the number of amino acids in the protein chain, placeholder tiles are generated to make up the difference. Before the search for the best solution begins, the tiles are grouped by the types of amino acids they most likely represent in an attempt to accelerate the assignment process. These groups are created using the expected C α and C β values shown in Figure 2. Tiles that do not match are grouped with the placeholder tiles. Figure 2: Expected Carbon Shift Values [2] To start the search, all the tiles matching the first amino acid in the reference protein chain are placed in the first position of a node. All placeholder tiles are also added to this position. The tile is then compared to the first amino acid in the protein chain to generate a cost. The cost is the sum of the differences generated by placing any previous tiles in the node, how well the current tile matches the corresponding amino acid in the protein chain, and the disparity between the tile s residue (i 1) values and the residue i values of the tile placed before it. All of the generated nodes are placed in an array of possible solutions, also called the frontier. The node in the frontier with the lowest cost is removed from the array and used to generate child nodes, which is accomplished simply by adding a tile to the end of the node. A child node is generated for all unplaced tiles that are either in the placeholder group or in the 3

group matching the corresponding amino acid in the protein chain. The cost of placing the tile is then calculated. Next, the child nodes are added to the frontier. This process of generating child nodes is repeated until a goal state is reached. Once a node has every tile placed, it is said to be in the goal state. The first node that reaches this state is stored as the current best solution. Every node that achieves the goal state is compared to the current best solution. The node with the lowest cost then replaces the previously stored best solution. The final solution has been found once the current best solution s cost is lower than the cost of any other nodes in the frontier. The search is completed. The algorithm then outputs the cost of the best solution followed by the assigned chemical shift values. 3 Conclusions 3.1 Results Our algorithm has been tested with NMR datasets of varying size from experiments conducted by Dr. Kilpatrick. When compared to manually assigned data, the algorithm produces the same, optimal ordering of amino acids. Initially, we focused our research on accurate assignment at the cost of speed. Thus, at the start of our research, our algorithm was only able to assign smaller datasets with a chain of ten to fifteen amino acids. Shifting our focus to optimizing the algorithm s speed while maintaining the accuracy we had managed to achieve, we were able to improve runtime. With the increased performance, we were able to correctly assign forty amino acids in less than 30 seconds. The results of our testing are shown in Figure 3. 45 40 Time to Completion (seconds) 35 30 25 20 15 10 5 0 5 10 15 20 25 30 35 40 Number of Residues Figure 3: Time of Assignment 4

Since we are using real data to test the algorithm, some data points are missing. Missing data can lead to dramatic increases in assignment time. The jump in runtime between 37 and 38 residues, seen in Figure 3, is one such example. Residue 38 is missing the C α and C β values for residue (i 1). This missing data forces the algorithm to explore a greater number of possibilities to ensure accurate assignment, incidentally increasing the time it takes for the best solution to be found. 3.2 Future Goals In future research, we will look into maximizing speed and efficiency with the goal of assigning full amino acid sequences. In order to achieve this, we will experiment with both parallelization and machine learning. The latter of these two will assist in optimizing the cost calculation while the former will increase the speed of assignment. We believe these implementations will help our algorithm handle datasets with numerous missing data points, validating our results even when given incomplete data. References [1] Babak Alipanahi, Xin Gao, Emre Karakoc, Frank Balbach, Shuai Cheng Li, Guangyu Feng, Logan Donaldson and Ming Li, Error tolerant NMR backbone resonance assignment and automated structure generation., Journal of bioinformatics and computational biology, 9 (2011), 15 41. [2] Sean Cahill and Mark Girvin. Introduction to 3d triple resonance experiments. 2012. [3] Peter Gntert, Automated structure determination from NMR spectra, European Biophysics Journal, 38 (2009), 129 143. [4] Flemming M. Poulsen. A brief introduction to nmr spectroscopy of proteins. 2002. 5