Structural Alignment of Proteins

Similar documents
CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Supplemental Materials for. Structural Diversity of Protein Segments Follows a Power-law Distribution

Programme Last week s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Physiochemical Properties of Residues

Structure and evolution of the spliceosomal peptidyl-prolyl cistrans isomerase Cwc27

Packing of Secondary Structures

Supplementary Figure 3 a. Structural comparison between the two determined structures for the IL 23:MA12 complex. The overall RMSD between the two

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009

Protein Structure Prediction

Protein Structure: Data Bases and Classification Ingo Ruczinski

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

Heteropolymer. Mostly in regular secondary structure

Viewing and Analyzing Proteins, Ligands and their Complexes 2

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon

A profile-based protein sequence alignment algorithm for a domain clustering database

Protein Structure Comparison Methods

Protein Structures: Experiments and Modeling. Patrice Koehl

Computational Protein Design

Supporting information to: Time-resolved observation of protein allosteric communication. Sebastian Buchenberg, Florian Sittel and Gerhard Stock 1

A Novel Text Modeling Approach for Structural Comparison and Alignment of Biomolecules

Supplementary figure 1. Comparison of unbound ogm-csf and ogm-csf as captured in the GIF:GM-CSF complex. Alignment of two copies of unbound ovine

ALL LECTURES IN SB Introduction

Protein Structure Prediction

Large-Scale Genomic Surveys

Similarity or Identity? When are molecules similar?

Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

April, The energy functions include:

What makes a good graphene-binding peptide? Adsorption of amino acids and peptides at aqueous graphene interfaces: Electronic Supplementary

Course Notes: Topics in Computational. Structural Biology.

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

Computer simulations of protein folding with a small number of distance restraints

Figure 1. Molecules geometries of 5021 and Each neutral group in CHARMM topology was grouped in dash circle.

Protein Secondary Structure Prediction

Detection of Protein Binding Sites II

FoldMiner: Structural motif discovery using an improved superposition algorithm

NMR study of complexes between low molecular mass inhibitors and the West Nile virus NS2B-NS3 protease

ITERATIVE NON-SEQUENTIAL PROTEIN STRUCTURAL ALIGNMENT

Central Dogma. modifications genome transcriptome proteome

Full wwpdb X-ray Structure Validation Report i

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Homology Modeling. Roberto Lins EPFL - summer semester 2005

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Automated Identification of Protein Structural Features

A Tool for Structure Alignment of Molecules

SUPPLEMENTARY INFORMATION

Esser et al. Crystal Structures of R. sphaeroides bc 1

A NEW ALGORITHM FOR THE ALIGNMENT OF MULTIPLE PROTEIN STRUCTURES USING MONTE CARLO OPTIMIZATION

Homologous proteins have similar structures and structural superposition means to rotate and translate the structures so that corresponding atoms are

Protein structure alignments

Supplementary Figure 1. Aligned sequences of yeast IDH1 (top) and IDH2 (bottom) with isocitrate

Efficient Protein Tertiary Structure Retrievals and Classifications Using Content Based Comparison Algorithms

Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell

Automated Identification of Protein Structural Features

Molecular Modeling lecture 4. Rotation Least-squares Superposition Structure-based alignment algorithms

7.012 Problem Set 1. i) What are two main differences between prokaryotic cells and eukaryotic cells?

Structure to Function. Molecular Bioinformatics, X3, 2006

Better Bond Angles in the Protein Data Bank

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy

Supplementary Information. Broad Spectrum Anti-Influenza Agents by Inhibiting Self- Association of Matrix Protein 1

Table 1. Crystallographic data collection, phasing and refinement statistics. Native Hg soaked Mn soaked 1 Mn soaked 2

Full wwpdb X-ray Structure Validation Report i

Protein Structure Overlap

Protein structure analysis. Risto Laakso 10th January 2005

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Building 3D models of proteins

Bioinformatics. Macromolecular structure

BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer Protein Structure Prediction I

ATP GTP Problem 2 mm.py

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

Can protein model accuracy be. identified? NO! CBS, BioCentrum, Morten Nielsen, DTU

Molecular Modeling lecture 17, Tue, Mar. 19. Rotation Least-squares Superposition Structure-based alignment algorithms

Sequential resonance assignments in (small) proteins: homonuclear method 2º structure determination

fconv Tutorial Part 2

DATE A DAtabase of TIM Barrel Enzymes

Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University

Supplementary Information Intrinsic Localized Modes in Proteins

CSE 549: Computational Biology. Substitution Matrices

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Sequence Based Bioinformatics

FW 1 CDR 1 FW 2 CDR 2

Any protein that can be labelled by both procedures must be a transmembrane protein.

Full wwpdb X-ray Structure Validation Report i

Details of Protein Structure

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Supporting Online Material for

Resonance assignments in proteins. Christina Redfield

Analysis and Prediction of Protein Structure (I)

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

Protein Structure Prediction and Display

SUPPLEMENTARY INFORMATION

Identification of correct regions in protein models using structural, alignment, and consensus information

Full wwpdb X-ray Structure Validation Report i

A General Model for Amino Acid Interaction Networks

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Transcription:

Goal Align protein structures Structural Alignment of Proteins 1 2 3 4 5 6 7 8 9 10 11 12 13 14 PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE CYS PHE ASN VAL CYS ARG --- --- --- THR PRO GLU ALA ILE CYS Thomas Funkhouser Princeton University CS597A, Fall 2007 [Marian Novotny] Terminology Structure vs. Sequence Superposition Given correspondences, compute optimal alignment transformation, and compute alignment score Alignment Find correspondences, and then superpose structures Sequence Identity (Structure similarity) [Orengo04, Fig 6.2] Structure vs. Sequence Applications Fundamental step in: Analysis Visualization Comparison Design Useful for: [Orengo04, Fig 6.1] Structure classification Structure prediction Function prediction Drug discovery Comparison of S1 binding pockets of thrombin (blue) and trypsin (red). [Katzenholtz00] 1

Goals Desirable properties: Automatic Discriminating Fast Theoretical Issues NP-complete problem Arbitrary gap lengths Global scoring function 1 2 3 4 5 6 7 8 9 10 11 12 13 14 PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE CYS PHE ASN VAL CYS ARG --- --- --- THR PRO GLU ALA ILE CYS Methodological Issues Choices: Representation Scoring function Search algorithm Methodological Issues Factors governing choices:? Methodological Issues Factors governing choices: Application: homology detection, drug design, etc. Granularity: atom, residue, fragment, SSE Representation: inter-molecular, intra-molecular Scoring: geometric, gaps, chemical, structural, etc. Correspondences: sequential, non-sequential Gap penalty: expect gaps near loops, etc. Flexibility: rigid, flexible Target: single protein, representative proteins, PDB Methodological Issues Representations: Residue positions Local geometry Side chain contacts Distance matrices (DALI) Properties (COMPARER) SSEs (, VAST) Geometric invariants 2

Methodological Issues Scoring functions: Distances (RMSD) Substitutions Gaps Methodological Issues Search algorithms: Heuristics (CE) Monte Carlo (DALI, VAST) Dynamic programming (, ) Graph matching () Outline Alignment issues Example alignment methods Fold prediction experiment Function prediction experiment Example Methods DALI DEJAVU /LSQMAN CE Taylor & Orengo, 1989 Subbiah, Laurents & Levitt, 1993 Gerstein & Levitt 1998 Holm & Sander, 1993 Holm & Park, 2000 Kleywegt, 1996 Shindyalov & Bourne, 1998 Krissinel & Henrick, 2003 + 30 others! Slide by Rachel Kolodny 2) Superimpose to minimize RMS 4) Use dynamic prog. to find the best set of equivalences 5) Superimpose given the new alignment 1) Alignment fixed 3) Calculate distances between all atoms 6) Recalculate distances between all atoms [Orengo04, Fig 6.6] [Subbiah93, Gerstein98] [Subbiah93, Gerstein98] 3

[Orengo04, Fig 6.11] [Orengo96] DALI [Orengo96] DALI [Orengo04, Fig 6.9] Distance Maps [Holm93] CE Basic steps: 1. Compare octameric fragments to create candidate aligned fragment pairs (AFP) 2. Stitch together AFPs according to heuristics 3. Find the optimal path through the AFPs Two-step solution: 1. Graph representation of structures 2. Graph matching Protein A Protein B Protein A Protein B [Orengo04, Fig 6.7] 4

Graph representation of molecular structures Simple and intuitive, however results in intractably large graphs for proteins Solution: build graphs over stable substructures, such as secondary structure elements (SSEs). Having a correspondence between SSEs, one may use that for the 3D alignment of all core atoms. [Orengo04, Fig 6.8] Graph representation of protein SSEs E. M. Mitchell et al. (1990) J. Mol. Biol. 212:151 A. P. Singh and D. L. Brutlag (1997) ISMB-97 4:284 Protein graph labeling Composite label of a vertex type - helix or strand length r Composite label of an edge length L (directed if connects vertices from the same chain) vertex orientation angles a 1 and a 2 torsion angle t Vertex and edge labels are matched with thresholds on particular quantities C α alignment SSE-alignment is used as an initial guess for C -alignment C -alignment is an iterative procedure based on the expansion of shortest contacts at best superposition of structures Statistical significance of match The overall probability of getting a particular match score by chance is the measure of the statistical significance of the match P M is traditionally expressed through so-called Z-characteristics C -alignment is a compromise between the alignment length N a and r.m.s.d. The optimised quantity is ω 2 ( y) 2 exp ( 1 y ) = π 2 5

output List of matches Table of matched Secondary Structure Elements (SSE alignment) Table of matched core atoms (C a - alignment ) with dists between them Rotational-translation matrix of best structure superposition R.m.s.d. of C a - alignment Length of C a - alignment N a Number of gaps in C a - alignment N g Quality score Q Probability estimate for the match P M Z - characteristics Sequence identity Match details SSE alignment Results Rotational-translation matrix of best superposition C a - alignment 6

Results Results Outline Alignment issues Example alignment methods Fold prediction experiment Function prediction experiment Fold Prediction Experiments Evaluate how useful alignment algorithms are for predicting a protein s fold How? Fold Prediction Experiments Kolodny, Koehl, & Levitt [2005] ROC curves and geometric measures using CATH Sierk & Pearson [2004] ROC curves using CATH Novotny et al. [2004] Checked a few dozen cases using CATH Leplae & Hubbard [2002] ROC curves using SCOP Fold Prediction Experiments Kolodny, Koehl, & Levitt [2005] ROC curves and geometric measures using CATH Sierk & Pearson [2004] ROC curves using CATH Novotny et al. [2004] Checked a few dozen cases using CATH Leplae & Hubbard [2002] ROC curves using SCOP 7

Kolodny, Koehl, & Levitt [2005] Large scale alignment study 2,930 structures (all pairs) 6 structural alignment algorithms 4 geometric scoring functions Evaluation with respect to CATH topology level 20,000 hours of compute time Tested Methods DALI DEJAVU /LSQMAN CE Best-of-All Taylor & Orengo, 1989 Subbiah, Laurents & Levitt, 1993 Gerstein & Levitt 1998 Holm & Sander, 1993 Holm & Park, 2000 Kleywegt, 1996 Shindyalov & Bourne, 1998 Krissinel & Henrick, 2003 Best of above methods Slide by Rachel Kolodny Scoring Functions Consider # aligned residues & geometric similarity: RMSD 100 SAS = N mat Also penalize gaps: if ( Nmat > N GSAS = else gap ) RMSD 100 Nmat N gap 99.9 Evaluation Using ROC Curves # $! " " %& '( ) *+. /, ) * -.+ [Kolodny05] Slide by Rachel Kolodny SAS & Native ROC Curves Fraction of TP 1 0.8 0.6 0.4 0.2 CE LSQMAN DALI LSQMAN DALI CE Dream Team Best of All 0 0 0.2 0.4 0.6 0.8 1 Fraction of FP 10-7 10-5 10-3 10-1 10 0 Log (Fraction of FP) Slide by Rachel Kolodny ROC Curve Issues Uses only internal ordering Estimation of similarity can be very wrong Converts a classification gold standard into binary truth # $ #$! "!" Slide by Rachel Kolodny 8

Comparing SAS Values Directly GSAS & SAS Distributions Fraction of TP 1 0.8 0.6 0.4 0.2 CE DALI LSQMAN LSQMAN DALI CE 01.1 Dream Best Team of All 0 0 0.2 0.4 0.6 0.8 1 Fraction of FP 0 2 4 6 8 10 TP s Average SAS Slide by Rachel Kolodny Same CAT Pairs percent All Pairs percent 100 90 80 70 60 50 40 30 20 10 0 18 16 14 12 10 8 6 4 2 0 CE LSQMAN DALI Dream Team 01.1 0 1 2 3 4 5 GSAS 0 1 2 3 4 5 SAS Slide by Rachel Kolodny Contributions to Best-of-All Outline Alignment issues Example alignment methods Fold prediction experiment Function prediction experiment [Kolodny05] Function Prediction Experiment Evaluate how useful alignment methods are for predicting a protein s molecular function How? Data Set Proteins crystallized with bound ligands PDB file must have resolution 3 Angstroms Ligands must have 20 HETATOMS Classified by reaction/reactant PDB file must have an EC number (enzymes only) EC number must have a KEGG reaction with a reactant whose graph closely matches ligand in PDB file Non-redundant No two ligands contacting domains with same CATH S95 No two ligands contacting domains with same SCOP SP No two ligands from same PDB file 9

Data Set Data Set 351 proteins / 58 Reactions (189 outliers) REACTION NAME # R00145 2 REACTION NAME # R00162 3 REACTION NAME # R00408 FAD 5 R00214 2 R00342 7 R03647 2 R00124 ADP 2 R00924 FAD 2 R01175 FAD 2 R00538 3 R00623 5 R00497 ADP 2 R00756 ADP 2 MISC FAD 2 R00351 COA 3 R00703 5 R01061 5 R01512 ADP 2 R02412 ADP 2 R03552 COA 2 MISC COA 7 R01403 2 R03647 AMP 2 R02961 SAM 3 R01778 3 R00112 NAP 2 R00330 GDP 2 R01135 GDP 4 MISC SAM 3 R03552 ACO 2 R00343 NAP 2 R00625 NAP 2 R01130 3 R02094 TMP 2 R00291 GDU 2 R03522 GTT 12 55 (34/9) 25 NDP (9/3) 38 NAP (18/8) 11 FAD (9/3) R00939 NAP 2 R01041 NAP 4 R01058 NAP 2 R02101 6 R00965 U5P 2 R00966 U5P 2 R01146 PQQ 3 R00190 PRP 2 R01402 MTA 2 R01195 NAP 2 R02477 NAP 2 R01229 5GP 2 MISC 16 R03435 I3P 2 R02886 CBI 4 R00703 NAI 2 R00939 NDP 5 MISC ADP 19 MISC AMP 10 R01590 ACD 2 R00529 ADX 2 R01063 NDP 2 MISC A3P 5 R03491 SIA 2 R01195 NDP 2 MISC 21 MISC GTP 2 MISC UDP 4 R00137 NMN 3 R03992 MYA 2 MISC NAP 20 MISC NAH 2 MISC 1 MISC 5GP 1 R03509 137 2 MISC etc etc 21 (5/2) 29 ADP (10/5) 6 GDP (6/2) 12 COA (5/2) MISC NAI 2 MISC NDP 16 Evaluation Method Leave-one-out classification experiment Match every ligand against all the others in data set Log a hit when best match performs same reaction Report percentage of hits (correctly classified ligands) Evaluation Method Leave-one-out classification experiment Match every ligand against all the others in data set Log a hit when best match performs same reaction Report percentage of hits (correctly classified ligands)...... Query 1st 2nd 3rd 4th Query 1st 2nd 3rd 4th Same Class Different Class Evaluation Method Leave-one-out classification experiment Match every ligand against all the others in data set Log a hit when best match performs same reaction Report percentage of hits (correctly classified ligands) Evaluation Method Classification rate is 33% is this example......... Query 1st 2nd 3rd 4th Nearest Neighbor Matches HIT Query 1st 2nd 3rd 4th... 10

Sequence Alignment Method Use FASTA to compute Smith-Waterman score for every pair of SCOP domains contacting ligand > fasta34 d1gv0a d1guya 10 20 30 40 50 60 d1gv0a AGVLDSARFRSFIAMELGVSMQDVTACVLGGHGDAMVPVVKYTTVAGIPVADLISAERIA :::::.::.:.::::: :::..:: :..::::: :::....:..::::...:..:.: d1guya AGVLDAARYRTFIAMEAGVSVEDVQAMLMGGHGDEMVPLPRFSTISGIPVSEFIAPDRLA 10 20 30 40 50 60 70 80 90 100 110 120 d1gv0a ELVERTRTGGAEIVNHLKQGSAFYSPATSVVEMVESIVLDRKRVLTCAVSLDGQYGIDGT..::::: ::.:::: :: :::.:.::...:::... :.:::. :. : ::::.. d1guya QIVERTRKGGGEIVNLLKTGSAYYAPAAATAQMVEAVLKDKKRVMPVAAYLTGQYGLNDI 70 80 90 100 110 120 130 140 150 160 d1gv0a FVGVPVKLGKNGVEHIYEIKLDQSDLDLLQKSAKIVDENCKML. :::: ::.:::.: :. :.... ::. ::: : d1guya YFGVPVILGAGGVEKILELPLNEEEMALLNASAKAVRATLDTL 130 140 150 160 54.487% identity 156 out of 163 amino acids overlap Smith-Waterman score: 588 Sequence Alignment Method Use FASTA to compute Smith-Waterman score for every pair of SCOP domains contacting ligand > fasta34 d1gv0a d1guya 10 20 30 40 50 60 d1gv0a AGVLDSARFRSFIAMELGVSMQDVTACVLGGHGDAMVPVVKYTTVAGIPVADLISAERIA :::::.::.:.::::: :::..:: :..::::: :::....:..::::...:..:.: d1guya AGVLDAARYRTFIAMEAGVSVEDVQAMLMGGHGDEMVPLPRFSTISGIPVSEFIAPDRLA 10 20 30 40 50 60 54.487% identity 156 out of 163 amino acids overlap 70 80 90 100 110 120 d1gv0a ELVERTRTGGAEIVNHLKQGSAFYSPATSVVEMVESIVLDRKRVLTCAVSLDGQYGIDGT..::::: ::.:::: :: :::.:.::...:::... :.:::. :. : ::::.. d1guya QIVERTRKGGGEIVNLLKTGSAYYAPAAATAQMVEAVLKDKKRVMPVAAYLTGQYGLNDI Smith-Waterman score: 588 70 80 90 100 110 120 130 140 150 160 d1gv0a FVGVPVKLGKNGVEHIYEIKLDQSDLDLLQKSAKIVDENCKML. :::: ::.:::.: :. :.... ::. ::: : d1guya YFGVPVILGAGGVEKILELPLNEEEMALLNASAKAVRATLDTL 130 140 150 160 54.487% identity 156 out of 163 amino acids overlap Smith-Waterman score: 588 Sequence Alignment Method Sequence Alignment Results Use FASTA to compute Smith-Waterman score for every pair of SCOP domains contacting ligand Similarity matrix: GTT GDP FAD ADP NAP NDP D( A, B) = 1/ max SmithWaterman( Ai, B j) Ai A, B j B NAI NDP NAP ADP COA FAD GDP GTT U5P 1/SmithWaterman Score: (Darker means better match) CBI Sequence Alignment Results Sequence Alignment Results Tier matrix: GTT GDP FAD ADP NAP NDP Classification rate FASTA = 68% Random = <1% GTT GDP FAD ADP NAP NDP NAI NDP NAI NDP NAP NAP ADP ADP COA FAD COA FAD GDP GDP GTT GTT U5P U5P Best Matches: (Beige = Best match) (Yellow = 1 st tier match) (Orange = 2 nd tier match) CBI Best Matches: (Beige = Best match) (Yellow = 1 st tier match) (Orange = 2 nd tier match) CBI 11

Structure Alignment Method Structure Alignment Results Use CE to compute similarity of protein structures Similarity matrix: GTT GDP FAD ADP NAP NDP CE - ~/ebi/data/pdbs/1jsu.pdb A ~/ebi/data/pdbs/1hcl.pdb _ scratch Structure Alignment Calculator, version 1.02, last modified: Jun 15, 2001. CE Algorithm, version 1.00, 1998. Rmsd = 2.28A Z-Score = 6.8 Gaps = 30(11.5%) Alignment length = 262 Rmsd = 2.28A Z-Score = 6.8 Gaps = 30(11.5%) CPU = 14s Sequence identities = 94.7% X2 = ( 0.997420)*X1 + ( 0.071548)*Y1 + ( 0.005923)*Z1 + ( -93.687386) Y2 = ( 0.059473)*X1 + (-0.777232)*Y1 + (-0.626397)*Z1 + ( 119.695427) Z2 = (-0.040214)*X1 + ( 0.625133)*Y1 + (-0.779482)*Z1 + ( 84.334198) NAI NDP NAP ADP COA FAD GDP GTT U5P 1/CE -Z-Score: (Darker means better match) CBI Image from Shindyalov and Bourne (1998) Structure Alignment Results Structure Alignment Results Tier matrix: GTT GDP FAD ADP NAP NDP NAI NDP Classification rate: FASTA = 68% CE = 65% Random = <1% NAP ADP COA FAD GDP GTT U5P Best Matches: (Beige = Best match) (Yellow = 1 st tier match) (Orange = 2 nd tier match) CBI Structure Alignment Results Classification rate: FASTA = 68% CE = 65% Random = <1% When Smith-Waterman 500: Sequence = 80% CE = 72% Random = <1% When Smith-Waterman < 500: CE = 53% FASTA = 44% Random = <1% Conclusion Many algorithms for structural alignment, differing according to Application: homology detection, drug design, etc. Granularity: atom, residue, fragment, SSE Representation: inter-molecular, intra-molecular Scoring: geometric, gaps, chemical, structural, etc. Correspondences: sequential, non-sequential Gap penalty: expect gaps near loops, etc. Flexibility: rigid, flexible Target: single protein, representative proteins, PDB None seems best for all situations All probably provide some benefit over sequence 12