Homology Modeling. Roberto Lins EPFL - summer semester 2005

Similar documents
Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Week 10: Homology Modelling (II) - HHpred

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

BLAST. Varieties of BLAST

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Basic Local Alignment Search Tool

In-Depth Assessment of Local Sequence Alignment

Single alignment: Substitution Matrix. 16 march 2017

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Sequence Alignment Techniques and Their Uses

Sequence analysis and comparison

BIOINFORMATICS: An Introduction

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

A profile-based protein sequence alignment algorithm for a domain clustering database

Pairwise & Multiple sequence alignments

Modeling for 3D structure prediction

Protein Structure Prediction, Engineering & Design CHEM 430

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Introduction to protein alignments

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Introduction to Bioinformatics

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Protein Modeling Methods. Knowledge. Protein Modeling Methods. Fold Recognition. Knowledge-based methods. Introduction to Bioinformatics

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Tools and Algorithms in Bioinformatics

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Quantifying sequence similarity

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

An Introduction to Sequence Similarity ( Homology ) Searching

Building 3D models of proteins

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Homology and Information Gathering and Domain Annotation for Proteins

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

Homology. and. Information Gathering and Domain Annotation for Proteins

Computational Biology

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Optimization of a New Score Function for the Detection of Remote Homologs

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Large-Scale Genomic Surveys

Ch. 9 Multiple Sequence Alignment (MSA)

Local Alignment Statistics

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

CAP 5510 Lecture 3 Protein Structures

Practical considerations of working with sequencing data

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Alignment & BLAST. By: Hadi Mozafari KUMS

Protein structure analysis. Risto Laakso 10th January 2005

Phylogenetic inference

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Computational methods for predicting protein-protein interactions

CS612 - Algorithms in Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Macromolecular structure

Similarity or Identity? When are molecules similar?

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Programme Last week s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues

Protein Modeling. Generating, Evaluating and Refining Protein Homology Models

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Algorithms in Bioinformatics

BLAST: Target frequencies and information content Dannie Durand

Computational Molecular Biology. Protein Structure and Homology Modeling

EECS730: Introduction to Bioinformatics

Sequence analysis and Genomics

Protein Structure: Data Bases and Classification Ingo Ruczinski

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Supporting Online Material for

Tools and Algorithms in Bioinformatics

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

Protein function prediction based on sequence analysis

Getting To Know Your Protein

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Bioinformatics Exercises

Francisco Melo, Damien Devos, Eric Depiereux and Ernest Feytmans

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Pairwise sequence alignments

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Example of Function Prediction

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Protein Structure Prediction

Transcription:

Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton, Bioinformatics, genes, protein & computers; A.M. Lesk, Introduction to Bioinformatics; A.D. Baxevanis & B.F. Ouellette, Bioinformatics, a practical guide to the analysis of genes and proteins; several online materials (George Washington University, University of Houston, Tel-Aviv University) and resources (RCSB, NCBI, SWISS-PROT) as well as personal research data.

Functional Genomics Genome Expressome algorithm algorithm database database Proteome database algorithm TERTIARY STRUCTURE (fold) TERTIARY STRUCTURE (fold) Metabolome algorithm database

Limitations of Experimental Methods Annotated proteins in the databank: ~ 100,000 Total number including ORFs: ~ 700,000 Proteins with known structure: ~5,000! Dataset for analysis ORF, or Open Reading Frame, is a region of genome that codes for a protein Have been identified by whole genome sequencing efforts ORFs with no known function are termed orphan

Structural Biology Consortia: Brute Force Approach Towards Structure Elucidation * Aim to solve about 400 structures a year Employment of a Ph.Ds & Postdocs army Large-scale expression & crystallization attempts Basic strategies remain the same No (known) new tricks Unrelenting ones will be ignored + Enhances the statistical base for inferring sequence structure relationships

Can we predict structure from sequence? GCTCCTCACTGTCTGTGTTTATTC TTTTAGCTTCTTCAGATCTTTTAG TCTGAGGAAGCCTGGCATGTGCA AATGAAGTTAACCTAA...

Comparative Modeling (Homology Modeling) Basis Structure is much more conserved than sequence during evolution Higher the similarity, higher is the confidence in the modeled structure Limited applicability A large number of proteins and ORFs have no similarity to proteins with known structure

What s homology modeling? Predicts the three-dimensional structure of a given protein sequence (target) based on an alignment to one or more known protein structures (templates). If similarity between the target sequence and the template sequence is detected, structural similarity can be assumed. In general, 30% sequence identity is required to generate an useful model. It can be used to understand function, activity, specificity, etc. It is of interest to drug companies wishing to do structure-aided drug design A keystone of structural proteomics

Homology modeling - applications Structure-based assessment of target drugability Structure-guided design of mutagenesis experiments Tool compound design for probing biological function Homology model based ligand design Design of in vitro test assays Structure-based prediction of drug metabolism and toxicity

Accuracy and application of protein structure

Does sequence similarity implies structure similarity? Safe zone (thanks to evolution!) Twilight zone

2.5 Chotia & Lesk, 1986 RMSD of backbone atoms (Ǻ) 2.0 1.5 1.0 0.5 RMSD Natoms! i= 1 = d 2 i Natoms 0.0 100 75 50 25 0 % identical residues in core Natoms = total number of atoms; d i = distance between the coordinates of an atom i at t 0 and t n, when the structures are superimposed.

My target sequence has over 30% sequence identity with a known protein structure, so I want to generate a 3D model. What do I have to do?

Structure prediction by homology modeling

Homology modeling makes two fundamental assumptions The structure of a protein is determined by its primary amino acid sequence (Anfinsen). During evolution, the structure of protein a has changed much slower than its sequence. Similar sequences adopt identical structures and distantly related sequences fold into similar structures.

In summary: homology modeling steps 1) Template recognition & initial alignment 2) Alignment correction 3) Backbone generation 4) Loop modeling 5) Side-chain modeling 6) Model optimization 7) Model validation

Template recognition & initial alignment Select the best template from a library of known protein structures derived from the PDB Templates can be found using the target sequence as a query for searching using FASTA or BLAST

Gaining confidence in template searching Once a suitable template is found, a literature search on the relevant fold can determine what biological role it plays Does this match the biological/biochemical function that you expect? Ligand(s) present? Resolution of the template Family of Proteins Multiple templates?

Further Considerations: Proteins are homologous if they are related by divergence from a common ancestor duplication Function may be related or very different! paralogues speciation orthologues species 1 species 2 Function more likely to be conserved

In summary: there are two types of homologous - Orthologs: proteins that carry out the same function in different species -Paralogs: proteins that perform different, but related functions within one organism

Alignment of the target onto the template Correct alignment is necessary to create the most probable 3D structure of the target If sequences aligns incorrectly, it will result in false positive or negative results Important to consider: - algorithms - scoring alignments - gap penalties Identity SCRs (Structure Conserved Regions and SVRs (Structure Variable Regions)

Alignment Outcome The (true) alignment indicates the evolutionary process giving rise to the different sequences starting from the same ancestor sequence and then changing through mutations (insertions, deletions, and substitutions)

Alignment vs. databases Task: given a query sequence and millions of database records, find the optimal alignment between the query and a record AGTCTCCAGTTATGCCA

Alignment vs. databases Tool: given two sequences, there exists an algorithm to find the best alignment. Naïve solution: apply algorithm to each of the records, one by one. Problem: an exact algorithm is just too slow to run millions of times (even linear time algorithm will run slowly on a huge database). Solution: - run in parallel (expensive) - use of a fast (heuristic) method to discard irrelevant records and the apply the exact algorithm to the remaining few

Sequence alignment algorithms Used to calculate a similarity score to infer sequence homology between two sequences Examples: the two most used in homology modeling are: BLAST: General strategy is to optimise the maximal segment pair (MSP) score - BLAST computes similarity, not alignment (Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J., J. Mol. Biol. (1990) 215:403-410) FastA (local alignment): searches for both full and partial sequence matches, i.e., local similarity obtained; more sensitive than BLAST, but slower; many gaps may represent a problem (Pearson, W. R., Lipman, D. J., P.N.A.S. (1988) 85:2444-2448).

BLAST FastA Sequence alignment outputs

Alignment corrections Alignments are scored (substitution score) in order to define similarity between 2 aa residues in the sequences A substitutions score is calculated for each aligned pair of letters. Substitution matrices: - reflect the true probabilities of mutations occurring through a period of evolution - PAM family: based on global aligments of closely related proteins. Mutation probability matrix. - BLOSUM family: based on observed alignments, no extrapolation of sequences that are related.

Gap Penalties Gap is one or more empty spaces in one sequence aligned with letters in the other sequence These empty spaces may or may not be treated as penalties: - higher penalty score is assigned for the first missing aa then the subsequent ones; it considers the fact that each mutational event can insert or delete many residues at a time

Gap Penalties

Gap Penalties Insertion/deletion of structural domains can easily be done at loop sites N C

Gap Penalties The overall alignment score is the sum of similarity and gap scores: the higher the overall alignment score, the better the alignment (more conserved)

Corrections by hand may still be needed!

Multiple Sequence Alignments Multiple nucleotide or amino sequence alignment techniques are usually performed to fit one of the following scopes : -to characterize protein families, identify shared regions of homology in a multiple sequence alignment; (this happens generally when a sequence search revealed homologies to several sequences) ; -to determine the consensus sequence of several aligned sequences; -to help prediction of the secondary and tertiary structures of new sequences; - preliminary step in molecular evolution analysis using Phylogenetic methods for constructing phylogenetic trees.

Backbone generation Uses known structurally conserved regions to generate coordinates for the unknown For SCRs - copy coordinates from known structures For variable regions (VR) - copy from known structure, if the residue types are similar; otherwise, use databases for fragtmented loop sequences.

Backbone generation Template-based fragment assembly a) Find structurally conserved regions b) build model core

Loop modeling

Loop modeling 1. Database search for segments from known protein structures fitting fixed end-points 2. Molecular mechanics/molecular dynamics 3. Combination of 1+2

Loop modeling Ab initio rebuilding (e.g., Monte Carlo, MD, etc) to build missing loops

Side chain modeling 1. Use of rotamer libraries (backbone dependent) 2. Molecular mechanics optimization - Dead-end elimination (heuristic) - Monte Carlo (heuristic) - Branch & Bound (exact) 3. Mean-field methods

Model optimization Molecular mechanics methods Model validation/evaluation Model should be evaluated for: - correctness of the overall fold/structure - errors over localized regions - stereochemical parameters: bond lengths, angles, etc Some softwares for model verification: - Procheck http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html -WHAT IF http://swift.cmbi.kun.nl/whatif -PROSA II http://www.came.sbg.ac.at/services/prosa.html -Profile 3D & Verify 3D http://shannon.mbi.ucla.edu/doe/services

Model validation/evaluation The Ramachandran plot

Model validation/evaluation

Model validation/evaluation Profile 3D & Verify 3D: -verify newly solved structures or homology models -find structures/folds compatible with a given sequence -find sequences compatible with known structure/fold from a database of sequences