Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton, Bioinformatics, genes, protein & computers; A.M. Lesk, Introduction to Bioinformatics; A.D. Baxevanis & B.F. Ouellette, Bioinformatics, a practical guide to the analysis of genes and proteins; several online materials (George Washington University, University of Houston, Tel-Aviv University) and resources (RCSB, NCBI, SWISS-PROT) as well as personal research data.
Functional Genomics Genome Expressome algorithm algorithm database database Proteome database algorithm TERTIARY STRUCTURE (fold) TERTIARY STRUCTURE (fold) Metabolome algorithm database
Limitations of Experimental Methods Annotated proteins in the databank: ~ 100,000 Total number including ORFs: ~ 700,000 Proteins with known structure: ~5,000! Dataset for analysis ORF, or Open Reading Frame, is a region of genome that codes for a protein Have been identified by whole genome sequencing efforts ORFs with no known function are termed orphan
Structural Biology Consortia: Brute Force Approach Towards Structure Elucidation * Aim to solve about 400 structures a year Employment of a Ph.Ds & Postdocs army Large-scale expression & crystallization attempts Basic strategies remain the same No (known) new tricks Unrelenting ones will be ignored + Enhances the statistical base for inferring sequence structure relationships
Can we predict structure from sequence? GCTCCTCACTGTCTGTGTTTATTC TTTTAGCTTCTTCAGATCTTTTAG TCTGAGGAAGCCTGGCATGTGCA AATGAAGTTAACCTAA...
Comparative Modeling (Homology Modeling) Basis Structure is much more conserved than sequence during evolution Higher the similarity, higher is the confidence in the modeled structure Limited applicability A large number of proteins and ORFs have no similarity to proteins with known structure
What s homology modeling? Predicts the three-dimensional structure of a given protein sequence (target) based on an alignment to one or more known protein structures (templates). If similarity between the target sequence and the template sequence is detected, structural similarity can be assumed. In general, 30% sequence identity is required to generate an useful model. It can be used to understand function, activity, specificity, etc. It is of interest to drug companies wishing to do structure-aided drug design A keystone of structural proteomics
Homology modeling - applications Structure-based assessment of target drugability Structure-guided design of mutagenesis experiments Tool compound design for probing biological function Homology model based ligand design Design of in vitro test assays Structure-based prediction of drug metabolism and toxicity
Accuracy and application of protein structure
Does sequence similarity implies structure similarity? Safe zone (thanks to evolution!) Twilight zone
2.5 Chotia & Lesk, 1986 RMSD of backbone atoms (Ǻ) 2.0 1.5 1.0 0.5 RMSD Natoms! i= 1 = d 2 i Natoms 0.0 100 75 50 25 0 % identical residues in core Natoms = total number of atoms; d i = distance between the coordinates of an atom i at t 0 and t n, when the structures are superimposed.
My target sequence has over 30% sequence identity with a known protein structure, so I want to generate a 3D model. What do I have to do?
Structure prediction by homology modeling
Homology modeling makes two fundamental assumptions The structure of a protein is determined by its primary amino acid sequence (Anfinsen). During evolution, the structure of protein a has changed much slower than its sequence. Similar sequences adopt identical structures and distantly related sequences fold into similar structures.
In summary: homology modeling steps 1) Template recognition & initial alignment 2) Alignment correction 3) Backbone generation 4) Loop modeling 5) Side-chain modeling 6) Model optimization 7) Model validation
Template recognition & initial alignment Select the best template from a library of known protein structures derived from the PDB Templates can be found using the target sequence as a query for searching using FASTA or BLAST
Gaining confidence in template searching Once a suitable template is found, a literature search on the relevant fold can determine what biological role it plays Does this match the biological/biochemical function that you expect? Ligand(s) present? Resolution of the template Family of Proteins Multiple templates?
Further Considerations: Proteins are homologous if they are related by divergence from a common ancestor duplication Function may be related or very different! paralogues speciation orthologues species 1 species 2 Function more likely to be conserved
In summary: there are two types of homologous - Orthologs: proteins that carry out the same function in different species -Paralogs: proteins that perform different, but related functions within one organism
Alignment of the target onto the template Correct alignment is necessary to create the most probable 3D structure of the target If sequences aligns incorrectly, it will result in false positive or negative results Important to consider: - algorithms - scoring alignments - gap penalties Identity SCRs (Structure Conserved Regions and SVRs (Structure Variable Regions)
Alignment Outcome The (true) alignment indicates the evolutionary process giving rise to the different sequences starting from the same ancestor sequence and then changing through mutations (insertions, deletions, and substitutions)
Alignment vs. databases Task: given a query sequence and millions of database records, find the optimal alignment between the query and a record AGTCTCCAGTTATGCCA
Alignment vs. databases Tool: given two sequences, there exists an algorithm to find the best alignment. Naïve solution: apply algorithm to each of the records, one by one. Problem: an exact algorithm is just too slow to run millions of times (even linear time algorithm will run slowly on a huge database). Solution: - run in parallel (expensive) - use of a fast (heuristic) method to discard irrelevant records and the apply the exact algorithm to the remaining few
Sequence alignment algorithms Used to calculate a similarity score to infer sequence homology between two sequences Examples: the two most used in homology modeling are: BLAST: General strategy is to optimise the maximal segment pair (MSP) score - BLAST computes similarity, not alignment (Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J., J. Mol. Biol. (1990) 215:403-410) FastA (local alignment): searches for both full and partial sequence matches, i.e., local similarity obtained; more sensitive than BLAST, but slower; many gaps may represent a problem (Pearson, W. R., Lipman, D. J., P.N.A.S. (1988) 85:2444-2448).
BLAST FastA Sequence alignment outputs
Alignment corrections Alignments are scored (substitution score) in order to define similarity between 2 aa residues in the sequences A substitutions score is calculated for each aligned pair of letters. Substitution matrices: - reflect the true probabilities of mutations occurring through a period of evolution - PAM family: based on global aligments of closely related proteins. Mutation probability matrix. - BLOSUM family: based on observed alignments, no extrapolation of sequences that are related.
Gap Penalties Gap is one or more empty spaces in one sequence aligned with letters in the other sequence These empty spaces may or may not be treated as penalties: - higher penalty score is assigned for the first missing aa then the subsequent ones; it considers the fact that each mutational event can insert or delete many residues at a time
Gap Penalties
Gap Penalties Insertion/deletion of structural domains can easily be done at loop sites N C
Gap Penalties The overall alignment score is the sum of similarity and gap scores: the higher the overall alignment score, the better the alignment (more conserved)
Corrections by hand may still be needed!
Multiple Sequence Alignments Multiple nucleotide or amino sequence alignment techniques are usually performed to fit one of the following scopes : -to characterize protein families, identify shared regions of homology in a multiple sequence alignment; (this happens generally when a sequence search revealed homologies to several sequences) ; -to determine the consensus sequence of several aligned sequences; -to help prediction of the secondary and tertiary structures of new sequences; - preliminary step in molecular evolution analysis using Phylogenetic methods for constructing phylogenetic trees.
Backbone generation Uses known structurally conserved regions to generate coordinates for the unknown For SCRs - copy coordinates from known structures For variable regions (VR) - copy from known structure, if the residue types are similar; otherwise, use databases for fragtmented loop sequences.
Backbone generation Template-based fragment assembly a) Find structurally conserved regions b) build model core
Loop modeling
Loop modeling 1. Database search for segments from known protein structures fitting fixed end-points 2. Molecular mechanics/molecular dynamics 3. Combination of 1+2
Loop modeling Ab initio rebuilding (e.g., Monte Carlo, MD, etc) to build missing loops
Side chain modeling 1. Use of rotamer libraries (backbone dependent) 2. Molecular mechanics optimization - Dead-end elimination (heuristic) - Monte Carlo (heuristic) - Branch & Bound (exact) 3. Mean-field methods
Model optimization Molecular mechanics methods Model validation/evaluation Model should be evaluated for: - correctness of the overall fold/structure - errors over localized regions - stereochemical parameters: bond lengths, angles, etc Some softwares for model verification: - Procheck http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html -WHAT IF http://swift.cmbi.kun.nl/whatif -PROSA II http://www.came.sbg.ac.at/services/prosa.html -Profile 3D & Verify 3D http://shannon.mbi.ucla.edu/doe/services
Model validation/evaluation The Ramachandran plot
Model validation/evaluation
Model validation/evaluation Profile 3D & Verify 3D: -verify newly solved structures or homology models -find structures/folds compatible with a given sequence -find sequences compatible with known structure/fold from a database of sequences