Homology Modeling (Comparative Structure Modeling)
Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded in the genome Up to 40% of the known protein sequences have at least one segment related to one or more structures => Determine all of the folds => Use homology modeling to predict 3D structures
Growth in the PDB
What is Homology? Homology: having a common evolutionary origin Cannot be partial Assertion of homology is an hypothesis Hypothesis usually based on extent of sequence similarity between proteins, though similar functions should be demonstrated
Some Definitions Homologues (homologs): proteins that are evolutionarily related Orthologues (orthologs): homologues from different organisms Paralogues (paralogs): homologues from the same organism
Basis of Homology Modeling 3D structures conserved to greater extent than primary structures Develop models of protein structure based on structures of homologues Using known structure as a template, calculate 3D model of a protein for which only know the sequence (the target )
Steps in Homology Modeling
Template Selection Identify protein structures related to target and select those to be used as templates Involves searching a database such as at NCBI (e.g., BLAST at NCBI) Involves a certain amount of sequence alignment
Aligning Sequences Critical step in homology modeling Many options to consider Factors to consider Which algorithm to use Which scoring method to apply Whether and how to assign gap penalties
Scoring Alignments Need some method of scoring to find optimal alignment Four general types of scoring have been applied Identity: considers only identical residues Genetic code: considers the number of base changes in DNA or RNA to interconvert codons for the amino acids Chemical similarity: considers physico-chemical properties Observed substitutions: considers substitution frequencies observed in alignments of sequences (*used the most*)
Scoring Matrices PAM40 - short highly similar sequences PAM160 - detecting members of protein family PAM250 - longer more divergent sequences BLOSUM90 - short highly similar sequences BLOSUM80 - detecting members of protein family BLOSUM62 - most effective in finding all potential similarities BLOSUM30 - longer more divergent sequences
Log-Odds Matrix S i,j = log[q i,j )/(p i p j )] q i,j = frequency of substitution p i p j = probability of occurrence of residues i and j in proteins
Rigid body assembly Building the 3D Model Rigid bodies from aligned sequences Core region, loops, and side chains Satisfaction of spatial restraints Generate restraints from templates Assume distances and angles between aligned template and target are similar Minimize violations of all restraints using distance geometry or optimization techniques (i.e., force field) to satisfy spatial restraints
Evaluation of Model Quality Check for proper protein stereochemistry ProCheck (http://biotech.ebi.ac.uk:8400/cgi-bin/sendquery) Ramachandran plot, bond-length, Whatif (http://www.cmbi.kun.nl/gv/servers/wiwwwi) Packing quality Both web-servers Fitness of sequence to structure ProsaII (http://lore.came.sbg.ac.at/services/prosa.html) Program runs on Linux and Unix Verify3D (http://www.doe-mbi.ucla.edu/services/verify_3d/) Web-server
Evaluating the 3D Model Ramachandran plot Planar peptide bonds Side chain conformations that correspond to those in rotamer library Hydrogen bonding No bad atom-atom contacts Procheck
Evaluating the 3D Model 3D-Profiler (Verify 3D) Based on statistical preferences of each of the 20 amino acids for particular environments within a protein Residue positions characterized by environment Preferred environments defined by three parameters Area of each residue that is buried Fraction of side-chain area that is covered by polar atoms (i.e., O and N) Local secondary structure
Refining the 3D Model MD and energy minimization Application of restraints based on experimental data (e.g., NMR, fluorescence)
Applications of the Model