Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Homology Modeling (Comparative Structure Modeling)

Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded in the genome Up to 40% of the known protein sequences have at least one segment related to one or more structures => Determine all of the folds => Use homology modeling to predict 3D structures

Growth in the PDB

What is Homology? Homology: having a common evolutionary origin Cannot be partial Assertion of homology is an hypothesis Hypothesis usually based on extent of sequence similarity between proteins, though similar functions should be demonstrated

Some Definitions Homologues (homologs): proteins that are evolutionarily related Orthologues (orthologs): homologues from different organisms Paralogues (paralogs): homologues from the same organism

Basis of Homology Modeling 3D structures conserved to greater extent than primary structures Develop models of protein structure based on structures of homologues Using known structure as a template, calculate 3D model of a protein for which only know the sequence (the target )

Steps in Homology Modeling

Template Selection Identify protein structures related to target and select those to be used as templates Involves searching a database such as at NCBI (e.g., BLAST at NCBI) Involves a certain amount of sequence alignment

Aligning Sequences Critical step in homology modeling Many options to consider Factors to consider Which algorithm to use Which scoring method to apply Whether and how to assign gap penalties

Scoring Alignments Need some method of scoring to find optimal alignment Four general types of scoring have been applied Identity: considers only identical residues Genetic code: considers the number of base changes in DNA or RNA to interconvert codons for the amino acids Chemical similarity: considers physico-chemical properties Observed substitutions: considers substitution frequencies observed in alignments of sequences (*used the most*)

Scoring Matrices PAM40 - short highly similar sequences PAM160 - detecting members of protein family PAM250 - longer more divergent sequences BLOSUM90 - short highly similar sequences BLOSUM80 - detecting members of protein family BLOSUM62 - most effective in finding all potential similarities BLOSUM30 - longer more divergent sequences

Log-Odds Matrix S i,j = log[q i,j )/(p i p j )] q i,j = frequency of substitution p i p j = probability of occurrence of residues i and j in proteins

Rigid body assembly Building the 3D Model Rigid bodies from aligned sequences Core region, loops, and side chains Satisfaction of spatial restraints Generate restraints from templates Assume distances and angles between aligned template and target are similar Minimize violations of all restraints using distance geometry or optimization techniques (i.e., force field) to satisfy spatial restraints

Evaluation of Model Quality Check for proper protein stereochemistry ProCheck (http://biotech.ebi.ac.uk:8400/cgi-bin/sendquery) Ramachandran plot, bond-length, Whatif (http://www.cmbi.kun.nl/gv/servers/wiwwwi) Packing quality Both web-servers Fitness of sequence to structure ProsaII (http://lore.came.sbg.ac.at/services/prosa.html) Program runs on Linux and Unix Verify3D (http://www.doe-mbi.ucla.edu/services/verify_3d/) Web-server

Evaluating the 3D Model Ramachandran plot Planar peptide bonds Side chain conformations that correspond to those in rotamer library Hydrogen bonding No bad atom-atom contacts Procheck

Evaluating the 3D Model 3D-Profiler (Verify 3D) Based on statistical preferences of each of the 20 amino acids for particular environments within a protein Residue positions characterized by environment Preferred environments defined by three parameters Area of each residue that is buried Fraction of side-chain area that is covered by polar atoms (i.e., O and N) Local secondary structure

Refining the 3D Model MD and energy minimization Application of restraints based on experimental data (e.g., NMR, fluorescence)

Applications of the Model