Template-Based 3D Structure Prediction

Size: px

Start display at page:

Download "Template-Based 3D Structure Prediction"

Elijah Wilkinson
6 years ago
Views:

1 Template-Based 3D Structure Prediction Sequence and Structure-based Template Detection and Alignment Issues

2 The rate of new sequences is growing exponentially relative to the rate of protein structures being solved!

3 Why Such a shift? Sequencing DNA is easy= 1-2 days Experimental determination of a protein is difficult= 1-3 years Small targets

4 How could we fill the gap between the number of known sequences and known structures? Structural Genomics Initiatives: JCSG

5 2005

6 2005

7 How could we fill the gap between the number of known sequences and known structures? Structural Genomics Initiative: JCSG or

8 SHORT REMINDER 1D: SECONDARY STRUCTURE ELEMENTS HELIX SHEET LOOPS* 3D=> FOLDING OF THESE SECONDARY STRUCTURE ELEMENTS (SEQUENTIAL and SPATIAL ARRAGEMENT OF SECONDARY STRUCTURE ELEMENTS)

AAVLYFGREDHTLLVY AAVLYFGREDHTLLVY Tertiary Quaternary -molecular dynamics -Energy

9 Current methods to predict protein structure Structural level Schema Additional info Ab Initio Secondary D 2D 3D 4D AAVLYFGREDHTLLVY 2 nd pred correlated mutations AAVLYFGREDHTLLVY AAVLYFGREDHTLLVY Tertiary Quaternary -molecular dynamics -Energy minimization -docking No Ab-Initio 2 nd pred. -homology modeling -threading -filtered docking

10 3D?? MREYKLVVLGSGGVGKSALTVQFVQGIFVDE YDPTIEDSYRKQVEVDCQQCMLEILDTAGTE QFTAMRDLYMKNGQGFALVYSITAQSTFNDL QDLREQILRVKDTEDVPMILVGNKCDLEDER VVGKEQGQNLARQWCNCAFLESSAKSKINVN EIFYDLVRQINR? How does it fold?"

11 So How Do You Get from Query Sequence to Model Structure? Template Detection The first step is to find a sufficiently similar structural template or templates from the PDB, either by sequence searches or more sophisticated structure-based techniques. Alignment All template detection methods need to create alignments in order to be able to evaluate the query-template fit. Alignments are also crucial for the next stage... Model Building Ranges from the simple tranference of PDB coordinates built into many fold recognition methods to complex all atom compative modelling. Evaluation All methods again use some sort of quality assessment of the models, either at the level of the alignment or of the feasibility of the 3D structure.

Template Identification Template Detection The most simple form of template detection is a sequence search of the sequences in the PDB database.

12 Template Identification Template Detection The most simple form of template detection is a sequence search of the sequences in the PDB database. This should always be the first step because the results of this search will condition the approach. Domains Searching for templates is complicated by the fact that many proteins are made up of several structural domains. A domain search should be carried out at the same time as the sequence search.

13 Russell: Structural prediction flowchart

Template Detection Continued But, if No Similar PDB Template Exists If no template is found for one or more of the domains, more work will be needed, particularly with the

14 Template Detection Continued But, if No Similar PDB Template Exists If no template is found for one or more of the domains, more work will be needed, particularly with the alignment, in order to produce a good model. In this case the predictor can move onto more complex sequence search methods (PSIBLAST, FFAS, HMMs) or use fold recognition techniques.

15 Structural prediction flowchart

16 Homology Modelling vs Fold Recognition % seq. ID Application Fold Recognition Homology Modelling Target Sequence Model Quality Any Sequence Fold Level >= 30-50% ID with template Atomic Level If the sequence is similar to a known structure (>30-50% identity) you can usually move straight onto generating an all atom model by homology modelling.

17 No Template Found by BLAST? Pairwise sequence search methods can detect folds when sequence similarity is high,. but are very poor at detecting relationships that have less than 20% identity. One possibility is to use profile-based sequence search methods. These have evolved greatly, and can find templates with very low sequence similarity. Fold recognition methods can find folds that are too distantly related to be detected by sequence based methods, because they evaluate not only sequence similarity, but also structural fit.

18 Why We Can Build Structures? Because Small Changes in Sequence Have Little Effect on Structure

19 Relationship between sequence and structural similarity Chotia & Lesk, 1986 %id seq. => same 3D (for sure) %id seq. => sometimes same str. sometimes not} depends on the length of the aligned region.

20 Sequence Space vs. Structure Space Homology Modelling Targets Fold Recognition Targets Sequence space Structural space The development of fold recognition methods came from the observation that many apparently unrelated sequences had very similar 3-dimensional structures (folds).

21 FOLD RECOGNITION Find out the real structure with prediction methods FIT SEQUENCES INTO STRUCTURES AND FIND THE BEST MATCH when? If Little Sequence Similarity Then, Fold Recognition

22 FOLD RECOGNITION BIOLOGIST s APPROACH: If seq 1 is similar to seq2 then structure 1 is similar to structure2 and there is probably an evolutionary explanation! PHYSICIST s APPROACH: Proteins form structures according to fundamental rules that they call energies or free energies! Quoted from: Protein Structure Prediction, Huber & Torda.s

23 Fold Recognition Algorithms: General Principle It was thought when fold recognition methods were developed that they could detect analogues, proteins that were structurally similar but that had no evolutionary relationship. In fact most of these predictions were later shown to be homologous (have an evolutionary relationship) by advanced sequence comparison methods, such as PSI-BLAST. They still have a place though, in part because many of the newer methods are more more sensitive than PSIBLAST, in part because research also shows that no one method can always hope to correctly identify a fold.

24 CAPABLE TO DETECT VERY DISTANT HOMOLOGY (WHEN SEQUENCE-BASED METHODS FAIL) FFAS03 example

25 FOLD RECOGNITION FOLD DETECTION THREADING BLAST, FASTA eg. FFAS03 GenThreader FOLD RECOGNITION eg HMM Alignment of sequences to structures as in THREADER (Jones et al. 1992) CAPABLE TO DETECT VERY DISTANT HOMOLOGY (WHEN SEQUENCE-BASED METHODS FAIL) Fold recognition: distant/no clear homology

26 FOLD RECOGNITION WHAT IS THREADING? To fit a structure into a sequence!..given a protein structure, what amino acid sequences are likely to fold into that structure?

27 QUERY TO STRUCTURE ALIGNMENT S1 S2 S3 S4 S5 Sheet helix Optimal alignments Suboptimal alignments

28 QUERY TO STRUCTURE ALIGNMENT I query sequence Structure template ALIGNMENT (threading): covering of segments of the query sequence by template blocks! A threading is completely determined by the starting positions of the blocks

29 QUERY TO STRUCTURE ALIGNMENT II: Rules query sequence The blocks preserve their order Structure template The blocks DO NOT OVERLAP There is NO GAPS in the blocks!

30 STEPS Construct a library of Potential core folds (structural templates) Choose an objective function (score function) to evaluate any alignment of a sequence to a structure template

31 The General Principle I 1. Library of protein structures (fold library) all known structures representative subset (seq. similarity filters) structural cores with loops removed

32 Building a Fold library

33 The General Principle II 2. Binary alignment algorithm with Scoring function contact potential environments Others.. Instead of aligning a sequence to a sequence, align strings of descriptors that represent 3D structural features. Usual Dynamic Programming: score matrix relates two amino acids Threading Dynamic Programming: relates amino acids to environments in 3D structure ALMVWTGH Evaluation of the fitness: probability The final score is the goodness of fit of the target sequence to each fold and is usually reported as a probability....

34 Position j=4 j=3 j=2 j=1 S T i=1 i=2 i=3 i=4 i=5 i=6 Block Each possible threading corresponds to a path from S to T in the graph and vice-versa The BLUE path corresponds to the threading (1,4,1,4,1,4) The GREEN path corresponds to the threading (1,2,2,3,4,4) THE KEY IS TO FIND THE SHORTEST PATH FROM S TO T =dynamic programming!!!

35 Scoring Functions for Fold Recognition Scoring functions measure some or more of the following: The similarity between the observed structural environment of the residue and the environment in which the residue is usually found Pair potentials Solvation energy Coincidence of real and predicted secondary structure and accessibility Evolutionary information (from aligned structures and sequences)

36 Structural Environments Bowie et al. (1991) created a fold recognition approach: each position of a fold template as being in one of eighteen environments. Environment: measuring the side chain buried area, the fraction of the side chain area that was exposed to polar atoms, and the local secondary structure. Other researches have developed similar methods, where the structural environments described include exposed atomic areas and type of residue-residue contacts.

37 How Structural Environments Scores are Used 20aa 18 env i.e.: Prob to have K Buried. Scoring matrices are pre-generated for the probabilities of finding each of the twenty amino acids in each of the environment classes. Probabilities are drawn from databases of known structures. Using these probabilities a 3D profile is created for each fold in the fold library. #This 3D matrix defines the probability of finding a certain amino acid in a certain position in each fold. When the target sequence is aligned with the fold, a score is calculated from the pre-generated 3D profile for each of the positions in the alignment. The fit of a fold is the sum of the probabilities of each residue being found in each environment.

38 Solvation Energy Solvation potential is a term used to describe the preference of an amino acid for a specific level of residue burial. It is derived by comparing the frequency of occurrence of each amino acid at a specific degree of residue burial to the frequency of occurrence of all other amino acid types with this degree of burial. The degree of burial of a residue is defined as the ratio between its solvent accessible surface area and its overall surface area.

39 Pair or Contact Potentials - the Tendency of residues to be in Contact counts d Counts become propensities (frequency at each distance separation) or energies (Boltzmann principle, -KT ln) Make count of interacting pairs of each residue type at different distance separations E d

40 Pair Potentials in Fold Recognition The energy that results from aligning a certain target sequence residue at a certain position depends on its interactions with other residues. This creates problems when pair potentials are used to create sequence structure alignments, since you do not know the position of all the residues in the model before threading them. Threading methods that use pair potentials in this way, such as THREADER (Jones et al, 1992) have to use clever programming methods to get round this problem.

41 INPUT Secondary structure pred TOPITS uses predicted secondary structure and accessibilities for the target sequence and compares them with the known values of the template. Rost, 1995

42 Alignments 1aac DKATIPSEPFAAAEVADGAIVVDIAKMKYETPELHVKVGDTVTWINREAMPHNVHFVAGV :.... :... :.... :..:::. : 1plc IDVLLGADDGSLAFVPSEFSISPGEKIVFKNNAGFPHNIVFDEDS 1aac L--GEAALKGPMMKKE------QAYSLTFTEAGTYDYHCTPHPF--MRGKVVVE. : : : : : :...:.:: : :::.:. 1plc IPSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQGAGMVGKVTVN All methods of template detection, whether sequence-based, fold recognition or hybrid needs alignments between the query sequence and the PDB template sequence. The quality of these alignments is highly variable. If an accurate 3D model is to be built, it is vital that the target-template alignments are correct. Particularly at lower percentage identity the biggest errors stem from the alignments.

Alignments Generally the higher the sequence similarity and the lower the number of gaps between the two sequences, the more likely the alignment is to be correct.

43 Alignments Generally the higher the sequence similarity and the lower the number of gaps between the two sequences, the more likely the alignment is to be correct. The more sequences that are included in the alignment the more likely the alignment is to be reliable in an evolutionary sense. Coincidence of real and predicted secondary structure and accessibility also generally improves alignments. Even with all this information automatic methods are far from perfect.

44 Alignments by Hand Alignments from sequence-based methods tend to produce alignments that are biased towards sequence evolution not structure and fold recognition alignments are not any more reliable. In practice most predictors update alignments manually using actual and predicted secondary structure and accessibility information, and careful placement of gaps. KSLKGSRTEKNILTAFAGESQARNRYNYFGGQAKKDGFVQISDIFAETADQEREHAKRLFKFLE GGDLEIVAAFPAGI. ::---========+==++==+====-==--::-======+==++=++==+====== MKGDTKVINYLNKLLGNELVAINQYFLHARMFKNWGLKRLNDVEYHESIDEMKHADRYIERILFLEGLPNLQDLGKLNI IADTHANLIASAAGEHHEYTEMYPSFARIAREEGYEEIARVFASIAVAEEFHEKRFLDFARNIKE GRVFLREQATK.:---===-=+--==--=- --==-==------:--======-====++==+====----:-:::.. GEDVEEMLRSDLALELDGA KNLREAIGYADSVHDYVSRDMMIEILRDEEGHIDWLETELDLIQKMGLQNYLQAQ WRCRNCGYVHEGTGAPELCPACAHPKAHFELLGINW. :. I REE

45 Sequence Alignment Correction PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS TEMPLATE PHE ASN VAL CYS ARG THR PRO GLU ALA ILE CYS TARGET (ALIGNMENT 1) PHE ASN VAL CYS ARG THR PRO GLU ALA ILE CYS TARGET (ALIGNMENT 2) "Alignment 1" is chosen because of the PROs at position 7. But the 10 Angstrom gap that results is too big to close with a single peptide bond.

46 A Fold Recognition Example - 3D-PSSM 3D-PSSM combines: Target sequence profiles. Template sequence profiles. Residue equivalence. Secondary structure matching. Solvation potentials. Sequences are aligned to folds using dynamic programming with the alignments scored by a range of 1D and 3D profiles.

47 3D-PSSM Fold Profile Library PSI-BLAST and a non-redundant database are used to create profiles for each of the folds in the library. Each fold is aligned with members of the same superfamily using the structural alignment program SSAP. Those folds from SCOP with sufficient structural similarity are then also used to create profiles using PSI-BLAST in the same way. All the related profiles are merged using the structural alignment to form a 3D-profile.

48 Secondary Structure and Solvation Potentials Secondary structure is assigned to each fold based on the annotation in the STRIDE database. Each residue in the fold is also assigned a solvation potential. The degree of burial of each residue is defined as the ratio between its solvent accessible surface area and its overall surface area. Solvation potential is divided into 21 bins, ranging from 0% (buried) to 100%(exposed).

49 Sequence and Secondary Structure Profiles 3D-PSSM also uses the coincidence of predicted secondary structure (target sequence) and known secondary structure (fold). Here a simple scoring scheme is used for matching secondary structure types, +1 for a match, otherwise -1.

50 Preparing the Query Sequence Query sequences have their secondary structure predicted by PSI-Pred. PSI-BLAST profiles are also generated for the query sequence to allow bi-directional scoring. The 3D-FSSP dynamic programming algorithm is used to scan the fold library with the query sequence.

51 3D-PSSM - Dynamic Programming Three passes of dynamic programming are performed for each querytemplate alignment. Each pass uses a different matrix to score the alignment, but secondary structure and solvation potential are used in each pass. The score for a match between a query residue and a fold residue is calculated the sum of the secondary structure, solvation potential and profile scores. The final score is simply the maximum of the scores from the three passes.

52 Differences between profile-based methods (Rychlewski( Rychlewski,, et al, 2000) PSI-BLAST PDB-BLAST Multiple alignments: 5 iterations with 10-3 evalue treshold Profile: Preclustering with 98% cutoff, pseudocount based on variability estimation-background aminoacid frequencies Database: NR Multiple alignment: same as PSI-Blast Profile: same as PSI-Blast Database: PDB database BASIC Multiple alignment: 2 PSI-Blast it. with 0.1 e-value threshold Profile: preclustering with 97% id cutoff; amino-acid composition filter, distant homologues have smaller weights Database: profiles of proteins from PDB FFAS/FFAS03 Multiple alignment: same as PSI-Blast Profile: preclustering with 97% id cutoff; amino-acid composition filter, sequence diversity based weight Database: profiles of proteins from PDB

53 Baker & Sali, Science 2001.

54 COMBINING ADDITIONAL INFORMATION Conserved Tree-Determinant Correlated mutations

55 rcc1 ran Ras Ral Rho Ras Ral Rho by J.A. G-Ranea

56 Azuma et al., J,Mol. Biol. 1999

57 Complex (Model on Vomplex superposition) Mapping of mutants (side view) Model GDP E157 H304 Mg++ D44 H410 H78 E157 H270 Mg++ GDP D44 H304 R206 H78 R206 H410 H270 D128 D128 H78 Green: Km, red: Kcat.

58 VISUALIZATION Pazos et al.,

59 Fold Recognition Servers I 3D-PSSM - Based on sequence profiles, solvatation potentials and secondary structure. SPARKS2 - Top server in CM predictions in CASP 6. Sequence, secondary structure Profiles And Residue-level Knowledgebased Score for fold recognition. mgenthreader - Combines profiles and sequence-structure alignments. A neural network-based jury system calculates the final score based on solvation and pair potentials.

60 Fold Recognition Servers I RAPTOR - Best-scoring server in CAFASP3 competition in You have to ask to use it first... ROBETTA - ROBETTA makes both ab initio and template-based predictions. It detects fragments with BLAST, FFAS03, or 3DJury, generates alignments with its own K*SYNC method and uses fragment insertion and assembly. PHYRE - A new server (so new it doesnt even have documentation that attempts to assemble fragments in a similar way to Robetta.

61 Advanced Sequence-Based and Hybrid Techniques PSIBLAST Profile methods, beginning with PSI-Blast, can be as accurate as many fold recognition techniques at detecting remote homologues. Although expert users of these methods can usually spot biologically meaningful templates from careful analysis of low-scoring hits, many remote homologues are not detected. Intermediate Searching Profile-profile alignment methods use evolutionary information in both query and template sequences. As a result, they are able to detect remote homologies beyond the reach of other sequence comparison methods. Hhpred! Profile-profile

62 Advanced Sequence-Based and Hybrid Techniques Hidden Markov Models Hidden Markov models were originally developed for speech recognition. They regard the sequence as a series of nodes, each corresponding to a column in a multiple alignment. Each node has a residue state and states for insertion and deletion. A model can be built from many sequences and these models have many similarities to profiles. META-PROFILES Many methods now also use predicted secondary structure. By adding structural information to the profiles (metaprofiles) it is often possible to find homologues that have very low sequence similarity but are still structurally similar..

63 Hybrid Sequence-Based Servers SAM T The query is checked against a library of hidden Markov models. This is NOT a threading technique, it is sequence based, but it does use secondary structure information. Meta-BASIC - basic.bioinfo.pl Meta-BASIC is based on consensus alignments of profiles. It combines sequence profiles with predicted secondary structure and uses several scoring systems and alignment algorithms. FFAS ffas.ljcrf.edu/ffas-cgi/cgi/ffas.pl FFAS03 is a profile-profile alignment method, which takes advantage of the evolutionary information in both query and template sequences.

64 Consensus Fold Recognition It has long been recognised that human experts are better at fold prediction than the methods these same experts had developed. Human experts usually use several different fold recognition methods and predict folds after evaluating all the results (not just the top hits) from a range of methods. So why not produce an algorithm that mimics the human experts? In first consensus server, Pcons, the target sequence was sent to six publicly available fold recognition web servers. Models were built from all the predictions. The models were then structurally superimposed and evaluated for their similarity. The quality of the model was predicted from the rescaled score and from its similarity to other predicted models.

65 Consensus Fold Recognition Servers 3D Jury - 3D Jury is a consensus predictor that utilizes the results of fold recognition servers, such as FFAS, 3D-PSSM, FUGUE and mgenthreader, and uses a jury system to select structures INGBU - This produces a consensus prediction based on five methods that exploit sequence and structure information in different ways. Pcons - Pcons was the first consensus server for fold recognition. It selects the best prediction from several servers. PMOD can also generate models using the alignment, template and MODELLER

66 Structure Prediction in a Nutshell Target sequence Biological information from papers Active sites, domains, cofactors etc. Are there domains? PFAM/ProDom/ InterPro - BLAST results Secondary structure, accessibility, Trans-membrane segments PHD, PSIPRED Domain1 Domain 2 Domain 3 etc... BLAST search for PDB Structural Template Yes No Homology modelling programs SWISSMODEL, coremodeller Align with template Consenus Servers, 3D Jury Alignment 1 Alignment 2 Alignment 3 Alignment 4... Loops... Fold Recognition Servers Eg 3DPSSM GenTHREADER Model Evaluation 3D - ProSa model Ana Rojas - Biotech Mendoza Structural Bioinformatics suite Group Side chain canonical Complete loops MaxSprout 3D model

67 SOME REAL EXAMPLES BIOLOGICALLY RELEVANT

68 PAAD DOMAIN AIM: TRY TO PREDICT BINDING MODE structure was unknown: we needed a model.

69 BACKGROUND WHERE IS THE PAAD DOMAIN? 1.-First, location of this domain using BLAST! PAAD family: MEFV/PYRIN (Pawlowski, et.al., 2001, others) Nacht family: PAN/NALPs/DEFCAP/PYCARD, CATERPILLER (Tschopp et al, Nature, 2003)

B-BOX Zn FINGER SPRY NAIP BIR BIR BIR NACHT LRR S COS1.5?

70 BACKGROUND THE PROBLEM OF DOMAIN SHUFFLING NALP2 PAAD NACHT LRR S ASC2 PAAD MATER? NACHT LRR S ASC PAAD CARD CARD4 CARD NACHT LRR S CASPASE ZF PAAD CASPASE NOD2 CARD CARD NACHT LRR S PYRIN PAAD B-BOX Zn FINGER SPRY NAIP BIR BIR BIR NACHT LRR S COS1.5? NACHT LRR S IF16 PAAD IF120X IF120X CLAN CARD NACHT LRR S MNDA,AIM2 PAAD IF120X NAC PAAD? NACHT LRR S? CARD Sensors! They connect different pathways! 2.-Domain analyses in different sequences (PFAM)

71 WHERE DOES IT COME FROM? 3.-Phylogenetic analyses (PFAM) PAAD CARD DD DED

1 2 3 4 5 6 Hydrophobic core (sol. acc. area <10% maximum solv. area) 4.-MAL & Sec.

72 Hydrophobic core (sol. acc. area <10% maximum solv. area) 4.-MAL & Sec. Structure Prediction HELIX 3 does not have core residues. In DD, and others helix3 doesn t pack too well

73 domain Homology modeling of PAAD domain (MEFV from mouse) N N H3 H3 C C 4.-Template detection, alignment and modeling! Hydrophobic core

74 pyrin LYS35 LYS52 LYS39 ARG49 ARG ILE40 PRO41 VAL51 MET45 Charged patch Pan2/NALP4 Hydrophobic patch 4.-Identification of patches or relevant features in the surfaces! ALA50 TRP44 LYS48 VAL47 PRO43 ILE42

75 IFI204 ASP32 LYS64 90 o GLU53 GLU71 GLU67 GLU70 GLU54 LYS76 LYS55 AIM2 ASP19 LYS23 GLU o ARG67 LYS71 LYS64 - CHARGED (CONCAVE) + CHARGED (CONVEX) +CHARGED

76 Paad is a 6 alpha helical bundle Helix 3 is disordered Binding patches correctly predicted Real structure 1PN5 Released October 2003 September 2003

77 SPOC DOMAIN Combining HMMER sequence analyses and threading

78 METHODS: Selecting regions first! Query seq Blast to nr/uniprot90 Blast to EST s & unfinished genomes Multiple alignment T COFFEE, MUSCLE, etc TO ENRICH PROFILE! PROFILE BUILDING HMMER/PSI BLAST SEARCHES in Uniprot90

79 METHODS: HMMER Strategy/Intermediate searches Known Known!!!

80 METHODS HMMER ANALYSES III iso1 iso aa NLS PHD 614 aa Coiled coil SPOC: Protein protein interaction (Sanchez Pulido et al, 2004) iso aa

81 METHODS HMMER ANALYSES III iso2 SPOC: Protein protein interaction RBMF_HUMAN Homology Structural modeling Bioinformatics Group

82 Acknowledgments Michael Tress, David de Juan (CNIO) Florencio Pazos, Luis Sanchez-Pulido (CNB) Rest of (CNIO) and anyone else whose figures I used...

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative