Template-Based 3D Structure Prediction

Similar documents
Week 10: Homology Modelling (II) - HHpred

Programme Last week s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues

Protein Structure Prediction

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Sequence analysis and comparison

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Packing of Secondary Structures

Bioinformatics: Secondary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer Protein Structure Prediction I

Identification of correct regions in protein models using structural, alignment, and consensus information

Protein structure alignments

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

ALL LECTURES IN SB Introduction

Steps in protein modelling. Structure prediction, fold recognition and homology modelling. Basic principles of protein structure

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Building 3D models of proteins

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

Bioinformatics: Secondary Structure Prediction

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Structure and evolution of the spliceosomal peptidyl-prolyl cistrans isomerase Cwc27

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Protein Structures: Experiments and Modeling. Patrice Koehl

Physiochemical Properties of Residues

CAP 5510 Lecture 3 Protein Structures

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

3D Structure. Prediction & Assessment Pt. 2. David Wishart 3-41 Athabasca Hall

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

CS612 - Algorithms in Bioinformatics

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS

Large-Scale Genomic Surveys

Protein Secondary Structure Prediction using Feed-Forward Neural Network

Supplementary Figure 3 a. Structural comparison between the two determined structures for the IL 23:MA12 complex. The overall RMSD between the two

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

Protein Structure Prediction using String Kernels. Technical Report

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Protein Structure Prediction, Engineering & Design CHEM 430

SUPPLEMENTARY MATERIALS

Structural Alignment of Proteins

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

Protein Structure Prediction and Display

Can protein model accuracy be. identified? NO! CBS, BioCentrum, Morten Nielsen, DTU

Computational Molecular Biology. Protein Structure and Homology Modeling

Bioinformatics Practical for Biochemists

We used the PSI-BLAST program ( to search the

Similarity or Identity? When are molecules similar?

Similarity searching summary (2)

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Template-Based Modeling of Protein Structure

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

proteins Refinement by shifting secondary structure elements improves sequence alignments

Introduction to Evolutionary Concepts

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Basics of protein structure

Supplemental Materials for. Structural Diversity of Protein Segments Follows a Power-law Distribution

Genomics and bioinformatics summary. Finding genes -- computer searches

Protein Secondary Structure Assignment and Prediction

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Structure to Function. Molecular Bioinformatics, X3, 2006

Protein Structure Prediction

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Properties of amino acids in proteins

IT og Sundhed 2010/11

Supersecondary Structures (structural motifs)

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

CSE 549: Computational Biology. Substitution Matrices

Major Types of Association of Proteins with Cell Membranes. From Alberts et al

PROTEIN STRUCTURE PREDICTION Bioinformatic Approach

Modeling for 3D structure prediction

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Sequence Based Bioinformatics

Protein Structure Determination

Conditional Graphical Models

Course Notes: Topics in Computational. Structural Biology.

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Heteropolymer. Mostly in regular secondary structure

Peptides And Proteins

Getting To Know Your Protein

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Transcription:

Template-Based 3D Structure Prediction Sequence and Structure-based Template Detection and Alignment Issues

The rate of new sequences is growing exponentially relative to the rate of protein structures being solved!

Why Such a shift? Sequencing DNA is easy= 1-2 days Experimental determination of a protein is difficult= 1-3 years Small targets

How could we fill the gap between the number of known sequences and known structures? Structural Genomics Initiatives: JCSG

2005

2005

How could we fill the gap between the number of known sequences and known structures? Structural Genomics Initiative: JCSG or

SHORT REMINDER 1D: SECONDARY STRUCTURE ELEMENTS HELIX SHEET LOOPS* 3D=> FOLDING OF THESE SECONDARY STRUCTURE ELEMENTS (SEQUENTIAL and SPATIAL ARRAGEMENT OF SECONDARY STRUCTURE ELEMENTS)

Current methods to predict protein structure Structural level Schema Additional info Ab Initio Secondary -------- 1D 2D 3D 4D AAVLYFGREDHTLLVY 2 nd pred correlated mutations AAVLYFGREDHTLLVY AAVLYFGREDHTLLVY Tertiary Quaternary -molecular dynamics -Energy minimization -docking No Ab-Initio 2 nd pred. -homology modeling -threading -filtered docking

3D?? MREYKLVVLGSGGVGKSALTVQFVQGIFVDE YDPTIEDSYRKQVEVDCQQCMLEILDTAGTE QFTAMRDLYMKNGQGFALVYSITAQSTFNDL QDLREQILRVKDTEDVPMILVGNKCDLEDER VVGKEQGQNLARQWCNCAFLESSAKSKINVN EIFYDLVRQINR? How does it fold?"

So How Do You Get from Query Sequence to Model Structure? Template Detection The first step is to find a sufficiently similar structural template or templates from the PDB, either by sequence searches or more sophisticated structure-based techniques. Alignment All template detection methods need to create alignments in order to be able to evaluate the query-template fit. Alignments are also crucial for the next stage... Model Building Ranges from the simple tranference of PDB coordinates built into many fold recognition methods to complex all atom compative modelling. Evaluation All methods again use some sort of quality assessment of the models, either at the level of the alignment or of the feasibility of the 3D structure.

Template Identification Template Detection The most simple form of template detection is a sequence search of the sequences in the PDB database. This should always be the first step because the results of this search will condition the approach. Domains Searching for templates is complicated by the fact that many proteins are made up of several structural domains. A domain search should be carried out at the same time as the sequence search.

Russell:http://speedy.embl-heidelberg.de/gtsp/flowchart2.html Structural prediction flowchart

Template Detection Continued But, if No Similar PDB Template Exists If no template is found for one or more of the domains, more work will be needed, particularly with the alignment, in order to produce a good model. In this case the predictor can move onto more complex sequence search methods (PSIBLAST, FFAS, HMMs) or use fold recognition techniques.

Structural prediction flowchart

Homology Modelling vs Fold Recognition 0 30 100 % seq. ID Application Fold Recognition Homology Modelling Target Sequence Model Quality Any Sequence Fold Level >= 30-50% ID with template Atomic Level If the sequence is similar to a known structure (>30-50% identity) you can usually move straight onto generating an all atom model by homology modelling.

No Template Found by BLAST? Pairwise sequence search methods can detect folds when sequence similarity is high,. but are very poor at detecting relationships that have less than 20% identity. One possibility is to use profile-based sequence search methods. These have evolved greatly, and can find templates with very low sequence similarity. Fold recognition methods can find folds that are too distantly related to be detected by sequence based methods, because they evaluate not only sequence similarity, but also structural fit.

Why We Can Build Structures? Because Small Changes in Sequence Have Little Effect on Structure

Relationship between sequence and structural similarity Chotia & Lesk, 1986 %id seq. => same 3D (for sure) %id seq. => sometimes same str. sometimes not} depends on the length of the aligned region.

Sequence Space vs. Structure Space Homology Modelling Targets Fold Recognition Targets Sequence space Structural space The development of fold recognition methods came from the observation that many apparently unrelated sequences had very similar 3-dimensional structures (folds).

FOLD RECOGNITION Find out the real structure with prediction methods FIT SEQUENCES INTO STRUCTURES AND FIND THE BEST MATCH when? If Little Sequence Similarity Then, Fold Recognition

FOLD RECOGNITION BIOLOGIST s APPROACH: If seq 1 is similar to seq2 then structure 1 is similar to structure2 and there is probably an evolutionary explanation! PHYSICIST s APPROACH: Proteins form structures according to fundamental rules that they call energies or free energies! Quoted from: Protein Structure Prediction, Huber & Torda.s

Fold Recognition Algorithms: General Principle It was thought when fold recognition methods were developed that they could detect analogues, proteins that were structurally similar but that had no evolutionary relationship. In fact most of these predictions were later shown to be homologous (have an evolutionary relationship) by advanced sequence comparison methods, such as PSI-BLAST. They still have a place though, in part because many of the newer methods are more more sensitive than PSIBLAST, in part because research also shows that no one method can always hope to correctly identify a fold.

CAPABLE TO DETECT VERY DISTANT HOMOLOGY (WHEN SEQUENCE-BASED METHODS FAIL) FFAS03 example

FOLD RECOGNITION FOLD DETECTION THREADING BLAST, FASTA eg. FFAS03 GenThreader FOLD RECOGNITION eg HMM Alignment of sequences to structures as in THREADER (Jones et al. 1992) CAPABLE TO DETECT VERY DISTANT HOMOLOGY (WHEN SEQUENCE-BASED METHODS FAIL) Fold recognition: distant/no clear homology

FOLD RECOGNITION WHAT IS THREADING? To fit a structure into a sequence!..given a protein structure, what amino acid sequences are likely to fold into that structure?

QUERY TO STRUCTURE ALIGNMENT S1 S2 S3 S4 S5 Sheet helix Optimal alignments Suboptimal alignments

QUERY TO STRUCTURE ALIGNMENT I query sequence Structure template ALIGNMENT (threading): covering of segments of the query sequence by template blocks! A threading is completely determined by the starting positions of the blocks

QUERY TO STRUCTURE ALIGNMENT II: Rules query sequence The blocks preserve their order Structure template The blocks DO NOT OVERLAP There is NO GAPS in the blocks!

STEPS Construct a library of Potential core folds (structural templates) Choose an objective function (score function) to evaluate any alignment of a sequence to a structure template

The General Principle I 1. Library of protein structures (fold library) all known structures representative subset (seq. similarity filters) structural cores with loops removed

Building a Fold library

The General Principle II 2. Binary alignment algorithm with Scoring function contact potential environments Others.. Instead of aligning a sequence to a sequence, align strings of descriptors that represent 3D structural features. Usual Dynamic Programming: score matrix relates two amino acids Threading Dynamic Programming: relates amino acids to environments in 3D structure ALMVWTGH... 2. Evaluation of the fitness: probability The final score is the goodness of fit of the target sequence to each fold and is usually reported as a probability....

Position j=4 j=3 j=2 j=1 S T i=1 i=2 i=3 i=4 i=5 i=6 Block Each possible threading corresponds to a path from S to T in the graph and vice-versa The BLUE path corresponds to the threading (1,4,1,4,1,4) The GREEN path corresponds to the threading (1,2,2,3,4,4) THE KEY IS TO FIND THE SHORTEST PATH FROM S TO T =dynamic programming!!!

Scoring Functions for Fold Recognition Scoring functions measure some or more of the following: The similarity between the observed structural environment of the residue and the environment in which the residue is usually found Pair potentials Solvation energy Coincidence of real and predicted secondary structure and accessibility Evolutionary information (from aligned structures and sequences)

Structural Environments Bowie et al. (1991) created a fold recognition approach: each position of a fold template as being in one of eighteen environments. Environment: measuring the side chain buried area, the fraction of the side chain area that was exposed to polar atoms, and the local secondary structure. Other researches have developed similar methods, where the structural environments described include exposed atomic areas and type of residue-residue contacts.

How Structural Environments Scores are Used 20aa 18 env i.e.: Prob to have K Buried. Scoring matrices are pre-generated for the probabilities of finding each of the twenty amino acids in each of the environment classes. Probabilities are drawn from databases of known structures. Using these probabilities a 3D profile is created for each fold in the fold library. #This 3D matrix defines the probability of finding a certain amino acid in a certain position in each fold. When the target sequence is aligned with the fold, a score is calculated from the pre-generated 3D profile for each of the positions in the alignment. The fit of a fold is the sum of the probabilities of each residue being found in each environment.

Solvation Energy Solvation potential is a term used to describe the preference of an amino acid for a specific level of residue burial. It is derived by comparing the frequency of occurrence of each amino acid at a specific degree of residue burial to the frequency of occurrence of all other amino acid types with this degree of burial. The degree of burial of a residue is defined as the ratio between its solvent accessible surface area and its overall surface area.

Pair or Contact Potentials - the Tendency of residues to be in Contact counts d Counts become propensities (frequency at each distance separation) or energies (Boltzmann principle, -KT ln) Make count of interacting pairs of each residue type at different distance separations E d

Pair Potentials in Fold Recognition The energy that results from aligning a certain target sequence residue at a certain position depends on its interactions with other residues. This creates problems when pair potentials are used to create sequence structure alignments, since you do not know the position of all the residues in the model before threading them. Threading methods that use pair potentials in this way, such as THREADER (Jones et al, 1992) have to use clever programming methods to get round this problem.

INPUT Secondary structure pred TOPITS uses predicted secondary structure and accessibilities for the target sequence and compares them with the known values of the template. Rost, 1995

Alignments 1aac DKATIPSEPFAAAEVADGAIVVDIAKMKYETPELHVKVGDTVTWINREAMPHNVHFVAGV :.... :... :.... :..:::. : 1plc IDVLLGADDGSLAFVPSEFSISPGEKIVFKNNAGFPHNIVFDEDS 1aac L--GEAALKGPMMKKE------QAYSLTFTEAGTYDYHCTPHPF--MRGKVVVE. : : : :......... : :...:.:: : :::.:. 1plc IPSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQGAGMVGKVTVN All methods of template detection, whether sequence-based, fold recognition or hybrid needs alignments between the query sequence and the PDB template sequence. The quality of these alignments is highly variable. If an accurate 3D model is to be built, it is vital that the target-template alignments are correct. Particularly at lower percentage identity the biggest errors stem from the alignments.

Alignments Generally the higher the sequence similarity and the lower the number of gaps between the two sequences, the more likely the alignment is to be correct. The more sequences that are included in the alignment the more likely the alignment is to be reliable in an evolutionary sense. Coincidence of real and predicted secondary structure and accessibility also generally improves alignments. Even with all this information automatic methods are far from perfect.

Alignments by Hand Alignments from sequence-based methods tend to produce alignments that are biased towards sequence evolution not structure and fold recognition alignments are not any more reliable. In practice most predictors update alignments manually using actual and predicted secondary structure and accessibility information, and careful placement of gaps. KSLKGSRTEKNILTAFAGESQARNRYNYFGGQAKKDGFVQISDIFAETADQEREHAKRLFKFLE GGDLEIVAAFPAGI. ::---========+==++==+====-==--::-======+==++=++==+======--. -... MKGDTKVINYLNKLLGNELVAINQYFLHARMFKNWGLKRLNDVEYHESIDEMKHADRYIERILFLEGLPNLQDLGKLNI IADTHANLIASAAGEHHEYTEMYPSFARIAREEGYEEIARVFASIAVAEEFHEKRFLDFARNIKE GRVFLREQATK.:---===-=+--==--=- --==-==------:--======-====++==+====----:-:::.. GEDVEEMLRSDLALELDGA KNLREAIGYADSVHDYVSRDMMIEILRDEEGHIDWLETELDLIQKMGLQNYLQAQ WRCRNCGYVHEGTGAPELCPACAHPKAHFELLGINW. :. I REE

Sequence Alignment Correction PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS TEMPLATE PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE CYS TARGET (ALIGNMENT 1) PHE ASN VAL CYS ARG --- --- --- THR PRO GLU ALA ILE CYS TARGET (ALIGNMENT 2) "Alignment 1" is chosen because of the PROs at position 7. But the 10 Angstrom gap that results is too big to close with a single peptide bond.

A Fold Recognition Example - 3D-PSSM 3D-PSSM combines: Target sequence profiles. Template sequence profiles. Residue equivalence. Secondary structure matching. Solvation potentials. Sequences are aligned to folds using dynamic programming with the alignments scored by a range of 1D and 3D profiles.

3D-PSSM Fold Profile Library PSI-BLAST and a non-redundant database are used to create profiles for each of the folds in the library. Each fold is aligned with members of the same superfamily using the structural alignment program SSAP. Those folds from SCOP with sufficient structural similarity are then also used to create profiles using PSI-BLAST in the same way. All the related profiles are merged using the structural alignment to form a 3D-profile.

Secondary Structure and Solvation Potentials Secondary structure is assigned to each fold based on the annotation in the STRIDE database. Each residue in the fold is also assigned a solvation potential. The degree of burial of each residue is defined as the ratio between its solvent accessible surface area and its overall surface area. Solvation potential is divided into 21 bins, ranging from 0% (buried) to 100%(exposed).

Sequence and Secondary Structure Profiles 3D-PSSM also uses the coincidence of predicted secondary structure (target sequence) and known secondary structure (fold). Here a simple scoring scheme is used for matching secondary structure types, +1 for a match, otherwise -1.

Preparing the Query Sequence Query sequences have their secondary structure predicted by PSI-Pred. PSI-BLAST profiles are also generated for the query sequence to allow bi-directional scoring. The 3D-FSSP dynamic programming algorithm is used to scan the fold library with the query sequence.

3D-PSSM - Dynamic Programming Three passes of dynamic programming are performed for each querytemplate alignment. Each pass uses a different matrix to score the alignment, but secondary structure and solvation potential are used in each pass. The score for a match between a query residue and a fold residue is calculated the sum of the secondary structure, solvation potential and profile scores. The final score is simply the maximum of the scores from the three passes.

Differences between profile-based methods (Rychlewski( Rychlewski,, et al, 2000) PSI-BLAST PDB-BLAST Multiple alignments: 5 iterations with 10-3 evalue treshold Profile: Preclustering with 98% cutoff, pseudocount based on variability estimation-background aminoacid frequencies Database: NR Multiple alignment: same as PSI-Blast Profile: same as PSI-Blast Database: PDB database BASIC Multiple alignment: 2 PSI-Blast it. with 0.1 e-value threshold Profile: preclustering with 97% id cutoff; amino-acid composition filter, distant homologues have smaller weights Database: profiles of proteins from PDB FFAS/FFAS03 Multiple alignment: same as PSI-Blast Profile: preclustering with 97% id cutoff; amino-acid composition filter, sequence diversity based weight Database: profiles of proteins from PDB

Baker & Sali, Science 2001.

COMBINING ADDITIONAL INFORMATION Conserved Tree-Determinant Correlated mutations

rcc1 ran Ras Ral Rho Ras Ral Rho by J.A. G-Ranea

Azuma et al., J,Mol. Biol. 1999

Complex (Model on Vomplex superposition) Mapping of mutants (side view) Model GDP E157 H304 Mg++ D44 H410 H78 E157 H270 Mg++ GDP D44 H304 R206 H78 R206 H410 H270 D128 D128 H78 Green: Km, red: Kcat.

VISUALIZATION Pazos et al., 1999 http://www.cnb.uam.es/~pazos/threadlize

Fold Recognition Servers I 3D-PSSM - www.sbg.bio.ic.ac.uk/~3dpssm/ Based on sequence profiles, solvatation potentials and secondary structure. SPARKS2 - http://phyyz4.med.buffalo.edu/hzhou/anonymous-foldsparks2.html Top server in CM predictions in CASP 6. Sequence, secondary structure Profiles And Residue-level Knowledgebased Score for fold recognition. mgenthreader - www.psipred.net/ Combines profiles and sequence-structure alignments. A neural network-based jury system calculates the final score based on solvation and pair potentials.

Fold Recognition Servers I RAPTOR - www.bioinformatics.uwaterloo.ca/~j3xu/raptor.htm Best-scoring server in CAFASP3 competition in 2002. You have to ask to use it first... ROBETTA - http://robetta.bakerlab.org/ ROBETTA makes both ab initio and template-based predictions. It detects fragments with BLAST, FFAS03, or 3DJury, generates alignments with its own K*SYNC method and uses fragment insertion and assembly. PHYRE - www.sbg.bio.ic.ac.uk/phyre A new server (so new it doesnt even have documentation that attempts to assemble fragments in a similar way to Robetta.

Advanced Sequence-Based and Hybrid Techniques PSIBLAST Profile methods, beginning with PSI-Blast, can be as accurate as many fold recognition techniques at detecting remote homologues. Although expert users of these methods can usually spot biologically meaningful templates from careful analysis of low-scoring hits, many remote homologues are not detected. Intermediate Searching Profile-profile alignment methods use evolutionary information in both query and template sequences. As a result, they are able to detect remote homologies beyond the reach of other sequence comparison methods. Hhpred! Profile-profile

Advanced Sequence-Based and Hybrid Techniques Hidden Markov Models Hidden Markov models were originally developed for speech recognition. They regard the sequence as a series of nodes, each corresponding to a column in a multiple alignment. Each node has a residue state and states for insertion and deletion. A model can be built from many sequences and these models have many similarities to profiles. META-PROFILES Many methods now also use predicted secondary structure. By adding structural information to the profiles (metaprofiles) it is often possible to find homologues that have very low sequence similarity but are still structurally similar..

Hybrid Sequence-Based Servers SAM T02 - www.cse.ucsc.edu/research/compbio/hmmapps/t02-query.html The query is checked against a library of hidden Markov models. This is NOT a threading technique, it is sequence based, but it does use secondary structure information. Meta-BASIC - basic.bioinfo.pl Meta-BASIC is based on consensus alignments of profiles. It combines sequence profiles with predicted secondary structure and uses several scoring systems and alignment algorithms. FFAS ffas.ljcrf.edu/ffas-cgi/cgi/ffas.pl FFAS03 is a profile-profile alignment method, which takes advantage of the evolutionary information in both query and template sequences.

Consensus Fold Recognition It has long been recognised that human experts are better at fold prediction than the methods these same experts had developed. Human experts usually use several different fold recognition methods and predict folds after evaluating all the results (not just the top hits) from a range of methods. So why not produce an algorithm that mimics the human experts? In first consensus server, Pcons, the target sequence was sent to six publicly available fold recognition web servers. Models were built from all the predictions. The models were then structurally superimposed and evaluated for their similarity. The quality of the model was predicted from the rescaled score and from its similarity to other predicted models.

Consensus Fold Recognition Servers 3D Jury - http://bioinfo.pl/meta/ 3D Jury is a consensus predictor that utilizes the results of fold recognition servers, such as FFAS, 3D-PSSM, FUGUE and mgenthreader, and uses a jury system to select structures INGBU - www.cs.bgu.ac.il/~bioinbgu/form.html This produces a consensus prediction based on five methods that exploit sequence and structure information in different ways. Pcons - www.sbc.su.se/~arne/pcons/ Pcons was the first consensus server for fold recognition. It selects the best prediction from several servers. PMOD can also generate models using the alignment, template and MODELLER

Structure Prediction in a Nutshell Target sequence Biological information from papers Active sites, domains, cofactors etc. Are there domains? PFAM/ProDom/ InterPro - BLAST results Secondary structure, accessibility, Trans-membrane segments PHD, PSIPRED Domain1 Domain 2 Domain 3 etc... BLAST search for PDB Structural Template Yes No Homology modelling programs SWISSMODEL, coremodeller Align with template Consenus Servers, 3D Jury Alignment 1 Alignment 2 Alignment 3 Alignment 4... Loops... Fold Recognition Servers Eg 3DPSSM GenTHREADER Model Evaluation 3D - ProSa model Ana Rojas - Biotech Mendoza Structural Bioinformatics suite Group Side chain canonical Complete loops MaxSprout 3D model

SOME REAL EXAMPLES BIOLOGICALLY RELEVANT

PAAD DOMAIN AIM: TRY TO PREDICT BINDING MODE structure was unknown: we needed a model.

BACKGROUND WHERE IS THE PAAD DOMAIN? 1.-First, location of this domain using BLAST! PAAD family: MEFV/PYRIN (Pawlowski, et.al., 2001, others) Nacht family: PAN/NALPs/DEFCAP/PYCARD, CATERPILLER (Tschopp et al, Nature, 2003)

BACKGROUND THE PROBLEM OF DOMAIN SHUFFLING NALP2 PAAD NACHT LRR S ASC2 PAAD MATER? NACHT LRR S ASC PAAD CARD CARD4 CARD NACHT LRR S CASPASE ZF PAAD CASPASE NOD2 CARD CARD NACHT LRR S PYRIN PAAD B-BOX Zn FINGER SPRY NAIP BIR BIR BIR NACHT LRR S COS1.5? NACHT LRR S IF16 PAAD IF120X IF120X CLAN CARD NACHT LRR S MNDA,AIM2 PAAD IF120X NAC PAAD? NACHT LRR S? CARD Sensors! They connect different pathways! 2.-Domain analyses in different sequences (PFAM)

WHERE DOES IT COME FROM? 3.-Phylogenetic analyses (PFAM) PAAD CARD DD DED

1 2 3 4 5 6 Hydrophobic core (sol. acc. area <10% maximum solv. area) 4.-MAL & Sec. Structure Prediction HELIX 3 does not have core residues. In DD, and others helix3 doesn t pack too well

domain Homology modeling of PAAD domain (MEFV from mouse) N N H3 H3 C C 4.-Template detection, alignment and modeling! Hydrophobic core

pyrin LYS35 LYS52 LYS39 ARG49 ARG42 180 ILE40 PRO41 VAL51 MET45 Charged patch Pan2/NALP4 Hydrophobic patch 4.-Identification of patches or relevant features in the surfaces! ALA50 TRP44 LYS48 VAL47 PRO43 ILE42

IFI204 ASP32 LYS64 90 o GLU53 GLU71 GLU67 GLU70 GLU54 LYS76 LYS55 AIM2 ASP19 LYS23 GLU20 180 o ARG67 LYS71 LYS64 - CHARGED (CONCAVE) + CHARGED (CONVEX) +CHARGED

Paad is a 6 alpha helical bundle Helix 3 is disordered Binding patches correctly predicted Real structure 1PN5 Released October 2003 September 2003

SPOC DOMAIN Combining HMMER sequence analyses and threading

METHODS: Selecting regions first! Query seq Blast to nr/uniprot90 Blast to EST s & unfinished genomes Multiple alignment T COFFEE, MUSCLE, etc TO ENRICH PROFILE! PROFILE BUILDING HMMER/PSI BLAST SEARCHES in Uniprot90

METHODS: HMMER Strategy/Intermediate searches Known Known!!!

METHODS HMMER ANALYSES III iso1 iso2 1183 aa NLS PHD 614 aa Coiled coil SPOC: Protein protein interaction (Sanchez Pulido et al, 2004) iso3 2256 aa 0.083 0.05

METHODS HMMER ANALYSES III iso2 SPOC: Protein protein interaction RBMF_HUMAN Homology Structural modeling Bioinformatics Group

Acknowledgments Michael Tress, David de Juan (CNIO) Florencio Pazos, Luis Sanchez-Pulido (CNB) Rest of (CNIO) and anyone else whose figures I used...