Biol Introduction to Bioinformatics

Size: px

Start display at page:

Download "Biol Introduction to Bioinformatics"

Kenneth Watkins
5 years ago
Views:

1 Biol Introduction to Bioinformatics Schedule Week Nov 15 Nov Reading: Ch Ch for next week Ch 14.4 Monday Protein energetics/dynamics Wednesday Homology-based modeling Friday Homology-based modeling, protein families 1 Biol47800/59500 Bioinformatics

2 Motivation Secondary structure prediction, independent of its accuracy, doesn't tell you what the three dimensional structure is. It is difficult or impossible to even go from KNOWN secondary structure elements to the three dimensional structure. What then can one do?? Often If a structure is known, one can reasonably accurately "predict" or model the three-dimensional structure of homologous proteins. The 3D structure database (PDB) is growing exponentially, the same as the other databases - many homologous structures are available, perhaps 50% of all sequences. Structural genomics (high-throughput structure solution) is increasing the number of sequences for which this is possible 2 Biol47800/59500 Bioinformatics

3 Protein Energetics Proteins exist at or very near their minimum free energy conformations Crystallographic structures may be slightly stressed due to crystal contacts and solution conditions Folding is generally rapid, and often does not require any assistance Anfinsen experiments Chaperones Proteins are not solid rocks they exhibit thermal motions which are important in conformational change 3 Biol47800/59500 Bioinformatics

4 Molecular dynamics Beta lactamase with water green fluorescent protein neuraminidase + tamiflu GLFG backbone 100 ps MHC Protein Glucocorticoid receptor/dna glucosamine deaminase blood clotting protein binding to membrane Protein and DOPC bilayer, 50 nsec DADME binding to Polynucleotide phosphorylase 4 Biol47800/59500 Bioinformatics

5 Molecular Dynamics MD simulations generally begins where experimental structure determination leaves off, if not during the structure refinement itself. MD is generally not used to predict structure from sequence nor to model the protein folding pathways. MD simulations can fold extended sequences to global potential energy minima, ONLY for very small systems (peptide length ten, or so, in vacuum) MD is most commonly used to simulate the dynamics of known structures. 5 Biol47800/59500 Bioinformatics

6 Molecular Dynamics Proteins are flexible and rapidly fluctuating Molecules Classification of Motions: Times (log sec) Distances (Angstroms) Atomic fluctuations -15 to -11 ~ 1 A Vibrations of individual bonds Collective Motions: - 12 to -30 ~ 10 A Groups of Atoms (AA Side Chains, Protein Motif or Domain, RNA Base, ) Triggered Conformational Changes: -9 to +31 ~ 100 A Motion is Response to Stimulus Correct Structural Template (H bonding, dis bridges, solvent accessibility, etc) 6 Biol47800/59500 Bioinformatics

7 Molecular Dynamics Energy minimization atomic coordinates and potential energy (force field) incrementally change coordinates according to force field (descent to lowest energy) Molecular Dynamics include velocities incrementally change atomic coordinates using numerical solutions of time-dependent equations of motion for atoms (F=ma) Result - Simulated trajectory through time of positions and momenta of all atoms of the molecule - explore conformational space in time 7 Biol47800/59500 Bioinformatics

8 Molecular Dynamics Basic Computational Approach: Begin with Initial Atomic Coordinates Calculate the Potential Energy (U) of the system (force field) This gives the force on each atom Force is the negative derivative of potential energy F = - du/dt Sum of forces on each atom gives acceleration Let molecules move for a very short time (femtoseconds) Recalculate energy 8 Biol47800/59500 Bioinformatics

9 Molecular Dynamics Force Field = empirical energy functions treats large molecules essentially as spheres and springs, with resulting following potential energy terms: E empirical = S E bond + S E angle + S E dihedral + S E VDW + S E elec where: E bond = S k b (r - r 0 ) 2 E angle = S k q (q-q 0 ) 2 + k(r 13 -r 0 ) 2 E dihedral = S S k f (1 + cos(n f f+d)) E VDW = A/r 12 - B/r 6 (Lennard-Jones potential) E elec = q i q j /r 2 9 Biol47800/59500 Bioinformatics

10 k b (r - r 0 ) 2 Local Local k q (q-q 0 ) 2 + k(r 13 -r 0 ) 2 k f (1 + cos(n f f+d)) Long-range Non-local q i q j /r 2 A/r 12 - B/r 6 10 Biol47800/59500 Bioinformatics

11 Molecular Dynamics - Energy Minimization General Optimization Methods Iterative Descent Method Change each atomic coordinate by a small descent step size, in direction of the force acting on the atom Recalculate potential energy from the new atomic coordinates Recalculate descent step direction from the new potential energy Iterate this procedure, varying the descent step size as needed Stop when a minimum in the potential energy is reached (can not proceed in any direction without increasing potential energy) Conjugate Gradient Method Similar to Iterative Descent Method BUT Each new descent step direction is based on previous directions as well as the current force Changes in direction less abrupt Convergence is faster 11 Biol47800/59500 Bioinformatics

12 Molecular Dynamics - Energy Minimization Energy calculations are used to solve Newton's equation of motion, i.e., F = ma = - E empirical These calculations yield an acceleration and velocity for each atom Very small time steps, about 1 femtosecond ( sec) To minimize energy, most common to use "simulated annealing" "Heat" molecule to get high thermal motion which samples conformational space Slowly "cool" to find minimum energy, hopefully a global minimum SA will only move a structure a small distance from the starting point, perhaps 1-2 Å 12 Biol47800/59500 Bioinformatics

13 Molecular Dynamics - Energy Minimization Computationally Intensive Requires 10,000s of energy evaluations and 1000s of steps of dynamics to minimize energy of a medium size structure This can require hours of supercomputer time Difficult to correctly model solvent effects Hydrophobic effect is important Solvent Bulk solvent model (continuum model) Explicit solvent model insert model in a "box" of water this adds thousands of additional atoms Energy minimization often used to refine a model or structure 13 Biol47800/59500 Bioinformatics Particularly useful with good initial structure, e.g., position of sidechain or

14 Homology-based modeling (Comparative modeling) Prediction of three dimensional structure of a target protein from the amino acid sequence (primary structure) of a homologous (template) protein for which an X-ray or NMR structure is available. Why a Model: X-ray crystallography (or NMR structures are unavailable or intractable) The model provides a wealth of information of how the protein functions with information at residue property level. This information can than be used for mutational studies or for drug design. 14 Biol47800/59500 Bioinformatics

15 Some Applications of Comparative Modeling: Design mutants to test hypothesis about a proteins function. Identify active sites and binding interfaces Model substrate specificity Protein-protein docking Effects of Coding SNPs (Single Nucleotide Polymorphisms) and other naturally occurring Polymorphisms on Protein Structure 15 Biol47800/59500 Bioinformatics

16 Methods Homology-based modeling Match sequence to known structure Change sequence Optimize with MD Fragment-based modeling Match subsequences to structure fragments Optimize with MD Threading Environment based profiles Pseudo-energy fitting 16 Biol47800/59500 Bioinformatics

17 Homology Modeling Flowchart Homology modeling Query Protein Sequence Sequence Database Search Structure Database Search No Hits Hits (Multiple) Sequence Alignment Identify Structurally Conserved Regions Iterative Search PsiBlast/Profiles Model Core SCRs Threading Model Loops Similar hits Fold Recognition Model Sidechains Secondary Structure Prediction Evaluate Model(s) Energy Minimization 17 Biol47800/59500 Bioinformatics

18 Quality of Known structures What is a good 3-dimensional structure? 6 Å resolution or so - secondary structure often clear, particularly alpha helices Less than 3 Å resolution - one has many errors in side groups 2.5 Å or better - good BUT loops or surface regions may still be disordered Usually must be at least this good for successful homology modeling 2.0 Å or better, very good to excellent, the best structures are below 1.5 Å resolution. Portions may still be invisible. R-factor measures X-ray crystallographic error. R measures difference between observed reflections and reflections predicted from model Should be close to or below 20% Temperature factor - lower is better measures thermal motion temperature factors for well ordered residues are in the 1-15 range. Above 50 means the residue was invisible Main-chain torsion angles reflect quality of structure (sometimes) 18 Biol47800/59500 Bioinformatics Torsion angles are restrained in refinement.

19 Electron Density Maps Two dimensional Three dimensional 19 Biol47800/59500 Bioinformatics

$data Crystallography: diffraction intensities NMR: coupling between atoms (distance restraints) 20 Biol47800/59500$

20 Known structures Crystallographic and NMR structures are models Models minimize the difference between observed and calculated data Crystallography: diffraction intensities NMR: coupling between atoms (distance restraints) 20 Biol47800/59500 Bioinformatics

21 Protein Models Stereo images 21 Biol47800/59500 Bioinformatics

22 C a trace 22 Biol47800/59500 Bioinformatics

23 Protein Models NMR Structural Ensemble 23 Biol47800/59500 Bioinformatics

24 How good does structure need to be? 24 Biol47800/59500 Bioinformatics

25 Homology Modeling Assumptions The overall 3-D structure of the target protein is similar to that of related proteins, and particularly the template structure. Regions of conserved sequence have similar structure. Residues conserved throughout a family of proteins are the most structurally conserved. Residues involved in biological activity have similar structure throughout the protein family. Loop regions (non-conserved residues) allow insertions and deletions without disrupting the core structure of the protein. Loop regions are flexible and therefore need not be constructed as strictly as the conserved regions - assuming that they play no role in biological activity. This doesn't apply to proteins whose surface loops play critical roles. 25 Biol47800/59500 Bioinformatics

26 Requirements for Homology-based modeling The query: The amino acid sequence of the protein to be built The template: The high-resolution structure of a homologous protein (AKA reference) Desirables for a Homology Project Additional sequences of related proteins (for multiple sequence alignment) Additional reference protein structures 26 Biol47800/59500 Bioinformatics

27 Steps in Homology Modeling Identify reference/template structures - one or more (the more the better) These will form the template for the target structure (model). Sequence Alignment. The most important step errors made at this point cannot be fixed Use best alignment possible Multiple alignments are usually better than pairwise alignments Proteins with less than ~<30 sequence identity with reference can be problematic Map sequence onto Template Transfer the coordinates from the template(s) to the target of structurally conserved regions (SCR s) Convert template side chains Optimize sidechain orientations with rotamer library Model variable regions: loops and side chains Loop insertions: Search of a high resolution fragment database Deletions: local minimizations may be sufficient. Minimize free energy of model Local - especially loop-hinge regions Global molecular dynamics/energy minimization Evaluate Model 27 Biol47800/59500 Bioinformatics

28 Locating and Aligning Homologs The modeling idea: extrapolate knowledge of related protein structures to a new homologous sequence Can include both related sequences and related 3D structures Approach: alignment procedures and database searches already learned in this course Extend search beyond a single sequence: Multiple alignments, profile analysis, at least consensus sequences or regular expressions Motifs via PROSITE database: regular expressions may be able to model some small regions if not the entire protein Global vs Local alignments: may be able to make separate models for independent domains and duplicated regions 28 Biol47800/59500 Bioinformatics

Sequence Similarity and Alignment: Homology modeling is based on using similar structures No similar structures = No Model Need sequence similarity across the whole sequence, not just in one part 40%

29 Sequence Similarity and Alignment: Homology modeling is based on using similar structures No similar structures = No Model Need sequence similarity across the whole sequence, not just in one part 40% - amino acid identity or higher is best Below 25% - is less useful but examples of success exist at this level 20% - 35% - sequence identity is often referred to as the «twilight zone» Identify target structure by sequence based search of structure database (PDB) sequences FASTA or BLAST Multiple sequence comparison to improve the sensitivity of the search and identify highly conserved regions. Muscle, HMM, profile, ClustalW2, PSI- 29 Biol47800/59500 Bioinformatics

Modeling Structurally Conserved Regions Core regions must be examined for effect of indels Sequence residues are copied into positions of template residues in three-dimensional structure When the

30 Modeling Structurally Conserved Regions Core regions must be examined for effect of indels Sequence residues are copied into positions of template residues in three-dimensional structure When the template residue is bigger, some empty space is left Nature abhors a vacuum When the template residue is smaller, there is steric conflict some atoms are too close together or maybe even interpenetrating 30 Biol47800/59500 Bioinformatics

31 Sidechain Conflicts Fixing conflicts Amino acid sidechains assume preferred positions (rotamers), which have been tabulated from known structures Computationally try all rotamers for sidechains affected in the region of a conflict Not all problems can be fixed, some require backbone movement Alternative alignments may be desirable 31 Biol47800/59500 Bioinformatics

32 Modeling Variable Regions (Loops) Search structure database for a loops with similar size and anchor points Ab initio Use molecular dynamics/energy minimization to find a plausible structure (energetically reasonable) Structure outside of loop region is not allowed to move Mainly used for very small loops and deletions where endpoints are close 32 Biol47800/59500 Bioinformatics

33 Models tend to stay close to template 1u5b/1qs0 Comparison of experimental model (1u5b) and model template RMSD 1qs o1x Model error by position red = high error 33 Biol47800/59500 Bioinformatics

the nonreducing and reducing end of the octasaccharide, respectively.

34 How good are models? cabcii-c4s tetrasaccharide complex pink cabc1 (template) gray cabcii (model) 34 Biol47800/59500 Bioinformatics active site groove seen from the nonreducing and reducing end of the octasaccharide, respectively. The octasaccharide is readily accommodated in the active site of cabci, but the access for the octasaccharide is constricted on the nonreducing end in cabcii. Recombinant Expression, Purification, and Biochemical Characterization of Chondroitinase ABC II from Proteus vulgaris, Prabhakar et al., J.Biol.Chem. 284, , 2009

35 Homology Modeling Fragment assembly method (Rosetta) Start with known structures in PDB Divide up into short fragments 9 residue library 3 residue library For unknown protein, find best 200 three and nine residue fragments at each position (sequence match) Start with protein in fully extended conformation (no steric conflicts) Energies steric repulsion - vdw environment (solvation) - env residue pair interactions pair strand pairing (hydrogen bonding) SS strand arrangement in shhets sheet helix-strand packing HS radius of gyration (compactness) rg Cβ density (compactness) cbeta 35 Biol47800/59500 Bioinformatics

36 Homology Modeling Fragment assembly method (Rosetta) Iterate 28,000 times Choose random 9 residue fragment in model replace torsion angles with one of best from list evaluate energy, keep if better Energy function is very approximate version of MD energy function initially only steric overlap energy is calculated (until all initial torsion angles are replaced) next 2,000 iterations, evaluate all energy terms except compactness, strand pairing weight=0.3 next 20,000 iterations: strand pairing weight=1.0, compactness weight 0.5 last 6,000 iterations: full weights on energies Attempt to improve using 8,000 trials of 3 residue fragment library 36 Biol47800/59500 Bioinformatics

37 Homology Modeling Fragment assembly method (Rosetta) Correct Structure 37 Biol47800/59500 Bioinformatics CASP5 T0135 and T0171

38 How good are models? 38 Biol47800/59500 Bioinformatics

39 Modeling Good or Bad? Proteins whose structure cannot be solved by NMR or X-ray crystallography can still be modeled Modeling takes only a few hours, but 3D structures often take months to years to solve experimentally Accuracy of models can be very good, nearly as good as crystal structures in the best case Can be good enough to generate lead compounds Model can be (need to be) experimentally tested: NMR In vitro mutagenesis 39 Biol47800/59500 Bioinformatics

40 Some sources of errors in comparative models: Errors due to Misalignments: Largest source of error, minimized by constructing multiple alignments No amount of MD will fix these errors Errors in sidechain packing: As sequences diverge, the packing of sidechains in the protein core changes. Backbone movements accommodate sidechain changes Distortions and shifts in correctly aligned regions: In some correctly aligned regions, the template is locally different from the target. Errors in regions without a template: Segments of the target sequence that have no equivalent region in the template structure are the most difficult regions to model (insertions and loops). If insertions are relatively short (less than 9 residues), some methods can correctly predict the conformation of the backbone. Incorrect templates: This is a problem when distantly related proteins are used as templates. Difficult to distinguish between a model based on a incorrect template 40 Biol47800/59500 Bioinformatics

41 Model Evaluation If it was easy to tell a correct model from an incorrect model the modeling process would be easy. One would simply use the "correctness" criterion as the objective function. Unfortunately, there is no completely satisfactory approach. Techniques for evaluation Model geometry Bond lengths, bond angles, dihedral angles, Van der Wals contacts, H bonds Programs used to evaluate the models: VERIFY3D, PROSAII, HARMONY and ANOLEA, and many others Agreement with homologous sequences (multiple alignment, Profile) Conserved regions in core, variable regions at surface Structural templates (3D profiles) Pair potentials (pseudo-energies) 41 Biol47800/59500 Bioinformatics

42 Model Quality Model based on 1qs0 Model based on 2o1x QMEAN score (higher is better) torsion angles pairwise potential solvation secondary structure potential phi/psi agreement solvent accessibility agreement 42 Biol47800/59500 Bioinformatics

43 Model Quality Anolea Atomic Non-Local Environment Assessment Distance based mean force potential Model based on 1qs0 Model based on 2o1x 43 Biol47800/59500 Bioinformatics

44 Homology Modeling Threading/Inverse Folding Methods Try to determine if a sequence is compatible with a known structure Inverse folding predict sequence from 3-D structure Compare to folding predict 3-D structure from sequence Threading imagine pulling the sequence through the known structure until a best match is obtained Threading approaches Local environment methods Characterize each sequence position according to its local three dimensional environment - 3D profile Simple to calculate match Could allow flexibility on variable regions Pseudo-energy methods (Contact potential) optimize pairwise interactions between residues in 3D space Difficult calculation 44 Biol47800/59500 Bioinformatics Ensures that residue-residue interactions approximate real proteins

45 Homology Modeling Local Environment Methods Three-dimensional Profile For each residues in the three dimensional structure, look at the structure type and surrounding residues to infer spectrum of allowed substitutions Secondary structure - alpha, beta or coil Solvent accessibility - buried, partially buried or accessible Hydrogen bonding / sidechain polarity 18 total states Preferred distributions of residues calculated from known structures in PDB probabilities for each of the 20 residues in each environment (observed frequencies are presumed to be optimal) Does not take conservation into account conserved positions use the same distributions as unconserved Align to profile as discussed previously 45 Biol47800/59500 Bioinformatics

46 Homology Modeling Threading - Pseudo-energy Methods Two approaches to threading - soft and hard threading (my terms) Soft threading - move the sequence along the template structure assuming that the interacting residues are the ones in the template Equivalent to local environment method Dynamic programming works Hard threading - move the sequence through the structure, with gaps, calculating all of the interacting pairs Very time consuming (NP-complete) 46 Biol47800/59500 Bioinformatics

47 Homology Modeling Pseudo-energy methods (quasi-energy, statistical potential, empirical energy function, knowledge-based force field) Boltzmann distribution relates probability to energy Z is the partition function that describes the probabilities of all states in system Frequencies at which residue pairs are seen in real structures can be converted to a pseudo-energy Calculate the energies for all residue pairs at all different separations The energy of any three dimensional structure can then be calculated by summing up the energies of all the pairs at the observed distances 47 Biol47800/59500 Bioinformatics

48 Homology Modeling Pseudo-energy Methods (see also fig 13.6 in text) 48 Biol47800/59500 Bioinformatics

with other proteins to make protein-protein interactions Structural only DHFR - thick blue Human survival

49 Homology Modeling Threading Can it find matches that sequence matching cannot? A is dihydrofolate reductase Interacts to form homodimer Contains catalytic site B is kinase SH3 Interacts with other proteins to make protein-protein interactions Structural only DHFR - thick blue Human survival motor protein - grey E. coli biotin holoenzyme - magenta Repressor KotB - green HIV integrase - orange 49 Biol47800/59500 Bioinformatics

Example Swiss-model (http://swissmodel.expasy.

50 Example Swiss-model ( Starting sequence: Medicago calcium-dependent protein kinase Contains protein kinase domain and EF-hand Ca binding domain 50 Biol47800/59500 Bioinformatics

Example Swiss-model Four templates found 2vn9 83 375 Human calcium/calmodulin depenndent protein kinasse 2qg5 79-344 Cryptosporidium parvum calcium dependent

51 Example Swiss-model Four templates found 2vn Human calcium/calmodulin depenndent protein kinasse 2qg Cryptosporidium parvum calcium dependent protein kinase 3hx Toxoplasma gondii CDPK1 2aao Arabidopsis thaliana Calcium dependent kinase EF hand region 51 Biol47800/59500 Bioinformatics

52 Example-Swiss-model Alignment and structure assignment for each template (reference structure) Deletions after residues 238, 260, 345 Insertion after 122, Biol47800/59500 Bioinformatics

53 Example Swiss-model Deletions after residues 238, 260, 345 Insertion after 122, 140 Structurally conserved region Add loops Delete extra residues Rotamer optimization 53 Biol47800/59500 Bioinformatics Energy minimization

54 Example-Swiss-model Structure assessment Gromos MD Anolea stat. potential 54 Biol47800/59500 Bioinformatics

55 Example Swiss-model Final model 55 Biol47800/59500 Bioinformatics

56 Protein Analysis Homologs - Fructose bis-phosphate aldolase 56 Biol47800/59500 Bioinformatics

C-terminal end of beta-barrel Fructose 1,6-bisphosphate aldolase Homologs

57 Protein Analysis Homology vs Structural Similarity TIM barrel proteins One of the most common protein folds (>900 examples) Active site always at C-terminal end of beta-barrel Fructose 1,6-bisphosphate aldolase Homologs Triose phosphate isomerase Probably not a homolog 57 Biol47800/59500 Bioinformatics

Protein Analysis Structurally similar? Text - page 569 There are many cases, where a protein shares no or little sequence homology and yet is a functional homolog.

58 Protein Analysis Structurally similar? Text - page 569 There are many cases, where a protein shares no or little sequence homology and yet is a functional homolog. While these proteins share a betasandwich architecture, they are connected entirely differently Are they homologs? Polycystin 1 (polycycstic kidney disease protein) a cell surface glycoprotein histone deposition protein 58 Biol47800/59500 Bioinformatics

59 Protein Analysis Structure Classifications SCOP - manual CATH largely automatic 59 Biol47800/59500 Bioinformatics

60 Protein Analysis Structural Similarity Structural similarity is measured by overlap of corresponding residue coordinates Most commonly used measure is RMS coordinate difference (RMSD) RMSD is very sensitive to outliers (car door effect) Problem is how to find which residues correspond DALI / FSSP Matches secondary structure elements regardless of connectivity CE (combinatorial extension) Builds up from small matching pieces, according to connectivity VAST Secondary structure orientation and connectivity Not clear which is best, not clear how to evaluate significance since completely unrelated structures are unavailable 60 Biol47800/59500 Bioinformatics

61 Protein Analysis Protein Folds 61 Biol47800/59500 Bioinformatics

Are there certain kinds of folds that are more stable?

62 Protein Analysis The protein structure universe total yearly How many protein folds are there? Are there certain kinds of folds that are more stable? How do you detect structural similarity? 62 Biol47800/59500 Bioinformatics

63 Protein Families Protein families - groups of homologous molecules superfamily, family, subfamily classification introduced by Dayhoff homeologous family families are seen both across and within species Structural classes / Folds - similar structures based on 3-dimensional coordinates may not be homologous - not clear to what extent certain structures are preferred by chance only recently becoming populated Domain Sequence or structure based independently folding unit Families are important for information mapping because they give a guide to how much variation is expected between homologous proteins that maintain similar (or have different) function. 63 Biol47800/59500 Bioinformatics

64 Protein Families Dayhoff Protein Classification Hierarchical classification Folds: Structural similarity Superfamilies: P < 10-3 Highly probable homology Superfamilies generally are entire sequences (homeomorphic family) Newer concept is homology domain - only part of sequence Families: > 50% identical (~E<10-30 ) Clear homology Similar function Substrates and function similar but not identical Subfamilies: >80% identical (~E<10-80 ) Identical function Probably bind nearly identical substrates 64 Biol47800/59500 Bioinformatics

65 Protein Families Clusters of Orthologous Groups COGs & KOGs genomes, 38 orders, 28 classes 14 phyla (192,987 proteins) prokaryotic (COGs) 5666 eukaryotic (KOGs) 4852 Originally (1997), 3307 COGs were delineated by comparing protein sequences encoded in 43 complete genomes, representing 30 major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain % of the gene products from each of the complete bacterial and archaeal genomes and ~35% of those from the yeast Saccharomyces cerevisiae genome. 65 Biol47800/59500 Bioinformatics

66 Protein Families COGs 1. Perform the all-against-all protein sequence comparison. 2. Detect and collapse obvious paralogs, that is, proteins from the same genome that are more similar to each other than to any proteins from other species. 3. Detect triangles of mutually consistent, genome-specific best hits (BeTs), taking into account the paralogous groups detected at step Merge triangles with a common side to form COGs. 5. A case-by-case analysis of each COG. This analysis serves to eliminate false-positives and to identify groups that contain multidomain proteins by examining the pictorial representation of the BLAST search outputs. The sequences of detected multidomain proteins are split into single-domain segments and steps 1 4 are repeated with these sequences, which results in the assignment of individual domains to COGs in accordance with their distinct evolutionary affinities. 6. Examination of large COGs that include multiple members from all or several of the genomes using phylogenetic trees, cluster analysis and visual inspection of alignments; as a result, some of these groups are split into two or more smaller ones that are included in the final set of COGs. 66 Biol47800/59500 Bioinformatics

67 Protein Families COGs & KOGS How well do COGs cover complete genomes? Phyletic patterns of COGs Phyletic patterns of KOGs 67 Biol47800/59500 Bioinformatics

68 Protein Families COGs 68 Biol47800/59500 Bioinformatics

69 Protein Families EggNOG Automatic COGs 630 genomes 529 bacteria 46 archaea 55 eukarya 224,847 Ogs 9724 extended versions of original COG and KOG Green = function annotated Orange = unannotated Gray = no match 69 Biol47800/59500 Bioinformatics

70 Protein Families Structural classifications SCOP Heuristic classification according to traditional crystallographic ideas Recently used as a standard for sequence comparisons v1.75, June PDB Entries Domains. CATH Systematic semi-automatic procedure with more clearly defined process Version 3.3.0, July ,625 PDB chains, 128,688 domains 70 Biol47800/59500 Bioinformatics

71 Protein Families SCOP Primarily manually curated according to traditional crystallographic ideas Family: Clear evolutionarily relationship Generally, pairwise residue identities greater than 30%. In some cases, similar functions and structures provide definitive evidence of common descent in the absence of high sequence identity; for example, many globins form a family though some members have sequence identities of only 15%. Superfamily: Probable common evolutionary origin Low sequence identity, but structural and functional features suggest a common evolutionary origin. For example, actin, the ATPase domain of the heat shock protein, and hexokinase together form a superfamily. Fold: Major structural similarity Major secondary structures in same arrangement and topology. Proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. Proteins with a common fold may not have a common evolutionary origin: the structural similarities could arise from physical-chemical properties of proteins that 71 Biol47800/59500 Bioinformatics

72 Protein Families SCOP - SCOP 1.75A statistics: PDB entries (released/updated prior to ) Domains. 1 Literature reference. Class Number Number of of folds superfamilies Number of families a: All alpha proteins b: All beta proteins c: Alpha and beta proteins (a/b) d: Alpha and beta proteins (a+b) e: Multi-domain proteins (alpha and beta) f: Membrane and cell surface proteins and peptides g: Small proteins Totals Biol47800/59500 Bioinformatics

Protein Families SCOP Class - All Alpha Proteins

helices; folded leaf, partly opened; Long

left-handed twist Cytochrome c (1) core: 3 helices;

(10) core: 3-helices; bundle, closed or partly

73 Protein Families SCOP Class - All Alpha Proteins Globin-like (2) (Globins and Phycocyanins) core: 6 helices; folded leaf, partly opened; Long alpha-hairpin (11) 2 helices; antiparallel hairpin, left-handed twist Cytochrome c (1) core: 3 helices; folded leaf, opened; DNA-binding 3-helical bundle (10) core: 3-helices; bundle, closed or partly opened, right-handed twist; upand down Many more Biol47800/59500 Bioinformatics

Protein Families CATH Classification http://www.cathdb.info/ v 3.

reliant on human intervention than SCOP CATH 3.

74 Protein Families CATH Classification v 3.5.0, September 2011 CATH is more formally specified and less reliant on human intervention than SCOP CATH ,536 domains 2,626 superfamilies 51,334 PDB entries 74 Biol47800/59500 Bioinformatics

75 Protein Families CATH Classification Class Determined according to the secondary structure composition and packing within the structure. Assigned automatically using the method of Michie et al. (1996). Architecture The overall shape of the domain structure as determined by the orientations of the secondary structures; ignores the connectivity between the secondary structures. Assigned manually Topology Fold families at this level depend on both the overall shape and connectivity of the secondary structures. This is done using the structure comparison algorithm SSAP (Taylor & Orengo, 1989). Homologous Superfamily Similarities are identified first by sequence comparisons and subsequently by structure comparison using SSAP. Criteria: Sequence identity >= 35%, 60% of larger structure equivalent to smaller SSAP score >= 80.0 and sequence identity >= 20%, 60% of larger structure equivalent to smaller SSAP score >= 80.0, 60% of larger structure equivalent to smaller, and domains have related functions Sequence Families Domains clustered in the same sequence families have sequence identities 75 Biol47800/59500 Bioinformatics

76 Protein Families 2010 CATH Classification Biol47800/59500 Bioinformatics

Protein Docking Finding binding sites Proteins with unknown function

(mostly for nucleic acid binding) Active sites usually in pockets Proteins

orientations Use scoring function to evaluate match Charge Shape

77 Protein Docking Finding binding sites Proteins with unknown function Conserved surface areas Hydrophobic surface area Highly charged areas (mostly for nucleic acid binding) Active sites usually in pockets Proteins with known partners docking Rotate and translate in all possible orientations Use scoring function to evaluate match Charge Shape Hydrophobicity How should you deal with flexibility of protein/induced fit 77 Biol47800/59500 Bioinformatics

78 Protein Docking Conformational Search (text: Ch 14) Given two proteins with three-dimensional structures, how do they bind? Hold one fixed Rotate and translate the other 3 angles, 10º increments = 23,000 positions 3 translational parameters, 100Å at 0.5Å intervals = 8 x 10 6 positions Total = 2 x positions to consider All docking methods use approximations What is a good position? Electrostatic interactions Steric interactions Solvent effects 78 Biol47800/59500 Bioinformatics

79 Protein Docking Search Methods Monte Carlo (Metropolis) methods Most common Start in a random position Calculate approximate energy Make a random move Accept the move probabilistically based on energy difference Often merged with genetic algorithm Consider many random starting positions (each is a genome) Each random modification is a mutation Fitness is energy Examples Gold (see text), Autodock 79 Biol47800/59500 Bioinformatics

80 Protein Docking Search Methods Other methods Point complementarity Distance Geometry Tabu search CAPRI Critical Assessment of Protein Interaction Docking contest (like CASP) 80 Biol47800/59500 Bioinformatics

81 Protein Docking Quality Is it a good fit (scoring function) MD energy models (force fields) from MD programs such as CHARMM, AMBER, Gromos Time consuming to calculate Approximate models usually focusing on electrostatics, and atomic overlap Statistical potentials (pseudo energies, knowledge-based scoring) Problems both molecules can move to accommodate binding (induced fit) water Water in binding site may be bound and act as a part of molecule, or Water may be released resulting in entropy increase ( ΔG = ΔH TΔS ) Flexible docking allows molecules to move 81 Biol47800/59500 Bioinformatics

Protein Docking Protein Docking Scoring functions are not that great Trypsin/trypsin inhibitor 2PTC beta trypsin (structure with I) 1TPO beta trypsin (structure without I) Bound

82 Protein Docking Protein Docking Scoring functions are not that great Trypsin/trypsin inhibitor 2PTC beta trypsin (structure with I) 1TPO beta trypsin (structure without I) Bound structure is often significantly different from free structure Even when binding site is correct, the conformation may still be wrong 2PTC vs inhibitor 82 Biol47800/59500 Bioinformatics

83 Protein Docking 83 Biol47800/59500 Bioinformatics

ALL LECTURES IN SB Introduction

ALL LECTURES IN SB Introduction 1. Introduction 2. Molecular Architecture I 3. Molecular Architecture II 4. Molecular Simulation I 5. Molecular Simulation II 6. Bioinformatics I 7. Bioinformatics II 8. Prediction I 9. Prediction II ALL