11/7/05 Protein Structure: Classification, Databases, Visualization Announcements BCB 544 Projects - Important Dates: Nov 2 Wed noon - Project proposals due to David/Drena Nov 4 Fri PM - Approvals/responses & tentative presentation schedule to students Dec 2 Fri noon - Written project reports due Dec 5,7,8,9 class/lab - Oral Presentations (20') (Dec 15 Thurs = Final Exam) 1 2 Nov 7 Bioinformatics Seminars Mon 12:10 IG Faculty Seminar in 101 Ind Ed II Inborn Errors of Metabolism in Humans & Animal Models Matt Ellinwood, Animal Science, ISU Nov 10 Thurs 3:40 Com S Seminar in 223 Atanasoff Computational Epidemiology Armin R. Mikler, Univ. North Texas http://www.cs.iastate.edu/~colloq/#t3 CORRECTION: Bioinformatics Seminars Next week - Baker Center/BCB Seminars: (seminar abstracts available at above link) Nov 14 Mon 1:10 PM Doug Brutlag, Stanford Discovering transcription factor binding sites Nov 15 Tues 1:10 PM Ilya Vakser, Univ Kansas Modeling protein-protein interactions both seminars will be in Howe Hall Auditorium 3 4 Mon Protein Structure & Function: Analysis & Prediction Protein structure: classification, databases, visualization Wed Protein structure: prediction & modeling Thurs Lab Protein structure prediction Fri Protein-nucleic acid interactions Protein-ligand docking Reading Assignment (for Mon-Fri) Mount Bioinformatics Chp 10 Protein classification & structure prediction http://www.bioinformaticsonline.org/ch/ch10/index.html pp. 409-491 Ck Errata: http://www.bioinformaticsonline.org/help/errata2.html Other? Additional reading assignments for BCB 544 5 6 D Dobbs ISU - BCB 444/544X 1
RNA structure prediction strategies Review last lecture: RNA Structure Prediction Algorithms Secondary structure prediction 1) Energy minimization (thermodynamics) 2) Comparative sequence analysis (co-variation) 3) Combined experimental & computational 7 8 1) Energy minimization method What are the assumptions? Native tertiary structure or "fold" of an RNA molecule is (one of) its lowest free energy configuration(s) Gibbs free energy = ΔG in kcal/mol at 37 C = equilibrium stability of structure lower values (negative) are more favorable Is this assumption valid? in vivo? - this may not hold, but we don't really know Gibbs free energy: ΔG Gibbs Free energy (G) is formally defined in terms of state functions enthalpy & entropy, & state variable, temperature G = H - TS ΔG = ΔH - TΔS (for constant temp) Enthalpy (H) = amount of heat absorbed by a system at constant pressure Entropy (S) = measure of the amount of disorder or randomness in a system Note = this is not the same as "entropy" in information theory, but is related, see: http://en.wikipedia.org/wiki/information_theory 9 10 Gibbs free energy: ΔG Gibbs free energy for formation of an RNA or protein structure = ΔG = equilibrium stability of that structure at a specific temperature (kcal/mol at 37 C) ΔG = -RT lnk eq Nearest-neighbor parameters Most methods for free energy minimization use nearest-neighbor parameters (derived from experiment) for predicting stability of an RNA secondary structure (in terms of ΔG at 37 C) & most available software packages use the same set of parameters: Mathews, Sabina, Zuker & Turner, 1999 R = gas constant 11 12 D Dobbs ISU - BCB 444/544X 2
Fig 6.3 Baxevanis & Ouellette 2005 Energy minimization - calculations: Total free energy of a specific conformation for a specific RNA molecule = sum of incremental energy terms for: helical stacking (sequence dependent) loop initiation unpaired stacking (favorable "increments" are < 0) 13 But how many possible conformations for a single RNA molecule? Huge number: Zuker estimates (1.8) N possible secondary structures for a sequence of N nucleotides for 100 nts (small RNA ) = 3 X 10 25 structures! Solution? Not exhaustive enumeration Dynamic programming O(N 3 ) in time O(N 2 ) in space/storage iff pseudoknots excluded, otherwise: O(N 6 ), time O(N 4 ), space 14 Algorithms based on energy minimization For outline of algorithm used in Mfold, including description of dynamic programming recursion, please visit Michael Zuker's lecture: http://www.bioinfo.rpi.edu/~zukerm/lectures/rnafold-html From this site, you may also download Zuker's lecture as either PDF or PS file. 2) Comparative sequence analysis (co-variation) Two basic approaches: Algorithms constrained by initial alignment Much faster, but not as robust as unconstrained Base-pairing probabilities determined by a partition function Algorithms not constrained by initial alignment Genetic algorithms often used for finding an alignment & set of structures 15 16 RNA structure prediction strategies Tertiary structure prediction Requires "craft" & significant user input & insight 1) Extensive comparative sequence analysis to predict tertiary contacts (co-variation) e.g., MANIP - Westhof 2) Use experimental data to constrain model building e.g., MC-CYM - Major 3) Homology modeling using sequence alignment & reference tertiary structure (not many of these!) 4) Low resolution molecular mechanics e.g., yammp - Harvey New Last Time: Protein Structure & Function 17 18 D Dobbs ISU - BCB 444/544X 3
Protein Structure & Function 4 Basic Levels of Protein Structure Protein structure - primarily determined by sequence Protein function - primarily determined by structure Globular proteins: compact hydrophobic core & hydrophilic surface Membrane proteins: special hydrophobic surfaces Folded proteins are only marginally stable Some proteins do not assume a stable "fold" until they bind to something = Intrinsically disordered Predicting protein structure and function can be very hard -- & fun! 19 20 Primary Primary & Secondary Structure Linear sequence of amino acids Description of covalent bonds linking aa s Secondary Local spatial arrangement of amino acids Description of short-range non-covalent interactions Periodic structural patterns: α-helix, β-sheet Tertiary Tertiary & Quaternary Structure Overall 3-D "fold" of a single polypeptide chain Spatial arrangement of 2 structural elements; packing of these into compact "domains" Description of long-range non-covalent interactions (plus disulfide bonds) Quaternary In proteins with > 1 polypeptide chain, spatial arrangement of subunits 21 22 "Additional" Structural Levels Super-secondary elements Motifs Domains Foldons New Today: Protein Structure & Function Amino acids characteristics Structural classes & motifs Protein functions & functional families not much - more on this later Classification Databases Visualization 23 24 D Dobbs ISU - BCB 444/544X 4
Amino Acids Peptide bond is rigid and planar Each of 20 different amino acids has different "R-Group," side chain attached to Cα 25 26 Hydrophobic Amino Acids Charged Amino Acids 27 28 Polar Amino Acids Certain side-chain configurations are energetically favored (rotamers) Ramachandran plot: "Allowable" psi & phi angles 29 30 D Dobbs ISU - BCB 444/544X 5
Glycine is smallest amino acid R group = H atom Glycine residues increase backbone flexibility because they have no R group Proline is cyclic Proline residues reduce flexibility of polypeptide chain Proline cis-trans isomerization is often a rate-limiting step in protein folding Recent work suggests it also may also regulate ligand binding in native proteins -Andreotti 31 32 Cysteines can form disulfide bonds Disulfide bonds (covalent) stabilize 3-D structures In eukaryotes, disulfide bonds are found only in secreted proteins or extracellular domains Globular proteins have a compact hydrophobic core Packing of hydrophobic side chains into interior is main driving force for folding Problem? Polypeptide backbone is highly polar (hydrophilic) due to polar -NH and C=O in each peptide unit; these polar groups must be neutralized Solution? Form regular secondary structures, e.g., α-helix, β-sheet, stabilized by H-bonds 33 34 Exterior surface of globular proteins is generally hydrophilic Hydrophobic core formed by packed secondary structural elements provides compact, stable core "Functional groups" of protein are attached to this framework; exterior has more flexible regions (loops) and polar/charged residues Protein Secondary Structures α Helix β Sheets Loops Coils Hydrophobic "patches" on protein surface are often involved in protein-protein interactions 35 36 D Dobbs ISU - BCB 444/544X 6
α - Helix Most abundant 2' structure in proteins Average length = 10 aa's (~10 Angstroms) Length varies from 5-40 aa's Alignment of H-bonds creates dipole moment (positive charge at NH end) Often at surface of core, with hydrophobic residues on inner-facing side, hydrophilic on other side α helix is stabilized by H-bonds between ~ every 4th residue C = black O = red N = blue 37 38 R-groups are on outside of α helix Types of α helices "Standard" α helix: 3.6 residues per turn H-bonds between C=0 of residue n and NH of residue n + 4 Helix ends are polar; almost always on surface of protein Other types of helices? n + 5 = π helix n + 3 = 3 10 helix 39 40 Certain amino acids are "preferred" & others are rare in α helices Ala, Glu, Leu, Met = good helix formers Pro, Gly Tyr, Ser = very poor Amino acid composition & distribution varies, depending on on location of helix in 3-D structure β-strands & Sheets H-bonds formed between 5-10 consecutive residues in one portion of chain with another set of 5-10 residues farther down chain Interacting regions may be adjacent (with short loop between) or far apart β-sheets usually have all strands either parallel or antiparallel 41 42 D Dobbs ISU - BCB 444/544X 7
Antiparallel β-sheet Antiparallel β-sheet 43 44 Parallel β-sheet Mixed β-sheets also occur 45 46 Loops Coils Connect helices and sheets Vary in length and 3-D configurations Are located on surface of structure Are more "tolerant" of mutations Are more flexible and can adopt multiple conformations Tend to have charged and polar amino acids Are frequently components of active sites Some fall into distinct structural families (e.g., hairpin loops, reverse turns) Regions of 2' structure that are not helices, sheets, or recognizable turns Intrinsically disordered regions appear to play important functional roles 47 48 D Dobbs ISU - BCB 444/544X 8
Globular proteins are built from recurring structural patterns Motifs or supersecondary structures = combinations of 2' structural elements Domains = combinations of motifs Independently folding unit (foldon) Functional unit A few common structural motifs Helix-turn-helix e.g., DNA binding Helix-loop-helix e.g., Calcium binding β-hairpin 2 adjacent antiparallel strands connected by short loop Greek key 4 adjacent antiparallel strands β α β 2 parallel strands connected by helix 49 50 H-T-H H-L-H β-hairpin 51 52 Greek key Beta-alpha-beta 53 54 D Dobbs ISU - BCB 444/544X 9
Simple motifs combine to form domains Large polypeptide chains fold into several domains 55 56 6 main classes of protein structure 1) α Domains 2) β Domains 3) α/β Domains Bundles of helices connected by loops Mainly antiparallel sheets, usually with 2 sheets forming sandwich Mainly parallel sheets with intervening helices, also mixed sheets 4) α+β Domains Mainly segregated helices and sheets 5) Multidomain (α & β) Containing domains from more than one class 6) Membrane & cell-surface proteins α-domain structures: coiled-coils 57 58 α-domain structures: 4-helix bundles All-α proteins: Globins 59 60 D Dobbs ISU - BCB 444/544X 10
β-domain structures Up-and-down sheets and barrel Anti-parallel β structures Functionally most diverse Includes: Up-and-down sheets or barrels Propeller-like structures Jelly roll barrels (from Greek key motifs) 61 62 Up-and-down sheets can form propeller-like structures Greek key motifs can form jelly roll barrels 63 64 3 main classes α/β-domain structures TIM barrel = Core of twisted parallel strands close together Rossman fold = open twisted sheet surrounded by helices on both sides Leucine-rich motif = specific pattern of Leu residues, strands form a curved sheet with helices on outside TIM barrel Rossman fold 65 66 D Dobbs ISU - BCB 444/544X 11
Leucine rich motifs can form α/β horseshoes Protein structure databases, structural classification & visualization PDB = Protein Data Bank http://www.rcsb.org/pdb/ (RISC) - several different structure viewers MMDB = Molecular Modeling Database http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=structure (NCBI Entrez) - Cn3D viewer SCOP = Structural Classification of Proteins Levels reflect both evolutionary and structural relationships CATH = Classification by Class, Architecture, Topology and Homology 67 68 D Dobbs ISU - BCB 444/544X 12