Data File Formats. There are dozens of file formats for chemical data.

Similar documents
Protein Data Bank. Institute for Protein Research New deposition system and a validation tool of Protein Data Bank

mmcif in Structural Bioinformatics

Protein Data Bank Contents Guide: Atomic Coordinate Entry Format Description. Version Document Published by the wwpdb

X-ray crystallography NMR Cryoelectron microscopy

Protein Structure and Visualisation. Introduction to PDB and PyMOL

Molecular Graphics with PyMOL

Chapter 2 Structures. 2.1 Introduction Storing Protein Structures The PDB File Format

Structure and evolution of the spliceosomal peptidyl-prolyl cistrans isomerase Cwc27

Packing of Secondary Structures

Protein Data Bank Contents Guide: Atomic Coordinate Entry Format Description. Version 3.0, December 1, 2006 Updated to Version 3.

Proteins: Characteristics and Properties of Amino Acids

Viewing and Analyzing Proteins, Ligands and their Complexes 2

Bahnson Biochemistry Cume, April 8, 2006 The Structural Biology of Signal Transduction

Peptides And Proteins

Model Mélange. Physical Models of Peptides and Proteins

Sequential resonance assignments in (small) proteins: homonuclear method 2º structure determination

Full wwpdb X-ray Structure Validation Report i

Resolution and data formats. Andrea Thorn

Protein Fragment Search Program ver Overview: Contents:

Working with protein structures. Benjamin Jack

Major Types of Association of Proteins with Cell Membranes. From Alberts et al

What makes a good graphene-binding peptide? Adsorption of amino acids and peptides at aqueous graphene interfaces: Electronic Supplementary

Figure 1. Molecules geometries of 5021 and Each neutral group in CHARMM topology was grouped in dash circle.

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

Full wwpdb X-ray Structure Validation Report i

Central Dogma. modifications genome transcriptome proteome

Proteins. Central Dogma : DNA RNA protein Amino acid polymers - defined composition & order. Perform nearly all cellular functions Drug Targets

X-Ray structure analysis

Supporting information to: Time-resolved observation of protein allosteric communication. Sebastian Buchenberg, Florian Sittel and Gerhard Stock 1

HIV protease inhibitor. Certain level of function can be found without structure. But a structure is a key to understand the detailed mechanism.

Protein Structure: Data Bases and Classification Ingo Ruczinski

Nitrogenase MoFe protein from Clostridium pasteurianum at 1.08 Å resolution: comparison with the Azotobacter vinelandii MoFe protein

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Supplementary figure 1. Comparison of unbound ogm-csf and ogm-csf as captured in the GIF:GM-CSF complex. Alignment of two copies of unbound ovine

Properties of amino acids in proteins

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Supplementary Figure 3 a. Structural comparison between the two determined structures for the IL 23:MA12 complex. The overall RMSD between the two

Physiochemical Properties of Residues

Other Methods for Generating Ions 1. MALDI matrix assisted laser desorption ionization MS 2. Spray ionization techniques 3. Fast atom bombardment 4.

Translation. A ribosome, mrna, and trna.

Drug targets, Protein Structures and Crystallography

Full wwpdb X-ray Structure Validation Report i

UNIT TWELVE. a, I _,o "' I I I. I I.P. l'o. H-c-c. I ~o I ~ I / H HI oh H...- I II I II 'oh. HO\HO~ I "-oh

Full wwpdb X-ray Structure Validation Report i

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS

Protein Struktur (optional, flexible)

Ramachandran Plot. 4ysz Phi (degrees) Plot statistics

3D Visualization of Drugs-Protein Complexes

7.012 Problem Set 1. i) What are two main differences between prokaryotic cells and eukaryotic cells?

Any protein that can be labelled by both procedures must be a transmembrane protein.

Full wwpdb X-ray Structure Validation Report i

Exam I Answer Key: Summer 2006, Semester C

Protein Structure Prediction and Display

Basic Principles of Protein Structures

April, The energy functions include:

Appendices. Appendix I: PDB file format

Details of Protein Structure

Table 1. Crystallographic data collection, phasing and refinement statistics. Native Hg soaked Mn soaked 1 Mn soaked 2

Amino Acids and Peptides

1. What is an ångstrom unit, and why is it used to describe molecular structures?

Ranjit P. Bahadur Assistant Professor Department of Biotechnology Indian Institute of Technology Kharagpur, India. 1 st November, 2013

Amino Acid Side Chain Induced Selectivity in the Hydrolysis of Peptides Catalyzed by a Zr(IV)-Substituted Wells-Dawson Type Polyoxometalate

Computer simulations of protein folding with a small number of distance restraints

Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell

Protein Structures: Experiments and Modeling. Patrice Koehl

NMR study of complexes between low molecular mass inhibitors and the West Nile virus NS2B-NS3 protease

Solutions In each case, the chirality center has the R configuration

Full wwpdb X-ray Structure Validation Report i

Bioinformatics Practical for Biochemists

Protein Structure Determination. Why Bother With Structure? Protein Sequences Far Outnumber Structures. Growth of Structural Data

Sensitive NMR Approach for Determining the Binding Mode of Tightly Binding Ligand Molecules to Protein Targets

Chemistry Chapter 22

Protein Structure Determination. Why Bother With Structure? Protein Sequences Far Outnumber Structures

Section Week 3. Junaid Malek, M.D.

Clustering and Model Integration under the Wasserstein Metric. Jia Li Department of Statistics Penn State University

Protein Structure Determination. How are these structures determined?

wwpdb Processing Procedures and Policies Document Section A: wwpdb processing procedures

DATE A DAtabase of TIM Barrel Enzymes

Full wwpdb X-ray Structure Validation Report i

Read more about Pauling and more scientists at: Profiles in Science, The National Library of Medicine, profiles.nlm.nih.gov

Bioinformatics. Macromolecular structure

Advanced Certificate in Principles in Protein Structure. You will be given a start time with your exam instructions

Course Notes: Topics in Computational. Structural Biology.

B O C 4 H 2 O O. NOTE: The reaction proceeds with a carbonium ion stabilized on the C 1 of sugar A.

Deciphering the Rules for Amino Acid Co- Assembly Based on Interlayer Distances

Supplemental Materials for. Structural Diversity of Protein Segments Follows a Power-law Distribution

Protein Structure Bioinformatics Introduction

MULTIDOCK V1.0. Richard M. Jackson Biomolecular Modelling Laboratory Imperial Cancer Research Fund 44 Lincoln s Inn Fields London WC2A 3PX

Computational Protein Design

Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity. Roger Sayle, Ph.D. NextMove Software, Cambridge, UK

Peptide Syntheses. Illustrative Protection: BOC/ t Bu. A. Introduction. do not acid

Supplementary Information. Broad Spectrum Anti-Influenza Agents by Inhibiting Self- Association of Matrix Protein 1

Problem Set 1

Supplemental Information for: Characterizing the Membrane-Bound State of Cytochrome P450 3A4: Structure, Depth of Insertion and Orientation

Chapter 4: Amino Acids

PDBe TUTORIAL. PDBePISA (Protein Interfaces, Surfaces and Assemblies)

----- Ver October 24, 2014 Bug about reading MOPAC2012 Ver.14 calculations of 1 atom and 2 atoms molecule was fixed.

Oxygen Binding in Hemocyanin

Protein Structure Marianne Øksnes Dalheim, PhD candidate Biopolymers, TBT4135, Autumn 2013

Transcription:

1 Introduction There are dozens of file formats for chemical data. We will do an overview of a few that are often used in structural bioinformatics. 2 1

PDB File Format (1) The PDB file format specification is rather extensive. A tutorial on the topic can be found at: http://www.wwpdb.org/documentation/format23/v2.3.html A PDB file includes the following information: HEADER First line of the file, contains PDB ID code, classification, and date of deposition. COMPND A description of the macromolecular contents. SOURCE Biological source of the macromolecule. REVDAT Revision dates. JRNL Reference to journal article dealing with this structure. 3 PDB File Format (2) A PDB file information continued: AUTHOR List of contributors. REMARK General remarks, some structured, some free form. SEQRES Primary sequence of the residues on a protein backbone. SEQRES 1 A 99 PRO GLN ILE THR LEU TRP GLN ARG PRO LEU VAL THR ILE 1HVS 91 SEQRES 2 A 99 LYS ILE GLY GLY GLN LEU LYS GLU ALA LEU LEU ASP THR 1HVS 92 1-6 9-10 Serial Number 12 Chain ID 20+4*n 22+4*n n=0,1,2,, 12 Residue Names 14-17 Number of Residues Integers in red boxes are column numbers PDB ID File Line Number 4 2

PDB File Format (3) For your assignment the most important type of record is that ATOM record: Contains the coordinates of a particular atom. 7-11 Serial Number 18-20 Residue Name 23-26 Residue Seq num 55-60 Occupancy PDB ID File Line Number ATOM 1 N PRO A 1-3.071 7.851 33.748 1.00 32.40 1HVS 136 ATOM 2 CA PRO A 1-2.327 7.086 34.745 1.00 29.77 1HVS 137 ATOM 3 C PRO A 1-0.920 7.200 34.299 1.00 29.71 1HVS 138 1-6 13-16 Atom Name 22 Chain ID 31-38 X, 39-46 Y, 47-54 Z Coordinates in Angstroms 61-66 Temperature Factor The list of ATOM records in a chain is terminated by a TER record. 5 PDB File Format (4) Other record types: HET Names of hetero-atoms Usually designates atoms and molecules that are separate from the protein. FORMUL Chemical formula for the hetero-atoms. HELIX Location of helices (designated by depositor). SHEET Location of beta sheets (designated by depositor). SSBOND Location of disulfide bonds (if they exist). ORIGn Scaling factors to transform from orthogonal coordinates. SCALEn Scaling factors transforming to fractional crystal coordinates. 6 3

A New File Format: mmcif Macromolecular Crystallographic Information File Derived from CIF S.R. Hall, F.H. Allen, and I.D. Brown. 1991. The Crystallographic Information File (CIF): a new standard archive file for crystallography. Acta Cryst. (1991). A47, 655-685. Dictionary based The macromolecular CIF dictionary is a set of data names designed to describe the macromolecular crystallographic experiment and its results. Each data item in a mmcif matches an entry in the dictionary. This enables the validation of data and gives a more consistent representation of structures. Easier to parse with less ambiguity. 7 Comparing PDB Files and mmcif (1) PDB Header: HEADER HYDROLASE(ACID PROTEASE) 26-JAN-94 1HVI becomes: _struct.entry_id 1HVI _struct.title ;INFLUENCE OF STEREOCHEMISTRY ON ACTIVITY AND BINDING MODES FOR C2 SYMMETRY-BASED DIOL INHIBITORS OF HIV-1 PROTEASE ; _struct_keywords.entry_id 1HVI _struct_keywords.pdbx_keywords 'HYDROLASE(ACID PROTEASE)' _struct_keywords.text 'HYDROLASE(ACID PROTEASE)' 8 4

Comparing PDB and mmcif (2) The equivalent of ATOM records: loop atom_site.group_pdb _atom_site.id _atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_alt_id _atom_site.label_comp_id _atom_site.label_asym_id _atom_site.label_entity_id _atom_site.label_seq_id _atom_site.pdbx_pdb_ins_code _atom_site.cartn_x _atom_site.cartn_y _atom_site.cartn_z _atom_site.occupancy _atom_site.b_iso_or_equiv _atom_site.cartn_x_esd _atom_site.cartn_y_esd _atom_site.cartn_z_esd _atom_site.occupancy_esd _atom_site.b_iso_or_equiv_esd _atom_site.auth_seq_id _atom_site.auth_comp_id _atom_site.auth_asym_id _atom_site.auth_atom_id _atom_site.pdbx_pdb_model_num 9 Comparing PDB and mmcif (3) The equivalent of ATOM records (cont.): ATOM 1 N N. PRO A 1 1? -3.609 7.549 33.926 1.00 27.71????? 1 PRO A N 1 ATOM 2 C CA. PRO A 1 1? -2.655 6.720 34.710 1.00 27.34????? 1 PRO A CA 1 ATOM 4 O O. PRO A 1 1? -1.000 7.590 33.270 1.00 29.50????? 1 PRO A O 1 PDB records for these first three atoms: ATOM 1 N PRO A 1-3.609 7.549 33.926 1.00 27.71 ATOM 2 CA PRO A 1-2.655 6.720 34.710 1.00 27.34 ATOM 3 C PRO A 1-1.219 6.969 34.289 1.00 27.56 10 5

Comparing PDB and mmcif (4) To see a good tutorial on mmcif you can visit the SDSC (San Diego Supercomputer Center): http://www.sdsc.edu/pb/cif/tutorial_mm.html This includes many examples of how dictionary entries and their associated values appear in a typical mmcif file. There are also several interesting links from their homepage. 11 PubChem File Formats: ASN.1 (1) ASN.1 (Abstract Syntax Notation 1): A standardized data specification language used to represent complex data types. Data structure is hierarchical Designed by Xerox A formal notation used for describing data transmitted by telecommunications protocols and physical representation of such data. Also used in telephone systems, air traffic, building and machine control, toll highways, smart cards, security and much more. Used by NCBI to store GenBank, PubMed, MMDB. National Center for Biotechnology Information 12 6

PubChem File Formats: ASN.1 (2) Abstract Syntax Notation 1 for water: ETC. 13 PubChem File Formats: XML (1) Extensible Markup Language Is a general-purpose markup language that supports a wide variety of applications. Just like HTML it is also built around an hierarchical layout that uses markup tags. It is quite versatile. Almost human readable. However, files tend to be rather long. The PubChem entry for water has an XML file that is 406 lines long! The first few lines of this file appear next: 14 7

PubChem File Formats: XML (2) Water in XML: 15 PubChem File Formats: XML (3) Water in XML (cont.): 16 357 lines to go 8

PubChem File Formats: SDF (1) SDF is a chemical file format File extensions:.mol.sd.sdf A MDL Molfile is a file format designed by MDL, for storing data about the atoms, bonds, connectivity and atom coordinates of a molecule. The molfile consists of header information, the Connection Table (CT) containing atom information, bond connections and types, followed by records for more complex information. The format is owned by MDL (Molecular Design Limited). 17 PubChem File Formats: SDF (2) SDF for glycine (CID 750): 18 9

PubChem File Formats: SDF (3) SDF for glycine (CID 750) Other useful information: <PUBCHEM_CACTVS_HBOND_ACCEPTOR> 3 > <PUBCHEM_CACTVS_HBOND_DONOR> 2 > <PUBCHEM_CACTVS_ROTATABLE_BOND> 1 > <PUBCHEM_IUPAC_NAME> 2-aminoacetic acid > <PUBCHEM_NIST_INCHI> InChI=1/C2H5NO2/c3-1-2(4)5/h1,3H2,(H,4,5)/f/h4H > <PUBCHEM_CACTVS_XLOGP> -3.4 > <PUBCHEM_OPENEYE_MF> C2H5NO2 > <PUBCHEM_OPENEYE_MW> 75.0666 > <PUBCHEM_OPENEYE_CAN_SMILES> C(C(=O)O)N 19 10