CS612 - Algorithms in Bioinformatics

Similar documents
Homology. and. Information Gathering and Domain Annotation for Proteins

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Homology and Information Gathering and Domain Annotation for Proteins

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Protein Structure: Data Bases and Classification Ingo Ruczinski

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Week 10: Homology Modelling (II) - HHpred

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

EBI web resources II: Ensembl and InterPro

Prediction of protein function from sequence analysis

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Visualization of Macromolecular Structures

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

ALL LECTURES IN SB Introduction

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Bioinformatics Exercises

Analysis and Prediction of Protein Structure (I)

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

Bioinformatics. Macromolecular structure

Genome Databases The CATH database

Heteropolymer. Mostly in regular secondary structure

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010

Chapter 2 Structures. 2.1 Introduction Storing Protein Structures The PDB File Format

Protein Structure Prediction and Display

Large-Scale Genomic Surveys

The CATH Database provides insights into protein structure/function relationships

Some Problems from Enzyme Families

Journal of Proteomics & Bioinformatics - Open Access

Chemical Data Retrieval and Management

Protein structure alignments

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Biophysics 101: Genomics & Computational Biology. Section 8: Protein Structure S T R U C T U R E P R O C E S S. Outline.

Introduction to Bioinformatics Online Course: IBT

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

STRUCTURAL BIOINFORMATICS I. Fall 2015

CSCE555 Bioinformatics. Protein Function Annotation

Structure to Function. Molecular Bioinformatics, X3, 2006

Protein function prediction based on sequence analysis

Protein Science (1997), 6: Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society

Sequences, Structures, and Gene Regulatory Networks

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon

Large-Scale Genomic Surveys

YAPS: Yet Another Protein Similarity

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Networks & pathways. Hedi Peterson MTAT Bioinformatics

Update on human genome completion and annotations: Protein information resource

Comparing whole genomes

Study of Mining Protein Structural Properties and its Application

Protein Structure and Function Prediction using Kernel Methods.

Introduction to Bioinformatics

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

CAP 5510 Lecture 3 Protein Structures

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Introduction to Evolutionary Concepts

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Protein tertiary structure prediction with new machine learning approaches

Hands-On Nine The PAX6 Gene and Protein

Getting To Know Your Protein

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

K-means-based Feature Learning for Protein Sequence Classification

BIOINFORMATICS LAB AP BIOLOGY

SUPPLEMENTARY INFORMATION

A Protein Ontology from Large-scale Textmining?

Protein Structure Prediction

Protein Structure Prediction 11/11/05

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

The Contribution of Bioinformatics to Evolutionary Thought

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Protein structure analysis. Risto Laakso 10th January 2005

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

Section II Understanding the Protein Data Bank

DATE A DAtabase of TIM Barrel Enzymes

-max_target_seqs: maximum number of targets to report

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space

Supporting Online Material for

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Measuring quaternary structure similarity using global versus local measures.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Introduction to Comparative Protein Modeling. Chapter 4 Part I

SUPPLEMENTARY INFORMATION

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

MiGA: The Microbial Genome Atlas

Protein Structure Analysis and Verification. Course S Basics for Biosystems of the Cell exercise work. Maija Nevala, BIO, 67485U 16.1.

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Miller & Levine Biology

Computational Molecular Modeling

Protein structure similarity based on multi-view images generated from 3D molecular visualization

Using Bioinformatics to Study Evolutionary Relationships Instructions

Transcription:

Fall 2017 Databases and Protein Structure Representation October 2, 2017

Molecular Biology as Information Science > 12, 000 genomes sequenced, mostly bacterial (2013) > 5x10 6 unique sequences available What do we do with them? Compare them to find what is common and different among organisms (Comparative Genomics) Find out how and which genes encode for which proteins Identify changes that lead to disease Associate structural and functional information with new gene sequences source: http://www.33rdsquare.com NIH structural genomics project Protein Structure Initiative (PSI) GOLD (Genomes Online Database) http://www.genomesonline.org < 1% of sequences solved Experiments lagging behind Way too much data for computer scientists to sit around doing nothing

What We Expect From a Biological Databases Sequence, functional, structural information, related bibliography Well Structured and Indexed Well cross-referenced (with other databases) Periodically updated and maintained Provides tools for analysis and visualization Or at least formatted in a compatible way with known tools http://www.docfoc.com/ biological-databases-pharmamatrix-workshop-2010-philip-winter-ishwar-v-hosamani

Sequence Databases International Nucleotide Sequence Database Collaboration (INSDC): http://www.insdc.org/ NCBI (National Center for Biotechnology Information): http://ncbi.nih.gov EMBL-EBI (European Molecular Biology Laboratory, European Bioinformatics Institute): https://www.ebi.ac.uk/ DDBJ (DNA Data Bank of Japan): http://www.ddbj.nig.ac.jp/

Contents of a Database Sequences/structures (depends on the database) Accession number References Taxonomic data Annotation/curation Keywords Cross-references Documentation

NCBI Nucleotide Sequence Databases NCBI GenBank (The nucleotide sequence database) http://www.ncbi.nlm.nih.gov/genbank/ Provides tools for submission (BankIt, Sequin), retrieval (Entrez) and analysis (BLAST, Genome workbench) Provides easy access to other NCBI resources

Protein Sequence Databases Uniprot http://www.uniprot.org/ A universal resource, resulting from a merger of several databases. Tools: BLAST, align, Retrieve/IDmapping Pfam http://pfam.xfam.org/ A database of protein families based on conserved regions.

Protein Structure Databases PDB Protein Data Bank http://www.rcsb.org/pdb/ SCOP2 Structural Classification of Proteins v.2 http://scop2.mrc-lmb.cam.ac.uk/ CATH Another structural classification database http://www.cathdb.info/ EMDB Electron microscopy Database https://www.ebi.ac.uk/pdbe/emdb/ (Actually part of the PDB now)

The Protein Databank (PDB) Most (all) of the protein structures discovered to date can be found in a large protein repository called the The RCSB Protein DataBank (PDB): http://www.rcsb.org. PDB is a public domain repository that contains experimentally determined structures of three-dimensional proteins. The majority of the proteins in the PDB have been determined by x-ray crystallography. The number of proteins determined using NMR methods has been increasing as efficient computational techniques to derive structures from NMR data have been developed.

Retrieving Protein Structures from the PDB Starting with 7 structures in 1971, the number has been growing exponentially since then. There are over 100,000 structures as of today (early 2016). All PDB entries are 4-letter words! 1CRZ, 2BHL... Sometimes the chain number is added: 1CRZA, 1CRZB... You can download the coordinates and display the structure The BLAST server and other databases contain links to PDB entries if the sequence has a known structure.

The PDB

The PDB File Format

Classification of Protein Structures - The SCOP Database Chothia, Murzin (Cambridge) Hand-curated hierarchical taxonomy of proteins based on their structural and evolutionary relationships. Classes Fold Level Super Family Family Domain

The SCOPe Database The successor of SCOP (which is no longer maintained/updated). Rather similar, combination of hand-curated and automated methods.

The SCOP2 Database Prototype Similar to SCOP(e), but different. Adding evolutionary events and protein types among others. Several new hierarchical categories. The evolutionary relationships induce a graph-like structure rather than rigid hierarchy.

The CATH Database Another database which classifies protein structures downloaded from the Protein Data Bank. It is a semi-automatic, hierarchical classification of protein domains initially published in 1997. CATH is an acronym of the four main levels in the classification. # Level Description 1 Class Overall secondary-structure content of the domain. (Equivalent to SCOP class) 2 Architecture High structural similarity but no evidence of homology. (Equivalent to SCOP fold) 3 Topology A large-scale grouping of topologies which share particular structural features 4 Homologous superfamily Indicative of a demonstrable evolutionary relationship. (Equivalent to SCOP superfamily)

The CATH Database Much of the work is done by automatic methods, however there are important manual elements to the classification. First separate the proteins into domains. It is difficult to produce an unequivocal definition of a domain and this is one area in which CATH and SCOP differ. The domains are automatically sorted into classes and clustered on the basis of sequence similarities. These groups form the H levels of the classification. The topology level is formed by structural comparisons of the homologous groups. Finally, the Architecture level is assigned manually.

The CATH Database Class Level classification is done on the basis of 4 criteria: 1 Secondary structure content; 2 Secondary structure contacts; 3 Secondary structure alternation score; and 4 Percentage of parallel strands. CATH defines four classes: mostly-α, mostly-β, α and β, few secondary structures.