EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Similar documents
EBI web resources II: Ensembl and InterPro

EBI web resources II: Ensembl and InterPro

CS612 - Algorithms in Bioinformatics

Homology. and. Information Gathering and Domain Annotation for Proteins

Homology and Information Gathering and Domain Annotation for Proteins

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Protein function prediction based on sequence analysis

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Bioinformatics. Dept. of Computational Biology & Bioinformatics

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

-max_target_seqs: maximum number of targets to report

Large-Scale Genomic Surveys

Bioinformatics Exercises

Functional Annotation

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

We have: We will: Assembled six genomes Made predictions of most likely gene locations. Add a layers of biological meaning to the sequences

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

ALL LECTURES IN SB Introduction

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

A Protein Ontology from Large-scale Textmining?

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Prediction of protein function from sequence analysis

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Some Problems from Enzyme Families

Browsing Genomic Information with Ensembl Plants

CSCE555 Bioinformatics. Protein Function Annotation

Structure to Function. Molecular Bioinformatics, X3, 2006

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

Protein Structure: Data Bases and Classification Ingo Ruczinski

GO annotation in InterPro: why stability does not indicate accuracy in a sea of changing annotations

Update on genome completion and annotations: Protein Information Resource

Update on human genome completion and annotations: Protein information resource

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Analysis and Prediction of Protein Structure (I)

Week 10: Homology Modelling (II) - HHpred

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Hidden Markov Models (HMMs) and Profiles

Chapter 2 Structures. 2.1 Introduction Storing Protein Structures The PDB File Format

Hands-On Nine The PAX6 Gene and Protein

Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010

FuncNet a distributed platform for high-throughput protein function analysis. Andrew Clegg University College London. funcnet.eu

Synteny Portal Documentation

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations

Multiple sequence alignment

Genome Browsers And Genome Databases. Andy Conley Computational Genomics 2009

Bioinformatics. Macromolecular structure

Chemical Data Retrieval and Management

Protein Families. João C. Setubal University of São Paulo Agosto /23/2012 J. C. Setubal

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Gene Ontology and overrepresentation analysis

BMD645. Integration of Omics

GEP Annotation Report

Browsing Genes and Genomes with Ensembl

The Contribution of Bioinformatics to Evolutionary Thought

SUPPLEMENTARY INFORMATION

PROTEIN CLUSTERING AND CLASSIFICATION

Introduction to Bioinformatics Online Course: IBT

Genome Databases The CATH database

Large-Scale Genomic Surveys

SnoPatrol: How many snorna genes are there? Supplementary

Patterns and profiles applications of multiple alignments. Tore Samuelsson March 2013

Gene function annotation

Computational Biology: Basics & Interesting Problems

GENE ONTOLOGY (GO) Wilver Martínez Martínez Giovanny Silva Rincón

Networks & pathways. Hedi Peterson MTAT Bioinformatics

Functional Annotation & Comparative Genomics. Lu Wang, Georgia Tech

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Visualization of Macromolecular Structures

BIOINFORMATICS LAB AP BIOLOGY

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

BIOINFORMATICS: An Introduction

Protein bioinforma-cs. Åsa Björklund CMB/LICR

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Predicting Protein Functions and Domain Interactions from Protein Interactions

Biology Chapter 14 The Human Genome Download

RGP finder: prediction of Genomic Islands

SoyBase, the USDA-ARS Soybean Genetics and Genomics Database

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

#33 - Genomics 11/09/07

Journal of Proteomics & Bioinformatics - Open Access

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

Domain-based computational approaches to understand the molecular basis of diseases

Proteins: Structure & Function. Ulf Leser

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

Example of Function Prediction

Protein Science (1997), 6: Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society

Heteropolymer. Mostly in regular secondary structure

Computational Molecular Modeling

Ensembl Genomes (non-chordates): Quick tour. This quick tour provides a brief introduction to Ensembl Genomes [2], the non-chordate genome browser.

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

PG Diploma in Genome Informatics onwards CCII Page 1 of 6

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

Transcription:

EBI web resources II: Ensembl and InterPro Yanbin Yin Spring 2013 1

Outline Intro to genome annotation Protein family/domain databases InterPro, Pfam, Superfamily etc. Genome browser Ensembl Hands on Practice 2

Genome annotation Predict genes (where are the genes?) protein coding RNA coding Function annotation (What are the genes?) Search against UniProt or NCBI-nr (GenPept) Search against protein family/domain databases Search against Pathway databases Function vocabularies defined in Gene Ontology 3

Superfamily Gene3D SCOP CATH PDB 4

InterPro components 1. CATH/Gene3D University College, London, UK 2. PANTHER University of Southern California, CA, USA 3. PIRSF Protein Information Resource, Georgetown University, USA 4. Pfam Wellcome Trust Sanger Institute, Hinxton, UK 5. PRINTS University of Manchester, UK 6. ProDom PRABI Villeurbanne, France 7. PROSITE Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland 8. SMART EMBL, Heidelberg, Germany 9. SUPERFAMILY University of Bristol, UK 10. TIGRFAMs J. Craig Venter Institute, Rockville, MD, US 11. HAMAP Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland CDD components Pfam, SMART, TIGRFAM, COG, KOG, PRK, CD, LOAD 5

6

Protein Classification Nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin. Proteins are classified to reflect both structural and evolutionary relatedness. Many levels exist in the hierarchy, but the principal levels are family, superfamily and fold, described below. Family: Clear evolutionarily relationship Proteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. Superfamily: Probable common evolutionary origin Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. Fold: Major structural similarity Proteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favoring certain packing arrangements and chain topologies. http://scop.mrc-lmb.cam.ac.uk/scop/intro.html 7

PDB Structure Superfamily Gene3D Pfam SMART ProSite Function (literature) SCOP CATH Protein Sequence UniProt GenPept Evolution 8

9

fold ~ class superfamily ~ clan family subfamily domain sequence 10

Hands on exercise 1: search against protein family databases 11

Google interpro 12

13

Google NCBI CDD search 14

15

Google Pfam You will see two pfam sites: Sanger pfam and Janellia pfam 16

17

18

Text/keyword search 19

20

21

22

http://supfam.cs.bris.ac.uk/superfamily/ 23

24

25

http://www.ebi.ac.uk/tools/ 26

The Ensembl project aims to automatically annotate genome sequences, integrate these data with other biological information and to make the results freely available to geneticists, molecular biologists, bioinformaticians and the wider research community. Ensembl is jointly headed by Dr Stephen Searle at the Wellcome Trust Sanger Institute and Dr Paul Flicek at the European Bioinformatics Institute (EBI). http://www.ensembl.org/ 27

What do we need genome browsers? To make the bare DNA sequence, its properties, and the associated annotations more accessible through graphical interface. Genome browsers provide access to large amounts of sequence data via a graphical user interface. They use a visual, high-level overview of complex data in a form that can be grasped at a glance and provide the means to explore the data in increasing resolution from megabase scales down to the level of individual elements of the DNA sequence. 28

http://useast.ensembl.org/info/website/tutorials/index.html 29

30

Nature 491, 56-65 ( 01 November 2012 ) 31

Nature 458, 719-724(9 April 2009) NATURE Vol 464 15 April 2010 32

While a user may start browsing for a particular gene, the user interface will display the area of the genome containing the gene, along with a broader context of other information available in the region of the chromosome occupied by the gene. This information is shown in tracks, with each track showing either the genomic sequence from a particular species or a particular kind of annotation on the gene. The tracks are aligned so that the information about a particular base in the sequence is lined up and can be viewed easily. In modern browsers, the abundance of contextual information linked to a genomic region not only helps to satisfy the most directed search, but also makes available a depth of content that facilitates integration of knowledge about genes, gene expression, regulatory sequences, sequence conservation between species, and many other classes of data. 33

Ensembl Genome Browsers: http://www.ensemblgenomes.org NCBI Map Viewer: http://www.ncbi.nlm.nih.gov/mapview/ UCSC Genome Browser: http://genome.ucsc.edu Each uses a centralized model, where the web site provides access to a large public database of genome data for many species and also integrates specialized tools, such as BLAST at NCBI and Ensembl and BLAT at UCSC. The public browsers provide a valuable service to the research community by providing tools for free access to whole genome data and by supporting the complex and robust informatics infrastructure required to make the data accessible 34

Hands on exercise 2: Ensembl gene search 35

http://www.ensembl.org/ colon cancer 36

37

38

39

40

41

42

43

44

45

46

47

48

Next lecture: ExPASy and DTU tools 49