Procedure to Create NCBI KOGS

Similar documents
Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

Advanced Algorithms and Models for Computational Biology

Graph Alignment and Biological Networks

ECOL/MCB 320 and 320H Genetics

BIOINFORMATICS LAB AP BIOLOGY

CGS 5991 (2 Credits) Bioinformatics Tools

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Welcome to BIOL 572: Recombinant DNA techniques

Genomics and bioinformatics summary. Finding genes -- computer searches

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Computational Structural Bioinformatics

BSC 4934: QʼBIC Capstone Workshop" Giri Narasimhan. ECS 254A; Phone: x3748

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

Introduction to protein alignments

Lecture Materials are available on the 321 web site

11/24/13. Science, then, and now. Computational Structural Bioinformatics. Learning curve. ECS129 Instructor: Patrice Koehl

Science of Information Initiative

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever.

Gene mention normalization in full texts using GNAT and LINNAEUS

SUPPLEMENTARY INFORMATION

Small RNA in rice genome

BLAST. Varieties of BLAST

Pyrobayes: an improved base caller for SNP discovery in pyrosequences

Sequence Database Search Techniques I: Blast and PatternHunter tools

You are required to know all terms defined in lecture. EXPLORE THE COURSE WEB SITE 1/6/2010 MENDEL AND MODELS

Heuristic Methods. Heuristic methods for alignment Sequence databases Multiple alignment Gene and protein prediction

2 The Proteome. The Proteome 15

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Multiple Sequence Alignments

Hands-On Nine The PAX6 Gene and Protein

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

Unit 5: Cell Division and Development Guided Reading Questions (45 pts total)

Comparing Genomes! Homologies and Families! Sequence Alignments!

Introduction to Bioinformatics

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

COMPARING DNA SEQUENCES TO UNDERSTAND EVOLUTIONARY RELATIONSHIPS WITH BLAST

6.096 Algorithms for Computational Biology. Prof. Manolis Kellis

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Introduction to Bioinformatics Integrated Science, 11/9/05

CRISPRseek Workshop Design of target-specific guide RNAs in CRISPR-Cas9 genome-editing systems

Lecture 10: Cyclins, cyclin kinases and cell division

Comparative Features of Multicellular Eukaryotic Genomes

Introduction of Biotechnology

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Genome Sequences and Evolution

Bioinformatics for Biologists

Inparanoid: a comprehensive database of eukaryotic orthologs

Bio2. Heuristics, Databases ; Multiple Sequence Alignment ; Gene Finding. Biological Databases (sequences) Armstrong, 2007 Bioinformatics 2

Protein function prediction based on sequence analysis

Basic Local Alignment Search Tool

Prediction of Protein-protein Interactions on the Basis of Evolutionary Conservation of Protein Functions

BIOINFORMATICS. Improved Network-based Identification of Protein Orthologs. Nir Yosef a,, Roded Sharan a and William Stafford Noble b

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

NUCLEOTIDE SUBSTITUTIONS AND THE EVOLUTION OF DUPLICATE GENES

Annotation of Drosophila grimashawi Contig12

Systematic prediction of gene function in Arabidopsis thaliana using a probabilistic functional gene network

Peeling the yeast protein network

In Search of the Biological Significance of Modular Structures in Protein Networks

Universal Rules Governing Genome Evolution Expressed by Linear Formulas

GEP Annotation Report

Fundamentally different strategies for transcriptional regulation are revealed by information-theoretical analysis of binding motifs

Comparison of Protein-Protein Interaction Confidence Assignment Schemes

Reminders about Eukaryotes

Chapter 18 Lecture. Concepts of Genetics. Tenth Edition. Developmental Genetics

7.06 Problem Set #4, Spring 2005

Genome Sequencing & DNA Sequence Analysis

The Gene The gene; Genes Genes Allele;

Network alignment and querying

Evolution and Development Evo-Devo

G C T G U A. template strand

Example of Function Prediction

Tools and Algorithms in Bioinformatics

GATA family of transcription factors of vertebrates: phylogenetics and chromosomal synteny

Proteins: Structure & Function. Ulf Leser

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Understanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007

Multiple Sequence Alignment. Sequences

Improved network-based identification of protein orthologs

Protein function studies: history, current status and future trends

Computational identification and analysis of MADS box genes in Camellia sinensis

Bioinformatics: Network Analysis

Arabidopsis genomic information for interpreting wheat EST sequences

How much non-coding DNA do eukaryotes require?

The MANTiS Manual. Contents. MANTiS Version 1.1

Genome-Wide Computational Prediction and Analysis of Core Promoter Elements across Plant Monocots and Dicots

Comparative Analysis of Molecular Interaction Networks

MCB142/ICB163 Overview. Instructors:

Characterization of New Proteins Found by Analysis of Short Open Reading Frames from the Full Yeast Genome

Emergence of gene regulatory networks under functional constraints

Bioinformatics and BLAST

Supplemental Materials

Comparative Bioinformatics Midterm II Fall 2004

SI Materials and Methods

Bioinformatics Chapter 1. Introduction

Comparative Gene Expression Analysis by a Differential Clustering Approach: Application to the Candida albicans Transcription Program

CSE 427 Comp Bio. Sequence Alignment

Transcription:

Procedure to Create NCBI KOGS full details in: Tatusov et al (2003) BMC Bioinformatics 4:41. 1. Detect and mask typical repetitive domains Reason: masking prevents spurious lumping of non-orthologs based only on common shared domains o Examples PPR Repeat 35 amino acids long up to 18 copiers per protein generally copy number is expanded in plants C2H2 classic zinc finger domain 25 amino acids long binds to major groove of DNA protein can function as a transcription regulator 2. All-against-all BLASTP analysis Comparison of amino acid sequences All sequences used as the database Each sequence used as a query against this database P values can be rather high Ubiquitin cluster

3. Mutually consistent triangles of genome specific best hits 4. Merge triangles with common sides 5. Manual analysis of each cluster to eliminate false positives 6. Assignment of masked proteins (step 1) to clusters 7. Perform phylogenetic clustering on KOGs with proteins from multiple species Species Arabidopsis thaliana (thale cress) Caenorhabditis elegans (worm) Drosophila melanogaster (fruit fly) Homo sapiens (human) Saccharomyces cerevisiae (baker yeast) Schizosaccharomyces pombe (fission yeast) Encephalitozoon cuniculi (Microsporidia) Species Designation A C D H Y P E Species Groupings with Largest Members Genomes Members CDH 1147 ACDHYP 928 ACDHYPE 860 ACDH 484 CDHYP 152

Examples of KOGs KOG containing all species 860 clusters KOG 0001: Ubiquitin and ubiquitin-like proteins Species Number of members Arabidopsis thaliana (thale cress) 29 Caenorhabditis elegans (worm) 12 Drosophila melanogaster (fruit fly) 3 Homo sapiens (human) 17 Saccharomyces cerevisiae (baker yeast) 2 Schizosaccharomyces pombe (fission yeast) 1 Encephalitozoon cuniculi (Microsporidia) 1

Arabidopsis At5g20620 vs. Baker Yeast YLL039c >YLL039c Length = 381 Score = 723 bits (1866), Expect = 0.0 Identities = 370/381 (97%), Positives = 381/381 (99%) Query: 1 MQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLADYN 60 MQIFVKTLTGKTITLEVESSDTIDNVK+KIQDKEGIPPDQQRLIFAGKQLEDGRTL+DYN Sbjct: 1 MQIFVKTLTGKTITLEVESSDTIDNVKSKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYN 60 Query: 61 IQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRLI 120 IQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVK+KIQDKEGIPPDQQRLI Sbjct: 61 IQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKSKIQDKEGIPPDQQRLI 120 Query: 121 FAGKQLEDGRTLADYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKA 180 FAGKQLEDGRTL+DYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVK+ Sbjct: 121 FAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKS 180 Query: 181 KIQDKEGIPPDQQRLIFAGKQLEDGRTLADYNIQKESTLHLVLRLRGGMQIFVKTLTGKT 240 KIQDKEGIPPDQQRLIFAGKQLEDGRTL+DYNIQKESTLHLVLRLRGGMQIFVKTLTGKT Sbjct: 181 KIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKT 240 Query: 241 ITLEVESSDTIDNVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLADYNIQKESTLHLVLR 300 ITLEVESSDTIDNVK+KIQDKEGIPPDQQRLIFAGKQLEDGRTL+DYNIQKESTLHLVLR Sbjct: 241 ITLEVESSDTIDNVKSKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLR 300 Query: 301 LRGGMQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTL 360 LRGGMQIFVKTLTGKTITLEVESSDTIDNVK+KIQDKEGIPPDQQRLIFAGKQLEDGRTL Sbjct: 301 LRGGMQIFVKTLTGKTITLEVESSDTIDNVKSKIQDKEGIPPDQQRLIFAGKQLEDGRTL 360 Query: 361 ADYNIQKESTLHLVLRLRGGS 381 +DYNIQKESTLHLVLRLRGG+ Sbjct: 361 SDYNIQKESTLHLVLRLRGGN 381 Arabidopsis At5g20620 vs. Arabidopsis At1g14650 >At1g14650 Length = 785 Score = 38.5 bits (88), Expect = 0.024 Identities = 24/68 (35%), Positives = 39/68 (57%), Gaps = 1/68 (1%) Query: 314 GKTITLEVES-SDTIDNVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLADYNIQKESTLH 372 G+ + + V+S S+ + ++K KI + IP ++Q+L L+D +LA YN+ L Sbjct: 715 GQFMEITVQSLSENVGSLKEKIAGEIQIPANKQKLSGKAGFLKDNMSLAHYNVGAGEILT 774 Query: 373 LVLRLRGG 380 L LR RGG Sbjct: 775 LSLRERGG 782

KOG 0001: 40S Ribosomal Protein S17 Species Number of members Arabidopsis thaliana (thale cress) 4 Caenorhabditis elegans (worm) 1 Drosophila melanogaster (fruit fly) 1 Homo sapiens (human) 8 Saccharomyces cerevisiae (baker yeast) 2 Schizosaccharomyces pombe (fission yeast) 2 Encephalitozoon cuniculi (Microsporidia) 1