Procedure to Create NCBI KOGS full details in: Tatusov et al (2003) BMC Bioinformatics 4:41. 1. Detect and mask typical repetitive domains Reason: masking prevents spurious lumping of non-orthologs based only on common shared domains o Examples PPR Repeat 35 amino acids long up to 18 copiers per protein generally copy number is expanded in plants C2H2 classic zinc finger domain 25 amino acids long binds to major groove of DNA protein can function as a transcription regulator 2. All-against-all BLASTP analysis Comparison of amino acid sequences All sequences used as the database Each sequence used as a query against this database P values can be rather high Ubiquitin cluster
3. Mutually consistent triangles of genome specific best hits 4. Merge triangles with common sides 5. Manual analysis of each cluster to eliminate false positives 6. Assignment of masked proteins (step 1) to clusters 7. Perform phylogenetic clustering on KOGs with proteins from multiple species Species Arabidopsis thaliana (thale cress) Caenorhabditis elegans (worm) Drosophila melanogaster (fruit fly) Homo sapiens (human) Saccharomyces cerevisiae (baker yeast) Schizosaccharomyces pombe (fission yeast) Encephalitozoon cuniculi (Microsporidia) Species Designation A C D H Y P E Species Groupings with Largest Members Genomes Members CDH 1147 ACDHYP 928 ACDHYPE 860 ACDH 484 CDHYP 152
Examples of KOGs KOG containing all species 860 clusters KOG 0001: Ubiquitin and ubiquitin-like proteins Species Number of members Arabidopsis thaliana (thale cress) 29 Caenorhabditis elegans (worm) 12 Drosophila melanogaster (fruit fly) 3 Homo sapiens (human) 17 Saccharomyces cerevisiae (baker yeast) 2 Schizosaccharomyces pombe (fission yeast) 1 Encephalitozoon cuniculi (Microsporidia) 1
Arabidopsis At5g20620 vs. Baker Yeast YLL039c >YLL039c Length = 381 Score = 723 bits (1866), Expect = 0.0 Identities = 370/381 (97%), Positives = 381/381 (99%) Query: 1 MQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLADYN 60 MQIFVKTLTGKTITLEVESSDTIDNVK+KIQDKEGIPPDQQRLIFAGKQLEDGRTL+DYN Sbjct: 1 MQIFVKTLTGKTITLEVESSDTIDNVKSKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYN 60 Query: 61 IQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRLI 120 IQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVK+KIQDKEGIPPDQQRLI Sbjct: 61 IQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKSKIQDKEGIPPDQQRLI 120 Query: 121 FAGKQLEDGRTLADYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKA 180 FAGKQLEDGRTL+DYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVK+ Sbjct: 121 FAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKS 180 Query: 181 KIQDKEGIPPDQQRLIFAGKQLEDGRTLADYNIQKESTLHLVLRLRGGMQIFVKTLTGKT 240 KIQDKEGIPPDQQRLIFAGKQLEDGRTL+DYNIQKESTLHLVLRLRGGMQIFVKTLTGKT Sbjct: 181 KIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKT 240 Query: 241 ITLEVESSDTIDNVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLADYNIQKESTLHLVLR 300 ITLEVESSDTIDNVK+KIQDKEGIPPDQQRLIFAGKQLEDGRTL+DYNIQKESTLHLVLR Sbjct: 241 ITLEVESSDTIDNVKSKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLR 300 Query: 301 LRGGMQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTL 360 LRGGMQIFVKTLTGKTITLEVESSDTIDNVK+KIQDKEGIPPDQQRLIFAGKQLEDGRTL Sbjct: 301 LRGGMQIFVKTLTGKTITLEVESSDTIDNVKSKIQDKEGIPPDQQRLIFAGKQLEDGRTL 360 Query: 361 ADYNIQKESTLHLVLRLRGGS 381 +DYNIQKESTLHLVLRLRGG+ Sbjct: 361 SDYNIQKESTLHLVLRLRGGN 381 Arabidopsis At5g20620 vs. Arabidopsis At1g14650 >At1g14650 Length = 785 Score = 38.5 bits (88), Expect = 0.024 Identities = 24/68 (35%), Positives = 39/68 (57%), Gaps = 1/68 (1%) Query: 314 GKTITLEVES-SDTIDNVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLADYNIQKESTLH 372 G+ + + V+S S+ + ++K KI + IP ++Q+L L+D +LA YN+ L Sbjct: 715 GQFMEITVQSLSENVGSLKEKIAGEIQIPANKQKLSGKAGFLKDNMSLAHYNVGAGEILT 774 Query: 373 LVLRLRGG 380 L LR RGG Sbjct: 775 LSLRERGG 782
KOG 0001: 40S Ribosomal Protein S17 Species Number of members Arabidopsis thaliana (thale cress) 4 Caenorhabditis elegans (worm) 1 Drosophila melanogaster (fruit fly) 1 Homo sapiens (human) 8 Saccharomyces cerevisiae (baker yeast) 2 Schizosaccharomyces pombe (fission yeast) 2 Encephalitozoon cuniculi (Microsporidia) 1