We have: We will: Assembled six genomes Made predictions of most likely gene locations. Add a layers of biological meaning to the sequences

Size: px

Start display at page:

Download "We have: We will: Assembled six genomes Made predictions of most likely gene locations. Add a layers of biological meaning to the sequences"

Paul Williamson
5 years ago
Views:

2 Recap We have: Assembled six genomes Made predictions of most likely gene locations We will: Add a layers of biological meaning to the sequences

3 Start with Biology This will motivate the choices we make in picking tools for bioinformatics later on

4 Pal, Debnath. (2006) On gene ontology and function annotation.

6 Pal, Debnath. (2006) On gene ontology and function annotation.

7 scale Functional annotation Assigning biological meaning to sequence info Types of genomic features (increasing scale) Short sequences Genes naming and function description Control of expression promoter Operons Pathways/Networks

8 Short sequences AGTGTTCTGATTACTGGGACTAAGTGCGGTACGTACGATGAGTCGATCAAATGCGTGC

9 Short sequences Four main categories Dispersed repeat motifs Competence signals Promoter regions Homopolymer tracts Short-motif SSR 2-6 bases repeats Have been shown to modulate virulence Informative in epidemiological studies for phylogeny, etc Knock-out of these regions Long-motif SSR 8+

10 Dispersed repeat motif AAGTGCGGT = one signal for competence machinery in Haemophilus influenzae Promoters TATA box (Pribnow box) at -10 TTGACAT at -35, allows for high transcription

11 Genes gena AGTGTTCTGATTACTGGGACTAAGTGCGGTACGTACGATGAGTCGATCAAATGCGTGC

12 Gene naming So the LORD God formed out of the ground various wild animals and various birds of the air, [and protein coding genes,] and he brought them to the man to see what he would call them; whatever the man called each of them would be its name. Genesis 2:19

13 Gene naming

14 Gene naming Gene ontology Study of what the gene is Assigning putative function How is this helpful? Facilitates communication of much information 2000 genes time 6 genomes Confirmation of experimental data from the CDC Allows for comparative analysis

15 Gene ontology Sub-domains Molecular function Elemental activities of a gene product at mol. Level Binding Catalysis Biological processes Sets of mol. events with defined beginning and end E.g. - Induction of cell death Cellular components The parts of a cell or its extracellular environment An Introduction to the Gene Ontology.

16 Operons gena genb genc gena AGTGTTCTGATTACTGGGACTAAGTGCGGTACGTACGATGAGTCGATCAAATGCGTGC

17 What is an operon Operon - a cluster of structural genes that are expressed as a group and their associated promoter and operator. In addition to being physically close in the genome, these genes are regulated such that they are all turned on or off together.

19 lac Operon in E.coli

22 Operons in Haemophilus influenzae hitabc Periplasmic iron transport operon, encoding a classic high affinity iron acquisition system. dprabc Genes required for efficient processing of linear DNA during cellular transformation.

23 Why operons are important Bacteria respond to changing environments by altering their gene expression patterns; thus, they express different enzymes depending on the carbon sources and other nutrients available to them. Grouping related genes under a common control mechanism allows bacteria to rapidly adapt to changes in the environment.

24 Functional networks some function gena genb genc gena AGTGTTCTGATTACTGGGACTAAGTGCGGTACGTACGATGAGTCGATCAAATGCGTGC

25 Examples of functions 6x 6x hν 1x 6x

26 to breath Gibbs free energy gotta work Metabolism n X1 X1 + n X2 X2 + + n XK XK -> n Y1 Y1 + n Y2 Y2 + + n YJ YJ Role of proteins in metabolism: help get over the free energy barrier! Reaction coordinate

27 Metabolism n X1 X1 + n X2 X2 + + n XK XK -> n Y1 Y1 + n Y2 Y2 + + n YJ YJ from A. Goelzer, et al. BMC Systems Biology 2008, 2:20

28 Flagellar biogenesis and chemotaxis Modifications to DNA sequences (and thus the functional network) can result in phenotypic changes WT Tumble mutant Speed mutant Left figure from S. Kalir, et al. Science 292, 2080 (2001)

29 Competence and transformation

30 From biology to bioinformatics GENOME DATA + GENE PREDICTIONS Small sequences Genes Operons Networks/Pathways Networks/Pathways FINAL ANNOTATION

Simple Pipeline for Short Sequences AGTGTTCTGATTACTGGGACTAAGTGCGGTACGTACGATGAGTCGATCAAATGCGTGC Genome data Ab initio patterns Database

That's what it's paid to do, after all ~Larry Wall Statistical analysis Final annotation The most important point is that the biases in

31 Simple Pipeline for Short Sequences AGTGTTCTGATTACTGGGACTAAGTGCGGTACGTACGATGAGTCGATCAAATGCGTGC Genome data Ab initio patterns Database Motif finder The computer should be doing the hard work. That's what it's paid to do, after all ~Larry Wall Statistical analysis Final annotation The most important point is that the biases in the distributions [of sequence motifs] need to be supported by some statistical analyses. Some sort of goodness-of-fit such as chi-square with an appropriate correction for multiple tests should suffice. ~King Jordan

32 GENE PREDICTION RESULTS Analyze Overlaps Identify overlaps and store for future analysis High + Medium Low Gene Level BLASTn BLASTx Pangenome Panproteome Haemophilus database intrinsic Transcript Level INTERPROSCAN Reverse PSI -BLAST BLASTx CDD UNIPROT Consensus Molecular Function Cellular Component SignalP LipoP TMHMM Results BLASTx NR Analyze overlaps extrinsic FINAL ANNOTATION GO terms Level 1 Small Sequence Pipeline KEGG Pathway Tools Pathways Level 2 Operon DOORS OPERON DB

33 Understanding the Gene Pipeline gena Homology and BLAST InterProScan Ab initio Methods

34 Homology and BLAST Homology is sequence similarity due to common ancestry. BLAST- heuristic algorithm for matching similar sequences. Blastn, blastp Blastx, tblastn, tblastx RPS-Blast

35 Steps of Blast Filter out low-complexity repeats May give statistically significant but biologically uninteresting results Generate list of all words in query Length of 3/11 for aa/nt query Precompute all possible high-scoring matches to these words Use this expanded word list as query Search database for sequences containing two nearby exact matches Score hits

36 Scoring Matrices PAM - calculated from a model of evolutionary distance Based on alignments of closely related sequences PAM1 - probability that 1 aa in 100 will undergo substitution PAM(N) = PAM ^ N PAM120 considered good for scoring closely related sequences

37 Scoring Matrices BLOSUM - derived from BLOCKS database Blocks were sorted into closely related clusters Frequency of substitutions between clusters within a family used to calculate probability of meaningful substitution BLOSUM(N) - N=cutoff value for percentage sequence identity that defines the clusters

38 Database Look for hits in related genomes Expected functional relationship H. flu Haemophilus pan-genome Pasteurellaceae family May contain more closely related organisms that Haemophilus

39 Blastn, Blastx 80% identity If a gene encodes a protein, blastx expected to be better aa sequence more complex, contains more functional information Frameshift due to sequencing error Blastn would still hit, blastx would fail

40 RPS-Blast Identify conserved domains in proteins Compares protein sequence to a database of position specific scoring matrices (PSSM) Uses substitution frequency at each position in MSAs of recognized conserved domains From SMART, PFAM, LOAD

41 InterPro Database of databases 13 officially integrated Signatures derived from the collection Represent domains, families, functional sites, etc Manually curated

42 HMM Databases PIRSF Superfamilies based on evolutionary relationships TIGRFAMs Functionally equivalent proteins equivalogs PANTHER Divergence of function within families

43 HMM DB continued Pfam Protein families based on functional regions Gene3D Structural annotation Extends CATH structural domain database SUPERFAMILY Structural annotation SCOP structural domain database

44 Profiles & Patterns HAMAP Identify conserved prokaryotic protein families and subfamilies PROSITE Profiles predict structural properties of proteins Patterns predict protein function

45 Clusters and Fingerprints ProDom Sequence clusters built from UniprotKB PRINTS Conserved motifs used as fingerprints

46 Integration into InterPro Signature Database Version *** Signatures* Integrated Signatures** GENE3D 3.3.0* HAMAP PANTHER PIRSF PRINTS PROSITE patterns 20.66* PROSITE profiles 20.66* Pfam PfamB ProDom SMART SUPERFAMILY 1.73* TIGRFAMs 9.0* * Some signatures may not have matches to UniProtKB proteins. ** Not all signatures of a member database may be integrated at the time of an InterPro release. *** InterPro is using older version of DBs marked with a * symbol Data based off current InterPro release 31.0, 9 th February 2011 (link)

47 Integration continued InterPro and UniProtKB Sequence Database Version Count count of proteins matching any signature integrated signatures UniProtKB 2011_ (85.5%) (79.3%) UniProtKB/TrEMBL 2011_ (85.0%) (78.7%) UniProtKB/Swiss-Prot 2011_ (97.2%) (95.3%) InterPro to GO 24,236 GO terms mapped to InterPro entries

48 InterProScan A suite of tools ScanRegExp, Pfscan, FingerPrintScan, HMMpfam Web-based vs. standalone install Run limitations Input limitations Signatures

49 InterProScan Output Formats Raw, html, gff3 Output Accession Numbers Swiss-Prot, PDB, TrEMBL, Member DBs etc Annotation GO Terms, Structural, Functional, etc Metadata Literature references, taxonomy, cross-references, etc

50 Intrinsic Method (Ab initio) SignalP LipoP TMHMM

51 SignalP SignalP 3.0 service A prediction of cleavage sites and a signal peptide/non-signal peptide prediction Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes Several artificial neural networks and hidden Markov models

52 Biology Background Proteins have intrinsic signal that govern their transport and localization in the cell". Günter Blobel Signal peptide: cleaved by signal peptidase I (SPase). Signal anchors are "uncleaved signal peptides" which has no SPase recognition site

53 Data sets The data used for SignalP version 3.0 were extracted from SWISS-PROT version 40

54 Algorithms 2 Neural networks: one for predicting the actual signal peptide and one for predicting the position of the signal peptidase I (SPase I) cleavage site.

55 Algorithms The HMM: prediction of signal anchors in addition to the prediction of signal peptides

56 Input

57 Output C-score: the ``cleavage site'' score S-score : signal peptide indicator Y-score: a better cleavage site prediction

58 Output

59 LipoP LipoP 1.0 server predictions of lipoproteins Gram-negative bacteria only HMM

60 Biology background Prokaryotic lipoprotein cleavage sites are not predicted using SignalP. Prokaryotic lipoproteins are cleaved by a specific lipoprotein signal peptidase, Lsp or signal peptidase II. This peptidase recognizes a conserved sequence and cuts upstream of a cysteine residue to which a glyceride-fatty acid lipid is attached. The cleavage sites of these proteins differ considerably from those cleaved by the standard prokaryotic signal peptidase (SpaseII).

61 Input/Output

62 TMHMM TMHMM Server v. 2.0 Prediction of transmembrane helices in proteins HMM

63 Input/Output The program takes proteins in FASTA format. It recognizes the 20 amino acids and B, Z, and X, which are all treated equally as unknown. Any other character is changed to X

64 Operon Pipeline Tools gena genb genc OperonDB DOORS

65 OperonDB Operon DataBase Relies on conservation of gene order and orientation in two or more species to infer operon structure Calculate the probability that gene pairs belong in the same operon Needs a training set of genomes Input: Full sequence + Gene loci

66 OperonDB output Gene1 Gene2 confidence Lv.

67 Pro/Cons Can use training set to bias the data for Haemophilus genus Can only find operons that are conserved in other species as well

68 DOORS Database for prokaryotic OpeRons Predicts operons based on the features of gene pairs Intergenic distance Distance between adjacent genes phylogenetic profiles Conservation of gene neighborhood Similarity score between GO terms of gene pairs Frequencies of specific DNA motifs in intergenic regions Use above features to train a linear logistic function-based classifier

69 DOORS input Full genome sequence file - fasta Gene location information - gff Protein Sequence information - fasta

70 DOORS classification

71 Pro / Cons Brings in data from other operon databases: ODB, MicrobesOnline Operon Not all operons in DOORS are experimentally verified

72 Functional network tools KEGG : Kyoto Encyclopedia of Genes and Genomes

73 About KEGG Initiated in May 1995 under the Human Genome program of the Ministry of Education, Science, Sports and Culture in Japan. Developed by the Kanehisa Laboratory (Bioinformatics Center) in the Institute for Chemical research, Kyoto University Database resource for understanding higher order functions and utilities of the biology system of the cell or organism from genomic and molecular information.

74 Components of KEGG

75 GENES database: GENBANK + NCBI RefSeq +EMBL + publically available organism specific databases. Genes in high-quality genomes: (140 eukaryotes, 1185 bacteria, 95 archaea):6,290,236 (as of 2011/3/2) Internal re-annotation -> SSEARCH SSDB database: Sequence similarity database -Pre-computed sequence similarity scores + best hits (SSEARCH) -Generates ortholog clusters and paralog clusters KO System -KO (KEGG Orthology) identifiers or K numbers -pathway based classification of orthologous genes -common identifier for linking genomic to pathway information KAAS-SSBD+ GFIT + manual verification PATHWAY mapping and BRITE mapping: - Based on K numbers, computationally generates organism specific pathways and BRITE hierachies.

76 PATHWAY database The KEGG PATHWAY database is a collection of manually drawn pathway maps for: metabolism, genetic information, processing, various other cellular processes and human diseases. KEGG reference pathways (maps) a known network of functional significance. organism-specific pathways: automatically generated by superimposing (coloring) genes in given organisms

77 BRITE database KEGG BRITE is a collection of hierarchical classifications representing our knowledge on various aspects of biological systems. In contrast to KEGG PATHWAY, which is limited to molecular interactions and reactions, KEGG BRITE incorporates many different types of relationships. It includes various biological objects, including molecules, cells, organisms, diseases and drugs, as well as relationships among them. Mainly aims to automate functional interpretation KEGG pathway reconstruction KEGG BRITE mapping is the process to map molecular datasets, to the BRITE functional hierarchies for biological interpretation of higher-level systemic functions.

78 PATHWAY TOOLS

79 Pathway Tools is a comprehensive symbolic systems biology software system. Mainly used to a create a type of modelorganism database (MOD) called Pathway/Genome Database (PGDB). It provides two ways to interact with the PGDB: 1. Graphical component -> to visualize and update contents 2. Ontology and database API -> allows programs to perform complex queries and data mining on the contents.

80 COMPONENTS PathoLogic: Creates a new PGDB containing the predicted metabolic pathways of an organism, Pathway/Genome Navigator: Supports query, visualization, and analysis of PGDBs Pathway/Genome Editors: Provide interactive editing capabilities for PGDBs.

81 WORKFLOW INPUT FILE: Flat file descriptions of genes and gene products Conversion Process Converts to PGDB representation DEVELOPER PATHWAY/GENOME EDITOR: Provides interactive forms for editing contents refining, updating etc. USER Inference Process Predicts metabolic pathway complement MetaCyc Pathway Tools Ontology Groups pathways by functional pathway PATHOLOGIC PATHWAY/GENOME NAVIGATOR Query, visualization and analysis of the PGDB

82 It supports Development of organism-specific databases Computational inferences inlcuding prediction of: metabolic pathways, metabolic pathway hole fillers, operons Scientific Visualization including: Automatic display of metabolic pathways, full metabolic networks A genome browser Display of operons, regulons, and full transcriptional regulatory networks Visual analysis of omics datasets, such as painting omics data onto diagrams of the full metabolic network, full regulatory network, and full genome Comparative analyses of organism-specific databases Analysis of biological networks: Interactively tracing metabolites through the metabolic network Finding dead-end metabolites in metabolic networks

83 GENE PREDICTION RESULTS Analyze Overlaps Identify overlaps and store for future analysis High + Medium Low Gene Level BLASTn BLASTx Pangenome Panproteome Haemophilus database intrinsic Transcript Level INTERPROSCAN Reverse PSI -BLAST BLASTx CDD UNIPROT Consensus Results Molecular Function Cellular Component SignalP LipoP TMHMM ProtCompB BLASTx NR Analyze overlaps extrinsic FINAL ANNOTATION GO terms Level 1 Small Sequence Pipeline KEGG Pathway Tools Pathways Level 2 Operon DOORS OPERON DB

Functional Annotation

Functional Annotation Outline Introduction Strategy Pipeline Databases Now, what s next? Functional Annotation Adding the layers of analysis and interpretation necessary to extract its biological significance