1 Introduction to Bioinformatics Integrated Science, 11/9/05 Morris Levy Biological Sciences Research: Evolutionary Ecology, Plant- Fungal Pathogen Interactions Coordinator: BIOL 495S/CS490B/STAT490B Introduction to Bioinformatics Fall semester [Selected slides from Mark Levinthal and Daisuke Kihara] 1 Bioinformatics/Computational Biology Development and application of computational tools (algorithms*) for genome sequencing and massive data analysis * Systematic procedure for solving a problem in a finite number of steps; can be written in a computer language and run as a program.-mount) Interdisciplinary (Biol/CS/Stat) Emphases on determining sequence, structure and functional relationships of DNAs, genes, and proteins necessary for cell metabolism and organism development Databases and Information Management 2
2 DNA RNA protein phenotype genomic DNA databases cdna ESTs Expression profiles protein Sequence Structure databases 3 Research Areas in Bioinformatics Genomics: Sequence and organization of the genome (structural), gene finding and functional annotation; Comparative genomics Proteomics: structure and function of entire inventory of proteins produced Transcriptomics: gene expression profiles in cells, tissues, organs, organisms during development; comparative expression in disease pathology or genetic disorder Metabolomics: organization and flux of all cellular pathways (chemistry and physiology) Phylogenetics: evolutionary history of above 4
3 Plan of my three lecturers 1. Intro to bioinformatics: comparative genomics 2. Tutorial on NCBI database use incl. BLAST (Basic Local Alignment Search Tool) 3. Phylogenetic Informatics Always ask questions for clarification during lecture; even other questions (@$0.50) 5 3 Domains of Life Archaea Prokaryote, lacks nuclear membrane, singlecell Initially found in extreme conditions, high temp., pressure, low ph Bacteria Prokaryote E.coli Eucarya/Eukarya Yeast (unicellular) a tree of life based human on small subunit rrna sequences (Pace, 2001). 6
4 Three Domains of Life + Endosymbiosis Monophyly but with horizontal transfer (eukaryotic organelles,i.e., mitochondria and chloroplasts, are bacterial in evolutionary origin) Closer relationship of the Archea and Eukaryota relative to Bacteria (share information processing genes) 7 Genome Sequences Human Genome Sequence Completed in 2000 8
5 1995: genome of the bacterium Haemophilus influenzae is sequenced 9 10
6 Overview of bacterial complete genomes 11 MBGD database http://mbgd.genome.ad.jp 12
7 Genome sizes in nucleotide base pairs plasmids viruses bacteria fungi plants The size of the human genome is ~ 3 x 10 9 base pairs and is thought to contain ~25,000-35,000 genes; protein coding genes = <2% of total genome (Why?) algae insects mollusks bony fish amphibians reptiles birds mammals 10 4 10 5 10 6 10 7 10 8 10 9 10 10 10 11 http://www3.kumc.edu/jcalvet/powerpoint/bioc801b.ppt Genes in the genome Organism Domain Genome size (KB) ORFs Ratio- Genome/ORF Escherichia coli Bacteria 4639 4289 1.08 Bacillus subtilis Bacteria 4214 4099 1.02 Methanobacterium thermoautotrophicum Archea 1751 1918 0.91 Saccharomyces cerevisiae Eukaryote Single cell 12069 6294 1.9 Caenorhabditis elegans Eukaryote nematode 97000 19099 5.07 Oryza sativa Eukaryote plant 420000 50000 8.4 Drosophila melanogaster Eukaryote insect 137000 14100 9.71 14
8 GC content varies across genomes Number of species in each GC class 10 5 5 3 10 5 Bacteria Plants Invertebrates Vertebrates 20 30 40 50 60 70 80 GC content (%) 15 Function Assignment BLAST/FASTA sequence comparison with genes of known function motif (protein folding structures) search via structure prediction Amount of Unknown Function in Genomes (201 Genomes) (Hawkins & Kihara) 16
9 Comparison Strategies for Deciphering Gene Function Genomes of closely related organisms Genomes of distantly related organisms Genomes vs. metabolic pathways, compounds *Inference: Conservation of sequence, physical order and/or phylogenetic clustering of genes implies their functional association 17 Dynamic rearrangement of genomes: Mycoplasma pneumoniae and Mycoplasma genitalium (Himmelreich et al., 1997) M. pneumoniae (1996) 732 genes M. genitalium (1995), 522 genes = smallest genome in self-replicating organisms 18
10 Genome map of two Mycoplasma sp. Method: FASTA/BLAST bidirectional hits 19 Dot-plots of closely related genomes (Suyama & Bork 2001) (a) (b) (c) (d) (e) (f) (g) (h) (i) Chlamydia pneumoniae, AR39 (CPa) & CWL029 (CP) Neisseria meningitidis, Z2491 (NMa) & MC58 (NMb) Helicobacter pylori, J99(HP99) & 26695 (HP) Chlamydia trachomatis, serov ar D (CT) & MoPn (CTm) Mycobacterium leprae (ML) & M. tuberculosis (MT) Pyrococcus horikoshii (PH) & P. abyssi (PA) E.coli (EC) & Vibrio cholerae chromosome 1 (VC1) Mycoplasma pneumoniae (MP) & M. genitalium (MG) CP & CT 20
11 Conservation of gene order (gene clusters) Danderkar, Snel, Huynen & Bork (1998) Analysis of selective constraints that preserve gene order Genomes to be compared should be not too far but not too close in evolutionary distance Reason for conservation: Physical interactions between coded proteins Operon: a unit of transcription which consists of several genes with related functions, (a) promoter region(s) and other regulatory sites 21 Conserved gene arrangements Ribosomal proteins ATP synthases Transporters ABC (ATP binding Cassette) transporters Enzyme pairs GroEL & GroES etc. Cell-division proteins Gene pairs of unknown function Tryptophan operon 22
12 Examples of proteins with the same phylogenetic profile (co-occurrence) A. Ribosomal proteins B. Flagellar structural proteins C. Histidine biosynthetic protein 23 Domain/gene fusion Genome 1 A B Genome 2 A B Two separate genes in one genome are fused into a single gene in another genome Most probably they are involved in the same function 24
13 Fusion proteins in human genome Domain Fusion Database: http://calcium.uhnres.utoronto.ca/pi/ 25 Pathway database KEGG database: http://www.genome.ad.jp green= E.coli genes 26
14 Clusters of chemical compounds on pathways (Hattori, Okuno, Goto, Kanehisa 2003) Compounds in pathways are compared and clustered Sub-pathways with similar compounds sometimes correspond to operons of enzymes 27 Functional network of Escherichia coli 89 complete genomes Functional interactions can be predicted from: Conserved gene order Gene fusion events Common phylogenetic pattern 28
15 Summary: Comparative Genomics Conservation of the gene order, phylogenetic patterns, and gene fusion events detected by comparative genomics analyses implies functional association. Detection of orthologous genes (bidirectional best hits) is the basis of all above analyses. Combinations with pathway/compound associations broadens inferences for metabolic pathway analysis Gene/Gene family comparisons across phylogeny indicate functional diversification (not discussed- later see Dr. Mason re globin evolution) 29 References Comparative analysis of the genomes of the bacteria Mycoplasma pneumoniae and Mycoplasma genitalium. Himmelreich R et al. Nucleic Acid Research 25:701-712 (1997) Comparative genomics, minimal gene-sets, and the last universal common ancestor. Koonin EV. Nature Review Microbiology, 1: 128-136 (2003) Evolution of prokaryotic gene order: genome rearrangements in closely related species. Suyama M & Bork P. Trends in Genetics, 17: 10-13 (2001) Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Pellegrini M,,,Eisenberg D, Yeates TO. Proc. Natl. Acad. Sci. USA 96: 4285-4288 (1999) Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. Hattori M et al. J. Am. Chem. Soc. 125: 11853-11865 (2003) The identification of functional modules from the genomic association of genes. Snel B, Bork P, Huynen MA. Proc. Natl. Acad. Sci. USA 99: 5890-5895 (2002) Genome evolution reveals biochemical networks and functional modules. von Mering AC et al. Proc. Natl. Acad. Sci. USA 100: 15428-15433 (2003) 30