Comparative Genomics Background and Strategies. Nitya Sharma, Emily Rogers, Kanika Arora, Zhiming Zhao, Yun Gyeong Lee

Comparative Genomics Background and Strategies Nitya Sharma, Emily Rogers, Kanika Arora, Zhiming Zhao, Yun Gyeong Lee

Introduction

Why comparative genomes? h"p://www.ensembl.org/info/about/species.html h"p://www.ncbi.nlm.nih.gov/sites/entrez?db=genome h"p://genome.ucsc.edu/cgi bin/hggateway?org=human&db=hg18&hgsid=124431327

Why comparative genomes? Genome information Pan genome Core genome Pathogenome Genome evolution Carriage strain vs virulent strain

Genome Structure Small scale: nucleotide Large scale: Gene Synteny: physical co-localization of genetic loci on the same chromosome within an individual or species. Chromosomes (unichromosome; multichromosome)

Genome Evolution Local events: point mutations, small insertions and deletions Large scale events: Gene content: indel Gene order: translocation, transposition Gene orientation: inversion Gene number: duplication Chromosome fusion and fission

Large scale genome evolution h"p://www.daimi.au.dk/~cstorm/courses/aibs_e07/slides/genomealignment.pdf

Signed permutation model (genome evolution) Savva, 2003

Main Pipeline Protein/DNA Sequences from Gene Prediciton COG HGT Synteny Phylogenies Virulence Functional Annotation evolutionary history candidate genes/regions for further investigation of pathogenicity

Clusters of Orthologous Groups of Proteins (COGs)

Main Pipeline Protein / DNA Sequences from Gene Prediciton COG HGT Synteny Phylogenies Virulence Functional Annotation evolutionary history candidate genes/regions for further investigation of pathogenicity

Orthologs vs Paralogs Homolog: A gene related to a second gene by descent from a common ancestral DNA sequence. Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Orthologs typically occupy the same functional niche in different species Paralogs are genes evolved by duplication within a genome. Paralogs tend to evolve towards functional diversification

Clusters of Orthologous Group of Proteins Represents an attempt on a phylogenetic classification of the proteins encoded in complete genomes Each COG includes proteins that are connected through vertical evolutionary descent Serves as a platform for: Functional annotation of newly sequenced genome Studies of genome evolution

Clusters of Orthologous Groups of Proteins Database COGs were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. COG database

Construction of COGs All-against-all sequence comparisons of proteins encoded in complete genome Detection and collapsing obvious paralogs Detect triangles of mutually consistent genome-specific best hits (BeTs) Merge triangles with a common side to form COGs Identify multidomain proteins, separate domains and assign to different COGs Examination of large COGs using phylogenetic trees and splitting them into two or more smaller groups

Goal: To look for differential distribution of COGs in different strains of Neisseria meningitidis and use this data to determine the phylogeny Approach: Create a comprehensive list of COGs for Neisseria gonorrhoea (FA 1090), and different strains of Neisseria meningitidis, and create a presence/absence matrix of COGs for each of the strain N. meningitidis strains to be used: Z2491*, MC58*, FAM18, α14, α153, α275 and our strain * List of COGs for these strains present in COG database

Protein sequences from a strain BLAST COG Database List of COGs Comprehensive List of COGs Presence / Absence Matrix Phylogenetic Tree

Searching for Horizontal Gene Transfer Events Emily Rogers

Main Pipeline Protein / DNA Sequences from Gene Prediciton COG HGT Synteny Phylogenies Virulence Functional Annotation evolutionary history candidate genes/regions for further investigation of pathogenicity

What are horizontal gene transfers? Horizontal gene transfers are events where an organism acquires genetic material from another organism that is not its ancestor HGT events are believed to be a major phenomena between prokaryotes, and is common among unicellular eukaryotes Thus, we should expect Neisseria meningitidis to exhibit signs of horizontal gene transfer

Why do we care about HGT's? HGT's are important because they can mess up your phylogenies, since the history of a gene acquired laterally is not the history of the organism Also, in our investigation of virulence, we would like to investigate the origin of pathogenicity, if any virulent gene came from other similarly pathogenetic organism Horizontal (or lateral) gene transfer is a known method for the acquisition of a block of virulent genes known as pathogenicity islands (PAIs); HGT is what allows quantum leaps in the evolution of a bacteria that can drastically alter its phenotype

A tree illustrating HGT's

Illustration of HGT vs inheritance http://www.nsf.gov/news/special_reports/ fibr/step.jsp

How can we detect HGT events? As mentioned earlier, methods can be either intrinsic (using information embodied in the gene of interest alone) or extrinsic (relying on outside knowledge); these are known as signature methods and phylogenetic methods We will be using both to uncover HGT information We will be using a combination of programs that predict potential HGT's and also comparisons to databases of HGT's predicted in other Neisseria meningitidis strains.

Programs We found three available on the command line that uses differing methods to predict HGT's These different methods complement each other, and gives us a breadth of predicted HGT's and also a level of confidence on any agreements Available methods for identifying horizontal transfer generally rely on finding anomalies in either nucleotide composition or phylogenetic relationships with orthologous proteins The three we found and will be using are UCSD's Darkhorse, EMBL's alien_hunter, and CodonW

DarkHorse

DarkHorse Darkhorse works by selecting potential ortholog matches from a reference amino acid database It then uses these matches to calculate something it calls a lineage probability index (LPI) score LPI scores are inversely proportional to the phylogenetic distance between database match sequences and the query genome. Candidates having low LPI scores are likely to have been horizontally transferred, since they are not highly conserved among closely related organisms.

alien_hunter

alien_hunter alien_hunter is another program that searches for HGT's It uses Interpolated Variable Order Motifs (IVOM's), a novel computational method introduced by the authors "An IVOM approach exploits compositional biases using variable order motif distributions and captures more reliably the local composition of a sequence compared to fixed-order methods."

CodonW

Codon usage bias and CodonW Although the genetic code is redundant, often with more than one three letter code specifying a protein, most proteins do not use all possibly synonymous codons equally Literature has shown that more highly expressed proteins tend to have optimized their translational efficiency such that they prefer certain codons for a given amino acid CodonW analyses sequences in order to give their statistics of codon usage bias This is handy to get a feel for the general codon bias, and to detect any unusual deviations from it that may indicate HGT's CodonW also calculates G+C content, which may be another indicator of abnormal gene lineage and is linked with a particular genome s codon usage bias

Databases Once we have a set of three programs' predictions, we can then compare them with databases of predicted HGT's of other Neisseria meningitidis strains DarkHorse's DB contains pre-computed predictions for N. meningitidis 053442, FAM18, MC58 using its LPI index IBM's Bioinformatics and Pattern Discovery Group's HGT- DB contains predictions for strains MC58 and Z2491 A codon usage program called CAICal has a database containing strains FAM18, MC58 and Z2491 using unusual codon usage These putative HGT genes can be reciprocally blasted against our set of predictions to see if our genes have any match with other strains, and if other strains have any predictions we missed

Proposed HGT pipeline DarkHorse Candidate HGT among diff. phyl. Compare HGT across granularities G E N E S Alien_hunter HGT candidates Compare Compare w/ HGT db s CodonW Codon usage stats Genes with atypical codon/gc usage List of HGTs and support Virulence (Nitya) Phylogenies (Yun)

Genome Alignment and Visualization

Main Pipeline Protein / DNA Sequences from Gene Prediciton COG HGT Synteny Phylogenies Virulence Functional Annotation evolutionary history candidate genes/regions for further investigation of pathogenicity

Large scale genome evolution h"p://www.daimi.au.dk/~cstorm/courses/aibs_e07/slides/genomealignment.pdf

How to align genomes? h"p://www.daimi.au.dk/~cstorm/courses/aibs_e07/slides/genomealignment.pdf

Genome Alignment Computation: time and space Genome large scale evolution: rearrangement, inversion

Tools for Genome Alignment and Visualization Jayaraj, 2005

Genome Alignment Pairwise: MUMmer (Maximum Unique Match).1999, Steven Salzberg's group, also Glimmer. Multiple: MAUVE (Multiple Alignment of Conserved Genomic Sequence with Rearrangements). 2004.

MUMmer Maximal Unique Matcher (MUM) match exact match of a minimum length maximal cannot be extended in either direction without a mismatch unique occurs only once in both sequences (MUM)

MUMmer: MUM, MAM, MEM MUM : maximal unique match MAM : maximal almost-unique match MEM : maximal exact match Reference Query h"p://www.cbcb.umd.edu/~mschatz/assemblyclass/06.%20whole%20genome%20alignment.pdf

h"p://www.daimi.au.dk/~cstorm/courses/aibs_e07/slides/genomealignment.pdf

B Translocation Inversion Insertion A B Output: 2D plot h"p://mummer.sourceforge.net/manual/alignmenttypes.pdf h"p://www.cbcb.umd.edu/~mschatz/assemblyclass/06.%20whole%20genome%20alignment.pdf A

MUMmer - VISTA Reference genome: Neisseria mengingitidis Z2491 1- Neisseria meningitidis MC58 2- Neisseria gonorrhoeae FA1090

MAUVE Multiple Alignment of Conserved Genomic Sequence with Rearrangements LCB: locally collinear blocks (many anchors) Genomic distance: based on the gene order (or LCB) GRIMM, can infer genomic phylogeny.

h"p://www.daimi.au.dk/~cstorm/courses/aibs_e07/slides/genomealignment.pdf

MAUVE - GRIMM Signed permutation Genomic distance Genomic phylogeny Reversal distance 1 2 3 4 5 6 7 8 9 10 1 2 3 8 7 6 5 4 9 10 1 8 3 2 7 6 5 4 9 10 1 8 3 7 2 6 5 4 9 10

Reversal distance (rearrangement distance) Software: MGR, GPAPPA, GRIMM web sever. Bourque and Pevzner, 2002

Pipeline MUMmer Sequences VISTA Synteny Virulence MAUVE

Main Pipeline Protein / DNA Sequences from Gene Prediciton COG HGT Synteny Phylogenies Virulence Functional Annotation evolutionary history candidate genes/regions for further investigation of pathogenicity

Phylogeny tree Purpose: To summarize the key aspects of a reconstructed evolutionary history by providing simple representation.

Maximum parsimony based on 23 proteins; Brown et al. 2001

Main Goals 1. Find out evolution of Neisseria Meningitidis 2. Discover relatedness between Neisseria Meningitidis strains

Main questions before we 1.Which data to use? starting analysis 2.Which method to use? 3.Which tests to perform to assess the robustness of the prediction of particular tree features? 4.What is the state-of-the-art in phylogenetic analysis tool for this type of data?

1. Which data to use? 1) 16S rrna What is 16S rrna? -16S rrna is 1542 nt long component of the small prokaryotic ribosomal subunit Why 16S rrna? - Derived from common ancestor - It s highly conserved region in all prokaryotes

1. Which data to use? 2) CoGs binary result - HGT result Result From CoGs Result From HGT Result From CoGs-HGT Why? CoGs = Clusters of Orthologous Groups of proteins HGT= Horizontal Gene Transfer

What data to use? MLST(Multi Locus Sequence Typing) A nucleotide sequence based approach for the unambiguous characterisation of isolates of bacteria and other organisms via the internet. To provide a portable, accurate, and highly discriminating typing system Helpful for the typing of bacterial pathogens

Methods of phylogenetic reconstruction Distance based Maximum parsimony Maximum likelihood Pairwise evolutionary distances computed for all taxa Tree constructed using algorithm based on relationships between distances Algorithmic: UPGMA Neighbor-joining Optimality criteria Least Squares Minimum Evolution Nucleotides or amino acids are considered as character states Best phylogeny is chosen as the one that minimizes the number of changes between character states Statistical method of phylogeny reconstruction Explicit model for how data set generated -nucleotide or amino acid substitution Find topology that maximizes the probability of the data given the model and the parameter values (estimated from data) one tree a set of trees a set of tree

UPGMA (unweighted pair group method with arithmetic mean) Simplest method -uses sequential clustering algorithm Results in ultrameric trees equal distances from root to all tips Based on assumption of strict rate constancy among lineages Rely on the overly strict assumption of rate constancy but it is conceptually important Neighbor-joining Star decomposition identification of neighbors that sequentially minimize the total length of the tree Extremely fast and efficient method Tends to perform fairly well in simulation studies Greedy Algorithm so can get stuck in local optima Produces only one tree and does not give any idea of how many other trees are equally well or almost as supported by the data To find a starting tree that other methods (e.g. minimum evolution) will evaluate to find the best tree

Maximum parsimony method -The best tree is chosen as the one that requires the smallest number of changes between characters -Based on a logically coherent and biologically plausible model of evolution -Useful for certain types of molecular data e.g. insertions and deletions -Provides several ways to evaluate the support for the topologies produced -Gives incorrect topologies when backward substitutions are present (common with nucleotides) and when the number of sites is fairly small /when rate of substitution varies substantially across lineages -Long branch attraction long branches (and short branches) tend to group together on reconstructed tree -Difficult to treat the results in a statistical framework Maximum likelihood -Statistically very well defined -Extremely slow method (computationally expensive method) -Method estimates branch lengths not topology so may give wrong topology -Based on explicit models of evolution -Uses all sequence information (characters) -Requires expert user input for model and parameter selection

3. Which tests to perform to assess the robustness of the prediction of How confident are we of this tree? Do Bootstrap particular tree features? What is boostsrap sampling? Bootstrap is sampling with replacement from a sample. Bootstrap is sampling within a sample. The name may come from phrase pull up by your own bootstraps which mean rely on your own resources'. What are the assumption of Bootstrap? Your sample is a valid representative of the population Bootstrap method will take sampling with replacement from the sample. Each sub sampling is independent and identical distribution (i.i.d.). In other word, it assumes that the sub samples come from the same distribution of the population, but each sample is drawn independently from the other samples.

Bootstrap Ex. Pseudosample Data Re-sampling Sample Data n replicates Inferred Tree Bootstrap Value Bootstrap Trees 63 (D.Graur and W.Li, 2000)

4.Which tool is the state-of-the-art SplitTree 4 in phylogenetic analysis?

Software SplitsTree4 Details Compute evolutionary networks from molecular sequence data (alignment of sequences, a distance matrix or a set of trees) Integrates a wide range of phylogenetic network and phylogenetic tree methods Compute a phylogenetic tree or network using many methods such as split decomposition, neighbor-net, consensus network, super networks methods or methods for computing hybridization or simple recombination networks. Why SplitsTree? Phylogenetic networks are more useful for reticulate events than phylogenetic trees.

Software SplitsTree4

Software MEGA 4.0 Feature Input Data :DNA, Protein, Pairwise distance matrix Sequence Alignment Construction Tree-making Methods Distance Matrix Viewer Tree Explorers

Pipeline 7 loci seq. (MLST Database) MEGA4 SplitsTree4

Virulence

Main Pipeline Protein / DNA Sequences from Gene Prediciton COG HGT Synteny Phylogenies Virulence Functional Annotation evolutionary history candidate genes/regions for further investigation of pathogenicity

N. meningitidis Gram-negative Pangenome is open Colonizes the nasopharynx and can enter the bloodstream (bypassing the epithelial barrier) Septicaemia Meningitis via BBB crossing Accidental pathogen Non disease causing isolates (carriage) in about 10% of healthy population

Pathogenicity vs. Virulence Bacterial pathogen: any bacterium that has the capacity to cause disease ability to cause disease is called pathogenicity Virulence: provides a quan`ta`ve measure of pathogenicity or the likelihood of causing disease Virulence factors: proper`es (i.e. gene products) that enable a microorganism to establish itself on or within a host and enhance its poten`al to cause disease Pathogenicity Islands: comprise of large genomic regions that encode for various virulence factors

Polysaccharide Capsule Defining characteristic for serogroup classification A C, W-135, Y Most characterized virulence factor Involved in evading immune defense against complement-mediated lysis and opsonophagocytosis Necessary but NOT SUFFICIENT!

Virulence Factors Adherence Genes able to mediate adhesion to host nasopharynx epithelium Immune evasion mediate resistance of both phagocytosis and complement-mediated killing by expression of capsule Invasion Enzymes that mediate movement across epithelium Iron uptake systems mediate iron uptake from host and contribute virulence Protease Genes that encode to proteins that cleave antibodies to evade immune system response Toxin Modify or disrupt essential functions of eukaryotic cells Major toxin LOS

Virulence Factor DB - Virulence Factors divided by category, with lists of corresponding genes - Comparative pathogenomics of disease causing strains

Pathogenicity Islands Criteria Subclass of genomic islands (GI) that are defined by the following criteria: 1) Encodes for virulence factors 2) Present in pathogenic strains, absent in non-pathogenic strains of one species or a related species 3) Different G+C content and codon usage (remember HGT) 4) Large genomic regions 5) Fanked by insertion sequece (IS) and/or direct repeats elements and/or trna genes at boundaries sites of recombination 6) Unstable

Pathogenicity Islands Neisseria meningitidis MC58 IHT-A: Genes of serogroup B capsulation cluster and an adenine rrna methylase IHT-C: Three toxin/toxin-related homologs; a protein known to be immunogenic, one intact and three fragmented proteins previously associated with bacteriophage Neisseria meningitidis Z2491 No known PAIs cpai: Candidate PAI (PAI-like region overlapping genomic islands) homologous to IHT-A

PAIs Cannot determine virulence by the presence or absence of specific genes Loses its utility in investigating virulence in our context Found PAIs in N.meningitidis, but did not investigate carriage vs. virulent strains More later background research

What have we learned about virulence and pathogenicity from past research?

Shoen et al. 2008

Comparative Genomics 2008

Majority of candidate virulence genes are found in the core genome (shared by all), and are not virulent strain-specific Not just due to presence or absence of certain genes So, what is causing differences in virulence?

What have we learned about virulence from past research? What can we do differently?

Candidate causes of virulence variability Chromosomal rearrangements Affect expression breadth Insertion Sequences Small genetic differences in genes from core genome or between genomes of carriage and disease strains May influence pathogenic potential SNPs

Goal and Approach Goal: Use more fine tuned methods to compare carriage versus disease strains in N. meningitidis Approach 1: Determine whether the IS profile distribution discriminates carriage strains from virulent strains Approach 2: Whole genome association mapping (WGAM) in disease vs. carriage strains

Insertion Sequences Short DNA (about 2.5 kbps) sequence whose function is exclusively involved in mobility Can cause mutations as a result of their translocation Many IS elements can enhance expression of neighboring genes if inserted (Mahillan and Chandler 1998) Associated with bacterial pathogenesis and virulence Most have short terminal repeat sequences Composite transposon: two copies of certain ISs flanking a DNA segment causing mobility of whole region Upon insertion, most generate short directly repeated sequences (drs) of the target DNA

Insertion Sequences the phenotype of the recipient bacterium can be changed if the IS is inserted into a structural gene or if the insertion in front of a gene affects the expression of a downstream gene(s) mediate deletions, duplications, and inversions and cointegrate formation contributing to changes in the bacterial genome

IS element structure

IS Family Classification Similarities in genetic organization Relatedness of transposases Similar features of ends (terminal IRs) Fate of nucleotide sequence of their target sites Families of interest: IS110, IS3, IS30, IS5, ISNCY (Shoen et al. 2008) IS1655 (IS30 family) specific to N. meningitidis

IS info from gene prediction R IS info from all available strains Distribution by IS family Genome BLAST to VFDB Significantly different IS family in carriage vs. disease strains Positional info Genes around IS sequences within family HGT Flanking genes, interrupted genes, neighboring genes associated with IS interference, and Virulence genes Synteny

SNPs as markers for WGAM Haplotype: set of SNPs that are statistically associated haplotype block Use whole genome sequences of disease vs. carriage strains and look for increased variability in local haplotype structure If there is increased variations in virulent strains as compared to carriage strains, then such variations can be considered to be associated with the virulence Identify regions of high variability in virulent vs. carriage strains These regions can be used as pointers to direct further study of genes within and/or around the haplotype block

Conclusion

Main Pipeline Protein / DNA Sequences from Gene Prediciton COG HGT Synteny Phylogenies Virulence Functional Annotation evolutionary history candidate genes/regions for further investigation of pathogenicity