Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Similar documents
Lecture 3: A basic statistical concept

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Gene Ontology. Shifra Ben-Dor. Weizmann Institute of Science

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

GENE ONTOLOGY (GO) Wilver Martínez Martínez Giovanny Silva Rincón

BMD645. Integration of Omics

Gene Ontology and overrepresentation analysis

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Dr. Amira A. AL-Hosary

Phylogenetic Tree Reconstruction

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes

Comparative Network Analysis

CSCE555 Bioinformatics. Protein Function Annotation

2 GENE FUNCTIONAL SIMILARITY. 2.1 Semantic values of GO terms

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

Evolutionary Tree Analysis. Overview

Computational approaches for functional genomics

BLAST. Varieties of BLAST

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Comparative RNA-seq analysis of transcriptome dynamics during petal development in Rosa chinensis

CATEGORY a TERM COUNT b P VALUE GENE FAMILIES: REPRESENTATIVE GENE SYMBOLS c. Annotation Cluster 1 Enrichment Score: 1.

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

BINF6201/8201. Molecular phylogenetic methods

Introduction to Bioinformatics

Written Exam 15 December Course name: Introduction to Systems Biology Course no

Computational methods for predicting protein-protein interactions

Clustering and Network

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Bioinformatics: Network Analysis

C3020 Molecular Evolution. Exercises #3: Phylogenetics

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Phylogenetics: Building Phylogenetic Trees

Phylogenetic inference

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

Supplementary Figure 1 The number of differentially expressed genes for uniparental males (green), uniparental females (yellow), biparental males

Model Accuracy Measures

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Supplementary Figure 1: To test the role of mir-17~92 in orthologous genetic model of ADPKD, we generated Ksp/Cre;Pkd1 F/F (Pkd1-KO) and Ksp/Cre;Pkd1

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Effects of Gap Open and Gap Extension Penalties

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

AP Biology. Metabolism & Enzymes

Genomics and bioinformatics summary. Finding genes -- computer searches

Integration of functional genomics data

Understanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Comparative Genomics II

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Predicting Protein Functions and Domain Interactions from Protein Interactions

Sample Size Estimation for Studies of High-Dimensional Data

JMJ14-HA. Col. Col. jmj14-1. jmj14-1 JMJ14ΔFYR-HA. Methylene Blue. Methylene Blue

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Supplementary Information

Graph Alignment and Biological Networks

Quantifying sequence similarity

Lecture Notes for Fall Network Modeling. Ernest Fraenkel

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem

S1 Gene ontology (GO) analysis of the network alignment results

Biological Networks Analysis

Ontology Alignment in the Presence of a Domain Ontology

SUPPLEMENTAL DATA - 1. This file contains: Supplemental methods. Supplemental results. Supplemental tables S1 and S2. Supplemental figures S1 to S4

Algorithms in Bioinformatics

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Discovering molecular pathways from protein interaction and ge

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?


Supplementary Information 16

Phylogeny. November 7, 2017

networks in molecular biology Wolfgang Huber

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources

Supplementary Materials for

Multiple Sequence Alignment: HMMs and Other Approaches

Network Alignment 858L

Comparative analysis of RNA- Seq data with DESeq2

Genomic expression catalogue of a global collection of BCG vaccine strains. show evidence for highly diverged metabolic and cell-wall adaptations.

Proteomics Systems Biology

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Correspondence of D. melanogaster and C. elegans developmental stages revealed by alternative splicing characteristics of conserved exons

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

122-Biology Guide-5thPass 12/06/14. Topic 1 An overview of the topic

Hands-On Nine The PAX6 Gene and Protein

Outline. Terminologies and Ontologies. Communication and Computation. Communication. Outline. Terminologies and Vocabularies.

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Comparison of Human Protein-Protein Interaction Maps

Supplementary online material

What Is a Cell? What Defines a Cell? Figure 1: Transport proteins in the cell membrane

Zhongyi Xiao. Correlation. In probability theory and statistics, correlation indicates the

Transcription:

Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

The parsimony principle: A quick review Find the tree that requires the fewest evolutionary changes! A fundamentally different method: Search rather than reconstruct Parsimony algorithm 1. Construct all possible trees 2. For each site in the alignment and for each tree count the minimal number of changes required 3. Add sites to obtain the total number of changes required for each tree 4. Pick the tree with the lowest score

A quick review cont Small vs. large parsimony Fitch s algorithm: 1. Bottom-up phase: Determine the set of possible states 2. Top-down phase: Pick a state for each internal node Searching the tree space: Exhaustive search, branch and bound Hill climbing with Nearest-Neighbor Interchange Branch confidence and bootstrap support

From sequence to function Which molecular processes/functions are involved in a certain phenotype -disease, response, development, etc. (what is the cell doing vs. what it could possibly do) Gene expression profiling

Gene expression profiling Measuring gene expression: (Northern blots and RT-qPCR) Microarray RNA-Seq Experimental conditions: Disease vs. control Across tissues Across time Across environments Many more

Different techniques, same structure conditions genes

Back in the good old days 1.Find the set of differentially expressed genes. 2.Survey the literature to obtain insights about the functions that differentially expressed genes are involved in. 3. Group together genes with similar functions. 4.Identify functional categories with many differentially expressed genes. Conclude that these functions are important in disease/condition under study

The good old days were not so good! Time-consuming Not systematic Extremely subjective No statistical validation

What do we need? A shared functional vocabulary Systematic linkage between genes and functions A way to identify genes relevant to the condition under study Statistical analysis (combining all of the above to identify cellular functions that contributed to the disease or condition under study) A way to identify related genes

What do we need? A shared functional vocabulary Gene Ontology Annotation Systematic linkage between genes and functions A way to identify genes relevant to the condition under study Statistical analysis (combining all of the above to identify cellular functions that contributed to the disease or condition under study) Fold change, Ranking, ANOVA Clustering, classification A way to identify related genes Enrichment analysis, GSEA

The Gene Ontology (GO) Project A major bioinformatics initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases. Three goals: 1. Maintain and further develop itscontrolled vocabularyof gene and gene product attributes 2. Annotategenesandgene products, and assimilate and disseminate annotation data 3. Provide toolsto facilitate access to all aspects of the data provided by the Gene Ontology project

GO terms The Gene Ontology (GO) is acontrolled vocabulary, a set of standard terms(words and phrases) used for indexing and retrieving information.

Ontology structure GO also defines therelationshipsbetween the terms, making it a structured vocabulary. GO is structured as a directed acyclic graph, and each term has defined relationships to one or more other terms.

Three ontology domains: GO domains 1. Molecular function: basic activity or task e.g. catalytic activity, calcium ion binding 2. Biological process: broad objective or goal e.g. signal transduction, immune response 3. Cellular component: location or complex e.g. nucleus, mitochondrion Genes can have multiple annotations: For example, the gene product cytochrome c can be described by the molecular function term oxidoreductase activity, the biological process termsoxidative phosphorylationandinduction of cell death, and the cellular component terms mitochondrial matrix and mitochondrial inner membrane.

Go domains Biological process Molecular function Cellular component

Ontology and annotation databases eggnog Clusters of Orthologous Groups (COG) The nice thing about standards is that there are so many to choose from Andrew S. Tanenbaum

What do we need? A shared functional vocabulary Systematic linkage between genes and functions A way to identify genes relevant to the condition under study Statistical analysis (combining all of the above to identify cellular functions that contributed to the disease or condition under study) A way to identify related genes GO annotation

Picking relevant genes In most cases, we will consider differential expression as a marker: Fold change cutoff (e.g., > two fold change) Fold change rank (e.g., top 10%) Significant differential expression (e.g., ANOVA) (don t forget to correct for multiple testing, e.g., Bonferroni or FDR) Gene study set

Enrichment analysis Signaling category contains 27.6% of all genes in the study set - by far the largest category. Reasonable to conclude that signaling may be important in the condition under study Functional category #of genes in the study set Signaling 82 27.6 Metabolism 40 13.5 Others 31 10.4 Trans factors 28 9.4 Transporters 26 8.8 Proteases 20 6.7 Protein synthesis 19 6.4 Adhesion 16 5.4 Oxidation 13 4.4 Cell structure 10 3.4 Secretion 6 2.0 Detoxification 6 2.0 %

Enrichment analysis the wrong way Signaling category contains 27.6% of all genes in the study set - by far the largest category. Reasonable to conclude that signaling may be important in the condition under study Functional category #of genes in the study set Signaling 82 27.6 Metabolism 40 13.5 Others 31 10.4 Trans factors 28 9.4 Transporters 26 8.8 Proteases 20 6.7 Protein synthesis 19 6.4 Adhesion 16 5.4 Oxidation 13 4.4 Cell structure 10 3.4 Secretion 6 2.0 Detoxification 6 2.0 %

Enrichment analysis the wrong way What if ~27% of the genes on the array are involved in signaling? The number of signaling genes in the set is what expected by chance. We need to consider not only the number of genes in the set for each category, but also the total number on the array. We want to know which category is over-represented (occurs more times than expected by chance). Functional category #of genes in the study set % % on array Signaling 82 27.6% 26% Metabolism 40 13.5% 15% Others 31 10.4% 11% Trans factors 28 9.4% 10% Transporters 26 8.8% 2% Proteases 20 6.7% 7% Protein synthesis 19 6.4% 7% Adhesion 16 5.4% 6% Oxidation 13 4.4% 4% Cell structure 10 3.4% 8% Secretion 6 2.0% 2% Detoxification 6 2.0% 2%

Enrichment analysis the right way A statistical test, based on a null model Assume the study set has nothing to do with the specific function at hand and was selected randomly, would we be surprised to see a certain number of genes annotated with this function? The urn version: You pick a set of 20 balls from an urn that contains 250 black and white balls. How surprised will you be to find that 16 of the balls you picked are white?

Modified Fisher's Exact Test Let mdenote the total number of genes in the array and nthe number of genes in the study set. Let m t denote the total number of genes annotated with function tand n t the number of genes in the study set annotated with this function.

Modified Fisher's Exact Test Let Sbe a set of size n, sampled randomly without replacementfrom the entire population of mgenes, and let σ t the number of genes in Sannotated with t. The probability of observing exactly k genes in S annotated with t is: hypergeometric distribution:

Modified Fisher's Exact Test We are interested in knowing the probability of seeing n t or more annotated genes! We can simply sum over all possibilities: This is equivalent to a one-sided Fisher exact test

So what do we have so far? A shared functional vocabulary Systematic linkage between genes and functions A way to identify genes relevant to the condition under study Statistical analysis (combining all of the above to identify cellular functions that contributed to the disease or condition under study) A way to identify related genes

Still far from being perfect! A shared functional vocabulary Systematic linkage between genes and functions Arbitrary! A way to identify genes relevant to the condition under study Ignores links between GO categories Limited hypotheses Statistical analysis (combining all of the above to identify cellular functions that contributed to the disease or condition under study) A way to identify related genes Considers only a few genes Simplistic null model!