Rule learning for gene expression data

Similar documents
Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Discovering modules in expression profiles using a network

Bioinformatics. Transcriptome

Introduction to Bioinformatics

Multiple Choice Review- Eukaryotic Gene Expression

CLUSTER, FUNCTION AND PROMOTER: ANALYSIS OF YEAST EXPRESSION ARRAY

Understanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007

Chapter 15 Active Reading Guide Regulation of Gene Expression

Central postgenomic. Transcription regulation: a genomic network. Transcriptome: the set of all mrnas expressed in a cell at a specific time.

BME 5742 Biosystems Modeling and Control

Name: SBI 4U. Gene Expression Quiz. Overall Expectation:

S1 Gene ontology (GO) analysis of the network alignment results

GO ID GO term Number of members GO: translation 225 GO: nucleosome 50 GO: calcium ion binding 76 GO: structural

Regulation and signaling. Overview. Control of gene expression. Cells need to regulate the amounts of different proteins they express, depending on

Supplementary Information 16

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Written Exam 15 December Course name: Introduction to Systems Biology Course no

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Comparative RNA-seq analysis of transcriptome dynamics during petal development in Rosa chinensis

Biological Process Term Enrichment

BMD645. Integration of Omics

Supplemental table S7.

Predicting Protein Functions and Domain Interactions from Protein Interactions

Computational methods for predicting protein-protein interactions

Dynamic modular architecture of protein-protein interaction networks beyond the dichotomy of date and party hubs

GENE ONTOLOGY (GO) Wilver Martínez Martínez Giovanny Silva Rincón

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

Complete all warm up questions Focus on operon functioning we will be creating operon models on Monday

Eukaryotic Gene Expression

Introduction. Gene expression is the combined process of :

GCD3033:Cell Biology. Transcription

Computational Cell Biology Lecture 4

Total

STAAR Biology Assessment

Signal Transduction. Dr. Chaidir, Apt

The Research Plan. Functional Genomics Research Stream. Transcription Factors. Tuning In Is A Good Idea

UNIT 6 PART 3 *REGULATION USING OPERONS* Hillis Textbook, CH 11

Computational Biology: Basics & Interesting Problems

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Welcome to Class 21!

Big Idea 3: Living systems store, retrieve, transmit and respond to information essential to life processes. Tuesday, December 27, 16

16 The Cell Cycle. Chapter Outline The Eukaryotic Cell Cycle Regulators of Cell Cycle Progression The Events of M Phase Meiosis and Fertilization

Gene Control Mechanisms at Transcription and Translation Levels

Biology Assessment. Eligible Texas Essential Knowledge and Skills

Boolean models of gene regulatory networks. Matthew Macauley Math 4500: Mathematical Modeling Clemson University Spring 2016

Computational Genomics. Reconstructing dynamic regulatory networks in multiple species

identifiers matched to homologous genes. Probeset annotation files for each array platform were used to

56:198:582 Biological Networks Lecture 10

From Gene to Protein

1. In most cases, genes code for and it is that

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Whole-genome analysis of GCN4 binding in S.cerevisiae

Prokaryotic Regulation

Regulation of gene expression. Premedical - Biology

Discovering molecular pathways from protein interaction and ge

Bioinformatics Chapter 1. Introduction

Lecture 3: A basic statistical concept

Model Accuracy Measures

A general co-expression network-based approach to gene expression analysis: comparison and applications

Evidence for dynamically organized modularity in the yeast protein-protein interaction network

Lecture 8: Temporal programs and the global structure of transcription networks. Chap 5 of Alon. 5.1 Introduction

Topographic Independent Component Analysis of Gene Expression Time Series Data

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

What Organelle Makes Proteins According To The Instructions Given By Dna

Translation Part 2 of Protein Synthesis

Lecture 13: PROTEIN SYNTHESIS II- TRANSLATION

Eukaryotic vs. Prokaryotic genes

BioControl - Week 6, Lecture 1

9/11/18. Molecular and Cellular Biology. 3. The Cell From Genes to Proteins. key processes

Types of biological networks. I. Intra-cellurar networks

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Computational Systems Biology

Gene regulation II Biochemistry 302. Bob Kelm February 28, 2005

Quantitative Molecular Biology

Inferring Transcriptional Regulatory Networks from High-throughput Data

Functional Genomics Research Stream. Lecture: May 5, 2009 The Road to Publication

Chapter 12. Genes: Expression and Regulation

9/2/17. Molecular and Cellular Biology. 3. The Cell From Genes to Proteins. key processes

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Plant Molecular and Cellular Biology Lecture 8: Mechanisms of Cell Cycle Control and DNA Synthesis Gary Peter

02/02/ Living things are organized. Analyze the functional inter-relationship of cell structures. Learning Outcome B1

Outline. Terminologies and Ontologies. Communication and Computation. Communication. Outline. Terminologies and Vocabularies.

Compare and contrast the cellular structures and degrees of complexity of prokaryotic and eukaryotic organisms.

GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications

REVIEW SESSION. Wednesday, September 15 5:30 PM SHANTZ 242 E

Regulation of Transcription in Eukaryotes. Nelson Saibo

L3.1: Circuits: Introduction to Transcription Networks. Cellular Design Principles Prof. Jenna Rickus

Activation of a receptor. Assembly of the complex

Honors Biology Reading Guide Chapter 11

Simulation of Gene Regulatory Networks

Initiation of translation in eukaryotic cells:connecting the head and tail

Regulation of Gene Expression

Prokaryotic Gene Expression (Learning Objectives)

Discovering MultipleLevels of Regulatory Networks

Lecture 2: Read about the yeast MAT locus in Molecular Biology of the Gene. Watson et al. Chapter 10. Plus section on yeast as a model system Read

Biology 112 Practice Midterm Questions

Connectivity in the Yeast Cell Cycle Transcription Network: Inferences from Neural Networks

CSCE555 Bioinformatics. Protein Function Annotation

Quiz answers. Allele. BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 17: The Quiz (and back to Eukaryotic DNA)

Protein-protein interaction networks Prof. Peter Csermely

Transcription:

Rule learning for gene expression data Stefan Enroth Original slides by Torgeir R. Hvidsten The Linnaeus Centre for Bioinformatics

Predicting biological process from gene expression time profiles Papers: I. T. R. Hvidsten, A. Lægreid and J. Komorowski. Learning rule-based models of biological process from gene expression time profiles using gene ontology, Bioinformatics 19(9): 1116-23, 2003. II. A. Lægreid, T. R. Hvidsten, H. Midelfart, J. Komorowski and A. K. Sandvik. Predicting Gene Ontology Biological Process From Temporal Gene Expression Patterns, Genome Research, 13(5): 965-979, 2003.

Hierarchical clustering 3 Iyer et al., The transcriptional program in the response of human fibroblasts to serum, Science, 283(5398): 83-87, 1999

4 Gene Ontology vs. expression clustering

Methodology Ontology Process 1. Annotation Transport Defense response Positive control of cell proliferation Cell cycle control g 2... g 2... g 4... g 5 g 3... 2. Extracting features for learning Gene 0HR 15MIN 30MIN 1HR 2HR 4HR 6HR 8HR 12HR 16HR 20HR 24HR Process g 1 0.00-0.47-3.32-0.81 0.11-0.60-1.36-1.03-1.84-1.00-0.60-0.94 Unknown g 2 0.00 0.66 0.07 0.20 0.29-0.89-0.45-0.29-0.29-0.15-0.45-0.42 Transport and defense response g 3 0.00 0.14-0.04 0.00-0.15-0.58-0.30-0.18-0.38-0.49-0.81-1.12 Cell cycle control Positive control of g 4 0.00-0.04 0.00-0.23-0.25-0.47-0.60-0.56-1.09-0.71-0.76-0.62 cell proliferation g 5 0.00 0.28 0.37 0.11-0.17-0.18-0.60-0.23-0.58-0.79-0.29-0.74 Positive control of cell proliferation.......................................... 1.5 1 0.5 0-0.5-1 3. Inducing minimal decision rules using rough sets -1.5-2 0 2 4 6 8 10 12 14 16 18 20 22 24 0-4(Increasing) AND 6-10(Decreasing) AND 14-18(Constant) => GO(cell proliferation) 5 4. The function of uncharacterized! genes is predicted using the rules

Rule Induction IF THEN 0-4 (Constant) AND 0-10(Increasing) GO(prot. met. and mod.) OR GO(mesoderm develop.) OR GO(prot. biosynt.) 6 IF-part (antecedent, premise) the minimal set of discrete changes in expression needed to uphold the discriminatory power of the full data set THEN-part (conclusion) all functions of genes described by the antecedent-side We want rules that describe the expression profiles of several genes with one or a few functions accuracy: the fraction of genes matching the IF-part that are annotated with the process in the THEN-part coverage: the fraction of genes annotated with the process in the THEN-part that matches the IF-part

Rule example 3 2.5 2 1.5 1 0.5 0-0.5-1 0 2 4 6 8 10 12 14 16 18 20 22 24 Rule 0-4(Constant) AND 0-10(Increasing) => GO(protein metabolism and modification) OR GO(mesoderm development) OR GO(protein biosynthesis) Covered genes 4: M35296 J02783 D13748 X05130 1: X60957 1: D13748 7

Classification X60957 3 2.5 2 1.5 1 0.5 0-0.5-1 0 2 4 6 8 10 12 14 16 18 20 22 24 Function? 8 IF THEN IF THEN IF THEN IF THEN IF THEN IF THEN IF THEN IF THEN IF THEN IF THEN Votes are normalized and processes with vote fractions higher than a selection-threshold are chosen as predictions IF 0-4(Constant) AND 0-10(Increasing) THEN GO(protein metabolism and modification ) OR GO(mesoderm development) OR GO(protein biosynthesis) +1 Process +1 +4 Votes protein metabolism and modification 6 mesoderm development 3 proteolysis and peptidolysis 2 transcription 1 protein biosynthesis 1 vision 1

Cross validation Iteration 1 Iteration 2 Iteration 3 Observation 1 Observation 2 9 Fold 1 Fold 2 Fold 3 Training set Training set Test set Training set Test set Training set Test set Training set Training set Observation n...............

Evaluation Evaluation technique divide examples into training set and test set cross validation Evaluation measures: accuracy = (TP+TN)/(TP+FN+TN+FP) sensitivity = TP/(TP+FN) specificity = TN/(TN+FP) Confusion matrix: 10

Threshold selection Gene with function protein biosynthesis Gene with a different function sensitivity: TP/(TP+FN) specificity: TN/(TN+FP) 1 Fraction of votes for protein biosynthesis Threshold 1 Threshold 2 g 1 g 2 g 3 g 4 g 5 g 6 g 7 g 8 g 9 g 10 g 11 g 12 g 13 Test set Sensitivity = 2/3, Specificity=1 Sensitivity = 1, Specificity=2/3

ROC analysis and classifier evaluation 1 sensitivity Perfect discrimination AUC No discrimination ROC: Receiver Operating Characteristics curve results from plotting sensitivity against specificity for all possible thresholds sensitivity: TP/(TP+FN) specificity: TN/(TN+FP) AUC: Area Under the ROC Curve 0 12 0 1 specificity False alarm 1

ROC analysis and classifier evaluation 1 Perfect discrimination A B Which ROC curve is better? A dominants B and C and clearly has a higher AUC sensitivity C No discrimination B and C have approximately the same AUC B is better for some thresholds, C for others 0 13 0 1 - specificity 1

Cross validation estimates PROCESS AUC SE P-VALUE Ion homeostasis 1.00 0.00 0.008 Protein targeting 0.99 0.03 0.000 Blood coagulation 0.96 0.08 0.000 DNA metabolism 0.94 0.09 0.000 Intracellular signaling cascade 0.94 0.06 0.000 Cell cycle 0.93 0.04 0.000 Energy pathways 0.93 0.12 0.004 Oncogenesis 0.92 0.11 0.000 Circulation 0.91 0.11 0.001 Cell death 0.90 0.10 0.000 Developmental processes 0.90 0.07 0.000 Defense (immune) response 0.88 0.05 0.000 Transcription 0.88 0.11 0.002 Cell adhesion 0.87 0.09 0.002 Stress response 0.86 0.15 0.002 Protein metabolism and modification 0.85 0.10 0.000 Cell motility 0.84 0.11 0.000 Cell surface rec linked signal transd 0.82 0.15 0.005 Lipid metabolism 0.81 0.14 0.000 Cell organization and biogenesis 0.79 0.11 0.000 Cell proliferation 0.79 0.06 0.002 Transport 0.79 0.17 0.001 Amino acid and derivative metabolism 0.69 0.06 0.288 AVERAGE 0.88 0.09 Over all classes: Coverage (recall) = TP/(TP+FN) Precision = TP/(TP+FP) Coverage: 84% Precision: 50% Coverage: 71% Precision: 60% Coverage: 39% Precision: 90% *Iyer et al. 14

Discovering regulatory binding site modules Paper: III. T. R. Hvidsten, B. Wilczyński, A. Kryshtafovych, J. Tiuryn, J. Komorowski and K. Fidelis. Discovering regulatory binding site modules using rule-based learning, Genome Research 15: 856-66, 2005

Gene regulation Gene expression is regulated by regulatory proteins (transcription factors) Transcription factors depend on recognizing sequence motifs (binding sites) in order to effect the expression of genes Transcription factors combine to respond to a large number of stress factors (e.g. heat shock) with a large number of expression outcomes Enhancer Silencer Response elements Promoter Regulatory region Binding sites Transcription region Yeast 16 Promoter

Background Many studies have used gene expression data to search for overrepresented sequence motifs in co-expressed genes Pilpel et al. (2001) found that genes sharing pairs of binding sites are significantly more likely to be co-expressed than genes with only single binding sites in common Expression coherence score (EC) 17 Pilpel, Y., P. Sudarsanam, and G.M. Church. 2001. Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet 29: 153-159.

Introductory remarks We want to explore the combinatorial nature of gene regulation ASSUMPTION: co-regulated genes (genes regulated by the same transcription factors through the same binding sites) are co-expressed (have similar expression profiles) Genome-wide analysis of combinatorial regulation using sequence and expression data 18 Binding site module: a set of binding sites that is used to co-regulate a set of genes

Data and Method Database of 43 known and 313 putative yeast binding site motifs Expression profiles for yeast genes measured under six different conditions: cell cycle and five stress conditions sporulation, diauxic shift, heat shock and cold shock, pheromone and DNA-damaging agents Gene HAP234 Binding sites RAP1 PHO SWI5 ECB MCM1' RPL18A 0 1 0 1 0 1 RPS18A 0 1 0 1 0 1 RPL16B 1 1 0 1 0 1 RPL26A 0 1 0 1 0 1 RPS24A 0 1 0 1 0 1 RPL30 0 1 0 1 0 1 RPL14A 0 1 0 1 0 1 SST2 0 1 0 1 0 1 DRS2 0 0 0 1 0 1 GIT1 0 1 0 1 0 0 CLN3 0 0 0 1 1 1 RPO21 1 0 0 0 1 1 BIT89 0 1 0 0 0 1 Next gene Similar expression to RPL18A? Rule learning* yes yes yes yes yes yes yes no no no no no no IF RAP1 AND SWI5 AND MCM1' THEN 4 Evaluation: Gene Ontology Binding data Filtering* 3 2 1 0 19-1 -2-3

An example of a binding site module IF RAP1 AND SWI5 AND MCM1' THEN The rule was (re-) discovered in five of the six expression data sets a) Cell cycle: RPL30, RPL18A, RPL26A, RPL14A, RPS18A, (SST2), RPL16B b) Sporulation: RPL30, RPL18A, RPL14A, (RPS18A), RPL16B, RPL26A 4 3 2 1 0-1 -2-3 2 1.5 1 0.5 0-0.5-1 -1.5-2 -2.5 The central gene in the expression cluster is underlined c) Diauxic shift: RPL30, RPL18A, RPL14A, RPS18A, SST2, RPL16B, RPL26A 3 2 1 0-1 -2-3 20 Genes with differing expression profiles are in parentheses d) Heat and cold shock: RPS24A, RPL26A, RPL14A, RPS18A, RPL16B e) DNA-damaging agents: RPL30, RPL18A, RPL26A, RPL14A, RPS18A, (SST2), RPL16B 3 2 1 0-1 -2-3 -4 3 2 1 0-1 -2-3

Evaluation IF RAP1 AND SWI5 AND MCM1' THEN <similar expression> Gene symbol Biological process Molecular function Cellular component Possible transcription factors (P-value < 0.01) RPL16B protein biosynthesis RNA binding, structural constituent of ribosome cytosolic ribosome (sensu Eukarya), large ribosomal subunit FHL1, GAT3, PDR1, RAP1, RGM1, YAP5 RPL26A protein biosynthesis RNA binding, structural constituent of ribosome cytosolic ribosome (sensu Eukarya), large ribosomal subunit FHL1,RAP1 RPS18A protein biosynthesis structural constituent of ribosome cytosolic ribosome (sensu Eukarya), eukaryotic 43S preinitiation complex,eukaryotic 48S initiation complex, mall ribosomal subunit FHL1, GAT3, HIR2, RAP1, RGM1, YAP5 RPL30 protein biosynthesis, rrna processing, mrna splicing, regulation of translation structural constituent of ribosome cytosolic ribosome (sensu Eukarya), cytoplasm, large ribosomal subunit FHL1, GAT3, RAP1, SFP1 RPL18A protein biosynthesis structural constituent of ribosome cytosolic ribosome (sensu Eukarya), large ribosomal subunit FHL1, MAL13, RAP1, YAP5 RPL14A protein biosynthesis RNA binding, structural constituent of ribosome cytosolic ribosome (sensu Eukarya), large ribosomal subunit FHL1, GAT3, GRF10(Pho2), GTS1, RAP1 SST2 signal transduction, adaptation to pheromone during conjugation with cellular fusion GTPase activator activity plasma membrane DIG1, FHL1, RAP1, STE12 RPS24A protein biosynthesis structural constituent of ribosome cytosolic ribosome (sensu Eukarya), eukaryotic 43S pre-initiation complex, eukaryotic 48S initiation complex, small ribosomal subunit FHL1, GAT3, PDR1, RAP1, RGM1, SMP1, YAP5 21 P-VALUE 2.35E-04 (protein biosynthesis) 2.36E-06 (structural constituent of ribosome) 5.66E-07 (cytosolic ribosome) 2.38E-10 (FHL1), 3.38E-08 (RAP1), 2.96E-05 (GAT3)

Model-based detection of periodic expression Paper: IV. C.R. Andersson, T.R. Hvidsten, A. Isaksson, M.G. Gustafsson and J. Komorowski. Revealing cell cycle control by combining model-based detection of periodic expression with novel cisregulatory descriptors, BMC Systems Biology, 1: 45, 2007.

Experiment Induction Prior knowledge Hypothesis Experiment Induction A technique that infers generalizations from the information in the data. Inference Inference the reasoning involved in drawing a conclusion or making a logical judgment on the basis of circumstantial evidence and prior conclusions rather than on the basis of direct observation 23

Synchronization To recover mrna-levels, cultures must be synchronized. Synchronization halts cells at a particular point in the cell cycle. Typically the mating system or temperature sensitive mutants are used (cdc28, cdc15). Spellman (1998). S cerevisae under various synchronizations (alpha-factor, ts cdc15, cdc28 and elutriation) Synchronization reveals periodic expression Periodic expression is related to the cell cycle 24

Detecting periodically expressed genes Signal Expression Time Which model is most similar to the signal? => Probability (Periodical expression) Model 0 Model 1 Expression Expression 25 Time Time T

The models H 0 : y t = μ + ε H = μ + ω + φ 1 : y t A cos( t ) + ε Prior knowledge: The period T 26

27 Conditional periodicity

28 Method

Comparison to clustering Known cell cycle related: sequence motifs CCA, ECB, MCB, SWI5, SCB, SFF, MCM1, SFF' and MCM1. transcription factors ACE2, FKH1, FKH2, GTS1, HIR1, HIR2, MBP1, MCM1, NDD1, STB1, SWI4,SWI5, SWI6, XBP1, YBR267W, YHP1 and YOX1. 29

Examples of interactions The point (0.034, 0.73) on the curve is associated with the 145 rules with p-value lower than 0.000195. These rules include 19 of the 26 known phase specific regulators (73%) and 18 other regulators (3.4%). Furthermore, they describe 24% of the genes in the periodic classes. 30

Examples of interactions Ellipses/rectangles: transcription factors/sequence motifs Green/Blue: Cell cycle related transcription factors/sequence motifs Red: interactions between transcription factors 31

Summary Gene expression data can be used in many types of applications in which rule-based methods can be used to synthesis hypothesis. Discovered rules/classes must always be validated against real world knowledge! If appropriate, prior real world knowledge can be incorporated into the models. 32