Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

Similar documents
Improving classification models when a class hierarchy is available. Babak Shahbaba

Discovering molecular pathways from protein interaction and ge

arxiv:math/ v1 [math.st] 20 Oct 2005

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Predicting Protein Functions and Domain Interactions from Protein Interactions

Bayesian Support Vector Machines for Feature Ranking and Selection

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

A New Method to Build Gene Regulation Network Based on Fuzzy Hierarchical Clustering Methods

Introduction to Bioinformatics

Learning in Bayesian Networks

Introduction to Probabilistic Machine Learning

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Introduction to SVM and RVM

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Approximate Inference Part 1 of 2

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Nonparametric Bayesian Methods (Gaussian Processes)

Approximate Inference Part 1 of 2

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Markov Random Field Models of Transient Interactions Between Protein Complexes in Yeast

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

Protein Complex Identification by Supervised Graph Clustering

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Bayes methods for categorical data. April 25, 2017

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Comparative Network Analysis

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Outline Challenges of Massive Data Combining approaches Application: Event Detection for Astronomical Data Conclusion. Abstract

Introduction to Bioinformatics

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Joint Emotion Analysis via Multi-task Gaussian Processes

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Intelligent Systems (AI-2)

Machine Learning Linear Classification. Prof. Matteo Matteucci

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Recent Advances in Bayesian Inference Techniques

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Microarray Data Analysis: Discovery

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources

Naïve Bayes classification

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Intelligent Systems (AI-2)

Feature Engineering, Model Evaluations

STA 4273H: Statistical Machine Learning

Undirected Graphical Models

Probabilistic Graphical Networks: Definitions and Basic Results

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Support Vector Machine & Its Applications

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

STA 4273H: Statistical Machine Learning

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Intelligent Systems (AI-2)

Modern Information Retrieval

Bayesian learning of sparse factor loadings

Introduction to clustering methods for gene expression data analysis

Based on slides by Richard Zemel

CS145: INTRODUCTION TO DATA MINING

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Model Accuracy Measures

Day 5: Generative models, structured classification

Hierachical Name Entity Recognition

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

CPSC 540: Machine Learning

Click Prediction and Preference Ranking of RSS Feeds

Inferring Transcriptional Regulatory Networks from High-throughput Data

Machine Learning Techniques for Computer Vision

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

Mixture models for analysing transcriptome and ChIP-chip data

Generative Clustering, Topic Modeling, & Bayesian Inference

The Monte Carlo Method: Bayesian Networks

K-means-based Feature Learning for Protein Sequence Classification

Multilevel Statistical Models: 3 rd edition, 2003 Contents

CPSC 340: Machine Learning and Data Mining

Machine Learning

Pattern Recognition and Machine Learning

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

CSE 473: Artificial Intelligence Autumn Topics

Functional Characterization and Topological Modularity of Molecular Interaction Networks

Stat 502X Exam 2 Spring 2014

Directed Graphical Models

Metabolic networks: Activity detection and Inference

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Holdout and Cross-Validation Methods Overfitting Avoidance

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

An Overview of Choice Models

Combining classifiers to predict gene function in Arabidopsis thaliana using large-scale gene expression measurements

Transcription:

Bayesian Hierarchical Classification Seminar on Predicting Structured Data Jukka Kohonen 17.4.2008

Overview Intro: The task of hierarchical gene annotation Approach I: SVM/Bayes hybrid Barutcuoglu et al: Hierarchical multi-label prediction of gene function (Bioinformatics 2006) Approach II: Joint Bayesian model Shahbaba & Neal: Gene function classification using Bayesian models with hierarchy-based priors (BMC Bioinformatics 2006) Discussion 2

Intro: Gene Ontology (GO) www.geneontology.org Controlled hierarchical vocabulary of gene or protein classes: Molecular function (7600 nodes) Biological process (13500 nodes) Cellular component (2000 nodes) 3

Intro: Hierarchical constraints Each gene can belong to one or more classes ( multi-label classification) Hierarchical constraint: if gene belongs to class X, then it also belongs to parent(s) of class X A Molecular function Catalytic Example: yeast Saccharomyces cerevisiae has ~ 6500 named proteins, of which A. ~ 2200 have no molecular function annotation B. ~ 1600 have only partial annotation (not at leaf level) C. ~ 2300 have a single complete MF annotation (at leaf level) D. ~ 300 have multiple complete annotations (at leaf level) B Hydrolase Peptidase Endopeptidase C Metalloendopeptidase D 4

Approach I: SVM/Bayes hybrid For a (sub)hierarchy of 105 classes: 1. Use 105 separate SVM classifiers (one per class) to generate outputs i 2. Use i as observations and infer actual memberships y i with a Bayes network (with GO structure built in) Hierarchical dependence between classes if y 3, then y 2 actual membership in class 5 observed SVM output for class 5 [Barutcuoglu et al, Bioinformatics 2006] 5

Hybrid: SVM part 3465 annotated genes (yeast S. cerevisiae) Input: 88721 raw features (very sparse; data from pairwise interaction, microarray data, colocalization, and transcription factor sites) 5930 features after some pruning 105 independent support vector machines (SVM) are trained, using the same input features but different yes/no labelings (according to class membership) 6

Hybrid: Bayes part Class memberships y i depend on each other according to the GO hierarchy Each SVM output i depends only on the true membership y i The two conditional distributions are approximated as Gaussians: ( i y i = 0) ~ N(...,...) ( i y i = 1) ~ N(...,...) (y i = 0) Figure: empirical distributions vs. gaussian model (for one particular class) Note: not a very good class separation... (y i = 1) i [Barutcuoglu et al] 7

Hybrid: Example of reconciliation [Barutcuoglu et al] 8

Hybrid: Prediction results Bayesian correction improves the AUC scores slightly (AUC = area under ROC curve) [Barutcuoglu et al] 9

Approach II: Joint Bayesian model Leaf class probabilities depend on input features x according to a multinomial logit (MNL) model: P(y = j x,, ) exp( j + x T j ) For each class j, different parameters j, j are used. The j parameters are constructed in a way that makes them correlated for nearby classes (hence name cormnl) [Shahbaba & Neal, BMC Bioinformatics 2006] 10

Joint Bayesian: feature selection initial dimension reduction by PCA to ~100 then Automatic Relevance Detection (ARD) intercept parameter of MNL (for class j) slope parameter of MNL (for class j, feature l) overall scale of hyperparameters overall scale of scale of for class j scale of for feature l Note: if a feature is irrelevant, its coefficients will tend to be near zero. For cormnl similar model is applied, but instead of, the model first chooses (from which is calculated, as shown on previous slide). 11

Joint Bayesian: Data for experiment 2122 genes of the bacterium Escherichia coli with known function Functional hierarchy of 3 levels, 6+20+146 = 172 classes (simpler than GO) Three sources of input features (phylogenic, sequence attributes, secondary structure), reduced with PCA to 100, 100, 150 features 12

Joint Bayesian: Prediction results Data sources: SEQ = sequential features, STR = secondary structure, SIM = phylogenic similarity Classification methods: Baseline = simply assign every gene to most common class (at each level) MNL = non-hierarchical multinomial logit (ignores hierarchy) treemnl = nested models (first predict level 1, then level 2, then level 3) cormnl = joint model with correlated parameters (as described before) Note: cormnl improves over the simple MNL (slightly) [Shahbaba & Neal 2006] 13

Discussion (1/3) Very different methods for basically the same problem (and this was just 2 examples) In both cases, taking the hierarchy into account improved the prediction results over blind classification (but not much why?) Difficult to compare results between methods: Different feature data, different class hierarchies Different definitions of accuracy of hierarchical classification Measuring prediction quality with a test set of known functions might not give a realistic view of true prediction quality: the functions of unknown genes might be overall very different from the functions known genes 14

Discussion (2/3) Functional classification seems to be a difficult, noisy classification task Input features are often nominally high-dimensional, but extremely sparse; often some feature data is available only for a subset of the genes. For other genes, it is missing data treat as zero, or impute with KNNimpute, or use something like EM or MCMC. Both methods (SVM, and MNL+ARD) are able to use high-dimensional feature data; they assign small weights for irrelevant feature dimensions. 15

Discussion (3/3) Quality of training labels is problematic: e.g. lots of uncertain classifications (which might change between today and tomorrow); lack of explicit negative examples Functional classification of genes is a field that needs more work on unifying the definitions, a better definition of what the prediction task really is, and how the prediction quality should be measured! Similar hierarchical classification tasks appear in other domains (e.g. document classification). 16