Multiplex network inference

Similar documents
STA 414/2104: Machine Learning

STA 4273H: Statistical Machine Learning

Hidden Markov Models Part 2: Algorithms

Computational Genomics and Molecular Biology, Fall

Lecture 11: Hidden Markov Models

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Advanced Data Science

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Advanced Algorithms. Lecture Notes for April 5, 2016 Dynamic programming, continued (HMMs); Iterative improvement Bernard Moret

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Hidden Markov Models

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Hidden Markov Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Log-Linear Models, MEMMs, and CRFs

Machine Learning (CS 567) Lecture 2

Conditional Random Field

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

HMM: Parameter Estimation

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CSCI-567: Machine Learning (Spring 2019)

Lecture 9. Intro to Hidden Markov Models (finish up)

Factor Graphs and Message Passing Algorithms Part 1: Introduction

Naïve Bayes classification

Introduction to Machine Learning Midterm, Tues April 8

11 The Max-Product Algorithm

Hidden Markov Models NIKOLAY YAKOVETS

Bioinformatics 2 - Lecture 4

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Note Set 5: Hidden Markov Models

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Terminology, Representation and Basic Problems

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Directed Probabilistic Graphical Models CMSC 678 UMBC

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Expectation maximization

Design and Implementation of Speech Recognition Systems

Introduction to Machine Learning CMU-10701

9 Forward-backward algorithm, sum-product on factor graphs

Markov Chains and Hidden Markov Models

Sequential Supervised Learning

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Probabilistic Graphical Models

Semi-Markov/Graph Cuts

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Notes on Machine Learning for and

Math 350: An exploration of HMMs through doodles.

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Linear Dynamical Systems (Kalman filter)

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Statistical Methods for NLP

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Machine Learning for OR & FE

Hidden Markov Models,99,100! Markov, here I come!

Association studies and regression

CSE 473: Artificial Intelligence Autumn Topics

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

11.3 Decoding Algorithm

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

State Space and Hidden Markov Models

Machine Learning for Structured Prediction

Bayesian decision making

Hidden Markov Models

Logistic Regression & Neural Networks

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Based on slides by Richard Zemel

Implementation of EM algorithm in HMM training. Jens Lagergren

Undirected Graphical Models

Machine Learning Overview

Alignment Algorithms. Alignment Algorithms

Hidden Markov Models

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Expectation Maximization (EM)

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Hidden Markov Models in Language Processing

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Hidden Markov models

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

STA 414/2104, Spring 2014, Practice Problem Set #1

Hidden Markov Models. x 1 x 2 x 3 x N

Classification 1: Linear regression of indicators, linear discriminant analysis

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Bayesian Networks Inference with Probabilistic Graphical Models

Hidden Markov Models

Lecture 13 : Variational Inference: Mean Field Approximation

CS534 Machine Learning - Spring Final Exam

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Transcription:

(using hidden Markov models) University of Cambridge Bioinformatics Group Meeting 11 February 2016

Words of warning Disclaimer These slides have been produced by combining & translating two of my previous slide decks: A five-minute presentation given during my internship at Jane Street, as part of the short expositions by interns series; A 1.5h long lecture given during the Nedelja informatike 1 seminar at my high school (for students gifted for informatics). An implication of this is that it might feel like more of a high-level pitch than a low-level specification of the models involved I am more than happy to discuss the internals during or after the presentation! (you may also consult the paper distributed by Thomas before the talk) 1 If anyone would like to visit Belgrade and hold a lecture at this seminar, that d be really incredible :)

Motivation Why multiplex networks? This talk will represent an introduction to one of the most popular types of complex networks, as well as its applications to machine learning. Multiplex networks were the central topic of my Bachelor s dissertation @Cambridge this resulted in a journal publication (Journal of Complex Networks, Oxford). These networks hold significant potential for modelling many real-world systems. My work represents (to the best of my knowledge) the first attempt at developing a machine learning algorithm over these networks, with highly satisfactory results!

Motivation Roadmap 1. We will start off with a few slides that (in)formally define multiplex networks, along with some motivating examples. 2. Then our attention turns to hidden Markov models (HMMs), in particular, taking advantage of the standard algorithms to tackle a machine learning problem. 3. Finally, we will show how the two concepts can be integrated (i.e. how I have integrated them... ).

Theoretical introduction Let s start with graphs! Imagine that, within a system containing four nodes, you have concluded that certain pairs of nodes are connected in a certain manner. 0 1 2 3 You ve got your usual, boring graph; in this context often called a monoplex network.

Theoretical introduction Some more graphs You now notice that, in other frames of reference (examples to come soon), these nodes may be connected in different ways. 0 1 2 3 0 1 2 3

Theoretical introduction Influence Finally, you conclude that these layers of interaction are not independent, but may interact with each other in nontrivial ways (thus forming a network of networks ). Multiplex networks provide us with a relatively simple way of representing these interactions by adding new interlayer edges between a node s images in different layers. Revisiting the previous example...

Theoretical introduction Previous example 0 1 2 3 0 1 2 3 0 1 2 3

Theoretical introduction Previous example (0, G 1 ) (1, G 1 ) (2, G 1 ) (3, G 1 ) (0, G 2 ) (1, G 2 ) (2, G 2 ) (3, G 2 ) (0, G 3 ) (1, G 3 ) (2, G 3 ) (3, G 3 )

Theoretical introduction We have a multiplex network!

Introduction Multiplex networks Hidden Markov Models Multiplex HMMs Review of applications Examples Despite the simplicity of this model, a wide variety of real-world systems exhibit natural multiplexity. Examples include:

Introduction Multiplex networks Hidden Markov Models Multiplex HMMs Review of applications Examples Despite the simplicity of this model, a wide variety of real-world systems exhibit natural multiplexity. Examples include: I Transportation networks (De Domenico et al.)

Review of applications Examples Despite the simplicity of this model, a wide variety of real-world systems exhibit natural multiplexity. Examples include: Genetic networks (De Domenico et al.)

Review of applications Examples Despite the simplicity of this model, a wide variety of real-world systems exhibit natural multiplexity. Examples include: Social networks (Granell et al.)

Theoretical introduction Markov chains Let S be a discrete set of states, and {X n } n 0 a sequence of random variables taking values from S. This sequence satisfies the Markov property if it is memoryless: if the next value in the sequence depends only on the current value; i.e. for all n 0: P(X n+1 = x n+1 X n = x n,..., X 0 = x 0 ) = P(X n+1 = x n+1 X n = x n ) It is then called a Markov chain. X t = x t signifies that the chain is in state x t ( S) at time t.

Theoretical introduction Time homogeneity A common assumption is that a Markov chain is time-homogenous that the transition probabilities do not change with time; i.e. for all n 0: P(X n+1 = b X n = a) = P(X 1 = b X 0 = a) Time homogeneity allows us to represent Markov chains with a finite state set S using only a single matrix, T: T ij = P(X 1 = j X 0 = i) for all i, j S. It holds that x S T ix = 1 for all i S. It is also useful to define a start-state probability vector π, s.t. π x = P(X 0 = x).

Theoretical introduction Markov chain example 0.5 y S = {x, y, z} 1 0.7 0.5 0.0 1.0 0.0 T = 0.0 0.5 0.5 0.3 0.7 0.0 x 0.3 z

Theoretical introduction Hidden Markov Models A hidden Markov model (HMM) is a Markov chain in which the state sequence may be unobservable (hidden). This means that, while the Markov chain parameters (e.g. transition matrix and start-state probabilities) are still known, there is no way to directly determine the state sequence {X n } n 0 the system will follow. What can be observed is an output sequence produced at each time step, {Y n } n 0. The output sequence can assume any value from a given set of outputs, O. Here we will assume O to be discrete, but it is easily extendable to the continuous case (GMHMMs, as used in the paper) and all the standard algorithms retain their usual form.

Theoretical introduction Further parameters It is assumed that the output at any given moment depends only on the current state; i.e. for all n 0: P(Y n = y n X n = x n,..., X 0 = x 0, Y n 1 = y n 1,..., Y 0 = y 0 ) =P(Y n = y n X n = x n ) Assuming time homogeneity on P(Y n = y n X n = x n ) as before, the only additional parameter needed to fully specify an HMM is the output probability matrix, O, defined as O xy = P(Y 0 = y X 0 = x)

Theoretical introduction HMM example b a c S = {x, y, z} 0.9 x 0.1 1 0.5 y 0.7 0.3 0.6 0.4 0.5 z 1 O = {a, b, c} 0.0 1.0 0.0 T = 0.0 0.5 0.5 0.3 0.7 0.0 0.9 0.1 0.0 O = 0.0 0.6 0.4 0.0 0.0 1.0

Theoretical introduction Learning and inference There are three main problems that one may wish to solve on a (GM)HMM, and each can be addressed by a standard algorithm:

Theoretical introduction Learning and inference There are three main problems that one may wish to solve on a (GM)HMM, and each can be addressed by a standard algorithm: Probability of an observed output sequence. Given an output sequence, {y t } T t=0, determine the probability that it was produced by the given HMM Θ, i.e. P(Y 0 = y 0,..., Y T = y T Θ) (1) This problem is efficiently solved with the forward algorithm.

Theoretical introduction Learning and inference There are three main problems that one may wish to solve on a (GM)HMM, and each can be addressed by a standard algorithm: Most likely sequence of states for an observed output sequence. Given an output sequence, {y t } T t=0, determine the most likely sequence of states, { x t } T t=0, that produced it within a given HMM Θ, i.e. { x t } T t=0 = argmax {x t} T t=0 P({x t } T t=0 {y t } T t=0, Θ) (2) This problem is efficiently solved with the Viterbi algorithm.

Theoretical introduction Learning and inference There are three main problems that one may wish to solve on a (GM)HMM, and each can be addressed by a standard algorithm: Adjusting the model parameters. Given an output sequence, {y t } T t=0 and an HMM Θ, produce a new HMM Θ that is more likely to produce that sequence, i.e. P({y t } T t=0 Θ ) P({y t } T t=0 Θ) (3) This problem is efficiently solved with the Baum-Welch Algorithm.

Problem setup Supervised learning One of the most common kinds of machine learning is supervised learning; the task is to construct a learning algorithm that will, upon observing a set of data with known labels (training data), construct a function capable of determining labels for, thus far, unseen data (test data). Training data, s Learning algorithm, L L( s) Unseen data, x Labelling function, h h(x) Label, y

Problem setup Binary classification The simplest example of a problem solvable via supervised learning is binary classification: given two classes (C 1 and C 2 ) determining which class an input x belongs to. The applications of this are widespread: Diagnostics (does a patient have a given disease, based on glucose levels, blood pressure, and sim. measurements?); Giving credit (is a person expected to return their credit, based on their financial history?); Trading (should a stock be bought or sold, depending on previous price movements?); Autonomous driving (are the driving conditions too dangerous for self-driving, based on meteorological data?);...

Classification via HMMs Classification via HMMs The training data consists of sequences for which class membership (in C 1 or C 2 ) is known. We may construct two separate HMMs; one producing all the training sequences belonging to C 1, the other producing all the training sequences belonging to C 2. The models can be trained by doing a sufficient number of iterations of the Baum-Welch algorithm over the sequences from the training data belonging to their respective classes. After constructing the models, classifying new sequences is simple; employ the forward algorithm to determine whether it is more likely that a new sequence was produced by C 1 or C 2.

Motivation A slightly different problem Now assume that we have access to more than one output type at all times (e.g. if we measure patient parameters through time, we may simultaneously measure the blood pressure and blood glucose levels). We may attempt to reformulate our output matrix O, such that it has k + 1 dimensions for k output types: O x,a,b,c,... = P (a, b, c,... x)... however this becomes intractable fairly quickly, both memory and numerics-wise... also, many combinations of the outputs may never be seen in the training data.

Motivation Modelling There exists a variety of ways for handling multiple outputs simultaneously, however most of them do not take into account the potentially nontrivial nature of interactions between these outputs. Worst offender: Naïve/Idiot Bayes This is where multiplex networks come into play, as a model which was proved efficient in modelling real-world systems. Fundamental idea: We will model each of the outputs separately within separate HMMs, after which we will combine the HMMs into a, larger-scale, multiplex HMM. The entire structure will still behave as an HMM, so we will be able to classify using the forward algorithm, just as before.

Model description Interlayer edges Therefore, we assume that we have k HMMs (one for each output type) with n nodes each. In each time step, the system is within one of the nodes of one of the HMMs, and may either: change the current node (remaining within the same HMM), or change the current HMM (remaining in the same node). Assumption: the multiplex is layer-coupled the probabilities of changing the HMM at each timestep can be represented with a single matrix, ω (of size k k). Therefore, ω ij gives the probability of, at any time step, transitioning from the HMM producing the ith output type to the HMM producing the jth output type.

Model description Multiplex HMM parameters Important: while in the ith HMM, we are only interested in the probability of producing the ith output type; we do not consider the other k 1 types! (they are assumed to be produced with probability 1) With that in mind, the full system may be observed as an HMM over node-layer pairs (x, i), such that: π (x,i) = ω ii π i x 0 a b i j T (a,i),(b,j) = ω ij a = b i j ω ii T i ab i = j O (x,i), y = O i xy i

Model description Example: Chain HMM y 0 y 1 y 2... y n O 0,y0 O 1,y1 O 2,y2 O n,yn 1 1 1 start x 0 x 1 x 2... 1 x n

Model description Example: Two Chain HMMs y 0 y 1 y 2... y n O 0,y0 O 1,y1 O 2,y2 O n,yn 1 1 1 start x 0 x 1 x 2... 1 x n start x 0 x 1 x 2... x n 1 1 1 1 O 0,y 0 O 1,y 1 O 2,y 2 O n,y n y 0 y 1 y 2... y n

Model description Example: Multiplex Chain HMM y 0 y 1 y 2... y n O 0,y0 O 1,y1 O 2,y2 O n,yn start π 1 π 2 ω 11 ω 11 ω 11 ω x 0 x 1 x 2... 11 x n ω 12 ω 12 ω 12 ω 12 ω 21 ω 21 ω 21 ω 21 x 0 x 1 x 2... x n ω 22 ω 22 ω 22 ω 22 O 0,y 0 O 1,y 1 O 2,y 2 O n,y n y 0 y 1 y 2... y n

Training and classification Training Training the individual HMM layers (i.e. determining the parameters π i, T i, O i ) may be done as before (by using the Baum-Welch algorithm). Determining the entries of ω is much harder; this matrix specifies the relative dependencies between processes generating the different output types, which is undetermined for many practical problems! Therefore we assume an optimisation approach that makes no further assumptions about the problem within my project, a multiobjective genetic algorithm, NSGA-II, was used to determine optimal values of entries of ω.

Training and classification Classification As mentioned before, for binary classification: We construct two separate models; Each model is separately trained over the training sequences belonging to its class; New sequences are classified into the class for whose model the forward algorithm assigns a larger likelihood.

Training and classification Putting it all together Training set, s Baum-Welch (t 1, t 2,... ) C 1 (t 1, t 2,... ) C 2 Baum-Welch Train layer 1 Train layer 2... Train layer 1 Train layer 2... NSGA-II Train interlayer edges NSGA-II Train interlayer edges P( y C 1) P( y C 2) forward alg. forward alg. Unseen sequence, y C = argmax Ci P( y C i)

Results and implementation Application & results My project has applied this method to classifying patients for breast invasive carcinoma based on gene expression and methylation data for genes assumed to be responsible: we have achieved a mean accuracy of 94.2% and mean sensitivity of 95.8% after 10-fold crossvalidation! This was accomplished without any optimisation efforts: Fixed the number of nodes to n = 4; Used the standard NSGA-II parameters without tweaking; Ordered the sequences based on the euclidean norm of the expression and methylation levels. so we expect further advances to be possible.

Results and implementation Implementation and conclusions You may find the full C++ implementation of this model at https://github.com/petarv-/muxstep. We are currently in the process of publishing a new paper describing basic workflows with the software. Hopefully, more multiplex-related work to come! Developed viral during the Hack Cambridge hackathon; check it out at http://devpost.com/software/viral...

Results and implementation Thank you! Questions?