(using hidden Markov models) University of Cambridge Bioinformatics Group Meeting 11 February 2016
Words of warning Disclaimer These slides have been produced by combining & translating two of my previous slide decks: A five-minute presentation given during my internship at Jane Street, as part of the short expositions by interns series; A 1.5h long lecture given during the Nedelja informatike 1 seminar at my high school (for students gifted for informatics). An implication of this is that it might feel like more of a high-level pitch than a low-level specification of the models involved I am more than happy to discuss the internals during or after the presentation! (you may also consult the paper distributed by Thomas before the talk) 1 If anyone would like to visit Belgrade and hold a lecture at this seminar, that d be really incredible :)
Motivation Why multiplex networks? This talk will represent an introduction to one of the most popular types of complex networks, as well as its applications to machine learning. Multiplex networks were the central topic of my Bachelor s dissertation @Cambridge this resulted in a journal publication (Journal of Complex Networks, Oxford). These networks hold significant potential for modelling many real-world systems. My work represents (to the best of my knowledge) the first attempt at developing a machine learning algorithm over these networks, with highly satisfactory results!
Motivation Roadmap 1. We will start off with a few slides that (in)formally define multiplex networks, along with some motivating examples. 2. Then our attention turns to hidden Markov models (HMMs), in particular, taking advantage of the standard algorithms to tackle a machine learning problem. 3. Finally, we will show how the two concepts can be integrated (i.e. how I have integrated them... ).
Theoretical introduction Let s start with graphs! Imagine that, within a system containing four nodes, you have concluded that certain pairs of nodes are connected in a certain manner. 0 1 2 3 You ve got your usual, boring graph; in this context often called a monoplex network.
Theoretical introduction Some more graphs You now notice that, in other frames of reference (examples to come soon), these nodes may be connected in different ways. 0 1 2 3 0 1 2 3
Theoretical introduction Influence Finally, you conclude that these layers of interaction are not independent, but may interact with each other in nontrivial ways (thus forming a network of networks ). Multiplex networks provide us with a relatively simple way of representing these interactions by adding new interlayer edges between a node s images in different layers. Revisiting the previous example...
Theoretical introduction Previous example 0 1 2 3 0 1 2 3 0 1 2 3
Theoretical introduction Previous example (0, G 1 ) (1, G 1 ) (2, G 1 ) (3, G 1 ) (0, G 2 ) (1, G 2 ) (2, G 2 ) (3, G 2 ) (0, G 3 ) (1, G 3 ) (2, G 3 ) (3, G 3 )
Theoretical introduction We have a multiplex network!
Introduction Multiplex networks Hidden Markov Models Multiplex HMMs Review of applications Examples Despite the simplicity of this model, a wide variety of real-world systems exhibit natural multiplexity. Examples include:
Introduction Multiplex networks Hidden Markov Models Multiplex HMMs Review of applications Examples Despite the simplicity of this model, a wide variety of real-world systems exhibit natural multiplexity. Examples include: I Transportation networks (De Domenico et al.)
Review of applications Examples Despite the simplicity of this model, a wide variety of real-world systems exhibit natural multiplexity. Examples include: Genetic networks (De Domenico et al.)
Review of applications Examples Despite the simplicity of this model, a wide variety of real-world systems exhibit natural multiplexity. Examples include: Social networks (Granell et al.)
Theoretical introduction Markov chains Let S be a discrete set of states, and {X n } n 0 a sequence of random variables taking values from S. This sequence satisfies the Markov property if it is memoryless: if the next value in the sequence depends only on the current value; i.e. for all n 0: P(X n+1 = x n+1 X n = x n,..., X 0 = x 0 ) = P(X n+1 = x n+1 X n = x n ) It is then called a Markov chain. X t = x t signifies that the chain is in state x t ( S) at time t.
Theoretical introduction Time homogeneity A common assumption is that a Markov chain is time-homogenous that the transition probabilities do not change with time; i.e. for all n 0: P(X n+1 = b X n = a) = P(X 1 = b X 0 = a) Time homogeneity allows us to represent Markov chains with a finite state set S using only a single matrix, T: T ij = P(X 1 = j X 0 = i) for all i, j S. It holds that x S T ix = 1 for all i S. It is also useful to define a start-state probability vector π, s.t. π x = P(X 0 = x).
Theoretical introduction Markov chain example 0.5 y S = {x, y, z} 1 0.7 0.5 0.0 1.0 0.0 T = 0.0 0.5 0.5 0.3 0.7 0.0 x 0.3 z
Theoretical introduction Hidden Markov Models A hidden Markov model (HMM) is a Markov chain in which the state sequence may be unobservable (hidden). This means that, while the Markov chain parameters (e.g. transition matrix and start-state probabilities) are still known, there is no way to directly determine the state sequence {X n } n 0 the system will follow. What can be observed is an output sequence produced at each time step, {Y n } n 0. The output sequence can assume any value from a given set of outputs, O. Here we will assume O to be discrete, but it is easily extendable to the continuous case (GMHMMs, as used in the paper) and all the standard algorithms retain their usual form.
Theoretical introduction Further parameters It is assumed that the output at any given moment depends only on the current state; i.e. for all n 0: P(Y n = y n X n = x n,..., X 0 = x 0, Y n 1 = y n 1,..., Y 0 = y 0 ) =P(Y n = y n X n = x n ) Assuming time homogeneity on P(Y n = y n X n = x n ) as before, the only additional parameter needed to fully specify an HMM is the output probability matrix, O, defined as O xy = P(Y 0 = y X 0 = x)
Theoretical introduction HMM example b a c S = {x, y, z} 0.9 x 0.1 1 0.5 y 0.7 0.3 0.6 0.4 0.5 z 1 O = {a, b, c} 0.0 1.0 0.0 T = 0.0 0.5 0.5 0.3 0.7 0.0 0.9 0.1 0.0 O = 0.0 0.6 0.4 0.0 0.0 1.0
Theoretical introduction Learning and inference There are three main problems that one may wish to solve on a (GM)HMM, and each can be addressed by a standard algorithm:
Theoretical introduction Learning and inference There are three main problems that one may wish to solve on a (GM)HMM, and each can be addressed by a standard algorithm: Probability of an observed output sequence. Given an output sequence, {y t } T t=0, determine the probability that it was produced by the given HMM Θ, i.e. P(Y 0 = y 0,..., Y T = y T Θ) (1) This problem is efficiently solved with the forward algorithm.
Theoretical introduction Learning and inference There are three main problems that one may wish to solve on a (GM)HMM, and each can be addressed by a standard algorithm: Most likely sequence of states for an observed output sequence. Given an output sequence, {y t } T t=0, determine the most likely sequence of states, { x t } T t=0, that produced it within a given HMM Θ, i.e. { x t } T t=0 = argmax {x t} T t=0 P({x t } T t=0 {y t } T t=0, Θ) (2) This problem is efficiently solved with the Viterbi algorithm.
Theoretical introduction Learning and inference There are three main problems that one may wish to solve on a (GM)HMM, and each can be addressed by a standard algorithm: Adjusting the model parameters. Given an output sequence, {y t } T t=0 and an HMM Θ, produce a new HMM Θ that is more likely to produce that sequence, i.e. P({y t } T t=0 Θ ) P({y t } T t=0 Θ) (3) This problem is efficiently solved with the Baum-Welch Algorithm.
Problem setup Supervised learning One of the most common kinds of machine learning is supervised learning; the task is to construct a learning algorithm that will, upon observing a set of data with known labels (training data), construct a function capable of determining labels for, thus far, unseen data (test data). Training data, s Learning algorithm, L L( s) Unseen data, x Labelling function, h h(x) Label, y
Problem setup Binary classification The simplest example of a problem solvable via supervised learning is binary classification: given two classes (C 1 and C 2 ) determining which class an input x belongs to. The applications of this are widespread: Diagnostics (does a patient have a given disease, based on glucose levels, blood pressure, and sim. measurements?); Giving credit (is a person expected to return their credit, based on their financial history?); Trading (should a stock be bought or sold, depending on previous price movements?); Autonomous driving (are the driving conditions too dangerous for self-driving, based on meteorological data?);...
Classification via HMMs Classification via HMMs The training data consists of sequences for which class membership (in C 1 or C 2 ) is known. We may construct two separate HMMs; one producing all the training sequences belonging to C 1, the other producing all the training sequences belonging to C 2. The models can be trained by doing a sufficient number of iterations of the Baum-Welch algorithm over the sequences from the training data belonging to their respective classes. After constructing the models, classifying new sequences is simple; employ the forward algorithm to determine whether it is more likely that a new sequence was produced by C 1 or C 2.
Motivation A slightly different problem Now assume that we have access to more than one output type at all times (e.g. if we measure patient parameters through time, we may simultaneously measure the blood pressure and blood glucose levels). We may attempt to reformulate our output matrix O, such that it has k + 1 dimensions for k output types: O x,a,b,c,... = P (a, b, c,... x)... however this becomes intractable fairly quickly, both memory and numerics-wise... also, many combinations of the outputs may never be seen in the training data.
Motivation Modelling There exists a variety of ways for handling multiple outputs simultaneously, however most of them do not take into account the potentially nontrivial nature of interactions between these outputs. Worst offender: Naïve/Idiot Bayes This is where multiplex networks come into play, as a model which was proved efficient in modelling real-world systems. Fundamental idea: We will model each of the outputs separately within separate HMMs, after which we will combine the HMMs into a, larger-scale, multiplex HMM. The entire structure will still behave as an HMM, so we will be able to classify using the forward algorithm, just as before.
Model description Interlayer edges Therefore, we assume that we have k HMMs (one for each output type) with n nodes each. In each time step, the system is within one of the nodes of one of the HMMs, and may either: change the current node (remaining within the same HMM), or change the current HMM (remaining in the same node). Assumption: the multiplex is layer-coupled the probabilities of changing the HMM at each timestep can be represented with a single matrix, ω (of size k k). Therefore, ω ij gives the probability of, at any time step, transitioning from the HMM producing the ith output type to the HMM producing the jth output type.
Model description Multiplex HMM parameters Important: while in the ith HMM, we are only interested in the probability of producing the ith output type; we do not consider the other k 1 types! (they are assumed to be produced with probability 1) With that in mind, the full system may be observed as an HMM over node-layer pairs (x, i), such that: π (x,i) = ω ii π i x 0 a b i j T (a,i),(b,j) = ω ij a = b i j ω ii T i ab i = j O (x,i), y = O i xy i
Model description Example: Chain HMM y 0 y 1 y 2... y n O 0,y0 O 1,y1 O 2,y2 O n,yn 1 1 1 start x 0 x 1 x 2... 1 x n
Model description Example: Two Chain HMMs y 0 y 1 y 2... y n O 0,y0 O 1,y1 O 2,y2 O n,yn 1 1 1 start x 0 x 1 x 2... 1 x n start x 0 x 1 x 2... x n 1 1 1 1 O 0,y 0 O 1,y 1 O 2,y 2 O n,y n y 0 y 1 y 2... y n
Model description Example: Multiplex Chain HMM y 0 y 1 y 2... y n O 0,y0 O 1,y1 O 2,y2 O n,yn start π 1 π 2 ω 11 ω 11 ω 11 ω x 0 x 1 x 2... 11 x n ω 12 ω 12 ω 12 ω 12 ω 21 ω 21 ω 21 ω 21 x 0 x 1 x 2... x n ω 22 ω 22 ω 22 ω 22 O 0,y 0 O 1,y 1 O 2,y 2 O n,y n y 0 y 1 y 2... y n
Training and classification Training Training the individual HMM layers (i.e. determining the parameters π i, T i, O i ) may be done as before (by using the Baum-Welch algorithm). Determining the entries of ω is much harder; this matrix specifies the relative dependencies between processes generating the different output types, which is undetermined for many practical problems! Therefore we assume an optimisation approach that makes no further assumptions about the problem within my project, a multiobjective genetic algorithm, NSGA-II, was used to determine optimal values of entries of ω.
Training and classification Classification As mentioned before, for binary classification: We construct two separate models; Each model is separately trained over the training sequences belonging to its class; New sequences are classified into the class for whose model the forward algorithm assigns a larger likelihood.
Training and classification Putting it all together Training set, s Baum-Welch (t 1, t 2,... ) C 1 (t 1, t 2,... ) C 2 Baum-Welch Train layer 1 Train layer 2... Train layer 1 Train layer 2... NSGA-II Train interlayer edges NSGA-II Train interlayer edges P( y C 1) P( y C 2) forward alg. forward alg. Unseen sequence, y C = argmax Ci P( y C i)
Results and implementation Application & results My project has applied this method to classifying patients for breast invasive carcinoma based on gene expression and methylation data for genes assumed to be responsible: we have achieved a mean accuracy of 94.2% and mean sensitivity of 95.8% after 10-fold crossvalidation! This was accomplished without any optimisation efforts: Fixed the number of nodes to n = 4; Used the standard NSGA-II parameters without tweaking; Ordered the sequences based on the euclidean norm of the expression and methylation levels. so we expect further advances to be possible.
Results and implementation Implementation and conclusions You may find the full C++ implementation of this model at https://github.com/petarv-/muxstep. We are currently in the process of publishing a new paper describing basic workflows with the software. Hopefully, more multiplex-related work to come! Developed viral during the Hack Cambridge hackathon; check it out at http://devpost.com/software/viral...
Results and implementation Thank you! Questions?