Learning in graphical models & MC methods Fall Cours 8 November 18

Size: px

Start display at page:

Download "Learning in graphical models & MC methods Fall Cours 8 November 18"

Gabriel McDowell
5 years ago
Views:

1 Learning in graphical models & MC methods Fall 2015 Cours 8 November 18 Enseignant: Guillaume Obozinsi Scribe: Khalife Sammy, Maryan Morel 8.1 HMM (end) As a reminder, the message propagation algorithm for Hidden Marov Models requires 2 recursions to compute α t (z t ) := p(z t, y 1,..., y t ) and β t (z t ) := p(y t+1,..., y T z t ): α t+1 (z t+1 ) = p (y t+1 z t+1 ) z t p (z t+1 z t ) α t (z t ) (8.1) β t (z t ) = z t+1 p (y t+1 z t+1 ) p (z t+1 z t ) β t+1 (z t+1 ) (8.2) Addressing practical implementation issues Since α t and β t are respectively joint probabilities of t + 1 and T t variables they tend to become exponentially small respectively for t large and t small. A naive implementation of the forward-bacward algorithm therefore typically leads to rounding errors. It is therefore necessary to wor on a logarithmic scale. So when considering operations on quantities say a 1,..., a n whose logarithms are l i = log(a i ), the log of the product is easily computed as l Π = log i a i = i l i and the log of the sum can be computed with the smallest amount of numerical errors by factoring the largest element. Precisely if i = arg max i a i and l = log a i then l Σ = log i a i = log i exp(l i ) = log [ exp(l ) i ] exp(l i l ) ( = l +log 1+ ) exp(l i l ) i i provides a stable way of computing the logarithm of the sum. For hidden Marov models, remember that the max-product (aa Viterbi) algorithm allows to compute the most probable sequence for hidden states. 8.2 Multiclass classification We return briefly to classification to mention two simple yet classical and useful models for multi-class classification: the naive Bayes model and the multiclass logistic regression. We consider classification problems where the input data is in X = R p and the output variable is a binary indicator in Y = {y {0, 1} K y y K = 1}. 8-1

2 8.2.1 Naive Bayes classifier The naive Bayes classifier is relevant when modeling the joint distribution of p(x y) is too complicated. We will present it the special case where the input data is a vector of binary random variable. X i : Ω {0, 1} p A practical example of classification problem in this setting is the problem of classification of documents based on a bag of word representation. In the bag-of-word approach, a document is represented as a long binary vector which indicates for each word of a reference dictionary whether that word is present in the document considered or not. So the document i would be represented by a vector x i {0, 1} p, with x i j = 1 iff word j of the dictionary is present in the ith document. As we saw in the second lecture, it is possible to approach the problem using directly a conditional model of p(y x) or using a generative model of the joint distribution modeling separately p(y) and p(x y) and computing p(y x) using Bayes rule. The naive Bayes model is an instance of a generative model. By contrast the multi class logistic regression of the following section is an example of a conditional model. Y i is naturally modeled as a multinomial distribution with p(y i ) = K =1 πyi. However p(x i y i ) = p(x i 1,..., x i p y i ) has a priori 2 p 1 parameters. The ey assumption made in the naive Bayes model is that X1, i..., Xp i are all independent conditionally on Y i. This assumption is not realistic and simplistic, hence the term naive". This assumption is clearly not satisfied in practice for documents where one would expect that there would be correlations between words that are not just explained by a document category. The corresponding modeling strategy is nonetheless woring well in practice. These conditional independence assumptions correspond to the following graphical model: Y i X i 1 X i 2... X i p The distribution of Y i is a multinomial distribution which we parameterize with (π 1,..., π K ), and we write µ j = P (X (i) j = 1 Y (i) = 1) We then have p p(x i = x i, Y i = y i ) = p(x i, y i ) = p(x i y i )p(y i ) = p(x i j y i )p(y i ) which leads to j=1 =1 j=1 [ p K ] K p(x i, y i x ) = µ i j yi j (1 µj ) (1 xi j )yi =1 π yi 8-2

3 and log p(x i, y i ) = K ( p ( x i j y i log µ j + (1 x i j)y i log(1 µ j ) ) + y i log(π )) =1 j=1 We can then use Bayes rule (hence the Bayes in Naive Bayes ), which leads to with η(x) = (η 1 (x),..., η K (x)) R K and log p(y i x i ) = η(x i ) y i A(η(x i )) η (x) = w x + b, w R p, [w ] j = log µ j 1 µ j, b = log π. Note that, in spite of the name the naive Bayes classifier is not a Bayesian approach to classification. Multiclass logistic regression In the light of the course on exponential families, the logistic regression model can be seen as resulting from a linear parameterization as a function of x of the natural parameter η(x) of the Bernoulli distribution corresponding to the conditional distribution of Y given X = x. Indeed for binary classification, we have that Y X = x Ber(µ(x)) and in the logistic regression model we set µ(x) = exp(η(x) A(η(x))) = (1+exp( η(x)) 1 and η(x) = w x+b. It is then natural to consider the generalization to a multiclass classification setting. In that case, Y X = x is multinomial distribution with natural parameters (η 1 (x),..., η K (x)). To again parameterize them linearly as a function of x, we need to introduce parameters w R p and b R, for all 1 K and set η (x) = w x + b. We then have P(Y = 1 X = x) = exp(η (x) A(η(x))) = e η (x) K =1 eη (x) = e w x+b K =1 ew x+b, and thus log P(Y = y X = x) = K [ K y (w x + b ) log =1 =1 ] e w x+b. Lie for binary logistic regression, the maximum lielihood principle can used to learn (w, b ) 1 K using numerical optimization methods such as the IRLS algorithm. Note that the form of the parameterization obtained is the same as for the Naive Bayes model; however, the Naive Bayes model is learnt as a generative model, while the logistic regression is learnt as conditional model. We have not taled about the multi class generalization of Fisher s linear discriminant. It exists as well as the multi class counterpart of the model seen for binary regression. It relies lie in the binary case on the assumption that p(x y) is Gaussian. This is good exercise to derive it. 8-3

4 8.3 Learning on graphical models ML principle for general Graphical Models Directed graphical model Proposition : Let G be a directed graph with p nodes. Assume that (X 1,...X n ) are i.i.d., with p features : i.e i {1,.., n}, X i R p, and that are fully observed, i.e., there is no latent or hidden variable among them. Then the ML principle decouples in p optimisation problems. Proof : Let us assume we have a decoupled model P Θ, i.e. : P Θ := { p θ (x) = j p(x j x πj, θ j ) θ = (θ 1,..., θ p ) Θ = Θ 1... Θ p } L(θ) = n p(x i θ) = l(θ) = p j=1 n j=1 p p(x i j x i π j, θ j ) n log p(x i j x i πj, θ j ). Then the ML principle reduces to solving p optimization problems of the form max θ j l j (θ j ) s.t θ j Θ j, with l j (θ j ) := Undirected graphical model n log p(x i j x i π j, θ j ). The ML problem is convex with respect to canonical parameters if: the data is fully observed (no latent or hidden variable), and the parameters are decoupled. In general, if the data is not fully observed, the EM scheme or similar scheme is used. If the parameters are coupled, the problem remains convex in some cases (e.g linear coupling), but not in general. If the model is a tree, one can reformulate the model as a directed tree to get bac to the directed case. In general, to compute the gradient of the log partition function and thus to compute the gradient of the log-lielihood, it is necessary to perform probabilistic inference on the model (i.e. to compute A(θ) = µ(θ) = E θ [φ(x)]). If the model is a tree, this can be done with the sum-product algorithm and if the model is a close to a tree, the junction tree theory can be leveraged to perform probabilistic inference; however in general probabilistic inference is NP-hard and so one needs to use approximate probabilistic inference techniques. 8-4

5 8.4 Approximate inference with Monte Carlo methods Sampling methods We often need to compute the expectation of a function f under some distribution p that cannot be computed. Let X be a random variable following the distribution p, we want to compute µ = E[f(X)]. Example For X = (X 1,..., X p ) the vector of variables corresponding to a graphical model, f(x) = δ(x A = x A ) E[f(X)] = P(X A = x A ) If we now how to sample from p, we can use the following method : Algorithm 1 Monte Carlo Estimation 1: Draw X (1),..., X (n) i.i.d. 2: ˆµ = 1 n n f(x(i) ) p This method relies on the two following propositions : Proposition 8.1 (Law of Large Numbers (LLN)) ˆµ a.s. µ if µ < Proposition 8.2 (Central Limit Theorem (CLT)) For X a scalar random variable, if Var(f(X)) = σ 2 <, then n(ˆµ µ) D N (0, σ 2 ) thus E( ˆµ µ 2 2) = σ2 n How to sample from a specific distribution? 1. Uniform distribution on [0, 1] : use rand 2. Bernoulli distribution of parameter p : X = 1 {U<p} with U U([0, 1]) 3. Using inverse transform sampling : x R F (x) = x p(t)dt = P(X [, x]) X = F 1 (U) avec U U([0, 1]) Proof P(X y) = P(F 1 (U) y) = P(U F (y)) = F (y) 8-5

6 Example Exponential distribution (one of the rare cases admitting an explicit inverse CDF 1 ) p(x) = λe λx 1 R+ (x) Ancestral sampling X = 1 λ ln(u) Consider the problem of sampling from a directed graphical model, whose distribution taes the form d p(x 1,..., x d ) = p(x i x πi ). We assume, without loss of generality, that the variables are indexed in a topological order. Consider the following algorithm Algorithm 2 Ancestral sampling for i = 1 to d do Draw z i from P(X i =. X πi = z πi ) end for return (z 1,..., z d ) Proposition 8.3 The random variable (Z 1,..., Z d ) returned by the ancestral sampling algorithm follows exactly the distribution p(x 1,..., x d ) = P(X 1 = x 1,..., X d = x d ). Proof We prove the result by induction. It is clearly obvious for a graph with a single node. For two nodes corresponding the pair of variable (X 1, X 2 ), then either X 1 and X 2 are independent and we are bac to the single node case. Or π 2 = {1} and then if z 1 is drawn from p X1 and, given the value z 1 obtained, z 2 is drawn from the conditional distribution p X2 X 1 ( z 1 ), then the pair (z 1, z 2 ) follows the joint distribution p X1,X 2. Now, assuming the result is true for n 1 nodes we prove that it is also true for n nodes. First note that after sampling z 1,..., z n 1 according to the algorithm z 1,..., z n 1 follows the distribution given by n 1 p(x i x πi ) which is exactly the marginal distribution of (X 1,..., X n 1 ). But then z n is drawn according to the distribution P(X n = X πn = z πn ) which by the Marov property is equal to P ( X n = (X 1,..., X n 1 ) = (z 1,..., z n 1 ) ). Setting X 2 = X n and X 1 = (X 1,..., X n 1 ), we see that we are essentially bac to the two nodes case, and that clearly (z 1,..., z n 1 ) is indeed drawn from the distribution of the random variable (X 1,..., X n ). By induction the result is proven. 1 Cumulative Distribution Function 8-6

7 8.4.3 Rejection sampling Assume that p(x) is the density of x with respect to some measure µ (typically the Lebesgue measure for a continuous random variable and the counting measure for a discrete variable) is nown up to a constant p(x) = p(x) Z p Assume that we can construct and compute q such that p(x) < q (x) with q a probability distribution. Assume we can sample from q. We define the rejection sampling algorithm as : Algorithm 3 Rejection Sampling Algorithm 1: Draw X from q 2: Accept X with probability p(x) q [0, 1], otherwise, reject the sample (x) Proposition 8.4 Accepted draws from rejection sampling follow exactly the distribution p. Proof We write the proof for the case of a discrete random variable X. and so that P(X = x, X is accepted) = P(X = x, X is accepted) = P(X is accepted X = x)p(x = x) = p(x) q(x) q(x) = p(x) P(X is accepted) = x p(x) P(X = x X is accepted) = p(x) = Z p = p(x). Z p To write the general version of this proof formally for any random variable (continuous or not) that has a density with respect to a measure µ, we would need to define Y to be the Bernoulli random variable such that {Y = 1} = {X is accepted}, and to consider p X,Y the joint density of (X, Y ) with respect to the product measure µ ν, where ν is the counting measure on {0, 1}. The computations of p X,Y (x, y) and p X Y (x, 1) then lead to the exact same calculations as above, but with less transparent notations. 8-7

8 Remar In practice, finding q and such that acceptance has a reasonably large probability is hard, because it requires to find a fairly tight bound on p(x) over the entire space Importance Sampling Assume X p. We aim at compuing the expectation of a function f : E p (f(x)) = f(x)p(x)dx w(y i ) = p(y j) q(y j ) Thus we get Lemme 8.5 If x, f(x) M, Proof f(x)p(x) = q(x)dx q(x) ( = E q f(y ) p(y ) ) with Y q q(y ) = E q (g(y )) 1 n g(y j ) n iid with Y j q = 1 n j=1 n j=1 f(y j ) p(y j) q(y j ) are called importance weights. Remember that µ = E p (f(x)) ˆµ = 1 n n f(x i ) E(ˆµ) = 1 f(x) p(x) n q(x) q(x)dx = V ar(ˆµ) = 1 ( ) f(x)p(x) n V ar q(x). q(x) V ar(ˆµ) M 2 V ar(ˆµ) = n 1 n V ar q(x) 1 n M 2 n p(x) 2 q(x) dx. f(x) 2 p(x) q(x) 2 ( ) f(x)p(x) q(x) q(x)dx p(x) 2 q(x) dx. f(x)p(x)dx

9 Remar p(x) 2 q(x) dx = = p 2 (x) 2p(x)q(x) + q 2 (x) 2p(x)q(x) q 2 (x) dx + dx q(x) q(x) (p(x) q(x)) 2 dx +1 q(x) }{{} χ 2 divergence between p and q. Hence, importance sampling will give good results if q has mass where p has. Indeed, if for some y, q(y) p(y), importance weights V ar(ˆµ) may be very large. Extension of Importance Sampling Assume we only now p and q up to a constant : p(x) = p(x) Z p and q(x) = q(x) Z p, and only p(x) and q(x) are nown. ( E f(y ) p(y ) ) q(y ) Tae f to be a constant, we get ( = E f(y ) p(y ) q(y ) ˆ µ = 1 n n Z p Z q f(y i ) p(y i) q(y i ) ) = µ Z p Z q a.s. µ Z p Z q Ẑ p/q = 1 n ˆµ = ˆ µ n Ẑ p/q p(y i ) q(y i ) a.s. µ a.s. Z p Z q Remar Even if Z p = Z q = 1, renormalizing by Ẑp/q often improves the estimation. 8.5 Marov Chain Monte Carlo (MCMC) Unfortunately, the previous techniques are often insufficient, especially for complex multivariate distributions, so that it is not possible to draw exactly from the distribution of interest or to obtain a reasonably good estimates based on importance sampling. The idea 8-9

10 of MCMC is that in many cases, even though it is not possible to sample directly from a distribution of interest, it is possible to construct a Marov Chain of samples X 0, X 1,... whose distribution q t (x) = p(x t = x) converges to a target distribution p(x). The idea is then that if T 0 is sufficiently large, we can consider that for all t T 0, X t follows approximately the distribution p and that 1 T T 0 T t=t 0 +1 f(x t ) 1 T T 0 T t=t 0 +1 f(x t ) E p [f(x)] Note that there is a double approximation: one due to the use of the law of large numbers and the second due to the approximation q t p for t sufficiently large. Note also that the draws of X t, X t+1 etc are not independent (but this is not necessary here to have a law of large numbers). The times before T 0 is often called the burn in period. The most classical procedure to obtain such a Marov Chain in the context of graphical models is called Gibbs sampling. We will see it in more details later. N.B.: In this whole section we write X t instead of X (t) which would match better with other sections. Indeed, here X t should be thought typically as the whole vector of variables corresponding to a graphical model X t = (X t,j ) 1 j d. We write t as an index just to simplify notations. In the rest of this section we will assume that we wor with random variables taing values in a set X with X = K <. However K is typically very large since it corresponds to all the configurations that the set of variables of a graphical model can tae Review of Marov chains Consider an order 1 homogenous Marov chain, i.e. such that for all t, P(X t = y X t 1 = x) = P(X t 1 = y X t 2 = x) Definition 8.6 (Time Homogenous Marov chain) t 0, (x, y) X, p(x t+1 = y X t = x, X t 1,..., X 0 ) = p(x t+1 = y X t = x) = p(x 1 = y X 0 = x) = S(x, y) Definition 8.7 (Transition matrix) Let = card(x ) <. We define the matrix S R such that x, y X, S(x, y) = P(X t = y X t 1 = x). S is called transition matrix of the Marov chain (X ). Properties If = card(x ) <, then: x, y X, S(x, y)

11 S1 = 1 (i.e. row sums are equal to 1) S is a stochastic matrix Definition 8.8 (Stationary Distribution) The distribution π on X is stationary if S T π = π with π = ( π(x) ) or equivalently y X, π(y) = π(x)s(x, y). x X x X If P(X n = x) = π(x) with π a stationary distribution of S, then we have P(X n+1 = y) = x P(X n+1 = y X n = x)p(x n = x) = x S(x, y)π(x) = π(y) Theorem 8.9 (Perron-Frobenius) Every stochastic matrix S has at least one stationary distribution. Let S m (x, y) := P(X t+m = y X t = x). Definition 8.10 (Irreducible Marov Chain) A Marov chain is irreducible if x, y X, m N, S m (x, y) > 0. Definition 8.11 (Period of a state) The greatest common divider of the elements in the set {m > 0 S m (x, x) > 0} is called the period of a state. When the period is equal to 1 the state is said to be aperiodic. Definition 8.12 (Aperiodic Marov Chain) If all the states of a Marov chain are aperiodic the chain is said to be aperiodic. Definition 8.13 (Regular Marov Chain) A Marov chain is regular if x, y X, S(x, y) > 0. Remar A regular Marov chain is clearly irreducible aperiodic. The converse is not true. Proposition 8.14 If a Marov chain on a finite state space is irreducible and aperiodic, then its transition matrix has a unique stationary distribution π and for any initial distribution q 0 on X 0, if q t ( ) = P(X t = ), then q t π Let q n be the distribution of X n, then for all t + distribution q 0 we get q n π Remar If the state space is not finite, an additional assumption is needed on the Marov chain: that it is recurrent positive. We do not define this notion in this course. 8-11

12 We want to construct a irreducible aperiodic transition S whose stationary distribu- Goal tion is. π(x) = 1 ψ c (x c ) Z c C Definition 8.15 (Detailed Balance) A Marov chain is reversible if for the transition matrix S, π, x, y X, π(x)s(x, y) = π(y)s(y, x) This equation is called the detailed balance equation. It can be reformulated P(X t+1 = y, X t = x) = P(X t+1 = x, X t = y) Proposition 8.16 If π satisfies detailed balance, then π is a stationary distribution. Proof S(x, y)p(x) = x x p(y)s(y, x) = p(y) x S(y, x) = p(y) Metropolis-Hastings Algorithm Proposal transition T (x, z) = P(Z = z X = x) Acceptance probability α(x, t) = P(Accept z X = x, Z = z) α is not a transition matrix. Algorithm 4 Metropolis Hastings 1: Initialize x 0 from X 0 q 2: for t = 1,..., T do 3: Draw z t from P(Z = X t 1 = x t 1 ) = T (x t 1, ) 4: With probability α(z t, x t 1 ), set x t = z t, otherwise, set x t = x t 1 5: end for Proposition 8.17 With that choice of α(x, z), if X is finite, if T (, ) is the transition matrix of an irreducible Marov chain such that, for any (x, z), ( T (x, z) > 0 T (z, x) > 0 ), and if π(x) > 0 for all x X, then the Metropolis-Hastings algorithm defines a Marov chain that converges to π. 8-12

13 Proof P(X t = x t X t 1 = x t 1 ) = S(x t 1, x t ) z x, S(x, z) = T (x, z)α(x, z) S(x, x) = T (x, x) + z x T (x, z)(1 α(x, z)) Let π be given; we want to choose S such that we have detailed balance: The fact that we have detailed balance for a transition from x to x is obvious because When z x, we have π(x)s(x, z) = π(z)s(z, x) π(x)t (x, z)α(x, z) = π(z)t (z, x)α(z, x) Then If α(x, z) α(z, x) = π(z)t (z, x) π(x)t (x, z) ( ) ( ) π(z)t (z, x) α(x, z) = min 1, π(x)t (x, z) then { α(x, z) [0, 1] ( ) is satisfied = detailed balance 8-13

Cours 7 12th November 2014

Sum Product Algorithm and Hidden Markov Model 2014/2015 Cours 7 12th November 2014 Enseignant: Francis Bach Scribe: Pauline Luc, Mathieu Andreux 7.1 Sum Product Algorithm 7.1.1 Motivations Inference, along