Lecture 1 October 9, 2013

Size: px

Start display at page:

Download "Lecture 1 October 9, 2013"

Peter Cummings
5 years ago
Views:

1 Probabilistic Graphical Models Fall 2013 Lecture 1 October 9, 2013 Lecturer: Guillaume Obozinski Scribe: Huu Dien Khue Le, Robin Bénesse The web page of the course: Introduction Problem To model complex data, one is confronted with two main questions: How to manage the complexity of the data to be processed? How to infer global properties from local models? These questions lead to 3 types of problems: the representation of data or how to obtain a global model from a local model, the inference of the distributions how to use the model, and the learning of the models what are the parameters of the models? Examples Image: consider a pixels monochromatic image. If each pixel is modelled by a discrete random variable so there are of them, then the image can be modelled using a grid of the form: Bioinformatics: consider a long sequence of ADN bases. If each base of this sequence is modelled by a discrete random variable that, in general, can take values in {A, C, G, T }, then the sequence can be modelled by a Markov chain: 1-1

For only two stocks, a possible simplified model is the following dependency graph: Speech processing: consider the syllables of a word and the way they are interpreted by a human ear or by a

2 Finance: consider the evolution of stock prices in discrete time, where we have values at time n. It is reasonable to postulate that the change of price of a stock at time n only depends only on its price or the price of other stocks at time n 1. For only two stocks, a possible simplified model is the following dependency graph: Speech processing: consider the syllables of a word and the way they are interpreted by a human ear or by a computer. Each syllable can be represented by a random sound. The objective is then to retrieve the word from the sequence of sounds heard or recorded. In this case, we can use a hidden Markov model: Text: consider a text with words. The text is modelled by a vector such that each of its components equals to the number of times each keyword appears. This is usually called the bag of words model. This model seems to be weak, as it does not take the order of the words into account. However, it works quite well in practice. A so-called naive Bayes classifier can be used for classification for example spam vs non spam. It is clear that models which ignore the dependence among variables are too simple for real-world problems. On the other hand, models in which every random variable is dependent all or too many other ones are doomed both for statistical lack of data and computational reasons. Therefore, in practice, one has to make suitable assumptions to design models with 1-2

3 the right level of complexity, so that the models obtained are able to generalize well from a statistical point of view and lead to tractable computations from an algorithmic perspective. 1.2 Basic notations and properties In this section we recall some basic notations and properties of random variables. Convention: Mathematically, the probability that a random variable X takes the value x is denoted px = x. In this document, we simply write px to denote a distribution over the random variable X, or px to denote the distribution evaluated for the particular value x. It is similar for more variables. Fundamental rules. For two random variables X, Y we have Sum rule: px = Y px, Y. Product rule: px, Y = py XpX. Independence. Two random variables X and Y are said to be independent if and only if P X, Y = P XP Y. Conditional independence. Let X, Y, Z be random variables. We define X and Y to be conditionally independent given Z if and only if P X, Y Z = P X ZP Y Z. Property: If X and Y are conditionally independent given Z, then P X Y, Z = P X Z. Independent and identically distributed. A set of random variables is independent and identically distributed i.i.d. if each random variable has the same probability distribution as the others and all are mutually independent. Bayes formula. For two random variables X, Y we have P X Y = P Y XP X. P Y Note that Bayes formula is not a Bayesian formula in the sense of Bayesian statistics. 1-3

4 1.3 Statistical models Definition 1.1 Statistical model A parametric statistical model P Θ is a collection of probability distributions or a collection of probability density functions 1 defined on the same space and parameterized by parameters θ belonging to a set Θ R p. Formally: P Θ = {p θ θ Θ} Bernoulli model Consider a binary random variable X that can take the value 0 or 1. parametrized by θ [0, 1]: { PX = 1 = θ PX = 0 = 1 θ then a probability distribution of the Bernoulli model can be written as If px = 1 is and we can write px = x; θ = θ x 1 θ 1 x 1.1 X Berθ. 1.2 The Bernoulli model is the collection of these distributions for θ Θ = [0, 1] Binomial model A binomial random variable Binθ, N is defined as the value of the sum of n i.i.d. Bernoulli r.v. with parameter θ. The distribution of a binomial random variable N is n PN = k = θ k 1 θ n k. k The set Θ is the same as for the Bernoulli model Multinomial model Consider a discrete random variable C that can take one of K possible values {1, 2,..., K}. The random variable C can be represented by a K-dimensional random variable X = X 1, X 2,..., X K T for which the event {C = k} corresponds to the event in R d {X k = 1 and X l = 0, l k}. 1 In which case, they are all defined with respect to the same base measure, such as the Lebesgue measure 1-4

5 If we parametrize PC = k by a parameter π k [0, 1], then by definition we also have PX k = 1 = π k k = 1, 2,..., K, with K π k = 1. The probability distribution over x = x 1,..., x k can be written as px; π = K π x k k 1.3 where π = π 1, π 2,..., π K T. We will denote M1, π 1,..., π K such a discrete distribution. The corresponding set of parameters is Θ = {π R + K π = 1}. Now if we consider n independent observations of a M1, π multinomial random variable X, and we denote by N k the number of observations for which x k = 1, then the joint distribution of N 1, N 2,..., N K is called a multinomial Mn, π distribution. It takes the form: n! K pn 1, n 2,..., n K ; π, n = π n k k 1.4 n 1!n 2!... n K! and we can write N 1,..., N K MN, π 1, π 2,..., π K. 1.5 The multinomial Mn, π is to the M1, π distribution, as the binomial distribution is to the Bernoulli distribution. In the rest of this course, when we will talk about multinomial distributions, we will always refer to a M1, π distribution Gaussian models The Gaussian distribution is also known as the normal distribution. In the case of a scalar variable X, the Gaussian distribution can be written in the form N x; µ, σ 2 1 x µ2 = exp 1.6 2πσ 2 1/2 2σ 2 where µ is the mean and σ 2 is the variance. For a d-dimensional vector x, the multivariate Gaussian distribution takes the form 1 1 N x µ, Σ = exp 1 2π d/2 Σ 1/2 2 x µt Σ 1 x µ 1.7 where µ is a d-dimensional vector, Σ is a d d symmetric positive definite matrix, and Σ denotes the determinant of Σ. It is a well-known property that the parameter µ is equal to the expectation of X and that the matrix Σ is the covariance matrix of X, which means that Σ ij = E[X i µ i X j µ j ]. 1-5

6 1.4 Parameter estimation by maximum likelihood Definition Maximum likelihood estimation MLE is a method of estimating the parameters of a statistical model. Suppose we have a sample x 1, x 2,..., x n of n independent and identically distributed observations, coming from a distribution px 1, x 2,..., x n ; θ where θ is an unknown parameter both x i and θ can be vectors. As the name suggests, the MLE finds the parameter ˆθ under which the data x 1, x 2,..., x n are most likely: ˆθ = argmax px 1, x 2,..., x n ; θ 1.8 θ The probability on the right-hand side in the above equation can be seen as a function of θ and can be denoted by Lθ: Lθ = px 1, x 2,..., x n ; θ 1.9 This function is called the likelihood. As x 1, x 2,..., x n are independent and identically distributed, we have n Lθ = px i ; θ 1.10 In practice it is often more convenient to work with the logarithm of the likelihood function, called the log-likelihood: n lθ = log Lθ = log px i ; θ 1.11 = log px i ; θ 1.12 Next, we will apply this method for the models presented previously. We assume that all the observations are independent and identically distributed in all of the remainder of this lecture MLE for the Bernoulli model Consider n observations x 1, x 2,..., x n of a binary random variable X following a Bernoulli distribution Berθ. From 1.12 and 1.1 we have lθ = log px i ; θ = log θ x i 1 θ 1 x i = N logθ + n N log1 θ 1-6

7 where N = n x i. As lθ is strictly concave, it has a unique maximizer, and since the function is in addition differentiable, its maximizer ˆθ is the zero of its gradient lθ: lθ = θ lθ = N θ n N 1 θ. It is easy to show that lθ = 0 θ = N. Therefore we have n MLE for the multinomial model ˆθ = N n = x 1 + x x n n Consider N observations X 1, X 2,..., X N of a discrete random variable X following a multinomial distribution M1, π, where π = π 1, π 2,..., π K T. We denote x i i = 1, 2,..., N the K-dimensional vectors of 0s and 1s representing X i, as presented in Section From 1.12 and 1.3 we have N lπ = log px i ; π = = = N K log N π x ik k K x ik log π k K n k log π k where n k = N x ik n k is therefore the number of observations of x k = 1. We need to maximize this quantity subject to the constraint K π k = 1. Brief review on Lagrange duality Lagrangian. Consider the following convex optimization problem: min fx, subject to Ax = b 1.14 x X where f is a convex function, X R p is a convex set included in the domain 2 of f, A R n p, b R n. The Lagrangian associated with this optimization problem is defined as 2 The domain of a function is the set on which the function is finite. Lx, λ = fx + λ T Ax b

8 The vector λ R n is called the Lagrange multiplier vector. Lagrange dual function. The Lagrange dual function is defined as gλ = min x Lx, λ 1.16 The problem of maximizing gλ with respect to λ is known as the Lagrange dual problem. Max-min inequality. For any f : R n R m and any w R n and z R m, we have fw, z max fw, z = min fw, z min fw, z z Z w W w W z Z 1.17 = max fw, z min fw, z. z Z w W w W z Z 1.18 The last inequality is known as the max-min inequality. Duality. It is easy to show that Which gives us max Lx, λ = λ min x { fx fx = min x Now from 1.16, 1.18 and 1.19 we have max λ gλ = max λ min x Lx, λ min if Ax = b + otherwise. max Lx, λ 1.19 λ x max λ Lx, λ = min fx 1.20 This inequality says that the optimal value d of the Lagrange dual problem always lower-bounds the optimal value p of the original problem. This property is called the weak duality. If the equality d = p holds, then we say that the strong duality holds. Strong duality means that the order of the minimization over x and the maximization over λ can be switched without affecting the result. Slater s constraint qualification lemma. If there exists an x in the relative interior of X {Ax = b} then strong duality holds. Note that by definition X is included in the domain of f so that if x X then fx <. Note that all the above notions and results are stated for the problem 1.14 only. For a more general problem and more details about Lagrange duality, please refer to [9] chapter x

9 Back to our problem We need to minimize subject to the constraint 1 T π = 1. The Lagrangian of this problem is fπ = lπ = Lπ, λ = K n k log π k 1.21 K K n k log π k + λ π k Clearly, as n k 0 k = 1, 2,..., K, f is convex and this problem is a convex optimization problem. Moreover, it is trivial that there exist π 1, π 2,..., π K such that π k > 0 k = 1, 2,..., K and K π k = 1, so by Slater s constraint qualification, the problem has strong duality property. Therefore, we have min π fπ = max min Lπ, λ 1.23 λ π As Lπ, λ is convex with respect to π, to find min π Lπ, λ, it suffices to take derivatives with respect to π k. This yields L = n k + λ = 0, k = 1, 2,..., K. π k π k or π k = n k, k = 1, 2,..., K λ Substituting these into the constraint K π k = 1 we get K n k = λ, yielding λ = N. From this and 1.24 we get finally ˆπ k = n k, k = 1, 2,..., K N Remark: ˆπ k is the fraction of the N observations for which x k = MLE for the univariate Gaussian model Consider n observations x 1, x 2,..., x n of a random variable X following a Gaussian distribution N µ, σ 2. From 1.12 and 1.6 we have lµ, σ 2 = log px i ; µ, σ 2 [ 1 = log exp x ] i µ 2 2πσ 2 1/2 2σ 2 = n 2 log2π n 2 logσ2 1 x i µ 2. 2 σ 2 1-9

10 We need to maximize this quantity with respect to µ and σ 2. By taking derivative with respect to µ and then σ 2, it is easy to obtain that the pair ˆµ, ˆσ 2, defined by ˆµ = 1 n ˆσ 2 = 1 n x i 1.26 x i ˆµ is the only stationary point of the likelihood. One can actually check for example computing the Hessian w.r.t. µ, σ 2 that this actually a maximum. We will have a confirmation of this in the lecture on exponential families MLE for the multivariate Gaussian model Let X R d be a Gaussian random vector, with mean vector µ R d and a covariance matrix Σ R d d positive definite: px µ, Σ = 1 1 x µ Σ 1 x µ exp 2π k 2 det Σ 2 Let x 1,..., x n be a i.i.d. sample. The log-likelihood is given by: l µ, Σ = log p x 1,..., x n ; µ, Σ n = log p x i µ, Σ = nd 2 log2π + n 2 log det Σ x i µ Σ 1 x i µ In this case, one should be careful that these log-likelihoods are not concave with respect to the pair of parameters µ, Σ. They are concave w.r.t. µ when Σ is fixed but they are not even concave with respect to Σ when µ is fixed Digression: review on differentials Differentiable function. A function f is differentiable at x R d if there exists a linear form df such that: fx + h fx = dfh + o h Since R d is a Hilbert space, we know that there exists g R d such that df x h = g, h. We call g the gradient of f : g = fx. 1-10

11 Example 1 : if f a x + b then we have : and thus fx + h fx = a h fx = a. Example 2 : if f 1 2 x Ax then we have : fx + h fx = 1 2 x + ht Ax + h 1 2 x Ax = 1 2 x Ah + h Ax + o h The gradient is then : fx = 1 2 Ax + A x Let us first differentiate l µ, Σ w.r.t. µ. We need to differentiate : µ x i µ Σ 1 x i µ Which is equal to f g where : and f : R d R y y Σ 1 y g : R d R d µ µ x i Reminder : Composition of differentials df g = df gx dg x h = df gx dg x h 1-11

12 The differential of f is : df y h = fy, h = 1 2 as Σ 1 is symmetric. The function l : µ x i µσ 1 x i µ. We have : We deduce the gradient of l : dl µ h = Σ 1 µ x i, h lµ = Σ 1 µ x i Σ 1 y + Σ 1 y, h = Σ 1 y, h Back to the MLE for the multivariate Gaussian Remember that the function we want to differentiate is : nd l µ, Σ = 2 log2π + n 2 log det Σ + 1 x i µ Σ 1 x i µ 2 According to the results above, we have : µ l µ, Σ 1 = Σ 1 µ x i = Σ 1 nµ = Σ 1 nµ nx where x = 1 n n x i The gradient is equal to 0 iff : x i µ = 1 n x i Let us now differentiate l w.r.t. Σ 1. Let A = Σ 1. We have : nd l µ, Σ = 2 log2π n 2 log det A + 1 x i µ A x i µ 2 The last term is a real number, so it equal to its trace. Thus : 1-12

13 nd l µ, Σ = 2 log2π n 2 log det A + 1 Trace x i µ Ax i µ 2 nd = 2 log2π n 2 logdet A + n 2 TraceA Σ where Σ = 1 n is the empirical covariance matrix. Let f : A A Σ ntrace. 2 x i µx i µ We have : fa + H fa = n H 2 Trace Σ The gradient of the last term of l is then : and : fa = n 2 Σ df A H = fa, H = Trace fa H Let us focus on the second term. We have : where H = log deta + H = log A 1 2 I + A 1 2 HA 1 2 A 1 2 = log A + log deti + H A 1 2 HA 1 2. Let g : A log deta where A = I + H. We have : 1-13

14 log deti + H log deti = H is symmetric, so it can be written as : d log1 + λ j d λ j + o H = Trace H + o H H = UΛU where U is an orthogonal matrix and Λ = diagλ 1,..., λ d. d log det A H = Trace A 1 2 HA 1 2 = TraceHA 1 We deduce the gradient of log deta: And the gradient of l w.r.t. A is : It is equal to zero iff : when Σ is invertible. Finally we have shown that the pair log deta = A 1 A l = n 2 A 1 + n 2 Σ Σ = Σ ˆµ = x = 1 n x i andˆσ = 1 n x i xx i x is the only stationary point of the likelihood. One can actually check for example computing the Hessian w.r.t. µ, Σ that this actually a maximum. We will have a confirmation of this in the lecture on exponential families. 1-14

15 Bibliography [1] J.M Amigo and M.B. Kennel. Variance estimators for the Lempel-Ziv entropy rate estimator. Chaos, 16:043102, [2] Aim Fuchs Dominique Foata. Calcul des probabilités. 2ème édition. Dunod, [3] Gilbert Saporta. Probabilités, analyses des données et statistiques. Technip, [4] Frédéric Bonnans. Optimisation continue, Cours et problèmes corrigés. Dunod, [5] Michael Jordan. An introduction to graphical models. In preparation. [6] de lagrange. [7] S.J.D. Prince. Computer Vision: Models Learning and Inference. Cambridge University Press, [8] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., Secaucus, NJ, USA, [9] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA,

Maximum likelihood estimation

Maximum likelihood estimation Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Maximum likelihood estimation 1/26 Outline 1 Statistical concepts 2 A short review of convex analysis and optimization