Lecture 1 October 9, 2013

Similar documents
Maximum likelihood estimation

Lecture 4 September 15

Naïve Bayes classification

Machine Learning. Option Data Science Mathématiques Appliquées & Modélisation. Alexandre Aussem

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Machine Learning. Lecture 2

Machine Learning, Fall 2012 Homework 2

The Expectation-Maximization Algorithm

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier

Chapter 17: Undirected Graphical Models

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Lecture 4 October 18th

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Constrained Optimization and Lagrangian Duality

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Lecture 4: Probabilistic Learning

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Introduction to Probability and Statistics (Continued)

COS513 LECTURE 8 STATISTICAL CONCEPTS

COM336: Neural Computing

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Statistics 3858 : Maximum Likelihood Estimators

PMR Learning as Inference

Machine learning - HT Maximum Likelihood

Introduction to Machine Learning

Probabilistic Graphical Models

Bayesian Methods: Naïve Bayes

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning

Linear Regression and Discrimination

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Quick Tour of Basic Probability Theory and Linear Algebra

Association studies and regression

Support Vector Machines

Latent Variable Models and EM algorithm

STA 4273H: Statistical Machine Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Introduction to Machine Learning

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Learning in graphical models & MC methods Fall Cours 8 November 18

Gaussian Models (9/9/13)

A brief introduction to Conditional Random Fields

CS 195-5: Machine Learning Problem Set 1

Expectation maximization

Introduction to Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Convex Functions. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Graphical Models and Kernel Methods

MAS223 Statistical Inference and Modelling Exercises

Gaussian Mixture Models

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Undirected Graphical Models

Machine Learning

Advanced Machine Learning & Perception

Two hours. To be provided by Examinations Office: Mathematical Formula Tables. THE UNIVERSITY OF MANCHESTER. xx xxxx 2017 xx:xx xx.

14 : Theory of Variational Inference: Inner and Outer Approximation

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Latent Variable Models and EM Algorithm

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Probabilistic Graphical Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Chris Bishop s PRML Ch. 8: Graphical Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Some Probability and Statistics

Gaussian discriminant analysis Naive Bayes

Probabilistic Graphical Models

Parametric Techniques

Linear Dynamical Systems

13: Variational inference II

Introduction to Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Some Probability and Statistics

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

Chapter 4: Asymptotic Properties of the MLE (Part 2)

Multivariate Bayesian Linear Regression MLAI Lecture 11

Statistical Data Mining and Machine Learning Hilary Term 2016

Parametric Techniques Lecture 3

Inf2b Learning and Data

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Probabilistic Graphical Models

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

COMP90051 Statistical Machine Learning

Lecture 2 Machine Learning Review

Machine Learning - MT Classification: Generative Models

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Pattern Recognition and Machine Learning

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

An Introduction to Bayesian Machine Learning

Logistic Regression. Seungjin Choi

Transcription:

Probabilistic Graphical Models Fall 2013 Lecture 1 October 9, 2013 Lecturer: Guillaume Obozinski Scribe: Huu Dien Khue Le, Robin Bénesse The web page of the course: http://www.di.ens.fr/~fbach/courses/fall2013/ 1.1 Introduction 1.1.1 Problem To model complex data, one is confronted with two main questions: How to manage the complexity of the data to be processed? How to infer global properties from local models? These questions lead to 3 types of problems: the representation of data or how to obtain a global model from a local model, the inference of the distributions how to use the model, and the learning of the models what are the parameters of the models?. 1.1.2 Examples Image: consider a 100 100 pixels monochromatic image. If each pixel is modelled by a discrete random variable so there are 10000 of them, then the image can be modelled using a grid of the form: Bioinformatics: consider a long sequence of 10000 ADN bases. If each base of this sequence is modelled by a discrete random variable that, in general, can take values in {A, C, G, T }, then the sequence can be modelled by a Markov chain: 1-1

Finance: consider the evolution of stock prices in discrete time, where we have values at time n. It is reasonable to postulate that the change of price of a stock at time n only depends only on its price or the price of other stocks at time n 1. For only two stocks, a possible simplified model is the following dependency graph: Speech processing: consider the syllables of a word and the way they are interpreted by a human ear or by a computer. Each syllable can be represented by a random sound. The objective is then to retrieve the word from the sequence of sounds heard or recorded. In this case, we can use a hidden Markov model: Text: consider a text with 1000000 words. The text is modelled by a vector such that each of its components equals to the number of times each keyword appears. This is usually called the bag of words model. This model seems to be weak, as it does not take the order of the words into account. However, it works quite well in practice. A so-called naive Bayes classifier can be used for classification for example spam vs non spam. It is clear that models which ignore the dependence among variables are too simple for real-world problems. On the other hand, models in which every random variable is dependent all or too many other ones are doomed both for statistical lack of data and computational reasons. Therefore, in practice, one has to make suitable assumptions to design models with 1-2

the right level of complexity, so that the models obtained are able to generalize well from a statistical point of view and lead to tractable computations from an algorithmic perspective. 1.2 Basic notations and properties In this section we recall some basic notations and properties of random variables. Convention: Mathematically, the probability that a random variable X takes the value x is denoted px = x. In this document, we simply write px to denote a distribution over the random variable X, or px to denote the distribution evaluated for the particular value x. It is similar for more variables. Fundamental rules. For two random variables X, Y we have Sum rule: px = Y px, Y. Product rule: px, Y = py XpX. Independence. Two random variables X and Y are said to be independent if and only if P X, Y = P XP Y. Conditional independence. Let X, Y, Z be random variables. We define X and Y to be conditionally independent given Z if and only if P X, Y Z = P X ZP Y Z. Property: If X and Y are conditionally independent given Z, then P X Y, Z = P X Z. Independent and identically distributed. A set of random variables is independent and identically distributed i.i.d. if each random variable has the same probability distribution as the others and all are mutually independent. Bayes formula. For two random variables X, Y we have P X Y = P Y XP X. P Y Note that Bayes formula is not a Bayesian formula in the sense of Bayesian statistics. 1-3

1.3 Statistical models Definition 1.1 Statistical model A parametric statistical model P Θ is a collection of probability distributions or a collection of probability density functions 1 defined on the same space and parameterized by parameters θ belonging to a set Θ R p. Formally: P Θ = {p θ θ Θ}. 1.3.1 Bernoulli model Consider a binary random variable X that can take the value 0 or 1. parametrized by θ [0, 1]: { PX = 1 = θ PX = 0 = 1 θ then a probability distribution of the Bernoulli model can be written as If px = 1 is and we can write px = x; θ = θ x 1 θ 1 x 1.1 X Berθ. 1.2 The Bernoulli model is the collection of these distributions for θ Θ = [0, 1]. 1.3.2 Binomial model A binomial random variable Binθ, N is defined as the value of the sum of n i.i.d. Bernoulli r.v. with parameter θ. The distribution of a binomial random variable N is n PN = k = θ k 1 θ n k. k The set Θ is the same as for the Bernoulli model. 1.3.3 Multinomial model Consider a discrete random variable C that can take one of K possible values {1, 2,..., K}. The random variable C can be represented by a K-dimensional random variable X = X 1, X 2,..., X K T for which the event {C = k} corresponds to the event in R d {X k = 1 and X l = 0, l k}. 1 In which case, they are all defined with respect to the same base measure, such as the Lebesgue measure 1-4

If we parametrize PC = k by a parameter π k [0, 1], then by definition we also have PX k = 1 = π k k = 1, 2,..., K, with K π k = 1. The probability distribution over x = x 1,..., x k can be written as px; π = K π x k k 1.3 where π = π 1, π 2,..., π K T. We will denote M1, π 1,..., π K such a discrete distribution. The corresponding set of parameters is Θ = {π R + K π = 1}. Now if we consider n independent observations of a M1, π multinomial random variable X, and we denote by N k the number of observations for which x k = 1, then the joint distribution of N 1, N 2,..., N K is called a multinomial Mn, π distribution. It takes the form: n! K pn 1, n 2,..., n K ; π, n = π n k k 1.4 n 1!n 2!... n K! and we can write N 1,..., N K MN, π 1, π 2,..., π K. 1.5 The multinomial Mn, π is to the M1, π distribution, as the binomial distribution is to the Bernoulli distribution. In the rest of this course, when we will talk about multinomial distributions, we will always refer to a M1, π distribution. 1.3.4 Gaussian models The Gaussian distribution is also known as the normal distribution. In the case of a scalar variable X, the Gaussian distribution can be written in the form N x; µ, σ 2 1 x µ2 = exp 1.6 2πσ 2 1/2 2σ 2 where µ is the mean and σ 2 is the variance. For a d-dimensional vector x, the multivariate Gaussian distribution takes the form 1 1 N x µ, Σ = exp 1 2π d/2 Σ 1/2 2 x µt Σ 1 x µ 1.7 where µ is a d-dimensional vector, Σ is a d d symmetric positive definite matrix, and Σ denotes the determinant of Σ. It is a well-known property that the parameter µ is equal to the expectation of X and that the matrix Σ is the covariance matrix of X, which means that Σ ij = E[X i µ i X j µ j ]. 1-5

1.4 Parameter estimation by maximum likelihood 1.4.1 Definition Maximum likelihood estimation MLE is a method of estimating the parameters of a statistical model. Suppose we have a sample x 1, x 2,..., x n of n independent and identically distributed observations, coming from a distribution px 1, x 2,..., x n ; θ where θ is an unknown parameter both x i and θ can be vectors. As the name suggests, the MLE finds the parameter ˆθ under which the data x 1, x 2,..., x n are most likely: ˆθ = argmax px 1, x 2,..., x n ; θ 1.8 θ The probability on the right-hand side in the above equation can be seen as a function of θ and can be denoted by Lθ: Lθ = px 1, x 2,..., x n ; θ 1.9 This function is called the likelihood. As x 1, x 2,..., x n are independent and identically distributed, we have n Lθ = px i ; θ 1.10 In practice it is often more convenient to work with the logarithm of the likelihood function, called the log-likelihood: n lθ = log Lθ = log px i ; θ 1.11 = log px i ; θ 1.12 Next, we will apply this method for the models presented previously. We assume that all the observations are independent and identically distributed in all of the remainder of this lecture. 1.4.2 MLE for the Bernoulli model Consider n observations x 1, x 2,..., x n of a binary random variable X following a Bernoulli distribution Berθ. From 1.12 and 1.1 we have lθ = log px i ; θ = log θ x i 1 θ 1 x i = N logθ + n N log1 θ 1-6

where N = n x i. As lθ is strictly concave, it has a unique maximizer, and since the function is in addition differentiable, its maximizer ˆθ is the zero of its gradient lθ: lθ = θ lθ = N θ n N 1 θ. It is easy to show that lθ = 0 θ = N. Therefore we have n 1.4.3 MLE for the multinomial model ˆθ = N n = x 1 + x 2 + + x n. 1.13 n Consider N observations X 1, X 2,..., X N of a discrete random variable X following a multinomial distribution M1, π, where π = π 1, π 2,..., π K T. We denote x i i = 1, 2,..., N the K-dimensional vectors of 0s and 1s representing X i, as presented in Section 1.3.3. From 1.12 and 1.3 we have N lπ = log px i ; π = = = N K log N π x ik k K x ik log π k K n k log π k where n k = N x ik n k is therefore the number of observations of x k = 1. We need to maximize this quantity subject to the constraint K π k = 1. Brief review on Lagrange duality Lagrangian. Consider the following convex optimization problem: min fx, subject to Ax = b 1.14 x X where f is a convex function, X R p is a convex set included in the domain 2 of f, A R n p, b R n. The Lagrangian associated with this optimization problem is defined as 2 The domain of a function is the set on which the function is finite. Lx, λ = fx + λ T Ax b 1.15 1-7

The vector λ R n is called the Lagrange multiplier vector. Lagrange dual function. The Lagrange dual function is defined as gλ = min x Lx, λ 1.16 The problem of maximizing gλ with respect to λ is known as the Lagrange dual problem. Max-min inequality. For any f : R n R m and any w R n and z R m, we have fw, z max fw, z = min fw, z min fw, z z Z w W w W z Z 1.17 = max fw, z min fw, z. z Z w W w W z Z 1.18 The last inequality is known as the max-min inequality. Duality. It is easy to show that Which gives us max Lx, λ = λ min x { fx fx = min x Now from 1.16, 1.18 and 1.19 we have max λ gλ = max λ min x Lx, λ min if Ax = b + otherwise. max Lx, λ 1.19 λ x max λ Lx, λ = min fx 1.20 This inequality says that the optimal value d of the Lagrange dual problem always lower-bounds the optimal value p of the original problem. This property is called the weak duality. If the equality d = p holds, then we say that the strong duality holds. Strong duality means that the order of the minimization over x and the maximization over λ can be switched without affecting the result. Slater s constraint qualification lemma. If there exists an x in the relative interior of X {Ax = b} then strong duality holds. Note that by definition X is included in the domain of f so that if x X then fx <. Note that all the above notions and results are stated for the problem 1.14 only. For a more general problem and more details about Lagrange duality, please refer to [9] chapter 5. 1-8 x

Back to our problem We need to minimize subject to the constraint 1 T π = 1. The Lagrangian of this problem is fπ = lπ = Lπ, λ = K n k log π k 1.21 K K n k log π k + λ π k 1 1.22 Clearly, as n k 0 k = 1, 2,..., K, f is convex and this problem is a convex optimization problem. Moreover, it is trivial that there exist π 1, π 2,..., π K such that π k > 0 k = 1, 2,..., K and K π k = 1, so by Slater s constraint qualification, the problem has strong duality property. Therefore, we have min π fπ = max min Lπ, λ 1.23 λ π As Lπ, λ is convex with respect to π, to find min π Lπ, λ, it suffices to take derivatives with respect to π k. This yields L = n k + λ = 0, k = 1, 2,..., K. π k π k or π k = n k, k = 1, 2,..., K. 1.24 λ Substituting these into the constraint K π k = 1 we get K n k = λ, yielding λ = N. From this and 1.24 we get finally ˆπ k = n k, k = 1, 2,..., K. 1.25 N Remark: ˆπ k is the fraction of the N observations for which x k = 1. 1.4.4 MLE for the univariate Gaussian model Consider n observations x 1, x 2,..., x n of a random variable X following a Gaussian distribution N µ, σ 2. From 1.12 and 1.6 we have lµ, σ 2 = log px i ; µ, σ 2 [ 1 = log exp x ] i µ 2 2πσ 2 1/2 2σ 2 = n 2 log2π n 2 logσ2 1 x i µ 2. 2 σ 2 1-9

We need to maximize this quantity with respect to µ and σ 2. By taking derivative with respect to µ and then σ 2, it is easy to obtain that the pair ˆµ, ˆσ 2, defined by ˆµ = 1 n ˆσ 2 = 1 n x i 1.26 x i ˆµ 2 1.27 is the only stationary point of the likelihood. One can actually check for example computing the Hessian w.r.t. µ, σ 2 that this actually a maximum. We will have a confirmation of this in the lecture on exponential families. 1.4.5 MLE for the multivariate Gaussian model Let X R d be a Gaussian random vector, with mean vector µ R d and a covariance matrix Σ R d d positive definite: px µ, Σ = 1 1 x µ Σ 1 x µ exp 2π k 2 det Σ 2 Let x 1,..., x n be a i.i.d. sample. The log-likelihood is given by: l µ, Σ = log p x 1,..., x n ; µ, Σ n = log p x i µ, Σ = nd 2 log2π + n 2 log det Σ + 1 2 x i µ Σ 1 x i µ In this case, one should be careful that these log-likelihoods are not concave with respect to the pair of parameters µ, Σ. They are concave w.r.t. µ when Σ is fixed but they are not even concave with respect to Σ when µ is fixed. 1.4.6 Digression: review on differentials Differentiable function. A function f is differentiable at x R d if there exists a linear form df such that: fx + h fx = dfh + o h Since R d is a Hilbert space, we know that there exists g R d such that df x h = g, h. We call g the gradient of f : g = fx. 1-10

Example 1 : if f a x + b then we have : and thus fx + h fx = a h fx = a. Example 2 : if f 1 2 x Ax then we have : fx + h fx = 1 2 x + ht Ax + h 1 2 x Ax = 1 2 x Ah + h Ax + o h The gradient is then : fx = 1 2 Ax + A x Let us first differentiate l µ, Σ w.r.t. µ. We need to differentiate : µ x i µ Σ 1 x i µ Which is equal to f g where : and f : R d R y y Σ 1 y g : R d R d µ µ x i Reminder : Composition of differentials df g = df gx dg x h = df gx dg x h 1-11

The differential of f is : df y h = fy, h = 1 2 as Σ 1 is symmetric. The function l : µ x i µσ 1 x i µ. We have : We deduce the gradient of l : dl µ h = Σ 1 µ x i, h lµ = Σ 1 µ x i Σ 1 y + Σ 1 y, h = Σ 1 y, h 1.4.7 Back to the MLE for the multivariate Gaussian Remember that the function we want to differentiate is : nd l µ, Σ = 2 log2π + n 2 log det Σ + 1 x i µ Σ 1 x i µ 2 According to the results above, we have : µ l µ, Σ 1 = Σ 1 µ x i = Σ 1 nµ = Σ 1 nµ nx where x = 1 n n x i The gradient is equal to 0 iff : x i µ = 1 n x i Let us now differentiate l w.r.t. Σ 1. Let A = Σ 1. We have : nd l µ, Σ = 2 log2π n 2 log det A + 1 x i µ A x i µ 2 The last term is a real number, so it equal to its trace. Thus : 1-12

nd l µ, Σ = 2 log2π n 2 log det A + 1 Trace x i µ Ax i µ 2 nd = 2 log2π n 2 logdet A + n 2 TraceA Σ where Σ = 1 n is the empirical covariance matrix. Let f : A A Σ ntrace. 2 x i µx i µ We have : fa + H fa = n H 2 Trace Σ The gradient of the last term of l is then : and : fa = n 2 Σ df A H = fa, H = Trace fa H Let us focus on the second term. We have : where H = log deta + H = log A 1 2 I + A 1 2 HA 1 2 A 1 2 = log A + log deti + H A 1 2 HA 1 2. Let g : A log deta where A = I + H. We have : 1-13

log deti + H log deti = H is symmetric, so it can be written as : d log1 + λ j d λ j + o H = Trace H + o H H = UΛU where U is an orthogonal matrix and Λ = diagλ 1,..., λ d. d log det A H = Trace A 1 2 HA 1 2 = TraceHA 1 We deduce the gradient of log deta: And the gradient of l w.r.t. A is : It is equal to zero iff : when Σ is invertible. Finally we have shown that the pair log deta = A 1 A l = n 2 A 1 + n 2 Σ Σ = Σ ˆµ = x = 1 n x i andˆσ = 1 n x i xx i x is the only stationary point of the likelihood. One can actually check for example computing the Hessian w.r.t. µ, Σ that this actually a maximum. We will have a confirmation of this in the lecture on exponential families. 1-14

Bibliography [1] J.M Amigo and M.B. Kennel. Variance estimators for the Lempel-Ziv entropy rate estimator. Chaos, 16:043102, 2006. [2] Aim Fuchs Dominique Foata. Calcul des probabilités. 2ème édition. Dunod, 2003. [3] Gilbert Saporta. Probabilités, analyses des données et statistiques. Technip, 1990. [4] Frédéric Bonnans. Optimisation continue, Cours et problèmes corrigés. Dunod, 2003. [5] Michael Jordan. An introduction to graphical models. In preparation. [6] http://fr.wikipedia.org/wiki/multiplicateur de lagrange. [7] S.J.D. Prince. Computer Vision: Models Learning and Inference. Cambridge University Press, 2012. [8] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. [9] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004. 15