Maximum likelihood estimation

Similar documents
Lecture 1 October 9, 2013

Lecture 4 September 15

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.

Linear and logistic regression

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Machine learning - HT Maximum Likelihood

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Lecture 4: Optimization. Maximizing a function of a single variable

Constrained Optimization and Lagrangian Duality

Convex Optimization M2

Support vector machines

Convex Optimization Boyd & Vandenberghe. 5. Duality

Convex Optimization & Lagrange Duality

Naive Bayes and Gaussian Bayes Classifier

5. Duality. Lagrangian

Naive Bayes and Gaussian Bayes Classifier

Machine Learning. Option Data Science Mathématiques Appliquées & Modélisation. Alexandre Aussem

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Introduction to Machine Learning

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Lecture: Duality.

Naive Bayes and Gaussian Bayes Classifier

Optimization for Communications and Networks. Poompat Saengudomlert. Session 4 Duality and Lagrange Multipliers

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

Lagrangian Duality and Convex Optimization

Economics 101A (Lecture 3) Stefano DellaVigna

Support Vector Machines

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Optimization Theory. Lectures 4-6

Lecture 3 January 28

A Brief Review on Convex Optimization

Lecture 3: Semidefinite Programming

Naïve Bayes classification

Duality. Geoff Gordon & Ryan Tibshirani Optimization /

Quick Tour of Basic Probability Theory and Linear Algebra

Lecture: Duality of LP, SOCP and SDP

Homework Set #6 - Solutions

Introduction to Machine Learning. Lecture 2

Lecture Note 5: Semidefinite Programming for Stability Analysis

Computing the MLE and the EM Algorithm

Convex Optimization M2

Introduction to Machine Learning

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Lectures 9 and 10: Constrained optimization problems and their optimality conditions

Applications of Linear Programming

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Lagrangian Duality Theory

Linear Classification

IEOR 165 Lecture 13 Maximum Likelihood Estimation

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Optimization for Machine Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

EE/AA 578, Univ of Washington, Fall Duality

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Introduction to Mathematical Programming IE406. Lecture 10. Dr. Ted Ralphs

Nonlinear Programming Models

Part IB Optimisation

Bayesian Decision and Bayesian Learning

CS-E4830 Kernel Methods in Machine Learning

Introduction to Nonlinear Stochastic Programming

Finite Dimensional Optimization Part III: Convex Optimization 1

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning And Applications: Supervised Learning-SVM

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

9. Geometric problems

Machine Learning for Data Science (CS4786) Lecture 12

Lecture 8. Strong Duality Results. September 22, 2008

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

PMR Learning as Inference

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Graduate Econometrics I: Maximum Likelihood I

Primal/Dual Decomposition Methods

Lecture 2: Convex Sets and Functions

E5295/5B5749 Convex optimization with engineering applications. Lecture 5. Convex programming and semidefinite programming

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Support Vector Machines and Kernel Methods

Expectation maximization tutorial

Parametric Models: from data to models

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Gaussian discriminant analysis Naive Bayes

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

A minimalist s exposition of EM

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Machine Learning. Support Vector Machines. Manfred Huber

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Posterior Regularization

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 4: Probabilistic Learning

Lecture 7: Convex Optimizations

Transcription:

Maximum likelihood estimation Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Maximum likelihood estimation 1/26

Outline 1 Statistical concepts 2 A short review of convex analysis and optimization 3 The maximum likelihood principle Maximum likelihood estimation 2/26

Statistical concepts Maximum likelihood estimation 4/26

Statistical Model Parametric model Definition: Set of distributions parametrized by a vector θ Θ R p P Θ = { p θ (x) θ Θ } Bernoulli model: X Ber(θ) Θ = [0, 1] p θ (x) = θ x (1 θ) (1 x) Binomial model: X Bin(n, θ) Θ = [0, 1] ( ) n p θ (x) = θ x (1 θ) (1 x) x Multinomial model: X M(n, π 1, π 2,..., π K ) ( ) n x p θ (x) = π 1 x 1... π k k x 1,..., x k Θ = [0, 1] K Maximum likelihood estimation 5/26

Indicator variable coding for multinomial variables Let C a r.v. taking values in {1,..., K}, with P(C = k) = π k. We will code C with a r.v. Y = (Y 1,..., Y K ) with Y k = 1 {C=k} For example if K = 5 and c = 4 then y = (0, 0, 0, 1, 0). So y {0, 1} K with K k=1 y k = 1. P(C = k) = P(Y k = 1) and P(Y = y) = K k=1 π y k k. Maximum likelihood estimation 6/26

Bernoulli, Binomial, Multinomial Y Ber(π) (Y 1,..., Y K ) M(1, π 1,..., π K ) p(y) = π y (1 π) 1 y p(y) = π y 1 1... πy K K N 1 Bin(n, π) (N 1,..., N K ) M(n, π 1,..., π K ) p(n 1 ) = ( ( ) ) n n 1 π n 1 (1 π) n n n 1 p(n) = n 1... n K π n 1 1... πn K K with ( ) n = i n! (n i)!i! and ( ) n = n 1... n K n! n 1!... n K! Maximum likelihood estimation 7/26

Gaussian model Scalar Gaussian model : X N (µ, σ 2 ) X real valued r.v., and θ = ( µ, σ 2) Θ = R R +. p µ,σ 2 (x) = ( 1 exp 1 2πσ 2 2 Multivariate Gaussian model: X N (µ, Σ) ) (x µ) 2 X r.v. taking values in R d. If K d is the set of positive definite matrices of size d d, and θ = (µ, Σ) Θ = R d K d. ( 1 p µ,σ (x) = exp 1 ) (2π) d 2 (x µ)t Σ 1 (x µ) det Σ σ 2 Maximum likelihood estimation 8/26

Gaussian densities Maximum likelihood estimation 9/26

A short review of convex analysis and optimization Maximum likelihood estimation 11/26

Review: convex analysis Convex function λ [0, 1], f(λ x + (1 λ) y) λ f(x) + (1 λ) f(y) Strictly convex function λ ]0, 1[, f(λ x + (1 λ) y) < λ f(x) + (1 λ) f(y) Strongly convex function µ > 0, s.t. x f(x) µ x 2 is convex Equivalently: λ [0, 1], f(λ x+(1 λ) y) λ f(x)+(1 λ) f(y) µ λ(1 λ) x y 2 The largest possible µ is called the strong convexity constant. Maximum likelihood estimation 12/26

Minima of convex functions Proposition (Supporting hyperplane) If f is convex and differentiable at x then f(y) f(x) + f(x) (y x) Convex function All local minima are global minima. Strictly convex function If there is a local minimum, then it is unique and global. Strongly convex function There exists a unique local minimum which is also global. Maximum likelihood estimation 13/26

Minima and stationary points of differentiable functions Definition (Stationary point) For f differentiable, we say that x is a stationary point if f(x) = 0. Theorem (Fermat) If f is differentiable at x and x is a local minimum, then x is stationary. Theorem (Stationary point of a convex differentiable function) If f is convex and differentiable at x and x is stationary then x is a minimum. Theorem (Stationary points of a twice differentiable functions) For f twice differentiable at x if x is a local minimum then f(x) = 0 and 2 f(x) 0. conversely if f(x) = 0 and 2 f(x) 0 then x is a strict local minimum. Maximum likelihood estimation 14/26

Sample/Training set The data used to learn or estimate a model typically consists of a collection of observation which can be thought of as instantiations of random variables. X (1),..., X (n) A common assumption is that the variables are i.i.d. independent identically distributed, i.e. have the same distribution P. This collection of observations is called the sample or the observations in statistics the samples in engineering the training set in machine learning Maximum likelihood estimation 15/26

The maximum likelihood principle Maximum likelihood estimation 17/26

Maximum likelihood principle Let P Θ = { p(x; θ) θ Θ } be a model Let x be an observation Likelihood: L : Θ R + θ p(x; θ) Maximum likelihood estimator: ˆθ ML = argmax p(x; θ) θ Θ Sir Ronald Fisher (1890-1962) Case of i.i.d data If (x i ) 1 i n is an i.i.d. sample of size n: n ˆθ ML = argmax p θ (x i ) = argmax θ Θ θ Θ i=1 n log p θ (x i ) i=1 Maximum likelihood estimation 18/26

The maximum likelihood estimator The MLE does not always exists is not necessarily unique is not admissible in general Maximum likelihood estimation 19/26

MLE for the Bernoulli model Let X 1, X 2,..., X n an i.i.d. sample Ber(θ). The log-likelihood is l(θ) = = n log p(x i ; θ) = i=1 n log [ θ x i (1 θ) 1 x ] i i=1 n ( xi log θ + (1 x i ) log(1 θ) ) = N log(θ) + (n N) log(1 θ) i=1 with N := n i=1 x i. θ l(θ) is strongly concave the MLE exists and is unique. since l differentiable + strongly concave its maximizer is the unique stationary point l(θ) = θ l(θ) = N θ n N 1 θ. Thus ˆθ MLE = N n = x 1 + x 2 + + x n. n Maximum likelihood estimation 20/26

MLE for the multinomial Done on the board. See lecture notes. Maximum likelihood estimation 21/26

Brief review of Lagrange duality Convex optimization problem with linear constraints For f a convex function, X R p a convex set included in the domain of f, A R n p, b R n, Lagrangian min f(x) subject to Ax = b (P ) x X L(x, λ) = f(x) + λ T (Ax b) with λ R n the Lagrange multiplier. Maximum likelihood estimation 22/26

Properties of the Lagrangian Link between primal and Lagrangian max L(x, λ) = λ Rn { f(x) if Ax = b + otherwise. So that min x X : Ax=b f(x) = min max x X λ R n L(x, λ) Lagrangian dual objective function Dual optimization problem g(λ) = min L(x, λ) x X max g(λ) = max λ Rn min λ R n x X L(x, λ) (D) Maximum likelihood estimation 23/26

Maxmin-minmax inequality, weak and strong duality For any f : R n R m and any w R n and z R m, we have max min f(w, z) min z Z w W max w W z Z f(w, z). Weak duality d := max g(λ) = max min L(x, λ) min λ Rn λ x X max xx X λ R L(x, λ) = min f(x) =: p n x X So that in general, we have d p. This is called weak duality Strong duality In some cases, we have strong duality: d = p Solutions to (P ) and (D) are the same Maximum likelihood estimation 24/26

Slater s qualification condition Slater s qualification condition is a condition on the constraints of a convex optimization problem that guarantees that strong duality holds. For linear constraints, Slater s condition is very simple: Slater s condition for a cvx opt. pb with lin. constraints If there exists an x in the relative interior of X {Ax = b} then strong duality holds. Maximum likelihood estimation 25/26

MLE for the univariate and multivariate Gaussian Done on the board. See lecture notes. Maximum likelihood estimation 26/26