Maximum Likelihood Estimation

Similar documents
Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

2 Nonlinear least squares algorithms

Lecture 7 Unconstrained nonlinear programming

Optimization for neural networks

Optimization II: Unconstrained Multivariable

Optimization II: Unconstrained Multivariable

Convex Optimization CMU-10725

2. Quasi-Newton methods

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Applied Mathematics 205. Unit I: Data Fitting. Lecturer: Dr. David Knezevic

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Quasi-Newton Methods

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Math 273a: Optimization Netwon s methods

Higher-Order Methods

Algorithms for Constrained Optimization

Programming, numerics and optimization

Pose Tracking II! Gordon Wetzstein! Stanford University! EE 267 Virtual Reality! Lecture 12! stanford.edu/class/ee267/!

Lecture 14: October 17

Stochastic Quasi-Newton Methods

Optimization Methods

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Lecture 4 September 15

NonlinearOptimization

Optimization Methods

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Introduction to gradient descent

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Numerical solution of Least Squares Problems 1/32

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

Lecture V. Numerical Optimization

Linear and Nonlinear Optimization

Matrix Derivatives and Descent Optimization Methods

Unconstrained optimization

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Convex Optimization. Problem set 2. Due Monday April 26th

Fitting The Unknown 1/28. Joshua Lande. September 1, Stanford

j=1 r 1 x 1 x n. r m r j (x) r j r j (x) r j (x). r j x k

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Nonlinear Optimization Methods for Machine Learning

Statistical Estimation

Lecture 3 September 1

Lecture 6 Optimization for Deep Neural Networks

Numerical computation II. Reprojection error Bundle adjustment Family of Newtonʼs methods Statistical background Maximum likelihood estimation

Approximate Normality, Newton-Raphson, & Multivariate Delta Method

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

MATH 4211/6211 Optimization Quasi-Newton Method

5 Quasi-Newton Methods

Chapter 4. Unconstrained optimization

17 Solution of Nonlinear Systems

13. Nonlinear least squares

Outline. Scientific Computing: An Introductory Survey. Optimization. Optimization Problems. Examples: Optimization Problems

Lecture 17: Numerical Optimization October 2014

4 Newton Method. Unconstrained Convex Optimization 21. H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion:

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Gradient Methods Using Momentum and Memory

Linear Models for Regression CS534

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Non-linear least squares

Statistics 3858 : Maximum Likelihood Estimators

Introduction to Logistic Regression and Support Vector Machine

Neural Network Training

University of Houston, Department of Mathematics Numerical Analysis, Fall 2005

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Non-polynomial Least-squares fitting

The Newton-Raphson Algorithm

Statistics 580 Optimization Methods

Maximum Likelihood Estimation. only training data is available to design a classifier

Optimization Methods. Lecture 19: Line Searches and Newton s Method

Data Mining (Mineria de Dades)

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Numerical solutions of nonlinear systems of equations

Lecture 18: November Review on Primal-dual interior-poit methods

ECON 5350 Class Notes Nonlinear Regression Models

Optimization Methods for Machine Learning

Least Mean Squares Regression. Machine Learning Fall 2018

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

Optimization and Root Finding. Kurt Hornik

Lecture 3 - Linear and Logistic Regression

Geometry optimization

Nonlinear Optimization: What s important?

The K-FAC method for neural network optimization

ECE521 lecture 4: 19 January Optimization, MLE, regularization

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Scientific Computing: An Introductory Survey

Some explanations about the IWLS algorithm to fit generalized linear models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

On Nesterov s Random Coordinate Descent Algorithms - Continued

Introduction to Maximum Likelihood Estimation

Transcription:

Maximum Likelihood Estimation Prof. C. F. Jeff Wu ISyE 8813

Section 1 Motivation

What is parameter estimation? A modeler proposes a model M(θ) for explaining some observed phenomenon θ are the parameters which dictate properties of such a model Parameter estimation is the process by which the modeler determines the best parameter choice θ given a set of training observations

Why parameter estimation? θ often encodes important and interpretable properties of the model M e.g., calibration parameters in computer experiment models An estimate of θ (call this θ) allows for: Model validation: Checking whether the proposed model fits the observed data Prediction: Forecasting future observations at untested settings

Simple example Suppose we observe a random sample X 1, X 2,, X n : X i = 1 if student i own a sports car, X i = 0 if student i does not own a sports car i.i.d. We then postulate the following model: X i Bernoulli(θ), i.e., X i s are i.i.d. Bernoulli random variables with the same (unknown) parameter θ We would like to estimate θ based on the observations X 1, X 2,, X n

Popular estimation techniques Maximum-likelihood estimation (MLE) Mnimax estimation Methods-of-moments (MOM) (Non-linear) Least-squares estimation

Popular estimation techniques Maximum-likelihood estimation (MLE) Mnimax estimation Methods-of-moments (MOM) (Non-linear) Least-squares estimation We will focus on these two techniques in this lecture.

Section 2 Maximum likelihood estimation (MLE)

Likelihood function In words, MLE chooses the parameter setting which maximizes the likelihood of the observed sample But how do we define this likelihood as a function of θ? Definition Let f( x θ) be the probability of observating the sample x = (x 1,, x n ) from M(θ). The likelihood function is defined as: L(θ x) = f( x θ) MLE tries to find the parameter setting which maximizes the probability of observing the data

Log-likelihood Function i.i.d. If X 1, X 2,, X n f( x θ), the likelihood function simplifies to L(θ x) = n i=1 f(x i θ) In this case, the likelihood can become very small, because we are multiplying many terms To fix this computational problem, the log-likelihood l(θ x) = log(l(θ x)) is often used instead For i.i.d observations, this becomes: l(θ x) = log(l(θ x)) = n i=1 log[f(x i θ x)]

Maximum likehood estimator We can now formally define the estimator for MLE: Definition Given observed data x, the maximum likelihood estimator (MLE) of θ is defined as: θ Argmax [L(θ x)] θ Equivalently, because the log-function is monotonic, we can instead solve for: θ Argmax [l(θ x)] θ The latter is more numerically stable for optimization.

MLE: a simple example Suppose some count data is observed X i Z +, and the i.i.d. following Poisson model is assumed X i Pois(λ) The likelihood function can be shown to be L(λ x) = with log-likelihood function: n n i=1 ( λx i x i! e λ ) l(λ x) = log [( x i! e λ )] = log(λ) ( x i ) log(x i!) nλ i=1 λ x i n i=1 n i=1

MLE: a simple example Since l(λ x) is differentiable, we can solve for its minimizer by: Differentiating the log-likelihood: d[l(λ x)] dλ = 1 λ Setting it to 0, and solving for λ: λ = n i=1 x i n n i=1 x i n = X Checking the Hessian matrix is positive-definite: 2 λ l(λ x) 0

What if a closed-form solution does not exist? In most practical models, there are two computational difficulties: No closed-form solution exists for the MLE, Multiple locally optimal solutions or stationary points. Standard non-linear optimization methods (Nocedal and Wright, 2006) are often used as a black-box technique for obtaining locally optimal solutions.

What if a closed-form solution does not exist? We can do better if the model exhibits specific structure: If optimization program is convex, one can use some form of accelerated gradient descent (Nesterov, 1983) If non-convex but twice-differentiable, the limited-memory Broyden-Fletcher-Goldfard-Shanno (L-BFGS) method works quite well for local optimization If missing data is present, the EM algorithm (Dempster et al, 1977; Wu, 1983) can be employed

Section 3 Non-linear least-squares estimation (Adapted from en.wikipedia.org/wiki/non-linear_least_squares)

Problem statement Consider a set of m data points, (x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ), and a curve (model function) y = f(x; β): Curve model depends on β = (β 1, β 2,..., β n ), with m n Want to choose the parameters β so that the curve best fits the given data in the least squares sense, i.e., such that S(β) = m i=1 r i (β) 2 is minimized, where the residuals r i (β) are defined as r i (β) = y i f(x i ; β)

Problem statement The minimum value of S occurs when the gradient equals zero: Since the model contains n parameters, there are n gradient equations to solve: S r i = 2 r i = 0, j = 1,..., n. β j i β j In a nonlinear system, this system of equations typically do not have a closed-form solution One way to solve for this system is: Set initial values for parameters Update parameters using the successive iterations: β (k+1) j = β (k) j + β j. Here, k is the increment count, and β is known as the shift vector

Deriving the normal equations From first-order Taylor series expansion about β k, we get: f(x i ; β) f(x i ; β (k) )+ j f(x i ; β (k) ) β j where J is the Jacobian matrix. (β j β (k) j ) f(x i ; β (k) )+ J ij β j, j

Deriving the normal equations In terms of the linearized model: with residuals given by: J ij = r i β j, r i = (y i f(x i ; β (k) ))+(f(x i ; β (k) ) f(x i ; β)) = y i J is β s where y i = y i f(x i ; β (k) ). Substituting these into the gradient equations, we get: m 2 J ij ( y i i=1 n s=1 J is β s ) = 0 n s=1

Deriving the normal equations This system of equations is better known as the set of normal equations: m n i=1 s=1 J ij J is β s = In matrix form, this becomes: where y = ( y 1,, y n ) T. m i=1 J ij y i j = 1,..., n. (J T J) β = J T y

Solving the normal equations Several methods have been proposed to solve the set of normal equations: Gauss-Newton method Levenberg-Marquardt algorithm Gradient methods Davidson-Fletcher-Powell Steepest descent

Solving the normal equations Several methods have been proposed to solve the set of normal equations: Gauss-Newton method Levenberg-Marquardt-Fletcher algorithm Gradient method Davidson-Fletcher-Powell Steepest descent We will focus on two in this lecture.

Gauss-Newton method Starting with an initial guess β (0) for the minimum, the Gauss-Newton method iteratively updates β (k) by solving for the shift vector β in the normal equations: β (k+1) = β (k) + (J T J) 1 J T r(β (k) ). This approach can encounter several problems: (J T J) 1 is often ill-conditioned, When far from fixed-point solution, such an iterative map may not be contractive; the sequence (β (k) ) k=1 may diverge without reaching a limiting solution.

Levenberg-Marquardt-Fletcher algorithm Historical perspective of the Levenberg-Marquardt-Fletcher (LMF) algorithm: The first form of LMF was first published in 1944 by Kenneth Levenberg, while working at the Frankford Army Arsenal, It was then rediscovered and improved upon by Donald Marquardt in 1963, who worked as a statistician at DuPont, A further modification was suggested by Roger Fletcher in 1971, which greatly improved the robustness of the algorithm.

Levenberg s contribution To improve the numerical stability of the algorithm, Levenberg (1944) proposed the modified iterative updates: β (k+1) = β (k) + (J T J + λi) 1 J T r(β (k) ), where λ is a damping factor and I is the identity matrix. λ 0 + gives the previous Gauss-Newton updates, which converges quickly but may diverge, λ gives the steepest-descent updates, which converge slowly but is more stable.

Marquardt s contribution How does one choose λ to control this trade-off between quick convergence and numerical stability? Marquardt (1963) proposes the improved iterative updates: β (k+1) = β (k) + (J T J + λ (k) I) 1 J T r(β (k) ), where the damping factor λ (k) can vary between iterations. When convergence is stable, the damping factor is iteratively decreased by λ (k+1) λ (k) /ν to exploit the accelerated Gauss-Newton rate, When divergence is observed, the damping factor is increased by λ (k+1) λ (k) ν to restore stability.

Fletcher s contribution Fletcher (1971) proposes a further improvement of this algorithm through the following update scheme: β (k+1) = β (k) + (J T J + λ diag{j T J}) 1 J T r(β (k) ), where diag{j T J} is a diagonal matrix of the diagonal entries in J T J. Intuition: By scaling each gradient component by the curvature, greater movement is encouraged in directions where the gradient is smaller, Can be viewed as a pre-conditioning step for solving ill-conditioned problems, Similar approach is used in Tikhonov regularization for linear least-squares.

Summary Parameter estimation is an unavoidable step in model validation and prediction, MLE is a popular approach for parameter estimation, For the non-linear least-squares problem, the Levenberg-Marquardt-Fletcher algorithm provides a stable and efficient method for estimating coefficients.