Link lecture - Lagrange Multipliers

Similar documents
MATH2070 Optimisation

Let f(x) = x, but the domain of f is the interval 0 x 1. Note

Taylor Series and stationary points

Lecture 4 September 15

Statistics 3858 : Maximum Likelihood Estimators

g(x,y) = c. For instance (see Figure 1 on the right), consider the optimization problem maximize subject to

SECTION C: CONTINUOUS OPTIMISATION LECTURE 11: THE METHOD OF LAGRANGE MULTIPLIERS

Method of Lagrange Multipliers

Computation. For QDA we need to calculate: Lets first consider the case that

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Association studies and regression

SECTION C: CONTINUOUS OPTIMISATION LECTURE 9: FIRST ORDER OPTIMALITY CONDITIONS FOR CONSTRAINED NONLINEAR PROGRAMMING

Lagrange Multipliers

Regression Models - Introduction

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

2015 Math Camp Calculus Exam Solution

Lecture 6: Linear models and Gauss-Markov theorem

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

1 One-way analysis of variance

CS-E4830 Kernel Methods in Machine Learning

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Constrained Optimization

Examination paper for TMA4180 Optimization I

ICS-E4030 Kernel Methods in Machine Learning

Lecture 34: Properties of the LSE

IEOR 165 Lecture 13 Maximum Likelihood Estimation

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Topic 12 Overview of Estimation

The loss function and estimating equations

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Maximum Likelihood Estimation

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Math 118, Fall 2014 Final Exam

Unit #24 - Lagrange Multipliers Section 15.3

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Math 164-1: Optimization Instructor: Alpár R. Mészáros

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Lecture 2: Convex Sets and Functions

3 Applications of partial differentiation

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes

Physics 6010, Fall 2016 Constraints and Lagrange Multipliers. Relevant Sections in Text:

Today. Calculus. Linear Regression. Lagrange Multipliers

Seminars on Mathematics for Economics and Finance Topic 5: Optimization Kuhn-Tucker conditions for problems with inequality constraints 1

Gaussians Linear Regression Bias-Variance Tradeoff

Direction: This test is worth 250 points and each problem worth points. DO ANY SIX

Model comparison and selection

Statistics Ph.D. Qualifying Exam: Part II November 3, 2001

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Joint work with Nottingham colleagues Simon Preston and Michail Tsagris.

Lagrange Multipliers

ECE 275A Homework 7 Solutions

4.5.1 The use of 2 log Λ when θ is scalar

Central Limit Theorem ( 5.3)

Introduction to Estimation Methods for Time Series models. Lecture 1

Second Order Optimality Conditions for Constrained Nonlinear Programming

3. Minimization with constraints Problem III. Minimize f(x) in R n given that x satisfies the equality constraints. g j (x) = c j, j = 1,...

HT Introduction. P(X i = x i ) = e λ λ x i

3.7 Constrained Optimization and Lagrange Multipliers

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

CALCULUS: Math 21C, Fall 2010 Final Exam: Solutions. 1. [25 pts] Do the following series converge or diverge? State clearly which test you use.

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Statistics and Econometrics I

Diagnostics can identify two possible areas of failure of assumptions when fitting linear models.

Data Mining Stat 588

Estimation, Detection, and Identification

Introduction to Machine Learning. Lecture 2

Mathematical Analysis II, 2018/19 First semester

COS513 LECTURE 8 STATISTICAL CONCEPTS

12.1 Taylor Polynomials In Two-Dimensions: Nonlinear Approximations of F(x,y) Polynomials in Two-Variables. Taylor Polynomials and Approximations.

Regression #4: Properties of OLS Estimator (Part 2)

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Econometrics A. Simple linear model (2) Keio University, Faculty of Economics. Simon Clinet (Keio University) Econometrics A October 16, / 11

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Mathematical statistics

MATH Midterm 1 Sample 4

STAT 730 Chapter 4: Estimation

Graduate Econometrics I: Maximum Likelihood I

The Derivative. Appendix B. B.1 The Derivative of f. Mappings from IR to IR

Regression Estimation Least Squares and Maximum Likelihood

MTH4101 CALCULUS II REVISION NOTES. 1. COMPLEX NUMBERS (Thomas Appendix 7 + lecture notes) ax 2 + bx + c = 0. x = b ± b 2 4ac 2a. i = 1.

1 Probability Model. 1.1 Types of models to be discussed in the course

MATH 228: Calculus III (FALL 2016) Sample Problems for FINAL EXAM SOLUTIONS

Math 20C. Lecture Examples. Section 14.5, Part 1. Directional derivatives and gradient vectors in the plane

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Lecture 1 October 9, 2013

Math 210, Final Exam, Spring 2012 Problem 1 Solution. (a) Find an equation of the plane passing through the tips of u, v, and w.

1 Probability Model. 1.1 Types of models to be discussed in the course

(θ θ ), θ θ = 2 L(θ ) θ θ θ θ θ (θ )= H θθ (θ ) 1 d θ (θ )

Answer sheet: Final exam for Math 2339, Dec 10, 2010

Optimization Methods: Optimization using Calculus - Equality constraints 1. Module 2 Lecture Notes 4

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

STA 260: Statistics and Probability II

The Fundamental Theorem of Linear Inequalities

Lagrange Multipliers

Regression Models - Introduction

MATH20132 Calculus of Several Variables. 2018

Lecture 9: Implicit function theorem, constrained extrema and Lagrange multipliers

Maximum likelihood estimation

2. Variance and Covariance: We will now derive some classic properties of variance and covariance. Assume real-valued random variables X and Y.

Transcription:

Link lecture - Lagrange Multipliers Lagrange multipliers provide a method for finding a stationary point of a function, say f(x, y) when the variables are subject to constraints, say of the form g(x, y) = 0 can need extra arguments to check if maximum or minimum or neither Links to: Calculus unit the method uses simple properties of partial derivatives Statistics unit can be used to calculate or derive properties of estimators Lecture will introduce the idea and cover two substantial examples: Gauss-Markov Theorem links to to Statistics 1, 4. Linear regression Constrained mles links to Statistics 1, 3. Maximum likelihood estimation 1

1. Lagrange multipliers simplest case Consider a function f of just two variables x and y. Say we want to find a stationary point of f(x, y) subject to a single constraint of the form g(x, y) = 0 Introduce a single new variable λ we call λ a Lagrange multiplier Find all sets of values of (x, y, λ) such that i.e. f = λ g and g(x, y) = 0 where f = x = λ g x and y = λ g y ( x, y and g(x, y) = 0 ) (number of equations = original number of variables + number of constraints) Evaluate f(x, y) at each of these points. We can often identify the largest/smallest value as the maximum/minimum of f(x, y) subject to the constraint, taking account of whether f is unbounded or bounded above/below. 2

2. Geometric motivation Consider finding a local maximum (or local minimum) of f(x, y) subject to a single constraint of the form g(x, y) = 0. Recall that a contour of f is a set of points (x, y) for which f takes some given fixed value. Consider how the curve g(x, y) = 0 (a contour of g) intersects the contours of f. In the following diagram, as we move from A to C along the contour g(x, y) = 0, the function f(x, y) first increases then decreases, with a stationary point (here a maximum) at B. At the stationary point f and g have a common tangent, so the normal vectors to f and g at that point are parallel, so the gradient vectors are parallel, so f = λ g for some scalar λ 3

f Common tangent at B g A B f(x,y) = 4 f(x,y) = 3 g(x,y) = 0 C f(x,y) = 2 f(x,y) = 1 4

3. General number of variables and constraints The method easily generalises to finding the stationary points of a function f with n variables subject to k independent constraints. E.g. consider a function f(x, y, z) of three variables x, y, z subject to two constraints g(x, y, z) = 0 and h(x, y, z) = 0, then: at a stationary point f is in the plane determined by g and h introduce two Lagrange multipliers, say λ and µ (one per constraint) find all sets of values x, y, z, λ, µ satisfying the five (i.e. 3 + 2) equations f = λ g + µ h and g(x, y) = 0 and h(x, y, z) = 0 i.e. x = λ g x + µ h x, y = λ g y + µ h y, z = λ g z + µ h z g(x, y, z) = 0 and h(x, y, z) = 0 5

4. Interpretation in terms of the Lagrangian Again consider the general case of finding a stationary point of a function f(x 1,..., x n ), subject to k constraints g 1 (x 1,..., x n ) = 0,..., g k (x 1,..., x n ) = 0 Introduce k Lagrange multipliers λ 1,..., λ k Define the Lagrangian Λ by Λ(x, λ) = f(x 1,..., x n ) k λ r g r (x 1,..., x n ) r=1 = f(x 1,..., x n ) λ 1 g 1 (x 1,..., x n ) λ k g k (x 1,..., x n ) The stationary points of f subject to the constraints g 1 = 0,..., g k = 0 are precisely the sets of values of (x 1,..., x n, λ 1,..., λ k ) at which Λ x i = 0, i = 1,..., n and Λ λ r = 0, r = 1,..., k i.e. they are stationary points of the unconstrained function Λ. 6

5. Example maximise f(x, y) = xy subject to x + y = 1 i.e. subject to g(x, y) = 0 where g(x, y) = x + y 1 one constraint so introduce one Lagrange multiplier λ compute x = y, y = x, and solve the (two + one) equations g x = 1, g y = 1 x = λ g x i.e. y = λ (1) y = λ g y i.e. x = λ (2) g(x, y) = 0 i.e. x + y = 1 (3) substituting (1) and (2) in (3) gives 2λ = 1, i.e. λ = 1/2, so from (1) and (2) the function has a stationary point subject to the constraint (here a maximum), at x = 1/2, y = 1/2 7

6. Example maximise f(x, y) = x 2 + 2y 2 subject to x 2 + y 2 = 1 i.e. subject to g(x, y) = 0 where g(x, y) = x 2 + y 2 1 one constraint so introduce one Lagrange multiplier λ compute x = 2x, y = 4y, and solve the (two + one) equations g x = 2x, g y = 2y x = λ g x i.e. 2x = λ2x (1) y = λ g y i.e. 4y = λ2y (2) g(x, y) = 0 i.e. x 2 + y 2 = 1 (3) (1) either λ = 1 or x = 0; (2) either λ = 2 or y = 0 so possible solutions are x = 0, λ = 2, y = ±1 and y = 0, λ = 1, x = ±1 where f(0, ±1) = 2 [max], while f(±1, 0) = 1 [min]. 8

7. Gauss-Markov Theorem A minimum variance property of least squares estimators In linear regression the values y 1,..., y n are assumed to be observed values of random variables Y 1,..., Y n satisfying the model E(Y i ) = α + βx i, Var(Y i ) = σ 2, i = 1,..., n A linear estimator of β is an estimator of the form c 1 Y 1 + c 2 Y 2 + + c n Y n for some choice of constants c 1,..., c n. The variance of a linear estimator is Var(c 1 Y 1 + + c n Y n ) = c 2 1 Var(Y 1 ) + + c 2 n Var(Y n ) = c 2 1σ 2 + + c 2 nσ 2 = σ 2 n 1 c2 i A linear estimator is unbiased if E( ˆβ) = β, i.e. β = E(c 1 Y 1 + c 2 Y 2 + + c n Y n ) = c 1 E(Y 1 ) + + c n E(Y n ) = c 1 (α + βx 1 ) + + c n (α + βx n ) = α(c 1 + + c n ) + β(c 1 x 1 + + c n x n ) which requires n c i = 0 and n c i x i = 1 1 1 9

Thus, for fixed σ 2, a linear estimator that has minimum variance in the class of linear unbiased estimators is obtained by choosing the variables c 1,..., c n to minimise the objective function f(c 1,..., c n ) = n 1 c2 i subject to the two constraints n g(c 1,..., c n ) = c i = 0 and h(c 1,..., c n ) = We introduce two Lagrange multipliers λ and µ and compute the 3 n partial derivatives 1 n c i x i 1 = 0 1 c i = 2c i, g c i = 1, h c i = x i, i = 1,..., n and solve the (n + 2) equations = λ g + µ h, i = 1,..., n, c i c i c i n c i = 0, 1 n c i x i 1 = 0 1 10

The first n equations give 2c i = λ + µx i, i = 1,..., n Summing these n equations over i = 1,..., n and using n 1 c i = 0 gives 2 c i = nλ + µ x i = λ = µ( x i )/n = µ x On the other hand, multiplying each equation by x i and then summing over i = 1,..., n gives 2 c i x i = λ x i + µ x 2 i Now using n 1 c ix i 1 = 0 and using λ = µ x from above gives 2 = µ x x i + µ x 2 i = µ = 2/( x 2 i n x 2 ) 11

Thus the values of the variables c 1,..., c n that minimise the variance of the constrained linear estimator c 1 Y 1 + c 2 Y 2 + + c n Y n are the values satisfying the equations c i = (λ + µx i )/2 i = 1,..., n where so λ = µ x and µ = 2/( x 2 i n x 2 ) c i = and the resulting estimate is (x i x) x 2 i n x 2 i = 1,..., n ˆβ = y i c i = y i (x i x) x 2 i n x 2 = yi x i nȳ x x 2 i n x 2 which is just the least squares estimate. This gives a simple proof in linear regression case of the Gauss-Markov theorem: The least squares estimator ˆβ has minimum variance in the class of all linear unbiased estimators of β. 12

8. Maximum likelihood estimates multinomial distributions Consider a statistical experiment in which a sample of size m is drawn from a large population assume each observation can take one of four values say A 1, A 2, A 3 or A 4 The respective proportions of these values in the population are θ 1, θ 2, θ 3 and θ 4 so 0 < θ i < 1, j = 1,..., 4 and θ 1 + θ 2 + θ 3 + θ 4 = 1 Assume there are m 1 observations with value A 1, m 2 with value A 2, m 3 with value A 3 and m 4 with value A 4 so m 1 + m 2 + m 3 + m 4 = m. What are the maximum likelihood estimates of θ 1, θ 2, θ 3 and θ 4? [Note that in this example we use m rather than n to denote the sample size, so as not to clash with the notation in earlier sections where n denoted the number of variables we were optimising over.] 13

Here the m i are observed values of random variables M i, i = 1,..., 4 where the joint distribution of M 1, M 2, M 3 and M 4 is called a multinomial distribution The joint distribution has probability mass function p(m 1, m 2, m 3, m 4 ; θ 1, θ 2, θ 3, θ 4 ) = m! m 1!m 2!m 3!m 4! θm 1 1 θ m 2 2 θ m 3 3 θ m 4 4 and so has log likelihood function l(θ 1, θ 2, θ 3, θ 4 ) = const + m 1 log θ 1 + m 2 log θ 2 + m 3 log θ 3 + m 4 log θ 4 where the constant c = log m! (log m 1! + log m 2! + log m 3! + log m 4!) Since the θ i are probabilities and must therefore sum to 1, the maximum likelihood estimates ˆθ 1, ˆθ 2, ˆθ 3, ˆθ 4 are the values that maximise the log likelihood l(θ 1, θ 2, θ 3, θ 4 ), subject to the condition θ 1 + θ 2 + θ 3 + θ 4 = 1 14

Thus we want to maximise the objective function l(θ 1, θ 2, θ 3, θ 4 ) = c + m 1 log θ 1 + m 2 log θ 2 + m 3 log θ 3 + m 4 log θ 4 subject to the constraint g(θ 1, θ 2, θ 3, θ 4 ) = θ 1 + θ 2 + θ 3 + θ 4 1 = 0 We introduce a single Lagrange multiplier λ and compute the partial derivatives l θ i = m i θ i, g θ i = 1, i = 1,..., 4 and solve the (four plus one) equations l = λ g i = 1,..., 4 θ 1 + θ 2 + θ 3 + θ 4 1 = 0 θ i θ i 15

The first four equations give i.e. m i θ i = λ, i = 1,..., 4 θ 1 = m 1 λ, θ 2 = m 2 λ, θ 3 = m 3 λ, θ 4 = m 4 λ, Substituting these values into the last equation gives 1 = θ 1 +θ 2 +θ 3 +θ 4 = m 1 λ + m 2 λ + m 3 λ + m 4 λ = (m 1 + m 2 + m 3 + m 4 ) λ = m λ Putting λ = m back into the equations for each θ i we see that the maximising values (the maximum likelihood estimates) are ˆθ 1 = m 1 m, ˆθ2 = m 2 m, ˆθ3 = m 3 m, ˆθ4 = m 4 m 16