Fundamentals of Unconstrained Optimization

Similar documents
MATH 5720: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 2018

Math (P)refresher Lecture 8: Unconstrained Optimization

Lecture 2 - Unconstrained Optimization Definition[Global Minimum and Maximum]Let f : S R be defined on a set S R n. Then

Chapter 3 Transformations

Chapter 2 BASIC PRINCIPLES. 2.1 Introduction. 2.2 Gradient Information

Numerical Optimization

OR MSc Maths Revision Course

Econ Slides from Lecture 8

Chapter 2: Unconstrained Extrema

HW3 - Due 02/06. Each answer must be mathematically justified. Don t forget your name. 1 2, A = 2 2

Numerical Optimization

Paul Schrimpf. October 18, UBC Economics 526. Unconstrained optimization. Paul Schrimpf. Notation and definitions. First order conditions

Lecture 3: Basics of set-constrained and unconstrained optimization

Symmetric Matrices and Eigendecomposition

Chap 3. Linear Algebra

Math 273a: Optimization Basic concepts

September Math Course: First Order Derivative

Here each term has degree 2 (the sum of exponents is 2 for all summands). A quadratic form of three variables looks as

EC /11. Math for Microeconomics September Course, Part II Problem Set 1 with Solutions. a11 a 12. x 2

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

MATHEMATICAL ECONOMICS: OPTIMIZATION. Contents

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

Examination paper for TMA4180 Optimization I

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Calculus 2502A - Advanced Calculus I Fall : Local minima and maxima

Vector Derivatives and the Gradient

Optimality conditions for unconstrained optimization. Outline

Review problems for MA 54, Fall 2004.

Mathematical Economics. Lecture Notes (in extracts)

Nonlinear Programming Models

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Chapter 13. Convex and Concave. Josef Leydold Mathematical Methods WS 2018/19 13 Convex and Concave 1 / 44

Functions of Several Variables

Algebra C Numerical Linear Algebra Sample Exam Problems

The proximal mapping

Chapter 4: Unconstrained nonlinear optimization

Definite versus Indefinite Linear Algebra. Christian Mehl Institut für Mathematik TU Berlin Germany. 10th SIAM Conference on Applied Linear Algebra

Line Search Methods for Unconstrained Optimisation

N. L. P. NONLINEAR PROGRAMMING (NLP) deals with optimization models with at least one nonlinear function. NLP. Optimization. Models of following form:

The Derivative. Appendix B. B.1 The Derivative of f. Mappings from IR to IR

Numerical Analysis Comprehensive Exam Questions

Regularized Discriminant Analysis and Reduced-Rank LDA

Convex Functions and Optimization

EC /11. Math for Microeconomics September Course, Part II Lecture Notes. Course Outline

The Singular Value Decomposition

MATH 4211/6211 Optimization Basics of Optimization Problems

Lecture 2: Convex Sets and Functions

Week 4: Calculus and Optimization (Jehle and Reny, Chapter A2)

Cayley-Hamilton Theorem

2. Linear algebra. matrices and vectors. linear equations. range and nullspace of matrices. function of vectors, gradient and Hessian

Symmetric matrices and dot products

CE 191: Civil and Environmental Engineering Systems Analysis. LEC 05 : Optimality Conditions

Positive Definite Matrix

The general programming problem is the nonlinear programming problem where a given function is maximized subject to a set of inequality constraints.

LINEAR ALGEBRA BOOT CAMP WEEK 4: THE SPECTRAL THEOREM

University of Colorado at Denver Mathematics Department Applied Linear Algebra Preliminary Exam With Solutions 16 January 2009, 10:00 am 2:00 pm

x +3y 2t = 1 2x +y +z +t = 2 3x y +z t = 7 2x +6y +z +t = a

UNCONSTRAINED OPTIMIZATION PAUL SCHRIMPF OCTOBER 24, 2013

Matrices and Determinants

Linear Algebra: Matrix Eigenvalue Problems

LMI MODELLING 4. CONVEX LMI MODELLING. Didier HENRION. LAAS-CNRS Toulouse, FR Czech Tech Univ Prague, CZ. Universidad de Valladolid, SP March 2009

Elementary linear algebra

Performance Surfaces and Optimum Points

12. Cholesky factorization

1. Introduction. Consider the following quadratically constrained quadratic optimization problem:

1. Matrix multiplication and Pauli Matrices: Pauli matrices are the 2 2 matrices. 1 0 i 0. 0 i

8. Diagonalization.

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

CHAPTER 4: HIGHER ORDER DERIVATIVES. Likewise, we may define the higher order derivatives. f(x, y, z) = xy 2 + e zx. y = 2xy.

EIGENVALUES AND EIGENVECTORS 3

Lecture 7: Positive Semidefinite Matrices

Lecture 6. Numerical methods. Approximation of functions

Monotone Function. Function f is called monotonically increasing, if. x 1 x 2 f (x 1 ) f (x 2 ) x 1 < x 2 f (x 1 ) < f (x 2 ) x 1 x 2

Lecture 3: QR-Factorization

Review of Optimization Methods

HW2 - Due 01/30. Each answer must be mathematically justified. Don t forget your name.

Linear Algebra: Linear Systems and Matrices - Quadratic Forms and Deniteness - Eigenvalues and Markov Chains

Basics of Calculus and Algebra

Rowan University Department of Electrical and Computer Engineering

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

EE731 Lecture Notes: Matrix Computations for Signal Processing

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Math 489AB Exercises for Chapter 2 Fall Section 2.3

CHAPTER 5. Basic Iterative Methods

Computational Methods. Eigenvalues and Singular Values

1 Directional Derivatives and Differentiability

ACM 104. Homework Set 5 Solutions. February 21, 2001

1. General Vector Spaces

Practical Optimization: Basic Multidimensional Gradient Methods

CONSTRAINED NONLINEAR PROGRAMMING

Module 04 Optimization Problems KKT Conditions & Solvers

Linear Algebra And Its Applications Chapter 6. Positive Definite Matrix

Lagrange Multipliers

Functions of Several Variables

Unconstrained Optimization

Econ Slides from Lecture 7

Parametric Techniques Lecture 3

MSc Mas6002, Introductory Material Mathematical Methods Exercises

Eigenvalues and Eigenvectors

Math Matrix Algebra

Transcription:

dalmau@cimat.mx Centro de Investigación en Matemáticas CIMAT A.C. Mexico Enero 2016

Outline Introduction 1 Introduction 2 3 4

Optimization Problem min f (x) x Ω where f (x) is a real-valued function The function f : R n R is called the objective function or cost function. The vector x is an n-vector of independent variables: x = [x 1, x 2,, x n ] T R n. The variables x 1, x 2,, x n are often referred to as decision variables. The set Ω R n is called the constraint set or feasible set.

Introduction Definition: Suppose that f : R n R is a real-valued function defined on Ω R n. A point x Ω is a local minimizer of f over Ω if there exists ɛ > 0 such that f (x) f (x ) for all x Ω \ {x } and x x < ɛ. Definition: Suppose that f : R n R is a real-valued function defined on Ω R n. A point x Ω is a global minimizer of f over Ω if f (x) f (x ) for all x Ω \ {x }. Replacing with > in the previous definitions we have a strict local minimizer and a strict global minimizer, respectively

Introduction If x is a global minimizer of f over Ω, we write f (x ) = min x Ω x = arg min x Ω If x is a global minimizer of f over R n, i.e., unconstrained problem, we write f (x ) = min f (x) x x = arg min f (x). x In general, global minimizers are difficult to find. So, in practice, we often are satisfied with finding local minimizers.

First order necessary conditions Theorem: If x is a local minimizer (or maximizer) and f is continuously differentiable in an open neighborhood of x, then f (x ) = 0.

First order necessary conditions Theorem: If x is a local minimizer (or maximizer) and f is continuously differentiable in an open neighborhood of x, then f (x ) = 0. Proof (1): Suppose that f (x ) 0. Therefore, we can find a direction v = f (x ) f (x ) for which f (x ) T v < 0. Let h(θ) = f (x + θv) T v. As h(0) < 0 there exits ɛ > 0 for which h(θ) < 0 for all θ (0, ɛ). Using Taylor s Theorem, there exists τ (0, 1) such that for all ˆɛ [0, ɛ) f (x + ˆɛv) = f (x ) + ˆɛ f (x + τˆɛv) T v Defining θ = τˆɛ it holds that θ [0, ˆɛ) [0, ɛ). Therefore f (x + τˆɛv) T v < 0 and as consequence f (x + ˆɛv) < f (x ) which contradict that x is a minimizer.

First order necessary conditions Theorem: If x is a local minimizer (or maximizer) and f is continuously differentiable in an open neighborhood of x, then f (x ) = 0. Proof (2): Suppose that f (x ) 0. Therefore, we can find a direction h = α f (x ) f (x ) = αv for which f (x ) T h < 0. Using Taylor s formula for x = x + h f (x) = f (x ) + g(x ) T h + o( h ) if α 0 then h 0 and g(x ) T h + o( h ) < 0 because o( h ) goes to zero faster than g(x ) T h, in fact g(x lim ) T h α 0 h = g(x ) T v v. Therefore f (x) < f (x ). This contradicts the assumption that x is a minimizer.

First order necessary conditions Theorem: If x is a local minimizer (or maximizer) and f is continuously differentiable in an open neighborhood of x, then f (x ) = 0. A point that satisfies that f (x ) = 0 is called a stationary point. According to the previous theorem, any local minimizer (or maximizer) must be a stationary point.

Second order necessary conditions Theorem If x is a local minimizer (maximizer) of f and 2 f exists and is continuous in an open neighborhood of x, then f (x ) = 0 and 2 f (x ) is positive semidefinite (negative semidefinite).

Second order necessary conditions Theorem If x is a local minimizer (maximizer) of f and 2 f exists and is continuous in an open neighborhood of x, then f (x ) = 0 and 2 f (x ) is positive semidefinite (negative semidefinite). Proof: From the previous theorem f (x ) = 0 (By contradiction) assume that 2 f is not positive semidefinite. Then, we can choose a vector v such that v T 2 f (x )v T < 0, and because 2 f is continuous near x, there is a scalar ɛ > 0 such that v T 2 f (x + ˆɛv)v T < 0 for all ˆɛ [0, ɛ).

Second order necessary conditions Theorem If x is a local minimizer (maximizer) of f and 2 f exists and is continuous in an open neighborhood of x, then f (x ) = 0 and 2 f (x ) is positive semidefinite (negative semidefinite). Proof: Applying Taylor s theorem around x, there exits τ (0, 1) for all ˆɛ [0, ɛ) for which f (x + ˆɛv) = f (x ) + ˆɛ f (x ) T v + 1 2ˆɛ2 v T 2 f (x + τˆɛv)v using that f (x ) T v = 0 and v T 2 f (x + τˆɛv)v < 0 we obtain f (x + ˆɛv) < f (x ) which is a contradiction!

Second order sufficient conditions Theorem: Suppose that 2 f exists and is continuous in an open neighborhood of x, and that f (x ) = 0 and 2 f (x ) is positive definite (negative definite). Then x is a strict local minimizer (maximizer) of f. Proof: There exists a ball B r (x ) = {x x x < r} for which q(θ) = h T D 2 f (z)h > 0 with h < r, z = x + θh with θ (0, 1) (note that z B r (x )) due to q(θ) is continuous and q(0) > 0. Using the Taylor s Theorem with x = x + h B r (x ), i.e., h < r, and that f (x ) = 0, there exists θ (0, 1) such that f (x) = f (x ) + 1 2 ht D 2 f (x + θh)h. As z = x + θh B r (x ) then h T D 2 f (z)h > 0 and f (x) > f (x ) for all x B r (x ) which gives the result.

Classification of stationary point Definition: A point x that satisfies g(x ) = 0 is called a stationary point. Definition: A point x that is neither a maximizer nor a minimizer is called a saddle point.

Classification of stationary point At a point x = x + αd in the neighborhood of a saddle point x, the Taylor series gives f (x) = f (x ) + 1 2 α2 d T H(x )d + o(α d ) since g(x ) = 0. As x is neither a maximizer nor a minimizer, there must be directions d 1, d 2 ( or x 1, x 2 ) such that x 1 {}}{ f ( x + αd 1 ) < f (x ) d T 1 H(x )d 1 < 0 f (x + αd 2 }{{} x 2 ) > f (x ) d T 2 H(x )d 2 > 0 Then, H(x ) is indefinite

Find and Classify Stationary points We can find and classify stationary points as follows Find the points x at which g(x ) = 0. Obtain the Hessian H(x). Determine the character of H(x ) for each point x. If H(x ) is positive (negative) definite then x is a minimizer (maximizer). If H(x ) is indefinite, x is a saddle point. If H(x ) is positive (negative) semidefinite, x can be a minimizer (maximizer). In this case, further work is necessary to classify the stationary point. A possible approach would be to deduce the third partial derivatives of f (x ) and then calculate the corresponding term in the Taylor series. If this term is zero, then the next term needs to be calculated and so on (see next slide for 1D case).

Find and Classify Stationary points 1D Assume that f (1) (x 0 ) = f (2) (x 0 ) = = f (n 1) (x 0 ) = 0 and f (n) (x 0 ) 0, ie, f (n) (x 0 ) > 0 or f (n) (x 0 ) < 0. 1 Using Taylor s theorem f (x) = f (x 0 ) + f (n) (x 0 +th) n! (x x 0 ) n, h = x x 0, for some t (0, 1). Therefore, one just needs to consider the sign of q = f (n) (x 0 +th) n! (x x 0 ) n

Find and Classify Stationary points 1D Assume that f (1) (x 0 ) = f (2) (x 0 ) = = f (n 1) (x 0 ) = 0 and f (n) (x 0 ) 0, ie, f (n) (x 0 ) > 0 or f (n) (x 0 ) < 0. 1 Using Taylor s theorem f (x) = f (x 0 ) + f (n) (x 0 +th) n! (x x 0 ) n, h = x x 0, for some t (0, 1). Therefore, one just needs to consider the sign of q = f (n) (x 0 +th) n! (x x 0 ) n If the sign of f (n) (x 0+th) n! (x x 0 ) n is positive for all x (x 0 δ, x 0 + δ) then f (x) > f (x 0 ), and x 0 is a local minimum If the sign of f (n) (x 0+th) n! (x x 0 ) n is negative for all x (x 0 δ, x 0 + δ) then f (x) < f (x 0 ), and x 0 is a local maximum If the sign of f (n) (x 0+th) n! (x x 0 ) n is positive and negative for x (x 0 δ, x 0 + δ) then f (x) f (x 0 ), and x 0 is an inflection point

Find and Classify Stationary points 1D 1 Using Taylor s theorem f (x) = f (x 0 ) + f (n) (x 0 +th) n! (x x 0 ) n for some t (0, 1). Which is the sign of q = f (n) (x 0 +th) n! (x x 0 ) n?

Find and Classify Stationary points 1D 1 Using Taylor s theorem f (x) = f (x 0 ) + f (n) (x 0 +th) n! (x x 0 ) n for some t (0, 1). Which is the sign of q = f (n) (x 0 +th) n! (x x 0 ) n? If n is even then the sign of q depends only of the factor f (n) (x 0 + th). Taking into account the theorem of the sign preserving, if f (n) (x 0 ) > 0 then q > 0 in a neighborhood of x 0 and x 0 is a local minimum, on the contrary, if f (n) (x 0 ) < 0 then q < 0 in a neighborhood of x 0 and x 0 is a local maximum. If n is odd, q could be positive or negative independently of the sign of f (n) (x 0 ). The sign of q changes when x > x 0 or x < x 0. Then x 0 is an inflection point.

x is a stationary point and H(x ) = 0 In the special case where H(x ) = 0, x can be a minimizer or maximizer since the necessary conditions are satisfied in both cases. If H(x ) is semidefinite, more information is required for the complete characterization of a stationary point and further work is necessary in this case.

x is a stationary point and H(x ) = 0 A possible approach could be to compute the third partial derivatives of f (x) and then calculate the corresponding term in the Taylor series, D 3 f (x )/3!. If the this term is zero, then the next term D 4 f (x )/4! needs to be computed and so on... f (x + h) = f (x) + g(x) T h + 1 2 ht H(x)h + 1 3! D3 f (x) + D r f (x) = n i 1 =1 i 2 =1 n n r f (x) h i1 h i2 h ir x i1 x i2 x ir i r =1 P(h) = D r f (x) : R r R is a polynomial of grade r in the variable h (see, Multilinear form)

Example when H(x ) = 0 f (x) = 1 [ (x1 2) 3 + (x 2 3) 3] 6 f (x) = 1 [ (x1 2) 2, (x 2 3) 2] T = 0, x = [2, 3] T [ 2 ] x1 2 0 H(x) =, H(x ) = 0 0 x 2 3 2 f (x) = x x1 2 1 2, 2 f (x) = x x2 2 2 3, 2 f (x) x 1 x 2 = 2 f (x) x 2 x 1 = 0. The third derivatives of f are all zero at x except 3 f (x ) x 3 1 = 3 f (x ) x 3 2 = 1.

Example when H(x ) = 0 The third derivatives of f are all zero at x except = 1 then 3 f (x ) x 3 1 = 3 f (x ) x 3 2 D 3 f (x ) = 2 i 1,i 2,i 3 =1 = h1 3 3 f x1 3 = h1 3 3 f x1 3 3 f (x ) h i1 h i2 h i3 x i1 x i2 x i3 + h 2 1h 2 3 f x 2 1 x 2 + h2 3 3 f x2 3 = h 3 1 + h 3 2 + h 1 h2 2 3 f x 1 x2 2 + h2 3 3 f x2 3 that is positive if h 1, h 2 > 0 and negative if h 1, h 2 < 0. Then x is a saddle point due to f (x + h) > f (x ) if h 1, h 2 > 0 and f (x + h) < f (x ) if h 1, h 2 < 0.

x is a stationary point and H(x ) 0 In this case, and from the previous discussion, the problem of classifying stationary points of the function f (x) becomes the problem of characterizing the Hessian H(x) at x = x, i.e., one needs to determine if H(x ) is positive, negative, positive semidefinite or negative semidefinite.

x is a stationary point and H(x ) 0 Theorem Characterization of symmetric matrices: A real symmetric n n matrix H is positive definite, positive semidefinite, etc., if for every nonsingular matrix B of the same order, the n n matrix Ĥ given by Ĥ = B T HB is positive definite, positive semidefinite, etc. Proof: Let x 0 x T Ĥx = x T B T HBx = y T Hy, where y = Bx 0 since B is no singular. Then x T Ĥx = y T Hy > 0 or 0, and therefore Ĥ is positive definite, positive semidefinite, etc.

Theorem Characterization of symmetric matrices via diagonalization 1 If the n n matrix B is nonsingular and Ĥ = B T HB is a diagonal matrix with diagonal elements ĥ 1, ĥ 2,, ĥ n then H is positive definite, positive semidefinite, negative semidefinite, negative definite, if ĥ i > 0, 0, 0, < 0 for i = 1, 2,, n. Otherwise, if some ĥ i are positive and some are negative, H is indefinite. 2 The converse of the previous part is also true, that is, if H is positive definite, positive semidefinite, etc., then ĥ i > 0, 0, etc., and if H is indefinite, then some ĥ i are positive and some are negative.

Theorem Characterization of symmetric matrices via diagonalization Proof (Part 1): x T Ĥx = ĥ 1 x 2 1 + ĥ 2 x 2 2 + + ĥ n x 2 n if ĥ i > 0, 0, 0, < 0 for i = 1, 2,, n then x T Ĥx > 0, 0, 0, < 0. Therefore Ĥ is positive definite, positive semidefinite, negative semidefinite, negative definite. If some ĥ i are positive and some are negative, one can find a vector x that yields a positive or negative x T Ĥx and then Ĥ is indefinite. Using the previous theorem (slide 24) one concludes that H = B T ĤB 1 is also positive definite, positive semidefinite, negative semidefinite or indefinite.

Theorem Characterization of symmetric matrices via diagonalization Proof (Part 2): Suppose that H is positive definite, positive semidefinite, etc. Since Ĥ = B T HB, it follows from the previous theorem (slide 24) that Ĥ is positive definite, positive semidefinite, etc. If x is a vector of the canonical base, i.e. x = e j x T Ĥx = e T j Ĥe j = ĥ i > 0, 0,, < for i = 1, 2,, n. On the other hand, if H is indefinite then by theorem in slide 24, Ĥ is indefinite and therefore some ĥ i must be positive and some must be negative. (on the contrary H would be positive definite, positive semidefinite, etc.)

Theorem Eigen decomposition of symmetric matrices 1 If H is a real symmetric matrix, then there exists a real unitary (or orthogonal) matrix U such that Λ = U T HU is a diagonal matrix whose diagonal elements are the eigenvalues of H. 2 The eigenvalues of H are real.

Comments.. Introduction Theorem Eigen decomposition of symmetric matrices 1 If H is a real symmetric matrix, then there exists a real unitary (or orthogonal) matrix U such that Λ = U T HU is a diagonal matrix whose diagonal elements are the eigenvalues of H. Comments: According to the Schur decomposition, any real matrix A can be written as A = UDU T, where A, U, D contain only real numbers, D is a block upper triangular matrix and U is an orthogonal matrix. Using the Schur decomposition H = UDU T, then U T HU = D, as H is symmetric then U T HU is also symmetric, therefore D is a symmetric triangular matrix this implies that D is necessarily diagonal.

Comments... Introduction Theorem Eigen decomposition of symmetric matrices 1 The eigenvalues of H are real. Comments: If Hx = λx then H x = λ x. The symbol represents the complex conjugate. λx T x = x T H x = λx T x then λ = λ. This implies that λ is real.

Definition Introduction Principal minor A minor of A of order k is principal if it is obtained by deleting n k rows and the n k columns with the same numbers. For example, in a principal minor where you have deleted row 1 and 3, you should also delete column 1 and 3. There are ( n k) principal minors of order k.

Definition Introduction Leading principal minor The leading principal minor of A of order k is the minor of order k obtained by deleting the last n k rows and columns. Let A = ( a b b c ) be a symmetric 2 2 matrix. Then the leading principal minors are D 1 = a and D 2 = ac b 2. If we want to find all the principal minors, these are given by 1 = a and 1 = c (of order one) and 2 = ac b 2 (of order two).

Properties of matrices Theorem (a) If H is positive semidefinite or positive definite, then det (H) 0 or > 0 (b) If H is positive definite if and only if all its leading principal minors are positive, i.e., det (H i ) for i = 1, 2,, n. (Sylvester s criterion) (c) H is positive semidefinite if and only if all its principal minors are nonneg ative, i.e., det (H (l) i ) 0 for all possible selections of {l 1, l 2,, l i } for i = 1, 2,, n.

Properties of matrices (d) H is negative definite if and only if all the leading principal minors of H are positive, i.e., det ( H i ) > 0 for i = 1, 2,, n. (e) H is negative semidefinite if and only if all the principal minors of H are nonnegative, i.e., det ( H (l) i ) 0 for all possible selections of {l 1, l 2,, l i } for i = 1, 2,, n. (f) H is indefinite if neither (c) nor (e) holds.

Summary Introduction If the Hessian is positive definite ( positive eigenvalues ) at x, then x is a local minimum. If the Hessian is negative definite (negative eigenvalues) at x, then x is a local maximum. If the Hessian has both positive and negative eigenvalues then x is a saddle point. Otherwise the test is inconclusive. At a local minimum (local maximum), the Hessian is positive semidefinite (negative semidefinite). For positive semidefinite and negative semidefinite Hessians the test is inconclusive.

Approximation problem Example 1. Suppose, that through an experiment the value of a function g is observed at m points, x 1, x 2,, x m, which mean that values g(x 1 ), g(x 2 ),, g(x m ) are known. We want to approximate the function by a polynomial h(x) = a 0 + a 1 x + a 2 x 2 + + a n x n with n < m. The error at each observation point is ɛ k = g(x k ) h(x k ), k = 1, 2,, m Then we obtain the following optimization problem m min (ɛ k ) 2 k=1

Approximation problem Let x k = [1, x k, x 2 k,, xn k ]T, g = [g(x 1 ), g(x 2 ),, g(x m )] T and X = [x 1, x 2,, x m ] T f (a) = = = m (ɛ k ) 2 = k=1 m [g(x k ) h(x k )] 2 k=1 m [g(x k ) a 0 + a 1 x k + a 2 xk 2 + + a n xk n ] 2 k=1 m [g(x k ) x T k a] 2 = g Xa 2 2 k=1 = a T Qa 2b T a + c with Q = X T X, b = X T g and c = g 2 2

Approximation problem Then min a f (a) = a T Qa 2b T a + c with q k = [1, x k, x 2 k,, xn k ]T, g = [g(x 1 ), g(x 2 ),, g(x m )] T, X = [x 1, x 2,, x m ] T, Q = X T X, b = X T g and c = g 2 2 a f (a) = 2Qa 2b = 0 a = Q 1 b

Approximation problem Example 2 Given a continuous f (x) in [a, b]. Find the approximation polynomial of degree n such that minimizes p(x) = a 0 + a 1 x + a 2 x 2 + + a n x n b a [f (x) p(x)] 2 dx

Maximum likelihood Suppose there is a sample x 1, x 2,, x n of n independent and identically distributed observations, coming from a distribution with an unknown probability density function f ( ). If f ( ) belongs to a certain family of distributions {f ( θ), θ Θ} (where θ is a vector of parameters for this family), called the parametric model, so that f 0 = f ( θ 0 ). The value θ 0 is unknown and is referred to as the true value of the parameter vector. It is desirable to find an estimator ˆθ which would be as close to the true value θ 0 as possible. For example f (x µ, σ) = 1 e (x µ)2 2σ 2 2πσ where θ = [µ, σ] T

Maximum likelihood Method In the method of maximum likelihood, one first specifies the joint density function for all observations. For an independent and identically distributed sample, this joint density function is f (x 1, x 2,..., x n θ) = f (x 1 θ)f (x 2 θ) f (x n θ) The observed values x 1, x 2,, x n are known whereas θ is the variable of the function. This function is called the likelihood: L(x; θ) = f (x θ) = n f (x i θ) i=1 The problem es to maxime the log-likelihood, i.e., l(x; θ) = log L(x; θ). This method of estimation defines a maximum-likelihood estimator.

Maximum likelihood Method: Example For the normal distribution N (µ, σ 2 ) which has probability density function ( ) f (x µ, σ 2 1 ) = exp (x µ)2 2πσ 2 2σ 2, the corresponding probability density function for a sample of n independent identically distributed normal random variables (the likelihood) is f (x 1,..., x n µ, σ 2 ) = = n f (x i µ, σ 2 ) i=1 ( ) 1 n/2 ( 2πσ 2 exp n i=1 (x i µ) 2 2σ 2 ),

Maximum likelihood Method: Example The log likelihood can be written as follows: log(l(µ, σ)) = ( n/2) log(2πσ 2 ) 1 2σ 2 n (x i µ) 2 i=1 Computing the derivatives of this log likelihood as follows. 0 = 2n( x µ) log(l(µ, σ)) = 0 µ 2σ 2. (1) This is solved by ˆµ = x = n i=1 x i n. Similarly for σ and one obtains σ 2 = 1 n n i=1 (x i µ) 2.