Gradient descents and inner products

Similar documents
ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

Line Search Methods for Unconstrained Optimisation

Math 5520 Homework 2 Solutions

Generalized Newton-Type Method for Energy Formulations in Image Processing

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Numerical Optimization

Trajectory-based optimization

Policy Gradient. U(θ) = E[ R(s t,a t );π θ ] = E[R(τ);π θ ] (1) 1 + e θ φ(s t) E[R(τ);π θ ] (3) = max. θ P(τ;θ)R(τ) (6) P(τ;θ) θ log P(τ;θ)R(τ) (9)

Section 3.5 The Implicit Function Theorem

The Gradient and Directional Derivative - (12.6)

Lectures in Discrete Differential Geometry 2 Surfaces

CS Tutorial 5 - Differential Geometry I - Surfaces

Sec. 1.1: Basics of Vectors

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Non-linear least squares

Unconstrained Optimization

LOWELL. MICHIGAN, THURSDAY, MAY 23, Schools Close. method of distribution. t o t h e b o y s of '98 a n d '18. C o m e out a n d see t h e m get

Inverse problems Total Variation Regularization Mark van Kraaij Casa seminar 23 May 2007 Technische Universiteit Eindh ove n University of Technology

The Steepest Descent Algorithm for Unconstrained Optimization

Motion Estimation (I)

Practical Optimization: Basic Multidimensional Gradient Methods

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

Exam 2. Average: 85.6 Median: 87.0 Maximum: Minimum: 55.0 Standard Deviation: Numerical Methods Fall 2011 Lecture 20

Motion Estimation (I) Ce Liu Microsoft Research New England

26. Directional Derivatives & The Gradient

ICS-E4030 Kernel Methods in Machine Learning

Performance Surfaces and Optimum Points

A Regularization Framework for Learning from Graph Data

Course Summary Math 211

Optimization. Yuh-Jye Lee. March 21, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 29

Functional Gradient Descent


Chapter 2: 1D Kinematics Tuesday January 13th

MATHEMATICS FOR COMPUTER VISION WEEK 8 OPTIMISATION PART 2. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

Oslo Class 2 Tikhonov regularization and kernels

Unconstrained minimization of smooth functions

Mathematical modelling Chapter 2 Nonlinear models and geometric models

The Conjugate Gradient Method

Review of Classical Optimization

Edge Detection. CS 650: Computer Vision

Chapter 8 Gradient Methods

Gradient Descent. Dr. Xiaowei Huang

PDEs in Image Processing, Tutorials

Newton s Method and Linear Approximations

Max Margin-Classifier

g(t) = f(x 1 (t),..., x n (t)).

Algorithms for Nonsmooth Optimization

Neural Network Training

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Classification Logistic Regression

II. DIFFERENTIABLE MANIFOLDS. Washington Mio CENTER FOR APPLIED VISION AND IMAGING SCIENCES

5 Handling Constraints

CS-E4830 Kernel Methods in Machine Learning

Introduction to gradient descent

Approximation of Circular Arcs by Parametric Polynomials

Mean-Shift Tracker Computer Vision (Kris Kitani) Carnegie Mellon University

Convex Optimization. Problem set 2. Due Monday April 26th

LECTURE 2: CROSS PRODUCTS, MULTILINEARITY, AND AREAS OF PARALLELOGRAMS

Orthogonal tensor decomposition

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Hilbert Space Methods for Reduced-Rank Gaussian Process Regression

Learning the Kernel Function via Regularization

RegML 2018 Class 2 Tikhonov regularization and kernels

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Definition and basic properties of heat kernels I, An introduction

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Worksheet 9. Math 1B, GSI: Andrew Hanlon. 1 Ce 3t 1/3 1 = Ce 3t. 4 Ce 3t 1/ =

Gradient Based Optimization Methods

Part 2: Linesearch methods for unconstrained optimization. Nick Gould (RAL)

Numerical Optimization

Piecewise rigid curve deformation via a Finsler steepest descent

MVE165/MMG631 Linear and integer optimization with applications Lecture 13 Overview of nonlinear programming. Ann-Brith Strömberg

Jim Lambers MAT 460/560 Fall Semester Practice Final Exam

LMS Algorithm Summary

The Karcher Mean of Points on SO n

Matrix Derivatives and Descent Optimization Methods

LECTURE 3 3.1Rules of Vector Differentiation

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Nonlinearity Root-finding Bisection Fixed Point Iteration Newton s Method Secant Method Conclusion. Nonlinear Systems

Iterative Reweighted Least Squares

Math 307: Problems for section 2.1

Newton s Method and Linear Approximations

Matrix differential calculus Optimization Geoff Gordon Ryan Tibshirani

Trust Region Methods. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

On Nesterov s Random Coordinate Descent Algorithms - Continued

Vector Derivatives and the Gradient

Variational Methods in Signal and Image Processing

g(2, 1) = cos(2π) + 1 = = 9

The Inverse Function Theorem via Newton s Method. Michael Taylor

Chapter 1: A Brief Review of Maximum Likelihood, GMM, and Numerical Tools. Joan Llull. Microeconometrics IDEA PhD Program

Learning Kernel Parameters by using Class Separability Measure

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Second Order ODEs. Second Order ODEs. In general second order ODEs contain terms involving y, dy But here only consider equations of the form

1 The space of linear transformations from R n to R m :

Numerical optimization

Newton s Method and Linear Approximations 10/19/2011

Numerical solutions of nonlinear systems of equations

Support Vector Machines

B553 Lecture 3: Multivariate Calculus and Linear Algebra Review

Transcription:

Gradient descents and inner products

Gradient descent and inner products Gradient (definition) Usual inner product (L 2 ) Natural inner products Other inner products Examples

Gradient (definition) Energy E depending on a ariable f (ector or function) gradient E(f) =? Directional deriatie: consider a ariation of f. E(f + ε) E(f) δe(f)() := lim ε 0 ε f Hopefully, δe is linear and continuous wrt. Riesz representation theorem = gradient: ariation E(f) such that, δe(f)() = E(f) Usual inner product : L 2 f gradient

Gradient Descent Scheme Build minimizing path: f t=0 = f 0 f t = P f E(f)

Gradient Descent Scheme Build minimizing path: f t=0 = f 0 f t = P f E(f) Change the inner product P = change the minimizing path P f E(f) = arg min δe(f)() + 1 } 2 2 P P as a prior on the minimizing flow

Example 1: ectors, lists of parameters E( α ) = E(α 1, α 2, α 3,..., α n ) Usual L 2 gradient: E( α ) = i i E(α 1, α 2,..., α n ) e i (supposes α i independent)

Example 1: ectors, lists of parameters E( α ) = E(α 1, α 2, α 3,..., α n ) Usual L 2 gradient: E( α ) = i i E(α 1, α 2,..., α n ) e i (supposes α i independent) Should all parameters α i hae the same weights? Consider a different parameterization β = P( α ): β 1 = 10 α 1 and β i = α i i 1 F ( β ) = E(P 1 ( β )) = E(0.1β 1, β 2,..., β n )

Example 1: ectors, lists of parameters E( α ) = E(α 1, α 2, α 3,..., α n ) Usual L 2 gradient: E( α ) = i i E(α 1, α 2,..., α n ) e i (supposes α i independent) Should all parameters α i hae the same weights? Consider a different parameterization β = P( α ): β 1 = 10 α 1 and β i = α i i 1 F ( β ) = E(P 1 ( β )) = E(0.1β 1, β 2,..., β n ) i 1, βi F ( β ) = αi E( α ) β1 F ( β ) = 0.1 α1 E( α ) α=p 1 ( β )

E( α ) = E(α 1, α 2, α 3,..., α n ) β 1 = 10 α 1 β1 F ( β ) = 0.1 α1 E( α ) α=p 1 ( β ) gradient ascent wrt β: δβ 1 = β1 F ( β ) = δβ 1 = 0.1 α1 E( α ) gradient ascent wrt α: δα 1 = α1 E( α ) = δβ 1 = 10 δα 1 = 10 α1 E( α ) Difference between two approaches: factor 100 for the first parameter! Conclusion: L 2 inner product is bad (not intrinsic)

Example 2: (not specific to) kernels and intrinsic gradients f = i α i k(x i, ) E(f) = i f(x i ) y i 2 + f 2 H L 2 gradient wrt α : not the best idea Choose a natural, intrinsic inner product: δ α a δ α b P = δf a δf b H = Df(δ α a ) Df(δ α b ) H = δ α a t Df Df δ α b H In case of kernels this gies: δ α a δ α b P = δ α a K δ α b Corresponding gradient: P α E = (t Df Df ) 1 ( P adm ( H f E) )

Example 3: parameterized shapes The same as for kernels Choose any representation for shapes (splines, polygon, leel-set...) Intrinsic gradient = as close as possible to the real eolution, most independent as possible of the representation/parameterization

Example 4: shapes, contours Cure f Deformation field, in the tangent space of f f + Usual tangent space: L 2 (f): u L 2 (f) = u(x) (x) df(x) f

Examples of inner products for shapes (or functions) L 2 inner product: u L 2 (f) = u(x) (x) df(x) f

Examples of inner products for shapes (or functions) L 2 inner product: u L 2 (f) = u(x) (x) df(x) f H 1 inner product: u H 1 = u L 2 (f) + xu x L 2 (f) Interesting property of the H 1 gradient: H1 E = arg inf u u L2 E 2 + xu 2 L 2 L 2

Examples of inner products for shapes (or functions) L 2 inner product: u L 2 (f) = u(x) (x) df(x) H 1 inner product: u H 1 = u L 2 (f) + xu x L 2 (f) Set S of prefered transformations (e.g. rigid motion) Projection on S: Q Projection orthogonal to S: R (Q + R = Id) f u S = Q(u) Q() L 2 + α R(u) R() L 2

Examples of inner products for shapes (or functions) L 2 inner product: u L 2 (f) = u(x) (x) df(x) H 1 inner product: u H 1 = u L 2 (f) + xu x L 2 (f) Set of prefered transformations (e.g. rigid motion) Example: minimizing the Hausdorff distance between two cures f f t = f d H (f, f 2 ) usual rigidified

Examples of inner products for shapes (or functions) L 2 inner product: u L 2 (f) = u(x) (x) df(x) H 1 inner product: u H 1 = u L 2 (f) + xu x L 2 (f) Set of prefered transformations (e.g. rigid motion) Example: minimizing the Hausdorff distance between two cures Change an inner product for another one: linear symmetric positie definite transformation of the gradient f

Examples of inner products for shapes (or functions) L 2 inner product: u L 2 (f) = u(x) (x) df(x) H 1 inner product: u H 1 = u L 2 (f) + xu x L 2 (f) Set of prefered transformations (e.g. rigid motion) Example: minimizing the Hausdorff distance between two cures Change an inner product for another one: linear symmetric positie definite transformation of the gradient Gaussian smoothing of the L 2 gradient: symmetric positie definite f

Extension to non-linear criteria P f E(f) = arg min δe(f)() + 1 2 2 P }

Extension to non-linear criteria P f E(f) = arg min δe(f)() + 1 2 2 P P f E(f) = arg min δe(f)() + R P ()} Example: semi-local rigidification }

Remark on Taylor series and Newton method P f E(f) = arg min δe(f)() + 1 } 2 2 P

Remark on Taylor series and Newton method P f E(f) = arg min δe(f)() + 1 } 2 2 P Different approximations of E(f + ) = different gradient descents:

Remark on Taylor series and Newton method P f E(f) = arg min δe(f)() + 1 } 2 2 P Different approximations of E(f + ) = different gradient descents: arg min E(f)}

Remark on Taylor series and Newton method P f E(f) = arg min δe(f)() + 1 } 2 2 P Different approximations of E(f + ) = different gradient descents: arg min E(f) + δe(f)()}

Remark on Taylor series and Newton method P f E(f) = arg min δe(f)() + 1 } 2 2 P Different approximations of E(f + ) = different gradient descents: arg min E(f) + δe(f)() + 1 } 2 2 P

Remark on Taylor series and Newton method P f E(f) = arg min δe(f)() + 1 } 2 2 P Different approximations of E(f + ) = different gradient descents: arg min E(f) + δe(f)() + 1 } 2 2 P = P f E(f) 1st order + metric choice 2 = quadratic

Remark on Taylor series and Newton method P f E(f) = arg min δe(f)() + 1 } 2 2 P Different approximations of E(f + ) = different gradient descents: arg min E(f) + δe(f)() + 1 } 2 2 P = P f E(f) 1st order + metric choice 2 = quadratic arg min E(f) + δe(f)() + 1 } 2 δ2 E(f)()()

Remark on Taylor series and Newton method P f E(f) = arg min δe(f)() + 1 } 2 2 P Different approximations of E(f + ) = different gradient descents: arg min E(f) + δe(f)() + 1 } 2 2 P = P f E(f) 1st order + metric choice 2 = quadratic arg min E(f) + δe(f)() + 1 } 2 δ2 E(f)()() quadratic : order-2 Newton

Remark on Taylor series and Newton method P f E(f) = arg min δe(f)() + 1 } 2 2 P Different approximations of E(f + ) = different gradient descents: arg min E(f) + δe(f)() + 1 } 2 2 P = P f E(f) 1st order + metric choice 2 = quadratic arg min E(f) + δe(f)() + 1 } 2 δ2 E(f)()() quadratic : order-2 Newton arg min E(f) + δe(f)() + R P ()} 1st order + any suitable criterion

(End of the parenthesis)