IFT Lecture 2 Basics of convex analysis and gradient descent

Similar documents
IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Introduction to gradient descent

Constrained Optimization and Lagrangian Duality

A function(al) f is convex if dom f is a convex set, and. f(θx + (1 θ)y) < θf(x) + (1 θ)f(y) f(x) = x 3

Lecture 5: September 12

Gradient Descent. Dr. Xiaowei Huang

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016

Machine Learning Brett Bernstein. Recitation 1: Gradients and Directional Derivatives

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Unconstrained minimization of smooth functions

CS-E4830 Kernel Methods in Machine Learning

Convex Functions. Pontus Giselsson

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

Lecture 6: September 12

10-725/36-725: Convex Optimization Prerequisite Topics

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Descent methods. min x. f(x)

Optimization Tutorial 1. Basic Gradient Descent

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Existence of minimizers

Lecture 6 : Projected Gradient Descent

Numerical Optimization

Convex Functions. Wing-Kin (Ken) Ma The Chinese University of Hong Kong (CUHK)

Optimization methods

EC9A0: Pre-sessional Advanced Mathematics Course. Lecture Notes: Unconstrained Optimisation By Pablo F. Beker 1

Recitation 1. Gradients and Directional Derivatives. Brett Bernstein. CDS at NYU. January 21, 2018

Computational Optimization. Mathematical Programming Fundamentals 1/25 (revised)

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Lecture 14: October 17

18.657: Mathematics of Machine Learning

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

Conditional Gradient (Frank-Wolfe) Method

Lecture 25: November 27

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

Lecture 5 : Projections

Performance Surfaces and Optimum Points

IFT Lecture 7 Elements of statistical learning theory

Optimization Methods. Lecture 19: Line Searches and Newton s Method

MATH 4211/6211 Optimization Basics of Optimization Problems

CPSC 540: Machine Learning

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

Lecture 2: Convex functions

Math 273a: Optimization Basic concepts

Unconstrained Multivariate Optimization

Bregman Divergence and Mirror Descent

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture Notes: Geometric Considerations in Unconstrained Optimization

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Lecture 5: September 15

Lecture Unconstrained optimization. In this lecture we will study the unconstrained problem. minimize f(x), (2.1)

CS675: Convex and Combinatorial Optimization Fall 2016 Convex Optimization Problems. Instructor: Shaddin Dughmi

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

We have seen that for a function the partial derivatives whenever they exist, play an important role. This motivates the following definition.

Optimality conditions for unconstrained optimization. Outline

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Lecture 1: January 12

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

Introduction to Unconstrained Optimization: Part 2

Negative Momentum for Improved Game Dynamics

Newton s Method. Javier Peña Convex Optimization /36-725

Lecture 19: Follow The Regulerized Leader

Unconstrained optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Exam in TMA4180 Optimization Theory

Computational Optimization. Convexity and Unconstrained Optimization 1/29/08 and 2/1(revised)

Distances, volumes, and integration

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

1 Introduction to Optimization

Lecture 3: Linesearch methods (continued). Steepest descent methods

Math 273a: Optimization Netwon s methods

CPSC 540: Machine Learning

Line Search Methods for Unconstrained Optimisation

Nonlinear equations and optimization

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

1 Regression with High Dimensional Data

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

FALL 2018 MATH 4211/6211 Optimization Homework 1

1 Directional Derivatives and Differentiability

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

CSCI : Optimization and Control of Networks. Review on Convex Optimization

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

Introduction to unconstrained optimization - direct search methods

A Brief Review on Convex Optimization

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Lecture 11. Linear Soft Margin Support Vector Machines

Convergence of Fixed-Point Iterations

B553 Lecture 1: Calculus Review

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Convex Functions and Optimization

On Nesterov s Random Coordinate Descent Algorithms - Continued

2 Upper-bound of Generalization Error of AdaBoost

MATH 5720: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 2018

Transcription:

IF 6085 - Lecture Basics of convex analysis and gradient descent his version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribes: Assya rofimov, Mohammad Pezeshki, Reyhane Askari Instructor: Ioannis Mitliagkas 1 Summary In this first lecture we cover some optimization basics with the following themes: Lipschitz continuity Some notions and definitions for convexity Smoothness and Strong Convexity Gradient Descent Note: Most of this lecture has been adapted from [1]. Introduction In this section we introduce the basic concepts of optimization. he gradient descent algorithm is the workhorse of machine learning. It generally has two equivalent interpretations: downhill local minimization of a function Definition 1 (Lipschitz continuity. A function f(x is L-Lipschitz if f(x f(y L x y Intuitively, this is a measurement of how steep the function can get (Figure 1. Figure 1: Lipschitz constant 1

IF 6085 - heoretical principles for deep learning Lecture : January 11, 018 his also implies that the derivative of the function cannot exceed L. and f (x = lim δ 0 f(x + δ f(x δ f f(x f(y f(x f(y (x = lim = lim L y x x y y x x y As a consequence, L-Lipschitz implies that f (x is bounded by L f (x L Lipschitz continuity can be seen as a refinement of continuity. Example: { exp( λx, if x > 0 f(x = 1, otherwise Note that f(x is L-lipschitz. As the λ value increases, the closer the function gets to discontinuity (Figure. Figure : As λ value increases, the function is closer to being discontinuous 3 Convexity Let us first look at the definition of convexity for a set. Definition. For a convex set X, for any two points x and y such that x, y X, the line between them lies within the set (Figure 3 A. hat is: θ [0, 1] and z = θx + (1 θy, then z X When parameter θ is equal to 1, we get x and when θ is 0, we get y. In contrast, a non-convex set is a set where z may lie outside of the set (Figure 3 B. Figure 3: A Convex set and B Non-convex set Definition 3 (Convex Linear Combination. he sum θx + (1 θy is termed as convex linear combination.

IF 6085 - heoretical principles for deep learning Lecture : January 11, 018 We can apply the convex definition to functions. Definition 4 (Convex function. A function f(x is convex if the following holds: he Domain(f is convex. For any two members of the domain, the function value on a convex combination does not exceed the convex combination of values. x, y Domain(f, θ [0, 1] f(θx + (1 θy θf(x + (1 θf(y Another way to express this would be to check the line segment connecting x and y (the chord. If the chord lies above the function itself (Figure 4 the function is convex. Figure 4: Example of convex and non-convex functions Moreover, for differentiable or twice differentiable functions, it is possible to define convexity with the following first and second order conditions for convexity. Definition 5 (First order condition for convexity. f(x is convex if and only if domain(f is convex and the following holds for x, y domain(f f(y f(x + f(x (y x In other words, the function should be lower bounded by all its tangents. In Figure 5, part of the non-convex function is below the tangent at point x. his is not the case for the convex function. he convex function should therefore be lower-bounded by all the tangents at any point. As a reminder, the Hessian is a measure of curvature. It is the multivariate generalization for second derivative. Indeed, for function f(x = h x, the second derivative f (x = h, which corresponds to a measure of how quickly the shape changes in the function. A multivariate quadratic can be written as f(x = 1 x Hx, where H is the Hessian. Curvature along the eigenvectors of the Hessian is given by the corresponding eigenvalues. Λ = H = QΛQ h 1 h... h d 3

IF 6085 - heoretical principles for deep learning Lecture : January 11, 018 Figure 5: Example of convex and non-convex function relative to the tangent at point x Figure 6: Looking along the principle directions of the quadratic, it appears that along q 1 we reach higher values more quickly. his means the curvature is higher along q 1. Changing the basis with Q, we decompose the matrix and focus on the direction described by Q = [q 1, q,..., q d ]. Along the direction of q i, we see the curvature for h i (Figure 6. Note that [h 1, h,..., h d ] are sorted in order of magnitude. If the function is twice differentiable, another convexity definition applies. Definition 6 (Second order condition for convexity. A twice differentiable function f is convex if and only if: f(x 0 where x domain(f(x his also implies that the Hessian needs to be positive semi-definite, in other words, eigenvalues need to be nonnegative. Note: All the definitions of convexity are equivalent when the right level of differentiability holds. 4 Smoothness and Strong Convexity Definition 7 (Smoothness. A function f(x is β-smooth if the following holds: f(x f(y β x y where x, y domain(f(x. (1 It is noted that β-smoothness of f(x is equivalent to β-lipschitz of f(x. Smoothness constraint requires the gradient of f(x to not change rapidly. 4

IF 6085 - heoretical principles for deep learning Lecture : January 11, 018 Definition 8 (Strong Convexity. A function f(x is α-strongly convex if f(x α x is convex. If f(x is α-strongly convex then the following hold: f(x αi f(x αi 0. ( It informally means that the curvature of f(x is not very close to zero. For instance, in 1-D case, f(x = h x is h-strongly convex but not (h + ɛ-strongly convex. Figure 7 illustrates examples of two convex functions of which only one is strongly convex. Figure 7: (a A convex function which is also strongly convex. (b A convex function which is not strongly convex. 5 Gradient Descent Gradient descent is an optimization algorithm based on the fact that a function f(x decreases fastest in the direction of the negative gradient of f(x at a current point. Consequently, starting from a guess x 0 for a local minimum of f(x the sequence x 0, x 1,..., x t R d is generated using the following rule: x k+1 = x k γ f(x k, (3 in which γ is called the step size or the learning rate. If f(x is convex and γ decays at the right rate, it is guaranteed that as t, x k x. he following holds for the optimal value x : x = Lemma 1. From L-Lipschitz constraint the following holds: argmin f(x. (4 x Domain(f(x f(x k L. (5 his lemma is used in the proof on the following theorem. heorem 1 (Gradient Descent heory. Let f(x be convex and L-lipschitz, if is the total number of steps taken and the learning rate is chosen as: hen the following holds: ( 1 f γ = x 1 x L X k f(x x 1 x L, (7 Proof. By applying the aylor expansion on f(x at the point x k, we have, f(x k f(x f(x k, x k x (8 = 1 γ (x k x k+1, x k x (9 = 1 ( x k x γ x k+1 x + γ f(x k (10 (6 5

IF 6085 - heoretical principles for deep learning Lecture : January 11, 018 From Equation (10 and Lemma 1, the following holds: By change of the variable x k x to D k: f(x k f(x 1 ( x k x γ x k+1 x + γ L (11 f(x 1 f(x 1 γ [D 1 D ] + γ L f(x f(x 1 γ [D D 3] + γ L... f(x f(x 1 γ [D D +1] + γ L 1 γ [D ] + γ L. Summing all the equations, most terms cancel. his is known as the telescoping sum. We get: (1 1 (f(x k f(x 1 γ D 1 + γl f(x k f(x 1 γ D 1 + γl (13 (14 From convexity of f(x we know: So from Equation 14 and 15 the following holds: f ( θx + (1 θy θf(x + (1 θf(y (15 ( 1 f x k f(x 1 γ D 1 + γl (16 hus, if we set γ = x1 x L, we get: ( 1 f X k f(x x 1 x L. (17 References [1] Bubeck, Sbastien. Convex optimization: Algorithms and complexity. Foundations and rends in Machine Learning 8.3-4 (015: 31-357. 6