Numerical Optimization Techniques

Similar documents
Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Classification and Pattern Recognition

Expectation-Maximization

Large Scale Machine Learning with Stochastic Gradient Descent

Numerical Optimization of Partial Differential Equations

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Nonlinear Optimization: What s important?

Optimization Methods

Optimization Methods

Contents. Preface. 1 Introduction Optimization view on mathematical models NLP models, black-box versus explicit expression 3

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

STA141C: Big Data & High Performance Statistical Computing

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Neural Network Training

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

2.098/6.255/ Optimization Methods Practice True/False Questions

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Support Vector Machines: Maximum Margin Classifiers

Numerical Optimization

Statistical Machine Learning from Data

Gradient Descent. Dr. Xiaowei Huang

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Day 3 Lecture 3. Optimizing deep networks

Regression with Numerical Optimization. Logistic

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Nonparametric Bayesian Methods (Gaussian Processes)

Lecture 18: Optimization Programming

Scientific Computing: Optimization

Support Vector Machines

CSC321 Lecture 8: Optimization

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

ECS289: Scalable Machine Learning

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

ECS171: Machine Learning

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Mathematical optimization

Data Mining (Mineria de Dades)

Max Margin-Classifier

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Lecture 7: Minimization or maximization of functions (Recipes Chapter 10)

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Machine Learning A Geometric Approach

Large-scale Stochastic Optimization

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016

Selected Topics in Optimization. Some slides borrowed from

MATHEMATICS FOR COMPUTER VISION WEEK 8 OPTIMISATION PART 2. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

Optimization Tutorial 1. Basic Gradient Descent

Lecture V. Numerical Optimization

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Sargur Srihari

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Stochastic Optimization Algorithms Beyond SG

Normalization Techniques in Training of Deep Neural Networks

1 Introduction

Optimization. Totally not complete this is...don't use it yet...

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Introduction to Logistic Regression and Support Vector Machine

Stochastic Programming Math Review and MultiPeriod Models

Nonlinear Optimization Methods for Machine Learning

Probabilistic Graphical Models & Applications

Computational Optimization. Convexity and Unconstrained Optimization 1/29/08 and 2/1(revised)

Stochastic Gradient Descent

More Optimization. Optimization Methods. Methods

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Numerical optimization

LINEAR AND NONLINEAR PROGRAMMING

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

Support Vector Machine (SVM) and Kernel Methods

Generalization to inequality constrained problem. Maximize

A Parallel SGD method with Strong Convergence

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Advanced computational methods X Selected Topics: SGD

ISM206 Lecture Optimization of Nonlinear Objective with Linear Constraints

Linear & nonlinear classifiers

Machine Learning. Support Vector Machines. Manfred Huber

A summary of Deep Learning without Poor Local Minima

Optimization for neural networks

Numerical optimization

Linear Regression (continued)

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

CPSC 540: Machine Learning

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Introduction to Optimization

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

Numerical Optimization: Basic Concepts and Algorithms

Introduction to gradient descent

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Transcription:

Numerical Optimization Techniques Léon Bottou NEC Labs America COS 424 3/2/2010

Today s Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression, other. Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Budget constraints Online vs. offline Exact algorithms for small datasets. Stochastic algorithms for big datasets. Parallel algorithms. Léon Bottou 2/30 COS 424 3/2/2010

Introduction General scheme Set a goal. Define a parametric model. Choose a suitable loss function. Choose suitable capacity control methods. Optimize average loss over the training set. Optimization Sometimes analytic (e.g. linear model with squared loss.) Usually numerical (e.g. everything else.) Léon Bottou 3/30 COS 424 3/2/2010

Summary 1. Convex vs. Nonconvex 2. Differentiable vs. Nondifferentiable 3. Constrained vs. Unconstrained 4. Line search 5. Gradient descent 6. Hessian matrix, etc. 7. Stochastic optimization Léon Bottou 4/30 COS 424 3/2/2010

Convex Definition x, y, 0 λ 1, f(λx + (1 λ)y) λf(x) + (1 λ)f(y) Property Any local minimum is a global minimum. Conclusion Optimization algorithms are easy to use. They always return the same solution. Example: Linear model with convex loss function. Curve fitting with mean squared error. Linear classification with log-loss or hinge loss. Léon Bottou 5/30 COS 424 3/2/2010

Nonconvex Landscape local minima, saddle points. plateaux, ravines, etc. Optimization algorithms Usually find local minima. Good and bad local minima. Result depend on subtle details. Examples Multilayer networks. Clustering algorithms. Learning features. Semi-supervised learning. Mixture models. Hidden Markov Models. Selecting features (some). Transfer learning. Léon Bottou 6/30 COS 424 3/2/2010

Differentiable vs. Nondifferentiable No such local cues without derivatives Derivatives may not exist. Derivatives may be too costly to compute. Examples Log loss versus Hinge loss. Léon Bottou 7/30 COS 424 3/2/2010

Constrained vs. Unconstrained Compare min w f(w) subject to w 2 < C min w f(w) + λw 2 Constraints Adding constraints lead to very different algorithms. Keywords Lagrange coefficients. Karush-Kuhn-Tucker theorem. Primal optimization, dual optimization. Léon Bottou 8/30 COS 424 3/2/2010

Line search - Bracketing a minimum Three points a < b < c such that f(b) < f(a) and f(b) < f(c). Léon Bottou 9/30 COS 424 3/2/2010

Line search - Refining the bracket Split the largest half and compute f(x). Léon Bottou 10/30 COS 424 3/2/2010

Line search - Refining the bracket Redefine a < b < c. Here a x. Split the largest half and compute f(x). Léon Bottou 11/30 COS 424 3/2/2010

Line search - Refining the bracket Redefine a < b < c. Here a b, b x. Split the largest half and compute f(x). Léon Bottou 12/30 COS 424 3/2/2010

Line search - Refining the bracket Redefine a < b < c. Here c x. Split the largest half and compute f(x). Léon Bottou 13/30 COS 424 3/2/2010

Line search - Golden Section Algorithm Optimal improvement by splitting at the golden ratio. Léon Bottou 14/30 COS 424 3/2/2010

Line search - Parabolic Interpolation Fitting a parabola can give much better guess. Léon Bottou 15/30 COS 424 3/2/2010

Line search - Parabolic Interpolation Fitting a parabola sometimes gives much better guess. Léon Bottou 16/30 COS 424 3/2/2010

Line search - Brent Algorithm Brent Algorithm for line search Alternate golden section and parabolic interpolation. No more than twice slower than golden section. No more than twice slower than parabolic section. In practice, almost as good as the best of the two. Variants with derivatives Improvements if we can compute f(x) and f (x) together. Improvements if we can compute f(x), f (x), f (x) together. Léon Bottou 17/30 COS 424 3/2/2010

Coordinate Descent Perform successive line searches along the axes. Tends to zig-zag. Léon Bottou 18/30 COS 424 3/2/2010

Gradient The gradient f ( w = f w,..., f ) 1 w d gives the steepest descent direction. Léon Bottou 19/30 COS 424 3/2/2010

Steepest Descent Perform successive line searches along the gradient direction. Beneficial if computing the gradients is cheap enough. Line searches can be expensive Léon Bottou 20/30 COS 424 3/2/2010

Gradient Descent Repeat w w γ f w (w) Merge gradient and line search. Large gain increase zig-zag tendencies, possibly divergent. High curvature direction limits gain size. Low curvature direction limits speed of approach. Léon Bottou 21/30 COS 424 3/2/2010

Hessian matrix Hessian matrix H(w) = 2 f w 1 w 1. 2 f w d w 1 2 f w 1 w d. 2 f w d w d Curvature information Taylor expansion near the optimum w : f(w) f(w ) + 1 2 (w w ) H(w ) (w w ) This paraboloid has ellipsoidal level curves. Principal axes are the eigenvectors of the Hessian. Ratio of curvatures = ratio of eigenvalues of the Hessian. Léon Bottou 22/30 COS 424 3/2/2010

Newton method Idea Since Taylor says f w (w) H(w) (w w ) then w w H(w) 1 f w (w). Newton algorithm w w H(w) 1 f w (w) Succession of paraboloidal approximations. Exact when f(w) is a paraboloid, e.g. linear model + squared loss. Very few iterations needed when H(w) is definite positive! Beware when H(w) is not definite positive. Computing and storing H(w) 1 can be too costly. Quasi-Newton methods Methods that avoid the drawbacks of Newton But behave like Newton during the final convergence. Léon Bottou 23/30 COS 424 3/2/2010

Conjugate Gradient algorithm Conjugate directions u, v conjugate u H v = 0. Non interacting directions. Conjugate Gradient algorithm Compute g t = f w (w t). Determine a line search direction d t = g t λd t 1 Choose λ such that d t H d t 1 = 0. Since g t g t 1 H (w t w t 1 ) H d t 1, this means λ = g t(g t g t 1 ) d t (g t g t 1 ). Perform a line search in direction d t. Loop. This is a fast and robust quasi-newton algorithm. A solution for all our learning problems? Léon Bottou 24/30 COS 424 3/2/2010

Optimization vs. learning Empirical cost Usually f(w) = 1 n ni=1 L(x i, y i, w) The number n of training examples can be large (billions?) Redundant examples Examples are redundant (otherwise there is nothing to learn.) Doubling the number of examples brings a little more information. Do we need it during the first optimization iterations? Examples on-the-fly All examples may not be available simultaneously. Sometimes they come on the fly (e.g. web click stream.) In quantities that are too large to store or retrieve (e.g. click stream.) Léon Bottou 25/30 COS 424 3/2/2010

Offline vs. Online Minimize C(w) = λ 2 w 2 + 1 n n L(x i, y i, w). i=1 Offline: process all examples together Example: minimization by gradient descent ( ) Repeat: w w γ λw + 1 n L n w (x i, y i, w) i=1 Offline: process examples one by one Example: minimization by stochastic gradient descent Repeat: (a) Pick random example x t, y ( t (b) w w γ t λw + L ) w (x t, y t, w) Léon Bottou 26/30 COS 424 3/2/2010

Stochastic Gradient Descent Very noisy estimates of the gradient. Gain γ t controls the size of the cloud. Decreasing gains γ t = γ 0 (1 + λγ 0 t) 1. Why is it attractive? Léon Bottou 27/30 COS 424 3/2/2010

Stochastic Gradient Descent Redundant examples Increase the computing cost of offline learning. Do not change the computing cost of online learning. Imagine the dataset contains 10 copies of the same 100 examples. Offline Gradient Descent Computation is 10 times larger than necessary. Stochastic Gradient Descent No difference regardless of the number of copies. Léon Bottou 28/30 COS 424 3/2/2010

Practical example Document classification Similar to homework#2 but bigger. 781,264 training examples. 47,152 dimensions. Linear classifier with Hinge Loss Offline dual coordinate descent (svmlight): 6 hours. Offline primal bundle optimizer (svmperf): 66 seconds. Stochastic Gradient Descent: 1.4 seconds. Linear classifier with Log Loss Offline truncated newton (tron): 44 seconds. Offline conjugate gradient descent: 40 seconds. Stochastic Gradient Descent: 2.3 seconds. These are times to reach the same test set error. Léon Bottou 29/30 COS 424 3/2/2010

The wall 0.3 Testing cost 0.2 Training time (secs) 100 SGD 50 TRON (LibLinear) 0.1 0.01 0.001 0.0001 1e 05 1e 06 1e 07 1e 08 1e 09 Optimization accuracy (trainingcost optimaltrainingcost) Léon Bottou 30/30 COS 424 3/2/2010