Optimization for Machine Learning

Similar documents
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Selected Topics in Optimization. Some slides borrowed from

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECS171: Machine Learning

Error Functions & Linear Regression (1)

Classification: Logistic Regression from Data

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Optimization and Gradient Descent

STA141C: Big Data & High Performance Statistical Computing

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Neural Networks and Deep Learning

Big Data Analytics. Lucas Rego Drumond

ECS171: Machine Learning

ECS289: Scalable Machine Learning

CSC 411: Lecture 02: Linear Regression

CSC321 Lecture 8: Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. CS 584: Big Data Analytics

Linear classifiers: Overfitting and regularization

CSC321 Lecture 4: Learning a Classifier

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Regression with Numerical Optimization. Logistic

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Logistic Regression Logistic

Introduction to Logistic Regression and Support Vector Machine

Introduction to Machine Learning

Modern Optimization Techniques

Logistic Regression. Stochastic Gradient Descent

CSE446: Clustering and EM Spring 2017

CSC 411 Lecture 6: Linear Regression

CSC321 Lecture 4: Learning a Classifier

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Ad Placement Strategies

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Day 3 Lecture 3. Optimizing deep networks

CSC321 Lecture 7: Optimization

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Logistic Regression. COMP 527 Danushka Bollegala

CS260: Machine Learning Algorithms

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Statistical Computing (36-350)

Lecture 17: October 27

Machine Learning Basics III

Classification: Logistic Regression from Data

Nonlinear Optimization Methods for Machine Learning

A Quick Tour of Linear Algebra and Optimization for Machine Learning

Lecture 4: Training a Classifier

Optimization Tutorial 1. Basic Gradient Descent

Gradient Descent. Stephan Robert

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Variations of Logistic Regression with Stochastic Gradient Descent

CSC321 Lecture 2: Linear Regression

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Constrained Optimization and Lagrangian Duality

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

OPTIMIZATION METHODS IN DEEP LEARNING

Introduction to Optimization

Advanced computational methods X Selected Topics: SGD

Least Mean Squares Regression

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Coordinate Descent and Ascent Methods

CPSC 540: Machine Learning

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

Overfitting, Bias / Variance Analysis

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Neural Networks: Backpropagation

Bayesian Deep Learning

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Machine Learning for NLP

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Linear Regression (continued)

Lecture 35: Optimization and Neural Nets

Gradient Descent. Model

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Reward-modulated inference

Deep Learning II: Momentum & Adaptive Step Size

Neural Networks Teaser

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

This ensures that we walk downhill. For fixed λ not even this may be the case.

Introduction to Logistic Regression

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Ad Placement Strategies

Bias-Variance Tradeoff

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Transcription:

Optimization for Machine Learning Elman Mansimov 1 September 24, 2015 1 Modified based on Shenlong Wang s and Jake Snell s tutorials, with additional contents borrowed from Kevin Swersky and Jasper Snoek

Contents Overview Gradient Descent

An informal definition of optimization Minimize (or maximize) some quantity.

Applications Engineering: Minimize fuel consumption of an automobile Economics: Maximize returns on an investment Supply Chain Logistics: Minimize time taken to fulfill an order Life: Maximize happiness

Recap: Linear Regression Want to predict house price based on some information about the house (location, number of rooms, etc.) Assume that you have a data for n houses. x i is a d dimensional vector of observations for house i x i = (x 1 i, x 2 i,..., x d i ); X is a d n matrix of all houses. y is a vector of the price of each house y = (y 1, y 2,..., y n ) y = WX, W is a 1 d dimensional matrix W that would result in a lowest error, i.e. smallest difference between predicted price and real price.

More formally Goal: find θ = argmin θ f (θ), (possibly subject to constraints on θ). θ R n : optimization variable f : R n R: objective function Maximizing f (θ) is equivalent to minimizing f (θ), so we can treat everything as a minimization problem.

Optimization is a large area of research The best method for solving the optimization problem depends on which assumptions we want to make: Is θ discrete or continuous? What form do constraints on θ take? (if any) Is f well-behaved? (linear, differentiable, convex, submodular, etc.)

Convex Functions A function f is convex if for any two points θ 1 and θ 2 and any t [0, 1], f (tθ 1 + (1 t)θ 2 ) tf (θ 1 ) + (1 t)f (θ 2 )

Convex Functions

Why do we care about convexity? Any local minimum is a global minimum. Which means that whatever solution we find would be the best solution. This makes optimization a lot easier because we don t have to worry about getting stuck in a local minimum.

Overview of Optimization for Machine Learning Often in machine learning we are interested in learning the parameters θ of a model. Goal: minimize some loss function For example, if we have some data (x, y), we may want to maximize P(y x, θ). Equivalently, we can minimize log P(y x, θ). We can also minimize other sorts of loss functions log can help for numerical reasons

Naive Optimization Algorithm Try all posssible combinations of W until you find one that has the lowest error (brute force). Doesn t scale as you grow number of parameters and dimensions. Need help from calculus.

Gradient Descent: Motivation From calculus, we know that the minimum of f must lie at a point where f (θ ) θ = 0. Sometimes, we can solve this equation analytically for θ. Most of the time, we are not so lucky and must resort to iterative methods. Review Gradient: θ f = ( f θ 1, f θ 2,..., f θ k )

Outline of Gradient Descent Algorithm Where η is the learning rate and T is the number of iterations: Initialize θ 0 randomly for t = 1 : T : δt η θt 1 f (i.e. calculate the change for the θ t ) θt θ t 1 + δ t The learning rate shouldn t be too big (objective function will blow up) or too small (will take a long time to converge)

Illustration of Learning Rates 2 2 First image taken from Andrej Karpathy s Stanford Lectures, second image taken from Wikipedia

Gradient Descent with Line-Search Where η is the learning rate and T is the number of iterations: Initialize θ 0 randomly for t = 1 : T : Finding a step size η t such that f (θ t η t θt 1 ) < f (θ t ) δt η t θt 1 f θ t θ t 1 + δ t Require a line-search step in each iteration.

Gradient Descent with Momentum We can introduce a momentum coefficient α [0, 1) so that the updates have memory : Initialize θ 0 randomly Initialize δ 0 to the zero vector for t = 1 : T : δ t η θt 1 f +αδ t 1 θt θ t 1 + δ t Momentum is a nice trick that can help speed up convergence. Generally we choose α between 0.8 and 0.95, but this is problem dependent

Outline of Gradient Descent Algorithm Where η is the learning rate and T is the number of iterations: Initialize θ 0 randomly Do: δ t η θt 1 f θt θ t 1 + δ t Until convergence Setting a convergence criteria.

Some convergence criteria Change in objective function value is close to zero: f (θ t+1 ) f (θ t ) < ɛ Gradient norm is close to zero: θ f < ɛ Validation error starts to increase (this is called early stopping)

Checkgrad When implementing the gradient computation for machine learning models, it s often difficult to know if our implementation of f and f is correct. We can use finite-differences approximation to the gradient to help: f f ((θ 1,..., θ i + ɛ,..., θ n )) f ((θ 1,..., θ i ɛ,..., θ n )) θ i 2ɛ Why don t we always just use the finite differences approximation? slow: we need to recompute f twice for each parameter in our model. numerical issues

Stochastic Gradient Descent Any iteration of gradient descent method requires that we sum over the entire dataset to compute gradient. SGD idea: at each iteraton, sub-sample a small amount of data (even just 1 point can work) and use that to estimate the gradient. (typically use around 100 samples) Each update is noisy, but very fast! This is the basis of optimizing ML algorithms with huge datasets (e.g., deep learning).

More on optimization Convex Optimization by Boyd & Vandenberghe Book available for free online at http://www.stanford.edu/ boyd/cvxbook/ Numerical Optimization by Nocedal & Wright Electronic version available from UofT Library

Resources for MATLAB Tutorials are available on the course website at http://www.cs.toronto.edu/ zemel/inquiry/matlab.php

Resources for Python Official tutorial: http://docs.python.org/2/tutorial/ Google s Python class: https://developers.google.com/edu/python/ Zed Shaw s Learn Python the Hard Way: http://learnpythonthehardway.org/book/ NumPy/SciPy/Matplotlib Scientific Python bootcamp (with video!): http://register.pythonbootcamp.info/agenda SciPy lectures: http://scipy-lectures.github.io/index.html

Questions?