Sequential and reinforcement learning: Stochastic Optimization I

Similar documents
Advanced computational methods X Selected Topics: SGD

Coordinate Descent and Ascent Methods

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Convex Optimization Lecture 16

Overfitting, Bias / Variance Analysis

ECS171: Machine Learning

J. Sadeghi E. Patelli M. de Angelis

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

CS260: Machine Learning Algorithms

Computational and Statistical Learning Theory

Kernelized Perceptron Support Vector Machines

ECS289: Scalable Machine Learning

Classification Logistic Regression

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Large-scale Stochastic Optimization

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear Models for Regression. Sargur Srihari

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Big Data Analytics. Lucas Rego Drumond

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

STA141C: Big Data & High Performance Statistical Computing

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Linear Convergence under the Polyak-Łojasiewicz Inequality

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Large Scale Machine Learning with Stochastic Gradient Descent

CS , Fall 2011 Assignment 2 Solutions

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Binary Classification / Perceptron

Lecture 11. Linear Soft Margin Support Vector Machines

Data Science - Convex optimization and application

Selected Topics in Optimization. Some slides borrowed from

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Linear classifiers: Overfitting and regularization

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Linear Convergence under the Polyak-Łojasiewicz Inequality

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Machine Learning And Applications: Supervised Learning-SVM

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

CS229 Supplemental Lecture notes

Generalization theory

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Stochastic gradient descent and robustness to ill-conditioning

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Probabilistic Graphical Models & Applications

Stochastic and online algorithms

On Markov chain Monte Carlo methods for tall data

Bits of Machine Learning Part 1: Supervised Learning

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Linear & nonlinear classifiers

CSC321 Lecture 8: Optimization

Introduction to Gaussian Process

Optimization methods

Regression with Numerical Optimization. Logistic

Logistic Regression. Stochastic Gradient Descent

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support vector machines Lecture 4

PAC-learning, VC Dimension and Margin-based Bounds

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Ad Placement Strategies

Stochastic Subgradient Method

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Stochastic Gradient Descent. CS 584: Big Data Analytics

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Machine Learning in the Data Revolution Era

Linear Models in Machine Learning

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Machine Learning & Data Mining CMS/CS/CNS/EE 155. Lecture 2: Perceptron & Gradient Descent

Stochastic Gradient Descent

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Beyond stochastic gradient descent for large-scale machine learning

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Lecture 6. Regression

The K-FAC method for neural network optimization

Linear Models for Regression

Algorithmic Stability and Generalization Christoph Lampert

Tutorial on Machine Learning for Advanced Electronics

Transcription:

1 Sequential and reinforcement learning: Stochastic Optimization I Sequential and reinforcement learning: Stochastic Optimization I Summary This session describes the important and nowadays framework of on-line learning and estimation. This kind of problem arises in bandit games (see below for details) and in optimization of big data problems that involve so massive data sets that the user cannot imagine exploiting the entire data with a batch algorithm. Instead, we are forced to use some subset of observations sequentially to produce both : tractable algorithms, fast predictions, efficient statistical estimators. We will describe below the method of Stochastic Gradient Algorithm, with application in Optimization of convex functions (regression, SVM, Lasso) Optimization with non convex functions : E.M. algorithm, Quantile estimation As well, we will introduce later a second family of method that relies on pure randomness exploration strategy with Markov chains. We will describe algorithms for solving the problem of simulations of posterior distribution through another family of stochastic methods : M.H. M-C algorithms Adjusted Langevin algorithms At last, we will briefly describe an ultimate method for solving the minimization of a non convex function with multiple local traps, by using simulated annealing. This problem will be illustrated on the so-called travelling salesman problem. 1 Stochastic Gradient Descent (aka S.G.D.) 1.1 Gradient descent 1.1.1 Convex case Let us consider a function f : R p R, which is convex and C 2 (for the sake of simplicity). We are interested in finding the minimum of f over R p. A natural method relies on a discretization of the so-called gradient descent : ẋ t = f(x t )dt. (1) It is easy to check that E(t) = f(x t ) is a decreasing function all along the time : E (t) = f(x t ) 2. Assume now that f(x ) = min f = 0, we can prove that R p so that E(t) E(0) = t 0 t 0 f(x s ) 2 ds, f(x s ) 2 ds E(0). This last inequality immediately implies that lim f(x t) = 0. t + The convexity of f permits to conclude the convergence of (1) : 1.1.2 Strongly convex case lim x t = x. t + We can derive convergence rates if we add a slightly stronger hypothesis on f. Assume now that f is α-strongly convex : it means that x f(x) α x x 2 is still convex, or equivalently that in the following sense : D 2 f αid, x R p t xd 2 (f)(x)x α x 2.

2 Sequential and reinforcement learning: Stochastic Optimization I A simple consequence is that x R p f(x) α x x 2 2. A typical example (involved in the linear models) : f(x) = Ax b 2 with A M n,p (R), such that t AA is an invertible matrix. We can be more precise about the behaviour of (1). We introduce F (t) = x t x 2 and compute It means that now, we have access to an unbiased estimator of f at each iteration of the algorithm, with bounded variance. The natural question raised by this framework is as follows. Does algorithm (2) still converges to x, if we tune suitably the step-size sequence (α k ) g 1 )? 1.3 Examples of applications of SGD 1.3.1 f as an average of convex functions A famous use of S.G.D. concerns the situation where f is given by F (t) = x t x, f(x t ). x R p f(x) = E U P [f(x, U)], But any convex function with minimum x always satisfies so that We then conclude that f(x), x x f(x), F (t) f(x t ) αf (t). F (t) F (0)e αt, leading to an exponential convergence rate of x t towards its target x. 1.2 Stochastic settings We are now interested in the situation where the numerical scheme is discrete, given by an Euler explicit scheme (with step size (α k ) g 1 ) : x k+1 = x k α k d k (x k ), (2) where d k (x k ) is a descent direction" that should be made close to the gradient of f at point x k. In some situation, it is not reasonnable to imagine using d k (x) = f(x) because of computationnal issues (computing f may be costly) availability (the computation of f is not possible). However, it is possible to produce an unbiased estimate of f(x k ) at time k while assuming the important hypothesis : k N x R p E[d k (x)] = f(x) and V ar(d k (x)) σ 2. where U is a random variable of distributions P and for any u : f(., u) is convex. If we can sample from the distribution P, then S.G.D. can be written as x k+1 = x k α k x f(x k, U k ) where U k P. Above, we implicitely assume that x f(x k, U k ) is easy to handle, which is not the case for the whole gradient x f(x k ). Another great advantage is that this approach permits generally to handle sequential arrivals of new observations X n, with a very low cost of updates. 1.3.2 Linear regression The loss L is given by L(w) = 1 2 [Y i w, X i ] 2. We then immediately adapt the on-line regression with stochastic gradient using the direction descent : d k (w) = [ w k, X ik Y ik ].X ik, where (X ik, Y ik ) is an observation uniformly picked in the training set. Note that d k (w) and X ik are vectors of size p. This stochastic step do not involve the inversion of the Fisher information matrix.

3 Sequential and reinforcement learning: Stochastic Optimization I 1.3.3 S.V.M. problem The S.V.M. problem induces the minimization of a cost function that deps on a variable w R p+1. The S.V.M. is dedicated to a supervised classification problem with n observations (X i, Y i ) where Y i {±1}. The observation X i is sent in a p dimensional space through a kernel φ. The loss function is defined as L(w) = λ w 2 2 + max{0; 1 Y i [ w, φ(x i ) ]}. This function is convex in w and minimizing this loss corresponds in trying to find the best separation hyperplane in a p dimensional space. If n is huge, handle the gradient of L may be complicated. However, we can notice that L(w) = λ w 2 2 + E (X,Y ) Pn max{0; 1 Y [ w, φ(x) ]}. Hence, we can replacement at each step the gradient of L by an unbiased estimation given by d k (w) = λw Y k w, φ(x k ) 1 Yk w,φ(x k ) <1 where (X k, Y k ) P n. Here, we can see that we only need one sample of the training set to produce an estimator of the gradient at each step of the algorithm. 1.3.4 Lasso The Lasso loss is L(w) = λ w 1 + 1 2 [Y i w, φ(x i ) ] 2 We can trivially apply the S.G.D. on this loss using again a sampling of P n the empirical measure. Nevertheless, this kind of method fails in producing a sparse reconstruction. Instead, we can also randomize on the variables (w 1,..., w p ) by picking one coordinate of w uniformly at random among the p variables. In that case, the descent becomes : Pick one variable j k uniformly in {1,..., p}. w j k+1 = wj k 1 j=j k jk ( 1 2 Use the proximal step 1.3.5 Quantile ) [Y i w, φ(x i ) ] 2. w k+1 = s λ [ w k+1 ]. The quantile of a real distribution may be of interest while looking for confidence sets. Recall that the quantile q α is defined as P(X < q α ) = qα p(s)ds = 1 α. Now, imagine you want to find q α from sequential arrivals of observations (X n ) n N, you are thus looking for the minimizer of the loss whose gradient is given by L(q) = [P[X q] (1 α)] 2, L (q) = p(q)[p[x q] (1 α)] = p(q) [E[1 X q ] (1 α)] A natural extension of the initial gradient algorithm will produce the following descent choice : d k (q k ) = 1 Xk q k (1 α), that solely deps on the observation X k at time k and the current estimation q k. 1.4 Examples in Matlab Each time, try to modify some parameters (step size, functions to be minimized, initialization, etc.) 1.4.1 Convex minimization The baseline (toy) functions :

4 Sequential and reinforcement learning: Stochastic Optimization I function y=e1(x) y=x.^2/2; function y=grade1(x) y=x; The Matlab script : % Script of the S.G.D. clear all close all %Step Size power parameter alpha=1; %Maximal number of iterations Nstop=1000; %Initialisation of the gradient and of the S.G.D. x(1)=randn; xx(1)=randn; y=5*cos(x)+x; The Matlab script : x(1)=3; %Initialisation of the S.G.D. xx(1)=3; for :Nstop-1 x(i+1)=x(i)-(i+1)^(-alpha)*grade2(x(i)); xx(i+1)=xx(i)-(i+1)^(-alpha)*[grade2(xx(i))+randn]; figure plot(x) hold on plot(xx, r ) leg( EGD, SGD ) for :Nstop-1 x(i+1)=x(i)-(i+1)^(-alpha)*grade1(x(i)); 1.4.3 Linear regression with (possibly) giant dataset xx(i+1)=xx(i)-(i+1)^(-alpha)*[grade1(xx(i))+randn]; The Matlab script : plot(x) hold on plot(xx, r ) leg( EGD, SGD ) 1.4.2 Non-Convex minimization function y=e2(x) y=5*sin(x)+x.^2/2; function y=grade2(x) % Linear Model clear all %Dimension of the problem: p=10; % Step size parameter alpha=1/2; % Unknown parameter theta=(2*rand(p,1)-1)*10; % Number of observations: nobs=1000;

5 Sequential and reinforcement learning: Stochastic Optimization I esti(:,1)=zeros(p,1); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% for :nobs %Current Error s(i)=norm(theta-esti(:,i)); % Sequential arrival of an observation (X,Y) % Evolution of the quantile estimation for the std gauss r.v. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% plot(q) X=10*randn(p,1); Y=sum(X.*theta)+randn; 1.5 Theoretical guarantees esti(:,i+1)=esti(:,i)+(i+1)^(-alpha)*(y-sum(esti(:,i).*x))*x/(100*p); 1.5.1 Convergence results % Evolution of the error plot(s) 1.4.4 Quantile estimation A first important ingredient for a good behaviour of these stochastic algorithms is about the choice of the step size parameter. This choice is not so obvious. It can be shown that two baseline important properties are as follows : γ n = + n=1 and n=1 for any ɛ > 0. In practice, it is important to choose γ 1+ɛ n < +, γ n Cn α with α (1/2, 1]. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Sequential quantile estimation %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% clear all % Alpha alpha=0.01; % Step size parameter nu=1/2; % Number of observations: nobs=10000; q(1)=0; for :nobs X=randn; q(i+1)=q(i)-i^(-nu)*((x<q(i))-(1-alpha)); Such a choice permits to establish the following results : THÉORÈME 1. Assume that f is stongly convex and f is L Lipschitz, and that the noise at each step has a bounded variance. Then γ n = cn α with α (0, 1) leads to E[f(X n )] min R p f γ n. Consequently, the smaller the step-size, the lower the upper bound. Unfortunately, we cannot choose α = 1 without any additional assumption that relies on the strong convexity of f. It can be shown that if γ n = c/n with c large enough, then the result of Theorem 1 remains true. These results are difficult (especially when α < 1/2 and can be found in the earliest work of Benaïm and Duflo). 1.5.2 Polyak averaging A very good alternative to a very careful choice of the step-size parameter (that would dep on the strong convexity of f or not), is to use a step size

6 Sequential and reinforcement learning: Stochastic Optimization I γ n = C/n α with α (1/2, 1) and an additional Cesaro averaging procedure. X n := 1 n X k. k=1 In that situation, we can show the next result. THÉORÈME 2. If f is strongly convex and γ n = cn α with α [1/2, 1), then E[f( X n )] min R p f 1 n. Hence, Polyak averaging has the good property to catch the best rate achievable with strong convexity, without setting a sharp constant in front of an hypothetic step size γ n = c/n. In practice, a good setup is γ n 1/ n coupled with Polyak averaging.