STA141C: Big Data & High Performance Statistical Computing

Similar documents
ECS289: Scalable Machine Learning

ECS171: Machine Learning

ECS171: Machine Learning

CS260: Machine Learning Algorithms

Large-scale Stochastic Optimization

Big Data Analytics: Optimization and Randomization

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

ECS289: Scalable Machine Learning

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

ECS289: Scalable Machine Learning

Accelerating Stochastic Optimization

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Convex Optimization Algorithms for Machine Learning in 10 Slides

Coordinate Descent and Ascent Methods

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

ECS289: Scalable Machine Learning

Lecture 23: November 21

Linear Convergence under the Polyak-Łojasiewicz Inequality

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Linear Convergence under the Polyak-Łojasiewicz Inequality

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Discriminative Models

Day 3 Lecture 3. Optimizing deep networks

SVRG++ with Non-uniform Sampling

Fast Stochastic Optimization Algorithms for ML

Convex Optimization Lecture 16

Discriminative Models

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Nesterov s Acceleration

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Trade-Offs in Distributed Learning and Optimization

Selected Topics in Optimization. Some slides borrowed from

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Nonlinear Optimization Methods for Machine Learning

Dan Roth 461C, 3401 Walnut

Optimization for Machine Learning

Lecture 1: Supervised Learning

Big Data Analytics. Lucas Rego Drumond

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Classification Logistic Regression

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Parallel Coordinate Optimization

Introduction to Logistic Regression and Support Vector Machine

Conditional Gradient (Frank-Wolfe) Method

Sub-Sampled Newton Methods

Logistic Regression Trained with Different Loss Functions. Discussion

Stochastic Gradient Descent

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

STA141C: Big Data & High Performance Statistical Computing

CSC 411 Lecture 17: Support Vector Machine

Neural Networks and Deep Learning

Lasso Regression: Regularization for feature selection

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Importance Sampling for Minibatches

Stochastic Gradient Descent with Variance Reduction

Is the test error unbiased for these programs?

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Optimization for Training I. First-Order Methods Training algorithm

Overfitting, Bias / Variance Analysis

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Numerical Optimization Techniques

Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Support Vector Machines

Stochastic Quasi-Newton Methods

A Parallel SGD method with Strong Convergence

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Adaptive Probabilities in Stochastic Optimization Algorithms

Linear Regression (continued)

Deep Feedforward Networks

Linear classifiers: Overfitting and regularization

Lecture 3: Minimizing Large Sums. Peter Richtárik

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

A picture of the energy landscape of! deep neural networks

Comparison of Modern Stochastic Optimization Algorithms

Announcements Kevin Jamieson

CIS 520: Machine Learning Oct 09, Kernel Methods

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections

Coordinate descent methods

Lecture 25: November 27

LASSO Review, Fused LASSO, Parallel LASSO Solvers

CPSC 540: Machine Learning

Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

Why should you care about the solution strategies?

Optimization methods

Lasso Regression: Regularization for feature selection

Transcription:

STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017

Optimization

Numerical Optimization Numerical Optimization: min X f (X ) Can be applied to many areas. Machine Learning: find a model that minimizes the prediction error.

Properties of the Function Smooth function: functions with continuous derivative. Example: ridge regression argmin w 1 2 X w y 2 + λ 2 w 2 Non-smooth function: Lasso, primal SVM Lasso: argmin w SVM: argmin w 1 2 X w y 2 + λ w 1 n max(0, 1 y i w T x i ) + λ 2 w 2 i=1

Convex Functions A function is convex if: x 1, x 2, t [0, 1], f (tx 1 + (1 t)x 2 ) tf (x 1 ) + (1 t)f (x 2 ) No local optimum (why?) Figure from Wikipedia

Convex Functions If f (x) is twice differentiable, then f is convex if and only if 2 f (x) 0 Optimal solution may not be unique: has a set of optimal solutions S Gradient: capture the first order change of f : f (x + αd) = f (x) + α f (x) T d + O(α 2 ) If f is convex and differentiable, we have the following optimality condition: x S if and only if f (x) = 0

Nonconvex Functions If f is nonconvex, most algorithms can only converge to stationary points x is a stationary point if and only if f ( x) = 0 Three types of stationary points: (1) Global optimum (2) Local optimum (3) Saddle point Example: matrix completion, neural network,... Example: f (x, y) = 1 2 (xy a)2

Coordinate Descent

Coordinate Descent Update one variable at a time Coordinate Descent: repeatedly perform the following loop Step 1: pick an index i Step 2: compute a step size δ by (approximately) minimizing Step 3: x i x i + δ argmin f (x + δe i ) δ

Coordinate Descent (update sequence) Three types of updating order: Cyclic: update sequence x 1, x 2,..., x }{{ n, x } 1, x 2,..., x n,... }{{} 1st outer iteration 2nd outer iteration Randomly permute the sequence for each outer iteration (faster convergence in practice) Some interesting theoretical analysis for this recently C.-P. Lee and S. J. Wright, Random Permutations Fix a Worst Case for Cyclic Coordinate Descent. 2016

Coordinate Descent (update sequence) Three types of updating order: Cyclic: update sequence x 1, x 2,..., x }{{ n, x } 1, x 2,..., x n,... }{{} 1st outer iteration 2nd outer iteration Randomly permute the sequence for each outer iteration (faster convergence in practice) Some interesting theoretical analysis for this recently C.-P. Lee and S. J. Wright, Random Permutations Fix a Worst Case for Cyclic Coordinate Descent. 2016 Random: each time pick a random coordinate to update Typical way: sample from uniform distribution Sample from uniform distribution vs sample from biased distribution P. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In ICML 2015 D. Csiba, Z. Qu and P. Richtarik, Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015

Greedy Coordinate Descent Greedy: choose the most important coordinate to update How to measure the importance? By first derivative: i f (x) By first and second derivative: i f (x)/ 2 ii f (x) By maximum reduction of objective function i = argmax i=1,...,n ( f (x) min δ f (x + δe i ) Need to consider the time complexity for variable selection Useful for kernel SVM (see lecture 6) )

Extension: block coordinate descent Variables are divided into blocks {X 1,..., X p }, where each X i is a subset of variables and X 1 X 2,..., X p = {1,..., n}, X i X j = ϕ, i, j Each time update a X i by (approximately) solving the subproblem within the block Example: alternating minimization for matrix completion (2 blocks). (See lecture 7)

Coordinate Descent (convergence) Converge to optimal if f ( ) is convex and smooth Has a linear convergence rate if f ( ) is strongly convex Linear convergence: error f (x t ) f (x ) decays as β, β 2, β 3,... for some β < 1. Local linear convergence: an algorithm converges linearly after x x K for some K > 0

Coordinate Descent: other names Alternating minimization (matrix completion) Iterative scaling (for log-linear models) Decomposition method (for kernel SVM) Gauss Seidel (for linear system when the matrix is positive definite)...

Gradient Descent

Gradient Descent Gradient descent algorithm: repeatedly conduct the following update: x t+1 x t α f (x t ) where α > 0 is the step size It is a fixed point iteration method: x α f (x) = x if and only if x is an optimal solution Step size too large diverge; too small slow convergence

Gradient Descent: step size A twice differentiable function has L-Lipchitz continuous gradient if and only if 2 f (x) LI x In this case, Condition 2 is satisfied if α < 1 L Theorem: gradient descent converges if α < 1 L Theorem: gradient descent converges linearly with α < 1 L strongly convex if f is

Gradient Descent In practice, we do not know L...... Step size α too large: the algorithm diverges Step size α too small: the algorithm converges very slowly

Gradient Descent: line search d is a descent direction if and only if (d ) T f (x) < 0 Armijo rule bakctracking line search: Try α = 1, 1 2, 1 4,... until it staisfies where 0 < γ < 1 f (x + αd ) f (x) + γα(d ) T f (x) Figure from http://ool.sourceforge.net/ool-ref.html

Gradient Descent: line search Gradient descent with line search: Converges to optimal solutions if f is smooth Converges linearly if f is strongly convex However, each iteration requires evaluating f several times Several other step-size selection approaches

Gradient Descent: applying to ridge regression Input: X R N d, y R N, initial w (0) Output: Solution w 1 := argmin w 2 X w y 2 + λ 2 w 2 1: t = 0 2: while not converged do 3: Compute the gradient 4: Choose step size α t g = X T (X w y) + λw 5: Update w w α t g 6: t t + 1 7: end while Time complexity: O(nnz(X )) per iteration

Proximal Gradient Descent How can we apply gradient descent to solve the Lasso problem? argmin w 1 2 X w y 2 + λ w 1 }{{} non differentiable General composite function minimization: argmin f (x) := {g(x) + h(x)} x where g is smooth and convex, h is convex but may be non-differentiable Usually assume h is simple (for computational efficiency)

Proximal Gradient Descent: successive approximation At each iteration, form an approximation of f ( ): f (x t + d) f x t (d) :=g(x t ) + g(x t ) T d + 1 2 d T ( 1 α I )d + h(x t + d) =g(x t ) + g(x t ) T d + 1 2α d T d + h(x t + d) Update solution by x t+1 x t + argmin d f x t (d) This is called proximal gradient descent Usually d = argmin d f x t (d) has a closed form solution

Proximal Gradient Descent: l 1 -regularization The subproblem: x t+1 =x t + argmin g(x t ) T d + 1 d 2α d T d + λ x t + d 1 1 2 u (x t α g(x t )) 2 + λα u 1 = argmin u = (x t α g(x t ), αλ), where is the soft-thresholding operator defined by a z if a > z (a, z) = a + z if a < z 0 if a [ z, z]

Proximal Gradient: soft-thresholding Figure from http://jocelynchi.com/soft-thresholding-operator-and-the-lasso-solution/

Proximal Gradient Descent for Lasso Input: X R N d, y R N, initial w (0) Output: Solution w 1 := argmin w 2 X w y 2 + λ w 1 1: t = 0 2: while not converged do 3: Compute the gradient 4: Choose step size α t g = X T (X w y) 5: Update w (w α t g, α t λ) 6: t t + 1 7: end while Time complexity: O(nnz(X )) per iteration

Newton s Method

Newton s Method Iteratively conduct the following updates: where α is the step size x x α 2 f (x) 1 f (x) If α = 1: converges quadratically when x t is close enough to x : x t+1 x K x t x 2 for some constant K. This means the error f (x t ) f (x ) decays quadratically: β, β 2, β 4, β 8, β 16,... Only need few iterations to converge in this quadratic convergence region

Newton s Method However, Newton s update rule is more expensive than gradient descent/coordinate descent

Newton s Method Need to compute 2 f (x) 1 f (x) Closed form solution: O(d 3 ) for solving a d dimensional linear system Usually solved by another iterative solver: gradient descent coordinate descent conjugate gradient method... Examples: primal L2-SVM/logistic regression, l 1 -regularized logistic regression,...

Stochastic Gradient Method

Stochastic Gradient Method: Motivation Widely used for machine learning problems (with large number of samples) Given training samples x 1,..., x n, we usually want to solve the following empirical risk minimization (ERM) problem: argmin w where each l i ( ) is the loss function n l i (x i ), i=1 Minimize the summation of individual loss on each sample

Stochastic Gradient Method Assume the objective function can be written as f (x) = 1 n f i (x) n Stochastic gradient method: i=1 Iterative conducts the following updates 1 Choose an index i (uniform) randomly 2 x t+1 x t η t f i (x t ) η t > 0 is the step size

Stochastic Gradient Method Assume the objective function can be written as f (x) = 1 n f i (x) n Stochastic gradient method: i=1 Iterative conducts the following updates 1 Choose an index i (uniform) randomly 2 x t+1 x t η t f i (x t ) η t > 0 is the step size Why does SG work? E i [ f i (x)] = 1 n n f i (x) = f (x) i=1

Stochastic Gradient Method Assume the objective function can be written as f (x) = 1 n f i (x) n Stochastic gradient method: i=1 Iterative conducts the following updates 1 Choose an index i (uniform) randomly 2 x t+1 x t η t f i (x t ) η t > 0 is the step size Why does SG work? E i [ f i (x)] = 1 n n f i (x) = f (x) i=1 Is it a fixed point method? No if η > 0 because x η i f (x ) x Is it a descent method? No, because f (x t+1 ) f (x t )

Stochastic Gradient Step size η has to decay to 0 (e.g., η t = Ct a for some constant a, C) SGD converges sub-linearly Many variants proposed recently SVRG, SAGA (2013, 2014): variance reduction AdaGrad (2011): adaptive learning rate RMSProp (2012): estimate learning rate by a running average of gradient. Adam (2015): adaptive moment estimation Widely used in machine learning

SGD in Deep Learning AdaGrad: adaptive step size for each parameter Update rule at the t-th iteration: Compute g = f (x) Estimate the second moment: G ii = G ii + g 2 Parameter update: x i x i η Gii g i Proposed for convex optimization in: i for all i for all i Adaptive subgradient methods for online learning and stochastic optimization (JMLR 2011) Adaptive Bound Optimization for Online Convex Optimization (COLT 2010)

SGD in Deep Learning AdaGrad: adaptive step size for each parameter Update rule at the t-th iteration: Compute g = f (x) Estimate the second moment: G ii = G ii + g 2 Parameter update: x i x i η Gii g i i for all i for all i

Stochastic Gradient: applying to ridge regression Objective function: argmin w 1 n n (w T x i y i ) 2 + λ w 2 i=1 How to write as argmin w 1 n n i=1 f i(w)? How to decompose into n components?

Stochastic Gradient: applying to ridge regression Objective function: argmin w 1 n n (w T x i y i ) 2 + λ w 2 i=1 How to write as argmin w 1 n n i=1 f i(w)? First approach: f i (w) = (w T x i y i ) 2 + λ w 2 Update rule: w t+1 w t 2η t (w T x i y i )x i 2η t λw = (1 2η t λ)w 2η t (w T x i y i )x i

Stochastic Gradient: applying to ridge regression Objective function: argmin w 1 n n (w T x i y i ) 2 + λ w 2 i=1 How to write as argmin w 1 n n i=1 f i(w)? First approach: f i (w) = (w T x i y i ) 2 + λ w 2 Update rule: w t+1 w t 2η t (w T x i y i )x i 2η t λw = (1 2η t λ)w 2η t (w T x i y i )x i

Coming up Other classification models Questions?