ECS289: Scalable Machine Learning
|
|
- Lauren Page
- 6 years ago
- Views:
Transcription
1 ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016
2 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent
3 Numerical Optimization Numerical Optimization: min X f (X ) Can be applied to computer science, economics, control engineering, operating research,... Machine Learning: find a model that minimizes the prediction error.
4 Properties of the Function Smooth function: functions with continuous derivative. Example: ridge regression argmin w 1 2 X w y 2 + λ 2 w 2 Non-smooth function: Lasso, primal SVM Lasso: argmin w SVM: argmin w 1 2 X w y 2 + λ w 1 n max(0, 1 y i w T x i ) + λ 2 w 2 i=1
5 Convex Functions A function is convex if: x 1, x 2, t [0, 1], f (tx 1 + (1 t)x 2 ) tf (x 1 ) + (1 t)f (x 2 ) No local optimum (why?) Figure from Wikipedia
6 Convex Functions If f (x) is twice differentiable, then f is convex if and only if 2 f (x) 0 Optimal solution may not be unique: has a set of optimal solutions S Gradient: capture the first order change of f : f (x + αd) = f (x) + α f (x) T d + O(α 2 ) If f is convex and differentiable, we have the following optimality condition: x S if and only if f (x) = 0
7 Strongly Convex Functions f is strongly convex if there exists an m > 0 such that f (y) f (x) + f (x) T (y x) + m 2 y x 2 2 A strongly convex function has a unique global optimum x (why?) If f is twice differentiable, then f is strongly convex if and only if 2 f (x) mi for all x Gradient descent, coordinate descent will converge linearly (will see later)
8 Nonconvex Functions If f is nonconvex, most algorithms can only converge to stationary points x is a stationary point if and only if f ( x) = 0 Three types of stationary points: (1) Global optimum (2) Local optimum (3) Saddle point Example: matrix completion, neural network,... Example: f (x, y) = 1 2 (xy a)2
9 Coordinate Descent
10 Coordinate Descent Update one variable at a time Coordinate Descent: repeatedly perform the following loop Step 1: pick an index i Step 2: compute a step size δ by (approximately) minimizing Step 3: x i x i + δ argmin f (x + δe i ) δ
11 Coordinate Descent (update sequence) Three types of updating order: Cyclic: update sequence x 1, x 2,..., x }{{ n, x } 1, x 2,..., x n,... }{{} 1st outer iteration 2nd outer iteration Randomly permute the sequence for each outer iteration (faster convergence in practice) Some interesting theoretical analysis for this recently C.-P. Lee and S. J. Wright, Random Permutations Fix a Worst Case for Cyclic Coordinate Descent. 2016
12 Coordinate Descent (update sequence) Three types of updating order: Cyclic: update sequence x 1, x 2,..., x }{{ n, x } 1, x 2,..., x n,... }{{} 1st outer iteration 2nd outer iteration Randomly permute the sequence for each outer iteration (faster convergence in practice) Some interesting theoretical analysis for this recently C.-P. Lee and S. J. Wright, Random Permutations Fix a Worst Case for Cyclic Coordinate Descent Random: each time pick a random coordinate to update Typical way: sample from uniform distribution Sample from uniform distribution vs sample from biased distribution P. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In ICML 2015 D. Csiba, Z. Qu and P. Richtarik, Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015
13 Greedy Coordinate Descent Greedy: choose the most important coordinate to update How to measure the importance? By first derivative: i f (x) By first and second derivative: i f (x)/ 2 ii f (x) By maximum reduction of objective function i = argmax i=1,...,n ( f (x) min δ f (x + δe i ) Need to consider the time complexity for variable selection Useful for kernel SVM (see lecture 6) )
14 Extension: block coordinate descent Variables are divided into blocks {X 1,..., X p }, where each X i is a subset of variables and X 1 X 2,..., X p = {1,..., n}, X i X j = ϕ, i, j Each time update a X i by (approximately) solving the subproblem within the block Example: alternating minimization for matrix completion (2 blocks). (See lecture 7)
15 Coordinate Descent (convergence) Converge to optimal if f ( ) is convex and smooth Has a linear convergence rate if f ( ) is strongly convex Linear convergence: error f (x t ) f (x ) decays as β, β 2, β 3,... for some β < 1. Local linear convergence: an algorithm converges linearly after x x K for some K > 0
16 Coordinate Descent: other names Alternating minimization (matrix completion) Iterative scaling (for log-linear models) Decomposition method (for kernel SVM) Gauss Seidel (for linear system when the matrix is positive definite)...
17 Gradient Descent
18 Gradient Descent Gradient descent algorithm: repeatedly conduct the following update: x t+1 x t α f (x t ) where α > 0 is the step size It is a fixed point iteration method: x α f (x) = x if and only if x is an optimal solution Step size too large diverge; too small slow convergence
19 Gradient Descent: successive approximation At each iteration, form an approximation of f ( ): f (x t + d) f x t (d) :=f (x t ) + f (x t ) T d d T ( 1 α I )d =f (x t ) + f (x t ) T d + 1 2α d T d Update solution by x t+1 x t + argmin d f x t (d) d = α f (x t ) is the minimizer of argmin d f x t (d) d will decrease the original objective function f if α (step size) is small enough
20 Gradient Descent: successive approximation However, the function value will decrease if Condition 1: f x (d) f (x + d) for all d Condition 2: f x (0) = f (x) Why? f (x t + d ) f x t (d ) f x t (0) = f (x t ) Condition 2 is satisfied by construction of f x t Condition 1 is satisfied if 1 α I 2 f (x) for all x
21 Gradient Descent: step size A twice differentiable function has L-Lipchitz continuous gradient if and only if 2 f (x) LI x In this case, Condition 2 is satisfied if α < 1 L Theorem: gradient descent converges if α < 1 L Theorem: gradient descent converges linearly with α < 1 L strongly convex if f is
22 Gradient Descent In practice, we do not know L Step size α too large: the algorithm diverges Step size α too small: the algorithm converges very slowly
23 Gradient Descent: line search d is a descent direction if and only if (d ) T f (x) < 0 Armijo rule bakctracking line search: Try α = 1, 1 2, 1 4,... until it staisfies where 0 < γ < 1 f (x + αd ) f (x) + γα(d ) T f (x) Figure from
24 Gradient Descent: line search Gradient descent with line search: Converges to optimal solutions if f is smooth Converges linearly if f is strongly convex However, each iteration requires evaluating f several times Several other step-size selection approaches (an ongoing research topic, especially for stochastic gradient descent)
25 Gradient Descent: applying to ridge regression Input: X R N d, y R N, initial w (0) Output: Solution w 1 := argmin w 2 X w y 2 + λ 2 w 2 1: t = 0 2: while not converged do 3: Compute the gradient 4: Choose step size α t g = X T (X w y) + λw 5: Update w w α t g 6: t t + 1 7: end while Time complexity: O(nnz(X )) per iteration
26 Proximal Gradient Descent How can we apply gradient descent to solve the Lasso problem? argmin w 1 2 X w y 2 + λ w 1 }{{} non differentiable General composite function minimization: argmin f (x) := {g(x) + h(x)} x where g is smooth and convex, h is convex but may be non-differentiable Usually assume h is simple (for computational efficiency)
27 Proximal Gradient Descent: successive approximation At each iteration, form an approximation of f ( ): f (x t + d) f x t (d) :=g(x t ) + g(x t ) T d d T ( 1 α I )d + h(x t + d) =g(x t ) + g(x t ) T d + 1 2α d T d + h(x t + d) Update solution by x t+1 x t + argmin d f x t (d) This is called proximal gradient descent Usually d = argmin d f x t (d) has a closed form solution
28 Proximal Gradient Descent: l 1 -regularization The subproblem: x t+1 =x t + argmin g(x t ) T d + 1 d 2α d T d + λ x t + d u (x t α g(x t )) 2 + λα u 1 = argmin u =S(x t α g(x t ), αλ), where S is the soft-thresholding operator defined by a z if a > z S(a, z) = a + z if a < z 0 if a [ z, z]
29 Proximal Gradient: soft-thresholding Figure from
30 Proximal Gradient Descent for Lasso Input: X R N d, y R N, initial w (0) Output: Solution w 1 := argmin w 2 X w y 2 + λ w 1 1: t = 0 2: while not converged do 3: Compute the gradient 4: Choose step size α t g = X T (X w y) 5: Update w S(w α t g, α t λ) 6: t t + 1 7: end while Time complexity: O(nnz(X )) per iteration
31 Newton s Method
32 Newton s Method Iteratively conduct the following updates: where α is the step size x x α 2 f (x) 1 f (x) If α = 1: converges quadratically when x t is close enough to x : x t+1 x K x t x 2 for some constant K. This means the error f (x t ) f (x ) decays quadratically: β, β 2, β 4, β 8, β 16,... Only need few iterations to converge in this quadratic convergence region
33 Newton s Method However, Newton s update rule is more expensive than gradient descent/coordinate descent
34 Newton s Method Need to compute 2 f (x) 1 f (x) Closed form solution: O(d 3 ) for solving a d dimensional linear system Usually solved by another iterative solver: gradient descent coordinate descent conjugate gradient method... Examples: primal L2-SVM/logistic regression, l 1 -regularized logistic regression,...
35 Newton s Method At each iteration, form an approximation of f ( ): f (x t + d) f x t (d) := f (x t ) + f (x t ) T d + 1 2α d T 2 f (x)d Update solution by x t+1 x t + argmin d f x t (d) When x is far away from x, needs line search to gaurantee convergence Assume LI 2 f (x) mi for all x, then α m L gaurantee the objective function value deacrease because L m 2 f (x) 2 f (y) x, y In practice, we often just use line search.
36 Proximal Newton What if f (x) = g(x) + h(x) and h(x) is non-smooth (h(x) = x 1 )? At each iteration, form an approximation of f ( ): f (x t + d) f x t (d) := g(x t ) + g(x t ) T d + α 2 d T 2 g(x)d + h(x + d) Update solution by x t+1 x t + argmin d fx t (d) Need another iterative solver for solving the subproblem
37 Stochastic Gradient Method
38 Stochastic Gradient Method: Motivation Widely used for machine learning problems (with large number of samples) Given training samples x 1,..., x n, we usually want to solve the following empirical risk minimization (ERM) problem: argmin w where each l i ( ) is the loss function n l i (x i ), i=1 Minimize the summation of individual loss on each sample
39 Stochastic Gradient Method Assume the objective function can be written as f (x) = 1 n f i (x) n Stochastic gradient method: i=1 Iterative conducts the following updates 1 Choose an index i (uniform) randomly 2 x t+1 x t η t f i (x t ) η t > 0 is the step size
40 Stochastic Gradient Method Assume the objective function can be written as f (x) = 1 n f i (x) n Stochastic gradient method: i=1 Iterative conducts the following updates 1 Choose an index i (uniform) randomly 2 x t+1 x t η t f i (x t ) η t > 0 is the step size Why does SG work? E i [ f i (x)] = 1 n n f i (x) = f (x) i=1
41 Stochastic Gradient Method Assume the objective function can be written as f (x) = 1 n f i (x) n Stochastic gradient method: i=1 Iterative conducts the following updates 1 Choose an index i (uniform) randomly 2 x t+1 x t η t f i (x t ) η t > 0 is the step size Why does SG work? E i [ f i (x)] = 1 n n f i (x) = f (x) i=1 Is it a fixed point method? No if η > 0 because x η i f (x ) x Is it a descent method? No, because f (x t+1 ) f (x t )
42 Stochastic Gradient Step size η has to decay to 0 (e.g., η t = Ct a for some constant a, C) SGD converges sub-linearly Many variants proposed recently SVRG, SAGA (2013, 2014): variance reduction AdaGrad (2011): adaptive learning rate RMSProp (2012): estimate learning rate by a running average of gradient. Adam (2015): adaptive moment estimation Widely used in machine learning
43 SGD in Deep Learning AdaGrad: adaptive step size for each parameter Update rule at the t-th iteration: Compute g = f (x) Estimate the second moment: G ii = G ii + g 2 Parameter update: x i x i η Gii g i Proposed for convex optimization in: i for all i for all i Adaptive subgradient methods for online learning and stochastic optimization (JMLR 2011) Adaptive Bound Optimization for Online Convex Optimization (COLT 2010)
44 SGD in Deep Learning AdaGrad: adaptive step size for each parameter Update rule at the t-th iteration: Compute g = f (x) Estimate the second moment: G ii = G ii + g 2 Parameter update: x i x i η Gii g i i for all i for all i Adam (Kingma and Ba, 2015): maintain both first moment and second moment Update rule at the t-th iteration: Compute g = f (x) m = β 1 m + (1 β 1 )g ˆm = m t /(1 β1 t ) (estimate of first moment) v = β 2 v + (1 β 2 )g 2 ˆv = v t /(1 β2 t ) (estimate of second moment) x x η ˆm/(ˆv + ɛ) All the operations are element-wise
45 Stochastic Gradient: Variance Reduction SGD is not a fixed point method: even if x is optimal, x η i f (x ) x Reason: E[ i f (x )] = f (x ) = 0 but variance can be large Variance reduction: reduce the variance so that variance 0 when t.
46 SVRG SVRG (Johnson and Zhang, 2013) For t = 1, 2, 1 x = x 2 g = 1 n n i=1 f i( x) 3 For s = 1, 2, 1 Randomly pick i 2 x x η( f i (x) f i ( x) + g ) }{{} a i Can easily show E[a i ] = 0 (unbiased) Var[a i ] = 0 when x = x = x Have linear convergence rate
47 Stochastic Gradient: applying to ridge regression Objective function: argmin w 1 n n (w T x i y i ) 2 + λ w 2 i=1 How to write as argmin w 1 n n i=1 f i(w)? How to decompose into n components?
48 Stochastic Gradient: applying to ridge regression Objective function: argmin w 1 n n (w T x i y i ) 2 + λ w 2 i=1 How to write as argmin w 1 n n i=1 f i(w)? First approach: f i (w) = (w T x i y i ) 2 + λ w 2 Update rule: w t+1 w t 2η t (w T x i y i )x i 2η t λw = (1 2η t λ)w 2η t (w T x i y i )x i
49 Stochastic Gradient: applying to ridge regression Objective function: argmin w 1 n n (w T x i y i ) 2 + λ w 2 i=1 How to write as argmin w 1 n n i=1 f i(w)? First approach: f i (w) = (w T x i y i ) 2 + λ w 2 Update rule: w t+1 w t 2η t (w T x i y i )x i 2η t λw = (1 2η t λ)w 2η t (w T x i y i )x i Need O(d) complexity per iteration even if data is sparse
50 Stochastic Gradient: applying to ridge regression Objective function: argmin w 1 n n (w T x i y i ) 2 + λ w 2 i=1 How to write as argmin w 1 n n i=1 f i(w)? First approach: f i (w) = (w T x i y i ) 2 + λ w 2 Update rule: w t+1 w t 2η t (w T x i y i )x i 2η t λw = (1 2η t λ)w 2η t (w T x i y i )x i Need O(d) complexity per iteration even if data is sparse Solution: store w = sv where s is a scalar
51 Stochastic Gradient: applying to ridge regression Objective function: Second approach: argmin w 1 n n (w T x i y i ) 2 + λ w 2 i=1 define Ω i = {j X ij 0} for i = 1,..., n define n j = {i X ij 0} for j = 1,..., d define f i (w) = (w T x i y i ) 2 + j Ω i λn n j wj 2
52 Stochastic Gradient: applying to ridge regression Objective function: Second approach: argmin w 1 n n (w T x i y i ) 2 + λ w 2 i=1 define Ω i = {j X ij 0} for i = 1,..., n define n j = {i X ij 0} for j = 1,..., d define f i (w) = (w T x i y i ) 2 + j Ω i Update rule when selecting index i: λn n j wj 2 w t+1 j wj t 2η t (xi T w t y i )X ij 2ηt λn wj t, n j j Ω i Solution: update can be done in O( Ω i ) operations
53 Coming up Next class: Parallel Optimization Questions?
STA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationLecture 23: November 21
10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationApproximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.
Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationFinite-sum Composition Optimization via Variance Reduced Gradient Descent
Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationStochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos
1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong
More informationIs the test error unbiased for these programs? 2017 Kevin Jamieson
Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationRandomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationSimple Optimization, Bigger Models, and Faster Learning. Niao He
Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationAccelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization
Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationStochastic Quasi-Newton Methods
Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationCoordinate descent methods
Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5
More informationParallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationStochastic Gradient Descent
Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationORIE 4741: Learning with Big Messy Data. Proximal Gradient Method
ORIE 4741: Learning with Big Messy Data Proximal Gradient Method Professor Udell Operations Research and Information Engineering Cornell November 13, 2017 1 / 31 Announcements Be a TA for CS/ORIE 1380:
More informationDeep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation
Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationDeep Learning II: Momentum & Adaptive Step Size
Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following
More informationImproving the Convergence of Back-Propogation Learning with Second Order Methods
the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible
More informationBig & Quic: Sparse Inverse Covariance Estimation for a Million Variables
for a Million Variables Cho-Jui Hsieh The University of Texas at Austin NIPS Lake Tahoe, Nevada Dec 8, 2013 Joint work with M. Sustik, I. Dhillon, P. Ravikumar and R. Poldrack FMRI Brain Analysis Goal:
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationLARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION
LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems
More informationVariables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014
for a Million Variables Cho-Jui Hsieh The University of Texas at Austin ICML workshop on Covariance Selection Beijing, China June 26, 2014 Joint work with M. Sustik, I. Dhillon, P. Ravikumar, R. Poldrack,
More informationImportance Sampling for Minibatches
Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationSupport Vector Machine
Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationOptimization for Training I. First-Order Methods Training algorithm
Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order
More informationSub-Sampled Newton Methods
Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1
More informationAccelerate Subgradient Methods
Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationCSC321 Lecture 8: Optimization
CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationCoordinate Descent Faceoff: Primal or Dual?
JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If
More informationl 1 and l 2 Regularization
David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More information