ECS289: Scalable Machine Learning

Size: px
Start display at page:

Download "ECS289: Scalable Machine Learning"

Transcription

1 ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016

2 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent

3 Numerical Optimization Numerical Optimization: min X f (X ) Can be applied to computer science, economics, control engineering, operating research,... Machine Learning: find a model that minimizes the prediction error.

4 Properties of the Function Smooth function: functions with continuous derivative. Example: ridge regression argmin w 1 2 X w y 2 + λ 2 w 2 Non-smooth function: Lasso, primal SVM Lasso: argmin w SVM: argmin w 1 2 X w y 2 + λ w 1 n max(0, 1 y i w T x i ) + λ 2 w 2 i=1

5 Convex Functions A function is convex if: x 1, x 2, t [0, 1], f (tx 1 + (1 t)x 2 ) tf (x 1 ) + (1 t)f (x 2 ) No local optimum (why?) Figure from Wikipedia

6 Convex Functions If f (x) is twice differentiable, then f is convex if and only if 2 f (x) 0 Optimal solution may not be unique: has a set of optimal solutions S Gradient: capture the first order change of f : f (x + αd) = f (x) + α f (x) T d + O(α 2 ) If f is convex and differentiable, we have the following optimality condition: x S if and only if f (x) = 0

7 Strongly Convex Functions f is strongly convex if there exists an m > 0 such that f (y) f (x) + f (x) T (y x) + m 2 y x 2 2 A strongly convex function has a unique global optimum x (why?) If f is twice differentiable, then f is strongly convex if and only if 2 f (x) mi for all x Gradient descent, coordinate descent will converge linearly (will see later)

8 Nonconvex Functions If f is nonconvex, most algorithms can only converge to stationary points x is a stationary point if and only if f ( x) = 0 Three types of stationary points: (1) Global optimum (2) Local optimum (3) Saddle point Example: matrix completion, neural network,... Example: f (x, y) = 1 2 (xy a)2

9 Coordinate Descent

10 Coordinate Descent Update one variable at a time Coordinate Descent: repeatedly perform the following loop Step 1: pick an index i Step 2: compute a step size δ by (approximately) minimizing Step 3: x i x i + δ argmin f (x + δe i ) δ

11 Coordinate Descent (update sequence) Three types of updating order: Cyclic: update sequence x 1, x 2,..., x }{{ n, x } 1, x 2,..., x n,... }{{} 1st outer iteration 2nd outer iteration Randomly permute the sequence for each outer iteration (faster convergence in practice) Some interesting theoretical analysis for this recently C.-P. Lee and S. J. Wright, Random Permutations Fix a Worst Case for Cyclic Coordinate Descent. 2016

12 Coordinate Descent (update sequence) Three types of updating order: Cyclic: update sequence x 1, x 2,..., x }{{ n, x } 1, x 2,..., x n,... }{{} 1st outer iteration 2nd outer iteration Randomly permute the sequence for each outer iteration (faster convergence in practice) Some interesting theoretical analysis for this recently C.-P. Lee and S. J. Wright, Random Permutations Fix a Worst Case for Cyclic Coordinate Descent Random: each time pick a random coordinate to update Typical way: sample from uniform distribution Sample from uniform distribution vs sample from biased distribution P. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In ICML 2015 D. Csiba, Z. Qu and P. Richtarik, Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015

13 Greedy Coordinate Descent Greedy: choose the most important coordinate to update How to measure the importance? By first derivative: i f (x) By first and second derivative: i f (x)/ 2 ii f (x) By maximum reduction of objective function i = argmax i=1,...,n ( f (x) min δ f (x + δe i ) Need to consider the time complexity for variable selection Useful for kernel SVM (see lecture 6) )

14 Extension: block coordinate descent Variables are divided into blocks {X 1,..., X p }, where each X i is a subset of variables and X 1 X 2,..., X p = {1,..., n}, X i X j = ϕ, i, j Each time update a X i by (approximately) solving the subproblem within the block Example: alternating minimization for matrix completion (2 blocks). (See lecture 7)

15 Coordinate Descent (convergence) Converge to optimal if f ( ) is convex and smooth Has a linear convergence rate if f ( ) is strongly convex Linear convergence: error f (x t ) f (x ) decays as β, β 2, β 3,... for some β < 1. Local linear convergence: an algorithm converges linearly after x x K for some K > 0

16 Coordinate Descent: other names Alternating minimization (matrix completion) Iterative scaling (for log-linear models) Decomposition method (for kernel SVM) Gauss Seidel (for linear system when the matrix is positive definite)...

17 Gradient Descent

18 Gradient Descent Gradient descent algorithm: repeatedly conduct the following update: x t+1 x t α f (x t ) where α > 0 is the step size It is a fixed point iteration method: x α f (x) = x if and only if x is an optimal solution Step size too large diverge; too small slow convergence

19 Gradient Descent: successive approximation At each iteration, form an approximation of f ( ): f (x t + d) f x t (d) :=f (x t ) + f (x t ) T d d T ( 1 α I )d =f (x t ) + f (x t ) T d + 1 2α d T d Update solution by x t+1 x t + argmin d f x t (d) d = α f (x t ) is the minimizer of argmin d f x t (d) d will decrease the original objective function f if α (step size) is small enough

20 Gradient Descent: successive approximation However, the function value will decrease if Condition 1: f x (d) f (x + d) for all d Condition 2: f x (0) = f (x) Why? f (x t + d ) f x t (d ) f x t (0) = f (x t ) Condition 2 is satisfied by construction of f x t Condition 1 is satisfied if 1 α I 2 f (x) for all x

21 Gradient Descent: step size A twice differentiable function has L-Lipchitz continuous gradient if and only if 2 f (x) LI x In this case, Condition 2 is satisfied if α < 1 L Theorem: gradient descent converges if α < 1 L Theorem: gradient descent converges linearly with α < 1 L strongly convex if f is

22 Gradient Descent In practice, we do not know L Step size α too large: the algorithm diverges Step size α too small: the algorithm converges very slowly

23 Gradient Descent: line search d is a descent direction if and only if (d ) T f (x) < 0 Armijo rule bakctracking line search: Try α = 1, 1 2, 1 4,... until it staisfies where 0 < γ < 1 f (x + αd ) f (x) + γα(d ) T f (x) Figure from

24 Gradient Descent: line search Gradient descent with line search: Converges to optimal solutions if f is smooth Converges linearly if f is strongly convex However, each iteration requires evaluating f several times Several other step-size selection approaches (an ongoing research topic, especially for stochastic gradient descent)

25 Gradient Descent: applying to ridge regression Input: X R N d, y R N, initial w (0) Output: Solution w 1 := argmin w 2 X w y 2 + λ 2 w 2 1: t = 0 2: while not converged do 3: Compute the gradient 4: Choose step size α t g = X T (X w y) + λw 5: Update w w α t g 6: t t + 1 7: end while Time complexity: O(nnz(X )) per iteration

26 Proximal Gradient Descent How can we apply gradient descent to solve the Lasso problem? argmin w 1 2 X w y 2 + λ w 1 }{{} non differentiable General composite function minimization: argmin f (x) := {g(x) + h(x)} x where g is smooth and convex, h is convex but may be non-differentiable Usually assume h is simple (for computational efficiency)

27 Proximal Gradient Descent: successive approximation At each iteration, form an approximation of f ( ): f (x t + d) f x t (d) :=g(x t ) + g(x t ) T d d T ( 1 α I )d + h(x t + d) =g(x t ) + g(x t ) T d + 1 2α d T d + h(x t + d) Update solution by x t+1 x t + argmin d f x t (d) This is called proximal gradient descent Usually d = argmin d f x t (d) has a closed form solution

28 Proximal Gradient Descent: l 1 -regularization The subproblem: x t+1 =x t + argmin g(x t ) T d + 1 d 2α d T d + λ x t + d u (x t α g(x t )) 2 + λα u 1 = argmin u =S(x t α g(x t ), αλ), where S is the soft-thresholding operator defined by a z if a > z S(a, z) = a + z if a < z 0 if a [ z, z]

29 Proximal Gradient: soft-thresholding Figure from

30 Proximal Gradient Descent for Lasso Input: X R N d, y R N, initial w (0) Output: Solution w 1 := argmin w 2 X w y 2 + λ w 1 1: t = 0 2: while not converged do 3: Compute the gradient 4: Choose step size α t g = X T (X w y) 5: Update w S(w α t g, α t λ) 6: t t + 1 7: end while Time complexity: O(nnz(X )) per iteration

31 Newton s Method

32 Newton s Method Iteratively conduct the following updates: where α is the step size x x α 2 f (x) 1 f (x) If α = 1: converges quadratically when x t is close enough to x : x t+1 x K x t x 2 for some constant K. This means the error f (x t ) f (x ) decays quadratically: β, β 2, β 4, β 8, β 16,... Only need few iterations to converge in this quadratic convergence region

33 Newton s Method However, Newton s update rule is more expensive than gradient descent/coordinate descent

34 Newton s Method Need to compute 2 f (x) 1 f (x) Closed form solution: O(d 3 ) for solving a d dimensional linear system Usually solved by another iterative solver: gradient descent coordinate descent conjugate gradient method... Examples: primal L2-SVM/logistic regression, l 1 -regularized logistic regression,...

35 Newton s Method At each iteration, form an approximation of f ( ): f (x t + d) f x t (d) := f (x t ) + f (x t ) T d + 1 2α d T 2 f (x)d Update solution by x t+1 x t + argmin d f x t (d) When x is far away from x, needs line search to gaurantee convergence Assume LI 2 f (x) mi for all x, then α m L gaurantee the objective function value deacrease because L m 2 f (x) 2 f (y) x, y In practice, we often just use line search.

36 Proximal Newton What if f (x) = g(x) + h(x) and h(x) is non-smooth (h(x) = x 1 )? At each iteration, form an approximation of f ( ): f (x t + d) f x t (d) := g(x t ) + g(x t ) T d + α 2 d T 2 g(x)d + h(x + d) Update solution by x t+1 x t + argmin d fx t (d) Need another iterative solver for solving the subproblem

37 Stochastic Gradient Method

38 Stochastic Gradient Method: Motivation Widely used for machine learning problems (with large number of samples) Given training samples x 1,..., x n, we usually want to solve the following empirical risk minimization (ERM) problem: argmin w where each l i ( ) is the loss function n l i (x i ), i=1 Minimize the summation of individual loss on each sample

39 Stochastic Gradient Method Assume the objective function can be written as f (x) = 1 n f i (x) n Stochastic gradient method: i=1 Iterative conducts the following updates 1 Choose an index i (uniform) randomly 2 x t+1 x t η t f i (x t ) η t > 0 is the step size

40 Stochastic Gradient Method Assume the objective function can be written as f (x) = 1 n f i (x) n Stochastic gradient method: i=1 Iterative conducts the following updates 1 Choose an index i (uniform) randomly 2 x t+1 x t η t f i (x t ) η t > 0 is the step size Why does SG work? E i [ f i (x)] = 1 n n f i (x) = f (x) i=1

41 Stochastic Gradient Method Assume the objective function can be written as f (x) = 1 n f i (x) n Stochastic gradient method: i=1 Iterative conducts the following updates 1 Choose an index i (uniform) randomly 2 x t+1 x t η t f i (x t ) η t > 0 is the step size Why does SG work? E i [ f i (x)] = 1 n n f i (x) = f (x) i=1 Is it a fixed point method? No if η > 0 because x η i f (x ) x Is it a descent method? No, because f (x t+1 ) f (x t )

42 Stochastic Gradient Step size η has to decay to 0 (e.g., η t = Ct a for some constant a, C) SGD converges sub-linearly Many variants proposed recently SVRG, SAGA (2013, 2014): variance reduction AdaGrad (2011): adaptive learning rate RMSProp (2012): estimate learning rate by a running average of gradient. Adam (2015): adaptive moment estimation Widely used in machine learning

43 SGD in Deep Learning AdaGrad: adaptive step size for each parameter Update rule at the t-th iteration: Compute g = f (x) Estimate the second moment: G ii = G ii + g 2 Parameter update: x i x i η Gii g i Proposed for convex optimization in: i for all i for all i Adaptive subgradient methods for online learning and stochastic optimization (JMLR 2011) Adaptive Bound Optimization for Online Convex Optimization (COLT 2010)

44 SGD in Deep Learning AdaGrad: adaptive step size for each parameter Update rule at the t-th iteration: Compute g = f (x) Estimate the second moment: G ii = G ii + g 2 Parameter update: x i x i η Gii g i i for all i for all i Adam (Kingma and Ba, 2015): maintain both first moment and second moment Update rule at the t-th iteration: Compute g = f (x) m = β 1 m + (1 β 1 )g ˆm = m t /(1 β1 t ) (estimate of first moment) v = β 2 v + (1 β 2 )g 2 ˆv = v t /(1 β2 t ) (estimate of second moment) x x η ˆm/(ˆv + ɛ) All the operations are element-wise

45 Stochastic Gradient: Variance Reduction SGD is not a fixed point method: even if x is optimal, x η i f (x ) x Reason: E[ i f (x )] = f (x ) = 0 but variance can be large Variance reduction: reduce the variance so that variance 0 when t.

46 SVRG SVRG (Johnson and Zhang, 2013) For t = 1, 2, 1 x = x 2 g = 1 n n i=1 f i( x) 3 For s = 1, 2, 1 Randomly pick i 2 x x η( f i (x) f i ( x) + g ) }{{} a i Can easily show E[a i ] = 0 (unbiased) Var[a i ] = 0 when x = x = x Have linear convergence rate

47 Stochastic Gradient: applying to ridge regression Objective function: argmin w 1 n n (w T x i y i ) 2 + λ w 2 i=1 How to write as argmin w 1 n n i=1 f i(w)? How to decompose into n components?

48 Stochastic Gradient: applying to ridge regression Objective function: argmin w 1 n n (w T x i y i ) 2 + λ w 2 i=1 How to write as argmin w 1 n n i=1 f i(w)? First approach: f i (w) = (w T x i y i ) 2 + λ w 2 Update rule: w t+1 w t 2η t (w T x i y i )x i 2η t λw = (1 2η t λ)w 2η t (w T x i y i )x i

49 Stochastic Gradient: applying to ridge regression Objective function: argmin w 1 n n (w T x i y i ) 2 + λ w 2 i=1 How to write as argmin w 1 n n i=1 f i(w)? First approach: f i (w) = (w T x i y i ) 2 + λ w 2 Update rule: w t+1 w t 2η t (w T x i y i )x i 2η t λw = (1 2η t λ)w 2η t (w T x i y i )x i Need O(d) complexity per iteration even if data is sparse

50 Stochastic Gradient: applying to ridge regression Objective function: argmin w 1 n n (w T x i y i ) 2 + λ w 2 i=1 How to write as argmin w 1 n n i=1 f i(w)? First approach: f i (w) = (w T x i y i ) 2 + λ w 2 Update rule: w t+1 w t 2η t (w T x i y i )x i 2η t λw = (1 2η t λ)w 2η t (w T x i y i )x i Need O(d) complexity per iteration even if data is sparse Solution: store w = sv where s is a scalar

51 Stochastic Gradient: applying to ridge regression Objective function: Second approach: argmin w 1 n n (w T x i y i ) 2 + λ w 2 i=1 define Ω i = {j X ij 0} for i = 1,..., n define n j = {i X ij 0} for j = 1,..., d define f i (w) = (w T x i y i ) 2 + j Ω i λn n j wj 2

52 Stochastic Gradient: applying to ridge regression Objective function: Second approach: argmin w 1 n n (w T x i y i ) 2 + λ w 2 i=1 define Ω i = {j X ij 0} for i = 1,..., n define n j = {i X ij 0} for j = 1,..., d define f i (w) = (w T x i y i ) 2 + j Ω i Update rule when selecting index i: λn n j wj 2 w t+1 j wj t 2η t (xi T w t y i )X ij 2ηt λn wj t, n j j Ω i Solution: update can be done in O( Ω i ) operations

53 Coming up Next class: Parallel Optimization Questions?

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Lecture 23: November 21

Lecture 23: November 21 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina. Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Is the test error unbiased for these programs? 2017 Kevin Jamieson Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

Parallel Coordinate Optimization

Parallel Coordinate Optimization 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method ORIE 4741: Learning with Big Messy Data Proximal Gradient Method Professor Udell Operations Research and Information Engineering Cornell November 13, 2017 1 / 31 Announcements Be a TA for CS/ORIE 1380:

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Deep Learning II: Momentum & Adaptive Step Size

Deep Learning II: Momentum & Adaptive Step Size Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following

More information

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Improving the Convergence of Back-Propogation Learning with Second Order Methods the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible

More information

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables for a Million Variables Cho-Jui Hsieh The University of Texas at Austin NIPS Lake Tahoe, Nevada Dec 8, 2013 Joint work with M. Sustik, I. Dhillon, P. Ravikumar and R. Poldrack FMRI Brain Analysis Goal:

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems

More information

Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014

Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014 for a Million Variables Cho-Jui Hsieh The University of Texas at Austin ICML workshop on Covariance Selection Beijing, China June 26, 2014 Joint work with M. Sustik, I. Dhillon, P. Ravikumar, R. Poldrack,

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Optimization for Training I. First-Order Methods Training algorithm

Optimization for Training I. First-Order Methods Training algorithm Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If

More information

l 1 and l 2 Regularization

l 1 and l 2 Regularization David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information