Modern Optimization Techniques
|
|
- Theodore McBride
- 5 years ago
- Views:
Transcription
1 Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany original slides by Lucas Rego Drumond (ISMLL) 1 / 44
2 Syllabus Tue (0) 0. Overview 1. Theory Tue (1) 1. Convex Sets and Functions 2. Unconstrained Optimization Tue (2) 2.1 Gradient Descent Tue (3) 2.2 Stochastic Gradient Descent Tue (4) 2.3 Newton s Method Tue (5) 2.3b Newton s Method (Part 2) Tue (6) 2.4 Subgradient Methods Tue (7) 2.5 Coordinate Descent 3. Equality Constrained Optimization Tue (8) 3.1 Duality Tue (9) 3.2 Methods Christmas Break 4. Inequality Constrained Optimization Tue (10) 4.1 Interior Point Methods Tue (11) 4.2 Cutting Plane Method 5. Distributed Optimization Tue (11) 5.1 Alternating Direction Method of Multipliers Tue (12) Questions and Answers 1 / 44
3 Outline 1. Stochastic Gradient Descent (SGD) 2. More on Line Search 3. Example: SGD for Linear Regression 4. Stochastic Gradient Descent in Practice 2 / 44
4 1. Stochastic Gradient Descent (SGD) Outline 1. Stochastic Gradient Descent (SGD) 2. More on Line Search 3. Example: SGD for Linear Regression 4. Stochastic Gradient Descent in Practice 2 / 44
5 1. Stochastic Gradient Descent (SGD) Unconstrained Convex Optimization arg min f (x) x dom f dom f R N is open (unconstrained optimization) f is convex 2 / 44
6 1. Stochastic Gradient Descent (SGD) Stochastic Gradient Gradient Descent makes use of the gradient f (x) Stochastic Gradient Descent: makes use of Stochastic Gradient only: g(x) p(g R M x), E p (g(x)) = f (x) for each point x R N : random variable over R N with distribution p (conditional on x) on average yields the gradient (at each point) 3 / 44
7 1. Stochastic Gradient Descent (SGD) Stochastic Gradient / Example: Big Sums f is a big sum : f (x) = 1 C C f c (x) c=1 with f c convex, c = 1,..., C g is the gradient of a random summand: p(g x) := Unif({ f c (x) c = 1,..., C}) 4 / 44
8 1. Stochastic Gradient Descent (SGD) Stochastic Gradient / Example: Least Squares min x R N f (x) := Ax b 2 2 will find solution for Ax = b if there is any (then Ax b 2 = 0) otherwise will find the x where the difference Ax b of left and right side is as small as possible (in the squared L2 norm) is a big sum: M f (x) := Ax b 2 2 = ((Ax) m b m ) 2 = 1 M m=1 M f m (x), f m (x) := M((Ax) m b m ) 2 m=1 stochastic gradient g: gradient for a random component m 5 / 44
9 1. Stochastic Gradient Descent (SGD) Stochastic Gradient / Example: Supervised Learning where min θ R P f (x) := 1 N N l(y n, ŷ(x n, θ)) + λ θ 2 2 n=1 (xn, y n ) R M R (k) are N training samples, ŷ is a parametrized model, e.g., logistic regression ŷ(x; θ) := (1 + e θt x ) 1, P := M, T := 1 l is a loss, e.g., negative binomial loglikelihood: l(y, ŷ) := y log ŷ (1 y) log(1 ŷ) λ R + 0 is the regularization weight. will find parametrization with best trade-off between low loss and low model complexity 6 / 44
10 1. Stochastic Gradient Descent (SGD) Stochastic Gradient / Example: Supervised Learning (2/2) min θ R P f (x) := 1 N N l(y n, ŷ(x n, θ)) + λ θ 2 2 n=1 where (x n, y n ) R M R T are N training samples,... is a big sum: f (θ) := 1 N N f n (θ), f n (θ) := l(y n, ŷ(x n, θ)) + λ θ 2 2 n=1 stochastic gradient g: gradient for a random sample n 7 / 44
11 1. Stochastic Gradient Descent (SGD) Stochastic Gradient Descent the very same as Gradient Descent but use stochastic gradient g(x) instead of exact gradient f (x) in each step where 1 min-sgd(f, p, x (0), µ, K): 2 for k := 1,..., K: 3 draw g (k 1) p(g x) 4 x (k 1) := g (k 1) 5 µ (k 1) := µ(f, x (k 1), x (k 1) ) 6 x (k) := x (k 1) + µ (k 1) x (k 1) 7 return x (K) f objective function p (distribution of the) stochastic gradient of f x (0) starting value µ step length controller Lars Schmidt-Thieme, K number Information of iterations Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 8 / 44
12 1. Stochastic Gradient Descent (SGD) Stochastic Gradient Descent / For Big Sums where 1 min-sgd((f c ) c=1,...,c, ( f c ) c=1,...,c, x (0), µ, K): 2 for k := 1,..., K: 3 draw c (k 1) Unif(1,..., C) 4 g (k 1) := f c (k 1)(x (k 1) ) 5 x (k 1) := g (k 1) 6 µ (k 1) := µ(f, x (k 1), x (k 1) ) 7 x (k) := x (k 1) + µ (k 1) x (k 1) 8 return x (K) (f c ) c=1,...,c objective function summands, f := 1 C C c=1 f c ( f c ) c=1,...,c gradients of the objective function summands x (0) starting value µ step length controller K number of iterations 9 / 44
13 1. Stochastic Gradient Descent (SGD) SGD / For Big Sums / Epochs where 1 min-sgd((f c ) c=1,...,c, ( f c ) c=1,...,c, x (0), µ, K): 2 C := (1, 2,..., C) 3 x (0,C) := x (0) 4 for k := 1,..., K: 5 randomly shuffle C 6 x (k,0) := x (k 1,C) 7 for i = 1,..., C: 8 g (k,i 1) := f ci (x (k,i 1) ) 9 x (k,i 1) := g (k,i 1) 10 µ (k,i 1) := µ(f, x (k,i 1), x (k,i 1) ) 11 x (k,i) := x (k,i 1) + µ (k,i 1) x (k,i 1) 12 return x (K,C) (f c ) c=1,...,c objective function summands, f := 1 C C c=1 f c... K number of epochs 10 / 44
14 1. Stochastic Gradient Descent (SGD) Convergence of SGD Theorem (Convergence of SGD) If (i) f is strongly convex ( 2 f (x) mi, m R + ), (ii) the expected squared norm of its stochastic gradient g is uniformly bounded ( G R + 0 x : E( g(x) 2 ) G 2 ) and (iii) the steplength µ (k) := 1 m(k+1) is used, then E p ( x (k) x 2 ) 1 k + 1 max{ x (0) x 2, G 2 m 2 } 11 / 44
15 1. Stochastic Gradient Descent (SGD) Convergence of SGD / Proof f (x ) f (x) f (x) T (x x) + m 2 x x 2 str. conv. (i) f (x) f (x ) f (x ) T (x x ) + m 2 x x 2 = m 2 x x 2 summing both yields 0 f (x) T (x x) + m x x 2 f (x) T (x x ) m x x 2 (1) E( x (k) x 2 ) = E( x (k 1) µ (k 1) g (k 1) x 2 ) = E( x (k 1) x 2 ) 2µ (k 1) E((g (k 1) ) T (x (k 1) x )) + (µ (k 1) ) 2 E( g k 1 2 ) = E( x (k 1) x 2 ) 2µ (k 1) E( f (x (k 1) ) T (x (k 1) x )) + (µ (k 1) ) 2 E( g k 1 2 (ii),(1) E( x (k 1) x 2 ) 2µ (k 1) me( x x (k 1) 2 ) + (µ (k 1) ) 2 G 2 = (1 2µ (k 1) m)e( x (k 1) x 2 ) + (µ (k 1) ) 2 G 2 (2) 12 / 44
16 1. Stochastic Gradient Descent (SGD) Convergence of SGD / Proof (2/2) induction over k: k := 0: k > 0: x (0) x L, L := max{ x (0) x 2, G 2 m 2 } E( x (k) x 2 ) (2) (1 2µ (k 1) m)e( x (k 1) x 2 ) + (µ (k 1) ) 2 G 2 (iii) = (1 2 k )E( x (k 1) x 2 ) + G 2 m 2 k 2 ind.hyp. (1 2 k ) 1 k 1 L + G 2 m 2 k 2 def. L (1 2 k ) 1 k L + 1 k 2 L = k 1 k 2 L 1 k L 13 / 44
17 2. More on Line Search Outline 1. Stochastic Gradient Descent (SGD) 2. More on Line Search 3. Example: SGD for Linear Regression 4. Stochastic Gradient Descent in Practice 14 / 44
18 2. More on Line Search Choosing the step size for SGD The step size µ is a crucial parameter to be tuned Given the low cost of the SGD update, using line search for the step size is a bad choice Possible alternatives: Fixed step size Armijo principle Bold-Driver Adagrad 14 / 44
19 2. More on Line Search Example: Body Fat prediction We want to estimate the percentage of body fat based on various attributes: Age (years) Weight (lbs) Height (inches) Neck circumference (cm) Chest circumference (cm) Abdomen 2 circumference (cm) Hip circumference (cm) Thigh circumference (cm) Knee circumference (cm) / 44
20 2. More on Line Search Example: Body Fat prediction The data is represented it as: 1 a 1,1 a 1,2... a 1,M y 1 1 a 2,1 a 2,2... a 2,M A =....., y = y 2. 1 a N,1 a N,2... a N,M y N with N = 252, M = 14 We can model the percentage of body fat y as a linear combination of the body measurements with parameters x: ŷ n = x T a n = x x 1 a n,1 + x 2 a n, x M a n,m 16 / 44
21 2. More on Line Search SGD - Fixed Step Size on the Body Fat dataset SGD Step Size MSE Iterations 17 / 44
22 2. More on Line Search Bold Driver Heuristic The Bold Driver Heuristic makes the assumption that smaller step sizes are needed when closer to the optimum It adjusts the step size based on the value of f (x t ) f (x (k 1) ) If the value of f (x) grows, the step size must decrease If the value of f (x) decreases, the step size can be larger for faster convergence 18 / 44
23 2. More on Line Search Bold Driver Heuristic - Update Rule We need to define an increase factor µ + > 1, e.g. µ + := 1.05, and a decay factor µ (0, 1), e.g., µ := 0.5. Step size update rule: adapt stepsize only once after each epoch, not for every (inner) iteration. Cycle through the whole data and update the parameters Evaluate the objective function f (x (k) ) if f (x (k) ) < f (x (k 1) ) then µ µ + µ else f (x (k) ) > f (x (k 1) ) then µ µ µ different from the bold driver heuristics for batch gradient descent, there is no way to evaluate f (x + µ x) for different µ. stepsize µ is adapted once after the step has been done 19 / 44
24 2. More on Line Search Bold Driver where µ stepsize of last update 1 stepsize-bd(µ, f new, f old, µ +, µ ): 2 if f new < f old 3 µ := µ + µ 4 else 5 µ := µ µ 6 return µ f new, f old = f (x k ), f (x k 1 ) function values before and after the last update µ +, µ stepsize increase and decay factors 20 / 44
25 2. More on Line Search Considerations Works well for a range of problems The initial µ just needs to be large enough µ + and µ have to be adjusted to the problem at hand May lead to faster convergence 21 / 44
26 2. More on Line Search AdaGrad Adagrad adjusts the step size individually for each variable to be optimized It uses information about the past gradients Leads to faster convergence Less sensitive to the choice of the step size 22 / 44
27 2. More on Line Search AdaGrad - Update Rule We have Update rule: f (x) = M f m (x) m=1 Update stepsize for every inner iteration Pick a random instance m Uniform(1, M) Compute the gradient x f m (x) Update the gradient history h := h + x f m (x) x f m (x) The step size for variable x n is µ n := µ 0 hn Update x next := x µ x f m (x) i.e., x next n := x n µ 0 hn ( x f m (x)) n denotes the elementwise product. 23 / 44
28 2. More on Line Search AdaGrad where 1 stepsize-adagrad(g, h, µ 0 ): 2 h := h + g g 3 µ n := µ 0 / h n for n = 1,..., N 4 return (µ, h) returns a vector of stepsizes, one for each variable g = f m (x) current (stochastic) gradient h past gradient size history µ 0 initial stepsize 24 / 44
29 2. More on Line Search AdaGrad Step Size ADAGRAD Step Size MSE Iterations 25 / 44
30 2. More on Line Search AdaGrad vs Fixed Step Size ADAGRAD Step Size MSE AdaGrad Fixed Step Size Iterations 26 / 44
31 3. Example: SGD for Linear Regression Outline 1. Stochastic Gradient Descent (SGD) 2. More on Line Search 3. Example: SGD for Linear Regression 4. Stochastic Gradient Descent in Practice 27 / 44
32 3. Example: SGD for Linear Regression Practical Example: Household Spending Suppose we have the following data about different households: Number of workers in the household (a 1 ) Household composition (a 2 ) Region (a 3 ) Gross normal weekly household income (a 4 ) Weekly household spending (y) We want to creat a model of the weekly household spending 27 / 44
33 3. Example: SGD for Linear Regression Practical Example: Household Spending If we have data about m households, we can represent it as: 1 a 1,2... a 1,n y 1 1 a 2,2... a 2,n A m,n =.... y = y 2. 1 a m,2... a m,n y m We can model the household consumption is a linear combination of the household features with parameters x: ŷ i = x T a i = x x 1 a i,1 + x 2 a i,2 + x 3 a i,3 + x 4 a i,4 28 / 44
34 3. Example: SGD for Linear Regression Least Square Problem Revisited The following least square problem minimize Ax y 2 2 Can be rewritten as minimize m (x T a i y i ) 2 i=1 1 a 1,1 a 1,2 a 1,3 a 1,4 y 1 1 a 2,1 a 2,2 a 2,3 a 2,4 A m,n =..... y = y 2. 1 a m,1 a m,2 a m,3 a m,4 y m 29 / 44
35 3. Example: SGD for Linear Regression The Gradient Descent update rule For the problem minimize m (x T a i y i ) 2 i=1 The the gradient f (x) of the objective function is: m x f (x) = 2 (x T a i y i )a i i=1 The Gradient Descent update rule is then: ( ) m x x µ 2 (x T a i y i )a i i=1 30 / 44
36 3. Example: SGD for Linear Regression The Gradient Descent update rule We need to see all the data before updating x ( ) m x x µ 2 (x T a i y i )a i Can we make any progress before iterating over all the data? i=1 31 / 44
37 3. Example: SGD for Linear Regression Decomposing the objective function The objective function m f (x) = (x T a i y i ) 2 i=1 Can be expressed as a function of the objective on each data point (a, y): So that f i (x) = (x T a i y i ) 2 f (x) = m f i (x) i=1 32 / 44
38 3. Example: SGD for Linear Regression A simpler update rule Now that we have f (x) = m f i (x) i=1 We can define the following update rule Pick a random instance i Uniform(1, m) Update x x x + µ ( x f i (x)) 33 / 44
39 3. Example: SGD for Linear Regression Stochastic Gradient Descent (SGD) 1: procedure StochasticGradiendDescent input: f, µ 2: Get initial point x 3: repeat 4: for i 1,..., m do 5: x x µ f i (x) 6: end for 7: until convergence 8: return x, f (x) 9: end procedure 34 / 44
40 3. Example: SGD for Linear Regression SGD and Least Squares We have m f (x) = f i (x) i=1 with f i (x) = (x T a i y i ) 2 The update rule is x f i (x) = 2(x T a i y i )a i ) x x µ (2(x T a i y i )a i 35 / 44
41 3. Example: SGD for Linear Regression SGD vs. GD 1: procedure SGD input: f, µ 2: Get initial point x 3: repeat 4: for i 1,..., m do 5: x x µ f i (x) 6: end for 7: until convergence 8: return x, f (x) 9: end procedure 1: procedure GradientDescent input: f 2: Get initial point x 3: repeat 4: Get Step Size µ 5: x := x µ f (x) 6: until convergence 7: return x, f (x) 8: end procedure 36 / 44
42 3. Example: SGD for Linear Regression SGD vs. GD - Least Squares 1: procedure SGD input: f, µ 2: Get initial point x 3: repeat 4: for i 1,..., m do 5: x x µ ( 2(x T ) a i y i )a i 6: end for 7: until convergence 8: return x, f (x) 9: end procedure 1: procedure GD input: f 2: Get initial point x 3: repeat 4: Get Step Size µ 5: x x µ ( 2 m ) i=1 (xt a i y i )a i 6: until convergence 7: return x, f (x) 8: end procedure 37 / 44
43 4. Stochastic Gradient Descent in Practice Outline 1. Stochastic Gradient Descent (SGD) 2. More on Line Search 3. Example: SGD for Linear Regression 4. Stochastic Gradient Descent in Practice 38 / 44
44 4. Stochastic Gradient Descent in Practice GD Step Size GD Step Size MSE Iterations 38 / 44
45 4. Stochastic Gradient Descent in Practice SGD vs GD - Body Fat Dataset SGD vs GD MSE SGD GD Iterations 39 / 44
46 4. Stochastic Gradient Descent in Practice Year Prediction Data Set Least Squares Problem Prediction of the release year of a song from audio features 90 features Experiments done on a subset of 1000 instances of the data 40 / 44
47 4. Stochastic Gradient Descent in Practice GD Step Size - Year Prediction GD Step Size MSE 0e+00 2e+05 4e+05 6e+05 8e+05 1e Iterations 41 / 44
48 4. Stochastic Gradient Descent in Practice SGD Step Size - Year Prediction SGD Step Size MSE 0e+00 1e+05 2e+05 3e+05 4e+05 5e Iterations 42 / 44
49 4. Stochastic Gradient Descent in Practice AdaGrad Step Size - Year Prediction ADAGRAD Step Size MSE 0e+00 1e+05 2e+05 3e+05 4e+05 5e Iterations 43 / 44
50 4. Stochastic Gradient Descent in Practice AdaGrad vs SGD vs GD - Year Prediction ADAGRAD Step Size MSE 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 AdaGrad GD SGD Iterations 44 / 44
51 Further Readings SGD is not covered in Boyd and Vandenberghe [2004]. Leon Bottou, Frank E. Curtis, Jorge Nocedal (2016): Stochastic Gradient Methods for Large-Scale Machine Learning, ICML 2016 Tutorial, Francis Bach (2013): Stochastic gradient methods for machine learning, Microsoft Machine Learning Summit 2013, events/mls2013/downloads/stochastic_gradient.pdf for the convergence proof: Ji Liu (2014), Notes Stochastic Gradient Descent, 45 / 44
52 References I Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge Univ Press, / 44
Modern Optimization Techniques
Modern Optimization Techniques Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Stochastic Gradient Descent Stochastic
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationModern Optimization Techniques
Modern Optimization Techniques 0. Overview Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 1 / 44 Syllabus Mon.
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationMachine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme
Machine Learning A. Supervised Learning A.1. Linear Regression Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationOptimization for Machine Learning
Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html
More informationLogistic Regression. Stochastic Gradient Descent
Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationConstrained Optimization and Lagrangian Duality
CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationLeast Mean Squares Regression
Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationOptimization for Machine Learning
Optimization for Machine Learning Elman Mansimov 1 September 24, 2015 1 Modified based on Shenlong Wang s and Jake Snell s tutorials, with additional contents borrowed from Kevin Swersky and Jasper Snoek
More informationSub-Sampled Newton Methods
Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationIPAM Summer School Optimization methods for machine learning. Jorge Nocedal
IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep
More informationQuasi-Newton Methods. Javier Peña Convex Optimization /36-725
Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex
More informationStochastic Gradient Descent
Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis
More informationConvex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014
Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationLecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data
Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given
More informationA Quick Tour of Linear Algebra and Optimization for Machine Learning
A Quick Tour of Linear Algebra and Optimization for Machine Learning Masoud Farivar January 8, 2015 1 / 28 Outline of Part I: Review of Basic Linear Algebra Matrices and Vectors Matrix Multiplication Operators
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationStatistical Computing (36-350)
Statistical Computing (36-350) Lecture 19: Optimization III: Constrained and Stochastic Optimization Cosma Shalizi 30 October 2013 Agenda Constraints and Penalties Constraints and penalties Stochastic
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationSub-Sampled Newton Methods for Machine Learning. Jorge Nocedal
Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationAdvanced Topics in Machine Learning
Advanced Topics in Machine Learning 1. Learning SVMs / Primal Methods Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany 1 / 16 Outline 10. Linearization
More informationSimple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationMachine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io
Machine Learning Lecture 2: Linear regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2017 Supervised Learning Regression: Predict
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationRirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology
Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More informationSimple Optimization, Bigger Models, and Faster Learning. Niao He
Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationHomework 4. Convex Optimization /36-725
Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationVariations of Logistic Regression with Stochastic Gradient Descent
Variations of Logistic Regression with Stochastic Gradient Descent Panqu Wang(pawang@ucsd.edu) Phuc Xuan Nguyen(pxn002@ucsd.edu) January 26, 2012 Abstract In this paper, we extend the traditional logistic
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More information12. Interior-point methods
12. Interior-point methods Convex Optimization Boyd & Vandenberghe inequality constrained minimization logarithmic barrier function and central path barrier method feasibility and phase I methods complexity
More informationMotivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning
Convex Optimization Lecture 15 - Gradient Descent in Machine Learning Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 21 Today s Lecture 1 Motivation 2 Subgradient Method 3 Stochastic
More informationDeep Learning Lab Course 2017 (Deep Learning Practical)
Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of
More informationNumerical optimization
Numerical optimization Lecture 4 Alexander & Michael Bronstein tosca.cs.technion.ac.il/book Numerical geometry of non-rigid shapes Stanford University, Winter 2009 2 Longest Slowest Shortest Minimal Maximal
More informationCSC321 Lecture 8: Optimization
CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationCSC321 Lecture 2: Linear Regression
CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationLeast Mean Squares Regression. Machine Learning Fall 2018
Least Mean Squares Regression Machine Learning Fall 2018 1 Where are we? Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent Exercises
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationNeural Networks: Optimization & Regularization
Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More informationNumerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems
1 Numerical optimization Alexander & Michael Bronstein, 2006-2009 Michael Bronstein, 2010 tosca.cs.technion.ac.il/book Numerical optimization 048921 Advanced topics in vision Processing and Analysis of
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationImproving L-BFGS Initialization for Trust-Region Methods in Deep Learning
Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University
More informationRandomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationStochastic gradient descent and robustness to ill-conditioning
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More informationDeepLearning on FPGAs
DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationIntroduction to Optimization
Introduction to Optimization Konstantin Tretyakov (kt@ut.ee) MTAT.03.227 Machine Learning So far Machine learning is important and interesting The general concept: Fitting models to data So far Machine
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationLecture 6 Optimization for Deep Neural Networks
Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things
More informationNumerical Optimization Techniques
Numerical Optimization Techniques Léon Bottou NEC Labs America COS 424 3/2/2010 Today s Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification,
More informationGRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)
0 x x x CSE 559A: Computer Vision For Binary Classification: [0, ] f (x; ) = σ( x) = exp( x) + exp( x) Output is interpreted as probability Pr(y = ) x are the log-odds. Fall 207: -R: :30-pm @ Lopata 0
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationStatistical Machine Learning from Data
January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole
More informationStochastic Optimization Methods for Machine Learning. Jorge Nocedal
Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern
More information