Modern Optimization Techniques

Size: px
Start display at page:

Download "Modern Optimization Techniques"

Transcription

1 Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany original slides by Lucas Rego Drumond (ISMLL) 1 / 44

2 Syllabus Tue (0) 0. Overview 1. Theory Tue (1) 1. Convex Sets and Functions 2. Unconstrained Optimization Tue (2) 2.1 Gradient Descent Tue (3) 2.2 Stochastic Gradient Descent Tue (4) 2.3 Newton s Method Tue (5) 2.3b Newton s Method (Part 2) Tue (6) 2.4 Subgradient Methods Tue (7) 2.5 Coordinate Descent 3. Equality Constrained Optimization Tue (8) 3.1 Duality Tue (9) 3.2 Methods Christmas Break 4. Inequality Constrained Optimization Tue (10) 4.1 Interior Point Methods Tue (11) 4.2 Cutting Plane Method 5. Distributed Optimization Tue (11) 5.1 Alternating Direction Method of Multipliers Tue (12) Questions and Answers 1 / 44

3 Outline 1. Stochastic Gradient Descent (SGD) 2. More on Line Search 3. Example: SGD for Linear Regression 4. Stochastic Gradient Descent in Practice 2 / 44

4 1. Stochastic Gradient Descent (SGD) Outline 1. Stochastic Gradient Descent (SGD) 2. More on Line Search 3. Example: SGD for Linear Regression 4. Stochastic Gradient Descent in Practice 2 / 44

5 1. Stochastic Gradient Descent (SGD) Unconstrained Convex Optimization arg min f (x) x dom f dom f R N is open (unconstrained optimization) f is convex 2 / 44

6 1. Stochastic Gradient Descent (SGD) Stochastic Gradient Gradient Descent makes use of the gradient f (x) Stochastic Gradient Descent: makes use of Stochastic Gradient only: g(x) p(g R M x), E p (g(x)) = f (x) for each point x R N : random variable over R N with distribution p (conditional on x) on average yields the gradient (at each point) 3 / 44

7 1. Stochastic Gradient Descent (SGD) Stochastic Gradient / Example: Big Sums f is a big sum : f (x) = 1 C C f c (x) c=1 with f c convex, c = 1,..., C g is the gradient of a random summand: p(g x) := Unif({ f c (x) c = 1,..., C}) 4 / 44

8 1. Stochastic Gradient Descent (SGD) Stochastic Gradient / Example: Least Squares min x R N f (x) := Ax b 2 2 will find solution for Ax = b if there is any (then Ax b 2 = 0) otherwise will find the x where the difference Ax b of left and right side is as small as possible (in the squared L2 norm) is a big sum: M f (x) := Ax b 2 2 = ((Ax) m b m ) 2 = 1 M m=1 M f m (x), f m (x) := M((Ax) m b m ) 2 m=1 stochastic gradient g: gradient for a random component m 5 / 44

9 1. Stochastic Gradient Descent (SGD) Stochastic Gradient / Example: Supervised Learning where min θ R P f (x) := 1 N N l(y n, ŷ(x n, θ)) + λ θ 2 2 n=1 (xn, y n ) R M R (k) are N training samples, ŷ is a parametrized model, e.g., logistic regression ŷ(x; θ) := (1 + e θt x ) 1, P := M, T := 1 l is a loss, e.g., negative binomial loglikelihood: l(y, ŷ) := y log ŷ (1 y) log(1 ŷ) λ R + 0 is the regularization weight. will find parametrization with best trade-off between low loss and low model complexity 6 / 44

10 1. Stochastic Gradient Descent (SGD) Stochastic Gradient / Example: Supervised Learning (2/2) min θ R P f (x) := 1 N N l(y n, ŷ(x n, θ)) + λ θ 2 2 n=1 where (x n, y n ) R M R T are N training samples,... is a big sum: f (θ) := 1 N N f n (θ), f n (θ) := l(y n, ŷ(x n, θ)) + λ θ 2 2 n=1 stochastic gradient g: gradient for a random sample n 7 / 44

11 1. Stochastic Gradient Descent (SGD) Stochastic Gradient Descent the very same as Gradient Descent but use stochastic gradient g(x) instead of exact gradient f (x) in each step where 1 min-sgd(f, p, x (0), µ, K): 2 for k := 1,..., K: 3 draw g (k 1) p(g x) 4 x (k 1) := g (k 1) 5 µ (k 1) := µ(f, x (k 1), x (k 1) ) 6 x (k) := x (k 1) + µ (k 1) x (k 1) 7 return x (K) f objective function p (distribution of the) stochastic gradient of f x (0) starting value µ step length controller Lars Schmidt-Thieme, K number Information of iterations Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 8 / 44

12 1. Stochastic Gradient Descent (SGD) Stochastic Gradient Descent / For Big Sums where 1 min-sgd((f c ) c=1,...,c, ( f c ) c=1,...,c, x (0), µ, K): 2 for k := 1,..., K: 3 draw c (k 1) Unif(1,..., C) 4 g (k 1) := f c (k 1)(x (k 1) ) 5 x (k 1) := g (k 1) 6 µ (k 1) := µ(f, x (k 1), x (k 1) ) 7 x (k) := x (k 1) + µ (k 1) x (k 1) 8 return x (K) (f c ) c=1,...,c objective function summands, f := 1 C C c=1 f c ( f c ) c=1,...,c gradients of the objective function summands x (0) starting value µ step length controller K number of iterations 9 / 44

13 1. Stochastic Gradient Descent (SGD) SGD / For Big Sums / Epochs where 1 min-sgd((f c ) c=1,...,c, ( f c ) c=1,...,c, x (0), µ, K): 2 C := (1, 2,..., C) 3 x (0,C) := x (0) 4 for k := 1,..., K: 5 randomly shuffle C 6 x (k,0) := x (k 1,C) 7 for i = 1,..., C: 8 g (k,i 1) := f ci (x (k,i 1) ) 9 x (k,i 1) := g (k,i 1) 10 µ (k,i 1) := µ(f, x (k,i 1), x (k,i 1) ) 11 x (k,i) := x (k,i 1) + µ (k,i 1) x (k,i 1) 12 return x (K,C) (f c ) c=1,...,c objective function summands, f := 1 C C c=1 f c... K number of epochs 10 / 44

14 1. Stochastic Gradient Descent (SGD) Convergence of SGD Theorem (Convergence of SGD) If (i) f is strongly convex ( 2 f (x) mi, m R + ), (ii) the expected squared norm of its stochastic gradient g is uniformly bounded ( G R + 0 x : E( g(x) 2 ) G 2 ) and (iii) the steplength µ (k) := 1 m(k+1) is used, then E p ( x (k) x 2 ) 1 k + 1 max{ x (0) x 2, G 2 m 2 } 11 / 44

15 1. Stochastic Gradient Descent (SGD) Convergence of SGD / Proof f (x ) f (x) f (x) T (x x) + m 2 x x 2 str. conv. (i) f (x) f (x ) f (x ) T (x x ) + m 2 x x 2 = m 2 x x 2 summing both yields 0 f (x) T (x x) + m x x 2 f (x) T (x x ) m x x 2 (1) E( x (k) x 2 ) = E( x (k 1) µ (k 1) g (k 1) x 2 ) = E( x (k 1) x 2 ) 2µ (k 1) E((g (k 1) ) T (x (k 1) x )) + (µ (k 1) ) 2 E( g k 1 2 ) = E( x (k 1) x 2 ) 2µ (k 1) E( f (x (k 1) ) T (x (k 1) x )) + (µ (k 1) ) 2 E( g k 1 2 (ii),(1) E( x (k 1) x 2 ) 2µ (k 1) me( x x (k 1) 2 ) + (µ (k 1) ) 2 G 2 = (1 2µ (k 1) m)e( x (k 1) x 2 ) + (µ (k 1) ) 2 G 2 (2) 12 / 44

16 1. Stochastic Gradient Descent (SGD) Convergence of SGD / Proof (2/2) induction over k: k := 0: k > 0: x (0) x L, L := max{ x (0) x 2, G 2 m 2 } E( x (k) x 2 ) (2) (1 2µ (k 1) m)e( x (k 1) x 2 ) + (µ (k 1) ) 2 G 2 (iii) = (1 2 k )E( x (k 1) x 2 ) + G 2 m 2 k 2 ind.hyp. (1 2 k ) 1 k 1 L + G 2 m 2 k 2 def. L (1 2 k ) 1 k L + 1 k 2 L = k 1 k 2 L 1 k L 13 / 44

17 2. More on Line Search Outline 1. Stochastic Gradient Descent (SGD) 2. More on Line Search 3. Example: SGD for Linear Regression 4. Stochastic Gradient Descent in Practice 14 / 44

18 2. More on Line Search Choosing the step size for SGD The step size µ is a crucial parameter to be tuned Given the low cost of the SGD update, using line search for the step size is a bad choice Possible alternatives: Fixed step size Armijo principle Bold-Driver Adagrad 14 / 44

19 2. More on Line Search Example: Body Fat prediction We want to estimate the percentage of body fat based on various attributes: Age (years) Weight (lbs) Height (inches) Neck circumference (cm) Chest circumference (cm) Abdomen 2 circumference (cm) Hip circumference (cm) Thigh circumference (cm) Knee circumference (cm) / 44

20 2. More on Line Search Example: Body Fat prediction The data is represented it as: 1 a 1,1 a 1,2... a 1,M y 1 1 a 2,1 a 2,2... a 2,M A =....., y = y 2. 1 a N,1 a N,2... a N,M y N with N = 252, M = 14 We can model the percentage of body fat y as a linear combination of the body measurements with parameters x: ŷ n = x T a n = x x 1 a n,1 + x 2 a n, x M a n,m 16 / 44

21 2. More on Line Search SGD - Fixed Step Size on the Body Fat dataset SGD Step Size MSE Iterations 17 / 44

22 2. More on Line Search Bold Driver Heuristic The Bold Driver Heuristic makes the assumption that smaller step sizes are needed when closer to the optimum It adjusts the step size based on the value of f (x t ) f (x (k 1) ) If the value of f (x) grows, the step size must decrease If the value of f (x) decreases, the step size can be larger for faster convergence 18 / 44

23 2. More on Line Search Bold Driver Heuristic - Update Rule We need to define an increase factor µ + > 1, e.g. µ + := 1.05, and a decay factor µ (0, 1), e.g., µ := 0.5. Step size update rule: adapt stepsize only once after each epoch, not for every (inner) iteration. Cycle through the whole data and update the parameters Evaluate the objective function f (x (k) ) if f (x (k) ) < f (x (k 1) ) then µ µ + µ else f (x (k) ) > f (x (k 1) ) then µ µ µ different from the bold driver heuristics for batch gradient descent, there is no way to evaluate f (x + µ x) for different µ. stepsize µ is adapted once after the step has been done 19 / 44

24 2. More on Line Search Bold Driver where µ stepsize of last update 1 stepsize-bd(µ, f new, f old, µ +, µ ): 2 if f new < f old 3 µ := µ + µ 4 else 5 µ := µ µ 6 return µ f new, f old = f (x k ), f (x k 1 ) function values before and after the last update µ +, µ stepsize increase and decay factors 20 / 44

25 2. More on Line Search Considerations Works well for a range of problems The initial µ just needs to be large enough µ + and µ have to be adjusted to the problem at hand May lead to faster convergence 21 / 44

26 2. More on Line Search AdaGrad Adagrad adjusts the step size individually for each variable to be optimized It uses information about the past gradients Leads to faster convergence Less sensitive to the choice of the step size 22 / 44

27 2. More on Line Search AdaGrad - Update Rule We have Update rule: f (x) = M f m (x) m=1 Update stepsize for every inner iteration Pick a random instance m Uniform(1, M) Compute the gradient x f m (x) Update the gradient history h := h + x f m (x) x f m (x) The step size for variable x n is µ n := µ 0 hn Update x next := x µ x f m (x) i.e., x next n := x n µ 0 hn ( x f m (x)) n denotes the elementwise product. 23 / 44

28 2. More on Line Search AdaGrad where 1 stepsize-adagrad(g, h, µ 0 ): 2 h := h + g g 3 µ n := µ 0 / h n for n = 1,..., N 4 return (µ, h) returns a vector of stepsizes, one for each variable g = f m (x) current (stochastic) gradient h past gradient size history µ 0 initial stepsize 24 / 44

29 2. More on Line Search AdaGrad Step Size ADAGRAD Step Size MSE Iterations 25 / 44

30 2. More on Line Search AdaGrad vs Fixed Step Size ADAGRAD Step Size MSE AdaGrad Fixed Step Size Iterations 26 / 44

31 3. Example: SGD for Linear Regression Outline 1. Stochastic Gradient Descent (SGD) 2. More on Line Search 3. Example: SGD for Linear Regression 4. Stochastic Gradient Descent in Practice 27 / 44

32 3. Example: SGD for Linear Regression Practical Example: Household Spending Suppose we have the following data about different households: Number of workers in the household (a 1 ) Household composition (a 2 ) Region (a 3 ) Gross normal weekly household income (a 4 ) Weekly household spending (y) We want to creat a model of the weekly household spending 27 / 44

33 3. Example: SGD for Linear Regression Practical Example: Household Spending If we have data about m households, we can represent it as: 1 a 1,2... a 1,n y 1 1 a 2,2... a 2,n A m,n =.... y = y 2. 1 a m,2... a m,n y m We can model the household consumption is a linear combination of the household features with parameters x: ŷ i = x T a i = x x 1 a i,1 + x 2 a i,2 + x 3 a i,3 + x 4 a i,4 28 / 44

34 3. Example: SGD for Linear Regression Least Square Problem Revisited The following least square problem minimize Ax y 2 2 Can be rewritten as minimize m (x T a i y i ) 2 i=1 1 a 1,1 a 1,2 a 1,3 a 1,4 y 1 1 a 2,1 a 2,2 a 2,3 a 2,4 A m,n =..... y = y 2. 1 a m,1 a m,2 a m,3 a m,4 y m 29 / 44

35 3. Example: SGD for Linear Regression The Gradient Descent update rule For the problem minimize m (x T a i y i ) 2 i=1 The the gradient f (x) of the objective function is: m x f (x) = 2 (x T a i y i )a i i=1 The Gradient Descent update rule is then: ( ) m x x µ 2 (x T a i y i )a i i=1 30 / 44

36 3. Example: SGD for Linear Regression The Gradient Descent update rule We need to see all the data before updating x ( ) m x x µ 2 (x T a i y i )a i Can we make any progress before iterating over all the data? i=1 31 / 44

37 3. Example: SGD for Linear Regression Decomposing the objective function The objective function m f (x) = (x T a i y i ) 2 i=1 Can be expressed as a function of the objective on each data point (a, y): So that f i (x) = (x T a i y i ) 2 f (x) = m f i (x) i=1 32 / 44

38 3. Example: SGD for Linear Regression A simpler update rule Now that we have f (x) = m f i (x) i=1 We can define the following update rule Pick a random instance i Uniform(1, m) Update x x x + µ ( x f i (x)) 33 / 44

39 3. Example: SGD for Linear Regression Stochastic Gradient Descent (SGD) 1: procedure StochasticGradiendDescent input: f, µ 2: Get initial point x 3: repeat 4: for i 1,..., m do 5: x x µ f i (x) 6: end for 7: until convergence 8: return x, f (x) 9: end procedure 34 / 44

40 3. Example: SGD for Linear Regression SGD and Least Squares We have m f (x) = f i (x) i=1 with f i (x) = (x T a i y i ) 2 The update rule is x f i (x) = 2(x T a i y i )a i ) x x µ (2(x T a i y i )a i 35 / 44

41 3. Example: SGD for Linear Regression SGD vs. GD 1: procedure SGD input: f, µ 2: Get initial point x 3: repeat 4: for i 1,..., m do 5: x x µ f i (x) 6: end for 7: until convergence 8: return x, f (x) 9: end procedure 1: procedure GradientDescent input: f 2: Get initial point x 3: repeat 4: Get Step Size µ 5: x := x µ f (x) 6: until convergence 7: return x, f (x) 8: end procedure 36 / 44

42 3. Example: SGD for Linear Regression SGD vs. GD - Least Squares 1: procedure SGD input: f, µ 2: Get initial point x 3: repeat 4: for i 1,..., m do 5: x x µ ( 2(x T ) a i y i )a i 6: end for 7: until convergence 8: return x, f (x) 9: end procedure 1: procedure GD input: f 2: Get initial point x 3: repeat 4: Get Step Size µ 5: x x µ ( 2 m ) i=1 (xt a i y i )a i 6: until convergence 7: return x, f (x) 8: end procedure 37 / 44

43 4. Stochastic Gradient Descent in Practice Outline 1. Stochastic Gradient Descent (SGD) 2. More on Line Search 3. Example: SGD for Linear Regression 4. Stochastic Gradient Descent in Practice 38 / 44

44 4. Stochastic Gradient Descent in Practice GD Step Size GD Step Size MSE Iterations 38 / 44

45 4. Stochastic Gradient Descent in Practice SGD vs GD - Body Fat Dataset SGD vs GD MSE SGD GD Iterations 39 / 44

46 4. Stochastic Gradient Descent in Practice Year Prediction Data Set Least Squares Problem Prediction of the release year of a song from audio features 90 features Experiments done on a subset of 1000 instances of the data 40 / 44

47 4. Stochastic Gradient Descent in Practice GD Step Size - Year Prediction GD Step Size MSE 0e+00 2e+05 4e+05 6e+05 8e+05 1e Iterations 41 / 44

48 4. Stochastic Gradient Descent in Practice SGD Step Size - Year Prediction SGD Step Size MSE 0e+00 1e+05 2e+05 3e+05 4e+05 5e Iterations 42 / 44

49 4. Stochastic Gradient Descent in Practice AdaGrad Step Size - Year Prediction ADAGRAD Step Size MSE 0e+00 1e+05 2e+05 3e+05 4e+05 5e Iterations 43 / 44

50 4. Stochastic Gradient Descent in Practice AdaGrad vs SGD vs GD - Year Prediction ADAGRAD Step Size MSE 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 AdaGrad GD SGD Iterations 44 / 44

51 Further Readings SGD is not covered in Boyd and Vandenberghe [2004]. Leon Bottou, Frank E. Curtis, Jorge Nocedal (2016): Stochastic Gradient Methods for Large-Scale Machine Learning, ICML 2016 Tutorial, Francis Bach (2013): Stochastic gradient methods for machine learning, Microsoft Machine Learning Summit 2013, events/mls2013/downloads/stochastic_gradient.pdf for the convergence proof: Ji Liu (2014), Notes Stochastic Gradient Descent, 45 / 44

52 References I Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge Univ Press, / 44

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Stochastic Gradient Descent Stochastic

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques 0. Overview Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 1 / 44 Syllabus Mon.

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme

Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme Machine Learning A. Supervised Learning A.1. Linear Regression Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Least Mean Squares Regression

Least Mean Squares Regression Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning Elman Mansimov 1 September 24, 2015 1 Modified based on Shenlong Wang s and Jake Snell s tutorials, with additional contents borrowed from Kevin Swersky and Jasper Snoek

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725 Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis

More information

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014 Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information

A Quick Tour of Linear Algebra and Optimization for Machine Learning

A Quick Tour of Linear Algebra and Optimization for Machine Learning A Quick Tour of Linear Algebra and Optimization for Machine Learning Masoud Farivar January 8, 2015 1 / 28 Outline of Part I: Review of Basic Linear Algebra Matrices and Vectors Matrix Multiplication Operators

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Statistical Computing (36-350)

Statistical Computing (36-350) Statistical Computing (36-350) Lecture 19: Optimization III: Constrained and Stochastic Optimization Cosma Shalizi 30 October 2013 Agenda Constraints and Penalties Constraints and penalties Stochastic

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Advanced Topics in Machine Learning

Advanced Topics in Machine Learning Advanced Topics in Machine Learning 1. Learning SVMs / Primal Methods Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany 1 / 16 Outline 10. Linearization

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io Machine Learning Lecture 2: Linear regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2017 Supervised Learning Regression: Predict

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Variations of Logistic Regression with Stochastic Gradient Descent

Variations of Logistic Regression with Stochastic Gradient Descent Variations of Logistic Regression with Stochastic Gradient Descent Panqu Wang(pawang@ucsd.edu) Phuc Xuan Nguyen(pxn002@ucsd.edu) January 26, 2012 Abstract In this paper, we extend the traditional logistic

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

12. Interior-point methods

12. Interior-point methods 12. Interior-point methods Convex Optimization Boyd & Vandenberghe inequality constrained minimization logarithmic barrier function and central path barrier method feasibility and phase I methods complexity

More information

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning Convex Optimization Lecture 15 - Gradient Descent in Machine Learning Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 21 Today s Lecture 1 Motivation 2 Subgradient Method 3 Stochastic

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

Numerical optimization

Numerical optimization Numerical optimization Lecture 4 Alexander & Michael Bronstein tosca.cs.technion.ac.il/book Numerical geometry of non-rigid shapes Stanford University, Winter 2009 2 Longest Slowest Shortest Minimal Maximal

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

CSC321 Lecture 2: Linear Regression

CSC321 Lecture 2: Linear Regression CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Least Mean Squares Regression. Machine Learning Fall 2018

Least Mean Squares Regression. Machine Learning Fall 2018 Least Mean Squares Regression Machine Learning Fall 2018 1 Where are we? Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent Exercises

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems 1 Numerical optimization Alexander & Michael Bronstein, 2006-2009 Michael Bronstein, 2010 tosca.cs.technion.ac.il/book Numerical optimization 048921 Advanced topics in vision Processing and Analysis of

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

DeepLearning on FPGAs

DeepLearning on FPGAs DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Introduction to Optimization

Introduction to Optimization Introduction to Optimization Konstantin Tretyakov (kt@ut.ee) MTAT.03.227 Machine Learning So far Machine learning is important and interesting The general concept: Fitting models to data So far Machine

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

Numerical Optimization Techniques

Numerical Optimization Techniques Numerical Optimization Techniques Léon Bottou NEC Labs America COS 424 3/2/2010 Today s Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification,

More information

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x) 0 x x x CSE 559A: Computer Vision For Binary Classification: [0, ] f (x; ) = σ( x) = exp( x) + exp( x) Output is interpreted as probability Pr(y = ) x are the log-odds. Fall 207: -R: :30-pm @ Lopata 0

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

More information

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern

More information