Modern Optimization Techniques

Size: px
Start display at page:

Download "Modern Optimization Techniques"

Transcription

1 Modern Optimization Techniques Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Stochastic Gradient Descent Stochastic Gradient Descent 1 / 32

2 Outline 1. Unconstrained Optimization 2. Stochastic Gradient Descent 3. Choosing the right step size 4. Stochastic Gradient Descent on Practice Stochastic Gradient Descent 1 / 32

3 1. Unconstrained Optimization Outline 1. Unconstrained Optimization 2. Stochastic Gradient Descent 3. Choosing the right step size 4. Stochastic Gradient Descent on Practice Stochastic Gradient Descent 1 / 32

4 1. Unconstrained Optimization Gradient Descent 1: procedure GradientDescent input: f 0 2: Get initial point β 3: repeat 4: Get Step Size µ 5: β := β µ f 0 (β) 6: until convergence 7: return β, f 0 (β) 8: end procedure Stochastic Gradient Descent 1 / 32

5 2. Stochastic Gradient Descent Outline 1. Unconstrained Optimization 2. Stochastic Gradient Descent 3. Choosing the right step size 4. Stochastic Gradient Descent on Practice Stochastic Gradient Descent 2 / 32

6 2. Stochastic Gradient Descent Practical Example: Household Spending Suppose we have the following data about different households: Number of workers in the household (x 1 ) Household composition (x 2 ) Region (x 3 ) Gross normal weekly household income (x 4 ) Weekly household spending (y) We want to creat a model of the weekly household spending Stochastic Gradient Descent 2 / 32

7 2. Stochastic Gradient Descent Practical Example: Household Spending If we have data about m households, we can represent it as: 1 x 1,2... x 1,n y 1 1 x 2,2... x 2,n X m,n =.... y = y 2. 1 x m,2... x m,n y m We can model the household consumption is a linear combination of the household features with parameters β: ŷ i = β T x i = β β 1 x i,1 + β 2 x i,2 + β 3 x i,3 + β 4 x i,4 Stochastic Gradient Descent 3 / 32

8 2. Stochastic Gradient Descent Least Square Problem Revisited The following least square problem minimize X β y 2 2 Can be rewritten as minimize m (β T x i y i ) 2 i=1 1 x 1,1 x 1,2 x 1,3 x 1,4 y 1 1 x 2,1 x 2,2 x 2,3 x 2,4 X m,n =..... y = y 2. 1 x m,1 x m,2 x m,3 x m,4 y m Stochastic Gradient Descent 4 / 32

9 2. Stochastic Gradient Descent The Gradient Descent update rule For the problem minimize m (β T x i y i ) 2 i=1 The the gradient f 0 (β) of the objective function is: m β f 0 (β) = 2 (β T x i y i )x i i=1 The Gradient Descent update rule is then: ( ) m β β µ 2 (β T x i y i )x i i=1 Stochastic Gradient Descent 5 / 32

10 2. Stochastic Gradient Descent The Gradient Descent update rule We need to see all the data before updating β ( ) m β β µ 2 (β T x i y i )x i Can we make any progress before iterating over all the data? i=1 Stochastic Gradient Descent 6 / 32

11 2. Stochastic Gradient Descent Decomposing the objective function The objective function m f 0 (β) = (β T x i y i ) 2 i=1 Can be expressed as a function of the objective on each data point (x, y): So that g(β, i) = (β T x i y i ) 2 f 0 (β) = m g(β, i) i=1 Stochastic Gradient Descent 7 / 32

12 2. Stochastic Gradient Descent A simpler update rule Now that we have f 0 (β) = m g(β, i) i=1 We can define the following update rule Pick a random instance i Uniform(1, m) Update β β β + µ ( β g(β, i)) Stochastic Gradient Descent 8 / 32

13 2. Stochastic Gradient Descent Stochastic Gradient Descent (SGD) 1: procedure StochasticGradiendDescent input: f 0, µ 2: Get initial point β 3: repeat 4: for i 1,..., m do 5: β β µ g(β, i) 6: end for 7: until convergence 8: return β, f 0 (β) 9: end procedure Stochastic Gradient Descent 9 / 32

14 2. Stochastic Gradient Descent SGD and the least squares We have m f 0 (β) = g(β, i) i=1 with g(β, i) = (β T x i y i ) 2 The update rule is β g(β, i) = 2(β T x i y i )x i ) β β µ (2(β T x i y i )x i Stochastic Gradient Descent 10 / 32

15 2. Stochastic Gradient Descent SGD vs. GD 1: procedure SGD input: f 0, µ 2: Get initial point β 3: repeat 4: for i 1,..., m do 5: β β µ g(β, i) 6: end for 7: until convergence 8: return β, f 0 (β) 9: end procedure 1: procedure GradientDescent input: f 0 2: Get initial point β 3: repeat 4: Get Step Size µ 5: β := β µ f 0 (β) 6: until convergence 7: return β, f 0 (β) 8: end procedure Stochastic Gradient Descent 11 / 32

16 2. Stochastic Gradient Descent SGD vs. GD - Least Squares 1: procedure SGD input: f 0, µ 2: Get initial point β 3: repeat 4: for i 1,..., m do 5: β β µ ( 2(β T ) x i y i )x i 6: end for 7: until convergence 8: return β, f 0 (β) 9: end procedure 1: procedure GD input: f 0 2: Get initial point β 3: repeat 4: Get Step Size µ 5: β β µ ( 2 m ) i=1 (βt x i y i )x i 6: until convergence 7: return β, f 0 (β) 8: end procedure Stochastic Gradient Descent 12 / 32

17 3. Choosing the right step size Outline 1. Unconstrained Optimization 2. Stochastic Gradient Descent 3. Choosing the right step size 4. Stochastic Gradient Descent on Practice Stochastic Gradient Descent 13 / 32

18 3. Choosing the right step size Choosing the step size for SGD The step size µ is a crucial parameter to be tuned Given the low cost of the SGD update, using line search for the step size is a bad choice Possible alternatives: Fixed step size Armijo principle Bold-Driver Adagrad Stochastic Gradient Descent 13 / 32

19 3. Choosing the right step size Real World Dataset: Body Fat prediction We want to estimate the percentage of body fat based on various attributes: Age (years) Weight (lbs) Height (inches) Neck circumference (cm) Chest circumference (cm) Abdomen 2 circumference (cm) Hip circumference (cm) Thigh circumference (cm) Knee circumference (cm)... Stochastic Gradient Descent 14 / 32

20 3. Choosing the right step size Real World Dataset: Body Fat prediction The data is represented it as: 1 x 1,1 x 1,2... x 1,n y 1 1 x 2,1 x 2,2... x 2,n X m,n =..... y = y 2. 1 x m,1 x m,2... x m,n y m with m = 252, n = 14 We can model the percentage of body fat y is a linear combination of the body measurements with parameters β: ŷ i = β T x i = β β 1 x i,1 + β 2 x i, β n x i,n Stochastic Gradient Descent 15 / 32

21 3. Choosing the right step size SGD - Fixed Step Size on the Body Fat dataset SGD Step Size MSE Iterations Stochastic Gradient Descent 16 / 32

22 3. Choosing the right step size Bold Driver Heuristic The Bold Driver Heuristic makes the assumption that smaller step sizes are needed when closer to the optimum It adjusts the step size based on the value of f 0 (β t ) f 0 (β t 1 ) If the value of f 0 (β) grows, the step size must decrease If the value of f 0 (β) decreases, the step size can be larger for faster convergence Stochastic Gradient Descent 17 / 32

23 3. Choosing the right step size Bold Driver Heuristic - Update Rule We have f 0 (β) = m g(β, i) i=1 We need to define an increase factor γ and a decay factor ν For each epoch Evaluate the objective function f 0 (β t 1 ) Cycle through the whole data and update the parameters Evaluate the objective function f 0 (β t ) if f 0 (β t ) < f 0 (β t 1 ) then µ γµ else f 0 (β t ) > f 0 (β t 1 ) then µ νµ Widely used values: γ = 1.05 and ν = 0.5 Stochastic Gradient Descent 18 / 32

24 3. Choosing the right step size SGD with Bold Driver 1: procedure BoldDriverSGD input: f 0, µ, γ and ν 2: Get initial point β 3: repeat 4: ɛ t 1 f 0 (β) 5: for i 1,..., m do 6: β β µ g(β, i) 7: end for 8: ɛ t f 0 (β) 9: if ɛ t < ɛ t 1 then 10: µ νµ 11: else 12: µ γµ 13: end if 14: until convergence 15: return β, f 0 (β) 16: end procedure Stochastic Gradient Descent 19 / 32

25 3. Choosing the right step size Considerations Works well for a range of problems The initial µ just need to be large enough γ and ν needs to be adusted May lead to faster convergence rates Stochastic Gradient Descent 20 / 32

26 3. Choosing the right step size AdaGrad Adagrad adjusts the step size for each parameter to be optimized It uses information about the past gradients Leads to faster convergence Less sensitive to the choice of the step size Stochastic Gradient Descent 21 / 32

27 3. Choosing the right step size AdaGrad - Update Rule We have f 0 (β) = m g(β, i) i=1 Update rule: Pick a random instance i Uniform(1, m) Compute the gradient β g(β, i) Update the gradient history h h + β g(β, i) β g(β, i) The step size for parameter β i is Update denotes the elementwise product µ hi β β µ h ( β g(β, i)) Stochastic Gradient Descent 22 / 32

28 3. Choosing the right step size SGD with Adagrad 1: procedure AdaGradSGD input: f 0, µ 2: Get initial point β 3: h 0 4: repeat 5: for i 1,..., m do 6: h h + β g(β, i) β g(β, i) 7: β β µ h g(β, i) 8: end for 9: until convergence 10: return β, f 0 (β) 11: end procedure Stochastic Gradient Descent 23 / 32

29 3. Choosing the right step size AdaGrad Step Size ADAGRAD Step Size MSE Iterations Stochastic Gradient Descent 24 / 32

30 3. Choosing the right step size AdaGrad vs Fixed Step Size ADAGRAD Step Size MSE AdaGrad Fixed Step Size Iterations Stochastic Gradient Descent 25 / 32

31 4. Stochastic Gradient Descent on Practice Outline 1. Unconstrained Optimization 2. Stochastic Gradient Descent 3. Choosing the right step size 4. Stochastic Gradient Descent on Practice Stochastic Gradient Descent 26 / 32

32 4. Stochastic Gradient Descent on Practice GD Step Size GD Step Size MSE Iterations Stochastic Gradient Descent 26 / 32

33 4. Stochastic Gradient Descent on Practice SGD vs GD - Body Fat Dataset SGD vs GD MSE SGD GD Iterations Stochastic Gradient Descent 27 / 32

34 4. Stochastic Gradient Descent on Practice Year Prediction Data Set Least Squares Problem Prediction of the release year of a song from audio features 90 features Experiments done on a subset of 1000 instances of the data Stochastic Gradient Descent 28 / 32

35 4. Stochastic Gradient Descent on Practice GD Step Size - Year Prediction GD Step Size MSE 0e+00 2e+05 4e+05 6e+05 8e+05 1e Iterations Stochastic Gradient Descent 29 / 32

36 4. Stochastic Gradient Descent on Practice SGD Step Size - Year Prediction SGD Step Size MSE 0e+00 1e+05 2e+05 3e+05 4e+05 5e Iterations Stochastic Gradient Descent 30 / 32

37 4. Stochastic Gradient Descent on Practice AdaGrad Step Size - Year Prediction ADAGRAD Step Size MSE 0e+00 1e+05 2e+05 3e+05 4e+05 5e Iterations Stochastic Gradient Descent 31 / 32

38 4. Stochastic Gradient Descent on Practice AdaGrad vs SGD vs GD - Year Prediction ADAGRAD Step Size MSE 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 AdaGrad GD SGD Iterations Stochastic Gradient Descent 32 / 32

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme

Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme Machine Learning A. Supervised Learning A.1. Linear Regression Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

CSC321 Lecture 2: Linear Regression

CSC321 Lecture 2: Linear Regression CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques 0. Overview Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 1 / 44 Syllabus Mon.

More information

Least Mean Squares Regression. Machine Learning Fall 2018

Least Mean Squares Regression. Machine Learning Fall 2018 Least Mean Squares Regression Machine Learning Fall 2018 1 Where are we? Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent Exercises

More information

Least Mean Squares Regression

Least Mean Squares Regression Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

A Distributed Solver for Kernelized SVM

A Distributed Solver for Kernelized SVM and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,

More information

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io Machine Learning Lecture 2: Linear regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2017 Supervised Learning Regression: Predict

More information

Advanced Topics in Machine Learning

Advanced Topics in Machine Learning Advanced Topics in Machine Learning 1. Learning SVMs / Primal Methods Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany 1 / 16 Outline 10. Linearization

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

1 Multiple Regression

1 Multiple Regression 1 Multiple Regression In this section, we extend the linear model to the case of several quantitative explanatory variables. There are many issues involved in this problem and this section serves only

More information

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph

More information

Variations of Logistic Regression with Stochastic Gradient Descent

Variations of Logistic Regression with Stochastic Gradient Descent Variations of Logistic Regression with Stochastic Gradient Descent Panqu Wang(pawang@ucsd.edu) Phuc Xuan Nguyen(pxn002@ucsd.edu) January 26, 2012 Abstract In this paper, we extend the traditional logistic

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

CSC 411 Lecture 6: Linear Regression

CSC 411 Lecture 6: Linear Regression CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 06-Linear Regression 1 / 37 A Timely XKCD UofT CSC 411: 06-Linear Regression

More information

Lecture 26: Neural Nets

Lecture 26: Neural Nets Lecture 26: Neural Nets ECE 417: Multimedia Signal Processing Mark Hasegawa-Johnson University of Illinois 11/30/2017 1 Intro 2 Knowledge-Based Design 3 Error Metric 4 Gradient Descent 5 Simulated Annealing

More information

J. Sadeghi E. Patelli M. de Angelis

J. Sadeghi E. Patelli M. de Angelis J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1

More information

Kinematic representation! Iterative methods! Optimization methods

Kinematic representation! Iterative methods! Optimization methods Human Kinematics Kinematic representation! Iterative methods! Optimization methods Kinematics Forward kinematics! given a joint configuration, what is the position of an end point on the structure?! Inverse

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Introduction to Optimization

Introduction to Optimization Introduction to Optimization Konstantin Tretyakov (kt@ut.ee) MTAT.03.227 Machine Learning So far Machine learning is important and interesting The general concept: Fitting models to data So far Machine

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Notes on AdaGrad. Joseph Perla 2014

Notes on AdaGrad. Joseph Perla 2014 Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Neural Networks: Optimization Part 2. Intro to Deep Learning, Fall 2017

Neural Networks: Optimization Part 2. Intro to Deep Learning, Fall 2017 Neural Networks: Optimization Part 2 Intro to Deep Learning, Fall 2017 Quiz 3 Quiz 3 Which of the following are necessary conditions for a value x to be a local minimum of a twice differentiable function

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

The K-FAC method for neural network optimization

The K-FAC method for neural network optimization The K-FAC method for neural network optimization James Martens Thanks to my various collaborators on K-FAC research and engineering: Roger Grosse, Jimmy Ba, Vikram Tankasali, Matthew Johnson, Daniel Duckworth,

More information

DeepLearning on FPGAs

DeepLearning on FPGAs DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent 10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Optimization for Training I. First-Order Methods Training algorithm

Optimization for Training I. First-Order Methods Training algorithm Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Learning in State-Space Reinforcement Learning CIS 32

Learning in State-Space Reinforcement Learning CIS 32 Learning in State-Space Reinforcement Learning CIS 32 Functionalia Syllabus Updated: MIDTERM and REVIEW moved up one day. MIDTERM: Everything through Evolutionary Agents. HW 2 Out - DUE Sunday before the

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Sep. 26, 2018 1 Reminders Homework 3:

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Sample questions for Fundamentals of Machine Learning 2018

Sample questions for Fundamentals of Machine Learning 2018 Sample questions for Fundamentals of Machine Learning 2018 Teacher: Mohammad Emtiyaz Khan A few important informations: In the final exam, no electronic devices are allowed except a calculator. Make sure

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

More information

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization Changyou Chen, David Carlson, Zhe Gan, Chunyuan Li, Lawrence Carin May 2, 2016 1 Changyou Chen Bridging the Gap between Stochastic

More information

Artificial Neuron (Perceptron)

Artificial Neuron (Perceptron) 9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) = w T = w 0 0 + + w 2 2 + + w d d where

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

IMPROVING STOCHASTIC GRADIENT DESCENT

IMPROVING STOCHASTIC GRADIENT DESCENT IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)

More information

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb,

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information