Modern Optimization Techniques
|
|
- Nickolas Booker
- 6 years ago
- Views:
Transcription
1 Modern Optimization Techniques Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Stochastic Gradient Descent Stochastic Gradient Descent 1 / 32
2 Outline 1. Unconstrained Optimization 2. Stochastic Gradient Descent 3. Choosing the right step size 4. Stochastic Gradient Descent on Practice Stochastic Gradient Descent 1 / 32
3 1. Unconstrained Optimization Outline 1. Unconstrained Optimization 2. Stochastic Gradient Descent 3. Choosing the right step size 4. Stochastic Gradient Descent on Practice Stochastic Gradient Descent 1 / 32
4 1. Unconstrained Optimization Gradient Descent 1: procedure GradientDescent input: f 0 2: Get initial point β 3: repeat 4: Get Step Size µ 5: β := β µ f 0 (β) 6: until convergence 7: return β, f 0 (β) 8: end procedure Stochastic Gradient Descent 1 / 32
5 2. Stochastic Gradient Descent Outline 1. Unconstrained Optimization 2. Stochastic Gradient Descent 3. Choosing the right step size 4. Stochastic Gradient Descent on Practice Stochastic Gradient Descent 2 / 32
6 2. Stochastic Gradient Descent Practical Example: Household Spending Suppose we have the following data about different households: Number of workers in the household (x 1 ) Household composition (x 2 ) Region (x 3 ) Gross normal weekly household income (x 4 ) Weekly household spending (y) We want to creat a model of the weekly household spending Stochastic Gradient Descent 2 / 32
7 2. Stochastic Gradient Descent Practical Example: Household Spending If we have data about m households, we can represent it as: 1 x 1,2... x 1,n y 1 1 x 2,2... x 2,n X m,n =.... y = y 2. 1 x m,2... x m,n y m We can model the household consumption is a linear combination of the household features with parameters β: ŷ i = β T x i = β β 1 x i,1 + β 2 x i,2 + β 3 x i,3 + β 4 x i,4 Stochastic Gradient Descent 3 / 32
8 2. Stochastic Gradient Descent Least Square Problem Revisited The following least square problem minimize X β y 2 2 Can be rewritten as minimize m (β T x i y i ) 2 i=1 1 x 1,1 x 1,2 x 1,3 x 1,4 y 1 1 x 2,1 x 2,2 x 2,3 x 2,4 X m,n =..... y = y 2. 1 x m,1 x m,2 x m,3 x m,4 y m Stochastic Gradient Descent 4 / 32
9 2. Stochastic Gradient Descent The Gradient Descent update rule For the problem minimize m (β T x i y i ) 2 i=1 The the gradient f 0 (β) of the objective function is: m β f 0 (β) = 2 (β T x i y i )x i i=1 The Gradient Descent update rule is then: ( ) m β β µ 2 (β T x i y i )x i i=1 Stochastic Gradient Descent 5 / 32
10 2. Stochastic Gradient Descent The Gradient Descent update rule We need to see all the data before updating β ( ) m β β µ 2 (β T x i y i )x i Can we make any progress before iterating over all the data? i=1 Stochastic Gradient Descent 6 / 32
11 2. Stochastic Gradient Descent Decomposing the objective function The objective function m f 0 (β) = (β T x i y i ) 2 i=1 Can be expressed as a function of the objective on each data point (x, y): So that g(β, i) = (β T x i y i ) 2 f 0 (β) = m g(β, i) i=1 Stochastic Gradient Descent 7 / 32
12 2. Stochastic Gradient Descent A simpler update rule Now that we have f 0 (β) = m g(β, i) i=1 We can define the following update rule Pick a random instance i Uniform(1, m) Update β β β + µ ( β g(β, i)) Stochastic Gradient Descent 8 / 32
13 2. Stochastic Gradient Descent Stochastic Gradient Descent (SGD) 1: procedure StochasticGradiendDescent input: f 0, µ 2: Get initial point β 3: repeat 4: for i 1,..., m do 5: β β µ g(β, i) 6: end for 7: until convergence 8: return β, f 0 (β) 9: end procedure Stochastic Gradient Descent 9 / 32
14 2. Stochastic Gradient Descent SGD and the least squares We have m f 0 (β) = g(β, i) i=1 with g(β, i) = (β T x i y i ) 2 The update rule is β g(β, i) = 2(β T x i y i )x i ) β β µ (2(β T x i y i )x i Stochastic Gradient Descent 10 / 32
15 2. Stochastic Gradient Descent SGD vs. GD 1: procedure SGD input: f 0, µ 2: Get initial point β 3: repeat 4: for i 1,..., m do 5: β β µ g(β, i) 6: end for 7: until convergence 8: return β, f 0 (β) 9: end procedure 1: procedure GradientDescent input: f 0 2: Get initial point β 3: repeat 4: Get Step Size µ 5: β := β µ f 0 (β) 6: until convergence 7: return β, f 0 (β) 8: end procedure Stochastic Gradient Descent 11 / 32
16 2. Stochastic Gradient Descent SGD vs. GD - Least Squares 1: procedure SGD input: f 0, µ 2: Get initial point β 3: repeat 4: for i 1,..., m do 5: β β µ ( 2(β T ) x i y i )x i 6: end for 7: until convergence 8: return β, f 0 (β) 9: end procedure 1: procedure GD input: f 0 2: Get initial point β 3: repeat 4: Get Step Size µ 5: β β µ ( 2 m ) i=1 (βt x i y i )x i 6: until convergence 7: return β, f 0 (β) 8: end procedure Stochastic Gradient Descent 12 / 32
17 3. Choosing the right step size Outline 1. Unconstrained Optimization 2. Stochastic Gradient Descent 3. Choosing the right step size 4. Stochastic Gradient Descent on Practice Stochastic Gradient Descent 13 / 32
18 3. Choosing the right step size Choosing the step size for SGD The step size µ is a crucial parameter to be tuned Given the low cost of the SGD update, using line search for the step size is a bad choice Possible alternatives: Fixed step size Armijo principle Bold-Driver Adagrad Stochastic Gradient Descent 13 / 32
19 3. Choosing the right step size Real World Dataset: Body Fat prediction We want to estimate the percentage of body fat based on various attributes: Age (years) Weight (lbs) Height (inches) Neck circumference (cm) Chest circumference (cm) Abdomen 2 circumference (cm) Hip circumference (cm) Thigh circumference (cm) Knee circumference (cm)... Stochastic Gradient Descent 14 / 32
20 3. Choosing the right step size Real World Dataset: Body Fat prediction The data is represented it as: 1 x 1,1 x 1,2... x 1,n y 1 1 x 2,1 x 2,2... x 2,n X m,n =..... y = y 2. 1 x m,1 x m,2... x m,n y m with m = 252, n = 14 We can model the percentage of body fat y is a linear combination of the body measurements with parameters β: ŷ i = β T x i = β β 1 x i,1 + β 2 x i, β n x i,n Stochastic Gradient Descent 15 / 32
21 3. Choosing the right step size SGD - Fixed Step Size on the Body Fat dataset SGD Step Size MSE Iterations Stochastic Gradient Descent 16 / 32
22 3. Choosing the right step size Bold Driver Heuristic The Bold Driver Heuristic makes the assumption that smaller step sizes are needed when closer to the optimum It adjusts the step size based on the value of f 0 (β t ) f 0 (β t 1 ) If the value of f 0 (β) grows, the step size must decrease If the value of f 0 (β) decreases, the step size can be larger for faster convergence Stochastic Gradient Descent 17 / 32
23 3. Choosing the right step size Bold Driver Heuristic - Update Rule We have f 0 (β) = m g(β, i) i=1 We need to define an increase factor γ and a decay factor ν For each epoch Evaluate the objective function f 0 (β t 1 ) Cycle through the whole data and update the parameters Evaluate the objective function f 0 (β t ) if f 0 (β t ) < f 0 (β t 1 ) then µ γµ else f 0 (β t ) > f 0 (β t 1 ) then µ νµ Widely used values: γ = 1.05 and ν = 0.5 Stochastic Gradient Descent 18 / 32
24 3. Choosing the right step size SGD with Bold Driver 1: procedure BoldDriverSGD input: f 0, µ, γ and ν 2: Get initial point β 3: repeat 4: ɛ t 1 f 0 (β) 5: for i 1,..., m do 6: β β µ g(β, i) 7: end for 8: ɛ t f 0 (β) 9: if ɛ t < ɛ t 1 then 10: µ νµ 11: else 12: µ γµ 13: end if 14: until convergence 15: return β, f 0 (β) 16: end procedure Stochastic Gradient Descent 19 / 32
25 3. Choosing the right step size Considerations Works well for a range of problems The initial µ just need to be large enough γ and ν needs to be adusted May lead to faster convergence rates Stochastic Gradient Descent 20 / 32
26 3. Choosing the right step size AdaGrad Adagrad adjusts the step size for each parameter to be optimized It uses information about the past gradients Leads to faster convergence Less sensitive to the choice of the step size Stochastic Gradient Descent 21 / 32
27 3. Choosing the right step size AdaGrad - Update Rule We have f 0 (β) = m g(β, i) i=1 Update rule: Pick a random instance i Uniform(1, m) Compute the gradient β g(β, i) Update the gradient history h h + β g(β, i) β g(β, i) The step size for parameter β i is Update denotes the elementwise product µ hi β β µ h ( β g(β, i)) Stochastic Gradient Descent 22 / 32
28 3. Choosing the right step size SGD with Adagrad 1: procedure AdaGradSGD input: f 0, µ 2: Get initial point β 3: h 0 4: repeat 5: for i 1,..., m do 6: h h + β g(β, i) β g(β, i) 7: β β µ h g(β, i) 8: end for 9: until convergence 10: return β, f 0 (β) 11: end procedure Stochastic Gradient Descent 23 / 32
29 3. Choosing the right step size AdaGrad Step Size ADAGRAD Step Size MSE Iterations Stochastic Gradient Descent 24 / 32
30 3. Choosing the right step size AdaGrad vs Fixed Step Size ADAGRAD Step Size MSE AdaGrad Fixed Step Size Iterations Stochastic Gradient Descent 25 / 32
31 4. Stochastic Gradient Descent on Practice Outline 1. Unconstrained Optimization 2. Stochastic Gradient Descent 3. Choosing the right step size 4. Stochastic Gradient Descent on Practice Stochastic Gradient Descent 26 / 32
32 4. Stochastic Gradient Descent on Practice GD Step Size GD Step Size MSE Iterations Stochastic Gradient Descent 26 / 32
33 4. Stochastic Gradient Descent on Practice SGD vs GD - Body Fat Dataset SGD vs GD MSE SGD GD Iterations Stochastic Gradient Descent 27 / 32
34 4. Stochastic Gradient Descent on Practice Year Prediction Data Set Least Squares Problem Prediction of the release year of a song from audio features 90 features Experiments done on a subset of 1000 instances of the data Stochastic Gradient Descent 28 / 32
35 4. Stochastic Gradient Descent on Practice GD Step Size - Year Prediction GD Step Size MSE 0e+00 2e+05 4e+05 6e+05 8e+05 1e Iterations Stochastic Gradient Descent 29 / 32
36 4. Stochastic Gradient Descent on Practice SGD Step Size - Year Prediction SGD Step Size MSE 0e+00 1e+05 2e+05 3e+05 4e+05 5e Iterations Stochastic Gradient Descent 30 / 32
37 4. Stochastic Gradient Descent on Practice AdaGrad Step Size - Year Prediction ADAGRAD Step Size MSE 0e+00 1e+05 2e+05 3e+05 4e+05 5e Iterations Stochastic Gradient Descent 31 / 32
38 4. Stochastic Gradient Descent on Practice AdaGrad vs SGD vs GD - Year Prediction ADAGRAD Step Size MSE 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 AdaGrad GD SGD Iterations Stochastic Gradient Descent 32 / 32
Modern Optimization Techniques
Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationSimple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationMachine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme
Machine Learning A. Supervised Learning A.1. Linear Regression Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationCSC321 Lecture 2: Linear Regression
CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,
More informationModern Optimization Techniques
Modern Optimization Techniques 0. Overview Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 1 / 44 Syllabus Mon.
More informationLeast Mean Squares Regression. Machine Learning Fall 2018
Least Mean Squares Regression Machine Learning Fall 2018 1 Where are we? Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent Exercises
More informationLeast Mean Squares Regression
Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationStochastic Gradient Descent
Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationEve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates
Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive
More informationDeep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation
Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks
More informationNeural Networks: Optimization & Regularization
Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationCSC321 Lecture 8: Optimization
CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationA Distributed Solver for Kernelized SVM
and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,
More informationMachine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io
Machine Learning Lecture 2: Linear regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2017 Supervised Learning Regression: Predict
More informationAdvanced Topics in Machine Learning
Advanced Topics in Machine Learning 1. Learning SVMs / Primal Methods Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany 1 / 16 Outline 10. Linearization
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More information1 Multiple Regression
1 Multiple Regression In this section, we extend the linear model to the case of several quantitative explanatory variables. There are many issues involved in this problem and this section serves only
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationNeural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016
Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph
More informationVariations of Logistic Regression with Stochastic Gradient Descent
Variations of Logistic Regression with Stochastic Gradient Descent Panqu Wang(pawang@ucsd.edu) Phuc Xuan Nguyen(pxn002@ucsd.edu) January 26, 2012 Abstract In this paper, we extend the traditional logistic
More informationLogistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction
More informationCSC 411 Lecture 6: Linear Regression
CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 06-Linear Regression 1 / 37 A Timely XKCD UofT CSC 411: 06-Linear Regression
More informationLecture 26: Neural Nets
Lecture 26: Neural Nets ECE 417: Multimedia Signal Processing Mark Hasegawa-Johnson University of Illinois 11/30/2017 1 Intro 2 Knowledge-Based Design 3 Error Metric 4 Gradient Descent 5 Simulated Annealing
More informationJ. Sadeghi E. Patelli M. de Angelis
J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationLinear Regression (9/11/13)
STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter
More informationLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descent
Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1
More informationKinematic representation! Iterative methods! Optimization methods
Human Kinematics Kinematic representation! Iterative methods! Optimization methods Kinematics Forward kinematics! given a joint configuration, what is the position of an end point on the structure?! Inverse
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationIntroduction to Optimization
Introduction to Optimization Konstantin Tretyakov (kt@ut.ee) MTAT.03.227 Machine Learning So far Machine learning is important and interesting The general concept: Fitting models to data So far Machine
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationNotes on AdaGrad. Joseph Perla 2014
Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationSub-Sampled Newton Methods
Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationNeural Networks: Optimization Part 2. Intro to Deep Learning, Fall 2017
Neural Networks: Optimization Part 2 Intro to Deep Learning, Fall 2017 Quiz 3 Quiz 3 Which of the following are necessary conditions for a value x to be a local minimum of a twice differentiable function
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationThe K-FAC method for neural network optimization
The K-FAC method for neural network optimization James Martens Thanks to my various collaborators on K-FAC research and engineering: Roger Grosse, Jimmy Ba, Vikram Tankasali, Matthew Johnson, Daniel Duckworth,
More informationDeepLearning on FPGAs
DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationLecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent
10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationSerious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions
BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationOptimization for Training I. First-Order Methods Training algorithm
Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationLecture 6 Optimization for Deep Neural Networks
Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationLearning in State-Space Reinforcement Learning CIS 32
Learning in State-Space Reinforcement Learning CIS 32 Functionalia Syllabus Updated: MIDTERM and REVIEW moved up one day. MIDTERM: Everything through Evolutionary Agents. HW 2 Out - DUE Sunday before the
More informationOptimization for neural networks
0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make
More informationStatistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks
Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered
More informationRirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology
Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign
More informationLogistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Sep. 26, 2018 1 Reminders Homework 3:
More informationSimple Optimization, Bigger Models, and Faster Learning. Niao He
Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture
More informationDeep Learning Lab Course 2017 (Deep Learning Practical)
Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationSample questions for Fundamentals of Machine Learning 2018
Sample questions for Fundamentals of Machine Learning 2018 Teacher: Mohammad Emtiyaz Khan A few important informations: In the final exam, no electronic devices are allowed except a calculator. Make sure
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationCSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross
More informationOverview of gradient descent optimization algorithms. HYUNG IL KOO Based on
Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationBridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization
Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization Changyou Chen, David Carlson, Zhe Gan, Chunyuan Li, Lawrence Carin May 2, 2016 1 Changyou Chen Bridging the Gap between Stochastic
More informationArtificial Neuron (Perceptron)
9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) = w T = w 0 0 + + w 2 2 + + w d d where
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationIMPROVING STOCHASTIC GRADIENT DESCENT
IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu
More informationLogistic Regression. Stochastic Gradient Descent
Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)
More informationDistributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server
Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb,
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More information