Deep Learning & Neural Networks Lecture 4

Similar documents
CSC321 Lecture 9: Generalization

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

CSC321 Lecture 9: Generalization

Lecture 6 Optimization for Deep Neural Networks

CSC321 Lecture 7: Optimization

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

ECS289: Scalable Machine Learning

Day 3 Lecture 3. Optimizing deep networks

Normalization Techniques in Training of Deep Neural Networks

CSC321 Lecture 8: Optimization

Nonlinear Optimization Methods for Machine Learning

Deep Learning & Artificial Intelligence WS 2018/2019

Lecture 5: Logistic Regression. Neural Networks

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

arxiv: v2 [cs.ne] 7 Apr 2015

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

SGD and Deep Learning

ECS171: Machine Learning

Machine Learning Lecture 14

CSC 578 Neural Networks and Deep Learning

Deep Feedforward Networks

Large-scale Stochastic Optimization

Summary and discussion of: Dropout Training as Adaptive Regularization

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Optimization for Training I. First-Order Methods Training algorithm

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Optimization for neural networks

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Training Neural Networks with Implicit Variance

Stochastic Optimization Algorithms Beyond SG

Lecture 9: Generalization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

ECS289: Scalable Machine Learning

Predicting Parameters in Deep Learning

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Why should you care about the solution strategies?

Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning

Introduction to Deep Neural Networks

CSC321 Lecture 20: Reversible and Autoregressive Models

Based on the original slides of Hung-yi Lee

A Parallel SGD method with Strong Convergence

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Classification: The rest of the story

Introduction to Optimization

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

A summary of Deep Learning without Poor Local Minima

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Advanced Training Techniques. Prajit Ramachandran

Linear Models in Machine Learning

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Adaptive Gradient Methods AdaGrad / Adam

OPTIMIZATION METHODS IN DEEP LEARNING

Tips for Deep Learning

Introduction to Convolutional Neural Networks (CNNs)

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Based on the original slides of Hung-yi Lee

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Introduction to Neural Networks

Machine Learning Basics III

Jakub Hajic Artificial Intelligence Seminar I

Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond

Introduction to gradient descent

CS260: Machine Learning Algorithms

Course 395: Machine Learning - Lectures

FreezeOut: Accelerate Training by Progressively Freezing Layers

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Neural Networks and Deep Learning

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Lecture 17: Neural Networks and Deep Learning

Advanced computational methods X Selected Topics: SGD

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

STA141C: Big Data & High Performance Statistical Computing

Tips for Deep Learning

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

Data Mining & Machine Learning

Neural networks and optimization

Deep learning with Elastic Averaging SGD

More Optimization. Optimization Methods. Methods

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Numerical Optimization Techniques

Asynchrony begets Momentum, with an Application to Deep Learning

Sub-Sampled Newton Methods

Transcription:

Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014

2/20

3/20 Advanced Topics in Optimization Today we ll briefly survey an assortment of exciting new tricks for optimizing deep architectures Although there are many exciting new things out of NIPS/ICML every year, I ll pick the four that I think is most helpful for practitioners: SGD alternative Better regularization Scaling to large data Hyper-parameter search

4/20 Today s Topics 1 Hessian-free optimization [Martens, 2010] 2 Dropout regularization [Hinton et al., 2012] 3 Large-scale distributed training [Dean et al., 2012] 4 Hyper-parameter search [Bergstra et al., 2011]

Difficulty of optimizing highly non-convex loss functions Pathological curvature is tough to navigate for SGD 2nd-order (Newton) methods may be needed to avoid underfitting Figure from [Martens, 2010] 5/20

6/20 2nd-order (Newton) methods Idea: approximate the loss function locally with a quadratic L(w + z) q w (z) L(w) + L(w) T z + 1 2 zt Hz where H is the Hessian (curvature matrix) at w

6/20 2nd-order (Newton) methods Idea: approximate the loss function locally with a quadratic L(w + z) q w (z) L(w) + L(w) T z + 1 2 zt Hz where H is the Hessian (curvature matrix) at w Minimizing this gives the search direction: z = H 1 L(w) Intuitively, H 1 fixes any pathological curvature for L(w) In practice, don t want to store nor invert H

6/20 2nd-order (Newton) methods Idea: approximate the loss function locally with a quadratic L(w + z) q w (z) L(w) + L(w) T z + 1 2 zt Hz where H is the Hessian (curvature matrix) at w Minimizing this gives the search direction: z = H 1 L(w) Intuitively, H 1 fixes any pathological curvature for L(w) In practice, don t want to store nor invert H Quasi-Newton methods L-BFGS: uses low-rank approximation of H 1

6/20 2nd-order (Newton) methods Idea: approximate the loss function locally with a quadratic L(w + z) q w (z) L(w) + L(w) T z + 1 2 zt Hz where H is the Hessian (curvature matrix) at w Minimizing this gives the search direction: z = H 1 L(w) Intuitively, H 1 fixes any pathological curvature for L(w) In practice, don t want to store nor invert H Quasi-Newton methods L-BFGS: uses low-rank approximation of H 1 Hessian-free (i.e. truncated Newton): (1) minimize qw (z) with conjugate gradient method; (2) computes Hz directly using L(w+ɛz) L(w) finite-difference: Hz = lim ɛ 0 ɛ

7/20 Hessian-free optimization applied to Deep Learning [Martens, 2010] describes some important modifications/settings to make Hessian-free methods work for Deep Learning Experiments: (Random initialization + 2nd-order Hessian-free optimizer) gives lower training error than (Pre-training initialization + 1-order optimizer). Nice results in Recurrent Nets too [Martens and Sutskever, 2011]

8/20 Today s Topics 1 Hessian-free optimization [Martens, 2010] 2 Dropout regularization [Hinton et al., 2012] 3 Large-scale distributed training [Dean et al., 2012] 4 Hyper-parameter search [Bergstra et al., 2011]

9/20 Dropout [Hinton et al., 2012] y Each time we present x (m), randomly delete each hidden node with 0.5 probability This is like sampling from 2 h different architectures At test time, use all nodes but halve the weights Effect: Reduce overfitting by preventing co-adaptation ; ensemble model averaging h 1 h 2 h 3 h 1 h 2 h 3 x 1 x 2 x 3

Some Results: TIMIT phone recognition 10/20

11/20 Today s Topics 1 Hessian-free optimization [Martens, 2010] 2 Dropout regularization [Hinton et al., 2012] 3 Large-scale distributed training [Dean et al., 2012] 4 Hyper-parameter search [Bergstra et al., 2011]

Model Parallelism Deep net is stored and processed on multiple cores (multi-thread) or machines (message passing) Performance benefit depends on connectivity structure vs. computational demand Figure from [Dean et al., 2012] 12/20

Data Parallelism Left: Asynchronous SGD Training data is partitioned; different model per shard Each model pushes and pulls gradient/weight info from parameter server asynchronously robust to machine failures Note gradient updates may be out-of-order and weights may be out-of-date. But it works! (c.f. [Langford et al., 2009]) AdaGrad learning rate is beneficial in practice Right: Distributed L-BFGS More precisely, each machine within the model communicates with the relevant parameter server. (Figure from [Dean et al., 2012]) 13/20

Performance Analysis Left: Exploiting model parallelism on a single data shard, up to 12x measured for sparse nets (Images) but diminishing returns. For dense nets (Speech), max at 8 machines. Right: Exploiting data parallelism also, how much time and how many machines are needed to achieve 16% test accuracy (Speech)? Figure from [Dean et al., 2012] 14/20

15/20 Today s Topics 1 Hessian-free optimization [Martens, 2010] 2 Dropout regularization [Hinton et al., 2012] 3 Large-scale distributed training [Dean et al., 2012] 4 Hyper-parameter search [Bergstra et al., 2011]

16/20 Hyper-parameter search is important Lots of hyper-parameters! 1 Number of layers 2 Number of nodes per layer 3 SGD learning rate 4 Regularization constant 5 Mini-batch size 6 Type of non-linearity 7 Type of distribution for random initialization

16/20 Hyper-parameter search is important Lots of hyper-parameters! 1 Number of layers 2 Number of nodes per layer 3 SGD learning rate 4 Regularization constant 5 Mini-batch size 6 Type of non-linearity 7 Type of distribution for random initialization It s important to invest in finding good settings for your data could be the difference between a winning system vs. useless system

17/20 Approaches to hyper-parameter search 1 Grid search

17/20 Approaches to hyper-parameter search 1 Grid search 2 Random search

17/20 Approaches to hyper-parameter search 1 Grid search 2 Random search 3 Manual search, a.k.a Graduate Student Descent (GSD)

17/20 Approaches to hyper-parameter search 1 Grid search 2 Random search 3 Manual search, a.k.a Graduate Student Descent (GSD) 4 Treat search itself as a meta machine learning problem [Bergstra et al., 2011] Input x = space of hyper-parameters Output y = validation error after training with given hyper-parameters

18/20 Hyper-parameter search as machine learning problem Input x = space of hyper-parameters Output y = validation error after training with given hyper-parameters 1 Computing y is expensive, so we learn a function f(x) that can predict it based on past (x, y) pairs e.g. Linear regression e.g. Gaussian Process, Parzen Estimator [Bergstra et al., 2011]

18/20 Hyper-parameter search as machine learning problem Input x = space of hyper-parameters Output y = validation error after training with given hyper-parameters 1 Computing y is expensive, so we learn a function f(x) that can predict it based on past (x, y) pairs e.g. Linear regression e.g. Gaussian Process, Parzen Estimator [Bergstra et al., 2011] 2 Try the hyper-parameter setting x = arg min x f (x), or some variant

18/20 Hyper-parameter search as machine learning problem Input x = space of hyper-parameters Output y = validation error after training with given hyper-parameters 1 Computing y is expensive, so we learn a function f(x) that can predict it based on past (x, y) pairs e.g. Linear regression e.g. Gaussian Process, Parzen Estimator [Bergstra et al., 2011] 2 Try the hyper-parameter setting x = arg min x f (x), or some variant 3 Repeat steps 1-2 until you solve AI!

19/20 References I Bergstra, J., Bardenet, R., Bengio, Y., and Kégel, B. (2011). Algorithms for hyper-parameter optimization. In Proc. Neural Information Processing Systems 24 (NIPS2011). Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., Mao, M. Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., and Ng, A. Y. (2012). Large scale distributed deep networks. In Neural Information Processing Systems (NIPS). Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580.

20/20 References II Langford, J., Smola, A., and Zinkevich, M. (2009). Slow learners are fast. In NIPS. Martens, J. (2010). Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML). Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th International Conference on Machine Learning (ICML).