Energy Landscapes of Deep Networks

Size: px
Start display at page:

Download "Energy Landscapes of Deep Networks"

Transcription

1 Energy Landscapes of Deep Networks Pratik Chaudhari February 29, 2016 References Auffinger, A., Arous, G. B., & Černý, J. (2013). Random matrices and complexity of spin glasses. arxiv: # Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., & LeCun, Y. (2014). The Loss Surface of Multilayer Networks. arxiv: # Fyodorov, Yan V. "High-dimensional random fields and random matrix theory." arxiv: (2013).# Chaudhari Pratik & Soatto Stefano. The Effect of Gradient Noise on the Energy Landscape of Deep Networks. arxiv: (2015). 1

2 Today Introduction to spin glasses! Write a deep network as a spin glass! Analyze local minima of such networks! Modify the energy landscape of deep networks 2

3 An introduction to spin glasses Ising model Model from statistical mechanics to explain magnetism Hamiltonian H(s) =Â (i,j) J s i s j s i 2 { 1, 1} upward / downward spins sum over all neighbors coupling strength, also called disorder p n p n dipoles on a lattice Hamiltonian gives the energy of a configuration# Thermodynamically, the model likes to find the ground state P(configuration s) = 1 Z e bh(s) Boltzmann distribution Boltzmann s constant inverse temperature b = 1 k B T temperature normalizing constant, also called partition function # very hard to compute Z = Â e bh(s) s2{ 1,1} n 3

4 Ising model Hamiltonian H(s) =Â (i,j) J s i s j s i 2 { 1, 1} upward / downward spins Why interesting? sum over all neighbors coupling strength, also called disorder Behaves like a magnet!# Beyond a temperature threshold, spins no# longer aligned together# Great model for analysis of graphical models Would also be a good model for deep nets! But very hard to solve for large correlations at # low temperature complete disorder at# high temperature Z = Â s2{ 1,1} {z } n 2 n terms e bh(s) exponentially many terms 4

5 Enter spin glasses 2-spin glass H(s) = Â (i 1,i 2 applen) J i1,i 2 s i1 s i2 Hamiltonian is also a random variable We will need a p-spin glass note the sum over all other spins is now a random variable with# exponential tails# usually Gaussian# couples all possible pairs of spins helps use a lot of results from# probability theory# much more known about spin-glasses# than Ising / Potts models# In particular, their energy landscape H p (s) = Â 1applei 1,...,i p applen J i1,...,i p s i1 s i2...s ip Why call them glasses? Long range coupling, not just neighbors# Glass is really just a liquid that flows very slowly Crystals: cooling a liquid very quickly, short range coupling# Glasses have long range coupling 5

6 A model for deep networks sparse, random connections, threshold nonlinearities scalar output h: hidden units n neurons on each layer p layers X: input Loss function (zero-one loss) is a spin glass Hamiltonian: H(s) law J n (p 1)/2 n  i 1,...,i p =1 J i1,...,i p s i1... s ip Active paths in the network# Unique weights in the network zero-mean standard Gaussian 6

7 Random sparse deep network p total layers p = log d n = 1 a Y = W > p s(w p 1 s(w p 2...s(W 1 X))...)) output X 2 R n class-label Y 2 R thresholds weights W i 2 [ 1, 1] n n decision mechanism W p 2 R n SVMs, soft-max etc.! d sparse as well Choromanska et. al. use a non-sparse# model with ReLU nonlinearities feature vector Some assumptions on weights on an average, d non-zeros d = n a ; a < 1/5 weights are not all close to zero E W k,ij var W k,ij = 0 n 2/5 7

8 Random sparse deep network Can write it as class-label Y = n  i=1 sum over the output neurons# dimensions of X but introduces weird couplings / disorder z } {  X i g2g i i th p k=1 w k g! element dimension of X sum over all active paths in the network these are our spins! product of all the# weights along pathg G i set of active paths from# output neuron to Y i th 8

9 Deep network as a spin glass Two ways to remove correlations Assume independence of paths Choromanska et. al. X i1! X i1,...,i p We shall explicitly account for them Y = n  i,i 1,...,i p =1 first edge, i.e., between# layer 1 and layer 2 g i,i1 X i W i,i1 {z } :=Z i,i1 net coupling of a path due to# the sum over all outputs is small g i1,...,i p W i1,...,i p path from layer 2 to Y  i Z i apple n 1/5 9

10 Model for deep networks For a 0-1 / logistic loss function, can show: H(s) law number of layers J n (p 1)/2 n  i 1,...,i p =1 number of neurons on each layer J i1,...,i p s i1... s ip Because of sparsity, can# estimate the probability# that a neuron of any layer# is active or not estimate the number of active# paths in the sparse network# zero-mean standard Gaussian rewrite in terms of unique weights Q(n) (for analysis) assume spherical constraint s 2 S n 1 ( p n) Key Idea - Can now analyze a generic neural network 10

11 Number of critical points All critical points: crt(b) = {s : rh(s) =0, H(s) 2 nb} any interval of real line gradient is zero energy lies in interval B Critical points of index k: crt k (B) = n s : rh(s) =0, H(s) 2 nb, i r 2 H(s) o = k counts saddle points in interval B k eigenvalues of Hessian# have negative real part 11

12 Number of critical points crt(b) =all critical points crt k (B) =saddle points of index k lim n! 1 n log E crt k(u) =Q k (u), for all u 2 R, k 2 N lim n! 1 n log E crt(u) =Q(u), for all u 2 R, k 2 N exponentially many local minima and saddle points! u (, u) rate functions, look like this Energy barriers for p = Q(u) 0.00 k = 0 E 0 = E inf = k = 1 k = 4 k = H/n 12

13 Number of critical points crt(b) =all critical points crt k (B) =saddle points of index k Corollary lim n! 1 n log E crt k(r) = 1 2 log(p 2) p 3 p 1 lim n! 1 n log E crt(r) =1 2 log(p 2) exponentially many local minima as well as saddle points in total! 13

14 Location of critical points Theorem i.e., decreases exponentially lim sup 1 n! n 2 log P (9 a saddle point of index k above energy n(e e)) < 0 lim sup 1 n! n log P (9 a saddle point of index k above energy n(e k + e)) < 0 Local minima exist throughout# SGD might think low-index saddle points# are minima# SGD finds one of these jumping between local minima is hard # normalized width of energy bands# is small, can t check numerically [ ne 0, ne ] layered structure index apple k saddle points between [ ne 0, ne k ] energy bands have to climb the energy barrier to reach a saddle point, then climb down again fortunately, width of bands decreases with k 14

15 Experimental evaluation: 2-spin glass, simulated annealing same as number of neurons n Choromanska et. al. 15

16 Experimental evaluation: 1-layer MNIST network, SGD number of neurons n from Choromanska et. al. Choromanska et. al. 16

17 Experimental evaluation: 1-layer MNIST network, SGD variance increases with n? should become worse with more layers due to correlations Choromanska et. al. SGD finds critical points of low normalized index,# i.e., does not get stuck at saddle points# Of course, there are those exponentially many local minima 17

18 Trivialization of critical points Simple example H = 1 2 s> Js + h > s; s = n spherical constraint random matrix, each entry is Gaussian var(j ij )= 1 + d ij n random magnetic field, Gaussian var(h i )=n 2 Three distinct regimes O(n) asymptotically critical points for # n = O(n 1/2 ) Number of critical points decreases with magnetic field# Total trivialization, 2 critical points for n = O(n 1/6 ) 18

19 Trivialization of critical points n = O(n 1/2 ) n = O(n 1/6 ) Ecrt(R) 2n Ecrt(R) g = nn2 2 k = n1/3 n

20 Trivialization: Changing the number of critical points eh(s) = H(s)+Â i h i s i original Hamiltonian random magnetic field, zero-mean Gaussian Variance of magnetic field: n c = q p(p 2) Three distinct regimes Exponentially many critical points with small noise# Polynomially-many critical points near threshold# Total trivialization, 2 critical points n < n c n > n c + W(n 1 ) n n c O(n 1 ) Important regime Want to avoid this 20

21 Monte-Carlo simulation on a spin glass eh(s) = H(s)+Â i h i s i n = 100, p = 3 Local-minima separated by high energy barriers# Saddle points are the canyons above Almost constant Hamiltonian# All points land at the same local minimum 21

22 Monte-Carlo simulation on a spin glass eh(s) = H(s)+Â i h i s i Very few local minima# Much larger proportion of saddle points (for a slightly different model) can prove that in this regime, only low order saddle points and local minima persist 22

23 Distribution of local minima concentrates lim P min H(s) m n t n! s = exp 1 e c pt c p Subag, Zeitouni (2015) Probability density of Hamiltonian at local minima Gumbel density µ = 1.58 b = 0.03 Gumbel distribution H/n 23

24 15 10 Probability density of Hamiltonian at local minima Distribution of local minima concentrates lim P min H(s) m n t n! s Gumbel density µ = 1.58 b = 0.03 and all saddle points = exp 1 e c pt c p Subag, Zeitouni (2015) Gumbel distribution H/n k = 0 E 0 = E inf = k = 1 k = 4 k = 10 24

25 Perturbations of local minima Theorem DH n (0), 1 n H quad(y s ) H quad (0) quadratic approximation at a critical point quadratic approximation at a critical point n = O(1) lim n! DH n(0) =n ky s kapplen 1/4 Roughly, magnetic field does not perturb the local minima, only changes the Hamiltonian 25

26 Annealing the magnetic field eh(s) = H(s)+ p a 2 p n  i s 2 i +  i h i s i L2 regularization (same as before) random magnetic field Set a parameter: n = Jp t n Magnetic field for the polynomial regime is: 1 + t a2 n + 2t n 1/2 Helps transition between# Polynomial regime:# Exponential regime: t 1 t! But fix its direction, don t resample! Anneal magnetic field to zero as training progresses: t(k) = n a k N epoch number total number of epochs 26

27 Fully-connected deep network : MNIST mnistfc : input! linear 64 {z } 20 relu nonlinearities! linear 10! softmax 100 mnistfc: Training error 90 mnistfc: Validation error `2 `2, triv, anneal %error %error 40 5 `2 `2, triv, anneal Iteration Epoch Does not exactly match the theoretical model, so set p = 20, n = s #weights p 27

28 Fully-connected deep network : CIFAR-10 cifarfc : input! linear 128 {z } 20! linear 10! softmax relu nonlinearities error = k c Trivialization is a great way to avoid bad initialization of weights 100 cifarfc: Training error 90 cifarfc: Validation error `2 `2, triv, anneal %error 40 %error `2 `2, triv, anneal Iteration Epoch 28

29 Convolution neural network: CIFAR-10 cifarconv : input! conv {z } 8! linear 128 {z } 8! linear 10! softmax relu nonlinearities, max-pooling 80 cifarconv: Training error `2 `2, triv, anneal `2, anneal, resample cifarconv: Validation error `2 `2, triv, anneal `2, anneal, resample %error %error Iteration Epoch Exact same training parameters as previous experiment 29

30 Convolution neural network: CIFAR-10 cifarconv : input! conv {z } 8! linear 128 {z } 8! linear 10! softmax relu nonlinearities, max-pooling Weights keep sloshing with resampling Much larger minimum gradient 10 0 cifarconv: Alignment of weights cifarconv: Minimum gradient Âi h i s i `2, triv, anneal `2, anneal, resample min g k `2 `2, triv, anneal `2, anneal, resample Iteration Iteration 30

31 Network-in-network architecture: CIFAR-10 cifarnin : input! conv {z } 3! max-pool 3 3! dropout {z } 2! conv ! conv ! conv ! mean-pool 8 8! softmax relu nonlinearities, batch-normalization, global average pooling cifarnin: Validation error `2 `2, triv, anneal `2, anneal, resample %error Epoch Exact same parameters! 31

32 Conclusions Spin glasses Good models for deep networks,! energy landscape has similar properties Gradient noise of right magnitude dramatically helps very-deep networks, fully-connected layers Theoretical goals Effect of regularization on the energy landscape! Devise specialized, non-local optimization algorithms 32

A picture of the energy landscape of! deep neural networks

A picture of the energy landscape of! deep neural networks A picture of the energy landscape of! deep neural networks Pratik Chaudhari December 15, 2017 UCLA VISION LAB 1 Dy (x; w) = (w p (w p 1 (... (w 1 x))...)) w = argmin w (x,y ) 2 D kx 1 {y =i } log Dy i

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Unraveling the mysteries of stochastic gradient descent on deep neural networks

Unraveling the mysteries of stochastic gradient descent on deep neural networks Unraveling the mysteries of stochastic gradient descent on deep neural networks Pratik Chaudhari UCLA VISION LAB 1 The question measures disagreement of predictions with ground truth Cat Dog... x = argmin

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

CSC321 Lecture 7: Optimization

CSC321 Lecture 7: Optimization CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Quantum and classical annealing in spin glasses and quantum computing. Anders W Sandvik, Boston University

Quantum and classical annealing in spin glasses and quantum computing. Anders W Sandvik, Boston University NATIONAL TAIWAN UNIVERSITY, COLLOQUIUM, MARCH 10, 2015 Quantum and classical annealing in spin glasses and quantum computing Anders W Sandvik, Boston University Cheng-Wei Liu (BU) Anatoli Polkovnikov (BU)

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Implicit Optimization Bias

Implicit Optimization Bias Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar,

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Overparametrization for Landscape Design in Non-convex Optimization

Overparametrization for Landscape Design in Non-convex Optimization Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC Non-Convex Optimization in Machine Learning Jan Mrkos AIC The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff? Neural net universal approximation Theorem (1989):

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

3. General properties of phase transitions and the Landau theory

3. General properties of phase transitions and the Landau theory 3. General properties of phase transitions and the Landau theory In this Section we review the general properties and the terminology used to characterise phase transitions, which you will have already

More information

CSCI 315: Artificial Intelligence through Deep Learning

CSCI 315: Artificial Intelligence through Deep Learning CSCI 35: Artificial Intelligence through Deep Learning W&L Fall Term 27 Prof. Levy Convolutional Networks http://wernerstudio.typepad.com/.a/6ad83549adb53ef53629ccf97c-5wi Convolution: Convolution is

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

The K-FAC method for neural network optimization

The K-FAC method for neural network optimization The K-FAC method for neural network optimization James Martens Thanks to my various collaborators on K-FAC research and engineering: Roger Grosse, Jimmy Ba, Vikram Tankasali, Matthew Johnson, Daniel Duckworth,

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

Theory IIIb: Generalization in Deep Networks

Theory IIIb: Generalization in Deep Networks CBMM Memo No. 90 June 29, 2018 Theory IIIb: Generalization in Deep Networks Tomaso Poggio 1, Qianli Liao 1, Brando Miranda 1, Andrzej Banburski 1, Xavier Boix 1 and Jack Hidary 2 1 Center for Brains, Minds,

More information

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima J. Nocedal with N. Keskar Northwestern University D. Mudigere INTEL P. Tang INTEL M. Smelyanskiy INTEL 1 Initial Remarks SGD

More information

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer CSE446: Neural Networks Spring 2017 Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5 Scene

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets Neural Networks for Machine Learning Lecture 11a Hopfield Nets Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Hopfield Nets A Hopfield net is composed of binary threshold

More information

Quantum Annealing in spin glasses and quantum computing Anders W Sandvik, Boston University

Quantum Annealing in spin glasses and quantum computing Anders W Sandvik, Boston University PY502, Computational Physics, December 12, 2017 Quantum Annealing in spin glasses and quantum computing Anders W Sandvik, Boston University Advancing Research in Basic Science and Mathematics Example:

More information

Introduction to the Renormalization Group

Introduction to the Renormalization Group Introduction to the Renormalization Group Gregory Petropoulos University of Colorado Boulder March 4, 2015 1 / 17 Summary Flavor of Statistical Physics Universality / Critical Exponents Ising Model Renormalization

More information

Why is Deep Learning so effective?

Why is Deep Learning so effective? Ma191b Winter 2017 Geometry of Neuroscience The unreasonable effectiveness of deep learning This lecture is based entirely on the paper: Reference: Henry W. Lin and Max Tegmark, Why does deep and cheap

More information

text classification 3: neural networks

text classification 3: neural networks text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University

More information

Normalization Techniques

Normalization Techniques Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization

More information

Markov Chain Monte Carlo Method

Markov Chain Monte Carlo Method Markov Chain Monte Carlo Method Macoto Kikuchi Cybermedia Center, Osaka University 6th July 2017 Thermal Simulations 1 Why temperature 2 Statistical mechanics in a nutshell 3 Temperature in computers 4

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Jakub Hajic Artificial Intelligence Seminar I

Jakub Hajic Artificial Intelligence Seminar I Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network

More information

Deep Relaxation: PDEs for optimizing Deep Neural Networks

Deep Relaxation: PDEs for optimizing Deep Neural Networks Deep Relaxation: PDEs for optimizing Deep Neural Networks IPAM Mean Field Games August 30, 2017 Adam Oberman (McGill) Thanks to the Simons Foundation (grant 395980) and the hospitality of the UCLA math

More information

UNDERSTANDING LOCAL MINIMA IN NEURAL NET-

UNDERSTANDING LOCAL MINIMA IN NEURAL NET- UNDERSTANDING LOCAL MINIMA IN NEURAL NET- WORKS BY LOSS SURFACE DECOMPOSITION Anonymous authors Paper under double-blind review ABSTRACT To provide principled ways of designing proper Deep Neural Network

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

S j H o = gµ o H o. j=1

S j H o = gµ o H o. j=1 LECTURE 17 Ferromagnetism (Refs.: Sections 10.6-10.7 of Reif; Book by J. S. Smart, Effective Field Theories of Magnetism) Consider a solid consisting of N identical atoms arranged in a regular lattice.

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm

Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm Name: Student number: This is a closed-book test. It is marked out of 15 marks. Please answer ALL of the

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Deep Learning Lecture 2

Deep Learning Lecture 2 Fall 2016 Machine Learning CMPSCI 689 Deep Learning Lecture 2 Sridhar Mahadevan Autonomous Learning Lab UMass Amherst COLLEGE Outline of lecture New type of units Convolutional units, Rectified linear

More information

CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 5: Multilayer Perceptrons CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3

More information

Neural networks and support vector machines

Neural networks and support vector machines Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith

More information

A Monte Carlo Implementation of the Ising Model in Python

A Monte Carlo Implementation of the Ising Model in Python A Monte Carlo Implementation of the Ising Model in Python Alexey Khorev alexey.s.khorev@gmail.com 2017.08.29 Contents 1 Theory 1 1.1 Introduction...................................... 1 1.2 Model.........................................

More information

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

More information

Porcupine Neural Networks: (Almost) All Local Optima are Global

Porcupine Neural Networks: (Almost) All Local Optima are Global Porcupine Neural Networs: (Almost) All Local Optima are Global Soheil Feizi, Hamid Javadi, Jesse Zhang and David Tse arxiv:1710.0196v1 [stat.ml] 5 Oct 017 Stanford University Abstract Neural networs have

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Daniel M. Roy University of Toronto; Vector Institute Joint work with Gintarė K. Džiugaitė University

More information

11 More Regression; Newton s Method; ROC Curves

11 More Regression; Newton s Method; ROC Curves More Regression; Newton s Method; ROC Curves 59 11 More Regression; Newton s Method; ROC Curves LEAST-SQUARES POLYNOMIAL REGRESSION Replace each X i with feature vector e.g. (X i ) = [X 2 i1 X i1 X i2

More information

When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles

When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles Tomaso Poggio, MIT, CBMM Astar CBMM s focus is the Science and the Engineering of Intelligence We aim to make progress

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable

More information

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions? Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Lecture 9 Numerical optimization and deep learning Niklas Wahlström Division of Systems and Control Department of Information Technology Uppsala University niklas.wahlstrom@it.uu.se

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Introduction to Machine Learning HW6

Introduction to Machine Learning HW6 CS 189 Spring 2018 Introduction to Machine Learning HW6 Your self-grade URL is http://eecs189.org/self_grade?question_ids=1_1,1_ 2,2_1,2_2,3_1,3_2,3_3,4_1,4_2,4_3,4_4,4_5,4_6,5_1,5_2,6. This homework is

More information

Basic Principles of Unsupervised and Unsupervised

Basic Principles of Unsupervised and Unsupervised Basic Principles of Unsupervised and Unsupervised Learning Toward Deep Learning Shun ichi Amari (RIKEN Brain Science Institute) collaborators: R. Karakida, M. Okada (U. Tokyo) Deep Learning Self Organization

More information

Lecture 14: Deep Generative Learning

Lecture 14: Deep Generative Learning Generative Modeling CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 14: Deep Generative Learning Density estimation Reconstructing probability density function using samples Bohyung Han

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Neural Nets and Symbolic Reasoning Hopfield Networks

Neural Nets and Symbolic Reasoning Hopfield Networks Neural Nets and Symbolic Reasoning Hopfield Networks Outline The idea of pattern completion The fast dynamics of Hopfield networks Learning with Hopfield networks Emerging properties of Hopfield networks

More information

NEURAL NETWORKS

NEURAL NETWORKS 5 Neural Networks In Chapters 3 and 4 we considered models for regression and classification that comprised linear combinations of fixed basis functions. We saw that such models have useful analytical

More information

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/

More information

Discrete Latent Variable Models

Discrete Latent Variable Models Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 14 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 1 / 29 Summary Major themes in the course

More information

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04 Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures Lecture 04 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Self-Taught Learning 1. Learn

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Ferromagnets and the classical Heisenberg model. Kay Kirkpatrick, UIUC

Ferromagnets and the classical Heisenberg model. Kay Kirkpatrick, UIUC Ferromagnets and the classical Heisenberg model Kay Kirkpatrick, UIUC Ferromagnets and the classical Heisenberg model: asymptotics for a mean-field phase transition Kay Kirkpatrick, Urbana-Champaign June

More information

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13 Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21 Summary Story so far Representation: Latent

More information

Machine Learning

Machine Learning Machine Learning 10-315 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 03/29/2019 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter

More information

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Anthony Trubiano April 11th, 2018 1 Introduction Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Feed-forward Network Functions

Feed-forward Network Functions Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

A.I.: Beyond Classical Search

A.I.: Beyond Classical Search A.I.: Beyond Classical Search Random Sampling Trivial Algorithms Generate a state randomly Random Walk Randomly pick a neighbor of the current state Both algorithms asymptotically complete. Overview Previously

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued) Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 3: Introduction to Deep Learning (continued) Course Logistics - Update on course registrations - 6 seats left now -

More information