Learning the Number of Neurons in Deep Networks

Size: px

Start display at page:

Download "Learning the Number of Neurons in Deep Networks"

Megan Whitehead
5 years ago
Views:

1 Learning the Number of Jose M. Alvarez 1 Mathieu Salzmanno 2 1 CSIRO,Canberra, ACT 2601, Australia 2 CVLab, EPFL,CH-1015 Lausanne, Switzerland NIPS,2016 Presenter: Arshdeep Sekhon NIPS,2016 Presenter: Arshdeep Sekhon 1

2 Motivation 1 Designing a deep NN architecture 2 Configure: number of layers number of units 3 mostly hand-designed 4 have redundant parameters NIPS,2016 Presenter: Arshdeep Sekhon 2

3 Automatic Model Selection: Constructive Approaches 1 Incrementally add layers/parameters 2 But Shallow networks are less expressive 3 bad initialization when incrementally adding layers NIPS,2016 Presenter: Arshdeep Sekhon 3

4 Automatic Model Selection: Destructive Approaches 1 Very Deep networks more expressive 2 Start from very deep networks, eliminate redundant parameters 3 check influence of every parameter 4 for example, check network Hessian wrt every parameter in an over complete network 5 not scalable to large networks NIPS,2016 Presenter: Arshdeep Sekhon 4

5 Automatic Model Selection 1 Automatically get number of neurons for each layer 2 cancel effects of individual neurons 3 jointly as we learn 4 no preprocessing NIPS,2016 Presenter: Arshdeep Sekhon 5

6 Model Selection 1 Start with an overcomplete network 2 Find neurons for each layer 3 A general deep network: L layers in network architecture N l neurons in each layer 4 weights Θ = [θ l, b l ] for layer l θ l = [θ n l ] 1 l L and 1 n N l 5 The optimization problem: min Θ 1 N N l(y i, f (x i, Θ)) + r(θ) (1) i=1 NIPS,2016 Presenter: Arshdeep Sekhon 6

7 Model Selection: Regularizer 1 The optimization problem: min Θ 1 N N l(y i, f (x i, Θ)) + r(θ) (2) i=1 2 Goal: Cancel entire neurons NIPS,2016 Presenter: Arshdeep Sekhon 7

8 Model Selection: Regularizer 1 The optimization problem: min Θ 1 N N l(y i, f (x i, Θ)) + r(θ) (2) i=1 2 Goal: Cancel entire neurons 3 traditional regularizers: l 1 or l 2 4 cannot cancel entire neurons because they control weights individually. NIPS,2016 Presenter: Arshdeep Sekhon 7

9 Model Selection: Regularizer 1 The optimization problem: min Θ 1 N N l(y i, f (x i, Θ)) + r(θ) (2) i=1 2 Goal: Cancel entire neurons 3 traditional regularizers: l 1 or l 2 4 cannot cancel entire neurons because they control weights individually. 5 Neurons are groups of parameters 6 weights Θ = [θ l, b l ] for layer l θ l = [θ n l ] 1 l L and 1 n N l NIPS,2016 Presenter: Arshdeep Sekhon 7

10 Model Selection: Regularizer 1 The optimization problem: min Θ 1 N N l(y i, f (x i, Θ)) + r(θ) (2) i=1 2 Goal: Cancel entire neurons 3 traditional regularizers: l 1 or l 2 4 cannot cancel entire neurons because they control weights individually. 5 Neurons are groups of parameters 6 weights Θ = [θ l, b l ] for layer l θ l = [θ n l ] 1 l L and 1 n N l 7 Use new regularizer: group sparsity ose M. Alvarez, Mathieu Salzmanno NIPS,2016 Presenter: Arshdeep Sekhon 7

11 Model Selection: Group Sparsity 1 Parameters associated with a neuron are grouped together 2 Penalty on groups of weights instead of individual weights 3 parameters of each neuron in layer l are grouped in a vector of size P l 4 New regularizer: r(θ) = L l=1 N l λ l Pl θn l 2 (3) n=1 5 θ n l are the parameters for neuron n in layer l 6 l 2 norm followed by l 1 norm 7 λ l sets the influence of the penalty. NIPS,2016 Presenter: Arshdeep Sekhon 8

12 Model Selection: Group Sparsity 1 But does not lead to sparsity within a group L N l r(θ) = (1 α)λ l Pl θn l 2 + αλ l θ l 1 (4) l=1 n=1 2 more general penalty that leads to sparsity both at and within group level. NIPS,2016 Presenter: Arshdeep Sekhon 9

13 Training: Proximal Gradient Descent minimize f (x) = g(x) + h(x) (5) proximal gradient algorithm: x k+1 = prox tk h ( ) x k 1 t k ( g(x k 1 )) (6) 1 proximal operator: prox h (x) = arg min h(u) + 1 u 2 x u 2 2 (7) 2 ( x k+1 = arg min h(u) + 1 ) u 2t u x k 1 + t k ( g(x k 1 )) 2 2 (8) NIPS,2016 Presenter: Arshdeep Sekhon 1

14 Training: Proximal Gradient Descent 1 The objective: min Θ 1 N N l(y i, f (x i, Θ)) + r(θ) (9) i=1 2 r(θ) = L N l (1 α)λ l Pl θn l 2 + αλ l θ l 1 (10) l=1 n=1 3 loss function is g(x) and regularizer h(x) in proximal gradient algorithm NIPS,2016 Presenter: Arshdeep Sekhon 1

15 Training: Proximal Gradient Descent 1 Update:Take gradient of loss and apply proximal operator of the regularizer 2 θ n l = arg min θ n l 1 2t θ n l ˆθ n l r(θ) (11) where ˆθ n l is update by gradient of loss function 3 This has a closed form solution: (S(z, τ)) j = sign(z j )( z j τ) + NIPS,2016 Presenter: Arshdeep Sekhon 1

16 Experiments and Model Architectures 1 Dataset: ImageNet, Places Models: 1 VGG-B Net:10 convolutional layers followed by three fully-connected layers 2 DecomposeMe 8 (Dec 8 ): 16 Conv layers with 1D kernels NIPS,2016 Presenter: Arshdeep Sekhon 1

17 Experiments and Model Architectures NIPS,2016 Presenter: Arshdeep Sekhon 1

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses