ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Size: px

Start display at page:

Download "ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference"

Rhoda Bishop
5 years ago
Views:

1 ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring /41

2 Outline introduction to deep learning perceptron model (a single neuron) 1-hidden-layer (2-layer) neural network 2/41

3 The rise of deep neural networks ImageNet Large Scale Visual Recognition Challenge (ILSVRC): Led by Prof. Fei-Fei Li (Stanford). Total number of non-empty synsets (categories): 21841; Total number of images: 14,197,122 3/41

4 The rise of deep neural networks The deeper, the better? 4/41

5 AlexNet Won the 2012 LSVRC competition by a large margin: top-1 and top-5 error rates of 37.5% and 17.0%. 5/41

6 AlexNet 60 million parameters and 650,000 neurons. takes 5-6 days to train on two GTX 580 3GB GPUs. 5 convolutional layers 3 fully connected layers Softmax output 6/41

7 Rectified linear units (ReLU): Faster training with ReLU y = max(0, x) compared to tanh and sigmoid, training is much faster ReLU tanh sigmoid ReLu doesn t saturate. 7/41

8 Reduce overfitting Important to reduce overfitting since: number of training data number of parameters 8/41

9 Reduce overfitting Important to reduce overfitting since: number of training data number of parameters Data augmentation: apply label-invariant transforms 8/41

10 Reduce overfitting Important to reduce overfitting since: number of training data number of parameters Dropout 9/41

11 Reduce overfitting Important to reduce overfitting since: number of training data number of parameters Other ways of implicit regularization : early stopping weight decay (ridge regression)... 10/41

12 Learned hierarchical representations Learned representations using CNN trained on ImageNet: Figure credit: Y. Lecun s slide with research credit to, Zeiler and Fergus, /41

13 single-layer networks (perceptron) 12/41

14 Perceptron Input x = [x 1,..., x d ] R d, weight w = [x 1,..., x d ] R d, output y R; ( ( ) d ) y = σ w x = σ w i x i i=1 where σ( ) is a nonlinear activation function, e.g. σ(z) = sign(z) (hard thresholding) or σ(z) = sigmoid(z) = 1 1+e z (soft thresholding). x 1 x 2 x 3 w 2 w 1 w 3 Decision making at test stage: given a test sample x, calculate y. 13/41

15 Nonlinear activation Nonlinear activation is critical for complex decision boundary. 14/41

16 Training of a perceptron Empirical risk minimization: Given training data {x i, y i } n i=1, find the weight vector w: 1 ŵ = arg min w R d n n l(w; x i, y i ) i=1 find the weight parameter w that best fits the data; popular choice for loss function: quadratic, cross entropy, hinge, etc.. ( ) 2 l(w; x i, y i ) = y i σ(w x i ) we ll use the quadratic loss and sigmoid activation as an example... 15/41

17 Training via (stochastic) gradient descent 1 ŵ = arg min w R d 2n Gradient descent: n i=1 ( y i σ(w x i )) 2 := arg min w R d l n(w) w t+1 = w t η t l n (w t ) where η t is the step-size or learning rate. 16/41

18 Training via (stochastic) gradient descent 1 ŵ = arg min w R d 2n Gradient descent: n i=1 ( y i σ(w x i )) 2 := arg min w R d l n(w) w t+1 = w t η t l n (w t ) where η t is the step-size or learning rate. The gradient can be calculated via chain rule. call ŷ i = ŷ i (w) = σ(w x i ), then d 1 dw 2 (y i ŷ i ) 2 = (ŷ i y i ) dŷ i(w) dw = (ŷ i y i )σ (w x i )x i = (ŷ i y i )ŷ i (1 ŷ i ) x i }{{} scalar where we used σ (z) = σ(z)(1 σ (z)). This is called delta rule. 16/41

19 Stochastic gradient descent Stochastic gradient descent uses only a mini-batch of data every iteration. At every iteration t, 1 Draw a mini-batch of data indexed by S t {1,, n}; 2 Update w t+1 = w t η t i S t l(w t ; x i, y i ) 17/41

20 Detour: Backpropagation Backpropagation is the basic algorithm to train neural network, rediscovered several times in the literature in the s, but popularized by the 1986 paper by Rumelhart, Hinton, and Williams. Assuming node operations take unit time, backpropagation takes linear time, specifically, O(Network Size) = O(V + E) to compute the gradient, where V is the number of vertices and E is the number of edges in the neural network. main idea: chain rule from calculus. d dx f(g(x)) = f (g(x))g (x) Let s illustrate the process with single-output, 2-layer NN 18/41

21 Derivations of backpropagation x j w m,j input layer hidden layer output layer network output: ( ) ŷ = σ v m h m = σ m m v m σ j w m,j x j loss function: f = 1 2 (y ŷ)2. 19/41

22 Backpropogation I Optimize the weights for each layer, starting with the layer closest to outputs and working back to the layer closest to inputs. 1 To update v m s: realize df = df dŷ dv m dŷ dv m = (ŷ y) dŷ dv m = (ŷ y)σ ( m v m h m ) h m = (ŷ y)ŷ(1 ŷ) h m. }{{} δ This is the same as updating the perceptron. 20/41

23 Backpropogation II 2 To update w m,j s: realize df = df dw m,j dŷ dŷ dh m dh m dw m,j = (ŷ y)ŷ(1 ŷ)v m h m (1 h m )x j = δv m h m (1 h m )x j 21/41

24 Questions we may ask (in general) Representation: how well can a given network (fixed activation) approximate / explain the training data? Generalization: how well can the learned ŵ behave in prediction during testing? Optimization: how does the output of (S)GD w t relate to ŵ? (or we should really plug in w t in the previous two questions!) 22/41

25 Nonconvex landscape of perceptron can be very bad SGD converges to local minimizers. Are they global? Theorem 12.1 (Auer et al., 1995) Let σ( ) be sigmoid and l( ) be the quadratic loss function. There exists a sequence of training samples {x i, y i } n i=1 such that l n(w) has n d d distinct local minima. 23/41

Nonconvex landscape of perceptron can be very bad SGD converges to local minimizers. Are they global? Theorem 12.1 (Auer et al., 1995) Let σ( ) be sigmoid and l( ) be the quadratic loss function.

26 Nonconvex landscape of perceptron can be very bad SGD converges to local minimizers. Are they global? Theorem 12.1 (Auer et al., 1995) Let σ( ) be sigmoid and l( ) be the quadratic loss function. There exists a sequence of training samples {x i, y i } n i=1 such that l n(w) has n d d distinct local minima. Consequence: there may exist exponentially many bad local minima with arbitrary data! curse of dimensionality 23/41

27 Why? saturation of the sigmoid l(w; 10, 0.55) l(w; 0.7, 0.25) each sample produces a local min + flat surfaces away from the minimizer 24/41

28 Why? saturation of the sigmoid l(w; 10, 0.55) l(w; 0.7, 0.25) each sample produces a local min + flat surfaces away from the minimizer if the local min of sample A falls into the flat region of sample B (and vice versa), the sum of sample losses preserve both minima. 24/41

29 Why? We get one local minimum per sample in 1D local minima Curse of dimensionality: we construct the samples to get n d d distinct local minima in d dim. 25/41

30 Statistical models come to rescue Data/measurements follow certain statistical models and hence are not worst-case instances. minimize w l n (w) = 1 m l(w; x i, y i ) m i=1 26/41

31 Statistical models come to rescue Data/measurements follow certain statistical models and hence are not worst-case instances. minimize w l n (w) = 1 m l(w; x i, y i ) m i=1 m = E[l(w; x, y)] := l(w) 26/41

32 Statistical models come to rescue Data/measurements follow certain statistical models and hence are not worst-case instances. minimize w l n (w) = 1 m l(w; x i, y i ) m i=1 m = E[l(w; x, y)] := l(w) empirical risk population risk θ0 = [1,0] ˆθn = [0.816, 0.268] θ2 0 θ2 0 θ θ θ 1 Figure credit: Mei, Bai and Montanari 26/41

33 Statistical models of training data Assume the training data {x i, y i } n i=1 is i.i.d. drawn from some distribution: (x, y) p(x, y) We are using neural networks to fit p(x, y). A planted-truth model: let x N (0, I) and the label y is drawn as regression model: y i = σ(w x i ) classification model: y i {0, 1}, where P(y i = 1) = σ(w x i ) Parameter recovery: can we recover w using {x i, y i } n i=1? 27/41

34 Roadmap 1 Step 1: Verify the landscape properties of population loss; 2 Step 2: translate properties of population loss to empirical loss; 3 Step 3: argue ŵ (minimizer of empirical loss) is close to w (minimizer of population loss). 28/41

35 Step 1: population risk θ2 0 θ θ 1 w is the unique local minimizer that is also global. No bad local minima! strongly convex near global optima ; large gradient elsewhere 29/41

36 Nonconvex landscape: from population to empirical Figure credit: Mei, Bai and Montanari 30/41

37 Step 2: uniform convergence of gradients & Hessian Theorem 12.2 (Bai, Song, Montanari, 2017) Under suitable assumptions, for any δ > 0, there exists a positive constant C depending on (R, δ) but independent of n and d, such that as long as n Cd log d, we have 1 preservation of gradient: P sup l n (w) l(w) 2 w R 2 preservation of Hessian: P sup w R 2 l n (w) 2 l(w) Cd log n 1 δ. n Cd log n 1 δ. n 31/41

38 Step 3: establish rate of convergence for ERM By the mean-value theorem, there exists some w between ŵ and w such that l n (ŵ) = l n (w ) + l n (w ), ŵ w (ŵ w ) 2 l n (w )(ŵ w ) l n (w ) where the last line follows by optimality of ŵ. Then 1 2 λ min( 2 l n (w )) ŵ w (ŵ w ) 2 l n (w )(ŵ w ) ŵ w 2 l n (w ), ŵ w l n (w ) ŵ w 2 l n(w ) λ min ( 2 l n (w )) Cd log n. n 32/41

39 two-layer networks 33/41

40 Two-layer linear network Given arbitrary data {x i, y i } n i=1, x i, y i R d, fit the two-layer linear network with quadratic loss: n f(a, B) = y i ABx i 2 2 i=1 where B R p d, A R d p, where p d. x B A input layer output layer hidden layer special case: auto-association (auto-encoding, identity mapping), where y i = x i, for e.g. image compression. 34/41

41 Landscape of 2-layer linear network bears some similarity with the nonconvex matrix factorization problem; Lack identifiability: for any invertible C, AB = (AC)(C 1 B). Define X = [x 1, x 2,, x n ] and Y = [y 1, y 2,, y n ], then f(a, B) = Y ABX 2 F Let Σ XX = n i=1 x i x i = XX, Σ Y Y = Y Y, Σ XY = XY, and Σ Y X = Y X. When X = Y, any optimum AB = UU, where U is the top p eigenvectors of Σ XX. 35/41

42 Landscape of 2-layer linear network Theorem 12.3 (Baldi and Hornik, 1989) Suppose X is full rank and hence Σ XX is invertible. Further assume Σ := Σ Y X Σ 1 XX Σ XY is full rank with d distinct eigenvalues λ 1 > > λ d > 0. Then f(a, B) has no spurious local minima, except for equivalent versions of global minimum due to invertible transformations. no bad local min! generalizable to multi-layer linear networks. 36/41

43 Critical points of two-layer linear networks Lemma 12.4 (critical points) Any critical point satisfies where A satisfies AB = P A Σ Y X Σ 1 XX, P A Σ = P A ΣP A = ΣP A, where P A is the ortho-projector projecting onto the column span of the sub-indexed matrix. 37/41

44 Critical points of two-layer linear networks Let the EVD of Σ = Σ Y X Σ 1 XX Σ XY be Σ = UΛU. Lemma 12.5 (critical points) At any critical point, A can be written in the form A = [U J, 0 d (p r) ]C where rank(a) = r p, J {1,..., d}, J = r and C is invertible. Correspondingly, [ B = C 1 UJ Σ ] Y XΣ 1 XX, last p r rows of CL where L is any d d matrix. Verify (A, B) is global optima if and only if J = {1,..., p}. 38/41

45 Two-layer nonlinear network Local strong convexity under the Gaussian model [Fu et al., 2018]. W x + y input layer output layer hidden layer 39/41

46 Reference I [1] ImageNet, [2] Deep learning, Lecun, Bengio, and Hinton, Nature, [3] ImageNet classification with deep convolutional neural networks, Krizhevsky et al., NIPS, [4] Learning representations by back-propagating errors, Rumelhart, Hinton, and Williams, Nature, [5] Dropout: A simple way to prevent neural networks from overfitting, Srivastava et al., JMLR, [6] Dropout Training as Adaptive Regularization, Wager et al., NIPS, [7] On the importance of initialization and momentum in deep learning, Sutskever et al., ICML, [8] Exponentially many local minima for single neurons, P. Auer, M. Herbster and K. Warmuth, NIPS, /41

47 Reference II [9] The landscape of empirical risk for non-convex losses, S. Mei, Y. Bai, and A. Montanari, arxiv preprint arxiv: , [10] Neural networks and principal component analysis: Learning from examples without local minima, P. Baldi and K. Hornik, Neural Networks, [11] Local Geometry of One-Hidden-Layer Neural Networks for Logistic Regression, Fu, Chi, Liang, arxiv preprint arxiv: , /41

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given