ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41
Outline introduction to deep learning perceptron model (a single neuron) 1-hidden-layer (2-layer) neural network 2/41
The rise of deep neural networks ImageNet Large Scale Visual Recognition Challenge (ILSVRC): Led by Prof. Fei-Fei Li (Stanford). Total number of non-empty synsets (categories): 21841; Total number of images: 14,197,122 3/41
The rise of deep neural networks The deeper, the better? 4/41
AlexNet Won the 2012 LSVRC competition by a large margin: top-1 and top-5 error rates of 37.5% and 17.0%. 5/41
AlexNet 60 million parameters and 650,000 neurons. takes 5-6 days to train on two GTX 580 3GB GPUs. 5 convolutional layers 3 fully connected layers Softmax output 6/41
Rectified linear units (ReLU): Faster training with ReLU y = max(0, x) compared to tanh and sigmoid, training is much faster. 3 2.5 2 ReLU tanh sigmoid 1.5 1 0.5 0-0.5-1 -3-2 -1 0 1 2 3 ReLu doesn t saturate. 7/41
Reduce overfitting Important to reduce overfitting since: number of training data number of parameters 8/41
Reduce overfitting Important to reduce overfitting since: number of training data number of parameters Data augmentation: apply label-invariant transforms 8/41
Reduce overfitting Important to reduce overfitting since: number of training data number of parameters Dropout 9/41
Reduce overfitting Important to reduce overfitting since: number of training data number of parameters Other ways of implicit regularization : early stopping weight decay (ridge regression)... 10/41
Learned hierarchical representations Learned representations using CNN trained on ImageNet: Figure credit: Y. Lecun s slide with research credit to, Zeiler and Fergus, 2013. 11/41
single-layer networks (perceptron) 12/41
Perceptron Input x = [x 1,..., x d ] R d, weight w = [x 1,..., x d ] R d, output y R; ( ( ) d ) y = σ w x = σ w i x i i=1 where σ( ) is a nonlinear activation function, e.g. σ(z) = sign(z) (hard thresholding) or σ(z) = sigmoid(z) = 1 1+e z (soft thresholding). x 1 x 2 x 3 w 2 w 1 w 3 Decision making at test stage: given a test sample x, calculate y. 13/41
Nonlinear activation Nonlinear activation is critical for complex decision boundary. 14/41
Training of a perceptron Empirical risk minimization: Given training data {x i, y i } n i=1, find the weight vector w: 1 ŵ = arg min w R d n n l(w; x i, y i ) i=1 find the weight parameter w that best fits the data; popular choice for loss function: quadratic, cross entropy, hinge, etc.. ( ) 2 l(w; x i, y i ) = y i σ(w x i ) we ll use the quadratic loss and sigmoid activation as an example... 15/41
Training via (stochastic) gradient descent 1 ŵ = arg min w R d 2n Gradient descent: n i=1 ( y i σ(w x i )) 2 := arg min w R d l n(w) w t+1 = w t η t l n (w t ) where η t is the step-size or learning rate. 16/41
Training via (stochastic) gradient descent 1 ŵ = arg min w R d 2n Gradient descent: n i=1 ( y i σ(w x i )) 2 := arg min w R d l n(w) w t+1 = w t η t l n (w t ) where η t is the step-size or learning rate. The gradient can be calculated via chain rule. call ŷ i = ŷ i (w) = σ(w x i ), then d 1 dw 2 (y i ŷ i ) 2 = (ŷ i y i ) dŷ i(w) dw = (ŷ i y i )σ (w x i )x i = (ŷ i y i )ŷ i (1 ŷ i ) x i }{{} scalar where we used σ (z) = σ(z)(1 σ (z)). This is called delta rule. 16/41
Stochastic gradient descent Stochastic gradient descent uses only a mini-batch of data every iteration. At every iteration t, 1 Draw a mini-batch of data indexed by S t {1,, n}; 2 Update w t+1 = w t η t i S t l(w t ; x i, y i ) 17/41
Detour: Backpropagation Backpropagation is the basic algorithm to train neural network, rediscovered several times in the literature in the 1970-80 s, but popularized by the 1986 paper by Rumelhart, Hinton, and Williams. Assuming node operations take unit time, backpropagation takes linear time, specifically, O(Network Size) = O(V + E) to compute the gradient, where V is the number of vertices and E is the number of edges in the neural network. main idea: chain rule from calculus. d dx f(g(x)) = f (g(x))g (x) Let s illustrate the process with single-output, 2-layer NN 18/41
Derivations of backpropagation x j w m,j input layer hidden layer output layer network output: ( ) ŷ = σ v m h m = σ m m v m σ j w m,j x j loss function: f = 1 2 (y ŷ)2. 19/41
Backpropogation I Optimize the weights for each layer, starting with the layer closest to outputs and working back to the layer closest to inputs. 1 To update v m s: realize df = df dŷ dv m dŷ dv m = (ŷ y) dŷ dv m = (ŷ y)σ ( m v m h m ) h m = (ŷ y)ŷ(1 ŷ) h m. }{{} δ This is the same as updating the perceptron. 20/41
Backpropogation II 2 To update w m,j s: realize df = df dw m,j dŷ dŷ dh m dh m dw m,j = (ŷ y)ŷ(1 ŷ)v m h m (1 h m )x j = δv m h m (1 h m )x j 21/41
Questions we may ask (in general) Representation: how well can a given network (fixed activation) approximate / explain the training data? Generalization: how well can the learned ŵ behave in prediction during testing? Optimization: how does the output of (S)GD w t relate to ŵ? (or we should really plug in w t in the previous two questions!) 22/41
Nonconvex landscape of perceptron can be very bad SGD converges to local minimizers. Are they global? Theorem 12.1 (Auer et al., 1995) Let σ( ) be sigmoid and l( ) be the quadratic loss function. There exists a sequence of training samples {x i, y i } n i=1 such that l n(w) has n d d distinct local minima. 23/41
Nonconvex landscape of perceptron can be very bad SGD converges to local minimizers. Are they global? Theorem 12.1 (Auer et al., 1995) Let σ( ) be sigmoid and l( ) be the quadratic loss function. There exists a sequence of training samples {x i, y i } n i=1 such that l n(w) has n d d distinct local minima. Consequence: there may exist exponentially many bad local minima with arbitrary data! curse of dimensionality 23/41
Why? saturation of the sigmoid 0.35 0.6 0.3 0.5 0.25 0.4 0.2 0.3 0.15 0.1 0.2 0.05 0.1 0-10 -8-6 -4-2 0 2 4 6 8 10 0-10 -8-6 -4-2 0 2 4 6 8 10 l(w; 10, 0.55) l(w; 0.7, 0.25) each sample produces a local min + flat surfaces away from the minimizer 24/41
Why? saturation of the sigmoid 0.35 0.6 0.3 0.5 0.25 0.4 0.2 0.3 0.15 0.1 0.2 0.05 0.1 0-10 -8-6 -4-2 0 2 4 6 8 10 0-10 -8-6 -4-2 0 2 4 6 8 10 l(w; 10, 0.55) l(w; 0.7, 0.25) each sample produces a local min + flat surfaces away from the minimizer if the local min of sample A falls into the flat region of sample B (and vice versa), the sum of sample losses preserve both minima. 24/41
Why? We get one local minimum per sample in 1D. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 local minima 0-10 -8-6 -4-2 0 2 4 6 8 10 Curse of dimensionality: we construct the samples to get n d d distinct local minima in d dim. 25/41
Statistical models come to rescue Data/measurements follow certain statistical models and hence are not worst-case instances. minimize w l n (w) = 1 m l(w; x i, y i ) m i=1 26/41
Statistical models come to rescue Data/measurements follow certain statistical models and hence are not worst-case instances. minimize w l n (w) = 1 m l(w; x i, y i ) m i=1 m = E[l(w; x, y)] := l(w) 26/41
Statistical models come to rescue Data/measurements follow certain statistical models and hence are not worst-case instances. minimize w l n (w) = 1 m l(w; x i, y i ) m i=1 m = E[l(w; x, y)] := l(w) 3 2 1 empirical risk population risk θ0 = [1,0] ˆθn = [0.816, 0.268] 3 2 1 θ2 0 θ2 0 θ 0-1 -2-1 -2-3 -3-2 -1 0 1 2 3 θ 1-3 -3-2 -1 0 1 2 3 θ 1 Figure credit: Mei, Bai and Montanari 26/41
Statistical models of training data Assume the training data {x i, y i } n i=1 is i.i.d. drawn from some distribution: (x, y) p(x, y) We are using neural networks to fit p(x, y). A planted-truth model: let x N (0, I) and the label y is drawn as regression model: y i = σ(w x i ) classification model: y i {0, 1}, where P(y i = 1) = σ(w x i ) Parameter recovery: can we recover w using {x i, y i } n i=1? 27/41
Roadmap 1 Step 1: Verify the landscape properties of population loss; 2 Step 2: translate properties of population loss to empirical loss; 3 Step 3: argue ŵ (minimizer of empirical loss) is close to w (minimizer of population loss). 28/41
Step 1: population risk 3 2 1 θ2 0 θ 0-1 -2-3 -3-2 -1 0 1 2 3 θ 1 w is the unique local minimizer that is also global. No bad local minima! strongly convex near global optima ; large gradient elsewhere 29/41
Nonconvex landscape: from population to empirical Figure credit: Mei, Bai and Montanari 30/41
Step 2: uniform convergence of gradients & Hessian Theorem 12.2 (Bai, Song, Montanari, 2017) Under suitable assumptions, for any δ > 0, there exists a positive constant C depending on (R, δ) but independent of n and d, such that as long as n Cd log d, we have 1 preservation of gradient: P sup l n (w) l(w) 2 w R 2 preservation of Hessian: P sup w R 2 l n (w) 2 l(w) Cd log n 1 δ. n Cd log n 1 δ. n 31/41
Step 3: establish rate of convergence for ERM By the mean-value theorem, there exists some w between ŵ and w such that l n (ŵ) = l n (w ) + l n (w ), ŵ w + 1 2 (ŵ w ) 2 l n (w )(ŵ w ) l n (w ) where the last line follows by optimality of ŵ. Then 1 2 λ min( 2 l n (w )) ŵ w 2 2 1 2 (ŵ w ) 2 l n (w )(ŵ w ) ŵ w 2 l n (w ), ŵ w l n (w ) ŵ w 2 l n(w ) λ min ( 2 l n (w )) Cd log n. n 32/41
two-layer networks 33/41
Two-layer linear network Given arbitrary data {x i, y i } n i=1, x i, y i R d, fit the two-layer linear network with quadratic loss: n f(a, B) = y i ABx i 2 2 i=1 where B R p d, A R d p, where p d. x B A input layer output layer hidden layer special case: auto-association (auto-encoding, identity mapping), where y i = x i, for e.g. image compression. 34/41
Landscape of 2-layer linear network bears some similarity with the nonconvex matrix factorization problem; Lack identifiability: for any invertible C, AB = (AC)(C 1 B). Define X = [x 1, x 2,, x n ] and Y = [y 1, y 2,, y n ], then f(a, B) = Y ABX 2 F Let Σ XX = n i=1 x i x i = XX, Σ Y Y = Y Y, Σ XY = XY, and Σ Y X = Y X. When X = Y, any optimum AB = UU, where U is the top p eigenvectors of Σ XX. 35/41
Landscape of 2-layer linear network Theorem 12.3 (Baldi and Hornik, 1989) Suppose X is full rank and hence Σ XX is invertible. Further assume Σ := Σ Y X Σ 1 XX Σ XY is full rank with d distinct eigenvalues λ 1 > > λ d > 0. Then f(a, B) has no spurious local minima, except for equivalent versions of global minimum due to invertible transformations. no bad local min! generalizable to multi-layer linear networks. 36/41
Critical points of two-layer linear networks Lemma 12.4 (critical points) Any critical point satisfies where A satisfies AB = P A Σ Y X Σ 1 XX, P A Σ = P A ΣP A = ΣP A, where P A is the ortho-projector projecting onto the column span of the sub-indexed matrix. 37/41
Critical points of two-layer linear networks Let the EVD of Σ = Σ Y X Σ 1 XX Σ XY be Σ = UΛU. Lemma 12.5 (critical points) At any critical point, A can be written in the form A = [U J, 0 d (p r) ]C where rank(a) = r p, J {1,..., d}, J = r and C is invertible. Correspondingly, [ B = C 1 UJ Σ ] Y XΣ 1 XX, last p r rows of CL where L is any d d matrix. Verify (A, B) is global optima if and only if J = {1,..., p}. 38/41
Two-layer nonlinear network Local strong convexity under the Gaussian model [Fu et al., 2018]. W x + y input layer output layer hidden layer 39/41
Reference I [1] ImageNet, www.image-net.org/. [2] Deep learning, Lecun, Bengio, and Hinton, Nature, 2015. [3] ImageNet classification with deep convolutional neural networks, Krizhevsky et al., NIPS, 2012. [4] Learning representations by back-propagating errors, Rumelhart, Hinton, and Williams, Nature, 1986. [5] Dropout: A simple way to prevent neural networks from overfitting, Srivastava et al., JMLR, 2014. [6] Dropout Training as Adaptive Regularization, Wager et al., NIPS, 2013. [7] On the importance of initialization and momentum in deep learning, Sutskever et al., ICML, 2013. [8] Exponentially many local minima for single neurons, P. Auer, M. Herbster and K. Warmuth, NIPS, 1995. 40/41
Reference II [9] The landscape of empirical risk for non-convex losses, S. Mei, Y. Bai, and A. Montanari, arxiv preprint arxiv:1607.06534, 2016. [10] Neural networks and principal component analysis: Learning from examples without local minima, P. Baldi and K. Hornik, Neural Networks, 1989. [11] Local Geometry of One-Hidden-Layer Neural Networks for Logistic Regression, Fu, Chi, Liang, arxiv preprint arxiv:1802.06463, 2018. 41/41