ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Similar documents
A summary of Deep Learning without Poor Local Minima

Introduction to Convolutional Neural Networks (CNNs)

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

SGD and Deep Learning

Introduction to Deep Neural Networks

Introduction to Neural Networks

Lecture 17: Neural Networks and Deep Learning

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Neural networks and optimization

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Reading Group on Deep Learning Session 1

1 What a Neural Network Computes

Deep Learning (CNNs)

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

CSC 578 Neural Networks and Deep Learning

Lecture 35: Optimization and Neural Nets

Introduction to Deep Learning

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Lecture 5: Logistic Regression. Neural Networks

Jakub Hajic Artificial Intelligence Seminar I

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Machine Learning

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Machine Learning Basics III

CSCI567 Machine Learning (Fall 2018)

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

ECE521 Lecture 7/8. Logistic Regression

CS60010: Deep Learning

ECS171: Machine Learning

Neural Network Training

Deep Feedforward Networks

Neural Networks (Part 1) Goals for the lecture

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Bits of Machine Learning Part 1: Supervised Learning

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

CSC 411 Lecture 10: Neural Networks

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

DeepLearning on FPGAs

Summary and discussion of: Dropout Training as Adaptive Regularization

Introduction to Deep Learning

Artificial Neural Networks 2

Understanding Neural Networks : Part I

Day 3 Lecture 3. Optimizing deep networks

Normalization Techniques

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Neural Networks and Deep Learning.

Deep Learning: a gentle introduction

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

CSE446: Neural Networks Winter 2015

Lecture 3 Feedforward Networks and Backpropagation

Statistical Machine Learning from Data

Advanced Machine Learning

Computational statistics

1 Machine Learning Concepts (16 points)

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

ECE521 Lectures 9 Fully Connected Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Lecture 3 Feedforward Networks and Backpropagation

Based on the original slides of Hung-yi Lee

Neural Networks and Deep Learning

Bayesian Networks (Part I)

Neural Networks: Optimization & Regularization

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Overfitting, Bias / Variance Analysis

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Lecture 2: Learning with neural networks

Machine Learning Lecture 5

Everything You Wanted to Know About the Loss Surface but Were Afraid to Ask The Talk

ECS289: Scalable Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Lecture 14

CS489/698: Intro to ML

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

An overview of deep learning methods for genomics

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Introduction to Machine Learning (67577)

RegML 2018 Class 8 Deep learning

Artificial Neuron (Perceptron)

STA 414/2104: Lecture 8

Foundations of Deep Learning: SGD, Overparametrization, and Generalization

Deep Feedforward Networks

Transcription:

ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41

Outline introduction to deep learning perceptron model (a single neuron) 1-hidden-layer (2-layer) neural network 2/41

The rise of deep neural networks ImageNet Large Scale Visual Recognition Challenge (ILSVRC): Led by Prof. Fei-Fei Li (Stanford). Total number of non-empty synsets (categories): 21841; Total number of images: 14,197,122 3/41

The rise of deep neural networks The deeper, the better? 4/41

AlexNet Won the 2012 LSVRC competition by a large margin: top-1 and top-5 error rates of 37.5% and 17.0%. 5/41

AlexNet 60 million parameters and 650,000 neurons. takes 5-6 days to train on two GTX 580 3GB GPUs. 5 convolutional layers 3 fully connected layers Softmax output 6/41

Rectified linear units (ReLU): Faster training with ReLU y = max(0, x) compared to tanh and sigmoid, training is much faster. 3 2.5 2 ReLU tanh sigmoid 1.5 1 0.5 0-0.5-1 -3-2 -1 0 1 2 3 ReLu doesn t saturate. 7/41

Reduce overfitting Important to reduce overfitting since: number of training data number of parameters 8/41

Reduce overfitting Important to reduce overfitting since: number of training data number of parameters Data augmentation: apply label-invariant transforms 8/41

Reduce overfitting Important to reduce overfitting since: number of training data number of parameters Dropout 9/41

Reduce overfitting Important to reduce overfitting since: number of training data number of parameters Other ways of implicit regularization : early stopping weight decay (ridge regression)... 10/41

Learned hierarchical representations Learned representations using CNN trained on ImageNet: Figure credit: Y. Lecun s slide with research credit to, Zeiler and Fergus, 2013. 11/41

single-layer networks (perceptron) 12/41

Perceptron Input x = [x 1,..., x d ] R d, weight w = [x 1,..., x d ] R d, output y R; ( ( ) d ) y = σ w x = σ w i x i i=1 where σ( ) is a nonlinear activation function, e.g. σ(z) = sign(z) (hard thresholding) or σ(z) = sigmoid(z) = 1 1+e z (soft thresholding). x 1 x 2 x 3 w 2 w 1 w 3 Decision making at test stage: given a test sample x, calculate y. 13/41

Nonlinear activation Nonlinear activation is critical for complex decision boundary. 14/41

Training of a perceptron Empirical risk minimization: Given training data {x i, y i } n i=1, find the weight vector w: 1 ŵ = arg min w R d n n l(w; x i, y i ) i=1 find the weight parameter w that best fits the data; popular choice for loss function: quadratic, cross entropy, hinge, etc.. ( ) 2 l(w; x i, y i ) = y i σ(w x i ) we ll use the quadratic loss and sigmoid activation as an example... 15/41

Training via (stochastic) gradient descent 1 ŵ = arg min w R d 2n Gradient descent: n i=1 ( y i σ(w x i )) 2 := arg min w R d l n(w) w t+1 = w t η t l n (w t ) where η t is the step-size or learning rate. 16/41

Training via (stochastic) gradient descent 1 ŵ = arg min w R d 2n Gradient descent: n i=1 ( y i σ(w x i )) 2 := arg min w R d l n(w) w t+1 = w t η t l n (w t ) where η t is the step-size or learning rate. The gradient can be calculated via chain rule. call ŷ i = ŷ i (w) = σ(w x i ), then d 1 dw 2 (y i ŷ i ) 2 = (ŷ i y i ) dŷ i(w) dw = (ŷ i y i )σ (w x i )x i = (ŷ i y i )ŷ i (1 ŷ i ) x i }{{} scalar where we used σ (z) = σ(z)(1 σ (z)). This is called delta rule. 16/41

Stochastic gradient descent Stochastic gradient descent uses only a mini-batch of data every iteration. At every iteration t, 1 Draw a mini-batch of data indexed by S t {1,, n}; 2 Update w t+1 = w t η t i S t l(w t ; x i, y i ) 17/41

Detour: Backpropagation Backpropagation is the basic algorithm to train neural network, rediscovered several times in the literature in the 1970-80 s, but popularized by the 1986 paper by Rumelhart, Hinton, and Williams. Assuming node operations take unit time, backpropagation takes linear time, specifically, O(Network Size) = O(V + E) to compute the gradient, where V is the number of vertices and E is the number of edges in the neural network. main idea: chain rule from calculus. d dx f(g(x)) = f (g(x))g (x) Let s illustrate the process with single-output, 2-layer NN 18/41

Derivations of backpropagation x j w m,j input layer hidden layer output layer network output: ( ) ŷ = σ v m h m = σ m m v m σ j w m,j x j loss function: f = 1 2 (y ŷ)2. 19/41

Backpropogation I Optimize the weights for each layer, starting with the layer closest to outputs and working back to the layer closest to inputs. 1 To update v m s: realize df = df dŷ dv m dŷ dv m = (ŷ y) dŷ dv m = (ŷ y)σ ( m v m h m ) h m = (ŷ y)ŷ(1 ŷ) h m. }{{} δ This is the same as updating the perceptron. 20/41

Backpropogation II 2 To update w m,j s: realize df = df dw m,j dŷ dŷ dh m dh m dw m,j = (ŷ y)ŷ(1 ŷ)v m h m (1 h m )x j = δv m h m (1 h m )x j 21/41

Questions we may ask (in general) Representation: how well can a given network (fixed activation) approximate / explain the training data? Generalization: how well can the learned ŵ behave in prediction during testing? Optimization: how does the output of (S)GD w t relate to ŵ? (or we should really plug in w t in the previous two questions!) 22/41

Nonconvex landscape of perceptron can be very bad SGD converges to local minimizers. Are they global? Theorem 12.1 (Auer et al., 1995) Let σ( ) be sigmoid and l( ) be the quadratic loss function. There exists a sequence of training samples {x i, y i } n i=1 such that l n(w) has n d d distinct local minima. 23/41

Nonconvex landscape of perceptron can be very bad SGD converges to local minimizers. Are they global? Theorem 12.1 (Auer et al., 1995) Let σ( ) be sigmoid and l( ) be the quadratic loss function. There exists a sequence of training samples {x i, y i } n i=1 such that l n(w) has n d d distinct local minima. Consequence: there may exist exponentially many bad local minima with arbitrary data! curse of dimensionality 23/41

Why? saturation of the sigmoid 0.35 0.6 0.3 0.5 0.25 0.4 0.2 0.3 0.15 0.1 0.2 0.05 0.1 0-10 -8-6 -4-2 0 2 4 6 8 10 0-10 -8-6 -4-2 0 2 4 6 8 10 l(w; 10, 0.55) l(w; 0.7, 0.25) each sample produces a local min + flat surfaces away from the minimizer 24/41

Why? saturation of the sigmoid 0.35 0.6 0.3 0.5 0.25 0.4 0.2 0.3 0.15 0.1 0.2 0.05 0.1 0-10 -8-6 -4-2 0 2 4 6 8 10 0-10 -8-6 -4-2 0 2 4 6 8 10 l(w; 10, 0.55) l(w; 0.7, 0.25) each sample produces a local min + flat surfaces away from the minimizer if the local min of sample A falls into the flat region of sample B (and vice versa), the sum of sample losses preserve both minima. 24/41

Why? We get one local minimum per sample in 1D. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 local minima 0-10 -8-6 -4-2 0 2 4 6 8 10 Curse of dimensionality: we construct the samples to get n d d distinct local minima in d dim. 25/41

Statistical models come to rescue Data/measurements follow certain statistical models and hence are not worst-case instances. minimize w l n (w) = 1 m l(w; x i, y i ) m i=1 26/41

Statistical models come to rescue Data/measurements follow certain statistical models and hence are not worst-case instances. minimize w l n (w) = 1 m l(w; x i, y i ) m i=1 m = E[l(w; x, y)] := l(w) 26/41

Statistical models come to rescue Data/measurements follow certain statistical models and hence are not worst-case instances. minimize w l n (w) = 1 m l(w; x i, y i ) m i=1 m = E[l(w; x, y)] := l(w) 3 2 1 empirical risk population risk θ0 = [1,0] ˆθn = [0.816, 0.268] 3 2 1 θ2 0 θ2 0 θ 0-1 -2-1 -2-3 -3-2 -1 0 1 2 3 θ 1-3 -3-2 -1 0 1 2 3 θ 1 Figure credit: Mei, Bai and Montanari 26/41

Statistical models of training data Assume the training data {x i, y i } n i=1 is i.i.d. drawn from some distribution: (x, y) p(x, y) We are using neural networks to fit p(x, y). A planted-truth model: let x N (0, I) and the label y is drawn as regression model: y i = σ(w x i ) classification model: y i {0, 1}, where P(y i = 1) = σ(w x i ) Parameter recovery: can we recover w using {x i, y i } n i=1? 27/41

Roadmap 1 Step 1: Verify the landscape properties of population loss; 2 Step 2: translate properties of population loss to empirical loss; 3 Step 3: argue ŵ (minimizer of empirical loss) is close to w (minimizer of population loss). 28/41

Step 1: population risk 3 2 1 θ2 0 θ 0-1 -2-3 -3-2 -1 0 1 2 3 θ 1 w is the unique local minimizer that is also global. No bad local minima! strongly convex near global optima ; large gradient elsewhere 29/41

Nonconvex landscape: from population to empirical Figure credit: Mei, Bai and Montanari 30/41

Step 2: uniform convergence of gradients & Hessian Theorem 12.2 (Bai, Song, Montanari, 2017) Under suitable assumptions, for any δ > 0, there exists a positive constant C depending on (R, δ) but independent of n and d, such that as long as n Cd log d, we have 1 preservation of gradient: P sup l n (w) l(w) 2 w R 2 preservation of Hessian: P sup w R 2 l n (w) 2 l(w) Cd log n 1 δ. n Cd log n 1 δ. n 31/41

Step 3: establish rate of convergence for ERM By the mean-value theorem, there exists some w between ŵ and w such that l n (ŵ) = l n (w ) + l n (w ), ŵ w + 1 2 (ŵ w ) 2 l n (w )(ŵ w ) l n (w ) where the last line follows by optimality of ŵ. Then 1 2 λ min( 2 l n (w )) ŵ w 2 2 1 2 (ŵ w ) 2 l n (w )(ŵ w ) ŵ w 2 l n (w ), ŵ w l n (w ) ŵ w 2 l n(w ) λ min ( 2 l n (w )) Cd log n. n 32/41

two-layer networks 33/41

Two-layer linear network Given arbitrary data {x i, y i } n i=1, x i, y i R d, fit the two-layer linear network with quadratic loss: n f(a, B) = y i ABx i 2 2 i=1 where B R p d, A R d p, where p d. x B A input layer output layer hidden layer special case: auto-association (auto-encoding, identity mapping), where y i = x i, for e.g. image compression. 34/41

Landscape of 2-layer linear network bears some similarity with the nonconvex matrix factorization problem; Lack identifiability: for any invertible C, AB = (AC)(C 1 B). Define X = [x 1, x 2,, x n ] and Y = [y 1, y 2,, y n ], then f(a, B) = Y ABX 2 F Let Σ XX = n i=1 x i x i = XX, Σ Y Y = Y Y, Σ XY = XY, and Σ Y X = Y X. When X = Y, any optimum AB = UU, where U is the top p eigenvectors of Σ XX. 35/41

Landscape of 2-layer linear network Theorem 12.3 (Baldi and Hornik, 1989) Suppose X is full rank and hence Σ XX is invertible. Further assume Σ := Σ Y X Σ 1 XX Σ XY is full rank with d distinct eigenvalues λ 1 > > λ d > 0. Then f(a, B) has no spurious local minima, except for equivalent versions of global minimum due to invertible transformations. no bad local min! generalizable to multi-layer linear networks. 36/41

Critical points of two-layer linear networks Lemma 12.4 (critical points) Any critical point satisfies where A satisfies AB = P A Σ Y X Σ 1 XX, P A Σ = P A ΣP A = ΣP A, where P A is the ortho-projector projecting onto the column span of the sub-indexed matrix. 37/41

Critical points of two-layer linear networks Let the EVD of Σ = Σ Y X Σ 1 XX Σ XY be Σ = UΛU. Lemma 12.5 (critical points) At any critical point, A can be written in the form A = [U J, 0 d (p r) ]C where rank(a) = r p, J {1,..., d}, J = r and C is invertible. Correspondingly, [ B = C 1 UJ Σ ] Y XΣ 1 XX, last p r rows of CL where L is any d d matrix. Verify (A, B) is global optima if and only if J = {1,..., p}. 38/41

Two-layer nonlinear network Local strong convexity under the Gaussian model [Fu et al., 2018]. W x + y input layer output layer hidden layer 39/41

Reference I [1] ImageNet, www.image-net.org/. [2] Deep learning, Lecun, Bengio, and Hinton, Nature, 2015. [3] ImageNet classification with deep convolutional neural networks, Krizhevsky et al., NIPS, 2012. [4] Learning representations by back-propagating errors, Rumelhart, Hinton, and Williams, Nature, 1986. [5] Dropout: A simple way to prevent neural networks from overfitting, Srivastava et al., JMLR, 2014. [6] Dropout Training as Adaptive Regularization, Wager et al., NIPS, 2013. [7] On the importance of initialization and momentum in deep learning, Sutskever et al., ICML, 2013. [8] Exponentially many local minima for single neurons, P. Auer, M. Herbster and K. Warmuth, NIPS, 1995. 40/41

Reference II [9] The landscape of empirical risk for non-convex losses, S. Mei, Y. Bai, and A. Montanari, arxiv preprint arxiv:1607.06534, 2016. [10] Neural networks and principal component analysis: Learning from examples without local minima, P. Baldi and K. Hornik, Neural Networks, 1989. [11] Local Geometry of One-Hidden-Layer Neural Networks for Logistic Regression, Fu, Chi, Liang, arxiv preprint arxiv:1802.06463, 2018. 41/41