Theories of Deep Learning

Similar documents
Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Introduction to Deep Neural Networks

Introduction to Convolutional Neural Networks (CNNs)

Lecture 17: Neural Networks and Deep Learning

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Neural networks and optimization

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

SGD and Deep Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Classification of Hand-Written Digits Using Scattering Convolutional Network

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Lecture 7 Convolutional Neural Networks

Machine Learning Lecture 14

Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net

Convolutional Neural Network Architecture

A summary of Deep Learning without Poor Local Minima

Jakub Hajic Artificial Intelligence Seminar I

ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS

FreezeOut: Accelerate Training by Progressively Freezing Layers

Deep Residual. Variations

Reading Group on Deep Learning Session 1

arxiv: v2 [stat.ml] 18 Jun 2017

Deep Learning (CNNs)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Convolutional neural networks

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Based on the original slides of Hung-yi Lee

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Swapout: Learning an ensemble of deep architectures

Introduction to Deep Learning

arxiv: v2 [cs.cv] 8 Mar 2018

Convolutional Neural Networks

Local Affine Approximators for Improving Knowledge Transfer

Deep Learning Lab Course 2017 (Deep Learning Practical)

Normalization Techniques in Training of Deep Neural Networks

Introduction to Deep Learning

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

CONVOLUTIONAL neural networks [18] have contributed

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Statistical Machine Learning

Bayesian Networks (Part I)

Convolutional Neural Networks. Srikumar Ramalingam

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Handwritten Indic Character Recognition using Capsule Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Training Neural Networks Practical Issues

Course 395: Machine Learning - Lectures

Learning Deep ResNet Blocks Sequentially using Boosting Theory

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Notes on Adversarial Examples

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Regularizing Deep Networks Using Efficient Layerwise Adversarial Training

Neural Networks and Deep Learning.

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 35: Optimization and Neural Nets

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

P-TELU : Parametric Tan Hyperbolic Linear Unit Activation for Deep Neural Networks

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

TTIC 31230, Fundamentals of Deep Learning, Winter David McAllester. The Fundamental Equations of Deep Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

arxiv: v4 [cs.cv] 6 Sep 2017

Deep Learning: a gentle introduction

arxiv: v1 [cs.cv] 11 May 2015 Abstract

Maxout Networks. Hien Quoc Dang

Introduction to Neural Networks

arxiv: v1 [cs.lg] 25 Sep 2018

Grundlagen der Künstlichen Intelligenz

Introduction to Deep Learning CMPT 733. Steven Bergner

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Lecture 14: Deep Generative Learning

RegML 2018 Class 8 Deep learning

An overview of deep learning methods for genomics

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Spatial Transformer Networks

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Tasks ADAS. Self Driving. Non-machine Learning. Traditional MLP. Machine-Learning based method. Supervised CNN. Methods. Deep-Learning based

Introduction to (Convolutional) Neural Networks

Machine Learning Lecture 10

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Lecture 5: Logistic Regression. Neural Networks

Understanding Neural Networks : Part I

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Deep Feedforward Networks

Encoder Based Lifelong Learning - Supplementary materials

OPTIMIZATION METHODS IN DEEP LEARNING

Day 3 Lecture 3. Optimizing deep networks

Transcription:

Theories of Deep Learning Lecture 02 Donoho, Monajemi, Papyan Department of Statistics Stanford Oct. 4, 2017 1 / 50

Stats 385 Fall 2017 2 / 50

Stats 285 Fall 2017 3 / 50

Course info Wed 3:00-4:20 PM in 200-002 Sept 27 - Dec 6 (10 Weeks) Website: http://stats385.github.io @stats385 Instructors: + David Donoho Email donoho@stanxxx.edu Office hours Mon/Wed 1PM in Sequoia 128 + Hatef Monajemi Email monajemi@stanxxx.edu Office hours Mondays, 11:00 AM in Sequoia 216 Twitter @hatefmnj + Vardan Papyan Email papyan@stanxxx.edu Office hours TBD 4 / 50

Reminders Weekly guest lectures Associated abstracts, readings Projects Course Website: http://stats385.github.io Each Week s Speaker Readings (Links to Selected) Announcements Lecture Slides Stanford Canvas site Readings (Incl. Copyrighted) Announcements Lecture Slides Chat 5 / 50

Basic Information about Deep Learning Chris Manning: http://web.stanford.edu/class/cs224n/ Pal Sujit s NLP tutorial: https://github.com/sujitpal/eeap-examples Andrew Ng s deeplearning.ai CS231n course website: http://cs231n.github.io PyTorch Tutorial (All kinds of examples): http://pytorch.org/tutorials/ Books: Deep Learning, Goodfellow, Bengio, Courville; 2016. Neural Networks and Deep Learning Michael Nielsen http://neuralnetworksanddeeplearning.com Many O Reilly Books http://deeplearning.net/reading-list/ Many NIPS Papers. 6 / 50

A Look Ahead: https://stats385.github.io Next Two Lectures: Wed Oct 11 Helmut Boelcskei ETH Zuerich Wed Oct 18 Ankit Patel Rice 7 / 50

Wed Oct 11 Helmut Boelsckei Readings for this lecture 1 A mathematical theory of deep convolutional neural networks for feature extraction 2 Energy propagation in deep convolutional neural networks 3 Discrete deep feature extraction: A theory and new architectures 4 Topology reduction in deep convolutional feature extraction networks Possibly also of interest S. Mallat, Understanding Deep Convolutional Networks Phil. Trans. Roy. Soc. 2017 Mallat, Stéphane. Group invariant scattering. Communications on Pure and Applied Mathematics 65, no. 10 (2012): 1331-1398 8 / 50

Lecture 1, in review Global Economy Computing Deep Learning 9 / 50

Lecture 2, in overview 10 / 50

ImageNet dataset 14,197,122 labeled images 21,841 classes Labeling required more than a year of human effort via Amazon Mechanical Turk 11 / 50

The Common Task Framework Crucial methodology driving predictive modeling s success An instance has the following ingredients: Training dataset Competitors whose goal is to learn a predictor from the training set Scoring referee 12 / 50

Instance of Common Task Framework, 1 ImageNet (subset): 1.2 million training images 100,000 test images 1000 classes ImageNet large-scale visual recognition Challenge source: https://www.linkedin.com/pulse/must-read-path-breaking-papers-image-classification-muktabh-mayank 13 / 50

Instance of Common Task Framework, 2 Source: [Krizhevsky et al., 2012] 14 / 50

Perceptron, the basic block Invented by Frank Rosenblatt (1957) b x 1 x 2 x d w 1 w 2 w d z = w x + b f(z) 15 / 50

Single-layer perceptron 16 / 50

Multi-layer perceptron 17 / 50

Forward pass Cascade of repeated [linear operation followed by coordinatewise nonlinearity] s Nonlinearities: sigmoid, hyperbolic tangent, (recently) ReLU. Algorithm 1 Forward pass Input: x 0 Output: x L 1: for l = 1 to L do 2: x l = f l (W l x l 1 + b l ) 3: end for 18 / 50

Training neural networks Training examples {x i 0 }n i=1 and labels {yi } n i=1 Output of the network {x i L }m i=1 Objective J({W l }, {b l }) = 1 n n i=1 1 2 yi x i L 2 2 (1) Gradient descent W l = W l η J W l b l = b l η J b l : In practice: use Stochastic Gradient Descent (SGD) 19 / 50

back-propagation derivation derivation from LeCun et al. 1988 Given n training examples (I i, y i ) (input,target) and L layers Constrained optimization ni=1 x i (L) y i 2 min W,x subject to x i (l) = f l [W l x i (l 1) i = 1,..., n, Lagrangian formulation (Unconstrained) min L(W, x, B) W,x,B L(W, x, B) = n i=1 { x i (L) y i 2 2 + ], l = 1,..., L, x i (0) = I i [ ] Ll=1 B i (l) (x )} T i (l) f l W l x i (l 1) http://yann.lecun.com/exdb/publis/pdf/lecun-88.pdf 20 / 50

back-propagation derivation L B Forward pass [ ] x i (l) = f l W l x i (l 1) }{{} A i (l) l = 1,..., L, i = 1,..., n L x, z l = [ f l ]B(l) Backward (adjoint) pass ] z(l) = 2 f L [A i (L) (y i x i (L)) ] z i (l) = f l [A i (l) Wl+1 T z i(l + 1) l = 0,..., L 1 W W + λ L W Weight update W l W l + λ n i=1 z i (l)x T i (l 1) 21 / 50

Convolutional Neural Network (CNN) Can be traced to Neocognitron of Kunihiko Fukushima (1979) Yann LeCun combined convolutional neural networks with back propagation (1989) Imposes shift invariance and locality on the weights Forward pass remains similar Backpropagation slightly changes need to sum over the gradients from all spatial positions Source: [LeCun et al., 1998] 22 / 50

AlexNet (2012) Architecture 8 layers: first 5 convolutional, rest fully connected ReLU nonlinearity Local response normalization Max-pooling Dropout Source: [Krizhevsky et al., 2012] 23 / 50

AlexNet (2012) ReLU Non-saturating function and therefore faster convergence when compared to other nonlinearities Problem of dying neurons Source: https://ml4a.github.io/ml4a/neural_networks/ 24 / 50

AlexNet (2012) Max pooling Chooses maximal entry in every non-overlapping window of size 2 2, for example Source: Stanford s CS231n github 25 / 50

AlexNet (2012) Dropout Source: [Srivastava et al., 2014] Zero every neuron with probability 1 p At test time, multiply every neuron by p 26 / 50

AlexNet (2012) Training Stochastic gradient descent Mini-batches Momentum Weight decay (l 2 prior on the weights) Filters trained in the first layer Source: [Krizhevsky et al., 2012] 27 / 50

Characteristics of different networks Source: Eugenio Culurciello 28 / 50

The need for regularization The number of training examples is 1.2 million The number of parameters is 5-155 million How does the network manage to generalize? 29 / 50

Implicit and explicit regularization Weight decay (l 2 prior on the weights) ReLU soft non-negative thresholding operator. Implicit regularization of sparse feature maps Dropout at test time, when no units dropped, gives sparser representations [Srivastava et. al 14 ] Dropout a particular form of ridge regression The structure of the network itself 30 / 50

Olshausen and Field (1996) Receptive fields in visual cortex are spatially localized, oriented and bandpass Coding natural images while promoting sparse solutions results in a set of filters satisfying these properties 1 min {ϕ i },a i 2 I i ϕ i a i 2 2 + i S(a i ), (2) Trained filters ϕ i Source: [Olshausen and Field, 1996] 31 / 50

AlexNet vs. Olshausen and Field Why does AlexNet learn filters similar to Olshausen/Field? Is there an implicit sparsity-promotion in training network? How would classification results change if replace learned filters in first layer with analytically defined wavelets, e.g. Gabors? Filters in the first layer are spatially localized, oriented and bandpass. What properties do filters in remaining layers satisfy? Can we derive mathematically? 32 / 50

VGG (2014) [Simonyan and Zisserman, 2014] Deeper than AlexNet: 11-19 layers versus 8 No local response normalization Number of filters multiplied by two every few layers Spatial extent of filters 3 3 in all layers Instead of 7 7 filters, use three layers of 3 3 filters Gain intermediate nonlinearity Impose a regularization on the 7 7 filters Source: https://blog.heuritech.com/2016/02/29/ 33 / 50

Optimization problems Formally, deeper networks contain shallower ones (i.e. consider no-op layers) Observation: Deeper networks not always lower training error Conclusion: Optimization process can t successfully infer no-op 34 / 50

ResNet (2015) Solves problem by adding skip connections Very deep: 152 layers No dropout Stride Batch normalization Source: Deep Residual Learning for Image Recognition 35 / 50

Stride Source: https://adeshpande3.github.io/a-beginner% 27s-Guide-To-Understanding-Convolutional-Neural-Networks-Part-2/ 36 / 50

Batch normalization Algorithm 2 Batch normalization [Ioffe and Szegedy, 2015] Input: Values of x over minibatch x 1... x B, where x is a certain channel in a certain feature vector Output: Normalized, scaled and shifted values y 1... y B 1: µ = 1 B Bb=1 x b 2: σ 2 = 1 B Bb=1 (x b µ) 2 3: ˆx b = x b µ σ 2 +ϵ 4: y b = γˆx b + β Accelerates training and makes initialization less sensitive Zero mean and unit variance feature vectors 37 / 50

ResNet versus standard architectures Standard architectures: increasingly abstract features at each layer ResNet: a group of successive layers iteratively refine an estimated representation [Klaus Greff et. al 17] Could we formulate a cost function that is being minimized in these successive layers? What is the relation between this cost function and standard architectures? 38 / 50

Depth as function of year [He et al., 2016] 39 / 50

The question of depth Besides increasing depth, one can increase width of each layer to improve performance [Zagoruyko and Komodakis 17 ] Is there a reason for increasing depth over width or vice versa? Is having many filters in same layer somehow detrimental? Is having many layers not beneficial after some point? 40 / 50

Linear separation Inputs are not linearly separable but their deepest representations are What happens during forward pass that makes linear separation possible? Is separation happening gradually with depth or abruptly at a certain point? 41 / 50

Transfer learning Filters learned in first layers of a network are transferable from one task to another When solving another problem, no need to retrain the lower layers, just fine tune upper ones Is this simply due to the large amount of images in ImageNet? Does solving many classification problems simultaneously result in features that are more easily transferable? Does this imply filters can be learned in unsupervised manner? Can we characterize filters mathematically? 42 / 50

Adversarial examples [Goodfellow et al., 2014] Small but malicious perturbations can result in severe misclassification Malicious examples generalize across different architectures What is source of instability? Can we robustify network? 43 / 50

Visualizing deep convolutional neural networks using natural pre-images Filters in first layer of CNN are easy to visualize, while deeper ones are harder Activation maximization seeks input image maximizing output of the i-th neuron in the network Objective e i is indicator vector x = arg min x R(x) is simple natural image prior R(x) Φ(x), e i (3) 44 / 50

Visualizing VGG Gabor-like images in first layer More sophisticated structures in the rest [Mahendran and Vedaldi, 2016] 45 / 50

Visualizing VGG VD [Mahendran and Vedaldi, 2016] 46 / 50

Visualizing CNN [Mahendran and Vedaldi, 2016] 47 / 50

Geometry of images Activation maximization seeks input image maximizing activation of certain neuron Could we span all images that excite a certain neuron? What geometrical structure would these images create? 48 / 50

Lecture 2, in overview 49 / 50

References I Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014). Explaining and harnessing adversarial examples. arxiv preprint arxiv:1412.6572. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778. Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448 456. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097 1105. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324. Mahendran, A. and Vedaldi, A. (2016). Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision, 120(3):233 255. Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607. Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arxiv preprint arxiv:1409.1556. Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929 1958. 50 / 50