Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016

Outline 1 Background and Preliminaries Why DNNs? Model: Logistic Regression Learning: Optimization with Stochastic Gradient Descent 2 Deep Neural Networks What is the difference of FNN with LR Model: Going Deep Learning: Back-propagation 3 Advances Model: Convolutional/Recurrent Neural Networks Learning: Dropout/Batch Normalization

Background and Preliminaries Deep Neural Networks Advances Recent Success Surpass human on tasks of Classification on ImageNet AlphaGo Interesting applications Neural Style (a) Style image 1 Content image (c) Synthesized image 1 (b) Style image 2 (d) Synthesized image 2

Predictive Models Assume we are given data D = {d 1,, d N }, where d i (x i, y i ) Input object/feature x i R D Output label y i Y, with Y being the discrete label space. A model characterizes the relationship from x to y with mapping parameterized by θ. In training, find proper ˆθ on training set, via maximizing p(ˆθ D). In testing, given a test input x (with missing label ỹ), p(ỹ x, D) = p(ỹ x, ˆθ). (1)

Logistic Regression (LR) Setup: input x i R D, output label y i {0, 1} For binary classification, the likelihood is: 1 p(y x, θ) g θ (x) = 1 + exp ( (W x + c)), (2) Regularizer: p(θ), e.g., l 2 weight penalty/decay Parameters: θ {W, c} LR Output Weights Input (a) Graphical model of LR y 1 Sigmoid function 0.8 0.6 0.4 0.2 0-5 -4-3 -2-1 0 1 2 3 4 5 x (b) Sigmoid link y = 1/(1+exp(-x))

Optimization In optimization, regularized loss function: L(θ) E + R (3) Loss function: E = N n=1 log p(d θ) Regularization function: R log p(θ). Optimization: The process of finding the set of parameters ˆθ that minimize L on training set Initialization Local Optima Adapted from Deep Learning book by Goodfellow et al.

Stochastic Gradient Descent (SGD) Stochastic Gradient Descent (SGD) θ t+1 = θ t + ( ɛ }{{} t θ log p(θ t ) + N n ) θ log p(d ti θ t ) n step size i=1 }{{} gradient of regularized loss (4) N is typically prohibitively large, a data mini-batch of size n is randomly chosen to esitmate the gradient

Limitation of LR In complex real-world modeling, the simple parametric model of LR is often not expressive enough for robust generalization. More complex parametric forms are demanded.

From LR to FNN The ideas of Deep Neural Network (DNNs): Take the output of a LR as the input of another LR?! LR is a zero-layer DNN

What s new in FNN An L-layer FNN for multi-class classification puts a softmax function on the output of a set of function compositions: p(y x, θ) = softmax ( g θl g θ0 (x) ), (5) Parameters: θ {θ l } L l=1 Detailed differences: 1 More layers as composition of LRs (Softmax for multi-class classification) 2 More choices in Nonliear functions 3 More gradient evaluation in Back-propogation

Composition of LRs An L-layer FNN as a set of function compositions: g θl g θ0 (x), (6) where denotes function composition LR FNN Output Output Input Hidden Input Softmax for multi-class classification, softmax(x) e x /( i e x i ).

Choices of Nonliear functions Rectified Linear Unit (ReLU) takes the form: g θl (x) = max(0, W l x + c l ), with θ l = (W l, c l ). 1 0.8 Sigmoid function 1 0.8 ReLU Function y 0.6 0.4 0.2 y = 1/(1+exp(-x)) y 0.6 0.4 0.2 y = max(0, x) 0-5 -4-3 -2-1 0 1 2 3 4 5 (a) Sigmoid link x 0-1 -0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1 (b) ReLU link x

Back-propogation Backward learning of parameters as opposed to forward inference of outputs Chain rule of gradient computing L W 0 = L g g W 0 LR FNN Output Output Input Hidden Input

Short Notes on FNN Feedforward Neural Networks (FNN), major differences: 1 Model: FNNs as composition of LRs 2 Learning: Back-propagation as chain rule of gradient evaluation

Extensions to CNN/RNN From FNN to CNN/RNN, which are the powerful tools of deep learning CNN is a special class of FNNs, typically applied to data with spatial covariates. The CNN employs the convolution operation at each layer of the FNN. RNN extends FNN to incoporate time information. It may be used to parameterize the input-output relationship, when input is a sequence Output Hidden Input FNN CNN RNN Figure: A rough schematic comparison of NNs, only major differences are illustrated. g( ) is nonliearity, is convolution, is matrix prodcut

Convolutional Neural Networks CNN is typically composed of convolution and pooling operators. The CNN can take advantage of the properties of natural signals such as images and shapes, which exhibit high local correlations and rich shared components. Given inputs in the form of multiple arrays (x kl 1 ) K l 1 from the k l 1 =1 (l 1)-th layer, for the k l -th filter bank W l in the l-th layer, the output is g Wkl (x kl 1 ) = Pool ( k l 1 W k l x kl 1 ), where is the convolution operator, Pool is the pooling operator (e.g., max-pooling). The parameters for the l-th layer are θ l {W kl }.

Applications of CNN Image Classification: Label images into given categories. Image Segmentation: Make dense predictions for per-pixel tasks like semantic segmentation [3]. AlphaGo: Policy network takes a representation of the board position as its input, and outputs a probability distribution or over legal moves [5]. Image Segmentation Policy/network in/alphago Figure adapted from [3] and [5]

Recurrent Neural Networks Consider input sequence X = {x 1,..., x T }, where x t is the input data vector at time t. There is a corresponding hidden state vector h t at each time t, obtained by recursively applying the transition function h t = g(h t 1, x t) Weights (Parameters: θ {W, U, V}) Encoding weights: connect input to hidden units Decoding weights: connect hidden units to output Recurrent weights: connect consecutive hidden units Output Hidden Decoding weights Recurrent weights Input Encoding weights Transition function: A gated activation function, such as Long Short-Term Memory (LSTM) or a Gated Recurrent Unit (GRU).

Applications of RNN Language Modeling: In word-level language modeling, and the network is trained to predict the next word in the sequence Image Captioning: Learn a generative language model of the caption conditioned on an image. Sentiment Analysis: Sentence classification aims to assign a semantic category label to a whole sentence Language Modeling the new york stock exchange did not fall apart 1 0.8 0.6 0.4 0.2 0 Image Captioning a"tan"dog"is"playing"in"the"grass a"tan"dog"is"playing"with"a"red"ball"in"the"grass a"tan"dog"with"a"red"collar"is"running"in"the"grass a"yellow"dog"runs"through"the"grass a"yellow"dog"is"running"through"the"grass a"brown"dog"is"running"through"the"grass

Dropout Problem: Overfitting Flexibility due to a large number of parameters Make overly confident decisions on prediction Solution: Dropout [6] Training stage: A unit is present with probability p Testing stage: The unit is always present and the weights are multiplied by p Figure adapted from [6]

Batch Normalization Problem: Internal Covariate Shift Internal Covariate Shift: The distribution of each layer s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates, and makes it hard to train models with saturating nonlinearities Solution: Batch Normalization [1] Performing the normalization of layer inputs for each training mini-batch

Summary Shallow Model LR Learning SGD FNN Back-propogation Deep CNN RNN Dropout Other DNNs Batch Normalization

Related Materials Software Torch 7 (http://torch.ch/) Tutorials: Deep Learning with Torch: the 60-minute blitz Caffe/Theano/Tensorflow/many others... Courses Stanford CS231n: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/ Books For beginners of deep learning: Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville http://www.deeplearningbook.org/

Lawrence Carin (Duke University) David Carlson (Columbia University) Changyou Chen (Duke University) Xiaolin Hu (Tsinghua University) Jian Li (Tsinghua University) John Paisley (Columbia University) Liwei Wang (Peking University) Jun Zhu (Tsinghua University) Duke-Tsinghua Machine Learning Summer School Deep Learning and Big Data, China, August 1-10, 2016 https://dukekunshan.edu.cn/en/events/machine-learning-2016 Duke-Tsinghua Machine Learning Summer School: Deep Learning for Big Data Duke-Kunshan University, Kunshan, China August 1-10, 2016 Organizers: Lawrence Carin Duke University Jun Zhu Tsinghua University Topics: Convolution Neural Networks Recurrent Neural Networks Feedforward Neural Networks Variational Auto-Encoders Restricted Boltzmann Machines Deep Poisson Models Bayesian Max-Margin Learning Stochastic Optimization Stochastic Gradient MCMC Stochastic Variational Inference Instructors: My Research Scalable Bayesian Methods for Deep Learning https://sites.google.com/site/chunyuan24/ Thanks!

References I Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 2015. Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on CVPR, 2015. D. E. Rumelhart, G. E. Hinton, and R.J. Williams. Learning representations by back-propagating errors. Nature, pages 323 533, 1986. David Silver, Aja Huang, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 2014.