Introduction to Deep Neural Networks

Similar documents
Lecture 17: Neural Networks and Deep Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Introduction to Convolutional Neural Networks (CNNs)

Jakub Hajic Artificial Intelligence Seminar I

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Normalization Techniques in Training of Deep Neural Networks

Machine Learning Lecture 14

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural networks and optimization

SGD and Deep Learning

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Deep Feedforward Networks

ECE521 Lectures 9 Fully Connected Neural Networks

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Theories of Deep Learning

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Deep Generative Models. (Unsupervised Learning)

Deep Feedforward Networks

Deep Learning: a gentle introduction

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Based on the original slides of Hung-yi Lee

Deep Learning Architectures and Algorithms

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Understanding Neural Networks : Part I

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Modeling Documents with a Deep Boltzmann Machine

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Bayesian Networks (Part I)

Convolutional Neural Networks. Srikumar Ramalingam

Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net

Neural Networks and Introduction to Deep Learning

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Slide credit from Hung-Yi Lee & Richard Socher

Grundlagen der Künstlichen Intelligenz

Statistical Machine Learning

arxiv: v3 [cs.lg] 14 Jan 2018

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Neural Networks. Intro to AI Bert Huang Virginia Tech

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Deep Learning Autoencoder Models

The connection of dropout and Bayesian statistics

Introduction to Deep Learning

Maxout Networks. Hien Quoc Dang

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Bayesian Deep Learning

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS

Auto-Encoding Variational Bayes

ECE521 Lecture 7/8. Logistic Regression

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

Faster Training of Very Deep Networks Via p-norm Gates

Topic 3: Neural Networks

Based on the original slides of Hung-yi Lee

text classification 3: neural networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

An overview of deep learning methods for genomics

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Recurrent Neural Network

Convolutional Neural Networks

Feature Design. Feature Design. Feature Design. & Deep Learning

IMPROVING STOCHASTIC GRADIENT DESCENT

Deep Learning Lab Course 2017 (Deep Learning Practical)

A summary of Deep Learning without Poor Local Minima

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

UNSUPERVISED LEARNING

Deep Learning for NLP

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Identifying QCD transition using Deep Learning

NEURAL LANGUAGE MODELS

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Machine Learning Basics III

Reading Group on Deep Learning Session 1

Supplementary Material of High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models

Deep Learning (CNNs)

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Transcription:

Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016

Outline 1 Background and Preliminaries Why DNNs? Model: Logistic Regression Learning: Optimization with Stochastic Gradient Descent 2 Deep Neural Networks What is the difference of FNN with LR Model: Going Deep Learning: Back-propagation 3 Advances Model: Convolutional/Recurrent Neural Networks Learning: Dropout/Batch Normalization

Background and Preliminaries Deep Neural Networks Advances Recent Success Surpass human on tasks of Classification on ImageNet AlphaGo Interesting applications Neural Style (a) Style image 1 Content image (c) Synthesized image 1 (b) Style image 2 (d) Synthesized image 2

Predictive Models Assume we are given data D = {d 1,, d N }, where d i (x i, y i ) Input object/feature x i R D Output label y i Y, with Y being the discrete label space. A model characterizes the relationship from x to y with mapping parameterized by θ. In training, find proper ˆθ on training set, via maximizing p(ˆθ D). In testing, given a test input x (with missing label ỹ), p(ỹ x, D) = p(ỹ x, ˆθ). (1)

Logistic Regression (LR) Setup: input x i R D, output label y i {0, 1} For binary classification, the likelihood is: 1 p(y x, θ) g θ (x) = 1 + exp ( (W x + c)), (2) Regularizer: p(θ), e.g., l 2 weight penalty/decay Parameters: θ {W, c} LR Output Weights Input (a) Graphical model of LR y 1 Sigmoid function 0.8 0.6 0.4 0.2 0-5 -4-3 -2-1 0 1 2 3 4 5 x (b) Sigmoid link y = 1/(1+exp(-x))

Optimization In optimization, regularized loss function: L(θ) E + R (3) Loss function: E = N n=1 log p(d θ) Regularization function: R log p(θ). Optimization: The process of finding the set of parameters ˆθ that minimize L on training set Initialization Local Optima Adapted from Deep Learning book by Goodfellow et al.

Stochastic Gradient Descent (SGD) Stochastic Gradient Descent (SGD) θ t+1 = θ t + ( ɛ }{{} t θ log p(θ t ) + N n ) θ log p(d ti θ t ) n step size i=1 }{{} gradient of regularized loss (4) N is typically prohibitively large, a data mini-batch of size n is randomly chosen to esitmate the gradient

Limitation of LR In complex real-world modeling, the simple parametric model of LR is often not expressive enough for robust generalization. More complex parametric forms are demanded.

From LR to FNN The ideas of Deep Neural Network (DNNs): Take the output of a LR as the input of another LR?! LR is a zero-layer DNN

What s new in FNN An L-layer FNN for multi-class classification puts a softmax function on the output of a set of function compositions: p(y x, θ) = softmax ( g θl g θ0 (x) ), (5) Parameters: θ {θ l } L l=1 Detailed differences: 1 More layers as composition of LRs (Softmax for multi-class classification) 2 More choices in Nonliear functions 3 More gradient evaluation in Back-propogation

Composition of LRs An L-layer FNN as a set of function compositions: g θl g θ0 (x), (6) where denotes function composition LR FNN Output Output Input Hidden Input Softmax for multi-class classification, softmax(x) e x /( i e x i ).

Choices of Nonliear functions Rectified Linear Unit (ReLU) takes the form: g θl (x) = max(0, W l x + c l ), with θ l = (W l, c l ). 1 0.8 Sigmoid function 1 0.8 ReLU Function y 0.6 0.4 0.2 y = 1/(1+exp(-x)) y 0.6 0.4 0.2 y = max(0, x) 0-5 -4-3 -2-1 0 1 2 3 4 5 (a) Sigmoid link x 0-1 -0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1 (b) ReLU link x

Back-propogation Backward learning of parameters as opposed to forward inference of outputs Chain rule of gradient computing L W 0 = L g g W 0 LR FNN Output Output Input Hidden Input

Short Notes on FNN Feedforward Neural Networks (FNN), major differences: 1 Model: FNNs as composition of LRs 2 Learning: Back-propagation as chain rule of gradient evaluation

Extensions to CNN/RNN From FNN to CNN/RNN, which are the powerful tools of deep learning CNN is a special class of FNNs, typically applied to data with spatial covariates. The CNN employs the convolution operation at each layer of the FNN. RNN extends FNN to incoporate time information. It may be used to parameterize the input-output relationship, when input is a sequence Output Hidden Input FNN CNN RNN Figure: A rough schematic comparison of NNs, only major differences are illustrated. g( ) is nonliearity, is convolution, is matrix prodcut

Convolutional Neural Networks CNN is typically composed of convolution and pooling operators. The CNN can take advantage of the properties of natural signals such as images and shapes, which exhibit high local correlations and rich shared components. Given inputs in the form of multiple arrays (x kl 1 ) K l 1 from the k l 1 =1 (l 1)-th layer, for the k l -th filter bank W l in the l-th layer, the output is g Wkl (x kl 1 ) = Pool ( k l 1 W k l x kl 1 ), where is the convolution operator, Pool is the pooling operator (e.g., max-pooling). The parameters for the l-th layer are θ l {W kl }.

Applications of CNN Image Classification: Label images into given categories. Image Segmentation: Make dense predictions for per-pixel tasks like semantic segmentation [3]. AlphaGo: Policy network takes a representation of the board position as its input, and outputs a probability distribution or over legal moves [5]. Image Segmentation Policy/network in/alphago Figure adapted from [3] and [5]

Recurrent Neural Networks Consider input sequence X = {x 1,..., x T }, where x t is the input data vector at time t. There is a corresponding hidden state vector h t at each time t, obtained by recursively applying the transition function h t = g(h t 1, x t) Weights (Parameters: θ {W, U, V}) Encoding weights: connect input to hidden units Decoding weights: connect hidden units to output Recurrent weights: connect consecutive hidden units Output Hidden Decoding weights Recurrent weights Input Encoding weights Transition function: A gated activation function, such as Long Short-Term Memory (LSTM) or a Gated Recurrent Unit (GRU).

Applications of RNN Language Modeling: In word-level language modeling, and the network is trained to predict the next word in the sequence Image Captioning: Learn a generative language model of the caption conditioned on an image. Sentiment Analysis: Sentence classification aims to assign a semantic category label to a whole sentence Language Modeling the new york stock exchange did not fall apart 1 0.8 0.6 0.4 0.2 0 Image Captioning a"tan"dog"is"playing"in"the"grass a"tan"dog"is"playing"with"a"red"ball"in"the"grass a"tan"dog"with"a"red"collar"is"running"in"the"grass a"yellow"dog"runs"through"the"grass a"yellow"dog"is"running"through"the"grass a"brown"dog"is"running"through"the"grass

Dropout Problem: Overfitting Flexibility due to a large number of parameters Make overly confident decisions on prediction Solution: Dropout [6] Training stage: A unit is present with probability p Testing stage: The unit is always present and the weights are multiplied by p Figure adapted from [6]

Batch Normalization Problem: Internal Covariate Shift Internal Covariate Shift: The distribution of each layer s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates, and makes it hard to train models with saturating nonlinearities Solution: Batch Normalization [1] Performing the normalization of layer inputs for each training mini-batch

Summary Shallow Model LR Learning SGD FNN Back-propogation Deep CNN RNN Dropout Other DNNs Batch Normalization

Related Materials Software Torch 7 (http://torch.ch/) Tutorials: Deep Learning with Torch: the 60-minute blitz Caffe/Theano/Tensorflow/many others... Courses Stanford CS231n: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/ Books For beginners of deep learning: Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville http://www.deeplearningbook.org/

Lawrence Carin (Duke University) David Carlson (Columbia University) Changyou Chen (Duke University) Xiaolin Hu (Tsinghua University) Jian Li (Tsinghua University) John Paisley (Columbia University) Liwei Wang (Peking University) Jun Zhu (Tsinghua University) Duke-Tsinghua Machine Learning Summer School Deep Learning and Big Data, China, August 1-10, 2016 https://dukekunshan.edu.cn/en/events/machine-learning-2016 Duke-Tsinghua Machine Learning Summer School: Deep Learning for Big Data Duke-Kunshan University, Kunshan, China August 1-10, 2016 Organizers: Lawrence Carin Duke University Jun Zhu Tsinghua University Topics: Convolution Neural Networks Recurrent Neural Networks Feedforward Neural Networks Variational Auto-Encoders Restricted Boltzmann Machines Deep Poisson Models Bayesian Max-Margin Learning Stochastic Optimization Stochastic Gradient MCMC Stochastic Variational Inference Instructors: My Research Scalable Bayesian Methods for Deep Learning https://sites.google.com/site/chunyuan24/ Thanks!

References I Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 2015. Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on CVPR, 2015. D. E. Rumelhart, G. E. Hinton, and R.J. Williams. Learning representations by back-propagating errors. Nature, pages 323 533, 1986. David Silver, Aja Huang, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 2014.