CSE 591: Introduction to Deep Learning in Visual Computing. - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan

Similar documents
Deep Residual. Variations

Tips for Deep Learning

Normalization Techniques in Training of Deep Neural Networks

Tips for Deep Learning

Convolutional Neural Network Architecture

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2018

ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

arxiv: v1 [cs.cv] 18 May 2018

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

arxiv: v2 [cs.cv] 8 Mar 2018

Training Neural Networks Practical Issues

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Introduction to Neural Networks

Based on the original slides of Hung-yi Lee

Introduction to Deep Neural Networks

Deep Learning for NLP

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Based on the original slides of Hung-yi Lee

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

arxiv: v2 [cs.cv] 12 Apr 2016

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Slide credit from Hung-Yi Lee & Richard Socher

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Feedforward Neural Networks

Swapout: Learning an ensemble of deep architectures

Deep Learning & Neural Networks Lecture 4

CS60010: Deep Learning

Data Mining & Machine Learning

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

CSC321 Lecture 9: Generalization

arxiv: v4 [cs.cv] 6 Sep 2017

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Notes on Adversarial Examples

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Jakub Hajic Artificial Intelligence Seminar I

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Machine Learning Lecture 14

CSC321 Lecture 9: Generalization

arxiv: v2 [cs.ne] 20 May 2017

Beyond finite layer neural network

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Neural Network Language Modeling

Deep Learning & Artificial Intelligence WS 2018/2019

Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond

CSC321 Lecture 16: ResNets and Attention

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Neural Networks 2. 2 Receptive fields and dealing with image inputs

RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY

Importance Reweighting Using Adversarial-Collaborative Training

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Lecture 2: Learning with neural networks

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

SHAKE-SHAKE REGULARIZATION OF 3-BRANCH

Deep Feedforward Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks

Lecture 15: Exploding and Vanishing Gradients

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Learning Long-Term Dependencies with Gradient Descent is Difficult

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Deep Learning Year in Review 2016: Computer Vision Perspective

Towards understanding feedback from supermassive black holes using convolutional neural networks

Efficient and accurate time-integration of combustion chemical kinetics using artificial neural networks

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Stochastic Gradient Descent

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

Lecture 5: Logistic Regression. Neural Networks

FreezeOut: Accelerate Training by Progressively Freezing Layers

Convolutional Neural Networks. Srikumar Ramalingam

arxiv: v2 [stat.ml] 18 Jun 2017

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

CSC321 Lecture 2: Linear Regression

Multiple Wavelet Coefficients Fusion in Deep Residual Networks for Fault Diagnosis

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Introduction to (Convolutional) Neural Networks

Neural Networks Language Models

A summary of Deep Learning without Poor Local Minima

Neural Architectures for Image, Language, and Speech Processing

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

arxiv: v1 [cs.lg] 11 May 2015

The connection of dropout and Bayesian statistics

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

CONVOLUTIONAL neural networks [18] have contributed

Deep Belief Networks are compact universal approximators

Feedforward Neural Networks. Michael Collins, Columbia University

Deep Recurrent Neural Networks

New Insights and Perspectives on the Natural Gradient Method

Transcription:

CSE 591: Introduction to Deep Learning in Visual Computing - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan

Overview Background Why another network structure? Vanishing and exploding gradients Deep Residual Network Identity Mappings a way to make them deeper Extensions Future Scope

Background Three famous network architectures: Courtesy: Deep learning gets way deeper, ICML 2016 tutorial, Kaiming He

Background Three famous network architectures: Do we yet have a way to make networks arbitrarily deeper? Courtesy: Deep learning gets way deeper, ICML 2016 tutorial, Kaiming He

Background Three famous network architectures: For example, a module that can be repeated to increase the network depth. Courtesy: Deep learning gets way deeper, ICML 2016 tutorial, Kaiming He

Background Just appending layers to increase depth leads to an increase in the training error! Courtesy: He et al., Deep Residual Learning for Image Recognition, CVPR 2016

Background Just appending layers to increase depth leads to an increase in the training error! Theoretically, training error for a 56-layer network should be lesser than or equal to its 20-layer counterpart. This is not overfitting, nor can it be fully attributed to the vanishing/exploding gradient problem.

Background There exists a naïve solution (to construct arbitrarily deep networks) by construction and our solver/optimizer should be able to find it.

Background There exists a naïve solution (to construct arbitrarily deep networks) by construction and our solver/optimizer should be able to find it. Copy all the layers from the learned shallower network to the deeper network. Rest of the layers in the deeper network do nothing but an identity mapping. Do our best solvers ever find this solution in a reasonably deep network?

Background Solvers are not able to find a solution as simple as identity mappings in deeper networks. This is the degradation problem in deep neural networks. Not overfitting Only partially caused by vanishing/exploding gradients

Background Assume linear activation, a single layer neural network, Cost function C Independently initialized weights, same input feature variances Y = W 1 X 1 + W 2 X 2 + + W n X n Var Y = n in Var W i Var X i Var C = n out Var(W X i ) Var C i Y i Courtesy: Kaiming He, Deep learning gets way deeper, ICML 2016 Reading: Xavier Glorot, Yoshua Bengio, Understanding the difficulty of training deep feedforward neural networks, AISTATS 2010.

Deep Residual Networks The key is to define a residual block that can be stacked to create a network with an arbitrary depth. X W 1 BN, ReLU W 2 BN ReLU F X + X = H(X) With a plain net, we hope to discover the underlying mapping - H(X) With a residual net, we hope to discover only the residual mapping - F(X) Reading: He et al., Deep Residual Learning for Image Recognition, CVPR 2016.

Deep Residual Networks x l F(x l, W l ) W 1 BN, ReLU W 2 BN ReLU y l = x l + F(x l, W l ) Almost uninterrupted flow of gradients from any layer to the input layer. Only has to find the residual mapping, which complements the identity mapping. x l+1 = f y l = f x l + F x l, W l Reading: He et al., Deep Residual Learning for Image Recognition, CVPR 2016.

Deep Residual Networks x l W 1 BN, ReLU Recursively, x L = f f x 0 + F x 0, W 0 L 1 times F(x l, W l ) W 2 BN ReLU y l = x l + F(x l, W l ) C x l = C x L x L x l = C x L? (What is the issue?) x l+1 = f y l = f x l + F x l, W l Reading: He et al., Deep Residual Learning for Image Recognition, CVPR 2016.

Identity Mappings in Residual Networks F(x l, W l ) x l W 1 BN, ReLU W 2 BN ReLU y l = x l + F(x l, W l ) x l+1 = f y l = f x l + F x l, W l Adverse effects of this phenomenon can be seen in ultra-deep networks of 1000+ layers. Solution: make f as an identity mapping too. Options: Just remove the last non-linearity OR Change the order of BN, ReLU and W Reading: He et al., Identity Mappings in Deep Residual Networks, ECCV 2016.

Identity Mappings in Residual Networks Solution: make f as an identity mapping too. F(x l, W l ) x l BN, ReLU W 1 BN, ReLU W 2 x l+1 = x l + F(x l, W l ) Options: Just remove the last non-linearity OR Change the order of W, BN and ReLU C x l = C x L x L x l = C x L 1 + x l i=l L 1 F x i, W i Now, gradient flows more smoothly from a deeper layer L to a shallower layer l. This allows us to construct ultra-deep networks of 1000+ layers. Reading: He et al., Identity Mappings in Deep Residual Networks, ECCV 2016.

Extensions Deep networks with stochastic depth [1]: Only a randomly chosen subset of the layers will be executed during the training of a mini-batch. This allows us to train networks with higher depth. Residual networks behave like ensembles of relatively shallow networks [2]: This paper claims that a residual network is nothing but an ensemble of multiple shallow networks. Provides a completely new perspective on the success of residual networks and how it avoids the vanishing gradient problem. An intriguing paper! Reading: [1] Huang, Gao, et al. "Deep networks with stochastic depth." ECCV, 2016. [2] Veit, Andreas et al., "Residual networks behave like ensembles of relatively shallow networks." NIPS, 2016.

Future Work Analysis of depth versus error rate. Does the error rate continuously decrease as we increase depth? Yes, we can build a 1200-layer neural network now, but is it going to be feasible to do inference from such a huge network? As always, anything we can do to increase the accuracy further?