Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Similar documents
Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Deep Neural Networks (1) Hidden layers; Back-propagation

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Introduction to Convolutional Neural Networks (CNNs)

Deep Neural Networks (1) Hidden layers; Back-propagation

Introduction to Deep Neural Networks

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

CS60010: Deep Learning

ECE521 Lectures 9 Fully Connected Neural Networks

Introduction to Neural Networks

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Lecture 35: Optimization and Neural Nets

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Training Neural Networks Practical Issues

Deep Learning & Neural Networks Lecture 4

Neural Networks and Deep Learning.

Data Mining & Machine Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Lecture 2: Learning with neural networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Day 3 Lecture 3. Optimizing deep networks

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Lecture 17: Neural Networks and Deep Learning

Machine Learning Lecture 14

Deep Learning Lab Course 2017 (Deep Learning Practical)

Introduction to Neural Networks

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Neural Network Language Modeling

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Artificial Neural Networks. MGS Lecture 2

CSC321 Lecture 9: Generalization

NEURAL LANGUAGE MODELS

CSC321 Lecture 9: Generalization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

CSC 578 Neural Networks and Deep Learning

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Deep Feedforward Networks

Machine Learning (CSE 446): Backpropagation

Generative adversarial networks

Lecture 3 Feedforward Networks and Backpropagation

Deep Feedforward Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Lecture 3 Feedforward Networks and Backpropagation

Ch.6 Deep Feedforward Networks (2/3)

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Lecture 2: Modular Learning

text classification 3: neural networks

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Rapid Introduction to Machine Learning/ Deep Learning

Classification. Sandro Cumani. Politecnico di Torino

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Normalization Techniques in Training of Deep Neural Networks

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Neural Networks (Part 1) Goals for the lecture

Deep Learning for NLP

CSCI567 Machine Learning (Fall 2018)

Neural networks and support vector machines

Advanced computational methods X Selected Topics: SGD

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

CSE 591: Introduction to Deep Learning in Visual Computing. - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 8. Sinan Kalkan

CS489/698: Intro to ML

epochs epochs

Understanding Neural Networks : Part I

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

Statistical Machine Learning from Data

Introduction to Machine Learning (67577)

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Deep Learning Lecture 2

Advanced Training Techniques. Prajit Ramachandran

Based on the original slides of Hung-yi Lee

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Lecture 4 Backpropagation

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

CSC321 Lecture 7: Optimization

Artificial Neural Networks

Deep Residual. Variations

SGD and Deep Learning

Agenda. Digit Classification using CNN Digit Classification using SAE Visualization: Class models, filters and saliency 2 DCT

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Gradient-Based Learning. Sargur N. Srihari

Neural Networks. Lecture 2. Rob Fergus

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Intro to Neural Networks and Deep Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Machine Learning Basics: Maximum Likelihood Estimation

Natural Language Processing with Deep Learning CS224N/Ling284

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CSC321 Lecture 20: Reversible and Autoregressive Models

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Deep Learning Autoencoder Models

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Learning & Artificial Intelligence WS 2018/2019

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Transcription:

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 3: Introduction to Deep Learning (continued)

Course Logistics - Update on course registrations - 6 seats left now - Microsoft Azure credits and tutorial - Assignment 1 any questions? - Question about constants in a computational graph

Course Logistics - Update on course registrations - 6 seats left now - Microsoft Azure credits and tutorial - Assignment 1 any questions? - Question about constants in a computational graph f(x) =a x + b

Course Logistics - Update on course registrations - 6 seats left now - Microsoft Azure credits and tutorial - Assignment 1 any questions? - Question about constants in a computational graph f(x) =a x + b f(x, a, b) =a x + b

Short Review - Introduced the basic building block of Neural Networks (MLP/FC) layer - How do we stack these layers up to make a Deep NN - Basic NN operations (implemented using computational graph)

Short Review - Introduced the basic building block of Neural Networks (MLP/FC) layer - How do we stack these layers up to make a Deep NN - Basic NN operations (implemented using computational graph)

Short Review - Introduced the basic building block of Neural Networks (MLP/FC) layer - How do we stack these layers up to make a Deep NN - Basic NN operations (implemented using computational graph) y i x i W x + b o sigmoid(o) MSE loss (ŷ, y) a l b W

Short Review - Introduced the basic building block of Neural Networks (MLP/FC) layer - How do we stack these layers up to make a Deep NN - Basic NN operations (implemented using computational graph) Prediction / Inference Function evaluation (a.k.a. ForwardProp)

Short Review - Introduced the basic building block of Neural Networks (MLP/FC) layer - How do we stack these layers up to make a Deep NN - Basic NN operations (implemented using computational graph) Prediction / Inference Parameter Learnings Function evaluation (Stochastic) Gradient Descent (needs derivatives) (a.k.a. ForwardProp)

Short Review - Introduced the basic building block of Neural Networks (MLP/FC) layer - How do we stack these layers up to make a Deep NN - Basic NN operations (implemented using computational graph) Prediction / Inference Parameter Learnings Function evaluation (Stochastic) Gradient Descent (needs derivatives) (a.k.a. ForwardProp) - - - - Numerical differentiation (not accurate) Symbolic differential (intractable) AutoDiff Forward (computationally expensive) AutoDiff Backward / BackProp

Short Review - Introduced the basic building block of Neural Networks (MLP/FC) layer - How do we stack these layers up to make a Deep NN - Basic NN operations (implemented using computational graph) Prediction / Inference Parameter Learnings Function evaluation (Stochastic) Gradient Descent (needs derivatives) (a.k.a. ForwardProp) - - - - Numerical differentiation (not accurate) Symbolic differential (intractable) AutoDiff Forward (computationally expensive) AutoDiff Backward / BackProp - Different activation functions and saturation problem

Follow up (some intuition) - Affine transformation of the input, followed by an activation - Another interpretation of the NN, affine combination of activation functions a(x) =x 1.2 0.6 Linear Activation

Follow up (some intuition) - Affine transformation of the input, followed by an activation - Another interpretation of the NN, affine combination of activation functions a(x) =x 1.2 0.6 Linear Activation

Follow up (some intuition) - Affine transformation of the input, followed by an activation - Another interpretation of the NN, affine combination of activation functions a(x) =x 1.2 0.6 Linear Activation

Follow up (some intuition) - Affine transformation of the input, followed by an activation - Another interpretation of the NN, affine combination of activation functions a(x) =x 1.2 0.6 Linear Activation

Follow up (some intuition) - Affine transformation of the input, followed by an activation - Another interpretation of the NN, affine combination of activation functions a(x) =x 1.2 0.6 Linear Activation

Follow up (some intuition) - Affine transformation of the input, followed by an activation - Another interpretation of the NN, affine combination of activation functions a(x) =x 1.2 0.6 Linear Activation

Follow up (some intuition) - Affine transformation of the input, followed by an activation - Another interpretation of the NN, affine combination of activation functions a(x) =max(0,x) ( 1.2 0.6 ReLU Activation

Follow up (some intuition) - Affine transformation of the input, followed by an activation - Another interpretation of the NN, affine combination of activation functions a(x) =max(0,x) ( 1.2 0.6 ReLU Activation

Follow up (some intuition) - Affine transformation of the input, followed by an activation - Another interpretation of the NN, affine combination of activation functions a(x) =max(0,x) ( 1.2 0.6 ReLU Activation

Regularization: L2 or L1 on the weights L2 Regularization: Learn a more (dense) distributed representation X R(W) = W 2 = X i j W 2 i,j L1 Regularization: Learn a sparse representation (few non-zero wight elements) X Example: x =[1, 1, 1, 1] W 1 =[1, 0, 0, 0] apple 1 W 2 = 4, 1 4, 1 4, 1 4 R(W) = W 1 = X i apple W 1 x T = W 2 x T two networks will have identical loss j W i,j apple (others regularizers are also possible) apple L2 Regularizer: R L2 (W 1 )=1 R L2 (W 2 )=0.25 L1 Regularizer: L2 2 R L1 (W 1 )=1 R L1 (W 2 )=1

Computational Graph: 1-layer with PReLU + Regularizer y i x i W x + b o a PReLU(, ) MSE loss (ŷ, y) l l tot b + W R(W) r

Regularization: Batch Normalization Normalize each mini-batch (using Batch Normalization layer) by subtracting empirically computed mean and dividing by variance for every dimension -> samples are approximately unit Gaussian x (k) = x(k) E[x (k) ] p Var[x (k) ] Benefit: Improves learning (better gradients, higher learning rate) [ Ioffe and Szegedy, NIPS 2015 ]

Regularization: Batch Normalization Normalize each mini-batch (using Batch Normalization layer) by subtracting empirically computed mean and dividing by variance for every dimension -> samples are approximately unit Gaussian x (k) = x(k) E[x (k) ] p Var[x (k) ] Benefit: Improves learning (better gradients, higher learning rate) Why? [ Ioffe and Szegedy, NIPS 2015 ]

Activation Function: Sigmoid @L x @x = @ sigmoid(x) @x @L @a Sigmoid Gate a = sigmoid(x) = ) @L @a 1 1+e x 0 a(x) =sigmoid(x) = 1 1+e x Cons: - - - Saturated neurons kill the gradients Non-zero centered Could be expensive to compute Sigmoid Activation * slide adopted from Li, Karpathy, Johnson s CS231n at Stanford

Regularization: Batch Normalization Normalize each mini-batch (using Batch Normalization layer) by subtracting empirically computed mean and dividing by variance for every dimension -> samples are approximately unit Gaussian x (k) = x(k) E[x (k) ] p Var[x (k) ] Benefit: Improves learning (better gradients, higher learning rate) Typically inserted before activation layer [ Ioffe and Szegedy, NIPS 2015 ]

Activation Function: Sigmoid vs. Tanh a(x) =tanh(x) =2 sigmoid(2x) 1 Pros: a(x) =tanh(x) = 2 1+e 2x 1 - - - Squishes everything in the range [-1,1] Centered around zero Has well defined gradient everywhere Cons: - Saturated neurons kill the gradients Tanh Activation * slide adopted from Li, Karpathy, Johnson s CS231n at Stanford

Regularization: Batch Normalization Normalize each mini-batch (using Batch Normalization layer) by subtracting empirically computed mean and dividing by variance for every dimension -> samples are approximately unit Gaussian x (k) = x(k) E[x (k) ] p Var[x (k) ] Benefit: Improves learning (better gradients, higher learning rate) In practice, also learn how to scale and offset: Typically inserted before activation layer y (k) = (k) x (k) + (k) BN layer parameters [ Ioffe and Szegedy, NIPS 2015 ]

Regularization: Dropout Randomly set some neurons to zero in the forward pass, with probability proportional to dropout rate (between 0 to 1) Standar Neural Network After Applying Dropout [ Srivastava et al, JMLR 2014 ] * adopted from slides of CS231n at Stanford

Regularization: Dropout Randomly set some neurons to zero in the forward pass, with probability proportional to dropout rate (between 0 to 1) 1. Compute output of the linear/fc layer o i = W i x + b i 2. Compute a mask with probability proportional to dropout rate m i = rand(1, o i ) < dropout rate 3. Apply the mask to zero out certain outputs o i = o i m i Standar Neural Network After Applying Dropout [ Srivastava et al, JMLR 2014 ] * adopted from slides of CS231n at Stanford

Regularization: Dropout Randomly set some neurons to zero in the forward pass, with probability proportional to dropout rate (between 0 to 1) Why is this a good idea? After Applying Dropout [ Srivastava et al, JMLR 2014 ] * adopted from slides of CS231n at Stanford

Regularization: Dropout Randomly set some neurons to zero in the forward pass, with probability proportional to dropout rate (between 0 to 1) Why is this a good idea? Dropout is training an ensemble of models that share parameters Each binary mask (generated in the forward pass) is one model that is trained on (approximately) one data point After Applying Dropout [ Srivastava et al, JMLR 2014 ] * adopted from slides of CS231n at Stanford

Regularization: Dropout (at test time) Randomly set some neurons to zero in the forward pass, with probability proportional to dropout rate (between 0 to 1) At test time, integrate out all the models in the ensemble Monte Carlo approximation: many forward passes with different masks and average all predictions After Applying Dropout [ Srivastava et al, JMLR 2014 ] * adopted from slides of CS231n at Stanford

Regularization: Dropout (at test time) Randomly set some neurons to zero in the forward pass, with probability proportional to dropout rate (between 0 to 1) x dropout rate At test time, integrate out all the models in the ensemble x dropout rate Monte Carlo approximation: many forward passes with different masks and average all x dropout rate predictions Equivalent to forward pass with all connections on and scaling of the outputs by dropout rate For derivation see Lecture 6 of CS231n at Stanford [ Srivastava et al, JMLR 2014 ] * adopted from slides of CS231n at Stanford

Regularization: Dropout Randomly set some neurons to zero in the forward pass, with probability proportional to dropout rate (between 0 to 1) 1. Compute output of the linear/fc layer o i = W i x + b i 2. Compute a mask with probability proportional to dropout rate m i = rand(1, o i ) < dropout rate 3. Apply the mask to zero out certain outputs o i = o i m i Standar Neural Network After Applying Dropout [ Srivastava et al, JMLR 2014 ] * adopted from slides of CS231n at Stanford

Deep Learning Terminology Google s Inception network Network structure: number and types of layers, forms of activation functions, dimensionality of each layer and connections (defines computational graph) generally kept fixed, requires some knowledge of the problem and NN to sensibly set deeper = better Loss function: objective function being optimized (softmax, cross entropy, etc.) requires knowledge of the nature of the problem Parameters: trainable parameters of the network, including weights/biases of linear/fc layers, parameters of the activation functions, etc. optimized using SGD or variants Hyper-parameters: parameters, including for optimization, that are not optimized directly as part of training (e.g., learning rate, batch size, drop-out rate) grid search

Loss Functions This is where all the fun is but later too many to cover

Monitoring Learning: Visualizing the (training) loss * slide from Li, Karpathy, Johnson s CS231n at Stanford

* slide from Li, Karpathy, Johnson s CS231n at Stanford Monitoring Learning: Visualizing the (training) loss Big gap = overfitting Solution: increase regularization No gap = undercutting Solution: increase model capacity Small gap = ideal