Lecture 17: Neural Networks and Deep Learning

Similar documents
Intro to Neural Networks and Deep Learning

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Neural Networks. Intro to AI Bert Huang Virginia Tech

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

ECE521 Lectures 9 Fully Connected Neural Networks

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Introduction to Deep Neural Networks

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Neural Networks: Backpropagation

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Deep Learning Recurrent Networks 2/28/2018

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Deep Learning (CNNs)

Convolutional Neural Networks

Lecture 5 Neural models for NLP

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

CSC321 Lecture 16: ResNets and Attention

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Introduction to Convolutional Neural Networks (CNNs)

Machine Learning Lecture 10

Slide credit from Hung-Yi Lee & Richard Socher

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Understanding How ConvNets See

Machine Learning Lecture 12

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

SGD and Deep Learning

Recurrent Neural Networks. Jian Tang

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

CSC321 Lecture 5: Multilayer Perceptrons

Lecture 11 Recurrent Neural Networks I

Neural Architectures for Image, Language, and Speech Processing

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Lecture 3 Feedforward Networks and Backpropagation

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Lecture 3 Feedforward Networks and Backpropagation

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Sequence Modeling with Neural Networks

Neural Network Training

CSC 411 Lecture 10: Neural Networks

Statistical NLP for the Web

Introduction to RNNs!

Grundlagen der Künstlichen Intelligenz

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Lecture 11 Recurrent Neural Networks I

Deep Learning: a gentle introduction

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Jakub Hajic Artificial Intelligence Seminar I

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Deep Feedforward Networks. Sargur N. Srihari

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Learning Lab Course 2017 (Deep Learning Practical)

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

CSC321 Lecture 10 Training RNNs

Deep Neural Networks (1) Hidden layers; Back-propagation

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Stochastic gradient descent; Classification

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

arxiv: v3 [cs.lg] 14 Jan 2018

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning from Data

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

CSCI567 Machine Learning (Fall 2018)

Bayesian Networks (Part I)

Introduction to Neural Networks

Neural networks and optimization

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Deep Feedforward Networks

Neural Networks and the Back-propagation Algorithm

Artificial Neuron (Perceptron)

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

y(x n, w) t n 2. (1)

Deep Feedforward Networks

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Recurrent neural networks

Machine Learning (CSE 446): Neural Networks

Logistic Regression & Neural Networks

Course 395: Machine Learning - Lectures

Speech and Language Processing

Neural Networks and Deep Learning

NEURAL LANGUAGE MODELS

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks. Learning and Computer Vision Prof. Olga Veksler CS9840. Lecture 10

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

A summary of Deep Learning without Poor Local Minima

CSC321 Lecture 15: Exploding and Vanishing Gradients

Transcription:

UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1

Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 2

W1 x1 x x2 W2 w3 ŷ x3 3

Logistic Regression wx+b e P(Y=1 x) = wx+b 1+e Sigmoid Function (aka logistic, logit, S, soft-step) x 4

Expanded Logistic Regression x1 x2 Input x p=3 x3 +1 w1 w2 w 3 1 b Σ z Summing Function ŷ = P(Y=1 x,w) Sigmoid Function Multiply by weights T. z = w x +b 1x1 px1 1xp y = sigmoid(z) = ez 1 + ez 5

Neuron x1 x2 x3 w1 w2 w3 Σ z b1 +1 6

Neurons 7 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Neuron x1 Input x x2 w1 w2 Σ w3 z ŷ x3 From here on, we leave out bias for simplicity T. z = w x 1x1 px1 1xp ŷ = sigmoid(z) = ez 1 + ez 8

Block View of a Neuron parameterized block w x * Input Dot Product z T. z = w x 1x1 px1 1xp ŷ = sigmoid(z) = ŷ Sigmoid ez 1 + ez output 9

Neuron Representation x1 x x2 w1 w2 Σ ŷ w3 x3 The linear transformation and nonlinearity together is typically considered a single neuron 10

Neuron Representation x w * ŷ The linear transformation and nonlinearity together is typically considered a single neuron 11

Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 12

W1 x1 x x2 W2 w3 ŷ x3 13

1-Layer Neural Network (with 4 neurons) vector z matrix W ŷ Σ 1 x1 ŷ Σ Input x 2 x2 Output ŷ ŷ Σ x3 3 ŷ Σ Linear 4 Sigmoid 1 layer 14

1-Layer Neural Network (with 4 neurons) z W ŷ Σ 1 x1 ŷ Σ Input x 2 x2 Output ŷ ŷ Σ x3 3 p=3 ŷ Σ 4 Element-wise on vector z d=4 z =WT x dx1 dxp px1 ŷ = sigmoid(z) = dx1 dx1 ez 1 + ez 15

1-Layer Neural Network (with 4 neurons) ŷ W Σ 1 x1 ŷ Σ Input x 2 x2 Output ŷ ŷ Σ x3 3 p=3 ŷ Σ 4 Element-wise on vector z d=4 z =WT x dx1 dxp px1 ŷ = sigmoid(z) = dx1 dx1 ez 1 + ez 16

Block View of a Neural Network W is now a matrix W x * Input Dot Product z is now a vector z ŷ Sigmoid output z =WT x dx1 dxp px1 ŷ = sigmoid(z) = dx1 dx1 ez 1 + ez 17

Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 18

W1 x1 x x2 W2 w3 ŷ x3 19

Multi-Layer Neural Network (Multi-Layer Perceptron (MLP) Network) weight subscript represents layer number W1 w2 x1 x ŷ x2 x3 Hidden layer Output layer 2-layer NN 20

Multi-Layer Neural Network (MLP) W2 W1 w3 x1 x ŷ x2 x3 1st hidden layer 2nd hidden layer Output layer 3-layer NN 21

Multi-Layer Neural Network (MLP) h1 W1 h2 W2 w3 x1 x ŷ x2 x3 hidden layer 1 output z1 =W1T x h1 = sigmoid(z1) z2 =W2T h1 h2 = sigmoid(z2) z3 =w3t h2 ŷ = sigmoid(z3) 22

Multi-Class Output MLP W1 x1 x h1 W2 h2 w3 x2 ŷ x3 z1 =W1T x h1 = sigmoid(z1) z2 =W2T h1 h2 = sigmoid(z2) z3 =W3T h2 ŷ = sigmoid(z2) 23

Block View Of MLP x W1 * z1 1st hidden layer h1 W2 * z2 2nd hidden layer h2 W3 * z3 y Output layer 24

Deep Neural Networks (i.e. > 1 hidden layer) Researchers have successfully used 1000 layers to train an object classifier 25

Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 26

W1 x1 x x2 W2 w3 ŷ E (ŷ) x3 27

Binary Classification Loss W1 W2 w3 x1 x ŷ = P(y=1 X,W) x2 x3 this example is for a single sample x E= loss = - logp(y = ŷ X= x ) = - y log(ŷ) - (1 - y) log(1-ŷ) 28 true output

Regression Loss W1 x1 x W2 w3 Σ x2 ŷ x3 E = loss= 12 ( y - ŷ )2 true output 29

Multi-Class Classification Loss W1 W2 W3 x z1 Σ x1 ŷ1 z2 x2 Σ x3 Σ ŷ2 z3 ŷi = P( ŷi = 1 x ) ŷ3 K=3 Softmax function. Normalizing function which converts each class output to a probability. E = loss = - Σ yj ln ŷj j = 1...K 0 for all except true class 30

Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 31

W1 x1 x x2 W2 w3 ŷ E (ŷ) x3 32

Training Neural Networks How do we learn the optimal weights WL for our task?? Gradient descent: WL(t+1) = WL(t) - E WL(t) W1 x W2 w3 ŷ But how do we get gradients of lower layers? Backpropagation! Repeated application of chain rule of calculus Locally minimize the objective Requires all blocks of the network to be differentiable 33 LeCun et. al. Efficient Backpropagation. 1998

Backpropagation Intro 34 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Backpropagation Intro 35 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Backpropagation Intro 36 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Backpropagation Intro 37 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Backpropagation Intro 38 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Backpropagation Intro 39 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Backpropagation Intro 40 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Backpropagation Intro 41 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Backpropagation Intro 42 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Backpropagation Intro 43 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Backpropagation Intro 44 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Backpropagation Intro 45 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Backpropagation Intro Tells us: by increasing x by a scale of 1, we decrease f by a scale of 4 46 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Backpropagation (binary classification example) W1 x1 x x2 w2 ŷ = P(y=1 X,W) x3 Example on 1-hidden layer NN for binary classification 47

Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 48

Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y E = loss = Gradient Descent to Minimize loss: Need to find these! 49

Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 50

Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y =?? =?? 51

Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y =?? =?? Exploit the chain rule! 52

Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 53

Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y chain rule 54

Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 55

Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 56

Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 57

Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 58

Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 59

Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 60

Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y 61

Backpropagation (binary classification example) x W1 * z1 h1 w2 * z2 y already computed 62

Local-ness of Backpropagation activations local gradients x f y gradients 63 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Local-ness of Backpropagation activations local gradients x f y gradients 64

Example: Sigmoid Block x sigmoid(x) = (x) 65

Deep Learning = Concatenation of Differentiable Parameterized Layers (linear & nonlinearity functions) x W1 * z1 1st hidden layer h1 W2 * z2 2nd hidden layer h2 W3 * z3 y Output layer Want to find optimal weights W to minimize some loss function E! 66

Backprop Whiteboard Demo w1 x1 w2 x2 Σ w3 w4 b1 z1 h1 w5 Σ Σ h2 z2 b2 1 ŷ w6 b3 1 f1 z1 = x1w1 + x2w3 + b1 z2 = x1w2 + x2w4 + b2 f2 exp(z1) h1 = 1 + exp(z ) 1 exp(z2) h2 = 1 + exp(z2) f3 ŷ = h1w5 + h2w6 + b3 f4 E = ( y - ŷ )2 w(t+1) = w(t) - E w E w(t) =?? 67

Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 68

Nonlinearity Functions (i.e. transfer or activation functions) x1 x2 x3 +1 w1 w2 w3 b1 Σ Summing Function z Sigmoid Function Multiply by weights 69

Nonlinearity Functions (i.e. transfer or activation functions) Name Plot Equation Derivative ( w.r.t x ) 70 https://en.wikipedia.org/wiki/activation_function#comparison_of_activation_functions

Nonlinearity Functions (i.e. transfer or activation functions) Name Plot Equation Derivative ( w.r.t x ) 71 https://en.wikipedia.org/wiki/activation_function#comparison_of_activation_functions

Nonlinearity Functions (aka transfer or activation functions) Name Plot Equation Derivative ( w.r.t x ) 72 https://en.wikipedia.org/wiki/activation_function#comparison_of_activation_functions usually works best in practice

Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 73

Neural Net Pipeline W2 W1 x w3 ŷ E 1. Initialize weights 2. For each batch of input x samples S: a. Run the network Forward on S to compute outputs and loss b. Run the network Backward using outputs and loss to compute gradients c. Update weights using SGD (or a similar method) 3. Repeat step 2 until loss convergence 74

Non-Convexity of Neural Nets In very high dimensions, there exists many local minimum which are about the same. Pascanu, et. al. On the saddle point problem for non-convex optimization 2014 75

Building Deep Neural Nets y x f 76 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Building Deep Neural Nets GoogLeNet for Object Classification 77

Block Example Implementation 78 http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Advantage of Neural Nets As long as it s fully differentiable, we can train the model to automatically learn features for us. 79

Advanced Deep Learning Models: Convolutional Neural Networks & Recurrent Neural Networks Most slides from http://cs231n.stanford.edu/ 80

Convolutional Neural Networks (aka CNNs and ConvNets) 81

Challenges in Visual Recognition

Challenges in Visual Recognition

Problems with Fully Connected Networks on Images Each neuron corresponds to a specific pixel 1000x1000 Image 1M hidden units: 10^12 parameters Spatial Correlation is local! Connect units 84

Problems with Fully Connected Networks on Images Each neuron corresponds to a specific pixel 1000x1000 Image 1M hidden units: 10^12 parameters Spatial Correlation is local! Connect units How do we deal with multiple dimensions in the input? Length,height, channel (R,G,B) 85

Convolutional Layer 86

Convolutional Layer 87

Convolutional Layer 88

Convolutional Layer 89

Convolution 90

Convolution 91

Neuron View of Convolutional Layer 92

Neuron View of Convolutional Layer 93

Convolutional Layer 94

Convolutional Layer 95

Neuron View of Convolutional Layer 96

Convolutional Neural Networks 97

Convolutional Neural Networks 98

Convolutional Neural Networks 99

Convolutional Neural Networks http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ 100

Pooling Layer 101

Pooling Layer (Max Pooling example) 102

History of ConvNets 1998 2012 Gradient-based learning applied to document recognition [LeCun, ImageNet Classification with Deep Convolutional Neural Networks Bottou, Bengio, Haffner] [Krizhevsky, Sutskever, Hinton, 2012] 103

104

105

106

107

Recurrent Neural Networks 108

Standard Feed-Forward Neural Network 109

Standard Feed-Forward Neural Network 110

Recurrent Neural Networks (RNNs) RNNs can handle 111

Recurrent Neural Networks (RNNs) RNNs can handle 112

Recurrent Neural Networks (RNNs) Traditional Feed Forward Neural Network Recurrent Neural Network predict a vector at each timestep output output hidden hidden input input

Recurrent Neural Networks ht ht-1 114

Recurrent Neural Networks ht ht-1 115

Vanilla Recurrent Neural Network ht ht-1 116

Character Level Language Model with an RNN 117

Character Level Language Model with an RNN 118

Character Level Language Model with an RNN 119

Character Level Language Model with an RNN 120

Character Level Language Model with an RNN 121

Character Level Language Model with an RNN 122

Character Level Language Model with an RNN 123

Example: Generating Shakespeare with RNNs 124

Example: Generating C Code with RNNs 125

Long Short Term Memory Networks (LSTMs) Recurrent networks suffer from the vanishing gradient problem Aren t able to model long term dependencies in sequences output hidden input

Long Short Term Memory Networks (LSTMs) Recurrent networks suffer from the vanishing gradient problem Aren t able to model long term dependencies in sequences Use gating units to learn when to remember output input

RNNs and CNNs Together 128