DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Similar documents
Deep neural networks

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Lecture 17: Neural Networks and Deep Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

CSC321 Lecture 5: Multilayer Perceptrons

Introduction to Convolutional Neural Networks (CNNs)

Neural Networks and Deep Learning.

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks (Part 1) Goals for the lecture

Artificial Neuron (Perceptron)

Neural networks. Chapter 19, Sections 1 5 1

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks and Deep Learning

CSC 411 Lecture 10: Neural Networks

Feed-forward Network Functions

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural networks. Chapter 20, Section 5 1

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Course 395: Machine Learning - Lectures

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Feedforward Networks

CS 4700: Foundations of Artificial Intelligence

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Grundlagen der Künstlichen Intelligenz

EEE 241: Linear Systems

Artificial Neural Networks

Machine Learning. Neural Networks

Neural networks. Chapter 20. Chapter 20 1

Multilayer Perceptrons (MLPs)

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Unit III. A Survey of Neural Network Model

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

CS 4700: Foundations of Artificial Intelligence

Deep Neural Networks (1) Hidden layers; Back-propagation

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Machine Learning Lecture 14

Neural Networks and the Back-propagation Algorithm

Jakub Hajic Artificial Intelligence Seminar I

Deep Learning for NLP

Intro to Neural Networks and Deep Learning

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

4. Multilayer Perceptrons

Deep Feedforward Networks

Lecture 35: Optimization and Neural Nets

Artificial Neural Networks. MGS Lecture 2

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Artificial Neural Networks

Introduction to Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Perceptron & Neural Networks

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Introduction Biologically Motivated Crude Model Backpropagation

DEEP LEARNING PART ONE - INTRODUCTION CS/CNS/EE MACHINE LEARNING & DATA MINING - LECTURE 7

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Reading Group on Deep Learning Session 1

Lecture 5: Logistic Regression. Neural Networks

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural networks COMS 4771

Artificial Intelligence

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSCI567 Machine Learning (Fall 2018)

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Introduction to Neural Networks

Artificial Neural Networks

Multilayer Perceptron

CSC 578 Neural Networks and Deep Learning

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

10. Artificial Neural Networks

Based on the original slides of Hung-yi Lee

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Introduction To Artificial Neural Networks

Supervised (BPL) verses Hybrid (RBF) Learning. By: Shahed Shahir

Lecture 5 Neural models for NLP

Artifical Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

Deep Learning: a gentle introduction

CS60010: Deep Learning

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

CMSC 421: Neural Computation. Applications of Neural Networks

Advanced Machine Learning

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Machine Learning Lecture 12

Feedforward Neural Nets and Backpropagation

Transcription:

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1

On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo - of convolutions https://cs.stanford.edu/people/karpathy/convnetjs/demo /mnist.html - demo of CNN http://scs.ryerson.ca/~aharley/vis/conv/ - 3D visualization http://cs231n.github.io/ Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition. http://www.deeplearningbook.org/ MIT Press book from Bengio et al, free online version 2

A history of neural networks 1940s-60 s: McCulloch & Pitts; Hebb: modeling real neurons Rosenblatt, Widrow-Hoff: : perceptrons 1969: Minskey & Papert, Perceptrons book showed formal limitations of one-layer linear network 1970 s-mid-1980 s: mid-1980 s mid-1990 s: backprop and multi-layer networks Rumelhart and McClelland PDP book set Sejnowski s NETTalk, BP-based text-to-speech Neural Info Processing Systems (NIPS) conference starts Mid 1990 s-early 2000 s: Mid-2000 s to current: More and more interest and experimental success 3

4

5

6

7

Multilayer networks Simplest case: classifier is a multilayer network of logistic units Each unit takes some inputs and produces one output using a logistic classifier Output of one unit can be the input of another Input layer Hidden layer Output layer 1 w 0,2 w 0,1 w 1,1 v 1 =S(w T X) w 1 x 1 w 1,2 w 2,1 w 2 z 1 =S(w T V) x 2 w 2,2 v 2 =S(w T X) 8

Learning a multilayer network Define a loss (simplest case: squared error) But over a network of units Minimize loss with gradient descent J X,y (w) = ( y i y ˆ i 2 ) i You can do this over complex networks if you can take the gradient of each unit: every computation is differentiable 1 w 0,2 w 0,1 w 1,1 v 1 =S(w T X) w 1 x 1 w 1,2 w 2,1 w 2 z 1 =S(w T V) x 2 w 2,2 v 2 =S(w T X) 9

ANNs in the 90 s Mostly 2-layer networks or else carefully constructed deep networks (eg CNNs) Worked well but training was slow and finicky Nov 1998 Yann LeCunn, Bottou, Bengio, Haffner 10

ANNs in the 90 s Mostly 2-layer networks or else carefully constructed deep networks Worked well but training typically took weeks when guided by an expert SVM: 98.9-99.2% accurate CNNs: 98.3-99.3% accurate 11

Learning a multilayer network Define a loss (simplest case: squared error) But over a network of units Minimize loss with gradient descent J X,y (w) = ( y i y ˆ i 2 ) i You can do this over complex networks if you can take the gradient of each unit: every computation is differentiable 12

Example: weight updates for multilayer ANN with square loss and logistic units For nodes k in output layer: δ k ( t k a ) k a ( k 1 a ) k For nodes j in hidden layer: δ j ( δ k w ) kj a j 1 a j k For all weights: ( ) w kj = w kj ε δ k a j w ji = w ji ε δ j a i Propagate errors backward BACKPROP Can carry this recursion out further if you have multiple hidden layers 13

BACKPROP FOR MLPS 14

BackProp in Matrix-Vector Notation Michael Nielson: http://neuralnetworksanddeeplearning.com/ 15

Notation 16

Notation Each digit is 28x28 pixels = 784 inputs 17

Notation w l is weight matrix for layer l 18

Notation w l a l and b l bias activation 19

Notation Matrix: w l Vector: a l Vector: b l Vector: z l bias weight activation vectoràvector function: componentwise logistic 20

Computation is feedforward for l=1, 2, L: 21

Notation Cost function to optimize: sum over examples x where Matrix: w l Vector: a l Vector: b l Vector: z l Vector: y 22

Notation 23

BackProp: last layer Matrix form: componentwise product of vectors components are components are Level l for l=1,,l Matrix: w l Vectors: bias b l activation a l pre-sigmoid activ: z l target output y local error δ l 24

BackProp: last layer Matrix form for square loss: Level l for l=1,,l Matrix: w l Vectors: bias b l activation a l pre-sigmoid activ: z l target output y local error δ l 25

BackProp: error at level l in terms of error at level l+1 which we can use to compute Level l for l=1,,l Matrix: w l Vectors: bias b l activation a l pre-sigmoid activ: z l target output y local error δ l 26

BackProp: summary Level l for l=1,,l Matrix: w l Vectors: bias b l activation a l pre-sigmoid activ: z l target output y local error δ l 27

Computation propagates errors backward for l=1, 2, L: for l=l,l-1, 1: 28

EXPRESSIVENESS OF DEEP NETWORKS 29

Deep ANNs are expressive One logistic unit can implement and AND or an OR of a subset of inputs e.g., (x 3 AND x 5 AND AND x 19 ) Every boolean function can be expressed as an OR of ANDs e.g., (x 3 AND x 5 ) OR (x 7 AND x 19 ) OR So one hidden layer can express any BF (But it might need lots and lots of hidden units) 30

Deep ANNs are expressive One logistic unit can implement and AND or an OR of a subset of inputs e.g., (x 3 AND x 5 AND AND x 19 ) Every boolean function can be expressed as an OR of ANDs e.g., (x 3 AND x 5 ) OR (x 7 AND x 19 ) OR So one hidden layer can express any BF Example: parity(x1,,xn)=1 iff off number of xi s are set to one Parity(a,b,c,d) = (a & -b & -c & -d) OR (-a & b & -c & -d) OR #list all the 1s (a & b & c & -d) OR (a & b & -c & d) OR #list all the 3s Size in general is O(2 N ) 31

Deeper ANNs are more expressive A two-layer network needs O(2 N ) units A two-layer network can express binary XOR A 2*logN layer network can express the parity of N inputs (even/odd number of 1 s) With O(logN) units in a binary tree Deep network + parameter tying ~= subroutines x1 x2 x3 x4 x5 x6 x7 x8 32

http://neuralnetworksanddeeplearning.com/chap1.html Hypothetical code for face recognition.. 33

PARALLEL TRAINING FOR ANNS 34

How are ANNs trained? Typically, with some variant of streaming SGD Keep the data on disk, in a preprocessed form Loop over it multiple times Keep the model in memory Solution to big data: but long training times! However, some parallelism is often used. 35

Recap: logistic regression with SGD 1 P(Y =1 X = x) = p = 1+ e x w This part computes inner product <x,w> This part logistic of <x,w> 36

Recap: logistic regression with SGD 1 P(Y =1 X = x) = p = 1+ e x w a z On one example: computes inner product <x,w> There s some chance to compute this in parallel can we do more? 37

In ANNs we have many many logistic regression nodes 38

Recap: logistic regression with SGD Let x be an example Let w i be the input weights for the i-th hidden unit Then output a i = x. w i a i z i 39

Recap: logistic regression with SGD Let x be an example Let w i be the input weights for the i-th hidden unit Then a = x W is output for all m units w 1 w 2 w 3 w m 0.1-0.3 W = -1.7.. a i z i 40

Recap: logistic regression with SGD Let X be a matrix with k examples Let w i be the input weights for the i-th hidden unit Then A = X W is output for all m units for all k examples w 1 w 2 w 3 w m x 1 1 0 1 1 0.1-0.3 x 2 x k -1.7 0.3 1.2 There s a lot of chances to do this in parallel XW = x 1.w 1 x 1.w 2 x 1.w m x k.w 1 x k.w m 41

ANNs and multicore CPUs Modern libraries (Matlab, numpy, ) do matrix operations fast, in parallel Many ANN implementations exploit this parallelism automatically Key implementation issue is working with matrices comfortably 42

ANNs and GPUs GPUs do matrix operations very fast, in parallel For dense matrixes, not sparse ones! Training ANNs on GPUs is common SGD and minibatch sizes of 128 Modern ANN implementations can exploit this GPUs are not super-expensive $500 for high-end one large models with O(10 7 ) parameters can fit in a large-memory GPU (12Gb) Speedups of 20x-50x have been reported 43

ANNs and multi-gpu systems There are ways to set up ANN computations so that they are spread across multiple GPUs Sometimes involves some sort of IPM Sometimes involves partitioning the model across multiple GPUs Often needed for very large networks Not especially easy to implement and do with most current tools 44

WHY ARE DEEP NETWORKS HARD TO TRAIN? 45

Recap: weight updates for multilayer δ L k t k a k ANN For nodes k in output layer L: ( ) a k ( 1 a k ) For nodes j in hidden layer h: ( ) δ h δ h+1 w j j kj a j 1 a j k ( ) What happens as the layers get further and further from the output layer? E.g., what s gradient for the bias term with several layers after it? 46

Gradients are unstable Max at 1/4 If weights are usually < 1 then we are multiplying by many numbers < 1 so the gradients get very small. The vanishing gradient problem What happens as the layers get further and further from the output layer? E.g., what s gradient for the bias term with several layers after it in a trivial net? 47

Gradients are unstable Max at 1/4 If weights are usually > 1 then we are multiplying by many numbers > 1 so the gradients get very big. The exploding gradient problem What happens (less common as the but layers possible) get further and further from the output layer? E.g., what s gradient for the bias term with several layers after it in a trivial net? 48

AI Stats 2010 input output Histogram of gradients in a 5-layer network for an artificial image recognition task 49

AI Stats 2010 We will get to these tricks eventually. 50

It s easy for sigmoid units to saturate components Learning are rate approaches zero and unit is stuck 51

It s easy for sigmoid units to saturate For a big network there are lots of weighted inputs to each neuron. If any of them are too large then the neuron will saturate. So neurons get stuck with a few large inputs OR many small ones. 52

It s easy for sigmoid units to saturate If there are 500 non-zero inputs initialized with a Gaussian ~N(0,1) then the SD is 500 22.4 53

It s easy for sigmoid units to saturate Saturation visualization from Glorot & Bengio 2010 - - using a smarter initialization scheme Closest-to-output hidden layer still stuck for first 100 epochs 54

WHAT S DIFFERENT ABOUT MODERN ANNS? 55

Some key differences Use of softmax and entropic loss instead of quadratic loss. Use of alternate non-linearities relu and hyperbolic tangent Better understanding of weight initialization Data augmentation Especially for image data 56

Cross-entropy loss Compare to gradient for square loss when a~=1 y=0 and x=1 C w = σ (z) y a z 57

Cross-entropy loss 58

Cross-entropy loss 59

Softmax output layer Softmax Network outputs a probability distribution! Cross-entropy loss after a softmax layer gives a very simple, numerically stable gradient Δw ij = (y i -z i )y j 60

Some key differences Use of softmax and entropic loss instead of quadratic loss. Often learning is faster and more stable as well as getting better accuracies in the limit Use of alternate non-linearities Better understanding of weight initialization Data augmentation Especially for image data 61

Some key differences Use of softmax and entropic loss instead of quadratic loss. Often learning is faster and more stable as well as getting better accuracies in the limit Use of alternate non-linearities relu and hyperbolic tangent Better understanding of weight initialization Data augmentation Especially for image data 62

Alternative non-linearities Changes so far Changed the loss from square error to crossentropy Proposed adding another output layer (softmax) A new change: modifying the nonlinearity The logistic is not widely used in modern ANNs 63

Alternative non-linearities A new change: modifying the nonlinearity The logistic is not widely used in modern ANNs Alternate 1: tanh Like logistic function but shifted to range [-1, +1] 64

AI Stats 2010 depth 5 We will get to these tricks eventually. 65

Alternative non-linearities A new change: modifying the nonlinearity relu often used in vision tasks Alternate 2: rectified linear unit Linear with a cutoff at zero (Implement: clip the gradient when you pass zero) 66

Alternative non-linearities A new change: modifying the nonlinearity relu often used in vision tasks Alternate 2: rectified linear unit Soft version: log(exp(x)+1) Doesn t saturate (at one end) Sparsifies outputs Helps with vanishing gradient 67

Some key differences Use of softmax and entropic loss instead of quadratic loss. Often learning is faster and more stable as well as getting better accuracies in the limit Use of alternate non-linearities relu and hyperbolic tangent Better understanding of weight initialization Data augmentation Especially for image data 68

It s easy for sigmoid units to saturate For a big network there are lots of weighted inputs to each neuron. If any of them are too large then the neuron will saturate. So neurons get stuck with a few large inputs OR many small ones. 69

It s easy for sigmoid units to saturate If there are 500 non-zero inputs initialized with a Gaussian ~N(0,1) then the SD is 500 22.4 Common heuristics for initializing weights:! N # 0, " 1 #inputs $ & % " U $ # 1 #inputs, 1 #inputs % ' & 70

It s easy for sigmoid units to saturate Saturation visualization from Glorot & Bengio 2010 using " U $ # 1 #inputs, 1 #inputs % ' & 71

Initializing to avoid saturation In Glorot First breakthrough and Bengiodeep they learning suggest results weights were if level j (with n j inputs) from based on clever pre-training initialization schemes, where deep networks were seeded with weights learned from unsupervised strategies This is not always the solution but good initialization is very important for deep nets! 72