Deep Learning: a gentle introduction

Size: px

Start display at page:

Download "Deep Learning: a gentle introduction"

Tyrone Ross
6 years ago
Views:

1 Deep Learning: a gentle introduction Jamal Atif jamal.atif@dauphine.fr PSL, Université Paris-Dauphine, LAMSADE February 8, 206 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 206 /

2 Why a talk about deep learning? Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

3 Why a talk about deep learning? Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

4 Why a talk about deep learning? Convolutional Networks (Yann Le Cun): challenge ILSVRC : 000 categories and images. Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

5 Neuron s basic anatomy Figure: A neuron s basic anatomy consists of four parts: a soma (cell body), dendrites, an axon, and nerve terminals. Information is received by dendrites, gets collected in the cell body, and flows down the axon. Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

6 Artificial neuron x w x2 w2 x3 w3 n i= wixi + b Axon Pre-activation Activation Cell body wd b xn Dendrites xn+ = Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

7 Perceptron Rosenblatt 957 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

8 Perceptron learning Input: a sample S = {(x (), y ),, (x (n), y n )} Initialize a parameter t to 0 Initialize the weights w i with random values. Repeat Pick an example x (k) = [x (k),, x(k) d ]T from S Let y (k) be the target value and ỹ (k) the computed value using the current perceptron If ỹ (k) y (k) then Update the weights: w i (t + ) = w i (t) + w i (t) where w i (t) = (y (k) ỹ (k) )x (k) i End If t = t + Until all the examples in S were visited and no change occurs in the weights Output: a perceptron for linear discrimination of S Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

9 Perceptron learning Example: learning the OR function Initialization: w (0) = w 2 (0) =, w 3 (0) = t w (t) w 2 (t) w 3 (t) x (k) wi x k i ỹ (k) y (k) w (t) w 2 (t) w 3 (t) Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

10 Perceptron illustration x () = [0, 0] T t = 0 0 w = 0 w 2 = w 3 = if a > 0 a = ỹ = 0 i wixi 0 elsewhere y = 0 w i = w i + (y ỹ)x i w = w 2 = w 3 = + 0 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

11 Perceptron illustration x (2) = [0, ] T t = 0 w = w 2 = w 3 = if a > 0 a = 0 ỹ = 0 i wixi 0 elsewhere y = w i = w i + (y ỹ)x i w = + 0 = w 2 = + = 2 w 3 = + = 0 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 206 /

12 Perceptron illustration x (3) = [, 0] T t = 2 w = 0 w 2 = 2 w 3 = 0 if a > 0 a = ỹ = i wixi 0 elsewhere y = w i = w i + (y ỹ)x i w = = w 2 = = 2 w 3 = = 0 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

13 Perceptron illustration x (4) = [, ] T t = 3 w = w 2 = 2 w 3 = 0 if a > 0 a = 3 ỹ = i wixi 0 elsewhere y = w i = w i + (y ỹ)x i w = = w 2 = = 2 w 3 = = 0 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

14 Perceptron illustration x () = [0, 0] T t = 4 0 w = 0 w 2 = 2 w 3 = 0 if a > 0 a = 0 ỹ = 0 i wixi 0 elsewhere y = 0 w i = w i + (y ỹ)x i w = = w 2 = = 2 w 3 = = 0 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

15 Perceptron illustration x (2) = [0, ] T t = 5 0 w = w 2 = 2 w 3 = 0 if a > 0 a = 2 ỹ = i wixi 0 elsewhere y = w i = w i + (y ỹ)x i w = = w 2 = = 2 w 3 = = 0 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

16 Perceptron capacity OR(x, x 2 ) AND(x, x 2 ) Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

17 Perceptron autumn 0 0 XOR(x, x 2 ) Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

18 Link with Logistic regression x w x2 w2 a(x) = w3 d i= wixi + b x3 h(x) = +e a(x) Stochastic gradient update rule: wd wj = wj + λ(y (i) h(x (i) ))x (i) j The same as the perceptron update rule! b xd xd+ = Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

19 But! XOR(x, x 2 ) AND(x, x2) 0 0 AND( x, x 2 ) Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

20 Multilayer Perceptron b Paul Werbos, 84. David Rumelhart, 86 x x2 o x3 om Output layer xd Input Layer Hidden layers Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

21 Layer Perceptron b x w w2 h w 2 b 2 w2 w3 x2 w3 w22 h2 w32 w 2 2 w23 x3 w33 h3 w 2 3 w2d wn2 wn3 Output Layer f(x) = o(b 2 + d i= w2 hi(x)) wn w3d wd xn w 2 d Input Layer wnd hd Hidden Layer hj(x) = g(b + n i= wijxi) Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

22 MLP training: backpropagation XOR example x i w + ixi b a h h(x) = +e a(x) w w 2 w 2 k w2 k hk + ao b2 ho(x) = +e a(x) ỹ x2 w 2 2 i w i2 x2 + b 2 a2 h2 h(x) = +e a(x) w 2 2 w 2 22 b b 2 b 2 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

23 MLP training: backpropagation XOR example: initialisation x w = i w + ixi b a h h(x) = +e a(x) x2 w 2 = w 2 = w 22 = b = b 2 = i w i2 x2 + b 2 a2 h2 h(x) = +e a(x) w 2 = w 2 2 = b 2 = k w2 k hk + ao b2 ho(x) = +e a(x) ỹ Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

24 MLP training: backpropagation XOR example: Feed-forward x 0 w = a = i w + ixi b h(x) = +e a (x) 0.27 x2 0 w 2 = w 2 = w 22 = b = b 2 = a2 = i w i2 x2 + b 2 h2(x) = +e a 2 (x) 0.27 w 2 = w 2 2 = b 2 = k 0.46 ao = ỹ = 0.39 w2 k hk + ho(x) = +e ao(x) y = 0 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

25 MLP training: backpropagation XOR example: Backpropagation w 2 k = w2 k + ηδk δk = (y ỹ)( ỹ)ỹhk x 0 w = a = i w + ixi b h(x) = +e a (x) 0.27 = Ew ỹ ỹ ao ao w 2 k x2 0 w 2 = w 2 = w 22 = b = b 2 = a2 = i w i2 x2 + b 2 h2(x) = +e a 2 (x) 0.27 w 2 = w 2 2 = b 2 = k 0.46 ao = ỹ = 0.39 w2 k hk + ho(x) = +e ao(x) y = 0 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

26 MLP training: backpropagation XOR example: Backpropagation w ij = w ij ηδij δij = Ew wij = Ew ho ho ao δij = (y ỹ)( ỹ)ỹwj 2 hj( hj)xi ao hj hj aj aj wij w 2 k = w2 k ηδk δk = Ew, w 2 Ew = 2 (y ỹ)2 k x 0 w = a = i w + ixi b h(x) = +e a (x) 0.27 = Ew ỹ ỹ ao ao w 2 k δk = (y ỹ)( ỹ)ỹhk x2 w 2 = w 2 = 0 w22 = b =.98 b 2 =.98 a2 = i w i2 x2 + b 2 h2(x) = +e a 2 (x) 0.27 w = + (0.39)(.39) (.27) 0 w22 = + (0.39)(.39) (.27) 0. b = + (0.39)(.39) (.27) k w 2 = ao = ỹ = 0.39 w2 k hk + ho(x) = +e ao(x) w 2 2 = 0.98 b 2 =.07 w 2 = w 2 + (0.39)(.39) w 2 2 = w (0.39)(.39) b 2 = b 2 + (0.39)(.39).3 y = 0 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

27 MLP training: backpropagation XOR example: Feed-forward x 0 w = a = i w + ixi b.02 h(x) = +e a (x) 0.5 x2 w 2 = w 2 = w 22 = b =.98 b 2 =.98 a2 =.02 i w i2 x2 + b 2 h2(x) = +e a 2 (x) 0.5 k w 2 = ao = ỹ 0.5 w2 k hk + ho(x) = +e ao(x) w 2 2 = 0.98 b 2 =.07 y = Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

28 Multilayer Perceptron as a deep NN MLP with more than 2/3 layers is a deep network b x x2 o x3 om Output layer xd Input Layer Hidden layers Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

29 Training issues Setting the hyperparameters Initialization Number of iterations Learning rate η Activation function Early stopping criterion Overfitting/Underfitting Number of hidden layers Number of neurons Optimization Vanishing gradient problem Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

30 Overfitting/Underfitting Source: Bishop, PRML Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

31 Overfitting/Underfitting Source: Bishop, PRML Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

32 What s new! Before 2006, training deep architectures was unsuccessful! Underfitting New optimization techniques Overfitting Bengio, Hinton, LeCun Unsupervised pre-training Stochastic dropout" training Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

33 Unsupervised pre-training Main idea Initialize the network in an unsupervised way. Consequences Network layers encode successive invariant features (latent structure of the data) Better optimization and hence better generalization Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

34 Unsupervised pre-training How to? Proceed greedily, layer by layer. x x x x2 x2 x2 x3 x3 x3 xd xd xd Input Layer = Input Layer = Input Layer Unsupervised NN learning techniques Restricted Boltzmann Machines (RBM) Auto-encoders and many many variants since 2006 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

35 Unsupervised pre-training Autoencoder Credit: Hugo Larochelle Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

36 Unsupervised pre-training Sparse Autoencoders Credit: Y. Bengio Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

37 Fine tuning How to? Add the output layer Initialize its weights with random values initialize the hidden layers with the values from the pre-training Update the layers by Backpropagation x x2 x3 xd Input Layer Output Layer Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

38 Drop out Intuition Regularize the network by dropping out stochastically some hidden units. Procedure Assign to each hidden unit a value 0 with probability p (common choice:.5) x x2 x3 xd Input Layer Output Layer Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

39 Some applications Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

40 Computer vision Convolutional Neural Networks Lecun, 89 State of the art in digit recognition Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

41 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

42 Modern CNN A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS hidden layers, 650,000 units, 60,000,000 parameters Drop out 0 6 images GPU implementation Activation function: f(x) = max(0, x) (ReLu) Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

43 Computer vision Convolutional Neural Networks LeCun, Hinton, Bengio, Nature 205 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

44 What is deep in deep learning! Old vs New ML paradigms Decision Decision Classifier Hand-crafted features Classifier Learned features Representation learning Raw data Raw data Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

45 Softwares TensorFlow (google), Theano (Python CPU/GPU) mathematical and deep learning library http ://deeplearning.net/software/theano. Can do automatic, symbolic differentiation. Senna: NLP by Collobert et al. http ://ronan.collobert.com/senna/ State-of-the-art performance on many tasks 3500 lines of C, extremely fast and using very little memory. Torch ML Library (C+Lua) : http :// Recurrent Neural Network Language Model http :// imikolov/rnnlm Recursive Neural Net and RAE models for paraphrase detection, sentiment analysis, relation classiffcation Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, /

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)