Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

CSCE 636 Neural Networks Instructor: Yoonsuck Choe Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. Motivation: how the brain represents sensory information in a hierarchical manner. 2 he Rise of Deep Learning Long History (in Hind Sight) Made popular in recent years Geoffrey Hinton et al. (2006). Andrew Ng & Jeff Dean (Google Brain team, 202). Schmidhuber et al. s deep neural networks (won many competitions and in some cases showed super human performance; 20 ). Fukushima s Neocognitron (980). LeCun et al. s Convolutional neural networks (989). Schmidhuber s work on stacked recurrent neural networks (993). Vanishing gradient problem. 3 4

Fukushima s Neocognitron LeCun s Colvolutional Neural Nets Appeared in journal Biological Cybernetics (980). Multiple layers with local receptive fields. S cells (trainable) and C cells (fixed weight). Deformation-resistent recognition. Convolution kernel (weight sharing) + Subsampling Fully connected layers near the end. 5 6 Current rends Boltzmann Machine to Deep Belief Nets Deep belief networks (based on Boltzmann machine) Deep neural networks Haykin Chapter : Stochastic Methods rooted in statistical mechanics. Convolutional neural networks Deep Q-learning Network (extensions to reinforcement learning) 7 8

Boltzmann Machine Boltzmann Machine: Energy Network state: x from random variable X. w ij = w ji and w ii = 0. Energy (in analogy to thermodynamics): E(x) = w ji x i x j 2 i j,i j Stochastic binary machine: + or -. Fully connected symmetric connections: w ij = w ji. Visible vs. hidden neurons, clamped vs. free-running. Goal: Learn weights to model prob. dist of visible units. Unsupervised. Pattern completion. 9 0 Boltzmann Machine: Prob. of a State x Probability of a state x given E(x) follows the Gibbs distribution: P (X = x) = ( Z, Z: partition function (normalization factor hard to compute) Z = x : temperature parameter. exp( E(x)/ ) Low energy states are exponentially more probable. With the above, we can calculate P (X j = x {X i = x i } K i=,i j ) his can be done without knowing Z. Boltzmann Machine: P (X j = x the rest) A : X j = x. B : {X i = x i } K i=,i j P (X j = x the rest) = P (A, B) P (B) (the rest). P (A, B) = A P (A, B) = P (A, B) P (A, B) + P ( A, B) = ( ) + exp x i,i j w jix i = sigmoid x w ji x i i,i j Can compute equilibrium state based on the above. 2

Boltzmann Machine: Gibbs Sampling Initialize x (0) to a random vector. For j =, 2,..., n (generate n samples x P (X)) x (j+) from p(x x (j) 2, x(j) 3,..., x(j) K ) x (j+) 2 from p(x 2 x (j+), x (j) 3 x (j+) 3 from p(x 3 x (j+), x (j+)... x (j+) K,..., x(j) K ) 2, x (j) 4..., x(j) K ) from p(x K x (j+), x (j+) 2, x (j+) 3..., x (j+) K ) One new sample x (j+) P (X). Simulated annealing used (high to low ) for faster conv. Boltzmann Learning Rule () Probability of activity pattern being one of the training patterns (visible unit: subvector x α ; hidden unit: subvector y β ), given the weight vector w. P (X α = x α ) Log-likelihood of the visible units being any one of the trainning patterns (assuming they are mutually independent) : L(w) = log = x α x α We want to learn w that maximizes L(w). P (X α = x α ) log P (X α = x α ) 3 4 Boltzmann Learning Rule (2) Want to calculate P (X α = x α ): use energy function. P (X α = x α ) = ( Z x β log P (X α = x α ) = log ( log Z x β = log ( x β log ( x Note: Z = x exp ( E(x) 5 ) Finally, we get: Boltzmann Learning Rule (3) L(w) = log ( log ( xα x β x Note that w is involved in: E(x) = w ji x i x j 2 i j,i j Differentiating L(w) wrt w ji, we get: L(w) = P (X β = x β Xα = xα)x j x i xα x β ) P (X = x)x j x i x 6

Setting: We get: ρ + ji = Boltzmann Learning Rule (4) x α ρ ji = P (X β = x β X α = x α )x j x i x β x α L(w) P (X = x)x j x i x = ( ) ρ + ji ρ ji heoretically elegant. Boltzmann Machine Summary Very slow in practice (especially the unclamped phase). Attempting to maximize L(w), we get: where η = ɛ w ji = ɛ L(w) = η. his is gradient ascent. ( ) ρ + ji ρ ji 7 8 Logistic (or Directed) Belief Net Deep Belief Net () Overcomes issues with Logistic Belief Net. Hinton et al. (2006) Similar to Boltzmann Machine, but with directed, acyclic connections. P (X j = x j X = x,..., X j = x j ) = P (X j = x j parents(x j )) Same learning rule: w ji = η L(w) With dense connetions, calculation of P becomes intractable. 9 Based on Restricted Boltzmann Machine (RBM): visible and hidden layers, with layer-to-layer full connection but no within-layer connections. RBM Back-and-forth update: update hidden given visible, then update visible given hidden, etc., then train w based on L(w) = ρ (0) ji ρ ( ) ji 20

Deep Belief Net (2) Deep Convolutional Neural Networks () Deep Belief Net = Layer-by-layer training using RBM. Hybrid architecture: op layer = undirected, lower layers directed.. rain RBM based on input to form hidden representation. 2. Use hidden representation as input to train another RBM. 3. Repeat steps 2-3. Applications: NIS digit recognition. Krizhevsky et al. (202) Applied to ImageNet competition (.2 million images,,000 classes). Network: 60 million parameters and 650,000 neurons. op- and top-5 error rates of 37.5% and 7.0%. rained with backprop. 2 22 Deep Convolutional Neural Networks (2) Deep Convolutional Neural Networks (3) Learned kernels (first convolutional layer). Resembles mammalian RFs: oriented Gabor patterns, color opponency (red-green, blue-yellow). Left: Hits and misses and close calls. Right: est (st column) vs. training images with closest hidden representation to the test data. 23 24

Deep Q-Network (DQN) DQN Overview Google Deep Mind (Mnih et al. Nature 205). Latest application of deep learning to a reinforcement learning domain (Q as in Q-learning). Applied to Atari 2600 video game playing. Input: video screen; Output: Q(s, a); Reward: game score. Q(s, a): action-value function Value of taking action a when in state s. 25 26 DQN Overview DQN Algorithm Input preprocessing Experience replay (collect and replay state, action, reward, and resulting state) Delayed (periodic) update of Q. Moving target ˆQ value used to compute error (loss function L, parameterized by weights θ i ). Gradient descent: L θ i 27 28

DQN Results Superhuman performance on over half of the games. 29 DQN Operation DQN Hidden Layer Representation (t-sne map) Similar perception, similar reward clustered. 30 Summary Deep belief network: Based on Boltzmann machine. Elegant theory, good performance. Deep convolutional networks: High computational demand, over the board great performance. Deep Q-Network: unique apporach to reinforcement learning. End-to-end machine learning. Super-human performance. Value vs. game state; Game state vs. action value. 32