CSCE 636 Neural Networks Instructor: Yoonsuck Choe Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. Motivation: how the brain represents sensory information in a hierarchical manner. he Rise of Deep Learning Long History in Hind Sight Made popular in recent years Geoffrey Hinton et al. 006. Andrew Ng & Jeff Dean Google Brain team, 0. Schmidhuber et al. s deep neural networks won many competitions and in some cases showed super human performance; 0. Google Deep Mind: Atari 600 games 05, AlphaGo 06. Fukushima s Neocognitron 980. LeCun et al. s Convolutional neural networks 989. Schmidhuber s work on stacked recurrent neural networks 993. Vanishing gradient problem. See Schmidhuber s extended review: Schmidhuber, J. 05. Deep learning in neural networks: An overview. Neural Networks, 6, 85-7. 3 4
Fukushima s Neocognitron LeCun s Colvolutional Neural Nets Appeared in journal Biological Cybernetics 980. Multiple layers with local receptive fields. S cells trainable and C cells fixed weight. Deformation-resistent recognition. Convolution kernel weight sharing + Subsampling Fully connected layers near the end. 5 6 Current rends Boltzmann Machine to Deep Belief Nets Deep belief networks based on Boltzmann machine Deep neural networks Haykin Chapter : Stochastic Methods rooted in statistical mechanics. Convolutional neural networks Deep Q-learning Network extensions to reinforcement learning Applications to diverse domains including natural language. Lots of open source tools available. 7 8
Boltzmann Machine Boltzmann Machine: Energy Network state: x from random variable X. w ij w ji and w ii 0. Energy in analogy to thermodynamics: Ex w ji x i x j i j,i j Stochastic binary machine: + or -. Fully connected symmetric connections: w ij w ji. Visible vs. hidden neurons, clamped vs. free-running. Goal: Learn weights to model prob. dist of visible units. Unsupervised. Pattern completion. 9 0 Boltzmann Machine: Prob. of a State x Probability of a state x given Ex follows the Gibbs distribution: P X x Z exp Ex, Z: partition function normalization factor hard to compute Z x : temperature parameter. exp Ex/ Low energy states are exponentially more probable. With the above, we can calculate P X j x {X i x i } K i,i j his can be done without knowing Z. Boltzmann Machine: P X j x the rest A : X j x. B : {X i x i } K i,i j P X j x the rest P A, B P B the rest. P A, B A P A, B P A, B P A, B + P A, B + exp x i,i j w jix i sigmoid x w ji x i i,i j Can compute equilibrium state based on the above.
Boltzmann Machine: Gibbs Sampling Initialize x 0 to a random vector. For j,,..., n generate n samples x P X x j+ from px x j, xj 3,..., xj K x j+ from px x j+, x j 3 x j+ 3 from px 3 x j+, x j+... x j+ K,..., xj K, x j 4..., xj K from px K x j+, x j+, x j+ 3..., x j+ K One new sample x j+ P X. Simulated annealing used high to low for faster conv. Boltzmann Learning Rule Probability of activity pattern being one of the training patterns visible unit: subvector x α ; hidden unit: subvector x β, given the weight vector w. P X α x α Log-likelihood of the visible units being any one of the trainning patterns assuming they are mutually independent : Lw log x α x α We want to learn w that maximizes Lw. P X α x α log P X α x α 3 4 Boltzmann Learning Rule Want to calculate P X α x α probability of finding the visible neurons in state x α with any x β : use energy function. P Xα xα x β P Xα xα, X β x β exp Ex Z x β log P Xα xα log x β exp Note: Z x exp Ex log x β exp log x exp Ex Ex Ex log Z Finally, we get: Boltzmann Learning Rule 3 Lw log exp Ex log exp Ex xα x β x Note that w is involved in: Ex w ji x i x j i j,i j Differentiating Lw wrt w ji, we get: Lw P X β x β Xα xαx j x i xα x β P X xx j x i x 5 6
o derive: Boltzmann Learning Rule 3-: Some hints Lw Ex i j,i j P X β x β Xα xαx j x i xα x β P X xx j x i x w ji x i x j x i x j, i j P X β x β Xα xα P X α xα, X β x β P X x exp Z P Xα xα Ex 7 exp Ex x exp Ex Z exp Ex Z x exp β Ex Setting: We get: ρ + ji Boltzmann Learning Rule 4 x α ρ ji P X β x β X α x α x j x i x β x α Lw P X xx j x i x Attempting to maximize Lw, we get: where η ɛ w ji ɛ Lw ρ + ji ρ ji η. his is gradient ascent. 8 ρ + ji ρ ji Boltzmann Machine Summary Logistic or Directed Belief Net heoretically elegant. Very slow in practice especially the unclamped phase. Similar to Boltzmann Machine, but with directed, acyclic connections. P X j x j X x,..., X j x j P X j x j parentsx j Same learning rule: w ji η Lw 9 With dense connetions, calculation of P becomes intractable. 0
Deep Belief Net Deep Belief Net Deep Belief Net Layer-by-layer training using RBM. Hybrid architecture: op layer undirected, lower layers directed. Overcomes issues with Logistic Belief Net. Hinton et al. 006 Based on Restricted Boltzmann Machine RBM: visible and hidden layers, with layer-to-layer full connection but no within-layer connections. RBM Back-and-forth update: update hidden given visible, then update visible given hidden, etc., then train w based on Lw ρ 0 ji ρ ji. rain RBM based on input to form hidden representation.. Use hidden representation as input to train another RBM. 3. Repeat steps -3. Applications: NIS digit recognition. Deep Convolutional Neural Networks Deep Convolutional Neural Networks Krizhevsky et al. 0 Applied to ImageNet competition. million images,,000 classes. Network: 60 million parameters and 650,000 neurons. op- and top-5 error rates of 37.5% and 7.0%. Learned kernels first convolutional layer. Resembles mammalian RFs: oriented Gabor patterns, color opponency red-green, blue-yellow. rained with backprop. 3 4
Deep Convolutional Neural Networks 3 Deep Q-Network DQN Left: Hits and misses and close calls. Right: est st column vs. training images with closest hidden representation to the test data. Google Deep Mind Mnih et al. Nature 05. Latest application of deep learning to a reinforcement learning domain Q as in Q-learning. Applied to Atari 600 video game playing. 5 6 DQN Overview DQN Overview Input preprocessing Experience replay collect and replay state, action, reward, and resulting state Delayed periodic update of Q. Moving target ˆQ value used to compute error loss function L, parameterized by weights θ i. Gradient descent: L Input: video screen; Output: Qs, a; Reward: game score. θ i Qs, a: action-value function Value of taking action a when in state s. 7 8
DQN Algorithm 9 DQN Hidden Layer Representation t-sne map Similar perception, similar reward clustered. 3 DQN Results Superhuman performance on over half of the games. 30 DQN Operation Value vs. game state; Game state vs. action value.
Deep Learning ools Kaffe: UC Berkeley s deep learning tool box Google ensorflow Microsoft CNK Computational Network ool Kit Other: Apache Mahout MapReduce-based ML Summary Deep belief network: Based on Boltzmann machine. Elegant theory, good performance. Deep convolutional networks: High computational demand, over the board great performance. Deep Q-Network: unique apporach to reinforcement learning. End-to-end machine learning. Super-human performance. Flood of deep learning tools available. 33 34