Buck Shlegeris Matthew Alger COMP3740, 2014
Outline Supervised, unsupervised, and reinforcement learning Neural nets RMI Results with RMI
Types of machine learning supervised unsupervised reinforcement
MNIST
Supervised learning Given some pairs (x, f (x)), approximate f.
Supervised learning Given some pairs (x, f (x)), approximate f. classification
Supervised learning Given some pairs (x, f (x)), approximate f. classification regression
Supervised learning Given some pairs (x, f (x)), approximate f. classification regression prediction
Unsupervised learning Given data, find patterns.
Unsupervised learning Given data, find patterns. Goal is to increase likelihood of observed data.
Unsupervised learning Given data, find patterns. Goal is to increase likelihood of observed data. Related to compression
Unsupervised learning Given data, find patterns. Goal is to increase likelihood of observed data. Related to compression This is useful as a preprocessing step.
Reinforcement learning You see some stuff, and get some reward. What do you want to do?
Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions.
Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions. Stationary
Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions. Stationary MDP
Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions. Stationary MDP Value estimation? (Use supervised prediction?)
Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions. Stationary MDP Value estimation? (Use supervised prediction?) Explore vs exploit?
Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions. Stationary MDP Value estimation? (Use supervised prediction?) Explore vs exploit? Preprocessing somehow?
Neural nets What s a nice class of functions from R n to R m?
Neural nets f ( x) = W x + b (1)
Neural nets f ( x) = W x + b (1) Let s use the name θ for our model, the combination of W and b.
Logistic regression P(Y = i θ, x) = s(w x + b)i (2) (s rescales vectors in R n to have an L1 norm of 1.)
Logistic regression P(Y = i θ, x) = s(w x + b)i (2) (s rescales vectors in R n to have an L1 norm of 1.) prediction(θ, x) = argmax(p(y = i θ, x)) (3) i
Logistic regression L(θ, x, y) = log(s(w x + b)y ) (4)
Logistic regression L(θ, x, y) = log(s(w x + b)y ) (4) Overfitting?
Logistic regression L(θ, x, y) = log(s(w x + b)y ) (4) Overfitting? L(θ, x, y) = log(s(w x + b)y ) R(θ) (5)
Logistic regression L(θ, x, y) = log(s(w x + b)y ) (4) Overfitting? L(θ, x, y) = log(s(w x + b)y ) R(θ) (5) What if instead of a single input vector x and single label y, we had a whole list of inputs D and a vector of labels y?
Logistic regression L(θ, x, y) = log(s(w x + b)y ) (4) Overfitting? L(θ, x, y) = log(s(w x + b)y ) R(θ) (5) What if instead of a single input vector x and single label y, we had a whole list of inputs D and a vector of labels y? L(θ, D, y) = L(θ, D i, y i ) R(θ) (6) i D
Logistic regression θ (D, y) =
Logistic regression θ (D, y) = argmin (L(θ, D, y)) (7) θ
Logistic regression θ (D, y) = argmin (L(θ, D, y)) (7) θ θ (R x y R y ), so good luck finding that analytically
Logistic regression
Logistic regression Gradient descent while last update was bigger than ɛ do W new W α L(θ,D, y) W end while
Logistic regression Stochastic gradient descent for input x and label y (D, y) do w W α L(θ, x,y) W end for
Logistic regression Only linearly separable things can be separated!
Multilayer perceptron Logistic regression Multilayer perceptron s(w x + b) s(ws(w 2 x + b2 ) + b) = s(w h + b)
Multilayer perceptron Why stop at 1 layer?
Multilayer perceptron Why stop at 1 layer?
Autoencoders
Autoencoders x = s(w s(w 2 x + b2 ) + b) (8)
Autoencoders x = s(w s(w 2 x + b2 ) + b) (8) L(θ, x) = x x (9)
Denoising autoencoders (das)
Denoising autoencoders (das) s(w s(w 2 n( x) + b2 ) + b) (10) (where n is a stochastic noise function)
Denoising autoencoders (das)
Denoising autoencoders (das) s(w s(w 2 x + b2 ) + b) (11)
Denoising autoencoders (das)
Denoising autoencoders (das)
Denoising autoencoders (das)
Denoising autoencoders (das)
Denoising autoencoders (das) L(θ, x, y) = log(s(w x + b)y ) (12)
Denoising autoencoders (das) L(θ, x, y) = log(s(w x + b)y ) (12) L(θ, x) = x x (13)
Denoising autoencoders (das) L(θ, x, y) = log(s(w x + b)y ) (12) L(θ, x) = x x (13) How do we mix between these?
Reward modulation Add a time-varying modulation function λ(t):
Reward modulation Add a time-varying modulation function λ(t): L(θ, x, y) = λ(t) supervised cost + (1 λ(t)) unsupervised cost (14)
Reward modulation Add a time-varying modulation function λ(t): L(θ, x, y) = λ(t) supervised cost + (1 λ(t)) unsupervised cost (14) ( ) ( ) L(θ, x, y) = λ(t) log(s(w x + b)y ) + (1 λ(t)) x x (15)
Reward modulation Motivations: Information theory
Reward modulation Motivations: Information theory Fine tuning
Reward modulation Motivations: Information theory Fine tuning Extreme learning
Reward modulation Questions:
Reward modulation Questions: Does it improve performance on problems?
Reward modulation Questions: Does it improve performance on problems? How should we vary reward modulation?
Reward modulation Questions: Does it improve performance on problems? How should we vary reward modulation?
Results
Hyperparameters
Hyperparameters
Hyperparameters
Classification problem Error rate 0.1 8 10 2 6 10 2 p = 0 p = 1 p = 2 p = 4 p = 10 p = 15 p = 20 4 10 2 2 10 2 0 5 10 15 20 25 30 35 40 Epochs Figure: Step reward modulation with λ = 0 if t < p, and λ = 0 otherwise.
Classification problem 10 2 λ = t λ = t 2 Error rate 6 4 λ = t 4 λ = t 10 λ = t 20 λ = t 40 λ = t 80 2 0 5 10 15 20 25 30 35 40 Epochs Figure: Linear reward modulation with different changes in λ.
Classification problem 10 2 Error rate 6 4 λ = 1 1 1+2t λ = 1 5 5+2t λ = 1 9 9+2t λ = 1 13 13+2t λ = 1 10 10+t λ = 1 12 12+t λ = 1 20 20+t λ = 1 100 100+t 2 0 5 10 15 20 25 30 35 40 Epochs Figure: Hyperbolic reward modulation with different changes in λ.
Contextual bandit no results yet.
MDP
Conclusions
Conclusions RMI works on classification (maybe because it s like fine-tuning)
Conclusions RMI works on classification (maybe because it s like fine-tuning) Haven t got contextual bandit results yet.
Conclusions RMI works on classification (maybe because it s like fine-tuning) Haven t got contextual bandit results yet. Works nicely on grid world for some reason, more MDP data to come.