Reward-modulated inference - PDF Free Download

Buck Shlegeris Matthew Alger COMP3740, 2014

Outline Supervised, unsupervised, and reinforcement learning Neural nets RMI Results with RMI

Types of machine learning supervised unsupervised reinforcement

MNIST

Supervised learning Given some pairs (x, f (x)), approximate f.

Supervised learning Given some pairs (x, f (x)), approximate f. classification

Supervised learning Given some pairs (x, f (x)), approximate f. classification regression

Supervised learning Given some pairs (x, f (x)), approximate f. classification regression prediction

Unsupervised learning Given data, find patterns.

Unsupervised learning Given data, find patterns. Goal is to increase likelihood of observed data.

Unsupervised learning Given data, find patterns. Goal is to increase likelihood of observed data. Related to compression

Unsupervised learning Given data, find patterns. Goal is to increase likelihood of observed data. Related to compression This is useful as a preprocessing step.

Reinforcement learning You see some stuff, and get some reward. What do you want to do?

Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions.

Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions. Stationary

Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions. Stationary MDP

Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions. Stationary MDP Value estimation? (Use supervised prediction?)

Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions. Stationary MDP Value estimation? (Use supervised prediction?) Explore vs exploit?

Neural nets What s a nice class of functions from R n to R m?

Neural nets f ( x) = W x + b (1)

Neural nets f ( x) = W x + b (1) Let s use the name θ for our model, the combination of W and b.

Logistic regression P(Y = i θ, x) = s(w x + b)i (2) (s rescales vectors in R n to have an L1 norm of 1.)

Logistic regression P(Y = i θ, x) = s(w x + b)i (2) (s rescales vectors in R n to have an L1 norm of 1.) prediction(θ, x) = argmax(p(y = i θ, x)) (3) i

Logistic regression L(θ, x, y) = log(s(w x + b)y ) (4)

Logistic regression L(θ, x, y) = log(s(w x + b)y ) (4) Overfitting?

Logistic regression L(θ, x, y) = log(s(w x + b)y ) (4) Overfitting? L(θ, x, y) = log(s(w x + b)y ) R(θ) (5)

Logistic regression L(θ, x, y) = log(s(w x + b)y ) (4) Overfitting? L(θ, x, y) = log(s(w x + b)y ) R(θ) (5) What if instead of a single input vector x and single label y, we had a whole list of inputs D and a vector of labels y?

Logistic regression θ (D, y) =

Logistic regression θ (D, y) = argmin (L(θ, D, y)) (7) θ

Logistic regression θ (D, y) = argmin (L(θ, D, y)) (7) θ θ (R x y R y ), so good luck finding that analytically

Logistic regression

Logistic regression Gradient descent while last update was bigger than ɛ do W new W α L(θ,D, y) W end while

Logistic regression Stochastic gradient descent for input x and label y (D, y) do w W α L(θ, x,y) W end for

Logistic regression Only linearly separable things can be separated!

Multilayer perceptron Logistic regression Multilayer perceptron s(w x + b) s(ws(w 2 x + b2 ) + b) = s(w h + b)

Multilayer perceptron Why stop at 1 layer?

Autoencoders

Autoencoders x = s(w s(w 2 x + b2 ) + b) (8)

Autoencoders x = s(w s(w 2 x + b2 ) + b) (8) L(θ, x) = x x (9)

Denoising autoencoders (das)

Denoising autoencoders (das) s(w s(w 2 n( x) + b2 ) + b) (10) (where n is a stochastic noise function)

Denoising autoencoders (das)

Denoising autoencoders (das) s(w s(w 2 x + b2 ) + b) (11)

Denoising autoencoders (das)

Denoising autoencoders (das) L(θ, x, y) = log(s(w x + b)y ) (12)

Denoising autoencoders (das) L(θ, x, y) = log(s(w x + b)y ) (12) L(θ, x) = x x (13)

Denoising autoencoders (das) L(θ, x, y) = log(s(w x + b)y ) (12) L(θ, x) = x x (13) How do we mix between these?

Reward modulation Add a time-varying modulation function λ(t):

Reward modulation Add a time-varying modulation function λ(t): L(θ, x, y) = λ(t) supervised cost + (1 λ(t)) unsupervised cost (14)

Reward modulation Add a time-varying modulation function λ(t): L(θ, x, y) = λ(t) supervised cost + (1 λ(t)) unsupervised cost (14) ( ) ( ) L(θ, x, y) = λ(t) log(s(w x + b)y ) + (1 λ(t)) x x (15)

Reward modulation Motivations: Information theory

Reward modulation Motivations: Information theory Fine tuning

Reward modulation Motivations: Information theory Fine tuning Extreme learning

Reward modulation Questions:

Reward modulation Questions: Does it improve performance on problems?

Reward modulation Questions: Does it improve performance on problems? How should we vary reward modulation?

Results

Hyperparameters

Classification problem Error rate 0.1 8 10 2 6 10 2 p = 0 p = 1 p = 2 p = 4 p = 10 p = 15 p = 20 4 10 2 2 10 2 0 5 10 15 20 25 30 35 40 Epochs Figure: Step reward modulation with λ = 0 if t < p, and λ = 0 otherwise.

Classification problem 10 2 λ = t λ = t 2 Error rate 6 4 λ = t 4 λ = t 10 λ = t 20 λ = t 40 λ = t 80 2 0 5 10 15 20 25 30 35 40 Epochs Figure: Linear reward modulation with different changes in λ.

Classification problem 10 2 Error rate 6 4 λ = 1 1 1+2t λ = 1 5 5+2t λ = 1 9 9+2t λ = 1 13 13+2t λ = 1 10 10+t λ = 1 12 12+t λ = 1 20 20+t λ = 1 100 100+t 2 0 5 10 15 20 25 30 35 40 Epochs Figure: Hyperbolic reward modulation with different changes in λ.

Contextual bandit no results yet.

MDP

Conclusions

Conclusions RMI works on classification (maybe because it s like fine-tuning)

Conclusions RMI works on classification (maybe because it s like fine-tuning) Haven t got contextual bandit results yet.

Conclusions RMI works on classification (maybe because it s like fine-tuning) Haven t got contextual bandit results yet. Works nicely on grid world for some reason, more MDP data to come.