Reward-modulated inference

Size: px

Start display at page:

Download "Reward-modulated inference"

Ashlyn Doyle
5 years ago
Views:

1 Buck Shlegeris Matthew Alger COMP3740, 2014

2 Outline Supervised, unsupervised, and reinforcement learning Neural nets RMI Results with RMI

3 Types of machine learning supervised unsupervised reinforcement

4 MNIST

5 Supervised learning Given some pairs (x, f (x)), approximate f.

6 Supervised learning Given some pairs (x, f (x)), approximate f. classification

7 Supervised learning Given some pairs (x, f (x)), approximate f. classification regression

8 Supervised learning Given some pairs (x, f (x)), approximate f. classification regression prediction

9 Unsupervised learning Given data, find patterns.

10 Unsupervised learning Given data, find patterns. Goal is to increase likelihood of observed data.

11 Unsupervised learning Given data, find patterns. Goal is to increase likelihood of observed data. Related to compression

12 Unsupervised learning Given data, find patterns. Goal is to increase likelihood of observed data. Related to compression This is useful as a preprocessing step.

13 Reinforcement learning You see some stuff, and get some reward. What do you want to do?

14 Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions.

15 Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions. Stationary

16 Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions. Stationary MDP

17 Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions. Stationary MDP Value estimation? (Use supervised prediction?)

18 Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions. Stationary MDP Value estimation? (Use supervised prediction?) Explore vs exploit?

19 Reinforcement learning You see some stuff, and get some reward. What do you want to do? Way more general much harder make assumptions. Stationary MDP Value estimation? (Use supervised prediction?) Explore vs exploit? Preprocessing somehow?

20 Neural nets What s a nice class of functions from R n to R m?

21 Neural nets f ( x) = W x + b (1)

22 Neural nets f ( x) = W x + b (1) Let s use the name θ for our model, the combination of W and b.

23 Logistic regression P(Y = i θ, x) = s(w x + b)i (2) (s rescales vectors in R n to have an L1 norm of 1.)

24 Logistic regression P(Y = i θ, x) = s(w x + b)i (2) (s rescales vectors in R n to have an L1 norm of 1.) prediction(θ, x) = argmax(p(y = i θ, x)) (3) i

25 Logistic regression L(θ, x, y) = log(s(w x + b)y ) (4)

26 Logistic regression L(θ, x, y) = log(s(w x + b)y ) (4) Overfitting?

27 Logistic regression L(θ, x, y) = log(s(w x + b)y ) (4) Overfitting? L(θ, x, y) = log(s(w x + b)y ) R(θ) (5)

28 Logistic regression L(θ, x, y) = log(s(w x + b)y ) (4) Overfitting? L(θ, x, y) = log(s(w x + b)y ) R(θ) (5) What if instead of a single input vector x and single label y, we had a whole list of inputs D and a vector of labels y?

29 Logistic regression L(θ, x, y) = log(s(w x + b)y ) (4) Overfitting? L(θ, x, y) = log(s(w x + b)y ) R(θ) (5) What if instead of a single input vector x and single label y, we had a whole list of inputs D and a vector of labels y? L(θ, D, y) = L(θ, D i, y i ) R(θ) (6) i D

30 Logistic regression θ (D, y) =

31 Logistic regression θ (D, y) = argmin (L(θ, D, y)) (7) θ

32 Logistic regression θ (D, y) = argmin (L(θ, D, y)) (7) θ θ (R x y R y ), so good luck finding that analytically

33 Logistic regression

34 Logistic regression Gradient descent while last update was bigger than ɛ do W new W α L(θ,D, y) W end while

35 Logistic regression Stochastic gradient descent for input x and label y (D, y) do w W α L(θ, x,y) W end for

36 Logistic regression Only linearly separable things can be separated!

37 Multilayer perceptron Logistic regression Multilayer perceptron s(w x + b) s(ws(w 2 x + b2 ) + b) = s(w h + b)

38 Multilayer perceptron Why stop at 1 layer?

39 Multilayer perceptron Why stop at 1 layer?

40 Autoencoders

41 Autoencoders x = s(w s(w 2 x + b2 ) + b) (8)

42 Autoencoders x = s(w s(w 2 x + b2 ) + b) (8) L(θ, x) = x x (9)

43 Denoising autoencoders (das)

44 Denoising autoencoders (das) s(w s(w 2 n( x) + b2 ) + b) (10) (where n is a stochastic noise function)

45 Denoising autoencoders (das)

46 Denoising autoencoders (das) s(w s(w 2 x + b2 ) + b) (11)

47 Denoising autoencoders (das)

48 Denoising autoencoders (das)

49 Denoising autoencoders (das)

50 Denoising autoencoders (das)

51 Denoising autoencoders (das) L(θ, x, y) = log(s(w x + b)y ) (12)

52 Denoising autoencoders (das) L(θ, x, y) = log(s(w x + b)y ) (12) L(θ, x) = x x (13)

53 Denoising autoencoders (das) L(θ, x, y) = log(s(w x + b)y ) (12) L(θ, x) = x x (13) How do we mix between these?

54 Reward modulation Add a time-varying modulation function λ(t):

55 Reward modulation Add a time-varying modulation function λ(t): L(θ, x, y) = λ(t) supervised cost + (1 λ(t)) unsupervised cost (14)

56 Reward modulation Add a time-varying modulation function λ(t): L(θ, x, y) = λ(t) supervised cost + (1 λ(t)) unsupervised cost (14) ( ) ( ) L(θ, x, y) = λ(t) log(s(w x + b)y ) + (1 λ(t)) x x (15)

57 Reward modulation Motivations: Information theory

58 Reward modulation Motivations: Information theory Fine tuning

59 Reward modulation Motivations: Information theory Fine tuning Extreme learning

60 Reward modulation Questions:

61 Reward modulation Questions: Does it improve performance on problems?

62 Reward modulation Questions: Does it improve performance on problems? How should we vary reward modulation?

63 Reward modulation Questions: Does it improve performance on problems? How should we vary reward modulation?

64 Results

65 Hyperparameters

66 Hyperparameters

67 Hyperparameters

68 Classification problem Error rate p = 0 p = 1 p = 2 p = 4 p = 10 p = 15 p = Epochs Figure: Step reward modulation with λ = 0 if t < p, and λ = 0 otherwise.

69 Classification problem 10 2 λ = t λ = t 2 Error rate 6 4 λ = t 4 λ = t 10 λ = t 20 λ = t 40 λ = t Epochs Figure: Linear reward modulation with different changes in λ.

70 Classification problem 10 2 Error rate 6 4 λ = t λ = t λ = t λ = t λ = t λ = t λ = t λ = t Epochs Figure: Hyperbolic reward modulation with different changes in λ.

71 Contextual bandit no results yet.

72 MDP

73 Conclusions

74 Conclusions RMI works on classification (maybe because it s like fine-tuning)

75 Conclusions RMI works on classification (maybe because it s like fine-tuning) Haven t got contextual bandit results yet.

76 Conclusions RMI works on classification (maybe because it s like fine-tuning) Haven t got contextual bandit results yet. Works nicely on grid world for some reason, more MDP data to come.

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011! 1 Todayʼs topics" Learning from Examples: brief review! Univariate Linear Regression! Batch gradient descent! Stochastic gradient