CSC242: Intro to AI. Lecture 23

Size: px

Start display at page:

Download "CSC242: Intro to AI. Lecture 23"

Prosper Lester
5 years ago
Views:

1 CSC242: Intro to AI Lecture 23

2 Administrivia Posters! Tue Apr 24 and Thu Apr 26 Idea! Presentation! 2-wide x 4-high landscape pages

3 Learning so far...

4 Input Attributes Alt Bar Fri Hun Pat Price Rain Res Type Est Will Wait x1 Yes No No Yes Some $$$ No Yes French 0-10 y1=yes x 2 Yes No No Yes Full $ No No Thai y2=no x 3 No Yes No No Some $ No No Burger 0-10 y3=yes x4 Yes No Yes Yes Full $ Yes No Thai y4=yes x5 Yes No Yes No Full $$$ No Yes French >60 y 5 =no x 6 No Yes No Yes Some $$ Yes Yes Italian 0-10 y 6 =yes x 7 No Yes No No None $ Yes No Burger 0-10 y 7 =no x8 No No No Yes Some $$ Yes Yes Thai 0-10 y 8 =yes x 9 No Yes Yes No Full $ Yes No Burger >60 y 9 =no x10 Yes Yes Yes Yes Full $$$ No Yes Italian y10=no x11 No No No No None $ No No Thai 0-10 y11=no x12 Yes Yes Yes Yes Full $ No No Burger y12=yes

5 Patrons? None Some Full No Yes Hungry? No Yes No Type? French Italian Thai Burger Yes No Fri/Sat? Yes No Yes No Yes

6 h w (x) = w1x + w0 L(h w )= = N L 2 (y j,h w (x j )) j=1 N (y h w (x)) 2 j=1 = N (y w 1 x + w 0 ) 2 j=1

7 Carl Friedrich Gauss ( )

8 Linear Regression Find w = [w0, w1] that minimizes L(h w ) w = argmin w = argmin w L(h w ) N (y w 1 x + w 0 ) 2 j=1

9 Gradient Descent w any point in parameter space loop until convergence do Gradient of for each w i in w do loss function Update rule w i w i α along wi axis L(w) w i Learning rate

10 Gradient Descent In Weight Space w* = [w0, [w0, w1] w1] Loss w 0 w 1

11 x x 1

12 Linear Classifier w 0 + w 1 x 1 + w 2 x 2 =0 w x =0 All instances of one class are above the line: w x > 0 All instances of one class are below the line: w x < 0 h w (x) =Threshold(w x)

13 Hard Threshold Threshold(z) =1ifz 0 = 0 otherwise

14 1 Proportion correct Number of weight updates

15 Logistic Threshold Logistic(z) = 1 1+e z

16 Squared error per example Number of weight updates Squared error per example Number of weight updates Squared error per example Number of weight updates

17 Neuron a 0 = 1 a j = g(in j ) wi,j a i Bias Weight w 0,j Σ in j g a j Input Links Input Function Activation Function Output Output Links

18 Bags Agent, process, disease,... D1= D2= D3= Candies Observations Actions, effects, symptoms, results of tests,... Goal Predict next candy Predict agent s next move Predict next output of process Predict disease given symptoms and tests

19 Bayesian Learning P(X d) = α i = α i P(X h i )P (h i d) P(X h i )P (d h i )P (h i ) Hypothesis prior Prediction of the hypothesis Likelihood of the data under the hypothesis

20 Maximum A Posteriori (MAP) h MAP = argmax h i P (h i d) P(X d) P(X h MAP )

21 Maximum Likelihood Hypothesis Assume uniform hypothesis prior No hypothesis preferred to any other a priori (e.g., all equally complex) h MAP = argmax h i P (h i d) = argmax h i P (d h i )=h ML

22 Burglary P(B).001 Earthquake P(E).002 Alarm B t t f f E t f t f P(A) JohnCalls A t f P(J) MaryCalls A t f P(M).70.01

23 Maximum Likelihood Hypothesis argmax Θ P (d h Θ )

24 Log Likelihood P (d h Θ )= j P (d j h Θ ) = Θ c (1 Θ) l L(d h Θ ) = log P (d h Θ )= j log P (d j h Θ ) = c log Θ + l log(1 Θ)

25 Naive Bayes Models { terrorist, tourist } Class Arrival Mode One-way Ticket Furtive Manner...

26 Learning Naive Bayes Models Naive Bayes model with n Boolean attributes requires 2n+1 parameters Maximum likelihood hypothesis hml can be found with no search Scales to large problems Robust to noisy or missing data

27 2 2 2 Smoking Diet Exercise Smoking Diet Exercise 54 HeartDisease Symptom 1 Symptom 2 Symptom Symptom 1 Symptom 2 Symptom 3 (a) (b) 78 parameters 708 parameters

28 Hidden (Latent) Variables Can dramatically reduce the number of parameters required to specify a Bayes net Reduces amount of data required to learn the parameters Values of hidden variables not present in training data (observations) Complicates the learning problem

29 EM: Expectation Maximization Repeat E: Use the current values of the parameters to compute the expected values of the hidden variables M: Recompute the parameters to maximize the log-likelihood of the data given the values of the variables (observed and hidden)

30 Reinforcement Learning

31 B.F. Skinner ( )

34 Reinforcement Learning

35 The Problem with Learning from Examples

36 Where do the examples come from?

37 Forget about examples Input Attributes Alt Bar Fri Hun Pat Price Rain Res Type Est Will Wait x1 Yes No No Yes Some $$$ No Yes French 0-10 y1=yes x 2 Yes No No Yes Full $ No No Thai y2=no x 3 No Yes No No Some $ No No Burger 0-10 y3=yes x4 Yes No Yes Yes Full $ Yes No Thai y4=yes x5 Yes No Yes No Full $$$ No Yes French >60 y 5 =no x6 No Yes No Yes Some $$ Yes Yes Italian 0-10 y 6 =yes x7 No Yes No No None $ Yes No Burger 0-10 y 7 =no x8 No No No Yes Some $$ Yes Yes Thai 0-10 y 8 =yes x 9 No Yes Yes No Full $ Yes No Burger >60 y 9 =no x10 Yes Yes Yes Yes Full $$$ No Yes Italian y10=no x 11 No No No No None $ No No Thai 0-10 y 11 =no x 12 Yes Yes Yes Yes Full $ No No Burger y 12 =yes

38 But we need feedback!

42 Reward (a.k.a Reinforcement) The positive or negative feedback one obtains from in response to action In animals: Pain, hunger: negative reward Pleasure, food: positive reward In computers...?

44 +1-1 START

45 Markov Decision Process Sequential decision problem Fully observable, stochastic environment Set of states S with initial state s0 Markovian transition model: Additive rewards: R(s) P (s s, a)

46 Policy A policy π specifies what the agent should do for any state the agent might reach: π(s) Each time a policy is executed, it leads to a different history Quality of a policy is its expected utility Optimal policy π * maximizes expected utility

47 Optimal Policy U π (s) =E t=0 γ t R(S t ) π s = argmax π U π (s)

48 Computing Policies Value Iteration Easy to understand (AIMA ) Converges to unique set of solutions to Bellman equations Policy Iteration Searches space of policies, rather than refining values of utilities More tractable

49 Markov Decision Process Sequential decision problem Fully observable, stochastic environment Set of states S with initial state s0 Markovian transition model: Additive rewards: R(s) P (s s, a) Learn!

50 Reinforcement Learning Learn a policy that tells you what to do without knowing How actions work How the environment behaves How you get rewarded

51 Passive Learning Fixed policy π : π(s) says what to do Learn U π (s): how good this policy is

52 +1-1 START R(s) = 0.04, γ =1 U π (s)?

53 Policy Iteration Repeat Policy Evaluation: Given a policy, compute its expected utility Policy Improvement: Compute a new MEU policy by checking for better action in any state, given EU

54 Policy Evaluation U i (s) =R(s)+γ s P (s s, π i (s))u i (s )

55 +1-1 START (1,1) (1,2) (1,3) (1,2) (1,3) (2,3) (3,3) (4,3) (1,1) (1,2) (1,3) (2,3) (3,3) (3,2) (3,3) (4,3) (1,1) (2,1) (3,1) (4,2)

56 +1-1 START (1,1) (1,2) (1,3) (1,2) (1,3) (2,3) (3,3) (4,3) (1,1) (1,2) (1,3) (2,3) (3,3) (3,2) (3,3) (4,3) (1,1) (2,1) (3,1) (4,2)

57 Direct Utility Estimation In each trial, compute reward-to-go for each state visited in the trial Keep track of average reward-to-go for every state In the limit, converges to true expected utility of the policy U π (s)

58 Utilities of states are not independent! U π (s) =R(s)+γ s P (s s, π(s))u π (s )

59 Adaptive Dynamic Programming Keep track of observed frequencies of state-action pairs and their outcomes Approximate unknown transition model P (s s,a) using observed frequencies Use that in standard policy evaluation to compute utility of policy

60 Utility estimates (4,3) (3,3) (1,3) (1,1) (3,2) RMS error in utility Number of trials Number of trials

61 +1-1 U π (1, 3) = 0.84 U π (2, 3) = 0.92 START (1,1) (1,2) (1,3) (1,2) (1,3) (2,3) (3,3) (4,3) (1,1) (1,2) (1,3) (2,3) (3,3) (3,2) (3,3) (4,3) (1,1) (2,1) (3,1) (4,2)

62 Temporal-Difference (TD) Learning At each step, update utility estimates using difference between successive states: U π (s) U π (s)+α(r(s)+γu π (s ) U π (s)) Learning Rate

63 Utility estimates (4,3) (3,3) (1,3) (1,1) (2,1) RMS error in utility Number of trials Number of trials

65 Where did that policy come from???

66 Possible Strategy Learn outcome model for actions based on observed frequencies (like ADP) Compute utility of optimal policy using value or policy iteration (at each observation) Use computed optimal policy to select next action

67 2 RMS error, policy loss RMS error Policy loss Number of trials

68 How could following the optimal policy not result in optimal behavior? The learned model is just an approximation of the true environment. What is optimal in the learned model may not really be optimal in the environment.

69 Paradox? Need to explore unexplored states (since they may be better than where we ve been) But they may be worse than our current optimal And after a while they probably will be

70 Active Learning Need to tradeoff Exploitation: Maximizing immediate reward by following current utility estimates Exploration: Improving utility estimates to maximize long-term reward

71 GLIE Greedy in the limit of infinite exploration Eventually, follow the optimal policy Examples: Choose a random action 1/t of the time Give some weight to actions you haven t tried very often, while avoiding actions with strong estimates of low utility

72 Exploration Function U + (s) R(s)+γ max a f( s P (s s, a)u + (s ),N(s, a)) f(u, n) = R +, u if n<n e otherwise

73 Utility estimates (1,1) (1,2) (1,3) (2,3) (3,2) (3,3) (4,3) RMS error, policy loss RMS error Policy loss Number of trials Number of trials

74 Reinforcement Learning Doesn t require labelled examples (training data) Learn a policy that tells you what to do without knowing How actions work How the environment behaves How you get rewarded

75 For Next Time: Posters! (Don t Be Late)

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI. Lecture 21 CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages