CSC242: Intro to AI Lecture 23
Administrivia Posters! Tue Apr 24 and Thu Apr 26 Idea! Presentation! 2-wide x 4-high landscape pages
Learning so far...
Input Attributes Alt Bar Fri Hun Pat Price Rain Res Type Est Will Wait x1 Yes No No Yes Some $$$ No Yes French 0-10 y1=yes x 2 Yes No No Yes Full $ No No Thai 30-60 y2=no x 3 No Yes No No Some $ No No Burger 0-10 y3=yes x4 Yes No Yes Yes Full $ Yes No Thai 10-30 y4=yes x5 Yes No Yes No Full $$$ No Yes French >60 y 5 =no x 6 No Yes No Yes Some $$ Yes Yes Italian 0-10 y 6 =yes x 7 No Yes No No None $ Yes No Burger 0-10 y 7 =no x8 No No No Yes Some $$ Yes Yes Thai 0-10 y 8 =yes x 9 No Yes Yes No Full $ Yes No Burger >60 y 9 =no x10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 y10=no x11 No No No No None $ No No Thai 0-10 y11=no x12 Yes Yes Yes Yes Full $ No No Burger 30-60 y12=yes
Patrons? None Some Full No Yes Hungry? No Yes No Type? French Italian Thai Burger Yes No Fri/Sat? Yes No Yes No Yes
h w (x) = w1x + w0 L(h w )= = N L 2 (y j,h w (x j )) j=1 N (y h w (x)) 2 j=1 = N (y w 1 x + w 0 ) 2 j=1
Carl Friedrich Gauss (1777 1855)
Linear Regression Find w = [w0, w1] that minimizes L(h w ) w = argmin w = argmin w L(h w ) N (y w 1 x + w 0 ) 2 j=1
Gradient Descent w any point in parameter space loop until convergence do Gradient of for each w i in w do loss function Update rule w i w i α along wi axis L(w) w i Learning rate
Gradient Descent In Weight Space w* = [w0, [w0, w1] w1] Loss w 0 w 1
x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1
Linear Classifier w 0 + w 1 x 1 + w 2 x 2 =0 w x =0 All instances of one class are above the line: w x > 0 All instances of one class are below the line: w x < 0 h w (x) =Threshold(w x)
Hard Threshold 1 0.5 0-8 -6-4 -2 0 2 4 6 8 Threshold(z) =1ifz 0 = 0 otherwise
1 Proportion correct 0.9 0.8 0.7 0.6 0.5 0.4 0 100 200 300 400 500 600 700 Number of weight updates
Logistic Threshold 1 0.5 0-6 -4-2 0 2 4 6 Logistic(z) = 1 1+e z
Squared error per example 1 0.9 0.8 0.7 0.6 0.5 0.4 0 1000 2000 3000 4000 5000 Number of weight updates Squared error per example 1 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 100000 Number of weight updates Squared error per example 1 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 100000 Number of weight updates
Neuron a 0 = 1 a j = g(in j ) wi,j a i Bias Weight w 0,j Σ in j g a j Input Links Input Function Activation Function Output Output Links
Bags Agent, process, disease,... D1= D2= D3= Candies Observations Actions, effects, symptoms, results of tests,... Goal Predict next candy Predict agent s next move Predict next output of process Predict disease given symptoms and tests
Bayesian Learning P(X d) = α i = α i P(X h i )P (h i d) P(X h i )P (d h i )P (h i ) Hypothesis prior Prediction of the hypothesis Likelihood of the data under the hypothesis
Maximum A Posteriori (MAP) h MAP = argmax h i P (h i d) P(X d) P(X h MAP )
Maximum Likelihood Hypothesis Assume uniform hypothesis prior No hypothesis preferred to any other a priori (e.g., all equally complex) h MAP = argmax h i P (h i d) = argmax h i P (d h i )=h ML
Burglary P(B).001 Earthquake P(E).002 Alarm B t t f f E t f t f P(A).95.94.29.001 JohnCalls A t f P(J).90.05 MaryCalls A t f P(M).70.01
Maximum Likelihood Hypothesis argmax Θ P (d h Θ )
Log Likelihood P (d h Θ )= j P (d j h Θ ) = Θ c (1 Θ) l L(d h Θ ) = log P (d h Θ )= j log P (d j h Θ ) = c log Θ + l log(1 Θ)
Naive Bayes Models { terrorist, tourist } Class Arrival Mode One-way Ticket Furtive Manner...
Learning Naive Bayes Models Naive Bayes model with n Boolean attributes requires 2n+1 parameters Maximum likelihood hypothesis hml can be found with no search Scales to large problems Robust to noisy or missing data
2 2 2 Smoking Diet Exercise 2 2 2 Smoking Diet Exercise 54 HeartDisease 6 6 6 Symptom 1 Symptom 2 Symptom 3 54 162 486 Symptom 1 Symptom 2 Symptom 3 (a) (b) 78 parameters 708 parameters
Hidden (Latent) Variables Can dramatically reduce the number of parameters required to specify a Bayes net Reduces amount of data required to learn the parameters Values of hidden variables not present in training data (observations) Complicates the learning problem
EM: Expectation Maximization Repeat E: Use the current values of the parameters to compute the expected values of the hidden variables M: Recompute the parameters to maximize the log-likelihood of the data given the values of the variables (observed and hidden)
Reinforcement Learning
B.F. Skinner (1904-1990)
Reinforcement Learning
The Problem with Learning from Examples
Where do the examples come from?
Forget about examples Input Attributes Alt Bar Fri Hun Pat Price Rain Res Type Est Will Wait x1 Yes No No Yes Some $$$ No Yes French 0-10 y1=yes x 2 Yes No No Yes Full $ No No Thai 30-60 y2=no x 3 No Yes No No Some $ No No Burger 0-10 y3=yes x4 Yes No Yes Yes Full $ Yes No Thai 10-30 y4=yes x5 Yes No Yes No Full $$$ No Yes French >60 y 5 =no x6 No Yes No Yes Some $$ Yes Yes Italian 0-10 y 6 =yes x7 No Yes No No None $ Yes No Burger 0-10 y 7 =no x8 No No No Yes Some $$ Yes Yes Thai 0-10 y 8 =yes x 9 No Yes Yes No Full $ Yes No Burger >60 y 9 =no x10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 y10=no x 11 No No No No None $ No No Thai 0-10 y 11 =no x 12 Yes Yes Yes Yes Full $ No No Burger 30-60 y 12 =yes
But we need feedback!
Reward (a.k.a Reinforcement) The positive or negative feedback one obtains from in response to action In animals: Pain, hunger: negative reward Pleasure, food: positive reward In computers...?
+1-1 START 0.8 0.1 0.1
Markov Decision Process Sequential decision problem Fully observable, stochastic environment Set of states S with initial state s0 Markovian transition model: Additive rewards: R(s) P (s s, a)
Policy A policy π specifies what the agent should do for any state the agent might reach: π(s) Each time a policy is executed, it leads to a different history Quality of a policy is its expected utility Optimal policy π * maximizes expected utility
Optimal Policy U π (s) =E t=0 γ t R(S t ) π s = argmax π U π (s)
Computing Policies Value Iteration Easy to understand (AIMA 17.2.2) Converges to unique set of solutions to Bellman equations Policy Iteration Searches space of policies, rather than refining values of utilities More tractable
Markov Decision Process Sequential decision problem Fully observable, stochastic environment Set of states S with initial state s0 Markovian transition model: Additive rewards: R(s) P (s s, a) Learn!
Reinforcement Learning Learn a policy that tells you what to do without knowing How actions work How the environment behaves How you get rewarded
Passive Learning Fixed policy π : π(s) says what to do Learn U π (s): how good this policy is
+1-1 START R(s) = 0.04, γ =1 U π (s)?
Policy Iteration Repeat Policy Evaluation: Given a policy, compute its expected utility Policy Improvement: Compute a new MEU policy by checking for better action in any state, given EU
Policy Evaluation U i (s) =R(s)+γ s P (s s, π i (s))u i (s )
+1-1 START (1,1) (1,2) (1,3) (1,2) (1,3) (2,3) (3,3) (4,3) -0.04-0.04-0.04-0.04-0.04-0.04-0.04 +1 (1,1) (1,2) (1,3) (2,3) (3,3) (3,2) (3,3) (4,3) -0.04-0.04-0.04-0.04-0.04-0.04-0.04 +1 (1,1) (2,1) (3,1) (4,2) -0.04-0.04-0.04-1
+1-1 START (1,1) (1,2) (1,3) (1,2) (1,3) (2,3) (3,3) (4,3) -0.04-0.04-0.04-0.04-0.04-0.04-0.04 +1 (1,1) (1,2) (1,3) (2,3) (3,3) (3,2) (3,3) (4,3) -0.04-0.04-0.04-0.04-0.04-0.04-0.04 +1 (1,1) (2,1) (3,1) (4,2) -0.04-0.04-0.04-1
Direct Utility Estimation In each trial, compute reward-to-go for each state visited in the trial Keep track of average reward-to-go for every state In the limit, converges to true expected utility of the policy U π (s)
Utilities of states are not independent! U π (s) =R(s)+γ s P (s s, π(s))u π (s )
Adaptive Dynamic Programming Keep track of observed frequencies of state-action pairs and their outcomes Approximate unknown transition model P (s s,a) using observed frequencies Use that in standard policy evaluation to compute utility of policy
Utility estimates 1 0.8 0.6 0.4 0.2 (4,3) (3,3) (1,3) (1,1) (3,2) RMS error in utility 0.6 0.5 0.4 0.3 0.2 0.1 0 0 20 40 60 80 100 Number of trials 0 0 20 40 60 80 100 Number of trials
+1-1 U π (1, 3) = 0.84 U π (2, 3) = 0.92 START (1,1) (1,2) (1,3) (1,2) (1,3) (2,3) (3,3) (4,3) -0.04-0.04-0.04-0.04-0.04-0.04-0.04 +1 (1,1) (1,2) (1,3) (2,3) (3,3) (3,2) (3,3) (4,3) -0.04-0.04-0.04-0.04-0.04-0.04-0.04 +1 (1,1) (2,1) (3,1) (4,2) -0.04-0.04-0.04-1
Temporal-Difference (TD) Learning At each step, update utility estimates using difference between successive states: U π (s) U π (s)+α(r(s)+γu π (s ) U π (s)) Learning Rate
Utility estimates 1 0.8 0.6 0.4 0.2 (4,3) (3,3) (1,3) (1,1) (2,1) RMS error in utility 0.6 0.5 0.4 0.3 0.2 0.1 0 0 100 200 300 400 500 Number of trials 0 0 20 40 60 80 100 Number of trials
Where did that policy come from???
Possible Strategy Learn outcome model for actions based on observed frequencies (like ADP) Compute utility of optimal policy using value or policy iteration (at each observation) Use computed optimal policy to select next action
2 RMS error, policy loss 1.5 1 0.5 RMS error Policy loss 3 2 +1 1 1 0 0 50 100 150 200 250 300 350 400 450 500 Number of trials 1 2 3 4
How could following the optimal policy not result in optimal behavior? The learned model is just an approximation of the true environment. What is optimal in the learned model may not really be optimal in the environment.
Paradox? Need to explore unexplored states (since they may be better than where we ve been) But they may be worse than our current optimal And after a while they probably will be
Active Learning Need to tradeoff Exploitation: Maximizing immediate reward by following current utility estimates Exploration: Improving utility estimates to maximize long-term reward
GLIE Greedy in the limit of infinite exploration Eventually, follow the optimal policy Examples: Choose a random action 1/t of the time Give some weight to actions you haven t tried very often, while avoiding actions with strong estimates of low utility
Exploration Function U + (s) R(s)+γ max a f( s P (s s, a)u + (s ),N(s, a)) f(u, n) = R +, u if n<n e otherwise
Utility estimates 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 (1,1) (1,2) (1,3) (2,3) (3,2) (3,3) (4,3) RMS error, policy loss 1.4 1.2 1 0.8 0.6 0.4 0.2 RMS error Policy loss 0 20 40 60 80 100 Number of trials 0 0 20 40 60 80 100 Number of trials
Reinforcement Learning Doesn t require labelled examples (training data) Learn a policy that tells you what to do without knowing How actions work How the environment behaves How you get rewarded
For Next Time: Posters! (Don t Be Late)