CSC242: Intro to AI. Lecture 23

Similar documents
CSC242: Intro to AI. Lecture 21

From inductive inference to machine learning

Statistical Learning. Philipp Koehn. 10 November 2015

Learning from Observations. Chapter 18, Sections 1 3 1

Introduction to Artificial Intelligence. Learning from Oberservations

Learning Decision Trees

Learning and Neural Networks

CS 380: ARTIFICIAL INTELLIGENCE

Reinforcement Learning

Assignment 1: Probabilistic Reasoning, Maximum Likelihood, Classification

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

CMPT 310 Artificial Intelligence Survey. Simon Fraser University Summer Instructor: Oliver Schulte

Expectation Maximization [KF Chapter 19] Incomplete data

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Reinforcement Learning. George Konidaris

Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Reinforcement Learning Active Learning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Grundlagen der Künstlichen Intelligenz

16.4 Multiattribute Utility Functions

Learning from Examples

Decision Theory: Q-Learning

Be able to define the following terms and answer basic questions about them:

Bayesian Networks. Motivation

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

CSE 473: Artificial Intelligence Autumn Topics

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning

6 Reinforcement Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Algorithmisches Lernen/Machine Learning

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Statistical learning. Chapter 20, Sections 1 4 1

Figure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net.

Reinforcement learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

ECE 5984: Introduction to Machine Learning

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

CS Machine Learning Qualifying Exam

Bias-Variance Tradeoff

Machine Learning, Fall 2012 Homework 2

Machine Learning (CSE 446): Neural Networks

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

CS6220: DATA MINING TECHNIQUES

Notes on Machine Learning for and

Stochastic Gradient Descent

Learning Bayesian Networks (part 1) Goals for the lecture

Reinforcement Learning. Machine Learning, Fall 2010

CSC321 Lecture 22: Q-Learning

Week 5: Logistic Regression & Neural Networks

Be able to define the following terms and answer basic questions about them:

Reinforcement Learning and Control

Review: Bayesian learning and inference

The Naïve Bayes Classifier. Machine Learning Fall 2017

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Temporal Difference Learning & Policy Iteration

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Probabilistic Classification

Bayesian Learning (II)

Mining Classification Knowledge

Logistic Regression. Machine Learning Fall 2018

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Final Exam, Fall 2002

Decision Theory: Markov Decision Processes

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Incremental Stochastic Gradient Descent

Reinforcement Learning

Generative v. Discriminative classifiers Intuition

Bayesian Networks Inference with Probabilistic Graphical Models

CS 6375 Machine Learning

Artificial Intelligence & Sequential Decision Problems

CPSC 540: Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

CSCE 478/878 Lecture 6: Bayesian Learning

Statistical learning. Chapter 20, Sections 1 3 1

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

CS230: Lecture 9 Deep Reinforcement Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

CS 7180: Behavioral Modeling and Decision- making in AI

Reinforcement Learning

EECS 349:Machine Learning Bryan Pardo

ECE 5424: Introduction to Machine Learning

Transcription:

CSC242: Intro to AI Lecture 23

Administrivia Posters! Tue Apr 24 and Thu Apr 26 Idea! Presentation! 2-wide x 4-high landscape pages

Learning so far...

Input Attributes Alt Bar Fri Hun Pat Price Rain Res Type Est Will Wait x1 Yes No No Yes Some $$$ No Yes French 0-10 y1=yes x 2 Yes No No Yes Full $ No No Thai 30-60 y2=no x 3 No Yes No No Some $ No No Burger 0-10 y3=yes x4 Yes No Yes Yes Full $ Yes No Thai 10-30 y4=yes x5 Yes No Yes No Full $$$ No Yes French >60 y 5 =no x 6 No Yes No Yes Some $$ Yes Yes Italian 0-10 y 6 =yes x 7 No Yes No No None $ Yes No Burger 0-10 y 7 =no x8 No No No Yes Some $$ Yes Yes Thai 0-10 y 8 =yes x 9 No Yes Yes No Full $ Yes No Burger >60 y 9 =no x10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 y10=no x11 No No No No None $ No No Thai 0-10 y11=no x12 Yes Yes Yes Yes Full $ No No Burger 30-60 y12=yes

Patrons? None Some Full No Yes Hungry? No Yes No Type? French Italian Thai Burger Yes No Fri/Sat? Yes No Yes No Yes

h w (x) = w1x + w0 L(h w )= = N L 2 (y j,h w (x j )) j=1 N (y h w (x)) 2 j=1 = N (y w 1 x + w 0 ) 2 j=1

Carl Friedrich Gauss (1777 1855)

Linear Regression Find w = [w0, w1] that minimizes L(h w ) w = argmin w = argmin w L(h w ) N (y w 1 x + w 0 ) 2 j=1

Gradient Descent w any point in parameter space loop until convergence do Gradient of for each w i in w do loss function Update rule w i w i α along wi axis L(w) w i Learning rate

Gradient Descent In Weight Space w* = [w0, [w0, w1] w1] Loss w 0 w 1

x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1

Linear Classifier w 0 + w 1 x 1 + w 2 x 2 =0 w x =0 All instances of one class are above the line: w x > 0 All instances of one class are below the line: w x < 0 h w (x) =Threshold(w x)

Hard Threshold 1 0.5 0-8 -6-4 -2 0 2 4 6 8 Threshold(z) =1ifz 0 = 0 otherwise

1 Proportion correct 0.9 0.8 0.7 0.6 0.5 0.4 0 100 200 300 400 500 600 700 Number of weight updates

Logistic Threshold 1 0.5 0-6 -4-2 0 2 4 6 Logistic(z) = 1 1+e z

Squared error per example 1 0.9 0.8 0.7 0.6 0.5 0.4 0 1000 2000 3000 4000 5000 Number of weight updates Squared error per example 1 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 100000 Number of weight updates Squared error per example 1 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 100000 Number of weight updates

Neuron a 0 = 1 a j = g(in j ) wi,j a i Bias Weight w 0,j Σ in j g a j Input Links Input Function Activation Function Output Output Links

Bags Agent, process, disease,... D1= D2= D3= Candies Observations Actions, effects, symptoms, results of tests,... Goal Predict next candy Predict agent s next move Predict next output of process Predict disease given symptoms and tests

Bayesian Learning P(X d) = α i = α i P(X h i )P (h i d) P(X h i )P (d h i )P (h i ) Hypothesis prior Prediction of the hypothesis Likelihood of the data under the hypothesis

Maximum A Posteriori (MAP) h MAP = argmax h i P (h i d) P(X d) P(X h MAP )

Maximum Likelihood Hypothesis Assume uniform hypothesis prior No hypothesis preferred to any other a priori (e.g., all equally complex) h MAP = argmax h i P (h i d) = argmax h i P (d h i )=h ML

Burglary P(B).001 Earthquake P(E).002 Alarm B t t f f E t f t f P(A).95.94.29.001 JohnCalls A t f P(J).90.05 MaryCalls A t f P(M).70.01

Maximum Likelihood Hypothesis argmax Θ P (d h Θ )

Log Likelihood P (d h Θ )= j P (d j h Θ ) = Θ c (1 Θ) l L(d h Θ ) = log P (d h Θ )= j log P (d j h Θ ) = c log Θ + l log(1 Θ)

Naive Bayes Models { terrorist, tourist } Class Arrival Mode One-way Ticket Furtive Manner...

Learning Naive Bayes Models Naive Bayes model with n Boolean attributes requires 2n+1 parameters Maximum likelihood hypothesis hml can be found with no search Scales to large problems Robust to noisy or missing data

2 2 2 Smoking Diet Exercise 2 2 2 Smoking Diet Exercise 54 HeartDisease 6 6 6 Symptom 1 Symptom 2 Symptom 3 54 162 486 Symptom 1 Symptom 2 Symptom 3 (a) (b) 78 parameters 708 parameters

Hidden (Latent) Variables Can dramatically reduce the number of parameters required to specify a Bayes net Reduces amount of data required to learn the parameters Values of hidden variables not present in training data (observations) Complicates the learning problem

EM: Expectation Maximization Repeat E: Use the current values of the parameters to compute the expected values of the hidden variables M: Recompute the parameters to maximize the log-likelihood of the data given the values of the variables (observed and hidden)

Reinforcement Learning

B.F. Skinner (1904-1990)

Reinforcement Learning

The Problem with Learning from Examples

Where do the examples come from?

Forget about examples Input Attributes Alt Bar Fri Hun Pat Price Rain Res Type Est Will Wait x1 Yes No No Yes Some $$$ No Yes French 0-10 y1=yes x 2 Yes No No Yes Full $ No No Thai 30-60 y2=no x 3 No Yes No No Some $ No No Burger 0-10 y3=yes x4 Yes No Yes Yes Full $ Yes No Thai 10-30 y4=yes x5 Yes No Yes No Full $$$ No Yes French >60 y 5 =no x6 No Yes No Yes Some $$ Yes Yes Italian 0-10 y 6 =yes x7 No Yes No No None $ Yes No Burger 0-10 y 7 =no x8 No No No Yes Some $$ Yes Yes Thai 0-10 y 8 =yes x 9 No Yes Yes No Full $ Yes No Burger >60 y 9 =no x10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 y10=no x 11 No No No No None $ No No Thai 0-10 y 11 =no x 12 Yes Yes Yes Yes Full $ No No Burger 30-60 y 12 =yes

But we need feedback!

Reward (a.k.a Reinforcement) The positive or negative feedback one obtains from in response to action In animals: Pain, hunger: negative reward Pleasure, food: positive reward In computers...?

+1-1 START 0.8 0.1 0.1

Markov Decision Process Sequential decision problem Fully observable, stochastic environment Set of states S with initial state s0 Markovian transition model: Additive rewards: R(s) P (s s, a)

Policy A policy π specifies what the agent should do for any state the agent might reach: π(s) Each time a policy is executed, it leads to a different history Quality of a policy is its expected utility Optimal policy π * maximizes expected utility

Optimal Policy U π (s) =E t=0 γ t R(S t ) π s = argmax π U π (s)

Computing Policies Value Iteration Easy to understand (AIMA 17.2.2) Converges to unique set of solutions to Bellman equations Policy Iteration Searches space of policies, rather than refining values of utilities More tractable

Markov Decision Process Sequential decision problem Fully observable, stochastic environment Set of states S with initial state s0 Markovian transition model: Additive rewards: R(s) P (s s, a) Learn!

Reinforcement Learning Learn a policy that tells you what to do without knowing How actions work How the environment behaves How you get rewarded

Passive Learning Fixed policy π : π(s) says what to do Learn U π (s): how good this policy is

+1-1 START R(s) = 0.04, γ =1 U π (s)?

Policy Iteration Repeat Policy Evaluation: Given a policy, compute its expected utility Policy Improvement: Compute a new MEU policy by checking for better action in any state, given EU

Policy Evaluation U i (s) =R(s)+γ s P (s s, π i (s))u i (s )

+1-1 START (1,1) (1,2) (1,3) (1,2) (1,3) (2,3) (3,3) (4,3) -0.04-0.04-0.04-0.04-0.04-0.04-0.04 +1 (1,1) (1,2) (1,3) (2,3) (3,3) (3,2) (3,3) (4,3) -0.04-0.04-0.04-0.04-0.04-0.04-0.04 +1 (1,1) (2,1) (3,1) (4,2) -0.04-0.04-0.04-1

+1-1 START (1,1) (1,2) (1,3) (1,2) (1,3) (2,3) (3,3) (4,3) -0.04-0.04-0.04-0.04-0.04-0.04-0.04 +1 (1,1) (1,2) (1,3) (2,3) (3,3) (3,2) (3,3) (4,3) -0.04-0.04-0.04-0.04-0.04-0.04-0.04 +1 (1,1) (2,1) (3,1) (4,2) -0.04-0.04-0.04-1

Direct Utility Estimation In each trial, compute reward-to-go for each state visited in the trial Keep track of average reward-to-go for every state In the limit, converges to true expected utility of the policy U π (s)

Utilities of states are not independent! U π (s) =R(s)+γ s P (s s, π(s))u π (s )

Adaptive Dynamic Programming Keep track of observed frequencies of state-action pairs and their outcomes Approximate unknown transition model P (s s,a) using observed frequencies Use that in standard policy evaluation to compute utility of policy

Utility estimates 1 0.8 0.6 0.4 0.2 (4,3) (3,3) (1,3) (1,1) (3,2) RMS error in utility 0.6 0.5 0.4 0.3 0.2 0.1 0 0 20 40 60 80 100 Number of trials 0 0 20 40 60 80 100 Number of trials

+1-1 U π (1, 3) = 0.84 U π (2, 3) = 0.92 START (1,1) (1,2) (1,3) (1,2) (1,3) (2,3) (3,3) (4,3) -0.04-0.04-0.04-0.04-0.04-0.04-0.04 +1 (1,1) (1,2) (1,3) (2,3) (3,3) (3,2) (3,3) (4,3) -0.04-0.04-0.04-0.04-0.04-0.04-0.04 +1 (1,1) (2,1) (3,1) (4,2) -0.04-0.04-0.04-1

Temporal-Difference (TD) Learning At each step, update utility estimates using difference between successive states: U π (s) U π (s)+α(r(s)+γu π (s ) U π (s)) Learning Rate

Utility estimates 1 0.8 0.6 0.4 0.2 (4,3) (3,3) (1,3) (1,1) (2,1) RMS error in utility 0.6 0.5 0.4 0.3 0.2 0.1 0 0 100 200 300 400 500 Number of trials 0 0 20 40 60 80 100 Number of trials

Where did that policy come from???

Possible Strategy Learn outcome model for actions based on observed frequencies (like ADP) Compute utility of optimal policy using value or policy iteration (at each observation) Use computed optimal policy to select next action

2 RMS error, policy loss 1.5 1 0.5 RMS error Policy loss 3 2 +1 1 1 0 0 50 100 150 200 250 300 350 400 450 500 Number of trials 1 2 3 4

How could following the optimal policy not result in optimal behavior? The learned model is just an approximation of the true environment. What is optimal in the learned model may not really be optimal in the environment.

Paradox? Need to explore unexplored states (since they may be better than where we ve been) But they may be worse than our current optimal And after a while they probably will be

Active Learning Need to tradeoff Exploitation: Maximizing immediate reward by following current utility estimates Exploration: Improving utility estimates to maximize long-term reward

GLIE Greedy in the limit of infinite exploration Eventually, follow the optimal policy Examples: Choose a random action 1/t of the time Give some weight to actions you haven t tried very often, while avoiding actions with strong estimates of low utility

Exploration Function U + (s) R(s)+γ max a f( s P (s s, a)u + (s ),N(s, a)) f(u, n) = R +, u if n<n e otherwise

Utility estimates 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 (1,1) (1,2) (1,3) (2,3) (3,2) (3,3) (4,3) RMS error, policy loss 1.4 1.2 1 0.8 0.6 0.4 0.2 RMS error Policy loss 0 20 40 60 80 100 Number of trials 0 0 20 40 60 80 100 Number of trials

Reinforcement Learning Doesn t require labelled examples (training data) Learn a policy that tells you what to do without knowing How actions work How the environment behaves How you get rewarded

For Next Time: Posters! (Don t Be Late)