COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati

Today } What is machine learning? } Where is it used? } Types of machine learning algorithms. } Supervised learning } Unsupervised learning } Reinforcement learning

What is machine learning? } Algorithms to enable agents improve their behavior with experience. } Improve: There s a performance measure. } Experience: Feedback / Observations the agents perceived. } Components to consider: } Performance measure. } Representation: Bayes Network, real functions, etc. } Types of feedback: Win/lose at the end of a game, immediate outcome, etc.

Where is it used? } Lots of places Games: Go, Dota, etc. Legal: Trademark, logo, etc. (e.g., https://trademark.vision ) Medical: Prosthesis, reduce tremor in surgeons, etc. Many more

Types of machine learning } Supervised learning: } Learn from examples. } Sometimes, we re not sure if the examples are correct or not. } Often called semi-supervised learning. } Becoming more popular with crowd sourcing (e.g., mechanical turk) } Unsupervised learning: } Learn by finding structure in data. } Reinforcement learning: } Learn by doing. Require training data

Reinforcement Learning } What is Reinforcement Learning? } Methods for solving

An Illustration of A Reinforcement Learning Agent

Formally, } A reinforcement learning is an MDP where the transition and/or reward functions are not known } S: State space. } A: Action space. } T: transition function. T(s, a, s ) = P(S t+1 = s S t = s, A t = a). } R: Reward function. R(s, a). } Need to try & explore the environment At least one of them are not known

Solving? } Solving a Reinforcement Learning Problem means: Computing the best action to perform (recall the MDP definition of best action), even thought the transition & reward function are not known a priori. } In general, solving boils down to solving exploration vs exploitation problem } Should the agent explore less known states and actions that might generate better values OR } Should the agent exploit states and actions that it has tried before and has definitely generate good values.

Exploration vs Exploitation } Simplest: Multi-arm bandit } Simplest in the sense assume only 1 state } We ve touched about this a bit when we try to combine samplers in motion planning, when we combine sampling strategies in sampling-based motion planning and when we need to choose which sampling strategy to use for tree expansion in MCTS } Epsilon-greedy } EXP3 } UCB

Epsilon-greedy } Assign a weight to each sampling strategy. } Start with equal weight for all strategies. } Strategy with the highest weight is selected with probability (1-ε). The rest are selected with probability ε/n, where N is #strategies available. } Suppose strategy s 1 is selected, we ll use s 1 to sample and add a vertex and edges to the roadmap. If the addition connects disconnected components of the roadmap OR adds #connected components of the roadmap, increment the utility of s 1 by 1.

EXP3 } At least competitive to the best strategy. p Sampling history s ws ( t) = (1 η ) + w ( t) samplers s s η K p s 0

Upper Confidence Bound } Choose an action a to perform at s as: Exploitation Exploration c: A constant indicating how to balance exploration & exploitation, need to be decided by trial & error. n(s): #times node s has been visited. n(s,a): #times the out-edge of s with label a has been visited.

More general approaches for solving RL } Data from interacting with the world: <s, a, r, s > } Model-based vs model-free: What s being learned? } Passive vs Active: How the data are being generated? Model-based Model-free Passive Active

Model-based vs model-free } Model-based } Use data to learn the missing components of the MDP problem, i.e.: T & R } Once we know T & R, solve the MDP problem } Indirect learning, but most efficient use of data } Model-free } Use data to learn the value function & policy directly. } Direct learning, not the most efficient use of data, but sometimes can be fast

Passive vs Active } Passive } Fixed policy } The agent observes the world following the policy OR given a data set (e.g., from video), the agent observes to learn value function or the model (Transition and Reward functions) } Recently introduced } Active } The classical Reinforcement learning problem } The agent selects what action to perform, and the action performed determine the data it receives, which then determines how fast the agent converges to the correct MDP model. } Exploration vs exploitation } Combination of the two

Model-based Approach: Overview } Two steps, ran iteratively } Loop over: } Learning step: Use data to estimate T & R } Solve the MDP as if the learned model is correct. Use methods for solving MDP we ve discussed before

Model-based: Simple Frequentist } Learning the T & R steps: } Use counting to estimate T & R } T (s, a, s ) = #data where the agent ends up in s after performing a from s / #data where the agent perform a from s } R (s, a) has a finite & known range, we can do counting, same as learning T(s, a, s ) } R (s, a) infinite / unknown range, we can fit a function to the date (can use methods from regression/supervised learning) } This simple strategy will converge to the true values

Model-based, Passive 1. Estimate the transition T & reward R functions from data, can use (modified) supervised learning methods to compute. 2. Solve the MDP (generate an optimal policy) as if the learned model is correct, using methods for solving MDP as discussed before. 3. Compute the difference between the data and trajectories generated if this optimal policy is executed 4. If the difference is large, } Improve the MDP model based on the above difference } Repeat to 2

Model-based, Active } We need a way to decide which data to use } Classic: Interact with the world directly. } Decide the action we use to interact with the world, so as to balance gaining information and reaching the goal } Nowadays: Can perform the trials in high fidelity simulator } Decide the action we use to interact with the simulated world (of course the hope is the simulation is close to reality), so as to balance gaining information and reaching the goal } Need to consider how well transfers from simulator to the real world is.

Bayesian Reinforcement Learning } Bayesian view: } The parameters (T & R) we want to estimate are represented as random variables } Start with a prior over models } Compute posterior based on data } Quite useful when the agent actively gather data } Can decide how to balance exploration & exploitation or how to improve the model & solve the problem optimally } Often represented as Partially Observable Markov Decision Processes (POMDPs)

Bayesian Reinforcement Learning } The problem of finding solving MDP with unknown T & R can be represented as a POMDP with partially observed MDP model } POMDP model: } S: MDP states X T X R } A: MDP action } T(s, a, s ): The transition assuming the MDP model is as described by POMDP state s } Ω: The resulting next state and reward of the MDP } Z(s, a, o): Perceived next state & reward assuming the MDP model is as described by POMDP state s } R(s, a): The reward assuming the MDP model is as described by POMDP state s

Bayesian Reinforcement Learning } Optimal policy of the POMDP is optimal exploration vs exploitation } It will try to balance building the most accurate model & working directly towards achieving the goal. } Will make the MDP agent receives the maximum reward given the initially unknown T & R. } Building the best model is just an intermediate step, not the end goal!