Learning in State-Space Reinforcement Learning CIS 32

Functionalia Syllabus Updated: MIDTERM and REVIEW moved up one day. MIDTERM: Everything through Evolutionary Agents. HW 2 Out - DUE Sunday before the MIDTERM. EVENING TEA: Next Monday, 5pm to 7pm, 0317 N Today: Training TLU s Recap Neural Networks Learning a Heuristic Search-Tree-Less Heuristic Learning Reinforcement Learning

Technique Error Correction Training TLU s Techniques Gradient Descent? No Widrow-Hoff Yes f = s = Threshold Function f = 1 if n! i=1 = 0 otherwise n x w i i # "! x i w i=1 i Generalized Delta Yes f (s) = 1 1 + e s

Weight Update Functions Technique Range of d (Desired Output) Range of f (Actual Training Output) Weight Update Error Correction 0 or 1 0 or 1 Widrow-Hoff -1 or 1 [-inf, +inf] Generalized Delta 0 or 1 [0, 1] sigmoid c is the learning rate parameter (small positive fraction)

Error-Correction Technique d f change 0 0 0 0 1 -c 1 0 +c 1 1 0 Change in finite chunks. For small enough c: terminates after a finite number of steps (if the function is linearly separable) If the function is not linearly separable, does not terminate (but oscillates).

Example or Error Correction W 0 = 1.5 W 0 = 0.5? W 0 = 1.5 random W 0 = 0.5 W 0 = 0.5 W 1 = 1 W 2 = 1 W 1 = W1 1 = random W 1 = 1 W 1 = 1 W 1 = 1 W 2 = 1 W 2 = 1 random W 2 = 1 AND OR AND NOT Remember that the Threshold becomes rolled into the weights. We will start with random (can also be uniform) set of weights. Set our Learning Rate to 0.1

Example or Error Correction W 0 = 1.5 random W 0 = 0.5 V X1 X2 W 0 = 0.5 d W1 = random W 1 = 1 W 2 = 1 random AND W 1 = 1 W 2 = 1 1 0 0 0 W 1 = 1 2 1 0 0 3OR 0 1 0 4 1 1 1 training set NOT

Train One Example a Time V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1 random

Train One Example a Time V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1

Train One Example a Time V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1 f = 1 if n! i=1 = 0 otherwise x w i i # "

Train One Example a Time V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1 learning parameter (constant)

Train One Example a Time V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1

Train One Example a Time V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1 Error:

After First Round V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 2 0 1 0 3 1 0 0 4 1 1 1 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1 0.3 0.4 0.6 0.9 1 0.1-1 -0.1 0-0.1 1 0.2 0.4 0.5 0.6 1 0.1-1 -0.1-0.1 0 1 0.1 0.3 0.5 0.9 1 0.1 0 0 0 0 0

Second Round V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.1 0.3 0.5 0.1 1 0.1-1 -0.1 0 0 1 2 0 1 0 0 0.3 0.5 0.5 1 0.1-1 -0.1 0-0.1 1 3 1 0 0-0.1 0.3 0.4 0.2 1 0.1-1 -0.1-0.1 0 1 4 1 1 1-0.2 0.2 0.4 0.4 1 0.1 0 0 0 0 0

Third Round V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 2 0 1 0-0.2 0.2 0.4-0.2 0 0.1 0 0 0 0 0-0.2 0.2 0.4 0.2 1 0.1-1 -0.1 0-0.1 1 3 1 0 0-0.3 0.2 0.3-0.1 0 0.1 0 0 0 0 0 4 1 1 1-0.3 0.2 0.3 0.2 1 0.1 0 0 0 0 0

Fourth Round V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 2 0 1 0-0.3 0.2 0.3-0.3 0 0.1 0 0 0 0 0-0.3 0.2 0.3 0 0 0.1 0 0 0 0 0 3 1 0 0-0.3 0.2 0.3-0.1 0 0.1 0 0 0 0 0 4 1 1 1-0.3 0.2 0.3 0.2 1 0.1 0 0 0 0 0 Successfully Completed a Round with Changes Done

Gradient Descent in Weight Space Wa Gradient of error with respect to TLU s Weights (Wa, Wb) (wa0, wb0) (wa2, wb2) (wa1, wb1) Wb

Widrow-Hoff Technique f(s) = s Change to the weights in variable chunks. d uses -1 to represent training examples of 0 (pulls zero s below 0) threshold f(s) 1-1 training f(s) Process never terminates, but the differences in Error will be minimized.

After First Round V X1 X2 d w0 w1 w2 s f f c d-f dw0 dw1 dw2 E 1 0 0-1 0.4 0.4 0.6 0.4 0.4 1 0.1-1.4-0.14 0 0 1.96 2 0 1-1 0.26 0.4 0.6 0.86 0.86 1 0.1-1.86-0.186 0-0.186 3.4596 3 1 0-1 0.074 0.4 0.414 0.474 0.474 1 0.1-1.474-0.1474-0.1474 0 2.172576 4 1 1 1-0.734 0.2526 0.414 0.5932 0.5932 1 0.1 0.04068 0.04068 0.0406 0.04068 0.1654 f = s Notice Wide Range in Error

After 10 Rounds V X1 X2 d w0 w1 w2 s f f c d-f dw0 dw1 dw2 E 1 0 0-1 -0.861 0.561 0.582-0.86-0.86 0 0.1-0.139-0.0139 0 0 0.019 2 0 1-1 -0.875 0.561 0.582-0.29-0.29 0 0.1-0.707-0.0707 0-0.0706 0.4992 3 1 0-1 -0.946 0.561 0.511-0.385-0.385 0 0.1-0.615-0.0615-0.0615 0 0.3783 4 1 1 1-1.007 0.4995 0.511 0.00325 0.00325 1 0.1 0.997 0.09967 0.09967 0.09967 0.9935 Good Enough

Round 200-something V X1 X2 d w0 w1 w2 s f f c d-f dw0 dw1 dw2 E 1 0 0-1 -1.556 1.111 1.055-1.555-1.555 0 0.1 0.5555 0.055 0 0 0.309 2 0 1-1 -1.5 1.111 1.055-0.444-0.444 0 0.1-0.555-0.055 0-0.055 0.309 3 1 0-1 -1.556 1.111 0.999-0.444-0.444 0 0.1-0.555-0.055-0.055 0 0.309 4 1 1 1-1.6111 1.0555 0.999 0.4444 0.4444 1 0.1 0.555 0.055 0.055 0.055 0.309 Still Good Enough Converged And Decreasing

Generalized Delta Technique Steeper slope close to the threshold causes faster change near boundary Change to the weights in variable chunks. More fuzzy boundary. d uses 0 to represent training examples of 0 (instead of -1 for W-H). More modern threshold - used in multi-node networks.

After First Round V X1 X2 d w0 w1 w2 s f f c d-f f(1-f) dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 0.599 1 0.2-0.599 0.2402-0.029 0 0 0.358 2 0 1 0 0.3712 0.4 0.6 0.971 0.725 1 0.2-0.725 0.199-0.029 0-0.029 0.526 3 1 0 0 0.3423 0.4 0.5712 0.742 0.677 1 0.2-0.677 0.218-0.0296-0.030 0 0.459 4 1 1 1 0.313 0.37 0.5712 1.254 0.778 1 0.2 0.222 0.173 0.00767 0.00767 0.00767 0.0492 Uses larger learning rate Notice Smaller Range in Error

After 14 Rounds V X1 X2 d w0 w1 w2 s f f c d-f f(1-f) dw0 dw1 dw2 E 1 0 0 0-0.427 0.256 0.437-0.43 0.394 0 0.2 0.24-0.39-0.019 0 0 0.156 2 0 1 0-0.447 0.256 0.437-0.0096 0.498 0 0.2 0.25-0.49-0.024 0-0.025 0.248 3 1 0 0-0.471 0.256 0.412-0.216 0.446 0 0.2 0.25-0.45-0.022-0.022 0 0.199 4 1 1 1-0.493 0.233 0.412 0.152 0.538 1 0.2 0.25 0.46 0.023 0.023 0.023 0.213 Always ranges between -0.5 and 0.5

Network Structures Two kinds of larger Neural Network Structures: 1. feed-forward networks - acyclic contains hidden layers and inputs. 2. recurrent networks - cyclic dynamic systems - with oscillations and chaotic behavior can exhibit short-term memory

Hidden Units 1 W 1, 3 W 1, 4 3 W 3, 5 5 2 W 2, 3 W 2, 4 4 W 4, 5 Activation of Unit 5 is based on the weighted outputs of Unit 3 and 4. Units 3 and 4 represent the hidden units. Activation depends on the Unit (can use the sigmoid function)

Layers are usually fully connected. Multi-layer Numbers of nodes typically set by hand.

Multilayer Feed Forward Layers are usually fully connected; numbers of nodes typically set by hand. Single Hidden Layer is Most Common. back-propagation

Larger hypothesis space Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface

Hopfield Networks - Recurrent Networks contain bidirectional connections (units are inputs and outputs) stimulus results in the networks settling into an activations pattern that most closely resembles a training example N units can store 0.138 N training examples. Boltzmann Machines - like Hopfield Networks, but contain hidden units activation functions are stochastic (functions based on a probability that a unit exhibits a 1 based on the total weighted unit)

Learning in State Space We return now to heuristics (evaluation functions): used both in Search and Minimax Search. Having a good heuristics greatly improve s an agent s performance: (i.e. A* search, and in evaluating leaf nodes in Adversarial Search) Good Knowledge of Subject Domain Good heuristics No Knowledge of Subject Domain Learn the heuristic

more Levels of Reinforcement Learning knowledge about the problem domain less 1. 2. 3. 4. Agent knows it s actions, results, and costs; can build an explicit Search Tree to explore; has a clear short-term goal. Agent does not have a model of it s actions; can build an explicit Search Tree to explore; has a clear short-term goal. Agent does have a model of it s actions; but cannot (too large) build an explicit Search Tree to explore; has a clear shortterm goal state. Agent knows it s actions, results, and costs; cannot (too large) build an explicit Search Tree to explore; does not have a clear short-term goal. Performance based on Reward not Goals.

Explicit Graph Heuristic Learning Just as we did with previous searches, Agent: knows actions, their results, and costs has enough space to build an entire search tree. Set the heuristic function h(n) = 0 for all nodes, and do an A* search. Updates the h(n) once the node is expanded: Knows the goal state: h(goal) = 0 set of all children

Explicit Graph Learning Performance What kind of search is this - when the agent searches for the first time?

Explicit Graph Learning Performance Uniform Cost Search (f = g + 0)

Explicit Graph Learning Performance Subsequent searches zoom in on the right solution faster and faster. This happens as the true (h(n)) values propagate to the goal. h=1 2 2 h=2 h=2 3 3 1 1 1 h=1 3 1 2 2 3 h=1 h=1 3 3 1 h=1

Explicit Graph Learning Performance Each run propagates the true cost of getting to goal further back through the search. Eventually the minimal path can be read off the the tree. h=1 2 2 h=2 h=2 3 3 1 1 1 h=1 3 1 2 2 3 h=1 h=1 3 3 1 h=1

Explicit Graph Learning Performance Each run propagates the true cost of getting to goal further back through the search. Eventually the minimal path can be read off the the tree. 2 2 3 1 h=1 h=2 h=2 3 1 h=1 1 1 3 3 2 2 Agent goes through a thought experiment, uses a model of the State-Space. h=1 h=1 3 3 1 h=1

No Model of Action Heuristic Learning What if there is not clear model of action for state transition? Assuming agent can build, name, and store previous states......the agent can learn heuristics in the real-world. This can be perilous... Explore: A robot uses a grid to plan a route, moves randomly about the room. Exploit: Works out which runs about the room are the most optimal, and at what time were certain operations useful.

Updating the heuristic value of states Start Node: Agent knows the Cost of an action after taking it. States are Named and Stored, and can be Distinguished at a later State. Heuristic function for a State is updated: heuristic value of the node agent was just in cost of the transition (i.e. action) heuristic value of node transitioned to (initially 0 if not travelled to previously)

Choosing Actions Initially actions are chosen randomly. After some exploring, states have h(n) values ascribed to them. And there is model built of the actions: (describes the state (i.e. node n) that is reached from node ni after carrying out action a) Actions are now chosen by: Eventually the estimated minimum path to the goal is built up. Keeping some randomness allows for discovery of possibly more optimum paths to the goal.

Learning without a Search Graph (or Node Table) More realistic problems are so large: it is not possible to store all the states/node and build the entire search graph. Now, if we have a model of the actions, we can create and search with an evaluation function. Assemble a heuristic function out of as many sub-functions that can describe some value of a state-space. For the 8-puzzle it a list of functions could be: W(n) : number of tiles out of place P(n) : sum of distance of each tile from it s home Any other functions : usually relaxed heuristics.

Weighted Heuristic Function Write our heuristic function as a linear weighted combination: All we have to do now is learn which weights are the best. One way to do that, is to notice the difference in the heuristic value once we traverse from one node to another taking into consideration that cost:

Updating the Heuristic Learning Rate Set of Successor Nodes We modify h(ni) by adding some proportion of (controlled by ) of the difference of what we thought h(ni) was before expansion, what we think it is after. Once we know the change in h(ni), we adjust the weights similar to the Neural Networks.

Rewritten: Temporal Learning controls how fast the agent learns how much weight we give to the new estimate of the heuristic. Effect 0 no adjustment to h(ni) low high slow learning erratic performance 1 h(ni) is thrown away

Temporal Learning Called Temporal Learning - because the difference is based on the distance in one timed step. Note that this temporal difference approach can also work without a model of the effects of actions (with suitable modification).

Rewards not goals For many tasks agents don t have short term goals, but instead accrue rewards over a period of time. Instead of a plan, we want a policy act over time. which says how the agent should Typically this is expressed as what action should be carried out in a given state. Express the reward an agent gets as We want an optimal policy at every node. special reward for being in state nj which maximizes the (discounted) reward

Finding the Optimum Policy One (non-ideal) solution is to search through all policies (randomly) until a good one is discovered. Instead, given a certain policy, one can calculate the value of a node - the reward an agent will get if it starts at that node and follows the policy. Agent at ni and follows the policy to nj, then the agent can expect this reward (in the long-term): discounting factor - adds a little long-term goal

Value Iteration The optimum policy then gives us the action that maximizes this reward: If we knew what the values of the nodes were under easily compute the optimal policy:, then we could The problem is that we don t know these values. But we can find them out using value iteration. We start by guessing (randomly is fine) an estimated value V(n) for each node.

Approximating the Estimated Values Then when we are at ni we pick the action to maximize: that is the best thing given what we currently know. We then update V(ni) by: Progressive iterations of this calculation make V(n) a closer and closer approximation to Intuitively this is because we replace the estimate with the actual reward we get for the next state (and the next state and the next state).

Summary This lecture has looked at a number of approaches to learning heuristic functions. We started assuming that the agent knew everything but the heuristic, and progressively relaxed assumptions. This created a battery of reinforcement learning methods that can be applied in a wide variety of situations. These models also tie learning and planning together very closely, and we will revisit them as planning models later in the course.