Learning in State-Space Reinforcement Learning CIS 32

Similar documents
CS:4420 Artificial Intelligence

Neural networks. Chapter 20, Section 5 1

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Least Mean Squares Regression

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Multilayer Perceptrons and Backpropagation

Lecture 23: Reinforcement Learning

Least Mean Squares Regression. Machine Learning Fall 2018

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Introduction to Neural Networks

Neural networks. Chapter 19, Sections 1 5 1

AI Programming CS F-20 Neural Networks

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

CSC321 Lecture 8: Optimization

Lecture 16: Introduction to Neural Networks

Neural Networks Introduction CIS 32

Reinforcement Learning and Control

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Introduction to Reinforcement Learning

Neural networks. Chapter 20. Chapter 20 1

18.6 Regression and Classification with Linear Models

Reinforcement Learning. Machine Learning, Fall 2010

Lecture 4: Perceptrons and Multilayer Perceptrons

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

CSC321 Lecture 7: Optimization

Artificial Neural Networks Examination, June 2005

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Input layer. Weight matrix [ ] Output layer

Reinforcement Learning

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Simple Neural Nets For Pattern Classification

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Approximate Q-Learning. Dan Weld / University of Washington

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Reinforcement Learning

Sections 18.6 and 18.7 Artificial Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 7 Artificial neural networks: Supervised learning

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Introduction to Artificial Neural Networks

Effect of number of hidden neurons on learning in large-scale layered neural networks

Lecture 5: Logistic Regression. Neural Networks

Learning with Momentum, Conjugate Gradient Learning

4. Multilayer Perceptrons

CS 4100 // artificial intelligence. Recap/midterm review!

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Hopfield Neural Network

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Artificial Neural Networks 2

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Pengju

Machine Learning. Neural Networks

Sections 18.6 and 18.7 Artificial Neural Networks

Reinforcement Learning

CSE250A Fall 12: Discussion Week 9

Decision Theory: Q-Learning

Optimization and Gradient Descent

Learning from Examples

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

Data Mining Part 5. Prediction

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Decision Theory: Markov Decision Processes

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

Neural Networks (Part 1) Goals for the lecture

Linear classification with logistic regression

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

In biological terms, memory refers to the ability of neural systems to store activity patterns and later recall them when required.

CS599 Lecture 1 Introduction To RL

Lecture 1: March 7, 2018

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Feedforward Neural Nets and Backpropagation

CS 4700: Foundations of Artificial Intelligence

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Name: UW CSE 473 Final Exam, Fall 2014

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

Reinforcement Learning

CS230: Lecture 9 Deep Reinforcement Learning

Markov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

Q-learning. Tambet Matiisen

Neural Networks biological neuron artificial neuron 1

Hopfield Networks and Boltzmann Machines. Christian Borgelt Artificial Neural Networks and Deep Learning 296

CSC242: Intro to AI. Lecture 21

Grundlagen der Künstlichen Intelligenz

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Markov Decision Processes

y(x n, w) t n 2. (1)

Artificial Intelligence

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

CS 188: Artificial Intelligence Spring Announcements

Neural Nets Supervised learning

Transcription:

Learning in State-Space Reinforcement Learning CIS 32

Functionalia Syllabus Updated: MIDTERM and REVIEW moved up one day. MIDTERM: Everything through Evolutionary Agents. HW 2 Out - DUE Sunday before the MIDTERM. EVENING TEA: Next Monday, 5pm to 7pm, 0317 N Today: Training TLU s Recap Neural Networks Learning a Heuristic Search-Tree-Less Heuristic Learning Reinforcement Learning

Technique Error Correction Training TLU s Techniques Gradient Descent? No Widrow-Hoff Yes f = s = Threshold Function f = 1 if n! i=1 = 0 otherwise n x w i i # "! x i w i=1 i Generalized Delta Yes f (s) = 1 1 + e s

Weight Update Functions Technique Range of d (Desired Output) Range of f (Actual Training Output) Weight Update Error Correction 0 or 1 0 or 1 Widrow-Hoff -1 or 1 [-inf, +inf] Generalized Delta 0 or 1 [0, 1] sigmoid c is the learning rate parameter (small positive fraction)

Error-Correction Technique d f change 0 0 0 0 1 -c 1 0 +c 1 1 0 Change in finite chunks. For small enough c: terminates after a finite number of steps (if the function is linearly separable) If the function is not linearly separable, does not terminate (but oscillates).

Example or Error Correction W 0 = 1.5 W 0 = 0.5? W 0 = 1.5 random W 0 = 0.5 W 0 = 0.5 W 1 = 1 W 2 = 1 W 1 = W1 1 = random W 1 = 1 W 1 = 1 W 1 = 1 W 2 = 1 W 2 = 1 random W 2 = 1 AND OR AND NOT Remember that the Threshold becomes rolled into the weights. We will start with random (can also be uniform) set of weights. Set our Learning Rate to 0.1

Example or Error Correction W 0 = 1.5 random W 0 = 0.5 V X1 X2 W 0 = 0.5 d W1 = random W 1 = 1 W 2 = 1 random AND W 1 = 1 W 2 = 1 1 0 0 0 W 1 = 1 2 1 0 0 3OR 0 1 0 4 1 1 1 training set NOT

Train One Example a Time V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1 random

Train One Example a Time V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1

Train One Example a Time V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1

Train One Example a Time V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1 f = 1 if n! i=1 = 0 otherwise x w i i # "

Train One Example a Time V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1 learning parameter (constant)

Train One Example a Time V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1

Train One Example a Time V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1

Train One Example a Time V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1

Train One Example a Time V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1 Error:

After First Round V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 2 0 1 0 3 1 0 0 4 1 1 1 0.4 0.4 0.6 0.4 1 0.1-1 -0.1 0 0 1 0.3 0.4 0.6 0.9 1 0.1-1 -0.1 0-0.1 1 0.2 0.4 0.5 0.6 1 0.1-1 -0.1-0.1 0 1 0.1 0.3 0.5 0.9 1 0.1 0 0 0 0 0

Second Round V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 0.1 0.3 0.5 0.1 1 0.1-1 -0.1 0 0 1 2 0 1 0 0 0.3 0.5 0.5 1 0.1-1 -0.1 0-0.1 1 3 1 0 0-0.1 0.3 0.4 0.2 1 0.1-1 -0.1-0.1 0 1 4 1 1 1-0.2 0.2 0.4 0.4 1 0.1 0 0 0 0 0

Third Round V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 2 0 1 0-0.2 0.2 0.4-0.2 0 0.1 0 0 0 0 0-0.2 0.2 0.4 0.2 1 0.1-1 -0.1 0-0.1 1 3 1 0 0-0.3 0.2 0.3-0.1 0 0.1 0 0 0 0 0 4 1 1 1-0.3 0.2 0.3 0.2 1 0.1 0 0 0 0 0

Fourth Round V X1 X2 d w0 w1 w2 s f c d-f dw0 dw1 dw2 E 1 0 0 0 2 0 1 0-0.3 0.2 0.3-0.3 0 0.1 0 0 0 0 0-0.3 0.2 0.3 0 0 0.1 0 0 0 0 0 3 1 0 0-0.3 0.2 0.3-0.1 0 0.1 0 0 0 0 0 4 1 1 1-0.3 0.2 0.3 0.2 1 0.1 0 0 0 0 0 Successfully Completed a Round with Changes Done

Gradient Descent in Weight Space Wa Gradient of error with respect to TLU s Weights (Wa, Wb) (wa0, wb0) (wa2, wb2) (wa1, wb1) Wb

Widrow-Hoff Technique f(s) = s Change to the weights in variable chunks. d uses -1 to represent training examples of 0 (pulls zero s below 0) threshold f(s) 1-1 training f(s) Process never terminates, but the differences in Error will be minimized.

After First Round V X1 X2 d w0 w1 w2 s f f c d-f dw0 dw1 dw2 E 1 0 0-1 0.4 0.4 0.6 0.4 0.4 1 0.1-1.4-0.14 0 0 1.96 2 0 1-1 0.26 0.4 0.6 0.86 0.86 1 0.1-1.86-0.186 0-0.186 3.4596 3 1 0-1 0.074 0.4 0.414 0.474 0.474 1 0.1-1.474-0.1474-0.1474 0 2.172576 4 1 1 1-0.734 0.2526 0.414 0.5932 0.5932 1 0.1 0.04068 0.04068 0.0406 0.04068 0.1654 f = s Notice Wide Range in Error

After 10 Rounds V X1 X2 d w0 w1 w2 s f f c d-f dw0 dw1 dw2 E 1 0 0-1 -0.861 0.561 0.582-0.86-0.86 0 0.1-0.139-0.0139 0 0 0.019 2 0 1-1 -0.875 0.561 0.582-0.29-0.29 0 0.1-0.707-0.0707 0-0.0706 0.4992 3 1 0-1 -0.946 0.561 0.511-0.385-0.385 0 0.1-0.615-0.0615-0.0615 0 0.3783 4 1 1 1-1.007 0.4995 0.511 0.00325 0.00325 1 0.1 0.997 0.09967 0.09967 0.09967 0.9935 Good Enough

Round 200-something V X1 X2 d w0 w1 w2 s f f c d-f dw0 dw1 dw2 E 1 0 0-1 -1.556 1.111 1.055-1.555-1.555 0 0.1 0.5555 0.055 0 0 0.309 2 0 1-1 -1.5 1.111 1.055-0.444-0.444 0 0.1-0.555-0.055 0-0.055 0.309 3 1 0-1 -1.556 1.111 0.999-0.444-0.444 0 0.1-0.555-0.055-0.055 0 0.309 4 1 1 1-1.6111 1.0555 0.999 0.4444 0.4444 1 0.1 0.555 0.055 0.055 0.055 0.309 Still Good Enough Converged And Decreasing

Generalized Delta Technique Steeper slope close to the threshold causes faster change near boundary Change to the weights in variable chunks. More fuzzy boundary. d uses 0 to represent training examples of 0 (instead of -1 for W-H). More modern threshold - used in multi-node networks.

After First Round V X1 X2 d w0 w1 w2 s f f c d-f f(1-f) dw0 dw1 dw2 E 1 0 0 0 0.4 0.4 0.6 0.4 0.599 1 0.2-0.599 0.2402-0.029 0 0 0.358 2 0 1 0 0.3712 0.4 0.6 0.971 0.725 1 0.2-0.725 0.199-0.029 0-0.029 0.526 3 1 0 0 0.3423 0.4 0.5712 0.742 0.677 1 0.2-0.677 0.218-0.0296-0.030 0 0.459 4 1 1 1 0.313 0.37 0.5712 1.254 0.778 1 0.2 0.222 0.173 0.00767 0.00767 0.00767 0.0492 Uses larger learning rate Notice Smaller Range in Error

After 14 Rounds V X1 X2 d w0 w1 w2 s f f c d-f f(1-f) dw0 dw1 dw2 E 1 0 0 0-0.427 0.256 0.437-0.43 0.394 0 0.2 0.24-0.39-0.019 0 0 0.156 2 0 1 0-0.447 0.256 0.437-0.0096 0.498 0 0.2 0.25-0.49-0.024 0-0.025 0.248 3 1 0 0-0.471 0.256 0.412-0.216 0.446 0 0.2 0.25-0.45-0.022-0.022 0 0.199 4 1 1 1-0.493 0.233 0.412 0.152 0.538 1 0.2 0.25 0.46 0.023 0.023 0.023 0.213 Always ranges between -0.5 and 0.5

Network Structures Two kinds of larger Neural Network Structures: 1. feed-forward networks - acyclic contains hidden layers and inputs. 2. recurrent networks - cyclic dynamic systems - with oscillations and chaotic behavior can exhibit short-term memory

Hidden Units 1 W 1, 3 W 1, 4 3 W 3, 5 5 2 W 2, 3 W 2, 4 4 W 4, 5 Activation of Unit 5 is based on the weighted outputs of Unit 3 and 4. Units 3 and 4 represent the hidden units. Activation depends on the Unit (can use the sigmoid function)

Layers are usually fully connected. Multi-layer Numbers of nodes typically set by hand.

Multilayer Feed Forward Layers are usually fully connected; numbers of nodes typically set by hand. Single Hidden Layer is Most Common. back-propagation

Larger hypothesis space Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface

Hopfield Networks - Recurrent Networks contain bidirectional connections (units are inputs and outputs) stimulus results in the networks settling into an activations pattern that most closely resembles a training example N units can store 0.138 N training examples. Boltzmann Machines - like Hopfield Networks, but contain hidden units activation functions are stochastic (functions based on a probability that a unit exhibits a 1 based on the total weighted unit)

Learning in State Space We return now to heuristics (evaluation functions): used both in Search and Minimax Search. Having a good heuristics greatly improve s an agent s performance: (i.e. A* search, and in evaluating leaf nodes in Adversarial Search) Good Knowledge of Subject Domain Good heuristics No Knowledge of Subject Domain Learn the heuristic

more Levels of Reinforcement Learning knowledge about the problem domain less 1. 2. 3. 4. Agent knows it s actions, results, and costs; can build an explicit Search Tree to explore; has a clear short-term goal. Agent does not have a model of it s actions; can build an explicit Search Tree to explore; has a clear short-term goal. Agent does have a model of it s actions; but cannot (too large) build an explicit Search Tree to explore; has a clear shortterm goal state. Agent knows it s actions, results, and costs; cannot (too large) build an explicit Search Tree to explore; does not have a clear short-term goal. Performance based on Reward not Goals.

Explicit Graph Heuristic Learning Just as we did with previous searches, Agent: knows actions, their results, and costs has enough space to build an entire search tree. Set the heuristic function h(n) = 0 for all nodes, and do an A* search. Updates the h(n) once the node is expanded: Knows the goal state: h(goal) = 0 set of all children

Explicit Graph Learning Performance What kind of search is this - when the agent searches for the first time?

Explicit Graph Learning Performance Uniform Cost Search (f = g + 0)

Explicit Graph Learning Performance Subsequent searches zoom in on the right solution faster and faster. This happens as the true (h(n)) values propagate to the goal. h=1 2 2 h=2 h=2 3 3 1 1 1 h=1 3 1 2 2 3 h=1 h=1 3 3 1 h=1

Explicit Graph Learning Performance Each run propagates the true cost of getting to goal further back through the search. Eventually the minimal path can be read off the the tree. h=1 2 2 h=2 h=2 3 3 1 1 1 h=1 3 1 2 2 3 h=1 h=1 3 3 1 h=1

Explicit Graph Learning Performance Each run propagates the true cost of getting to goal further back through the search. Eventually the minimal path can be read off the the tree. 2 2 3 1 h=1 h=2 h=2 3 1 h=1 1 1 3 3 2 2 Agent goes through a thought experiment, uses a model of the State-Space. h=1 h=1 3 3 1 h=1

No Model of Action Heuristic Learning What if there is not clear model of action for state transition? Assuming agent can build, name, and store previous states......the agent can learn heuristics in the real-world. This can be perilous... Explore: A robot uses a grid to plan a route, moves randomly about the room. Exploit: Works out which runs about the room are the most optimal, and at what time were certain operations useful.

Updating the heuristic value of states Start Node: Agent knows the Cost of an action after taking it. States are Named and Stored, and can be Distinguished at a later State. Heuristic function for a State is updated: heuristic value of the node agent was just in cost of the transition (i.e. action) heuristic value of node transitioned to (initially 0 if not travelled to previously)

Choosing Actions Initially actions are chosen randomly. After some exploring, states have h(n) values ascribed to them. And there is model built of the actions: (describes the state (i.e. node n) that is reached from node ni after carrying out action a) Actions are now chosen by: Eventually the estimated minimum path to the goal is built up. Keeping some randomness allows for discovery of possibly more optimum paths to the goal.

Learning without a Search Graph (or Node Table) More realistic problems are so large: it is not possible to store all the states/node and build the entire search graph. Now, if we have a model of the actions, we can create and search with an evaluation function. Assemble a heuristic function out of as many sub-functions that can describe some value of a state-space. For the 8-puzzle it a list of functions could be: W(n) : number of tiles out of place P(n) : sum of distance of each tile from it s home Any other functions : usually relaxed heuristics.

Weighted Heuristic Function Write our heuristic function as a linear weighted combination: All we have to do now is learn which weights are the best. One way to do that, is to notice the difference in the heuristic value once we traverse from one node to another taking into consideration that cost:

Updating the Heuristic Learning Rate Set of Successor Nodes We modify h(ni) by adding some proportion of (controlled by ) of the difference of what we thought h(ni) was before expansion, what we think it is after. Once we know the change in h(ni), we adjust the weights similar to the Neural Networks.

Rewritten: Temporal Learning controls how fast the agent learns how much weight we give to the new estimate of the heuristic. Effect 0 no adjustment to h(ni) low high slow learning erratic performance 1 h(ni) is thrown away

Temporal Learning Called Temporal Learning - because the difference is based on the distance in one timed step. Note that this temporal difference approach can also work without a model of the effects of actions (with suitable modification).

Rewards not goals For many tasks agents don t have short term goals, but instead accrue rewards over a period of time. Instead of a plan, we want a policy act over time. which says how the agent should Typically this is expressed as what action should be carried out in a given state. Express the reward an agent gets as We want an optimal policy at every node. special reward for being in state nj which maximizes the (discounted) reward

Finding the Optimum Policy One (non-ideal) solution is to search through all policies (randomly) until a good one is discovered. Instead, given a certain policy, one can calculate the value of a node - the reward an agent will get if it starts at that node and follows the policy. Agent at ni and follows the policy to nj, then the agent can expect this reward (in the long-term): discounting factor - adds a little long-term goal

Value Iteration The optimum policy then gives us the action that maximizes this reward: If we knew what the values of the nodes were under easily compute the optimal policy:, then we could The problem is that we don t know these values. But we can find them out using value iteration. We start by guessing (randomly is fine) an estimated value V(n) for each node.

Approximating the Estimated Values Then when we are at ni we pick the action to maximize: that is the best thing given what we currently know. We then update V(ni) by: Progressive iterations of this calculation make V(n) a closer and closer approximation to Intuitively this is because we replace the estimate with the actual reward we get for the next state (and the next state and the next state).

Summary This lecture has looked at a number of approaches to learning heuristic functions. We started assuming that the agent knew everything but the heuristic, and progressively relaxed assumptions. This created a battery of reinforcement learning methods that can be applied in a wide variety of situations. These models also tie learning and planning together very closely, and we will revisit them as planning models later in the course.