Autonomous Helicopter Flight via Reinforcement Learning

Size: px

Start display at page:

Download "Autonomous Helicopter Flight via Reinforcement Learning"

Todd Henry
6 years ago
Views:

1 Autonomous Helicopter Flight via Reinforcement Learning Authors: Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, Shankar Sastry Presenters: Shiv Ballianda, Jerrolyn Hebert, Shuiwang Ji, Kenley Malveaux, Huy Pham

2 Presentation Outline Introduction to Reinforcement Learning (Huy) Background Information (Shuiwang) Pegasus Reinforcement Learning Algorithm (Jerri) Using Algorithm for Helicopter Flight (Kenley and Shiv) Conclusions (Shiv)

3 Context of Reinforcement Learning Artificial Intelligence Psychology Reinforcement Learning (RL) Control Theory and Operations Research Neuroscience Artificial Neural Networks

4 What is Reinforcement Learning? Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an external environment Learning what to do how to map situations to actions so as to maximize a numerical reward signal

5 Supervised Learning Training Info = desired (target) outputs Inputs Supervised Learning System Outputs Error = (target output actual output)

6 Reinforcement Learning Training Info = evaluations ( rewards / penalties ) Inputs RL System Outputs ( actions ) Objective: get as much reward as possible

7 Key Features of RL Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward Sacrifice short-term gains for greater long-term gains The need to explore and exploit Considers the whole problem of a goaldirected agent interacting with an uncertain environment

8 RL Problem Agent State Reward Action Environment Environment a0 a1 s0 s1 s2 r0 r1 a2 r2 Goal: Learn to choose actions that maximize r 0 +γ r 1 +γ 2 r 2 +, where 0 γ <1

9 Set of states S Set of actions A Elements of RL Actions may be non-deterministic Set of reinforcement signals (rewards) Rewards may be delayed

10 Example 1 Tic Tac Toe State: board Action: next move Reward: 1 for win, -1 for loss, 0 for draw Problem: Find π:s A that maximizes R

11 Example 2 Mobile Robot State: location of robot, people Action: motion Reward: number of happy faces Problem: Find π:s A that maximizes R

12 Presentation Outline Introduction to Reinforcement Learning (Huy) Background Information (Shuiwang) Pegasus Reinforcement Learning Algorithm (Jerri) Using Algorithm for Helicopter Flight (Kenley and Shiv) Conclusions (Shiv)

13 The Agent-Environment Interface Agent and environment interact at discrete time steps : t = 0,1, 2, K Agent observes state at step t: s t S produces action at step t : a t A(s t ) gets resulting reward: r t +1 R and resulting next state: s t s t a t r t +1 s t +1 a t +1 r t +2 s t +2 t +2 a r t +3 st a t +3

14 The Agent Learns a Policy Policy at step t, π t : a mapping from states to action probabilities π t (s, a) = probability that a t = a when s t = s Reinforcement learning methods specify how the agent changes its policy as a result of experience. Roughly, the agent s goal is to get as much reward as it can over the long run.

15 Returns Suppose the sequence of rewards after step t is : r t +1, r t+2,r t + 3, K What do we want to maximize?

16 Returns Discounted return: k =0 R t = r t +1 +γ r t+2 + γ 2 r t +3 +L = γ k r t + k+1, where γ, 0 γ 1, is the discount rate. shortsighted 0 γ 1 farsighted

17 The Markov Property What the state should include? 2. More than immediate sensations, may be complex structures built up over time from the sequence of sensations. 4. But not complete history of past sensations. 6. Ideally, a state signal that summarizes past sensations compactly, yet in such a way that all relevant information is retained.

18 The Markov Property In short, we don t blame an agent for not knowing something that matters, but only for having known something and then forgotten it.

19 The Markov Property A state signal that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. For example, the current checker configuration Also referred to as independence of path property.

20 The Markov Property The state can include immediate sensations, highly processed sensations, and structures built up over time from sequences of sensations. Ideally, a state should summarize past sensations so as to retain all essential information, i.e., it should have the Markov Property: { } { }.,,,,,,,,,and histories, for all,, Pr,,,,,,,,, Pr a s r a s r a s r s a r s r s s a s r a s r a s r r s s t t t t t t t t t t t t t t t t = = = = =

21 Markov Decision Processes If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step dynamics defined by transition probabilities: P a ss = t+ { s = s s = s, a = a} for all s, s S, a A( ). Pr 1 s t t

22 Monte Carlo Methods Monte Carlo methods require only experiences, online or simulated Monte Carlo methods are ways of solving RL problems based on averaging sample returns

23 Monte Carlo Policy Evaluation Goal: learn V π (s) Given: some number of transitions under π which contain s Idea: Average returns observed after visits to s Every-Visit MC: average returns for every time s is visited First-visit MC: average returns only for first time s is visited Both converge asymptotically

24 Other application of Monte Carlo methods Monte Carlo Estimation of Action Values Q π (s,a) -- average return starting from state s and action a following π Monte Carlo Control: MC policy iteration and Policy improvement step

25 Presentation Outline Introduction to Reinforcement Learning (Huy) Background Information (Shuiwang) Pegasus Reinforcement Learning Algorithm (Jerri) Using Algorithm for Helicopter Flight (Kenley and Shiv) Conclusions (Shiv)

26 Pegasus Algorithm: Variable Explanation MDP Space with: S state space s 0 initial state: s 0 in S A action space P sa state transition probabilities R: S -> R reward function γ discount factor Π family of policies: π: S -> A

27 Pegasus Algorithm: Policy Discovery Goal is to find a policy (π) in Π such that: U =E [ R s 0 R s 1 2 R s 2... ] E the expectation over the random sequence of states visited over time when π is executed starting at state s 0 U universal set of states

28 Pegasus Algorithm: Estimating the Universal Set Define an estimate Û(π) of U(π) via Monte Carlo m U = 1 m i =1 R s' i R s' i H m number of scenarios to test R s' i H H ε horizon time, truncation of steps that introduces at most ε/2 error to the approximation

29 Pegasus Algorithm: s' is defined as: State Transitions s i, p i 1, p i 2,... s' i k =g s i,a, p i k Pseudo-random numbers are used to ensure the results are deterministic g (s, a, p) deterministically transitions to the next state by using the state transition probabilities, P sa

30 Pegasus Algorithm: Truncation Error To avoid an infinite sum, a truncation point must be selected. H ε is defined as: H =log 1 /2 R max Introduces at most ε/2 error into approximation

31 Pegasus Algorithm: Û Estimation Error If m is at most polynomial in all quantities of interest and if Û is averaged over m number of samples, then Û has a high probability to be a uniformly good estimate of U. U U

32 Pegasus Algorithm: Example Reward Function The reward calculation for hovering is based on where the helicopter currently is and where it is supposed to be hovering. Reward function for hovering: R s = x x x 2 y y y 2 z z z 2 w w w 2 R a R a = ẋ ẋ 2 ẏ ẏ 2 ż ż 2 ẇ ẇ 2 (x', y', z', w') desired hovering position and orientation α k weights chosen to scale the terms

33 Presentation Outline Introduction to Reinforcement Learning (Huy) Background Information (Shuiwang) Pegasus Reinforcement Learning Algorithm (Jerri) Using Algorithm for Helicopter Flight (Kenley and Shiv) Conclusions (Shiv)

34 Helicopter Control Longitudinal and latitudinal pitch control Controls the angle of the plane that the helicopter s rotors spin in Main rotor blade pitch control Controls the thrust of the main rotor Tail rotor blade pitch control Controls the thrust of the tail rotor

orthogonally positioned (x, y, z) rate gyroscopes Differential GPS system

35 Yamaha R-50 Helicopter Helicopter Instrumentation Inertial Navigation system (INS) 3 orthogonally positioned (x, y, z) accelerometers 3 orthogonally positioned (x, y, z) rate gyroscopes Differential GPS system with a ground station Gives position estimates accurate to within 2 cm Digital compass

36 On-board Sensor Information Sensor Data provided at 50 Hz after integration through a Kalman filter. Position (x, y, z) Velocity (x, y, z ) Orientation (Φ (roll), Θ (pitch), ω (yaw)) Angular velocities (Φ (roll rate), Θ (pitch rate), ω (yaw rate))

37 Helicopter Modeling Model Development Flight data was from a human pilot was recorded to obtain a the 12 dimensional helicopter state and 4 dimensional control measurements that would be used to facilitate model fitting To exploit state and control symmetries found when operating a helicopter facing in different directions a body state coordinated system was developed replacing the external state system

38 Position Prediction Model Fitting y= T x Locally weighted linear regression A set of input and output vectors were used to create a prediction system This linear regression equation was designed to predict the output state of the helicopter given a control input vector = X T W X 1 X T W y y= T x W ii =exp 1 2 x x i T 1 x x i

39 Predicted Position vs. Actual State

40 Presentation Outline Introduction to Reinforcement Learning (Huy) Background Information (Shuiwang) Pegasus Reinforcement Learning Algorithm (Jerri) Using Algorithm for Helicopter Flight (Kenley and Shiv) Conclusions (Shiv)

41 Previous Attempts μ- synthesis H 2 H α

42 Flying Competition Maneuvers

43 Reward Functions Trajectory Flying Trajectory Following R s = x x x 2 y y y 2 z z z 2 w w w 2 R a R a = ẋ ẋ 2 ẏ ẏ 2 ż ż 2 ẇ ẇ 2 (x', y', z', w') desired hovering position and orientation

44 Success

45 Conclusion Pros Very flexible model Performs better than other approaches Often does converge to optimal Real world applications Cons The paper not too detailed Augmentation not explained in detail

46 Q & A Thank You

47 References Ng, Andrew, et. al., Autonomous Helicopter Flight via Reinforcement Learning Ng, Andrew and Michael Jordan, Pegasus: A Policy Search Method for Large MDPs and POMDPs RLAIcourse.html

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming