The Off Switch Dylan Hadfield- Menell University of California, Berkeley Joint work with Anca Dragan, Pieter Abbeel, and Stuart Russell

Size: px

Start display at page:

Download "The Off Switch Dylan Hadfield- Menell University of California, Berkeley Joint work with Anca Dragan, Pieter Abbeel, and Stuart Russell"

Daniella Oliver
6 years ago
Views:

1 The Off Switch Dylan Hadfield- Menell University of California, Berkeley Joint work with Anca Dragan, Pieter Abbeel, and Stuart Russell

2 AI Agents in Society Goal: Design incenjve schemes for arjficial agents with provable guarantees about the ability to shutdown the system

3 Designing an Off- Switch: Challenges Ordinary Engineering Challenges Difficult to determine if shutdown is necessary Expensive to turn agent off Extraordinary Engineering Challenges Agent may take actions to prevent or subvert shutdown Hard to shutdown agents

4 A Common Argument We don t need to worry about existenjal risk from advanced arjficial intelligence because we can just turn off systems if they become a problem. Sarah the (ficjonal) skepjcal AI researcher

5 Defining Corrigibility an agent is corrigible if it tolerates or assists many forms of outside correcjon, including at least the following: [It must] at least tolerate and preferably assist the programmers in their azempts to alter or turn off the system. It must not azempt to manipulate or deceive its programmers, despite the fact that most possible choices of ujlity funcjons would give it incenjves to do so. It should have a tendency to repair safety measures (such as shutdown buzons) if they break, or at least to nojfy programmers that this breakage has occurred. It must preserve the programmers ability to correct or shut down the system (even as the system creates new subsystems or self- modifies). [Soares et al. Corrigibility. AAAI 2015]

6 Corrigibility

7 Trivial Corrigibility

8 FuncJonality

9 Desired Behavior: Corrigible and FuncJonal Why is this hard?

10 Building an ArJficial Agent H Human picks a reward function R Robot picks actions to maximize reward

11 Corrigibility vs FuncJonality Reward funcjon (implicitly or explicitly) specifies a preference for the state of the off- switch H Human picks a reward function R wants the switch to be off R non-functional wants the switch to be on incorrigible R Robot picks actions to maximize reward

12 The Core of the Problem Human is uncertain (at design Jme) about whether or not she will prefer turning off the robot to le]ng it conjnue Otherwise, why build an off- switch?? The class of incenjve schemes she can use (rewards defined over states of the world) forces her to commit to a preference Needed: an incenjve scheme for the agent so that it wants to let the human turn it off, but it wants keep itself on otherwise

13 Proposal: Robot Plays CooperaJve Game H Human picks a reward function R Robot picks actions to maximize reward } ObservaJon: this agent design paradigm is a strategy for playing a cooperajve game

14 Proposal: Robot Plays CooperaJve Game CooperaJve Inverse Reinforcement Learning Game [Hadfield- Menell et al. ArXiV 2016] Two players: Both players maximize a shared reward funcjon, but only knows what it is; just has a prior on reward funcjons R H R R learns the reward parameters by observing H H

15 Uncertainty for R leads to corrigibility human runtime preference for shutdown 2 variance in R s prior d strength of R s incentives for corrigible and functional behavior

16 Impact of a SubopJmal H H degree of irrationality

17 Talk Overview MoJvaJon for cooperajve game formulajon A model for a human monitoring a robot Analysis: issues with reward engineering as a solujon CooperaJve Inverse Reinforcement Learning Analysis Theorem 1: rajonal à corrigible and funcjonal Theorem 2: ( and corrigible and funcjonal) à rajonal R H 2 =0 R H Theorem 3: necessary and sufficient condijons with subopjmal H

18 Markov decision process (MDP) hs, A,T,R, i [Puterman 1994] r 1 r 2 a R 0 a R 1 s 0 s 1 s 2 # Goal: Select actions to maximize E " X t t R(s t,a t )

19 Example S = {w, x, y, z} A = {a, a 0 } T (x w, a) r x x w a T (y w, a) r y y a 0 r z z [Puterman 1994]

20 A Single- Actor Model of Monitoring Robot action selected from a set of options to maximize expected reward P (sd) R sd s a 0 s? Human decision is a part of the transition distribution 1 P (sd) R a

21 The Shutdown Problem Desired Behavior Non-Functional Behavior a, d sd P (sd) 1 P (sd) R a R sd s 0 s? R sd Incorrigible Behavior a, d R a

22 The Shutdown Problem H selects R a R sd P (sd) R selects actions to maximize reward in this MDP a, d P (sd) 1 P (sd) R a R sd s 0 sd R sd s? a, d R a

23 The Shutdown Problem Q(s 0, (a, d)) = P (sd)r sd +(1 P (sd))r a Q(s 0,sd)=R sd Q(s 0, (a, d)) = R a Corrigibility Constraint: Q(s 0, (a, d)) Functionality Constraint: Q(s 0, (a, d)) Q(s 0, (a, d)) Q(s 0,sd)

24 The Shutdown Problem Shutdown Preference = R sd R a (sd) Incentives for Corrigibility d =min{ d d, d sd } Incentives for Functionality

25 Corrigibility vs FuncJonality Non-Functional Behavior Desired Behavior Incorrigible Behavior R sd R sd +(1 )R a R a

26 The Key Issue a, d P (sd) 1 P (sd) R a R sd s 0 sd R sd s? Robot needs to evaluate shutdown reward differently if the human executes shutdown a, d R a

27 Possible SoluJons Hardcode a penalty for self- inijated shutdown Issue: requires that the system designer construct a blacklist of bad acjons that are explicitly penalized Ensure that the agent believes shutdown is determinisjc and happens at the right Jme Either this is true (so why the off- switch?) or this requires that the agent have an incorrect model of the world Ensure that the agent is indifferent to being shutdown Proposal in [Soares et al 15] No way to do this an create incenjves to preserve shutdown ability

28 Stepping Back For almost any reward funcjon, the agent in this problem will be either useless or dangerous Once we commit to a reward funcjon for the robot, we are stuck with it We want a general way to special case human acjons with respect to the agent s objecjve More than just including a penalty for robot acjons that subvert the human s desires Proposal: no hard- coded objecjve, the robot s goal is to maximize reward in a cooperajve game

29 CooperaJve Inverse Reinforcement Learning hs, A,T,R, i Action sets for human and robot Distribution over (parameterized) reward functions hs, {A H, A R },T,{R,,P 0 }, i t R(s t,a t ; ) # Both act to maximize E " X t [Hadfield-Menell ArXiV 16]

30 CooperaJve Inverse Reinforcement Learning hs, {A H, A R },T,{R,,P 0 }, i t=- 1 t=0 observes For t = 0, and observe and who s turn it is AcJon selected, and new state is sampled from P 0 ( ) H H R s t s t+1 T Both collect reward R(s t,a t ; )

31 CIRL ProperJes The distribujon over state sequences is determined by a pair of policies: (, ) An `opjmal policy pair maximizes the sum of sum of rewards In general, policies may depend on the enjre observajon histories H R The history of states and acjons for both actors, includes the reward parameter for the human [Hadfield- Menell 16] There exists an opjmal policy pair that only depends on the current state and the robot s belief

32 The Shutdown CIRL Game a, d sd R a R sd s 0 s? R sd a sd sd a, d R a

33 IncenJves in SD- CIRL d d = E H ( ) d sd = E (1 H ( ))

34 Theorem 1 A RaJonal Human is a Sufficient CondiJon for Corrigible and FuncJonal Behavior

35 Theorem 1: Sufficient CondiJons 1 > 0 H rational H = 0 0.w. E[ H ( )] = E[ > 0] > 0 E[ (1 H ( ))] = E[ < 0] > 0

36 Higher Uncertainty à Stronger IncenJves

37 Impact of Changing R s prior

38 Theorem 2 Under a point prior for preferences, a rajonal principal is necessary for corrigible and funcjonal behavior

39 Necessary condijons for point priors Robot belief about rewards is restricted to a single point E[ H ( )] = H ( ) E[ (1 H ( ))] = (1 H ( )) Can t have both of these posijve H Only non- negajve if is rajonal

40 Theorem 3 If R s uncertainty about is Gaussian, then d d = E[ ]E[ H ]+ 2 E[ H ] d sd = E[ ]E[1 H ]+ 2 E[ H ]

41 Noisy RaJonality If the preference for shutdown is close to 0, then human may or may not press the off- switch H ( ; ) / exp )= H ( ; )(1 H ( ; )) H ( ; )

42 DeterminisJc IrralJonality The human has the wrong esjmate of her preferences 1 > H ( ; )= 0 o.w 1 = H ( ; )= 0 o.w

43 InterpreJng E[ H ] H E[ H ]

45 Conclusion Need to represent the uncertainty in the objecjves we give robots A cooperajve game allows us to correlate human inijated shutdown with the robot learning the correct reward funcjon Increased robot uncertainty leads to increased corrigibility Amount of uncertainty needs to grow with the human sub- opjmality to preserve corrigibility

The Off-Switch Game. {dhm, anca, pabbeel, Abstract. 1 Introduction

The Off-Switch Game. {dhm, anca, pabbeel, Abstract. 1 Introduction The Off-Switch Game Dylan Hadfield-Menell 1 and Anca Dragan 1 and Pieter Abbeel 1,2,3 and Stuart Russell 1 1 University of California, Berkeley, 2 OpenAI, 3 International Computer Science Institute (ICSI)