Games Against Nature

Similar documents
Online Convex Optimization Example And Follow-The-Leader

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

1 Review of Zero-Sum Games

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

Finish reading Chapter 2 of Spivak, rereading earlier sections as necessary. handout and fill in some missing details!

Hamilton- J acobi Equation: Explicit Formulas In this lecture we try to apply the method of characteristics to the Hamilton-Jacobi equation: u t

Online Learning with Partial Feedback. 1 Online Mirror Descent with Estimated Gradient

Lecture 2 October ε-approximation of 2-player zero-sum games

Lecture 2 April 04, 2018

Chapter 2. First Order Scalar Equations

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Hamilton- J acobi Equation: Weak S olution We continue the study of the Hamilton-Jacobi equation:

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Vehicle Arrival Models : Headway

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

GMM - Generalized Method of Moments

Empirical Process Theory

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Lecture Notes 2. The Hilbert Space Approach to Time Series

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

Notes on online convex optimization

Cash Flow Valuation Mode Lin Discrete Time

Expert Advice for Amateurs

18 Biological models with discrete time

EXERCISES FOR SECTION 1.5

Notes for Lecture 17-18

Approximation Algorithms for Unique Games via Orthogonal Separators

Optimality Conditions for Unconstrained Problems

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Online Learning Applications

A Stochastic View of Optimal Regret through Minimax Duality

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018

Introduction to Probability and Statistics Slides 4 Chapter 4

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

Physics 127b: Statistical Mechanics. Fokker-Planck Equation. Time Evolution

Essential Microeconomics : OPTIMAL CONTROL 1. Consider the following class of optimization problems

Mixing times and hitting times: lecture notes

Lecture 33: November 29

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

14 Autoregressive Moving Average Models

Homework 4 (Stats 620, Winter 2017) Due Tuesday Feb 14, in class Questions are derived from problems in Stochastic Processes by S. Ross.

We just finished the Erdős-Stone Theorem, and ex(n, F ) (1 1/(χ(F ) 1)) ( n

Nature Neuroscience: doi: /nn Supplementary Figure 1. Spike-count autocorrelations in time.

Echocardiography Project and Finite Fourier Series

Convergence of the Neumann series in higher norms

Let us start with a two dimensional case. We consider a vector ( x,

Discrete Markov Processes. 1. Introduction

4 Sequences of measurable functions

Robust estimation based on the first- and third-moment restrictions of the power transformation model

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t

Unit Root Time Series. Univariate random walk

5. Stochastic processes (1)

BEng (Hons) Telecommunications. Examinations for / Semester 2

ODEs II, Lecture 1: Homogeneous Linear Systems - I. Mike Raugh 1. March 8, 2004

Linear Response Theory: The connection between QFT and experiments

The Asymptotic Behavior of Nonoscillatory Solutions of Some Nonlinear Dynamic Equations on Time Scales

( ) ( ) if t = t. It must satisfy the identity. So, bulkiness of the unit impulse (hyper)function is equal to 1. The defining characteristic is

1 Solutions to selected problems

arxiv: v1 [math.pr] 19 Feb 2011

Ground Rules. PC1221 Fundamentals of Physics I. Kinematics. Position. Lectures 3 and 4 Motion in One Dimension. A/Prof Tay Seng Chuan

Matlab and Python programming: how to get started

Christos Papadimitriou & Luca Trevisan November 22, 2016

LECTURE 1: GENERALIZED RAY KNIGHT THEOREM FOR FINITE MARKOV CHAINS

Solutions from Chapter 9.1 and 9.2

Homogenization of random Hamilton Jacobi Bellman Equations

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Lie Derivatives operator vector field flow push back Lie derivative of

Hamilton Jacobi equations

Math 527 Lecture 6: Hamilton-Jacobi Equation: Explicit Formulas

An Introduction to Malliavin calculus and its applications

The Strong Law of Large Numbers

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

An introduction to the theory of SDDP algorithm

DIFFERENTIAL GEOMETRY HW 5

THE MYSTERY OF STOCHASTIC MECHANICS. Edward Nelson Department of Mathematics Princeton University

The consumption-based determinants of the term structure of discount rates: Corrigendum. Christian Gollier 1 Toulouse School of Economics March 2012

Longest Common Prefixes

Lecture 9: September 25

Online Appendix for "Customer Recognition in. Experience versus Inspection Good Markets"

Math 334 Fall 2011 Homework 11 Solutions

Predator - Prey Model Trajectories and the nonlinear conservation law

SOLUTIONS TO ECE 3084

Heat kernel and Harnack inequality on Riemannian manifolds

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

Robotics I. April 11, The kinematics of a 3R spatial robot is specified by the Denavit-Hartenberg parameters in Tab. 1.

Online Appendix to Solution Methods for Models with Rare Disasters

Ensamble methods: Boosting

A Bayesian Approach to Spectral Analysis

Decentralized Stochastic Control with Partial History Sharing: A Common Information Approach

Let ( α, β be the eigenvector associated with the eigenvalue λ i

10. State Space Methods

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

O Q L N. Discrete-Time Stochastic Dynamic Programming. I. Notation and basic assumptions. ε t : a px1 random vector of disturbances at time t.

Random Walk with Anti-Correlated Steps

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1

Transcription:

Advanced Course in Machine Learning Spring 2010 Games Agains Naure Handous are joinly prepared by Shie Mannor and Shai Shalev-Shwarz In he previous lecures we alked abou expers in differen seups and analyzed he regre of he algorihm by comparing is performance o he performance of he bes fixed expers (and laer he bes shifing exper). In his lecure we consider he game heory connecion and presen games agains Naure. Along he way, we presen one of he mos common ools o analyze predicion problems: approachabiliy heory. The seup in oday s lecure is ha of full informaion. The nex lecure will be devoed o he parial informaion seup. We sar from a more general model for he game and hen show how o apply i o differen online learning seups. 1 The Model The model is comprised of a single player playing agains Naure. The game is repeaed in ime, and a sage he decision maker has o choose an acion a A and Naure chooses (simulaneously) an acion b B. As a resul he decision maker obains a reward r R(a, b ) (ha is, he reward can be sochasic: we will only need finie second momens). The game coninues ad infinium. We le he average reward be denoed by ˆr = 1 r τ. Noe: There is no reward for Naure, herefore his is no a game in he sandard sense of he word (or, one can say his is a zero-sum game). The decision maker keeps rack of he rewards and of Naure s acions. We consider he empirical frequency of Naure s acions as: q (b) = 1 1{b = b} and noe ha q (B), he se of disribuions over B. 1.1 The saionary case If Naure is saionary (i.e., he acions are generaed from an IID source q ) hen: q q a.s. (In fac, we have exponenially fas convergence: Pr( q q > ɛ) C exp( C ɛ 2 ).) In ha case, one can hope o obain a reward as high as he bes response reward: By obaining we mean: r (q) = max p a,b q(a)p(b)r(a, b) = max a ˆr r (q ) Here is a simple ficiious play algorihm ha obains ha: a.s. q(b)r(a, b). b Games Agains Naure-1

1. Observe b and for an esimae: q = 1 1{b = b}. 2. Play a arg max r(a, q ). This algorihm is also based on he celebraed cerainy equivalence scheme. Theorem 1 The Ficiious Play algorihm saisfies ha ˆr r (q ) a.s. Bu wha happens if Naure is no saionary? 1.2 Arbirary source Suppose now ha he sequence b 1, b 2,... is generaed by an arbirary process. Arbirary here means no necessarily sochasic. Clearly, we canno assume ha q converges. Our objecive of having he average reward converge o r (q ) is no well defined anymore since q may no exis. We can define he average regre as: R = r (q ) ˆr. This is a random variable. Randomness is deermined by randomness in he algorihm. The basic quesions is herefore: Can we find an algorihm such ha lim sup R 0 a.s.? If such an algorihm exiss we call i 0-regre (we will laer call such an algorihm 0 exernal regre, bu his is sufficien for now). This is, of course, he same noion from he previous wo lecures where we consider he average regre as opposed o he cumulaive regre. Naure models. 1. Oblivious. Naure wries down he sequence of b 1, b 2,... a ime 0 (no disclosing hem). 2. Non-oblivious. Naure is adversarial and i ries o maximize he regre. Naure may even be aware of any randomizaion he decision maker does (bu no he value of privae coin osses). Observaions: 1. An non-oblivious opponen is a very srong model: i encompasses a wors case view on disurbances in many sysems and i generalizes play agains an adversary. 2. Ficiious play would fail since randomizaion is needed. Ficiious play is called here follow he leader (FL). 3. If he leader does no change (asympoically), FL does have 0 regre. More ineresingly, as long as here are no many swiches, FL works. More precisely, we say ha FL swiches from acion a o a a ime if a 1 = a and a = a. We le he number of swiches be N. We say ha FL exhibis infrequen swiches along a hisory if for every ɛ > 0 here exiss T such ha N / < ɛ for all T. Theorem 2 If FL exhibis infrequen swiches along a hisory i saisfies lim sup R 0 along ha hisory. Proof: Home exercise. (Noe ha we do no use almos sure quanifiers since clearly FL is no opimal for every hisory.) Games Agains Naure-2

1.3 A generalized noion of regre In general, regre can be defined as he difference beween he obained (cumulaive reward) and he reward ha would have been obained by he bes sraegy in a reference se. Tha is: R = sup r(σ, hisory) ˆr, sraegy σ where r(σ, hisory) is an esimae of he average reward if playing σ. This is no always well defined or achievable. In he example above, he se of sraegies is simply he se of saionary sraegies. One can easily hink of oher ses of sraegies such as he se of sraegies ha depend on he las observaion from Naure. In ha case: he se of sraegies is idenified wih p p(a b 1 ) (A) B and he reward as a funcion of hisory is defined as: r(σ, hisory) = 1 p(a b 1 )r(a, b ), where b 0 is defined is one of he members of B. We observe ha his comparison class is richer han he comparison class we considered above which can be idenified wih p(a) (A). We will show laer ha here is an asympoical 0-regre sraegy agains his paricular comparison class. a 2 Blackwell s Approachabiliy We now inroduce a useful ool in he analysis of repeaed games agains Naure called Blackwell s approachabiliy heory. Le us define a vecor-valued wo-player game. We call he players P1 and P2 o disinguish hem from he decision makers above. We consider a wo player vecor-valued repeaed game where boh P1 and P2 choose acions as before from finie ses A and B. The reward is now a k-dimensional vecor, m(a, b) R k. As before, he sage game reward is m m(a, b ) (he reward can be a random vecor). The average reward is ˆm = 1 m. P1 s ask is o approach a arge se T, namely o ensure convergence of he average reward vecor o his se irrespecively of P2 s acions. Formally, le T R k denoe he arge se. In he following, d is he Euclidean disance in R k. The se-o-poin disance beween a poin x and a se T is d(x, T ) = inf y T d(x, y). (We le P π,σ denoe he probabiliy measure when P1 plays he policy π and P2 plays policy σ.) Definiion 1 A policy π of P1 approaches a se T R k if lim d( ˆm n, T ) = 0 P π n,σ-a.s., for every σ Σ. A policy σ Σ of P2 excludes a se T if for some δ > 0, lim inf d( ˆm n, T ) > δ P π,σ -a.s. for every π Π, The policy π (σ ) will be called an approaching (excluding) policy for P1 (P2). A se is approachable if here exiss an approaching policy. Noing ha approaching a se and is opological closure are he same, we shall henceforh suppose ha he se T is closed. The noion of approachabiliy and excludabiliy assumes uniformiy wih respec o ime (and he sraegy of P2 (approachabiliy) or P1 (excludabiliy). Games Agains Naure-3

2.1 The projeced game Le u be a uni vecor in he reward space R k. We ofen consider he projeced game in direcion u as he zero-sum game wih he same dynamics as above, and scalar rewards r n = m n u. Here sands for he sandard inner produc in R k. Denoe his game by Γ(u). 2.2 The Basic Approachabiliy Resuls For any x T, denoe by C x a closes poin in T o x, and le u x be he uni vecor in he direcion of C x x, which poins from x o he goal se T. The following heorem requires, geomerically, ha here exiss a (mixed) acion p(x) such ha he se of all possible (vecor-valued) expeced rewards is on he oher side of he hyperplane suppored by C x in direcion u x. Theorem 3 Assume ha for every poin x T here exiss a sraegy p(x) such ha: (m(p(x), q) C x ) u x 0, q (B). (1) Then T is approachable by P1. An approaching policy is given as follows: If ˆm n T, play p( ˆm n ), oherwise, play arbirarily. Proof Le y n = C ˆmn and denoe by F n he filraion generaed by he hisory up o ime n. We furher le d n = ˆm n y n. We wan o prove ha d n 0 a.s.. We have ha: IE(d 2 n+1 F n ) = IE( ˆm n+1 y n+1 2 Fn ) IE( ˆm n+1 y n 2 Fn ) = IE( ˆm n+1 ˆm n + ˆm n y n 2 Fn ) = ˆm n y n 2 + IE( ˆm n+1 ˆm n 2 F n ) + 2IE(( ˆm n y n ) ( ˆm n+1 ˆm n ) F n ). Now, since ˆm n+1 ˆm n = m n+1 /(n + 1) ˆm n /(n + 1) we have ha: Expanding he las erm we obain: IE(d 2 n+1 F n ) d 2 n + C n 2 + 2IE(( ˆm n y n ) ( ˆm n+1 ˆm n ) F n ). ( ˆm n y n ) ( ˆm n+1 ˆm n ) = ( ˆm n y n ) (m n+1 /(n + 1) ˆm n /(n + 1)) = ( ˆm n y n ) (y n /n + 1 ˆm n /(n + 1) + m n+1 /(n + 1) y n /(n + 1)) = d 2 n/(n + 1) + 1 n + 1 ( ˆm n y n ) (m n+1 /(n + 1) y n /(n + 1)) Now, he expeced value of he las erm is negaive so we obain: IE(d 2 n+1 F n ) (1 2 n + 1 )d2 n + c n 2. I follows by Lemma 1 ha d n 0 almos surely. Remarks: 1. Convergence Raes. The convergence rae of he above policy is O( T ) and is independen of he dimension. The only dependence kicks in hrough he magniude of he randomness (he second momen, o be exac). Games Agains Naure-4

2. Complexiy. There are wo disinc elemens o compuing an approaching sraegy as in Theorem 3. The firs is finding he closes poin C x and he second is solving he projeced game. Solving he projeced 0-sum game can be easily done using linear programming (or oher mehods) wih polynomial dependence on he number of acions of boh players. Finding C x, however, can be in general a very hard problem as finding he closes poin in a non-convex se is NP-hard. There are, however, some easy insances such as he case where T is convex and described in some compac form. In fac, i is enough o assume ha a convex T has a separaion oracle (i.e., we can query in polyime if a poin belongs o T or no). 3. Is a se approachable? In general, i is NP-hard even o deermine if a poin is approachable where hardness here is measured in he dimension (if he dimension is fixed i is no hard o decide if a poin is approachable). 4. The game heory connecion. The above resul generalizes he celebraed min-max heorem. To observe ha, ake a one dimensional problem. In ha case he approachable se is he segmen [v, ]. For convex arge ses, he condiion of he las heorem urns ou o be boh sufficien and necessary. Moreover, his condiion may be expressed in a simpler form, which may be considered as a generalizaion of he minimax heorem for scalar games. Given a saionary policy q (B) for P2, le Φ(A, q) co({m(p, q)} p (A) ), where co is he convex hull operaor. The Euclidean uni sphere in R k is denoed by IB k. The following heorem is characerizes convex approachable ses in an elegan way. Theorem 4 Le T be a closed convex se in R k. (i) T is approachable if and only if Φ(A, q) T for every saionary policy q (B). (ii) If T is no approachable hen i is excludable by P2. In fac, any saionary policy q ha violaes (i) is an excluding policy. (iii) T is approachable if and only if val Γ(u) inf m T u m for every u IB k, where val is he value of he (scalar) 0-sum game. Condiion (i) in Theorem 4 is someimes very easy o check, as we see below. 3 Back o regre We are now ready o use approachabiliy for proving we can minimize he regre. Consider he following vecor-valued game. When he decision maker plays a and Naure plays b and a reward r is obained he vecor-valued reward is m = (r, e b ) where e b is a vecor of zeros excep for he b-h enry which is one. I holds ha: ˆm = (ˆr, q ). Now, define he following arge se T R (B): T = {(r, q) : r r (q), q (B)}. We claim ha T is convex. Indeed, i follows ha r (q) is convex as a maximum of linear funcions. The se T is convex as he epigraph of a convex funcion. We now claim ha T is approachable. By Theorem 4, a necessary and sufficien condiion is ha Φ(A, q) T for every q. Fix some q and le p (A) be a member of he argmax of r, ha is: p arg max r(p, q). Bu his is easy o show since m(p, q) Φ(A, q) and m(p, q) T. This means ha by using approachabiliy we have ha d( ˆm, T ) 0. Wha is lef is o argue ha approaching T implies ha ˆr r (q ) 0 asympoically. This holds since r is a uniformly coninuous funcion (i is convex, coninuous and on a compac domain). We have hus proved: = Games Agains Naure-5

Theorem 5 There exiss a sraegy ha guaranees ha lim sup ˆr r (q ) 0 In fac, we have proved ha he convergence rae is O( T ). We now reurn o he problem where we considered generalized regre. We claim a 0-regre sraegy does exis. Indeed, consider he arge se of he form: T = {(r, π) R (B 2 ) : r max π(b, b )p(a b)r(a, b )}, p (A) B b,b B where we idenify p wih a condiional probabiliy of choosing an acion given he pas observaion (noe ha i suffices o choose a pure acion). I is easy o see ha T is convex as an epigraph of a convex funcion. Now, we need o define he game: when P1 chooses a, P2 chooses b and he previous acion chosen by P2 was b he reward is a vecor whose enries are r(a, b ) in he firs coordinae and he remaining coordinaes are zero excep for one a he b B + b coordinae. I remains an easy exercise o show ha he se T is approachable. (We noe ha a sligh exension of approachabiliy is needed: see The Empirical Bayes Envelope and Regre Minimizaion in Compeiive Markov Decision Processes. MOR 28(1):327-345, S. Mannor and N. Shimkin.) 4 Calibraion The definiion of calibraion and a very easy proof using approachabiliy is provided in he aached noe. a.s. A Appendix Lemma 1 Assume e is a non-negaive random variable, measurable according o he sigma algebra F (F F +1 ) and ha IE(e +1 F ) (1 d )e + cd 2. (2) Furher assume ha =1 d =, d 0, and ha d 0. Then e 0 P-a.s. Proof Firs noe ha by aking he expecaion of Eq. (2) we ge: IEe +1 (1 d )IEe + cd 2. According o Bersekas and Tsisiklis (Neuro-dynamic programming, page 117) i follows ha IEe 0. Since e is non-negaive i suffices o show ha e converges. Fix ɛ > 0, le = max{ɛ, e }. V ɛ Since d 0 here exiss T (ɛ) such ha cd < ɛ for > T. Resric aenion o > T (ɛ). If e < ɛ hen If e > ɛ we have: V ɛ IEV ɛ IE(V ɛ +1 F ) (1 d )ɛ + cd 2 ɛ V ɛ. IE(V+1 F ɛ ) (1 d )e + d e V ɛ. is a super-maringale, by a sandard convergence argumen we ge V ɛ V. ɛ By definiion V ɛ ɛ and herefore IEV ɛ ɛ. Since IE [max(x, Y )] IEX + IEY i follows ha IEe + ɛ. So ha IEV ɛ = ɛ. Now we have a posiive random variable, wih expecaion ɛ which is above ɛ wih probabiliy 1. I follows ha V ɛ = ɛ. To summarize, we have shown ha for every ɛ > 0 wih probabiliy 1: lim sup e lim sup V ɛ = lim V ɛ = ɛ. Since ɛ is arbirary and e non-negaive i follows ha e 0 almos surely. Games Agains Naure-6